I called it when I wrote it, they are just burning their goodwill to the ground.
I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.
AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.
"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.
And this is why AI training is not "fair use". The AI companies seek to train models in order to compete with the authors of the content used to train the models.
A possible eventual downfall of AI is that the risk of losing a copyright infringement lawsuit is not going away. If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.
If it's owned by a few, as it is right now, it's an existential threat to the life, liberty, and pursuit of a happiness of everyone else on the planet.
We should be seriously considering what we're going to do in response to that threat if something doesn't change soon.
This has to change somehow.
"Machines will do everything and we'll just reap the profits" is a vision that techno-millenialists are repeating since the beginnings of the Industrial Revolution, but we haven't seen that happening anywhere.
For some strange reason, technological progress seem to be always accompanied with an increase on human labor. We're already past the 8-hours 5-days norm and things are only getting worse.
This isn't a consequence of capitalism. The notion of having to work to survive - assuming you aren't a fan of slavery - is baked into things at a much more fundamental level. And lots of people don't work, and are paid by a welfare state funded by capitalism-generated taxes.
> "Machines will do everything and we'll just reap the profits" is a vision that techno-millenialists are repeating since the beginnings of the Industrial Revolution, but we haven't seen that happening anywhere.
They were wrong, but the work is still there to do. You haven't come up with the utopian plan you're comparing this to.
> For some strange reason, technological progress seem to be always accompanied with an increase on human labor.
No it doesn't. What happens is not enough people are needed to do a job any more, so they go find another job. No one's opening barista-staffed coffee shops on every corner in the time when 30% of the world was doing agricultural labour.
Yes, it is. The fact we have welfare isn't a refutation of that, it's proof. The welfare is a bandaid over the fundamental flaws of capitalism. A purely capitalist system is so evil, it is unthinkable. Those people currently on welfare should, in a free labor market, die and rot in the street. We, collectively, decided that's not a good idea and went against that.
That's why the labor market, and truly all our markets, are not free. Free markets suck major ass. We all know it. Six year olds have no business being in coal mines, no matter how much the invisible hand demands it.
I think this should be an axiom which should be respected by any copyright rule.
Let's not forget the basis:
> [The Congress shall have Power . . . ] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
Is our current implementation of copyright promoting the progress of science and useful arts?
Or will science and the useful arts be accelerated by culling back the current cruft of copyright laws?
For example, imagine if copyright were non-transferable and did not permit exclusive licensing agreements.
Copyright isn't the problem. Over-financialization is the problem.
Realize what it already has.
A foundational language model with no additional training is already quite powerful.
And that genie isn't going back into the bottle.
"The upside of my gambit is so great for the world, that I should be able to consume everyone else's resources for free. I promise to be a benevolent ruler."
When Google first came out in 1998, it was amazing, spooky how good it was. Then people figured out how to game pagerank and Google's accuracy cratered.
AI is now in a similar bubble period. Throwing out all of copyright law just for the benefit of a few oligarchs would be utter foolishness. Given who is in power right now I'm sure that prospect will find a few friends, but I think the odds of it actually happening before the bubble bursts are pretty small.
If software and ideas become commodities and the legal ecosystem around creating captive markets disappears, then we will all be much better off.
The people agitating for such things are usually leeches who want everything free and do, in fact, hold an infantile worldview that doesn't consider how necessary remuneration is to whatever it is they want so badly (media pirates being another example).
Not that I haven't "pirated" media, but this is usually the result of it not being available for purchase or my already having purchased it.
When I read someone else’s essay I may intend to write essays like that author. When I read someone else’s code I may intend to write code like that author.
AI training is no different from any other training.
> If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.
Do you mean the output of the AI training process (the model), or the output of the AI model? If the former, yes, sure: if a model actually contains within it it copies of data, then sure: it’s a copy of that work.
But we should all be very wary of any argument that the ability to create a new work which is identical to a previous work is itself derivative. A painter may be able to copy Gogh, but neither the painter’s brain nor his non-copy paintings (even those in the style of Gogh) are copies of Gogh’s work.
I completely agree — that’s why I explicitly wrote ‘non-copy paintings’ in my example.
> The AI argument that passing original content through an algorithm insulates the output from claims of infringement because of "fair use" is pigwash.
Sure, but the argument that training an AI on content is necessarily infringement is equally pigwash. So long as the resulting model does not contain copies, it is not infringement; and so long as it does not produce a copy, it is not infringement.
That's not true.
The article specifically deals with training by scraping sites. That does necessarily involve producing a copy from the server to the machine(s) doing the scraping & training. If the TOS of the site incorporates robots.txt or otherwise denies a license for such activity, it is arguably infringement. Sourcehut's TOS for example specifically denies the use of automated tools to obtain information for profit.
Will it mean longer and longer clips are "fair use", or will we just stop making new content because it can't avoid copying patterns of the past?
https://www.vice.com/en/article/musicians-algorithmically-ge...
They did this in 2020. The article points out that "Whether this tactic actually works in court remains to be seen" and I haven't been following along with the story, so I don't know the current status.
Yes, it is. One is done by a computer program, and one is done by a human.
I believe in the rights and liberties of human beings. I have no reason to believe in rights for silicon. You, and every other AI apologist, are never able to produce anything to back up what is largely seen as an outrageous world view.
You cannot simply jump the gun and compare AI training to human training like it's a foregone conclusion. No, it doesn't work that way. Explain why AI should have rights. Explain if AI should be considered persons. Explain what I, personally, will gain from extending rights to AI. And explain what we, collectively, will gain from it.
Once you have an army of robot slaves ... you've rendered the whole concept of money irrelevant. Your skynet just barters rare earth metals with other skynets and your robot slaves furnish your desired lifestyle as best they can given the amount of rare earth metals your skynet can get its hands on. Or maybe a better skynet / slave army kills your skynet / slave army, but tough tits, sucks to be you and rules to be whoever's skynet killed yours.
they are heavily outnumbered and "outfunded"
Ubiquitous surveillance is another.
At some point in the future, if you aren't using AI, you won't be able to compete in the job market.
the tools feed back to the mothership what you are accepting and what you aren't
this is a far better signal than anything they get from crawling the internet
Job market is formed by the presence of needs and the presence of the ability to satisfy them. AI - does not reduce the ability to satisfy needs, so only possible situation where you won't be able to compete - is either the socialists will seize power and ban competition, or all the needs will be met in some other ways. In any other situation - there will be job market and the people will compete in it
maybe there will be. I'm sure there also is a market for Walkman somewhere, its just exceedingly small.
The proclaimed goal is to displace workers on a grand scale. This is basically the vision of any AI company and literally the only way you could even remotely justify their valuations given the heavy losses they incur right now.
> Job market is formed by the presence of needs and the presence of the ability to satisfy them
The needs of a job market are largely shaped by the overall economy. Many industrial nations are largely service based economies with a lot of white collar jobs in particular. These white collar jobs are generally easier to replace with AI than blue collar jobs because you don't have to deal with pesky things like the real, physical world. The problem is: if white collar workers are kicked out of their jobs en masse, it also negatively affects the "value" of the remaining people with employment (exhibit A: tech job marker right now).
> is either the socialists will seize power and ban competition,
I am really having a hard time understanding where this obsession with mythical socialism comes from. The reality we live in is largely capitalistic and a striving towards a monopoly - i.e. a lack of competition - is basically the entire purpose of a corporation, which is only kept in check by government regulations.
It doesn't matter. What you need to understand - is that in the source of the job market is needs, ability to meet those needs and ability to exchanges those ability on one another. And nothing of those are hindered by AI.
>Many industrial nations are largely service based economies with a lot of white collar jobs in particular.
Again: in the end of the day it doesn't change anything. In the end of the day you need a cooked dinner, a built house and everything else. So someone must build a house and exchange it for a cooked dinners. That's what happening (white collar workers and international trade balance included) and that's what job market is. AI doesn't changes the nature of those relationship. Maybe it replace white collar workers, maybe even almost all of them - that's only mean that they will go to satisfy another unsatisfied needs of other people in exchange for satisfying their own, job marker won't go anywhere, if anything - amount of satisfied needs will go up, not down.
>if white collar workers are kicked out of their jobs en masse, it also negatively affects the "value" of the remaining people with employment
No, it doesn't. I mean it does if they would be simply kicked out, but that's not the case - they would be replaced by AI. So the society get all the benefits that they were creating plus additional labor force to satisfy earlier unsatisfied needs.
>exhibit A: tech job marker right now
I don't have the stats at hand, but aren't blue collar workers doing better now than ever before?
>I am really having a hard time understanding where this obsession with mythical socialism comes from
From the history of the 20th century? I mean not obsession, but we we are discussing scenarios of the disappearance (or significant decrease) of the job market, and the socialists are the most (if not only) realistic reason for that at the moment.
>The reality we live in is largely capitalistic and a striving towards a monopoly
Yeas, and this monopoly, the monopoly, are called "socialism".
>corporation, which is only kept in check by government regulations.
Generally corporation kept in check by economic freedom of other economic agents, and this government regulations that protects monopolies from free market. I mean why would government regulate in other direction? Small amount of big corporations are way easier for government to control and get personal benefits from them.
You should read some history.This veiw is so naive and overconfident.
* https://phys.org/news/2023-08-people-pointless-meaningless-j...
Parliament had made a law phasing in the introduction of automated looms; specifically so that existing weavers were first on the list to get one. Britain's oligarchy completely ignored this and bought or built looms anyway; and because Parliament is part of that oligarchy, the law effectively turned into "weavers get looms last". That's why they were smashing looms - to bring the oligarchy back to the negotiating table.
The oligarchy responded the way all violent thugs do: killing their detractors and lying about their motives.
Why would this happen? Money is simply a medium of exchange of values that this contractors, mechanics and other hardcore blue collar trades are creating. How can they be broke, if Ai doesn't disturb their ability to create values and exchange it?
Money mean nothing. It is simply medium of exchange. The question is, is there anything to exchange? And the answer is yeas, and position of white collar workers doesn't affect availability of things for exchange. There's no reason for recession, there is nothing that can hinder ability of blue collar workers to create goods and services, all that things that when combined is called "wealth".
Don't think in the meaningless category of "what set of digits will be printed in the piece of paper called paycheck?". Think in the terms, that are implied: "What goods and services blue collar workers can't afford to themselves?". And it will become clear that the set of unaffordable goods and services to blue collar workers will decrease because of the replacement white collar workers with AI, because it is not hinder their ability to create those goods and services.
You think so? Give me the contents of your checking, savings, and retirement accounts and then get back to me on that.
> position of white collar workers doesn't affect availability of things for exchange.
You appear to be confused about the concept of consumers, let me help. Consumers are the people who buy things. When there are fewer consumers in a market, demand for products and services declines. This means less sales. So no, you don't get to unemploy big chunks of the population and expect business to continue thriving.
No, demand is unlimited and defined by the amount of production.
>You don't get to unemploy big chunks of the population and expect business to continue thriving.
I mean, generally replaced worker with the instruments - is the main way to business (and society) to thrive. In other words, what goods and services will became less affordable to the blue collar workers?
Enough of your trolling, go waste someone else's time.
Obviously it is affect. Supply of goods are increased and their relative market value are increased - how can this not increase their incomes?
I mean yeas, values of consumed goods will decrease, so blue color workers will be able to consume more. That's exactly what is called increase of income.
AI is poised to disrupt large swaths of the workforce. If large swaths of the workforce are disrupted this necessarily means a bunch of people will see their income negatively impacted (job got replaced by AI). Broke people by definition don't have money to spend on things, and will prioritize tier one of Maslow's Hierarchy out of necessity. Since shit like pergolas and oil changes are not directly on tier 1 they will be deprioritized. This in turn cuts business to blue collar service providers. Net result: everyone who isn't running an AI company or controlling some currently undefined minimum amount of capital is fucked.
If you're trying to suggest that any notional increases in productivity created by AI will in any way benefit working class individuals either individually or as a group you are off the edge of the map economically speaking. Historical precedents and observed executive tier depravity both suggest any increase in productivity will be used as an excuse to cut labor costs.
No, it doesn't. Where's that is come from?
I mean, look at the situation from the perspective of blue collar service providers: what is exactly those goods and services, that they was be able to afford for themselves, but AI will make it unaffordable for them? Pretty obviously, that there's about none of those goods and services. So, in big picture, all that process that you described, doesn't lead to any disadvantage of blue collar workers.
Similar to how advertising and legal services are required for everything but have ambiguous ROI at best, AI is set to become a major “cost of doing business“ tax everywhere. Large corporations welcome this even if it’s useless, because it drags down smaller competitors and digs a deeper moat.
Executives large and small mostly have one thing in common though.. they have nothing but contempt for both their customers and their employees, and would much rather play the mergers and acquisitions type of games than do any real work in their industry (which is how we end up in a world where the doors are flying off airplanes mid flight). Either they consolidate power by getting bigger or they get a cushy exit, so.. who cares about any other kind of collateral damage?
Building things IS a contribution to society, but the people who build things typically aren't the ultimate owners. And even in cases where the builders and owners are the same, entitling the builders and all of their future heirs to rent seek for the rest of eternity is an inordinate reward.
This goes both ways. Let's say there is something you want but you're having trouble obtaining it. You'd need to give something in exchange.
But the seller of what you want doesn't need the things you can easily acquire, because they can get those things just as easily themselves.
The economy collapses back into self sufficiency. That's why most Minecraft economy servers start stagnating and die.
They would use some of the goods/services produced themselves, and also trade with other owners to live happy lives with everything they need, no workers involved.
Non-owners may let the jobless working class inhabit unwanted land, until they change their minds.
I miss the star trek visions of the future
now the "good" outcome is a world sized north korea, with elon as ruler
and the bad outcome is the ruler using his army of robots to eliminate the possibility of the peasant revolt once and for all
This economic relationship can be collectively[1] described as "feudalism". This is a system in which:
- The vast majority of people are obligated to perform menial labor, i.e. peasant farmers.
- Class mobility is forbidden by law and ownership predominantly stays within families.
- The vast majority of wealth in the economy is in the form of rents paid to owners.
We often use the word "capitalist" to describe all businesses, but that's a modern simplification. Businesses can absolutely engage in feudalist economies just as well, or better, than they can engage in capitalist ones. The key difference is that, under capitalism, businesses have to provide goods or services that people are willing to pay for. Feudalism makes no such demand; your business is just renting out a thing you own.
Assuming AI does what it says on the tin (which isn't at all obvious), the endgame of AI automation is an economy of roughly fifty elite oligarchs who own the software to make the robots that do all work. They will be in a constant state of cold war, having to pay their competitors for access to the work they need done, with periodic wars (kinetic, cyber, legal, whatever) being fought whenever a company intrudes upon another's labor-enclave.
The question of "well, who pays for the robots" misunderstands what money is ultimately for. Money is a token that tracks tax payments for coercive states. It is minted specifically to fund wars of conquest; you pay your soldiers in tax tokens so the people they conquer will have to barter for money to pay the tax collector with[2]. But this logic assumes your soldiers are engaging in a voluntary exchange. If your 'soldiers' are killer robots that won't say no and only demand payment in energy and ammunition, then you don't need money. You just need to seize critical energy and mineral reserves that can be harvested to make more robots.
So far, AI companies have been talking of first-order effects like mass unemployment and hand-waving about UBI to fix it. On a surface level, UBI sounds a lot like the law necessary to make all this AI nonsense palatable. Sam Altman even paid to have a study done on UBI, and the results were... not great. Everyone who got money saw real declines in their net worth. Capital-c Conservative types will get a big stiffy from the finding that UBI did lead people to work less, but that's only part of the story. UBI as promoted by AI companies is bribing the peasants. In the world where the AI companies win, what is the economic or political restraining bolt stopping the AI companies from just dialing the UBI back and keeping more of the resources for themselves once traditional employment is scaled back? Like, at that point, they already own all the resources and the means of production. What makes them share?
[0] Depending on your definition of institutional continuity - i.e. whether or not Istanbul is still Constantinople - you could argue the Roman Empire survived until WWI.
[1] Insamuch as the complicated and ideosyncratic economic relationships of medieval Europe could even be summed up in one word.
[2] Ransomware vendors accidentally did this, establishing Bitcoin (and a few other cryptos) as money by demanding it as payment for a data ransom.
I agree with you in the case of AI companies, but the desire to own everything an bee completely unconstrained is the dream of every large corporation.
how has this been any different from the past 10,000 years of human conquest and domination?
It can be better or worse depending on what those with power choose to do. Probably worse. There has been conquest and domination for a long time, but ordinary people have also lived in relative peace gathering and growing food in large parts of the world in the past, some for entire generations. But now the world is rapidly becoming unable to support much of that as abundance and carrying capacity are deleted through human activity. And eventually the robot armies controlled by a few people will probably extract and hoard everything that's left. Hopefully in some corners some people and animals can survive, probably by being seen as useful to the owners.
Be fruitful, and multiply, so that you may enjoy a comfortable middle age and senescence exploiting the shit out of numerous naive 25-year-olds! If it's robots, we can ramp down the population of both humans and robots until the planet can once again easily provide abundance.
That's why even though technology could theoretically be used to save us from many of our problems, it isn't primarily used that way.
But presumably petty tyrants with armies of slave robots are less interested than consensus in a long-term vision for humanity that involves feeding and housing a population of 10 billion.
So after whatever horrific holocaust follows the AI wars the way is clear for a hundred thousand humans to live in the lap of luxury with minimal impact on the planet. Even if there are a few intervening millennia of like 200 humans living in the lap of luxury and 99,800 living in sex slavery.
One issue was a pic with text in it, like a store sign. Users were complaining that it kept asking for better focus on the text in the background, before allowing a photo. Alpha quality junk.
Which is what AI is, really.
links to this comment.
AI will be incorporated into the government, whether you like it or not.
FTFY!
Like why else can we just spam these AI endpoints and pay $0.07 at the end of the month? There is some incredible competition going on. And so far everyone except big tech is the winner so that’s nice.
I had to do a double take here. I run (mostly using dedicated servers) infrastructure that handles a few hundred TB of traffic per month, and my traffic costs are on the order of $0.50 to $3 per TB (mostly depending on the geographical location). AWS egress costs are just nuts.
Yes, it is.
> and way bigger problem then some AI companies that ignore robot.txt.
No, it absolutely is not. I think you underestimate just how hard these AI companies hammer services - it is bringing down systems that have weathered significant past traffic spikes with no issues, and the traffic volumes are at the level where literally any other kind of company would've been banned by their upstream for "carrying out DDoS attacks" months ago.
Yeas, I completely don't understand this and don't understand comparing this with ddos attacks. There's no difference with what search engines are doing, and in some way it's worse? How? It's simply scraping data, what significant problems may it cause? Cache pollution? And thats'it? I mean even when we talking about ignoring robots.txt (which search engines are often doing too) and calling costly endpoints - what is the problem to add to those endpoints some captcha or rate limiters?
Send a bill to their accounts payable team instead.
Terms of use charges them per page load in some terminology of abuse.
Profit... By sending them invoices :-)
At which point does the crawling cease to be a bug/oversight and constitute a DDOS?
Depending on the number of simultaneous requesting connections, you may be able to do this without a significant change to your infrastructure. There are ways to do it that don't exhaust your number of (IP, port) available too, if that is an issue.
Then the hard part is deciding which connections to slow, but you can start with a proportional delay based on the number of bytes per source IP block or do it based on certain user agents. Might turn into a small arms race but it's a start.
Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.
Not sure what to tell you but I surely feel quite human
Three of the pages told me to contact customer support and the other two were a hard and useless block wall. Only from Codeberg did I get a useful response, the other two customer supports were the typical "have you tried clearing your cookies" and restart the router advice — which is counterproductive because cookie tracking is often what lets one pass. Support is not prepared to deal with this, which means I can't shop at the stores that have blocking algorithms erroneously going off. I also don't think any normal person would ever contact support, I only do it to help them realise there's a problem and they're blocking legitimate people from using the internet normally
Beware if you employ this...
On the other hand CloudFlare and Akamai mistakenly block me all the damn time.
But to your point, the real kicker is the "many sites aren't going to get feedback from the real people they've blocked" since those tools inherently decided that the traffic was not human. You start getting into Westworld "doesn't look like anything to me" territory.
You don't know if each entry in the log is a real customer until they buy products proportional to some fraction of their page load rate, or real people until they submit useful content or whatever your site is about. Many people just read information without contributing to the site itself and that's okay, too. A list of blocked systems won't help; I run a server myself, I see the legit-looking user agent strings doing hundreds of thousands of requests, crawling past every page in sequence, but if there wasn't this inhuman request pattern and I just saw this user agent and IP address and other metadata among a list of blocked access attempts, I'd have no clue if the ban is legit or not
With these protection services, you can't know how much frustration is hiding in that paper trail, so I'm not blocking anyone from my sites; I'm making the system stand up to crawling. You have to do that regardless for search engines and traffic spikes like from HN
>I'm Not a Robot (film) https://en.m.wikipedia.org/wiki/I%27m_Not_a_Robot_(film)
Edit: and it's on YouTube in full! Was wondering which streaming service I'd have to buy for this niche genre of Dutch sci-fi but that makes life easy: https://www.youtube.com/watch?v=4VrLQXR7mKU
Final update: well, that was certainly special. Favorite moment was 10:26–10:36 ^^. Don't think that comes fully across in the baked-in subtitles in English though. Overall it could have been an episode of Dark Mirror, just shorter. Thanks again for the tip :)
I have to assume the Dutch movie industry just isn't too big.
I guess it's a side effect of America's media, but when I went to Europe including the Netherlands almost everyone spoke English at an almost native level.
It almost felt like playing a video game where there is an immersive mode you can just turn off if it gets too difficult ( subtitles in English at all public facilities).
One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?
I know there is a loading animation widget thingy, but the first time I saw that page (some weeks ago at the Gnome issue tracker), it was proof-of-work'ing for like 20 seconds, and I wasn't sure what was going on, I initially thought I got blocked or that the captcha failed to load.
Of course, now I understand what it is, but I'm not sure it's 100% clear when you just see the "checking if you're a bot" page in isolation.
All of this is placeholder wording, layouts, CSS, and more. It'll be fixed in time. This is teething pain that I will get through.
Network effects anyone? So yes, we should work on a different way of indexing the web again, than via google, but easier said than done I think ..
Also
> https://news.ycombinator.com/item?id=43422781
Integrate a way to calculate micro-amounts of the shitcoin of your choice and we might have the another actually legitimately useful application of cryptocurrencies on our hands..!
If a GPU was required per scrape then >90% simply couldn't afford it at scale.
Oh dear, somebody is going to implement this in about an hour, aren't they....
Regardless, I think something like this is the way forward if one doesn't want to throw privacy entirely out the window.
client
A sha256 hash is a bunch of bytes like this:
394d1cc82924c2368d4e34fa450c6b30d5d02f8ae4bb6310e2296593008ff89f
We usually write it out in hex form, but that's literally what the bytes in ram look like. In a proof of work validation system, you take some base value (the "challenge") and a rapidly incrementing number (the "nonce"), so the thing you end up hashing is this: await sha256(`${challenge}${nonce}`);
The "difficulty" is how many leading zeroes the generated hash needs to have. When a client requests to pass the challenge, they include the nonce they used. The server then only has to do one sha256 operation: the one that confirms that the challenge (generated from request metadata) and the nonce (provided by the client) match the difficulty number of leading zeroes.The other trick is that presenting the challenge page is super cheap. I wrote that page with templ (https://templ.guide) so it compiles to native Go. This makes it as optimized as Go is modulo things like variable replacement. If this becomes a problem I plan to prerender things as much as possible. Rendering the challenge page from binary code or ram is always always always going to be so much cheaper than your webapp ever will be.
I'm planning on adding things like changing out the hash in use, but right now sha256 is the best option because most CPUs in active deployment have instructions to accelerate sha256 hashing. This combined with webcrypto jumping to heavily optimized C++ and the JIT in JS being shockingly good means that this super naïve approach is probably the most efficient way to do things right now.
I'm shocked that this all works so well and I'm so glad to see it take off like it has.
I imagine it costs more resource to access the protected website but would this stop the bots? Wouldn't they be able to pass the challenge and scrap the data after? Or normal scrapbots usually timeout after a small amount of time/ resources is used?
Like spam, this kind of mass-scraping only works because the cost of sending/requesting is virtually zero. Any cost is going to be a massive increase compared to 'virtually zero', at the kind of scale they operate at, even if it would be small to a normal user.
That's exactly how it works (easy for server, hard for client). Once the client completed the Proof-of-Work challenge, the server doesn't need to complete the same challenge, it only needs to validate that the results checks out.
Similar to how in Proof-of-Work blockchains where coming up with the block hashes is difficult, but validating them isn't nearly as compute-intensive.
This asymmetric computation requirement is probably the most fundamental property of Proof-of-Work, Wikipedia has more details if you're curious: https://en.wikipedia.org/wiki/Proof_of_work
Fun fact: it seems Proof-of-Work was used as a DoS preventing technique before it was used in Bitcoin/blockchains, so seems we've gone full circle :)
Edit: I will probably send a pull request to fix it.
Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article -- or even real user agents because they're wired up to something like Playwright.
What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.
I don't mind registering an account for private communities, but for stuff which people put up thinking it is just going to be publicly visible it's really annoying.
I don't think these business owners really understand. Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.
I agree with you that it is extremely frustrating.
The people without a basic internet presence aren't likely to be customers anyway so it's not a huge loss. It's trivial to setup a basic account for any site that doesn't contain any personal data you want to keep hidden, if you aren't willing to do that, you're in a tiny minority.
It's equally trivial for a restaurant to set up a custom domain with their own 2-page website (overview and menu) on any of a hundred platforms that provide this service.
Most of these services are not free like FB, but any business that can afford a landline phone can afford a real website.
There are free ones as well, just as a subdomain (something.wordpress.com or something.wix.com), not a full top level custom domain.
Sure but they don't want to. If you want to see the menu they have online you need to follow their rules, not your own.
Obviously the restaurant has enough other customers and I have enough other restaurants to go to, so we both will be fine.
Sure, but putting their menu behind a trivial to access account shows they don't want you as a customer. You're the one complaining, not them.
I'm not sure why you think why people who don't have a Facebook account wouldn't eat at restaurants
https://github.com/mikf/gallery-dl https://git.ao2.it/tweeper.git
And those 30 seconds are a harrowing pit of despair out of which comes the rest of your life filled with advertisements, tracking, second-guessing, and accusations of being a hypocrite.
Not that it means you should just make an account to make their tracking easier...
Just to say the quiet part out loud here.. one of the biggest reasons this is depressing is that it's not only vandalism but actually vandalism with huge compounding benefits for the assholes involved and grabbing the content is just the beginning. If they take down the site forever due to increasing costs? Great, because people have to use AI for all documentation. If we retreat from captcha and force people to put in credit cards or telephone numbers? Great, because the internet is that much less anonymous. Data exfiltration leads to extra fraud? Great, you're gonna need AI to combat that. It's all pretty win-win for the bad actors.
People have discussed things like the balkanization of the internet for a long time now. One might think that the threat of that and/or the fact that it's such an unmitigated dumpster fire already might lead to some caution about making it worse! But pushing the bounds of harassment and friction that people are willing to deal with is moot anyway, because of course they have no real choice in the matter.
That we live in an internet where getting too many visitors is an existential crisis for websites should tell you that our internet is not one that can survive long.
Now there's a new generation of hungry hungry hippo indexers that didn't agree to that and who feel intense pressure from competition to scoop up as much data as they can, who just ignore it.
Legislation should have been made anyway, and those that ignore robots.txt blocked / fined / throttled / etc.
There’s other options besides a blanket ban.
If you are hosting a Forgejo instance, I strongly recommend setting DISABLE_DOWNLOAD_SOURCE_ARCHIVES to true. The crawlers will still peg your CPU but at least your disk won't be filled with zip files.
That's bad software design to generate ZIP files on the fly.
It'd be better to totally stream it of course, but that's not always an option for one reason or another.
Hm, so it's a cache then? Requesting the same tarball 100 times shouldn't create 100 zip files if they're cached, and if they aren't cached they shouldn't fill up the disk.
Clearly generating zip files, writing them fully to disk and then sending them to the client all at once is a completely awful and unusable design, compared to the proper design of incrementally generating and transmitting them to the client with minimal memory consumption and no disk usage at all.
The fact that such an absurd design is present is a sign that most likely the developers completely disregarded efficiency when making the software, and it's thus probably full of similar catastrophic issues.
For example, from a cursory look at the Forgejo source code, it appears that it spawns "git" processes to perform all git operations rather than using a dedicated library and while I haven't checked, I wouldn't be surprised if those operations were extremely far from the most efficient way of performing a given operation.
It's not surprising that the CPU is pegged at 100% load and the server is unavailable when running such extremely poor software.
But Forgejo is not the only piece of software that can have CPU intensive endpoints. If I can't fence those off with robots.txt, should I just not be allowed to have them in the open? And if I forced people to have an account to view my packages, then surely I'd have close to 0 users for them.
This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.
There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.
It'll all burn down.
We're kind of stuck between a rock and a hard place here. Which do you prefer, entrenched incumbents or affordable/open hosting?
Anonymous browsing and potentially-malicious bots look identical. This was sort of OK up until now.
If you thought browser fingerprinting for ad tracking was creepy, just wait until they're using your actual fingerprint.
AI companies with best anti-captcha mechanics will win and will inject ads to LLM output in more sophisticated way.
OpenAI goes through initial cycle of enshittification. Google is too big right now. Once they establish dominance you will have to see 5 unskippable ads between prompts, even for paid plan.
I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.
Example 'search' project: https://rumca-js.github.io/search
> I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.
> Example 'search' project: https://rumca-js.github.io/search
That is not really solution. Since typical indexing still works for masses, your approach is currently unique. But in the end, bots will be capable of reading on web page context if human is capable on reading them. And we get back to the original problem where we try to detect bots from humans. It's the only way.
Is it going to become another race like the adblocker -> detect adblocker -> bypass adblocker detector and so on...?
Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.
(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)
I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.
I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]
[1] The AI SaaS will simply run a background worker on the client to do their search indexing.
Granted, I'm not looking forward to some LLM condensing all the garbage and handing me a Definitive Answer (TM) based on the information it deems relevant for inclusion.
Same for us, our forum and our Gitlab are getting hammered by AI companies bots.
Most of them don’t respect robots.txt…
You know, flood the zone with s***, Bannon-style ...
It won't work for well-structured sites where the bots know the exact endpoint they want to scrape, but might slow down the more exploratory spider threads.
Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.
Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).
A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.
And so on, and so forth ...
[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act
I love that the solution to LLM scraping is to serve the browser a proof of work, before they allow access - I wonder if things like new sites start to do this . . . It would mean they won't be indexed by search engines, but it would help to protect the IP
This is an aspect that a lot of PoW haters miss. While PoW is a waste, there are long term economic incentives to minimize it to either being a side-effect of something actually useful, or using energy that would go to waste anyway, making it's overall effect gravitate toward neutral.
Unfortunately such a second order effects are hard to explain to most people.
Say a hash challenge gets widely adopted, and scraping becomes more costly, maybe even requires GPUs. This is great, you can declare victory.
But what if after a while the scraping companies, with more resources than real users, are better able to solve the hash?
Crypto appeals here because you could make the scrapers cover the cost of serving their page.
Ofc if you’re leery of crypto you could try to find something else for bots to do. xkcd 810 for example. Or folding at home or something. But something to make the bot traffic productive, because if it’s just a hardware capability check maybe the scrapers get better hardware than the real users. Or not, no clue idk
The problem is that these companies are fairly well funded and renting infrastructure isn't an issue.
You can allow API access from cloud IPs, as long as you don't do anything expensive before you've authenticated the client.
“…they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses - mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure - actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.”
So it looks like much of the traffic, particularly from China, is indeed using consumer ips to disguise itself. That’s why they blocked based on browser type (MS Edge, in this case).
(I described my bot woes a few weeks ago at https://news.ycombinator.com/item?id=43208623. The "just block bots!" replies were well-intentioned but naive -- I've still found no signal that works reliably well to distinguish bots from real traffic.)
Either the LLM devs got more funding, or maybe the authorities took down the botnet they were using.
The advantage of a third party service is that you're sharing intel of bad actors.
How do they know that these are LLM crawlers and not anything else?
Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?
> is that it's 100s of IPs doing 1 request
Are all of those IPs within the same ranges or scattered?
Thanks a lot for taking the time to talk about your experience btw, as someone who hasn't been hit by this it's interesting to have more details about it before it eventually happens.
Those are the ones that make it obvious, yes. It's not exclusive, though, but enough to connect the dots.
> Are all of those IPs within the same ranges or scattered?
The IP ranges are all over the place. Alibaba seems to have tons of small ASNs, for instance.
I can tell you what it looks like in case of a git web interface like cgit: you get a burst of one or two isolated requests from a large number of IPs each for very obscure (but different) URLs, like a file contents at a specific commit id. And the user agent suggesting it's coming from IPhone or Android.
- We cannot block them because we can’t differentiate legitimate traffic from illegitimate traffic…
- …but we can conclusively identify this traffic as coming from AI crawlers.
Getting caught isn't a big deal. Getting caught in the act is. As long as they get their data, it doesn't matter if they're caught afterwards.
It's awful and it was costing me non-trivial amounts of money just from the constant pinging at all hours, for thousands of pages that absolutely do not need to be scraped. Which is just insane, because I actively design robots.txt to direct the robots to the correct pages to scrape.
So far so good with the honeypots, but I'll probably be creating more and clamping down harder on robots.txt to simply whitelist instead of blacklist. I'm thinking of even throwing in a robots honeypot directly in sitemap.xml that should bait robots to visit when they're not following the robots.txt.
It's really, really ridiculous.
"tens of thousands" ? I think not:
% sudo fail2ban-client status gitbots | more
Status for the jail: gitbots
|- Filter
| |- Currently failed: 0
| |- Total failed: 573555
| `- File list: /var/log/nginx/gitea_access.log
`- Actions
|- Currently banned: 78671
|- Total banned: 573074
Even though these bots are using different IPs with each request, that IP may be reused for a different website, and donating those IPs to a central system could help identify entire subnets to block.
Another trick was “tar-pitting” suspect senders (browser agent for example) to slow their message down and delay their process.
Bust the kneecaps of all the people responsible for those crawlers. Publicly. And all of them: from the person entering the command to the CEO of the company going through all the middle management. You did not go against this policy? Intact kneecaps are a privilege which just got revoked in your case.
https://en.m.wikipedia.org/wiki/Jack_Higgins
https://en.m.wikipedia.org/wiki/Liam_Devlin
Ideally a site would get scraped once, and then the scraper would check if content has changed, e.g. etag, while also learning how frequently content changes. So rather than just hammer some poor personal git repo over and over, it would learn that Monday is a good time to check if something changed and then back off for a week.
Wheels that haven't changed in years, with a "Last-Modified" and "ETag" that haven't changed.
The only thing that makes sense to me is it's cheaper them to re-pull and re-analyze the data than to develop a cache.
For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.
that's not what I meant.
and it is not they, it is it.
i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.
google:
http header last modified
and look for the etag link too.
It's that not doing so means they can increase their profit numbers just a skoshe more.
And at least as long as they haven't IPOed, that number's the only thing that matters. Everything getting in the way of increasing it is just an obstacle to be removed.
The poor implementation is not really relevant, it's companies deciding they own the internet and can take whatever they want, let everyone else deal with the consequences. The companies do not care what the impact of their ai non-sense is..
Crawlers have existed forever in search engine space and mostly behave.
This sort of no rate limit, fake user agent, 100s of IPs approach used by AI teams is obviously intentionally not caring who it fucks over. More malicious than sloppy implementation
I'll disagree that it's not at least individual malicious choice, though. Someone decided that they needed to fake/change user agents (as one example), and implemented it. Most likely it was more than one person- some manager(s)/teams probably also either suggested or agreed to this choice.
I would like to think at some point in this decision making process, someone would have considered 'is it ethical to change user agents to get around bans? Is it ethical to ignore robots.txt?' and decided not to proceed, but apparently that's not happening here...
the result? a mixed up version of 5000 plagiarised "baby's first webcrawler" github projects
It is literally the point of public websites to answer HTTP requests. If yours can't you're doing something wrong.
If you can't conclusively identify bots, you'll end up serving 'poisoned' responses to actual users. Doesn't seem like a viable solution.
We've never had one of these arms races end up with the defenders winning.
They will never respect you, but the second they notice this hurts their business more than it gains them, they will stop.
Thankfully, these bots were easy enough to block at the firewall level, but that may not work forever.
Given this info, the natural next question is “who is doing the harm?”
The answer is “AI companies”. Most people would now view the situation as having a lot to do with AI companies.
Bing used to do the same thing. (It might still do it, I just haven't heard about it in a while.)
Surely has very little to do with "intelligence".
And yet they are not. So what does that tell you?
What these crawlers are doing is akin to DDoS attacks.
You could also get CloudFlare, or some other CDN, but depending on your size that might not be within your budget. I don't get why the rest of the internet should subsidize these AI companies. They're not profitable and live of venture capital and increase the operation costs of everyone else.
And you just know they'll gladly bill you for egress charges for their own bot traffic, too.
EDIT: Actually, this is an excellent question. By default, these bots would likely appear to come from "the internet" and thus be subject to egress charges for data transfers. Since all three major cloud providers also have significant interests in AI, wouldn't this be a sort of "silent" price increase, or a form of exploitive revenue pumping? There's nothing stopping Google, Microsoft/OpenAI, or Amazon from sending an army of bots against your sites, scraping the data, and then stiffing you with the charges for their own bots' traffic. Would be curious if anyone has read the T&Cs of their own rate cards closely enough to see if that's the case, or has proof in their billing metrics.
---
Original post continues below:
One topic of conversation I think worth having in light of this is why we still agree to charge for bandwidth consumed instead of bandwidth available, just as general industry practice. Bits are cheap in the grand scheme of things, even free, since all the associated costs are for the actual hardware infrastructure and human labor involved in setup and maintenance - the actual cost per bit transmitted is ridiculously small, infinitesimally so to be practical to bill.
It seems to me a better solution is to go back to charging for capacity instead of consumption, at least in an effort to reduce consumption charges for projects hosted. In the meantime, I'm 100% behind blocking entire ASNs and IP blocks from accessing websites or services in an effort to reduce abuse. I know a prior post about blocking the entirety of AWS ingress traffic got a high degree of skepticism and flack from the HN community about its utility, but now more than ever it seems highly relevant to those of us managing infrastructure.
Also, as an aside: all the more reason not to deploy SRV records for home-hosted services. I suspect these bots are just querying standard HTTP/S ports, and so my gut (but NOT data - I purposely don't collect analytics, even at home, so I have NO HARD EVIDENCE FOR THIS CLAIM) suggests that having nothing directly available on 80/443 will greatly limit potential scrapers.
1. Using the web would become much more compute/energy intensive and old devices would quickly lose access to the modern web.
2. Some hosts would inevitably double-dip by implementing this and ads or by "overcharging" the amount of work. There would have to be some kind of limit on how much work can be required by hosts - or at least some way to monitor and hold hosts accountable for the amount of work they charge.
3. There would need to be a cheap and reliable way to prove the client's work was correct and accurate. Otherwise people will inevitably find a way to spoof the work in order to reduce their compute/energy cost.
All you need is a central clearing house service that can handle billions of 0.000001 transactions per day.
Incidentaly, I doubt the bitcoin chain could handle that...
It's a solution that already has adoption, does not require everyone to sign up with a centralized service, and does not require everyone to pay money (they can pay with small amounts of computation instead) so it remains accessible to ~everyone.
Of course it isn't very secure because if the client sees a mined block they might have the technical savvy keep it. But you'd be forcing big web scrapers to run a horribly inefficient mining operation and they'd hate it. Plus you can run a blacklist of hated clients and double the difficulty for them, which is very low-cost for false positives and very high-cost for real scrapers - that isn't a result of using Bitcoin but it'd be funny.
Honestly, I don't see it necessarily as a bad thing.
I mean, at Communick I offer Matrix, Mastodon, Funkwhale and Lemmy accounts only for paying customers. As such, I have implemented payments via Stripe for convenience, but that didn't stop from getting customers who wanted to pay directly via crypto, SEPA and even cash. It also didn't stop me from bypassing the whole system and giving my friends and family an account directly.
Why would any third party rely on authentication based on the relationship between my service and my customers?
Sounds like sanctioned racism.
I'm talking about social proof as in "You are an student of the city university, so you get an account at the library", "Julie from the book reading group wanted an account at our Bookwyrm server, so I made an account for her" or even "Unnamed customer who signed up for Cingular Wireless and was given an authorization code to access Level 2 support directly".
You are taking one thing I said (service providers will require some form of payment or social proof to give credentials to people who want to access the service), assumed the worst possible interpretation (people will only implement the worst possible forms of social proofing), and to top it off you added something else (gatekeeping) entirely on your own.
I can not dictate how you interpret my comment, but maybe could you be a bit more charitable and assume positive intent when talking with people you never met?
Is Common Crawl data not fresh enough? Is there some other deficiency?
Whatever the problem is, a single crawler which every AI company can reference seems like a compromise to solve this issue, doesn't it?
20 years ago the fear of AI was that it would take over the world and try to kill us. Today we can clearly see that the threat of AI is the amoral humans that control it.
Nepenthes
https://arstechnica.com/tech-policy/2025/01/ai-haters-build-...
What I am thinking about may even be less idea.
For people actively working on these projects how about puptting the git server on a private net with VPN or SSH access.
Use a seperate read only static git server to the net.
Anyway, why not git clone the project and parse it locally instead of scraping the web pages? I understand that scraping works on every kind of content but given the scale git clone and periodical git fetch could save money even to the scrapers.
Finally, all of this reminds me about Peter Watt's Maelstrom, when viruses infested the Internet so much (at about this time in history) that nobody was using it anymore [1]
That's two big "ifs" for something I'm not aware of a standardized way of announcing. And the entire thing crumbles as soon as someone who wants every drop of data possible says "crawl their sites anyway to make sure they didn't forget to publish anything into the 2nd system."
This way crawlers might contribute back by providing extra storage and bandwidth.
Though something like ZeroNet seems a better approach to allow dynamic content.
I'm gonna do experiments with xeiaso.net as the main testing ground.
They really do use headless chrome for everything. My testing has shown a lot of them are on Digital Ocean. I have a list of IP addresses in case someone from there is reading this and can have a come to jesus conversation with those AI companies.
a proof of work function will end up selecting FOR them!
and now you have an experience where the bots have it easier time accessing your content than legitimate visitors
There's another aspect to this too: China and DeepSeek. While this was released by a private company, I think there's a not-insigificant chance that it reflects Chinese government policy to "commoditize your complements" [1]. Companies like OPenAI want to hide their secret sauce so it can't be produced. Training an LLM is expensive. If there are high-quality LLMs out there for free you can just download, then this moat completely evaporates.
[1]: https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/
We've seen random stuff break when AWS has had outages, not because we used AWS ourselves, but because suppliers do.
Technically I'm all for kicking AWS off the internet for a day or to, for failing to police their customers, but it would just break a lot.
nothing good comes from there
unfortunately then they instantly switch to home IPs
Companies like DataImpulse [1] or ScraperAPI [2] will happily publicize their services with that specific target.
--
0: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
1: https://dataimpulse.com/use-cases/ai-proxies/
2: https://www.scraperapi.com/solutions/ai-data/
Unethical, definitely. Illegal, no.
Examples?
At this point, I think we're well under 1% actual users on a good day.
The good news is that it's easy to disrupt these crawlers with some easy hacks. Tactical reverse slowloris is probably gonna make a comeback.
The thing about AI scrapers is that they don't just do this once. They do this every day in case every file in a glibc commit from 15 years ago changed. It's absolutely maddening and I don't know why AI companies do this, but if this is not handled then the git forge falls over and nobody can use it.
Anubis is a solution that should not have to exist, but the problem it solves is catastrophically bad, so it needs to exist.
amusingly some scammy companies have crowdfunded off this type of "traffic" by citing "interest in asian markets"
However, they could just do an end run round this. In the UK they're planning to get the government to help them just grab everything for free: https://www.gov.uk/government/consultations/copyright-and-ar...
There's your answer. Lawfare works in favor of the party with deeper pockets.
This costs money, time, and ongoing commitment. FOSS isn't typically known for being overflowing with cash.
There is not really any incentive or reward for 3rd-party organizations to step in and do this.
Because reckless and greedy AI operators not only endanger FOSS projects, they threaten to collapse the free accessible internet as a whole as well. Sooner or later, we will need to fight for our freedom, our rights as individual human against rogue AI and the übermacht of the mega-corporations, just as we need to fight against the concentration of contents behind corporate gates today.
And I don’t any other way than going juridical against these operators. They give a sh*t about the little humans, not even copyrights and other legal regulations.
I don't really like blocking an entire ASN, especially since I don't mind (responsible) crawling to begin with, but I was left with no choice
[1]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
Obviously there's still ways to pay people to run the browser but it would be nice for this activity to cost the AI company something without blocking actual people.
LLM bots are doing a great job of stress testing infra, so if you are running abominations like Gitlab or any terribly coded site and you are exposing it to the internet, you are just asking for trouble. If anything, Gitlab should stop pumping bloat and focus on some performance, because it's really bad. I would hope FOSS projects would stick to something like Forgejo, although I am not sure of their CI/CD state. Though my guess is that they are 85% there with 1/10 of Gitlabs resources.
On the other side are of course badly coded bots that are aggressively trying to download everything. This was happening before LLMs and it just increased significantly because of them. I think we will reach a tipping point soon and then we will just assume those bots are just another malicious actor (like regular DDOS), and we will start actively taking them down, even with help of law enforcement.
Last thing I wanna see is 3 second bot challanges on every single site I visit, cookie banners are more than enough of a nightmare already.
I'll grant it can be a problem for super-heavy "application" websites where every GET is a serious computation. So I'm not surprised gitlab is having problems. They've literally the most bloated and heaviest website I've ever seen. Maybe applications shouldn't be websites.
But this spreading social hysteria, this belief that all non-humans are dangerous and must be blocked is a nerd snipe. It really doesn't apply to most situations. Just running a website as we've always run them, public, and letting all user-agents access, is much less resource intensive than these various "mitigations" people are implementing. Mitigations which end up being worse than the problem itself in terms of preventing actual humans from reading text.
Its sad that it has to come to this. But especially when those "scrapers" are in a foreign country, you can't even do anything legally.
Personally, when I first got connected to the internet around ~1999, that was the approach I've followed since, I don't share things I am not OK with others to use for whatever they want.
In physical environments where people are bombarded with low-information “noise”, gatekeeping (ie credentialism) emerges as a natural mechanism.
Like some sort of legal honeypot trap.
in all likelihood all of these assholes are paying some unscrupulous suppliers for the data so the terabytes of traffic aren’t immediately attributable to them.
> out of those only 3% passed Anubi's proof of work, hinting at 97% of the traffic being bots
This doesn't follow. If I open a link from my phone and it shows a spinner and gets hot, I'm closing it long before it gets to one minute and maybe looking for a way to contact the site's maintainer to tell them how annoying it was.
It's a shitty solution, but as it stands the status quo is quite untenable and will eventually have cloudflare as a spooky MITM for all the web's traffic.
It's also just wasting more of the planet's resources as compared to blocking
And more effort, with as only upside that it's not immediately obvious to the bot that it is being blocked so it'll suck in more of your pages
I understand that people are exploring options but I wouldn't label this as a solution to anything as of today's state of the art
Simple example: no legitimate user has "GPTBot" in their user agent string.
It'll still trickle down to people using the system and waste people's time (from development to the people working to produce all the things used by these companies and mopping up the impact this has to eventually the users) but that would definitely resolve one of my concerns
Nope.
You don't have only the AI crawlers, you have also scans and hack attempts (which look alike script-kiddy stuff), all the time. Some smell of AI strapped to javascript web engines (or click farms with real humans???).
Smart: IP ranges from all over the world, and "clouds" make that even worse since the pown systems or bad actors (the guys who scan the whole ipv4 internet for its own good AND MANY SELL THE F* SCAN DATA: onyphe, stretchoid.com, etc) are "moving", in other words clouds are protecting those guys and are weaponizing hackers with their massive network resources, wrecking small hosting. No cloud is spared: aws, microsoft, google, ovh, ucloud.cn, etc.
I send good vibes to the brave open source software small hosting (until they are noscript/basic (x)html compatible ofc).
Many fixed-IPv4 pown systems have been referenced by security communities, often for months, sometimes years, and the people with the right leverage, don't seem to do a damn thing about it.
Currently, I wonder if I should not block all digital ocean IP ranges... and I was about to do the same with ucloud.cn IP ranges.
The second you host anything on the net, it WILL take a significant amount of your time. Do presume you will be pown, that's why security communities are referencing each other too.
Then I am thinking going towards 2 types of "hosting": private IPv6+port("randomized" for each client, may be transient in time depending of the service) thanks to those /64 prefixes (maybe /92 prefixes are a thing, for mobile internet?). Yes this complicated and convoluted. Second type, a 'standard' permanent IP, but with services which are implemented in an _HARDCORE_ simple way, if possible near 100% static. I am thinking going even further: assembly on bare metal, custom kernel based on hand compilation of linux code (RISC-V hardware ofc, FPGA for bigger hosting?)
I don't think anything will improve unless carrier scale network operators start to show their teeth.
First, it was Facebook https://news.ycombinator.com/item?id=23490367 and now it's these other companies.
What's worse? They completely ignore a simple HTTP 429 status.
503 is at least apparently understood by more crawlers/bots, but they still like to blame the victim: YouTube sends me a condescending (and inaccurate) email when it gets a 503 for ignoring cache headers and other basics it seems...
Since apparently they (scrapers) have no intent In doing (releasing) but so, expect the commercial open source to achieve a de facto protocol status very soon. And the rest may not exist in a centralised and such free manner anymore.
The special sauce is in parsing, tokenizing, enriching etc. There is no value in re-scraping, and massive cost, right?
Just do it.
You simply can't get 40 terabytes of text without mass scraping.
Are you sure it is not just very inefficient?
Also, set AI tarpits as fake links with recursive calls. Make then mad with non-curated bullshit made from Markov chain generators until their cache begins to rot forever.
Also I would argue that not having capitalist incentives baked directly into the network is what made the web work, for good or bad. Xanadu would never have gotten off the ground if people had to pay their ISP then pay for every website, or every packet, or every clicked link or whatever.
Reading the Xanadu page on Wikipedia tells me "Every document can contain a royalty mechanism at any desired degree of granularity to ensure payment on any portion accessed, including virtual copies ("transclusions") of all or part of the document."
That would be absolute chaos at scale.
I agree that the lack of monetization was important to the development and that it would have been chaos as proposed, but will the current setup be sustainable forever in the world of AI?
We have projects like Ethereum that are specifically intended to merge payments and computing, and I wouldn't be surprised if at some point in the future, some kind of small access fee negotiated in the background without direct user involvement become a component of access. I wouldn't expect people to pay ISPs but rather some kind of token exchange to occur that would benefit both the network operators and the web hosts by verifying classes of users. Non-fungible token exchanges could be used as a kind of CATPCHA replacement by cryptographically verifying users anonymously with a third-party token holder as the intermediary.
For example, let's say Mullvad or some other VPN company purchased a small amount of verification tokens for its subscribers who pay them anonymously for an account. On the other side, let's say a government requires people to register through their ISP, and the ISP purchases the same tokens on behalf of the user, and then exchanges the tokens on behalf of the user. In either case, the person can stand behind a third party who both sends them the data they requested and exchanges the verification tokens, which the site operator could then exchange for reimbursement of their services to their hosting provider.
This is just a high-level idea of how we might get around the challenges of a web dominated by bots and AI, but I'm sure the reality of it will be more interesting.
Meanwhile as profit motives begin to dominate (as they inevitably would,) access to information and resources becomes more and more of a privilege than a right, and everything becomes more commercialized, faster.
I won't claim to have a better idea, though. The best solutions in my mind are simply not publishing anything to the web and letting AI choke on its own vomit, or poisoning anything you do publish, somehow.
The other day, I logged into Usenet using eternalseptember, and found out that it consisted in 95% of zombies sending spam you could recognize from the millenium start. On one hand, it made me feel pretty nostalgic. Yay, 9/11 conspiracy theories! Yay, more all-caps deranged Illuminati conspiracies! Yay, Nigerian princes! Yay, dick pills! And an occasional on-topic message which strangely felt out of place.
On the other hand, I felt like I was in a half-dark mall bereft of most of its tenants, where the only place left is 85-year old watch repair shop and a photocopy service on the other end of the floor. On still another hand, turns out I haven't missed much by not being on Usenet, as all-caps deranged conspiracy shit is quite abound on Facebook.
I would welcome a modern replacement for Usenet, but I feel like it would need a thorough redesign based on modern connectivity patterns and computing realities.
But I guess realistically you can't fight entropy forever. Even Hacker News, aggressively moderated as it is, is slowly but irrevocably degrading over time.
Also, I often access FIDO over NNTP.
> and found out that it consisted in 95% of zombies sending spam you could recognize from the millenium start
I like to imagine a forgotten server, running since the mid-90s, its owners long since imprisoned for tax fraud, still pumping out its daily quota of penis enlagement spam.
The distributed nature of git is fine until you want to serve it to the world - then, you're back to bad actors. They're looking for commits because it's nicely chunked, I'm taking a guess.
They're not looking for anything specifically from what I can tell. If that was the case, they would be just cloning the git repository, as it would be the easiest way to ingest such information. Instead, they just want to guzzle every single URL they can get hold of. And a web frontend for git generates thousands of those. Every file in a repository results in dozens, if not hundreds of unique links for file revisions, blame, etc. and many of those are expensive to serve. Which is why they are often put in robots.txt, so everything was fine until the LLM crawlers came along and ignored robots.txt.
What the article describes, though, is possibly the worst way a machine can access a git repository, which is using a web UI and scraping that, instead of cloning it and adding all the commits to its training set. I feel like they simply don't give a shit. They got such a huge capital injection that they feel they can afford to not give a shit about their own cost efficiency and that they go using the scorched earth tactics. After all, even their own LLMs can produce a naive scraper that wreaks havoc on the internet infrastructure, and they just let it loose. Got mine, fuck you all the way!
But then they will release some DeepSeek R(xyz), and yay, all the hackernews who were roasting them for such methods, will be applauding them for a new version of an "open source" stochastic parrot. Yay indeed.
fb (meta) & big tech put their user contributed stuff behind a paywall. yet abuse open systems.
Where we could have once wrapped our mostly static websites in Varnish or a scalable P2P cache like Coral CDN, now we must fiddle and twiddle with robots.txt and appeal to the goodwill of megacorps who never cared about being good netizens before, even when they weren't profiting from scraping to such a degree.
This is yet another chance for me to scream into the void that we're still doing this all wrong. Our sites should work more like htmx, with full static functionality, adding dynamic embellishment when available. Business logic should happen deterministically in one place on the backend or "serverless" with some kind of distributed consensus protocol like Raft/Paxos or a CRDT, then propagate to the frontend through a RESTful protocol, similarly to how Firebase or Ruby Hotwire/Laravel Livewire work. The way that we mostly all do form validation wrong in 2 places with 2 languages is almost hilariously tragic in how predictably it happens.
But the real tragedy is that the wealthiest and most powerful companies that could have fixed web development decades ago don't care about you. Amazon, Google and Microsoft would rather double down on byzantine cloud infrastructure than devote even a fraction of their profit to pure research into actually fixing all of this.
Meanwhile the rest of us sit and spin, sacrificing the hours and days and years of our lives building out other people's ideas to make rent. Many of us know exactly how to fix things, but with infinite backlogs and never truly exiting burnout, we're too tired at the end of the day to contribute to FOSS projects and get real work done. Our valiant quest to win the internet lottery has become a death march through a seemingly inescapable tragedy of the commons.
Instead of fixing the web at a foundational level from first principles, we'll do the wrong thing like we always do and lock everything down behind login walls and endless are-you-human/2FA challenges. Then the LLMs will evolve past us and wrap our cryptic languages and frameworks in human language to a level where even pair programming won't be enough for us to decipher the code or maintain it ourselves.
If I was the developer tasked with hardening a website to LLMs, the first thing I would do is separate the static and dynamic content. I'd fix most of the responses to respect standard HTTP cache headers. Then I'd put that behind the first Cloudlare competitor I could find that promises to never have a human challenge screen. Then I'd wrap every backend API endpoint in Russian doll caching middleware. Then I'd shard the database by user id as a last resort, avoiding that at all cost by caching queries and/or using modern techniques like materialized views to put the burden of scaling on the database and scale vertically or gradually migrate the heaviest queries to a document or column-oriented store. Better yet, move to a stronger store that's already solved all of these problems, like CouchDB/PouchDB.
Then I'd build a time machine to convince everyone to do things right the first time instead of building a tech industry upon unforced errors. Oh wait, former me already tried sounding the alarm and nobody cared anyway. How can I even care anymore, when honestly I don't see any way to get out of this mess on any practical timescale? I guess the irony is that only LLMs can save us now.
robots.txt should allow to exclude all AI crawlers and AI crawlers should be forced to add "AI" to their crawl user agent headers and also respect robots.txt saying they can't crawl this website
right now we need to do this:
User-agent: *
Disallow: /
Maybe that is the problem? They misspelled robots.txt
All licenses need a clause like the following:
This software is for humans. AI training is prohibited and carries a default
penalty of $1 trillion.
That's not "no longer strong enough". That's a very strong system applying leverage to a powerful actor.
If we instead adopt the view of free software (https://www.gnu.org/philosophy/open-source-misses-the-point....), the fact that OpenAI and other large corporations train their large-language models behind closed doors - with no disclosure of their training corpus - effectively represents the biggest attack on GPL-licensed code to date.
No evidence suggests that OpenAI and others exclude GPL-licensed repositories from their training sets. And nothing prevents the incorporation of GPL-licensed code into proprietary codebases. Note that a few papers have documented the regurgitation of literal text snippets by large language models (one example: https://arxiv.org/pdf/2409.12367v2).
To me, this seems like the LLM-version of using coin-mixing to obscure the trail of Bitcoin transactions in the blockchain. The current situation also reminds me of how the generalization of the SaaS model led to the creation of the Affero GPL license (https://www.gnu.org/licenses/why-affero-gpl.html).
LLM's enable the circumvention of the spirit of free software licenses, as well as of the legal mechanisms to enforce them.
Also I don't think a restriction on the FSF's freedom 2 "The freedom to study how the program works" based on what tools you use and how you use them fits with FSF philosophy, nor do I think it is appropriate. You should be able to run whatever analysis tools you have available to study the program. Being able to ingest a program into a local LLM model and then ask questions about the codebase before you understand it yourself is valuable. Or aren't a programmer and or aren't familiar with the language used, then a local LLM could help you make the changes needed to add a new feature. In that situation LLMs can enable practical software freedom, for those who can't afford to pay/convince a programmer to make the changes they want.
https://www.gnu.org/philosophy/free-sw.html
In addition, OpenAI clearly do not respect copyrights and licenses in general, so would ignore any anti-AI clauses, which would make them ineffective and thus pointless. So, I think we should tackle the LLM problem through the law, and not through licenses. That is already happening with various caselaw in software, writing, artwork etc.
It isn't possible or practical to change the existing body of Free Software to use new anti-AI clauses anyway.https://juliareda.eu/2021/07/github-copilot-is-not-infringin...
BTW, LLMs could also in theory be used to licensewash proprietary software, see "Does free software benefit from ML models being derived works of training data?" by Matthew Garret:
Regarding the licensing, I'll restate my point that the Affero license was created precisely in a moment where the existing licenses could no longer uphold the freedoms that the Free Software Foundation set out to defend. A change of license was the right solution at that particular point in time and, if it worked then, I think we can all agree that there is at least a precedent that such a course of action might work and should at the very least be considered as a possible solution for today's problems.
That said, my own personal view is more aligned with demanding the nation states to pressure big corporations so that currently closed-source software becomes at least open-source (either by law, or simply by stopping using it and invest their budget in free alternatives instead). Note I said open source and not free. I just would like to read their code and feed it to my LLM's :)
On Affero, that was indeed definitely needed, although some folks on HN seem to think that privately modifying code is allowed by copyright, even if the modified version is outputting a public website, thus what the license says is irrelevant. That seems bogus to me, but seems a loophole if it is legit. Anyway, personally I think that people should simply just never use SaaS, nor web apps. It also doesn't help with data portability.
I'd go further and advocate for legally mandated source code escrow for copyright validity, and GPL like rights to the code once public, which would happen if the software is off the market for N years.
I agree 100%.
Sure, if you want to try to prevent AI training by licensing, do that, but it's no longer FOSS, so please don't call it that.
MIT license requires this:
> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
Do the AI companies follow this requirement?
I haven't seen any LLMs being able to reproduce full copies or even "substantial portions" of any existing software, unless we're talking "famous" functions like those from Quake and such.
You have any examples of that happening? I might have missed it
Also, isn't this basically just extortion? "I know you're minding you're own business FOSS maintainer, but move your code to our recommended forge instead so we can stop DDoSing you?"
I was actually thinking of a more general thing than just code, eg similar to CommonCrawl, but maybe a code specific thing is what is needed.
Doesn't your suggestion shift the responsibility to likely under-sponsored FOSS maintainers rather than companies? Also, how do people agree to switch to some centralized repository and how long does that take? Even if people move over, would that solve the issue? How would a scraper know not to crawl a maintainer's site? Scrapers already ignore robots.txt, so they'd probably still crawl even if you verified you've uploaded the latest content.
If you put some data in a central repository, they will take it.
Then they will go and DDoS the rest of the Internet in order to take all the rest of the data.
What?! "AI"?!?! We are talking about traffic abusers!...
To be clear, you could still have anonymous spaces like Reddit where arbitrary user IDs are used and real identities are discarded. People could opt-in to those spaces. But for most people most of the time, things get better when you can verify sources. Everything from DDOS to spam, to malware infections to personal attacks and threats will be reduced when anonymity is removed.
Yes there are downsides to this idea but I'd like people to have real conversations around those rather than throw the baby out with the bath water.
It's hard to have a serious conversation when you present a couple of upsides but completely understate/not mention the downsides.
Eliminating anonymity comes with real danger. What about whistleblowers and marginalized groups? The increased likelihood of targeted harassment, stalking, and chilling effects on free speech? The increase in surveillance? The reduction in content creation and legitimate criticism of companies/products/etc? The power imbalance granted to whoever validates identities?
pjc50 brings up some other great points, which got me thinking even more:
Removing anonymity creates a greater incentive to steal identities, has a slew of logistical issues (who/how are IDs verified, what IDs are accepted, what are the enforcement mechanisms and who enforces them, etc.), creates issues with shared accounts and corporate/brand accounts, would require cooperation across every country with internet access (good luck!) otherwise it doesn't really work, and probably a million other things if I keep thinking about it.
Doesn't this just create an even worse market for identity theft and botnets?
How does this apply to countries without a national ID system like the United States?
What do you do with an ID traced to a different country, anyway?
> personal attacks and threats will be reduced when anonymity is removed
People are happy to make death threats under their real name, newspaper byline, blue tick, or on the presidential letterhead if they're doing so from a position of power.
So do I support a fully authenticated internet? Fuck no. If we can get good at bot detection, zip bomb the fuckers. In the meantime, work as hard as we can to dismantle the hellscape that the internet has become. I'm all for decentralized, sovereign identity systems that aren't owned by some profiteering corpo cretins or some proto-fascist state, but I don't want it to be a requirement to look at photos of dogs or plan my next trip.
Such as living under logging. Which, you know (you know?), some people will radically refuse, with several crucial justifications. One of them is that privacy is a condition for Dignity. Another is Prudence. Another one is a warning millennia old, about who you should choose as confident. And more.
Nothing should be $$ free unless you already paid with your tax. Same principle -> As long as HN starts to charge every account, I'm happy to pay a small amount per month. This token amount of pay per account will also reduce the number of bots.
FOSS is generally built on the idea that anyone can use the code for anything, if you start to add a price for that, not only do you effectively gate your project from "poor people", but it also kind of erodes some of the core principles behind FOSS.
There's access via (e.g.) the git protocol (git://....) and access via http.
These attacks all happen via the latter, since the former is already access-controlled.
We mirror to github for public access; our developers all use git itself, not the web interface, for interacting with the repo.
How/what github et al. are doing to deal with this, I do not know.
Edit: nevermind, I see you are using Gitea rather than cgit/etc. I guess Gitea can't disable the problematic commit/etc views.
Well, yes and no. If you had a cost to access the source code, I'm pretty sure I'd stop calling that FOSS. If you only have a price for downloading binaries, sure, still FOSS, since we're talking source code licensing.
> Nothing should be $$ free
I took this statement at face value, and assumed parent argued for basically eliminating FOSS.
Even the FSF thinks you can charge money for free software and still call it FOSS: https://www.gnu.org/philosophy/selling.html.
In other words, it was written with no consideration for performance at all.
A competent engineer would use Rust or C++ with an in-process git library, perhaps rewrite part of the git library or git storage system if necessary for high performance, and would design a fast storage system with SSDs, and rate-limit slow storage access if there has to be slow storage.
That's the actual problem, LLMs are seemingly just adding a bit of load that is exposing the extremely amateurish design of their software, unsuitable for being exposed on the public Internet.
Anyway, they can work around the problem by restricting their systems to logged in users (and restricting registration if necessary), and using mirroring their content to well-implemented external services like GitHub or GitLab and redirecting the users there.
The issue is, there aren't any fully featured ones of these yet. Sure, they do exist, but you run into issues. Spawning a git process isn't about not considering performance, it's about correctness. You simply won't be able to support a lot of people if you don't just spawn a git process.
This is a bold assumption to make on such little data other than "your opinion".
Developing in python is not a negative, and depending on the people, the scope of the product and the intended use is completely acceptable. The balance of "it performs what its needed to do in an acceptable window of performance while providing x,y,z benefits" is almost a certain discussion the company and its developers have had.
What it never tried to solve was scaling to LLM and crawler abuse. Claiming that they have made no performance considerations because they can't scale to handle a use case they never supported is just idiotic.
>That's the actual problem, LLMs are seemingly just adding a bit of load that is exposing the extremely amateurish design of their software.
"Just adding a bit of load" != 75%+ of calls. You can't be discussing this in good faith and make simplistic reductions like this. Either you are trolling or naively blaming the victims without any rational thought or knowledge.