Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.
Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).
A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.
And so on, and so forth ...
Nepenthes is nice, but word salad can be detected easily. It needs a feature that pre-generates linguistically plausible but factically garbage text via open models
Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.
That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks
Last week I had a web server with a high load. After some log analysis I found 66,000 unique IPs from residential ISPs in Brazil had made requests to the server in a few hours. I have broad rate limits on data center ISPs, but this kinda shocked me. Botnet? News coverage of the site in Brazil? No clue.
Edit: LOL didn't read the article unity after posting—they mention the Fedora Pagure server getting this traffic from Brazil last week too!
Rate limiting vast swathes of Google Cloud, Amazon EC2, Digital Ocean, Hetzner, Huawei, Alibaba, Tencent, and a dozen other data center ISPs by subnet has really helped keep the load on my web servers down.
Last year I had one incident with 14,000 unique IPs in Amazon Singapore making requests in one day. What the hell is that?
I don't even bother trusting user agents any more. My nginx config has gotten too complex over the years and I wish I didn't need all this arcane mapping and whatnot.
https://www.heise.de/en/news/Poisoning-training-data-Russian...
Hello, it's me, your legitimate user who doesn't use one of the 4 main browsers. The internet gets more annoying every day on a simple android webview browser, I guess I'll have to go back to the fully-featured browser I migrated away from because it was much slower
> A request rate too inhuman?
I've run into this on five websites in the past 2 months, usually just from a single request that didn't have a referrer (because I clicked it from a chat, not because I block anything). When I email the owners, it's the typical "have you tried turning it off and on again" from the bigger sites, or on smaller sites "dunno but you somehow triggered our bot protection, should have expired by now [a day later] good luck using the internet further". Only one of the sites, Codeberg, actually gave a useful response and said they'd consider how to resolve when someone follows a link directly to a subpage and thus doesn't have a cookie set yet. Either way, blocks left and right are fun! More please!
Gotta get my daily dose of bleach for enhanced performance, chatgpt said so.
If the target goes down after you scrape it, that's a feature.
Data is presented to the user with multiple layers of encryption that they use their personal key to decrypt. This might add an extra 200ms to decrypt. Degrades the user experience slightly but creates a bottleneck for large-scale bots.
Does it have the client do a bunch of SHA-256 hashes?
SHA2 can run on ASICs and isn't memory-hard, so I'm hoping someone will add something tougher
> RPM packages and unbranded (or customly branded) versions are available if you contact me and purchase commercial support. Otherwise your users have to see a happy anime girl every time they solve a challenge. This is a feature.
The visceral reaction might be genuine, but the actual feelings are probably not. I have yet to see someone who actually "cares about the children". The vast majority of accusations e.g. democrats running a pedophile ring turn out to be completely manufactured. The democrats responded by giving more funding and starting projects combating child abuse, only for the republicans to gut the programs, who think they are a waste of tax payer money and an example of big government.
But of course this is the more expensive option that can't really be asked of sites that already provide public services (even if those are paid for by ads).
Something similar to proof-of-work but on a much smaller scale than Bitcoin.
For highly valuable information, they might throw the GDP of a small country at scraping your site. But most information isn't worth that.
And there are a lot of bad actors who don't have the resources you're thinking of that are trying to compete with the big guys on a budget. This would cut them out of the equation.
The idea that you should pay for content shouldn't be an insane pipedream. It should be the default on the internet.
Maybe then we wouldn't be in the situation where getting new users is an existential threat to the majority of websites.
The next scraper doesn’t get the data. People don’t realize we’re not compute limited for ai, we’re data limited. What we’re watching is the “data war”.
Seems like there’s a fuck ton. All of Wikipedia, GitHub for code, etc.
I can understand targeting certain sites like Reddit, etc. but not random websites
If you look closely even Google does this. This is probably why many popular sites started getting down ranked in the last 2 years. Now they're below the fold and Google can present their content as their own through the AI box.
Yea, but, the FTC doesn't want it to be.
It feels a lot like they're stuck for improvements but management doesn't want to hear it.
You can periodically remove tracking data for entries older than a threshold -- e.g. once a minute or so (adjustable) remove tracked entries that haven't made a request in the past minute to keep memory usage down.
That'd effectively rate limit the worst offenders with minimal impact on most well-behaved edge-case users (like me running NoScript for security) while also wasting less energy globally on unnecessary computation through the proof-of-work scheme, wouldn't it? Is there some reason I'm not thinking of that would prevent that from working?
Not sure about author's motivation, but this part is why I don't track usage - PoW allows you to do everything statelessly and not keep any centralised database or write any data. The benefit of a system slowing down crawling should be minimal resource usage for the server.
Take all regular papers and change their words or keywords to something outrageous and watch it feed it to users.
https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scr...
It really sucks that this is the way things are, but what I did was
10 requests for pages in a minute, you get captchad (with a little apology and the option to bypass it by logging in). asset loads don’t count
After a captcha pass, 100 requests in an hour gets you auth walled
It’s really shitty but my industry is used to content scraping.
This allows legit users to get what they need. Although my users maybe don’t need prolonged access ahem.
[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
Bad actors don’t care and annoying actors would make fun of you for it on twitter
There’s some stuff you can do, like creating risk scores (if a user changes ip and uses the same captcha token, increase score). Many vendors do that, as does my captcha provider.
These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.
Many of the bots disguise themselves as coming from Amazon or other big company.
Amazon has a page where you can check some details to see if it’s really their crawler or someone imitating it.
My conclusion is that they're all equally terrible then.
Setting user-agent headers is easy.
Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.
I’ve resorted to returning xml and zip bombs in canary pages. At best it slows them down until I block their network.
IMHO the fact that this shows up at a time when the Ladybird browser is just starting to become a serious contender is suspicious.
Seeing what used to be simple HTML forms turned into bloated invasive webapps to accomplish the exact same thing seriously angers me; and everyone else who wanted an easily accessible and freedom-preserving Internet.
It's not nice for visitors using a very old smartphone, but it's arguably less-exclusionary than some of the tests and third-party gatekeepers that exist now.
In many cases we don't actually care about telling if someone is truly a human alone, as much as ensuring that they aren't a throwaway sockpuppet of a larger automated system that doesn't care about good behavior because a replacement is so easy to make.
Those who have the computing resources to do commercial scraping will easily get past that.
In contrast, there are still many questions which a human can easily answer, but even the best LLMs currently can't.
I am genuinely curious: what is an example of such a question, if it's for a person you don't know (i.e. where you cannot rely on inside knowledge)?
One interesting thought: do we know if these AI crawlers intentionally avoid certain topics? Is pornography totally left unscathed by these bots? How about extreme political opinions?
I had briefly set up port knocking for the HTTP server (and only for HTTP; other protocols are accessible without port knocking), but due to a kernel panic I removed it and now the HTTP server is not accessible. (I may later put it back on once I can fix this problem.)
As far as I can tell, the LLM scrapers do not attempt to be "smart" about it at this time; if they do in future, you might try to take advantage of that somehow.
However, even if they don't, there are probably things that can be done. For example, check that if the declared user-agent declares things that it isn't doing, and display an error message if so (users who use Lynx will then remain unaffected and will still be able to access it). Another possibility is to try to confuse the scrapers however they are working, e.g. invalid redirects, valid redirects (e.g. to internal API functions of the companies that made them), invalid UTF-8, invalid compressed data, ZIP bombs (you can use the compression functions of HTTP to serve a small file that is too big when decompressed), EICAR test files, reverse pings (if you know who they really are), etc. What will work and what doesn't work depends on what software they are using.
[2025-03-19] https://blog.cloudflare.com/ai-labyrinth/
> Trapping misbehaving bots in an AI Labyrinth
> Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives.
... I would. Out of curiosity and amusement I would most definitely do that. Not every time, and not many times, but I would definitely do that one or a few times.
Guess I'm getting added to (yet another) Cloudflare naughty list.
> It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.
In that case wouldn't it be faster and easier to restyle the CSS of wikipedia pages?
Also it's not identifiable AI bot traffic that's detected (they mask themselves as regular browsers and hop between domestic IP addresses when blocked), it's just really obviously AI scraper traffic in aggregate: other mass crawlers have no benefit from bringing down their host sites, except for AI.
A search engine has nothing if it brings down the site they're scraping (and has everything to gain from identifying itself as a search engine to try and get favorable request speeds - the only thing they'd need to check is if the site in question isn't serving different data, but that's much cheaper), same with an archive scraper and those two are pretty much the main examples I can think of for most scraping traffic.
(I feel I need to preemptively state that I am being sarcastic.)
Via peering agreements it is.
People need to find better methods. And, crawlers need to pay a stupidity tax or be regulated (dirty word in the tech sector)
I don't expect any international calls... ever, so I block international calling numbers on my phone (since they are always spam calls) and it cuts down on the overwhelming majority of them. Don't see why that couldn't apply to websites either.
As for phone numbers, businesses and individuals employ a similar strategy. Most "legitimate" phone numbers begin with 060 or 070. Due to lack of supply, telcos are gradually rolling out 080 numbers. 080 numbers currently have a bad reputation because they look unfamiliar to the majority of Japanese. Similarly, VoIP numbers all begin with 050, and many services refuse such numbers. Most people instinctively refuse to answer any call that is not from a 060 or 070 number.
Cloudflare is basically still just this, but with more steps.
The other thing is that phone numbers follow a numbering scheme where +1 is north america and +64 is NZ. Its easy to know the longterm geographic consequence of your block, modulo faked out CLID. IP packets don't follow this logic and Amazon can deploy AWS nodes with IPs acquired in Asia, in any DC they like. The smaller hosting companies don't say the IP range they route for banks have no pornographers on them.
It's really not sensible to use IP blocks except for the very specific cases like yours: "I never terminate international calls" is the NAT of firewalls: "I don't want incoming packets from strangers" sure the cheapest path is to block entire swathes of IPv4 and IPv6. But if you are in general service delivery, that rarely works. If you ran a business doing trade in China, you'd remove that block immediately.
People in Iran, Russia, etc get annoyed with sanctions but that’s kind of the point. If your government isn’t responding appropriately, yes you’ll get shafted it’s what you do after that which solves the problem.
In particular, universal access to knowledge is a fundamental principle of liberalism.
That’s got nothing to do with solving the issues created by these people, but if you’re going to toss out meaningless non sequitur’s then I figure I might as well join in on the fun.
There's the whole other side of these AI researchers, and thats just slop artisans.
This seems like an opportunity for a company like Firecrawl, ScrapingBee, etc to offer built-in caching with TTLs so that redundant requests can hit the cache and not contribute to load on the actual site.
Even if each company that operates a crawler cached pages across multiple runs, I'd expect a large improvement in the situation.
For more dynamic pages, this obviously doesn't help. But a lot of the web's content is more static and is being crawled thousands of times.
I built something for my own company that crawls using Playwright and caches in S3/Postgres with a TTL for this purpose.
Does this make sense to anyone else? I'm not sure if I'm missing something that makes this harder than it seems on the surface. (Actual question!)
They have the incentive, it is relatively easy and I don't think there's a huge benefit to centralisation (especially since it will basically be centralised to one of the big providers of caching anyways)
To me it seems like the companies actually doing the crawling have an incentive to leverage centralized caching. It makes their own crawling faster (since hitting the cache is much faster than using Playwright etc to load the page) and it reduces the impact on all these sites. Which would then also decrease the impact of this whole bot situation overall.
Or something like: AI is making your experience worse, complain here (link to OpenAI).
Maybe not the most technical solution, but this at least gets the signal across to regular human beings who want to browse a site. Puts all this AI bs in a bad spotlight.
I wrote the above some time ago. I think its even more true today. Its practically impossible to crawl the way the bigger players do and with the increased focus on legislation in this area its going to lock out smaller teams even faster.
The old web is dead really. There really needs to be a move to more independent websites. Thankfully we are starting to see more of this like the linked searchmysite discussed earlier today https://news.ycombinator.com/item?id=43467541
I appreciate that this could be easily circumvented by a 'bad actor', but it would make this abuse overt...
See also: Meta being sued for torrenting. Since this is an Ars Technica article, here's another one: https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...
A license can help as well, but what's a license without enforcement? These companies are simply treating the courts as a cost to do business.
But what do I know, the young whippersnappers will just word lawyer me to death, so I better shut up and go away.
Their robber baron behavior reveals their true values and the reality of capitalism.
This is rather reductionist… By your same logic I could say that Stalin and Mao revealed the true values and reality of communism.
Let’s not elaborate on it further though and just leave this as a simple argument. Free market capitalism has led us to the most prosperous, peaceful, and advanced society humanity has ever ventured to create. Communism threatened that prosperity and peace with atrocities on a scale that exists beyond human comprehension. Capitalism, even with all of its faults, is the obvious choice.
Capitalism without law ends up with the same kind of authoritariasm as communism without law. Some Rich Guy ends up telling everyone what to do as a ruler with loose rules that no longer resemble the economic model. That's what people complain about when they bring up terms like "late stage capitalism".
Blocking by UA is stupid, an by country kind of wrong. I am currently exploring ja4 fingerprints, that together with other metrics (country, Arn, block list), might give me a good tool to stop malicious usage.
My point is, this is a lot of work, and it takes time off the budget you give to side projects.
I can't help but wonder if big AI crawlers belonging to the LLMs wouldn't be doing some amount of local caching with Squid or something.
Maybe it's beneficial somehow to let the websites tarpit them or slow down requests to use more tokens.
Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3405.80 Safari/537.36
Everything but the Chrome/ is the same. They come from different IP addresses and make two hit-and-run requests. The different IPs always use a different Chrome string. Always some two digit main version like 69, 70. Then a .0. and then some funny minor and build numbers: typically a four digit minor.When I was hit with a lot of these a couple of weeks ago, I put in a custom rewrite rule to redirect them to the honeypot.
The attack quickly abated.
There’s always VPNs, though you can only be on at one at a time per device
It was nice for interested guests to get an impression of what we are doing.
First the AI crawlers came in from foreign countries that could be blocked.
Then they beat down the small server by being very distributed, calling from thousands of ips one or two requests each.
We finally put a stop to it by requiring a login with a message informing people to physically show up to gain access.
Worked fine for over 15 years but AI finally killed it.
Also you're implying that the only way to crawl is to essentially DDOS a website by blasting them from thousands of IP addresses. There is no reason crawlers can't do more sites in parallel and avoid hitting individual sites so hard. There are plenty of crawlers for the last few decades that don't cause problems, these are just stories about the ones that do.
In the long run it'll be an arms race but the transition will be rough for businesses as consumers can adopt these tools faster than SMBs or enterprises can integrate them.
Individuals do too. Tools like https://github.com/unclecode/crawl4ai make it simple to obtain public content but also paywalled content including ebooks, forums and more. I doubt these folks are trying to do a DDoS though.
[1] https://docs.crawl4ai.com/advanced/identity-based-crawling
Google also publish the IP ranges for GoogleBot I believe, and Bing probably does the same, so we can then whitelist those IPs and still have sites appear in searches.
My issue is that the burden is again placed on everyone else, not the people/companies who are causing the problem.
It's crazy to me to think about how much needless capacity is built into the internet to deal with crawlers. The resource waste is just insane.
If you're worried about your data getting scraped and used then maybe you can consider putting it behind a login or do some proof of work/soft captcha. Yeah, this isn't perfect but it will keep most dumb bots away.
Some people are hosting their sites like we're still in 1995 and times have changed.
https://blog.cloudflare.com/declaring-your-aindependence-blo...
Is it just me or does Ars keep on serving videos about the sound design of Callisto Protocol in the middle of everything? Why do they keep on promoting these videos about a game from 2022? They've been doing this for months now.
Maybe. But even if that turns out to be true, what good is it for the source website? The "AI" will surely not share any money (or anything else that may help the source website) with the source anyways. Why would they, they already got the content and trained on it.