Devs say AI crawlers dominate traffic, forcing blocks on entire countries

360
275
Bender
1 month ago
arstechnica.com

xyzal
·
1 month ago
·
[ - ]

I think the proper answer is to aim for the bots to get _negative_ utility value from visiting our sites, that is poisoning their well, not just zero value, that is to block them.

Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.

Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).

A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.

And so on, and so forth ...

Nepenthes is nice, but word salad can be detected easily. It needs a feature that pre-generates linguistically plausible but factically garbage text via open models

sigmoid10
·
1 month ago
·
[ - ]

This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.

hec126
·
1 month ago
·
[ - ]

You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.

rustc
·
1 month ago
·
[ - ]

This would be terrible for accessibility for users using a screen reader.

mostlysimilar
·
1 month ago
·
[ - ]

So would the site shutting down because AI bots are too much traffic.

MrResearcher
·
1 month ago
·
[ - ]

<a aria-hidden="true"> ... </a> will result in a link ignored by screen readers.

rustc
·
1 month ago
·
[ - ]

Removing elements that match `[hidden], [aria-hidden]` is the most trivial cleanup transform a crawler can do and I'm sure most crawlers already do that.

soco
·
1 month ago
·
[ - ]

But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?

majewsky
·
1 month ago
·
[ - ]

robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.

ccgreg
·
1 month ago
·
[ - ]

I'd say a bigger problem is that people disagree about the meaning of nofollow and noindex.

sigmoid10
·
1 month ago
·
[ - ]

The detection and bypass is trivial: Access the site from two IPs, one disrespecting robots.txt. If the content changes, you know it's garbage.

delichon
·
1 month ago
·
[ - ]

Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?

voidUpdate
·
1 month ago
·
[ - ]

When was the last time you looked at robots.txt to find a page that wasn't linked anywhere else?

zzo38computer
·
1 month ago
·
[ - ]

It was a while ago, and it was not deliberate (wget downloaded robots.txt as well as the files I requested, and I was able to find many other files due to that, some of which could not be accessed due to requiring a password, but some were interesting (although I did not use wget to copy those other files; I only wanted to copy the files I originally requested)).

sharlos201068
·
1 month ago
·
[ - ]

Crawlers aren't interested in fake pages that aren't linked to anywhere, they're crawling the same pages your users are viewing.

danielheath
·
1 month ago
·
[ - ]

Adding a disallowed url to your robots.txt is a quick way to get a ton of crawlers to hit it, without linking to it from anywhere. Try it sometime.

gadflyinyoureye
·
1 month ago
·
[ - ]

Tuesday. But I have odd hobbies.

brookst
·
1 month ago
·
[ - ]

robots.txt is not a sitemap. If it worked that way you could just make a 5TB file linking to a billion pages that look like static links but are dynamically generated.

ccgreg
·
1 month ago
·
[ - ]

robots.txt has a maximum relevant size of 500 kib.

gdcbe
·
1 month ago
·
[ - ]

That’s vaguely what https://blog.cloudflare.com/ai-labyrinth/ is about

TonyTrapp
·
1 month ago
·
[ - ]

We're affected by this. The only thing that would realistically work is the first suggestion. The most unscrupulous AI crawlers distribute their inhuman request rate over dozens of IPs, so every IP just makes 1-2 requests in total. And they use real-world browser user agents, so blocking those could lock out real users. However, sometimes they claim to be using really old Chrome versions, so I feel less bad about locking those out.

sokoloff
·
1 month ago
·
[ - ]

> dozens of IPs, so every IP just makes 1-2 requests in total

Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.

lucb1e
·
1 month ago
·
[ - ]

I'm also affected. I presume that this is per day, not just once, yet it's fewer requests than a human would often do so you cannot block it. I blocked 15 IP ranges containing 37 million IP addresses (most of them from Huawei's Singapore and mobile divisions, according to the IP address WHOIS data) because they did not respect robots.txt and didn't set a user agent identifier. This is not including several other scrapers that did set a user agent string but do not respect robots.txt (again, including Huawei's PetalBot). (Note, I've only blocked them from one specific service that proxies+caches data from a third party, which I'm caching precisely because the third party site struggled with load, so more load from these bots isn't helping)

That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks

aorth
·
1 month ago
·
[ - ]

Parent probably meant hundreds or thousands of IPs.

Last week I had a web server with a high load. After some log analysis I found 66,000 unique IPs from residential ISPs in Brazil had made requests to the server in a few hours. I have broad rate limits on data center ISPs, but this kinda shocked me. Botnet? News coverage of the site in Brazil? No clue.

Edit: LOL didn't read the article unity after posting—they mention the Fedora Pagure server getting this traffic from Brazil last week too!

Rate limiting vast swathes of Google Cloud, Amazon EC2, Digital Ocean, Hetzner, Huawei, Alibaba, Tencent, and a dozen other data center ISPs by subnet has really helped keep the load on my web servers down.

Last year I had one incident with 14,000 unique IPs in Amazon Singapore making requests in one day. What the hell is that?

I don't even bother trusting user agents any more. My nginx config has gotten too complex over the years and I wish I didn't need all this arcane mapping and whatnot.

TonyTrapp
·
1 month ago
·
[ - ]

If you are serving a Git repository browser and all of those IPs are hitting all the expensive endpoints such as git blame, it becomes something to worry about very quickly.

giantg2
·
1 month ago
·
[ - ]

That's probably per day, per bot. Now how does it look when there are thousands of bots? In most cases I think you're right, but I can also see how it can add up.

gchamonlive
·
1 month ago
·
[ - ]

If I'm hosting my site independently, with a rented machine and a cloudflare CDN hosting my code on a self managed gitlab instance, how should I go about implementing this? Is there something plug and play I can drop into nginx that would do this work for me of serving bogus content and leaving my gitlab instance unscathed by bots?

lomonosov
·
1 month ago
·
[ - ]

Russia is already doing poisoning with success, so it is a viable tactic!

https://www.heise.de/en/news/Poisoning-training-data-Russian...

lucb1e
·
1 month ago
·
[ - ]

> Is your user-agent too suspicious?

Hello, it's me, your legitimate user who doesn't use one of the 4 main browsers. The internet gets more annoying every day on a simple android webview browser, I guess I'll have to go back to the fully-featured browser I migrated away from because it was much slower

> A request rate too inhuman?

I've run into this on five websites in the past 2 months, usually just from a single request that didn't have a referrer (because I clicked it from a chat, not because I block anything). When I email the owners, it's the typical "have you tried turning it off and on again" from the bigger sites, or on smaller sites "dunno but you somehow triggered our bot protection, should have expired by now [a day later] good luck using the internet further". Only one of the sites, Codeberg, actually gave a useful response and said they'd consider how to resolve when someone follows a link directly to a subpage and thus doesn't have a cookie set yet. Either way, blocks left and right are fun! More please!

·
1 month ago
·
[ - ]

jajko
·
1 month ago
·
[ - ]

Thats absolutely brilliant f*cked up idea, poisoning AI while fending them off.

Gotta get my daily dose of bleach for enhanced performance, chatgpt said so.

usrnm
·
1 month ago
·
[ - ]

Where are you going to get all that content? If it's static, it will get filtered out very fast, if it's dynamic and autogenerated, it might cost even more than just letting the crawler through

hec126
·
1 month ago
·
[ - ]

Generate it once every few weeks with LLaMa and then serve as static content?

j-bos
·
1 month ago
·
[ - ]

This makes tons of sense, the AI trainers spend endless resources aligning their llms, the least we could do ia spend a few minutes aligning their owners. Fixing things at the incentive level.

agilob
·
1 month ago
·
[ - ]

Also, lower upload rate to 5kb/s

PeterStuer
·
1 month ago
·
[ - ]

You have clearly no idea on the incompetence so many public administrations have in configuring robots.txt for data that is actually created for and specifically meant to be consumed programatically (think rss and atom feeds, REST api endpoints etc.). Half the time the person setting up the robots.txt just blankedly blocks everything, and does not even know (or care) to exclude those.

seper8
·
1 month ago
·
[ - ]

I love this idea haha

tedunangst
·
1 month ago
·
[ - ]

> It remains unclear why these companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.

If the target goes down after you scrape it, that's a feature.

prisenco
·
1 month ago
·
[ - ]

This has me wondering what it would take to do a bcrypt style slow hashing requirement to retrieve data from a site. Something fast enough that a single mobile client for a user wouldn't really feel the difference. But an automated scraper would get bogged down in the calculations.

Data is presented to the user with multiple layers of encryption that they use their personal key to decrypt. This might add an extra 200ms to decrypt. Degrades the user experience slightly but creates a bottleneck for large-scale bots.

tschwimmer
·
1 month ago
·
[ - ]

Check out Anubis - it's not quite what you're suggesting but similar in concept: https://anubis.techaro.lol/

LPisGood
·
1 month ago
·
[ - ]

How does it work? I don’t have time to read the code, and the website/docs seem to be under construction.

Does it have the client do a bunch of SHA-256 hashes?

01HNNWZ0MV43FF
·
1 month ago
·
[ - ]

Yeah it's like hashcash. You have to try random numbers until you roll a hash with enough leading zeroes. Then you get a cookie (JWT I think) that's valid for a week.

SHA2 can run on ASICs and isn't memory-hard, so I'm hoping someone will add something tougher

prisenco
·
1 month ago
·
[ - ]

Interesting! Not how I'd approach it but certainly thinking along the same lines.

userbinator
·
1 month ago
·
[ - ]

I just hit a site with this --- and hit the back button immediately.

Retr0id
·
1 month ago
·
[ - ]

Better than a 500 error

fc417fc802
·
1 month ago
·
[ - ]

Also better than a third party service IMO because of the privacy implications. You could even potentially give users the choice (complete overkill but technically you could do it).

joeblubaugh
·
1 month ago
·
[ - ]

Maybe once you can use something more professional as an interstitial page

ndiddy
·
1 month ago
·
[ - ]

From the announcement page: https://xeiaso.net/blog/2025/anubis/

> RPM packages and unbranded (or customly branded) versions are available if you contact me and purchase commercial support. Otherwise your users have to see a happy anime girl every time they solve a challenge. This is a feature.

xena
·
1 month ago
·
[ - ]

I'm going to be making distro packages and binaries public. I will have to figure out white label monetization I guess.

neurostimulant
·
1 month ago
·
[ - ]

It's open source with MIT license, so you should be able to remove those images yourself if you want.

Spivak
·
1 month ago
·
[ - ]

[flagged]

bakugo
·
1 month ago
·
[ - ]

[flagged]

williamdclt
·
1 month ago
·
[ - ]

I’m no anime fan, but this sort of judgement on normalcy leaves me with a very sour impression about whoever says this sort of thing.

bakugo
·
1 month ago
·
[ - ]

[flagged]

soulofmischief
·
1 month ago
·
[ - ]

[flagged]

dns_snek
·
1 month ago
·
[ - ]

[flagged]

imtringued
·
1 month ago
·
[ - ]

You might say that, but claiming everyone is a pedophile is such a tired political play at this point. Its primary purpose is to dehumanize people so that blatant wrongdoing can be justified.

The visceral reaction might be genuine, but the actual feelings are probably not. I have yet to see someone who actually "cares about the children". The vast majority of accusations e.g. democrats running a pedophile ring turn out to be completely manufactured. The democrats responded by giving more funding and starting projects combating child abuse, only for the republicans to gut the programs, who think they are a waste of tax payer money and an example of big government.

dns_snek
·
1 month ago
·
[ - ]

[flagged]

GoblinSlayer
·
1 month ago
·
[ - ]

Violent games make people violent, and money make people greedy again?

dns_snek
·
1 month ago
·
[ - ]

How does that relate to anything I said?

puchatek
·
1 month ago
·
[ - ]

If we are able to detect AI scrapers then I would welcome a more strategic solution: feed them garbage data instead of the real content. If enough sites did that then the inference quality would take a hit and eventually the perpetrators, too.

But of course this is the more expensive option that can't really be asked of sites that already provide public services (even if those are paid for by ads).

brookst
·
1 month ago
·
[ - ]

I really don’t think we have such a lack of misinformation that we need to invest in creating more of it, no matter the motive.

kevin_thibedeau
·
1 month ago
·
[ - ]

The problem is that the server also has to do the work. Fine for an infrequent auth challenge. Not so fine for every single data request.

tedunangst
·
1 month ago
·
[ - ]

Tons of problems are easier to verify than to solve.

fsckboy
·
1 month ago
·
[ - ]

do any of them involve mining bitcoin?

Retr0id
·
1 month ago
·
[ - ]

sure

saganus
·
1 month ago
·
[ - ]

Maybe there is a way for the server to ask the client to do the work?

Something similar to proof-of-work but on a much smaller scale than Bitcoin.

mrheosuper
·
1 month ago
·
[ - ]

just add some delay to your response, we don't have to waste any more energy on meaningless calculation.

01HNNWZ0MV43FF
·
1 month ago
·
[ - ]

Adding delay means you have to keep more connections open at a single time. Parallelism doesn't favor a server if your problem is already a small server getting hit by a big scraper

Ma8ee
·
1 month ago
·
[ - ]

How expensive is it to just keep a connection open?

swiftcoder
·
1 month ago
·
[ - ]

About 20 kilobytes of socket + TLS state, if you've really optimised it down to the minimum. Most server software isn't that lean, of course, so pick a framework designed for running a million or so concurrent connections on a single server (i.e. something like Nginx)

prisenco
·
1 month ago
·
[ - ]

Right it would need an algorithm with widely different encryption speeds vs decryption speeds. Lattice-based cryptography maybe?

Retr0id
·
1 month ago
·
[ - ]

Hash functions are all you need.

Dylan16807
·
1 month ago
·
[ - ]

Yeah, searching for hashes with some prefix is easy to set up.

baq
·
1 month ago
·
[ - ]

koakuma-chan
·
1 month ago
·
[ - ]

Companies running those bots have more than enough resources

prisenco
·
1 month ago
·
[ - ]

Nobody has unlimited resources. Everything is a cost-benefit analysis.

For highly valuable information, they might throw the GDP of a small country at scraping your site. But most information isn't worth that.

And there are a lot of bad actors who don't have the resources you're thinking of that are trying to compete with the big guys on a budget. This would cut them out of the equation.

timewizard
·
1 month ago
·
[ - ]

Make all websites intentionally waste energy as a strategy to defeat unscrupulous operators has negative costs and marginal benefits.

prisenco
·
1 month ago
·
[ - ]

Resources applied to prevent bad actors from degrading or destroying the commons has always been the cost of civilization.

noosphr
·
1 month ago
·
[ - ]

Then use it to mine monero or similar.

The idea that you should pay for content shouldn't be an insane pipedream. It should be the default on the internet.

Maybe then we wouldn't be in the situation where getting new users is an existential threat to the majority of websites.

01HNNWZ0MV43FF
·
1 month ago
·
[ - ]

I'll add a link on my site recommending that visitors petition their elected officials for a pollution tax

MathMonkeyMan
·
1 month ago
·
[ - ]

Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?

randmeerkat
·
1 month ago
·
[ - ]

> Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?

The next scraper doesn’t get the data. People don’t realize we’re not compute limited for ai, we’re data limited. What we’re watching is the “data war”.

cyanydeez
·
1 month ago
·
[ - ]

at this point we're _good data_ limited, which has little to do with scraping.

DrFalkyn
·
1 month ago
·
[ - ]

Why kind of data that isn’t public would be so valuable for AI training?

Seems like there’s a fuck ton. All of Wikipedia, GitHub for code, etc.

I can understand targeting certain sites like Reddit, etc. but not random websites

timewizard
·
1 month ago
·
[ - ]

It's to rip off copyrighted content and profit from it instead of the original authors. It's like every other low rent and highly automated scam that finds it's way onto the internet.

If you look closely even Google does this. This is probably why many popular sites started getting down ranked in the last 2 years. Now they're below the fold and Google can present their content as their own through the AI box.

throwaway2037
·
1 month ago
·
[ - ]

Please remember that Google only needs to be marginally better than the competition. And, of course, their primary biz is ads, not serving great results; that is a distant second priority.

MathMonkeyMan
·
1 month ago
·
[ - ]

Their biz is ads, but since search is winner takes all they need only be marginally better than the competition... twenty years ago.

timewizard
·
1 month ago
·
[ - ]

> Their biz is ads,

Yea, but, the FTC doesn't want it to be.

grotorea
·
1 month ago
·
[ - ]

Discord I guess would be quite valuable, even the de facto public servers.

XorNot
·
1 month ago
·
[ - ]

Honestly it's hard to tell how much more value the LLM people are going to get out of another copy of the internet.

It feels a lot like they're stuck for improvements but management doesn't want to hear it.

Davidzheng
·
1 month ago
·
[ - ]

It's a bit strange to talk about stuck when the most recent breakthrough is less than a year old.

LPisGood
·
1 month ago
·
[ - ]

I’m not sure what you mean by breakthrough, but if you’re talking about Deepseek, it’s more of an incremental improvement than a breakthrough.

threatofrain
·
1 month ago
·
[ - ]

Scraping social media is good data, even without ML. The fact that something is "happening" to people in a social space inherently has importance to people. The specter of law is more threatening to whether companies can get their hands on good data.

DaSHacka
·
1 month ago
·
[ - ]

Now the only way to obtain that information is through them

TuxMark5
·
1 month ago
·
[ - ]

I guess one could make a point that competition will no longer have the access to the scraped data.

xena
·
1 month ago
·
[ - ]

Wow it is so surreal to see a project of mine on Ars Technica! It's such an honor!

shric
·
1 month ago
·
[ - ]

You're well into https://refactoringenglish.com/tools/hn-popularity/ so enjoy the fame!

seafoamteal
·
1 month ago
·
[ - ]

I've seen Anubis a couple times irl, mostly on Sourcehut, and the first time I saw it I was like, "Hey, I remember that blog post!" Congratulations on making something both useful and usable!

true_blue
·
1 month ago
·
[ - ]

On the few sites I've seen using it so far, it's been a more pleasant (and cuter) experience for me than the captchas I'd probably get otherwise. good work!

xena
·
1 month ago
·
[ - ]

Thanks! The artist I'm contracting and I are in discussions on how to make the mascot better. It will be improved. And more Canadian.

Figs
·
1 month ago
·
[ - ]

Hmm. Instead of requiring JS on the client, why don't you add a delay on the server side (e.g. 1 second default, adjustable by server admin) for requests that don't have a session cookie? For each session keep a counter and a timestamp. Every time you get a request from a session, look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp. If the counter is greater than a configured threshold, slow-walk the response (e.g. add a delay before forwarding the request to the shielded web server -- or transfer the response back out at reduced bytes/second, etc.)

You can periodically remove tracking data for entries older than a threshold -- e.g. once a minute or so (adjustable) remove tracked entries that haven't made a request in the past minute to keep memory usage down.

That'd effectively rate limit the worst offenders with minimal impact on most well-behaved edge-case users (like me running NoScript for security) while also wasting less energy globally on unnecessary computation through the proof-of-work scheme, wouldn't it? Is there some reason I'm not thinking of that would prevent that from working?

viraptor
·
1 month ago
·
[ - ]

> look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp

Not sure about author's motivation, but this part is why I don't track usage - PoW allows you to do everything statelessly and not keep any centralised database or write any data. The benefit of a system slowing down crawling should be minimal resource usage for the server.

PufPufPuf
·
1 month ago
·
[ - ]

In a "denial of service prevention" scenario, you need your cost to be lower than the cost of the attacker. "Delay on the server side" means keeping a TCP connection open for that long, and that's a limited resource.

techjamie
·
1 month ago
·
[ - ]

The ffmpeg website is also using it. First time I actually saw it in the wild.

rfurmani
·
1 month ago
·
[ - ]

After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.

keyle
·
1 month ago
·
[ - ]

Has someone made honeypot for AI yet?

Take all regular papers and change their words or keywords to something outrageous and watch it feed it to users.

MIC132
·
1 month ago
·
[ - ]

This kinda fits, though it's on a personal blog level:

https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scr...

puchatek
·
1 month ago
·
[ - ]

If there was a non-profit dedicated do this, I would donate

karlgkk
·
1 month ago
·
[ - ]

One thing that worked well for me was layering obstacles

It really sucks that this is the way things are, but what I did was

10 requests for pages in a minute, you get captchad (with a little apology and the option to bypass it by logging in). asset loads don’t count

After a captcha pass, 100 requests in an hour gets you auth walled

It’s really shitty but my industry is used to content scraping.

This allows legit users to get what they need. Although my users maybe don’t need prolonged access ahem.

nomel
·
1 month ago
·
[ - ]

What happens if you use the proper rate limiting status of 429? It includes a next retry time [1]. I'm curious what (probably small) fraction would respect it.

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

karlgkk
·
1 month ago
·
[ - ]

Probably makes sense for a b2b app where you publish status codes as part of the api

Bad actors don’t care and annoying actors would make fun of you for it on twitter

rfurmani
·
1 month ago
·
[ - ]

I've wanted to but wasn't sure how to keep track of individuals. What works for you? IP Addresses, cookies, something else?

karlgkk
·
1 month ago
·
[ - ]

I use IP addy. Users behind cgnat are already used to getting captcha the first time around

There’s some stuff you can do, like creating risk scores (if a user changes ip and uses the same captcha token, increase score). Many vendors do that, as does my captcha provider.

nukem222
·
1 month ago
·
[ - ]

> This allows legit users to get what they need.

Of course they could have just used the site directly.

karlgkk
·
1 month ago
·
[ - ]

If bots and scrapers respected the robots and tos, we wouldn’t be here

It sucks!

GoblinSlayer
·
1 month ago
·
[ - ]

Or just buy cloudflare :)

LPisGood
·
1 month ago
·
[ - ]

What is your website?

Nckpz
·
1 month ago
·
[ - ]

I recently started a side-project with a "code everything in prod" approach for fun. I've done this many times over the past 20 years and the bot traffic is usually harmless, but this has been different. I haven't advertised the hostname anywhere, and in less than 24 hours I had a bunch of spam form submissions. I've always expected this after minor publicity, but not "start server, instantly get raided by bots performing interactions"

ndiddy
·
1 month ago
·
[ - ]

This is yet another great example of the innovation that the AI industry is delivering. Why just limit your scraper bots to GET requests when there might be some juicy data to train on hidden behind that form? There's a reason why the a16z funded cracked vibe coder ninjas are taking over software engineering, they're full of wonderful ideas like this.

ohgr
·
1 month ago
·
[ - ]

I work with one of those guys. Morally bankrupt at every level of his existence.

bendangelo
·
1 month ago
·
[ - ]

There are bots that scrape https registration sites thats how they usually find you.

ipaddr
·
1 month ago
·
[ - ]

I've had a number of content sites I've shut down a few sites in the last few days because of the toll these aggressive AI bots. Alexa seems like the worst.

These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.

Aurornis
·
1 month ago
·
[ - ]

> Alexa seems like the worst.

Many of the bots disguise themselves as coming from Amazon or other big company.

Amazon has a page where you can check some details to see if it’s really their crawler or someone imitating it.

spookie
·
1 month ago
·
[ - ]

yup, actually most that I've seen are impersonating amazon

·
1 month ago
·
[ - ]

svelle
·
1 month ago
·
[ - ]

It's funny how every time this topic comes up someone says "I've had this happen and x is the worst" With x being any of the big AI providers. Just a couple minutes ago I read the same in another thread and it was Anthropic. A couple weeks back it was Meta.

My conclusion is that they're all equally terrible then.

Aurornis
·
1 month ago
·
[ - ]

All of the crawlers present themselves as being from one of the major companies, even if they’re not.

Setting user-agent headers is easy.

cyanydeez
·
1 month ago
·
[ - ]

at the same time, all the AI providers have some kind of web based AI agent, so let snot pretend they're crafting their services in care of other peoples websites.

CaptainFever
·
1 month ago
·
[ - ]

I highly doubt that people are using AI agent features so frequently and so concentrated-ly that it brings down websites.

ipaddr
·
1 month ago
·
[ - ]

I agree they are all creating negative value for site owners. In my personal experience this week blocking Amazon solved my server overload issue.

rco8786
·
1 month ago
·
[ - ]

I got DoSed by ClaudeBot (Anthropic) just last week. Hitting a website I manage 700,000 times in one month and tripping our bandwidth limit with our hosting provider. What a PITA to have to investigate that, figure it out, block the user agent, and work with hosting provider support to get the limit lifted as a courtesy.

Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.

i5heu
·
1 month ago
·
[ - ]

At which level of DDos one can claim damages from them?

brookst
·
1 month ago
·
[ - ]

You can claim whatever you want, but actually litigation is expensive and it’s not at all a sure thing that “I made a publicly available resource and they used it too much” is going to win damages. Maybe? Maybe not?

i5heu
·
1 month ago
·
[ - ]

So I could ddos anyone legally as long I have a some reason?

brookst
·
1 month ago
·
[ - ]

Sure. Courts tend to care about intent, but your are welcome to try to change that.

Neil44
·
1 month ago
·
[ - ]

I've got claude bot blocked too. It regularly took sites offline and ignored robots.txt. Claude bot is an asshole.

throwaway2037
·
1 month ago
·
[ - ]

robots.txt did not work?

seabird
·
1 month ago
·
[ - ]

Of course it didn't work. At best, the dorks doing this think there's a gamechanging LLM application to justify the insane valuations right around the corner if they just scrape every backwater site they can find. At worst, they're doing it because it's paying good money. Either way, they don't care, they're just going to ignore robots.txt.

hsbauauvhabzb
·
1 month ago
·
[ - ]

I have not monitored traffic in this way, but I imagine most AI companies would explicitly follow links listed in robots, even if not mentioned elsewhere on the site.

epc
·
1 month ago
·
[ - ]

I’ve been doing web sites for thirty years, robots.txt is at best a request to polite user agents to respect the server’s desires. None of the malicious crawlers respect it. None of the AI crawlers respect it.

I’ve resorted to returning xml and zip bombs in canary pages. At best it slows them down until I block their network.

lemper
·
1 month ago
·
[ - ]

bro, since when vc funded ai companies have the courtesy to respect robots.txt?

userbinator
·
1 month ago
·
[ - ]

All these JS-heavy "anti bot" measures do is further entrench the browser monopoly, making it much harder for the minority of independents, while those who pay big $$$ can still bypass them. Instead I recommend a simple HTML form that asks questions with answers that LLMs cannot yet figure out or get consistently wrong. The more related to the site's content the questions are, the better; I remember some electronics forums would have similar "skill-testing" questions on their registration forms, and while some of them may be LLM'able now, I suspect many of them are still really CAPTCHAs that only humans can solve.

IMHO the fact that this shows up at a time when the Ladybird browser is just starting to become a serious contender is suspicious.

mvdtnz
·
1 month ago
·
[ - ]

How does JS entrench a browser monopoly? If you're not using vendor-specific JS extensions or non-standard APIs any browser should be able to execute your JS. Like most web developers I don't have a lot of patience for the people who refuse to run JS on their clients.

userbinator
·
1 month ago
·
[ - ]

The effort required to implement a JS engine and keep trendchasing the latest changes with it is a huge barrier to entry, not to mention the insane amount of fingerprinting and other privacy-hostile, anti-user techniques it enables.

Seeing what used to be simple HTML forms turned into bloated invasive webapps to accomplish the exact same thing seriously angers me; and everyone else who wanted an easily accessible and freedom-preserving Internet.

Terr_
·
1 month ago
·
[ - ]

Or require every fresh "unique" visitor to run some JS that takes X seconds to compute.

It's not nice for visitors using a very old smartphone, but it's arguably less-exclusionary than some of the tests and third-party gatekeepers that exist now.

In many cases we don't actually care about telling if someone is truly a human alone, as much as ensuring that they aren't a throwaway sockpuppet of a larger automated system that doesn't care about good behavior because a replacement is so easy to make.

userbinator
·
1 month ago
·
[ - ]

that takes X seconds to compute.

Those who have the computing resources to do commercial scraping will easily get past that.

In contrast, there are still many questions which a human can easily answer, but even the best LLMs currently can't.

Terr_
·
1 month ago
·
[ - ]

It doesn't have to be bulletproof, it just has to create a cost that doesn't scale economically for them.

userbinator
·
1 month ago
·
[ - ]

Computing power is cheap, and getting cheaper for the big guys. Real humans are not.

kristiandupont
·
1 month ago
·
[ - ]

>there are still many questions which a human can easily answer, but even the best LLMs currently can't.

I am genuinely curious: what is an example of such a question, if it's for a person you don't know (i.e. where you cannot rely on inside knowledge)?

userbinator
·
1 month ago
·
[ - ]

This was a notable example in that category: https://news.ycombinator.com/item?id=41058318

Also https://news.ycombinator.com/item?id=38766512

kristiandupont
·
1 month ago
·
[ - ]

I just tried these with ChatGPT (4o) and it got both of them right. That's not to say that you won't be able to find something that still works but I think that particular hole is closing fast.

spartanatreyu
·
1 month ago
·
[ - ]

Yeah, any text based question with a text based answer eventually ends up getting posted on a forum for an AI model to scrape.

a2128
·
1 month ago
·
[ - ]

IIRC that's basically already part of what Cloudflare Turnstile does

GoblinSlayer
·
1 month ago
·
[ - ]

The algorithm is inspired by hashcash https://raw.githubusercontent.com/TecharoHQ/anubis/refs/head...

·
1 month ago
·
[ - ]

everdrive
·
1 month ago
·
[ - ]

One more reason we're moving away from privacy. Didn't load all the javascript domains? You're probably a bot. Not signed in? You're probably a bot. The web we knew is dying step by step.

One interesting thought: do we know if these AI crawlers intentionally avoid certain topics? Is pornography totally left unscathed by these bots? How about extreme political opinions?

GoblinSlayer
·
1 month ago
·
[ - ]

No, https://www.heise.de/en/news/Poisoning-training-data-Russian...

BiteCode_dev
·
1 month ago
·
[ - ]

That's an interesting thing: put things about all dictators known currently in power that they don't want to hear, and maybe they will back off.

banq
·
1 month ago
·
[ - ]

I have blocked these ip from the country: 190.0.0.0/8 207.248.0.0/16 177.0.0.0/8 200.0.0.0/8 201.0.0.0/8 145.0.0.0/8 168.0.0.0/8 187.0.0.0/8 186.0.0.0/8 45.0.0.0/8 131.0.0.0/16 191.0.0.0/8 160.238.0.0/16 179.0.0.0/8 186.192.0.0/10 187.0.0.0/8 189.0.0.0/8

HermanMartinus
·
1 month ago
·
[ - ]

I literally just published a post on this and how it affects Bear Blog.

https://herman.bearblog.dev/the-great-scrape

zzo38computer
·
1 month ago
·
[ - ]

My issue is not to prevent others from obtaining copies of the files, using Lynx or curl, disabling JavaScripts and CSS and pictures, etc. It is to prevent others from overloading the server due to badly behaved software.

I had briefly set up port knocking for the HTTP server (and only for HTTP; other protocols are accessible without port knocking), but due to a kernel panic I removed it and now the HTTP server is not accessible. (I may later put it back on once I can fix this problem.)

As far as I can tell, the LLM scrapers do not attempt to be "smart" about it at this time; if they do in future, you might try to take advantage of that somehow.

However, even if they don't, there are probably things that can be done. For example, check that if the declared user-agent declares things that it isn't doing, and display an error message if so (users who use Lynx will then remain unaffected and will still be able to access it). Another possibility is to try to confuse the scrapers however they are working, e.g. invalid redirects, valid redirects (e.g. to internal API functions of the companies that made them), invalid UTF-8, invalid compressed data, ZIP bombs (you can use the compression functions of HTTP to serve a small file that is too big when decompressed), EICAR test files, reverse pings (if you know who they really are), etc. What will work and what doesn't work depends on what software they are using.

edoloughlin
·
1 month ago
·
[ - ]

I'm being trite, but if you can detect an AI bot, why not just serve them random data? At least they'll be sharing some of the pain they inflict.

nosianu
·
1 month ago
·
[ - ]

You mean like this?

[2025-03-19] https://blog.cloudflare.com/ai-labyrinth/

> Trapping misbehaving bots in an AI Labyrinth

> Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives.

barbazoo
·
1 month ago
·
[ - ]

What a colossal waste of energy

fc417fc802
·
1 month ago
·
[ - ]

> No real human would go four links deep into a maze of AI-generated nonsense.

... I would. Out of curiosity and amusement I would most definitely do that. Not every time, and not many times, but I would definitely do that one or a few times.

Guess I'm getting added to (yet another) Cloudflare naughty list.

> It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.

In that case wouldn't it be faster and easier to restyle the CSS of wikipedia pages?

mbesto
·
1 month ago
·
[ - ]

Wait, what happens when a Cloudflare Worker AI meets an AI Labyrinth?!

ronsor
·
1 month ago
·
[ - ]

Cloudflare deletes itself.

GoblinSlayer
·
1 month ago
·
[ - ]

Rise of the machines.

noirscape
·
1 month ago
·
[ - ]

Bandwidth isn't free, not at the volume these crawlers scrape at; serving them random data (for example by leading them down an endless tarpit of links that no human would end up visiting) would still incur bandwidth fees.

Also it's not identifiable AI bot traffic that's detected (they mask themselves as regular browsers and hop between domestic IP addresses when blocked), it's just really obviously AI scraper traffic in aggregate: other mass crawlers have no benefit from bringing down their host sites, except for AI.

A search engine has nothing if it brings down the site they're scraping (and has everything to gain from identifying itself as a search engine to try and get favorable request speeds - the only thing they'd need to check is if the site in question isn't serving different data, but that's much cheaper), same with an archive scraper and those two are pretty much the main examples I can think of for most scraping traffic.

BlarfMcFlarf
·
1 month ago
·
[ - ]

Hmm, maybe you could zipbomb the data? Aka, you send a few kilobytes of compressed data that expands to many gigabytes on client side?

gus_massa
·
1 month ago
·
[ - ]

Reverse Slowloris?

https://en.wikipedia.org/wiki/Slowloris_(cyber_attack)

miohtama
·
1 month ago
·
[ - ]

For Cloudflare, bandwidth is practically free.

cyanydeez
·
1 month ago
·
[ - ]

arnt a lot of these bots now actively loading javascript? you could just load a simple script that does the job .

edoloughlin
·
1 month ago
·
[ - ]

If they agree to mine crypto for you then you send valid data. Is this a win-win?

(I feel I need to preemptively state that I am being sarcastic.)

charcircuit
·
1 month ago
·
[ - ]

>Bandwidth isn't free

Via peering agreements it is.

rcxdude
·
1 month ago
·
[ - ]

Not something available to smaller sites

charcircuit
·
1 month ago
·
[ - ]

Yes, it is. They transitively get it via the agreements the smaller site's host's host makes. Or via services like Cloudflare.

xena
·
1 month ago
·
[ - ]

What button do I click in the AWS panel for that?

charcircuit
·
1 month ago
·
[ - ]

There is no button. AWS is where you go to light money on fire.

xena
·
1 month ago
·
[ - ]

You can detect the patterns in aggregate. You can't detect it easily at an individual request level.

bluGill
·
1 month ago
·
[ - ]

In short if you get several million requests and expect to only get 100 you won't know which are the real requests and which are the AI ones - but it is obvious that the vast majority are AI.

jmpeax
·
1 month ago
·
[ - ]

You skipped the last section "Tarpits and labyrinths: The growing resistance" of the article.

DecentShoes
·
1 month ago
·
[ - ]

Random data? Why not "recipes" that just say "Bezos is a pedo" over and over ?

ggm
·
1 month ago
·
[ - ]

Entire country blocks are lazy, and pragmatic. The US armed forces at one point blocked AU/NZ on 202/8 and 203/8 on a misunderstanding about packets from China, also from these blocks. Not so useful for military staff seconded into the region seeking to use public internet to get back to base.

People need to find better methods. And, crawlers need to pay a stupidity tax or be regulated (dirty word in the tech sector)

noirscape
·
1 month ago
·
[ - ]

They can absolutely work if you aren't expecting any traffic from those countries whatsoever.

I don't expect any international calls... ever, so I block international calling numbers on my phone (since they are always spam calls) and it cuts down on the overwhelming majority of them. Don't see why that couldn't apply to websites either.

koito17
·
1 month ago
·
[ - ]

Although it's a very lazy practice, this is exactly how many Japanese sites (and internet services) fight against bad actors. In short, they block non-Japanese traffic and data center IPs. I expect these measures to become insufficient as consumers adopt IoT devices and provide ample amounts of residential IPs for botnets.

As for phone numbers, businesses and individuals employ a similar strategy. Most "legitimate" phone numbers begin with 060 or 070. Due to lack of supply, telcos are gradually rolling out 080 numbers. 080 numbers currently have a bad reputation because they look unfamiliar to the majority of Japanese. Similarly, VoIP numbers all begin with 050, and many services refuse such numbers. Most people instinctively refuse to answer any call that is not from a 060 or 070 number.

alabastervlog
·
1 month ago
·
[ - ]

Country or region blocks based on IPs used to (c. ~2000) be pretty standard. Blackhole the blocks associated with China, Russia, and maybe Africa, and your failed-login logs drop from scrolling so fast you can't read them, to a handful of lines per minute. Almost all the traffic was from those blocks, and was malicious. Meanwhile, for many sites, your odds (especially back then) of getting legitimate traffic from, say, China, was nearly zero, so the cost of blocking them was effectively nothing.

Cloudflare is basically still just this, but with more steps.

ggm
·
1 month ago
·
[ - ]

Sure. Absolutely works. Right up until it doesn't. I think the MIL was the wrong people to assume "we will never need packets from these network blocks"

The other thing is that phone numbers follow a numbering scheme where +1 is north america and +64 is NZ. Its easy to know the longterm geographic consequence of your block, modulo faked out CLID. IP packets don't follow this logic and Amazon can deploy AWS nodes with IPs acquired in Asia, in any DC they like. The smaller hosting companies don't say the IP range they route for banks have no pornographers on them.

It's really not sensible to use IP blocks except for the very specific cases like yours: "I never terminate international calls" is the NAT of firewalls: "I don't want incoming packets from strangers" sure the cheapest path is to block entire swathes of IPv4 and IPv6. But if you are in general service delivery, that rarely works. If you ran a business doing trade in China, you'd remove that block immediately.

kragen
·
1 month ago
·
[ - ]

It depends on whether the information on the website is supposed to be publicly available or not. "This information is publicly available except to people from Israel" sends a really terrible message.

Retric
·
1 month ago
·
[ - ]

It sends a great message to crack down on these companies, as long as you mention why it’s blocked.

kragen
·
1 month ago
·
[ - ]

"You're cut off from access to knowledge because you live in the same country as AI researchers"?

Retric
·
1 month ago
·
[ - ]

A DoS attack is a DoS attack even if someone is pretending to be a “Researcher.”

People in Iran, Russia, etc get annoyed with sanctions but that’s kind of the point. If your government isn’t responding appropriately, yes you’ll get shafted it’s what you do after that which solves the problem.

kragen
·
1 month ago
·
[ - ]

Me, I prefer to relate to people as individuals rather than, as you are advocating, interchangeable representatives of their area of residence. If what you want is World War III, this is how you get it.

In particular, universal access to knowledge is a fundamental principle of liberalism.

Retric
·
1 month ago
·
[ - ]

I’d personally love for someone to hand me a billion dollars no strings attached.

That’s got nothing to do with solving the issues created by these people, but if you’re going to toss out meaningless non sequitur’s then I figure I might as well join in on the fun.

cyanydeez
·
1 month ago
·
[ - ]

"researchers" is like ignoring the whole Capitalists. We don't call them script researchers. We call them script kiddies.

There's the whole other side of these AI researchers, and thats just slop artisans.

nashashmi
·
1 month ago
·
[ - ]

And this is why the Internet has become a maze of captcha’s.

fsckboy
·
1 month ago
·
[ - ]

yeah but the HN community would be ground zero for "automating scraping tasks"

"We have met the enemy and he is us." -- Walt Kelly

myzie
·
1 month ago
·
[ - ]

An aspect I find interesting is that these crawlers are all doing highly redundant work. As in, thousands of crawlers are running around the world, and each crawler may visit the same site and pages multiple times a week.

This seems like an opportunity for a company like Firecrawl, ScrapingBee, etc to offer built-in caching with TTLs so that redundant requests can hit the cache and not contribute to load on the actual site.

Even if each company that operates a crawler cached pages across multiple runs, I'd expect a large improvement in the situation.

For more dynamic pages, this obviously doesn't help. But a lot of the web's content is more static and is being crawled thousands of times.

I built something for my own company that crawls using Playwright and caches in S3/Postgres with a TTL for this purpose.

Does this make sense to anyone else? I'm not sure if I'm missing something that makes this harder than it seems on the surface. (Actual question!)

jaccola
·
1 month ago
·
[ - ]

I have considered this before, but then if the content can be cached why wouldn't the website just do this themselves?

They have the incentive, it is relatively easy and I don't think there's a huge benefit to centralisation (especially since it will basically be centralised to one of the big providers of caching anyways)

myzie
·
1 month ago
·
[ - ]

I'm definitely with you that sites should be leveraging CDNs and similar. But I get that many don't want to do any work to support bots that they don't want to exist in the first place.

To me it seems like the companies actually doing the crawling have an incentive to leverage centralized caching. It makes their own crawling faster (since hitting the cache is much faster than using Playwright etc to load the page) and it reduces the impact on all these sites. Which would then also decrease the impact of this whole bot situation overall.

brookst
·
1 month ago
·
[ - ]

It would shift the complexity and cost of large scale caching to a provider that would sell to the scrapers. Not sure it has much value, but it’s kind of a classic three tier distribution system with a middleman to make life easier for both producer and consumer.

xena
·
1 month ago
·
[ - ]

What does the user agent oook like for if you wanted to crawl xeiaso.net?

time4tea
·
1 month ago
·
[ - ]

Yeah my small site got 170k requests from one bot in a few mins. Of course it was rate limited, but didn't seem to know 429 or 444 (drop connection) so kept on coming back for more. I do have ipset drop too but at the moment takes a person to enable it... just so much effort to stop these **s. Exhausting!

rambambram
·
1 month ago
·
[ - ]

Make AI 'pay' by delaying every request and presenting the content with a warning on top, like: AI bots are (sc)raping the internet, that's why we do this.

Or something like: AI is making your experience worse, complain here (link to OpenAI).

Maybe not the most technical solution, but this at least gets the signal across to regular human beings who want to browse a site. Puts all this AI bs in a bad spotlight.

boyter
·
1 month ago
·
[ - ]

Crawling, incidentally, I think is the biggest issue with making a new search engine these days. Websites flat out refuse to support any crawler [other] than Google, and Cloudflare and other protection services and CDN's flat out deny access to incumbents. It is not a level playing field.

I wrote the above some time ago. I think its even more true today. Its practically impossible to crawl the way the bigger players do and with the increased focus on legislation in this area its going to lock out smaller teams even faster.

The old web is dead really. There really needs to be a move to more independent websites. Thankfully we are starting to see more of this like the linked searchmysite discussed earlier today https://news.ycombinator.com/item?id=43467541

zlagen
·
1 month ago
·
[ - ]

that's a good point, in search we have google as a monopoly and since a big percentage of sites only want to be crawled by them it reinforces the monopoly. So a lot of people complain about bots not following robots.txt but if you follow them to the letter it's impossible to make anything useful. Also AFAIK robots.txt doesn't have any legal standing

zkmon
·
1 month ago
·
[ - ]

An opensource repo asking who is responsible for this AI invasions? Well, it is you, who is responsible for all this. What did you think when you helped tech to advance so rapidly, over-pacing the needs of humans? Read about panchatantra story of 4 brothers who got a dead tiger alive, just to boast of their skill and greatness.

ANarrativeApe
·
1 month ago
·
[ - ]

Excuse my ignorance, but is it time to update the open source licenses in the light of this behavior? If so, what should the evolved license wording be?

I appreciate that this could be easily circumvented by a 'bad actor', but it would make this abuse overt...

mook
·
1 month ago
·
[ - ]

They already ignore copyright. The open source licenses are based on copyright, so changing the licenses wouldn't do squat, they'd still ignore it.

See also: Meta being sued for torrenting. Since this is an Ars Technica article, here's another one: https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...

pabs3
·
1 month ago
·
[ - ]

That wouldn't be an Open Source Definition compliant license. More in this subthread:

https://news.ycombinator.com/item?id=43423595

johnnyanmac
·
1 month ago
·
[ - ]

From my little understanding, we have a sort of agreement in place with an item called robot.txt that's more or less a hanshake with such scrapers. Of course, the issue is these scrapers are blatantly ignoring robots.txt

A license can help as well, but what's a license without enforcement? These companies are simply treating the courts as a cost to do business.

CaptainFever
·
1 month ago
·
[ - ]

Close, robots.txt was originally for web crawlers, to reduce accidental denial-of-service attacks. It had nothing to do with the scraping (i.e. downloading content and parsing the HTML tags in a programmatic manner).

WesolyKubeczek
·
1 month ago
·
[ - ]

What do you think a search engine’s crawler bot is doing exactly? I could sure be wrong, but I have a hunch that “downloading content and paraing the HTML tags in a programmatic manner” describes it.

CaptainFever
·
1 month ago
·
[ - ]

Yes, but the difference is that the term "scraping" also targets things like automatically generating RSS feeds from HTML pages, which is not covered by robots.txt.

WesolyKubeczek
·
1 month ago
·
[ - ]

I thought robots.txt covered all automated, programmatic access by third parties where a bot slurps stuff and follows links, without splitting hairs about it.

But what do I know, the young whippersnappers will just word lawyer me to death, so I better shut up and go away.

grotorea
·
1 month ago
·
[ - ]

Is this stuff only affecting the not for profit web? What are the for profit sites doing? I haven't seen Anubis around the web elsewhere. Are we just going to get more and tighter login walls and send everything into the deep web?

burkaman
·
1 month ago
·
[ - ]

For profit sites are making deals directly with the AI companies so they can get some more of that profit.

surfingdino
·
1 month ago
·
[ - ]

I think we killed the old web. We'll see new ways of communicating, publishing, and gathering over the internet. It's sad, but it's also exciting.

epolanski
·
1 month ago
·
[ - ]

Literally nothing in this data driven world (sports, technology, entertainment, everything is data driven and maximized) is exciting, nothing.

_xtrimsky
·
1 month ago
·
[ - ]

I think what they mean is that most not for profit small sites don't have expensive hardware or DDOS blocking mechanisms. A small 256mb ram vps might be enough for 1000 users per month traffic, but not enough for 200,000 users a day traffic.

eevilspock
·
1 month ago
·
[ - ]

The irony is that the people and employees of the AI companies will vehemently defend the morality of capitalism, private property and free markets.

Their robber baron behavior reveals their true values and the reality of capitalism.

johnnyanmac
·
1 month ago
·
[ - ]

Insert the Sinclair quote here. Anything to drive up the stock, no matter how immoral or illegal.

randmeerkat
·
1 month ago
·
[ - ]

> Their robber baron behavior reveals their true values and the reality of capitalism.

This is rather reductionist… By your same logic I could say that Stalin and Mao revealed the true values and reality of communism.

Let’s not elaborate on it further though and just leave this as a simple argument. Free market capitalism has led us to the most prosperous, peaceful, and advanced society humanity has ever ventured to create. Communism threatened that prosperity and peace with atrocities on a scale that exists beyond human comprehension. Capitalism, even with all of its faults, is the obvious choice.

johnnyanmac
·
1 month ago
·
[ - ]

It's rather strawman to bring up communism in a conversation that talked nothing about it, except that Capitalism is clearly flawed.

Capitalism without law ends up with the same kind of authoritariasm as communism without law. Some Rich Guy ends up telling everyone what to do as a ruler with loose rules that no longer resemble the economic model. That's what people complain about when they bring up terms like "late stage capitalism".

CyberDildonics
·
1 month ago
·
[ - ]

Internet comments are inherently reductionist.

kh_hk
·
1 month ago
·
[ - ]

Proof of work is sufficient (although easy to bypass on targeted crawls) for protecting endpoints that are accessed via browsers, but plain public APIs have to resort to other more primitive methods like rate limiting.

Blocking by UA is stupid, an by country kind of wrong. I am currently exploring ja4 fingerprints, that together with other metrics (country, Arn, block list), might give me a good tool to stop malicious usage.

My point is, this is a lot of work, and it takes time off the budget you give to side projects.

j45
·
1 month ago
·
[ - ]

Surprised at how crawling as a whole seems to have taken a sustained step backwards with old best practices that are solved being new to devs today?

I can't help but wonder if big AI crawlers belonging to the LLMs wouldn't be doing some amount of local caching with Squid or something.

Maybe it's beneficial somehow to let the websites tarpit them or slow down requests to use more tokens.

cyanydeez
·
1 month ago
·
[ - ]

I'm guessing its the rise of the AI agents that just go out and download "research" and not necessarily building LLMs.

kazinator
·
1 month ago
·
[ - ]

I've been seeing crawlers which report an Agent string like this:

  Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3405.80 Safari/537.36

Everything but the Chrome/ is the same. They come from different IP addresses and make two hit-and-run requests. The different IPs always use a different Chrome string. Always some two digit main version like 69, 70. Then a .0. and then some funny minor and build numbers: typically a four digit minor.

When I was hit with a lot of these a couple of weeks ago, I put in a custom rewrite rule to redirect them to the honeypot.

The attack quickly abated.

haswell
·
1 month ago
·
[ - ]

Lately I’ve been thinking a lot about what it would take to create a sort of “friends and family” Internet using some combination of Headscale/Tailscale. I want to feel free and open about what I’m publishing again, and the modern internet is making that increasingly difficult.

DrFalkyn
·
1 month ago
·
[ - ]

Chances are one your friends/ family will eventually be compromised in some way.

There’s always VPNs, though you can only be on at one at a time per device

CyberDildonics
·
1 month ago
·
[ - ]

Can't you just rate limit beyond what a person would ever notice and do it by slowing the response?

ordersofmag
·
1 month ago
·
[ - ]

Not if they hop to a different IP address every few requests. And they generally aren't bothered slow responses. It's not like they have to wait for one request to finish before they make another one (especially if they are making requests from thousands of machines).

CyberDildonics
·
1 month ago
·
[ - ]

You're saying that large companies are hitting individual websites with thousands of unrelated IP addresses?

Vespasian
·
1 month ago
·
[ - ]

Yep we've been seeing that on our random small scale site that used to be open (and mostly relevant to a very limited number of people).

It was nice for interested guests to get an impression of what we are doing.

First the AI crawlers came in from foreign countries that could be blocked.

Then they beat down the small server by being very distributed, calling from thousands of ips one or two requests each.

We finally put a stop to it by requiring a login with a message informing people to physically show up to gain access.

Worked fine for over 15 years but AI finally killed it.

jaggederest
·
1 month ago
·
[ - ]

How do you think they do crawling if not like that? They'd be IP banned instantly if they used any kind of predictable IP regime for more than a few minutes.

CyberDildonics
·
1 month ago
·
[ - ]

I don't know what is actually happening, that's why I'm asking.

Also you're implying that the only way to crawl is to essentially DDOS a website by blasting them from thousands of IP addresses. There is no reason crawlers can't do more sites in parallel and avoid hitting individual sites so hard. There are plenty of crawlers for the last few decades that don't cause problems, these are just stories about the ones that do.

xena
·
1 month ago
·
[ - ]

You'd think, but no :(

jrvarela56
·
1 month ago
·
[ - ]

This is going to start happening to brick and mortar businesses through their customer support channels

imtringued
·
1 month ago
·
[ - ]

Kitboga already deployed a counter-scam LLM that calls scammers to waste their time. The goal is to keep them on the line as long as possible.

sean_lynch
·
1 month ago
·
[ - ]

I’d love to hear more. What are they seeing?

jrvarela56
·
1 month ago
·
[ - ]

Unfounded. Assuming automated/voice-enabled assistants will spam for opening times, price & description of products/services, scheduling/booking, etc.

In the long run it'll be an arms race but the transition will be rough for businesses as consumers can adopt these tools faster than SMBs or enterprises can integrate them.

cdolan
·
1 month ago
·
[ - ]

And enterprise call centers

·
1 month ago
·
[ - ]

101008
·
1 month ago
·
[ - ]

I have a blog with content about a non tech topic and I had any problem. I am all against AI scrappers, but I didn't notice any change being behind Cloudflare (funny enough, if I ask GPT and Claude about my website, they know it)

WesolyKubeczek
·
1 month ago
·
[ - ]

They could hit Cloudflare’s caches and run with that. The problems are if your site is dynamic and your CDN has to hit the origin every time.

miyuru
·
1 month ago
·
[ - ]

Isn't this by problem solved by using commoncrawl data. I wonder what changed to AI companies to do mass crawling individually.

https://commoncrawl.org/

JCharante
·
1 month ago
·
[ - ]

I believe we should have microtransactions to access resources. Pay a server a tiny amount and it'll return the content. This way if crawlers dominate traffic it just means they're paying a bunch for it

·
1 month ago
·
[ - ]

hbcondo714
·
1 month ago
·
[ - ]

> many AI companies engage in web crawling

Individuals do too. Tools like https://github.com/unclecode/crawl4ai make it simple to obtain public content but also paywalled content including ebooks, forums and more. I doubt these folks are trying to do a DDoS though.

CyberDildonics
·
1 month ago
·
[ - ]

How do they manage to get 'paywalled' content?

hbcondo714
·
1 month ago
·
[ - ]

Maybe 'paywalled' is not the best word but using their Identity Based Crawling feature with Managed Browsers[1], you can use an existing account and scrape content that requires authentication. This may not sound like anything new but IMHO, crawl4ai's workflow is easy to follow.

[1] https://docs.crawl4ai.com/advanced/identity-based-crawling

aorth
·
1 month ago
·
[ - ]

I think it's strange to focus on "FOSS sites" in this article. I have regular corporate sites that are getting slammed by ChatGPT, Perplexity, ByteDance, and others too.

temp008
·
1 month ago
·
[ - ]

I wonder if a service exists where IPs known for crawling are reported and reputation of such IPs is tracked for others to use and ban / ratelimit by default

mrweasel
·
1 month ago
·
[ - ]

The problem with that approach is that you'll quickly add large swaths of IPs belonging to cloud service provides, such as AWS. We already know that AWS, Azure, GCP and Alibaba are part of the problem, so we can technically just rate-limit them already. I believe that all of them publish their IP ranges.

Google also publish the IP ranges for GoogleBot I believe, and Bing probably does the same, so we can then whitelist those IPs and still have sites appear in searches.

My issue is that the burden is again placed on everyone else, not the people/companies who are causing the problem.

It's crazy to me to think about how much needless capacity is built into the internet to deal with crawlers. The resource waste is just insane.

pdw
·
1 month ago
·
[ - ]

I wonder how long until we'll see DNS blocklists to blackhole IP addresses associated with scrapers. It seems like the logical evolution.

kh_hk
·
1 month ago
·
[ - ]

There are ip blocklist. Ranging from free, to paid, from text files to apis. Some of the sites offering IP blacklists also need to protect themselves from automated crawls too. It all goes full circle

navane
·
1 month ago
·
[ - ]

How do they know which square contains a bicycle?

_DeadFred_
·
1 month ago
·
[ - ]

Sounds like every request should require crypto processing be fed back to the host on the client side as quid pro quo.

instagib
·
1 month ago
·
[ - ]

Can we force the bots to mine cryptocurrency?

userbinator
·
1 month ago
·
[ - ]

Anger someone enough and you'll get a real DDoS instead.

throwaway81523
·
1 month ago
·
[ - ]

Saaay, what happened to that assistant US attorney who prosecuted Aaron Swartz for doing exactly this? Oh wait, Aaron didn't have billions in VC backing. That's the difference.

kittikitti
·
1 month ago
·
[ - ]

One of the main peddlers for residential proxies is Israel, see BrightData.

zlagen
·
1 month ago
·
[ - ]

in the perfect world we would have a very high trust internet in which everyone follows the rules, checks robots.txt, rate limits, etc. But we don't live in such a world. Getting angry at these ai bots is useless. People should start considering what and how they host their data. If you're worried about bandwidth costs you have many alternatives to host your data for free or at a very little cost. i.e: github/gitlab/cloudflare.

If you're worried about your data getting scraped and used then maybe you can consider putting it behind a login or do some proof of work/soft captcha. Yeah, this isn't perfect but it will keep most dumb bots away.

Some people are hosting their sites like we're still in 1995 and times have changed.

internet101010
·
1 month ago
·
[ - ]

I block all traffic except that which comes from the country of Cloudflare.

superkuh
·
1 month ago
·
[ - ]

While the motivations may be AI related the cause of the problem is the first and original type of non-human person: corporations. Corporations are doing this, not human persons, not AI.

·
1 month ago
·
[ - ]

nektro
·
1 month ago
·
[ - ]

those crawlers and models are a scourge on the internet

yieldcrv
·
1 month ago
·
[ - ]

nice, a free way to keep our IPFS pins alive

i5heu
·
1 month ago
·
[ - ]

This is not how it works.

You have to actively pin data for it to be distributed.

EVa5I7bHFq9mnYK
·
1 month ago
·
[ - ]

Why can't Cloudflare do it? They know in real time which IPs spew millions of scrape requests to various sites, so they can classify them as AI bots and allow site owners to block them.

BiteCode_dev
·
1 month ago
·
[ - ]

Some people have issues with the idea that the entire web will eventually depend of a single point of failure and private for provide entity like cloudflare.

EVa5I7bHFq9mnYK
·
1 month ago
·
[ - ]

Well, should choose from two evils then

miohtama
·
1 month ago
·
[ - ]

Cloudflare already offers this feature

https://blog.cloudflare.com/declaring-your-aindependence-blo...

egypturnash
·
1 month ago
·
[ - ]

This is 100% off-topic but:

Is it just me or does Ars keep on serving videos about the sound design of Callisto Protocol in the middle of everything? Why do they keep on promoting these videos about a game from 2022? They've been doing this for months now.

misnome
·
1 month ago
·
[ - ]

I also get this for 1+ years, and just assume that it’s a symptom of my adblockers/pihole working and screwing up something in their auto ad targeting.

egypturnash
·
1 month ago
·
[ - ]

I’m glad to know it’s not just me! It’s such a weird and specific failure state.

hoaxminion
·
1 month ago
·
[ - ]

[dead]

·
1 month ago
·
[ - ]

·
1 month ago
·
[ - ]

RamblingCTO
·
1 month ago
·
[ - ]

Sure, overly spamming websites is shitty behaviour. But blocking AI crawlers hurts you in the end. Guess what will replace SEO in the long run?

entropi
·
1 month ago
·
[ - ]

>But blocking AI crawlers hurts you in the end. Guess what will replace SEO in the long run?

Maybe. But even if that turns out to be true, what good is it for the source website? The "AI" will surely not share any money (or anything else that may help the source website) with the source anyways. Why would they, they already got the content and trained on it.

RamblingCTO
·
1 month ago
·
[ - ]

What good is it? If "AI" doesn't know about you down the line, you won't be discovered. Be it in LLM weights or via crawling (perplexity, jina reader etc.), you won't get any organic traffic. It's not about sharing profits.

entropi
·
1 month ago
·
[ - ]

Again, the "AI" doesn't care about the website. It doesn't even link to it in vast majority of the cases. Even if it did, the "AI" derives a lot of its business value from the fact that it is providing what the client requests while removing the need to visit potentially dozens of these pages. So the clients, in most cases, would not even click them (as they already got what they wanted).

xena
·
1 month ago
·
[ - ]

So you're willing to pay my hosting bills?

RamblingCTO
·
1 month ago
·
[ - ]

Re-read what I posted and stop projecting.

xena
·
1 month ago
·
[ - ]

But if the AI crawlers are taking the website down and money buys more server time, are you willing to do your part and use money to make sure your training data sources are solvent until you can replace them?

RamblingCTO
·
1 month ago
·
[ - ]

I am not training shit dude ... So stop projecting. And again: spamming is unfair, as I said.

CyberDildonics
·
1 month ago
·
[ - ]

What a bizarre response. They are just saying that some of these crawlers are being used for search engines in one way or another.