"However, when authorizers later expressly revoke authorization—for example, through unambiguous written cease and desist communications that defendants receive and understand—the Department will consider defendants from that point onward not to be authorized."
So, you get a lawyer to write an "unambiguous cease and desist" letter. You have it delivered to Amazon by either registered mail or a process server, as recommended by the lawyer. Probably both, plus email.
Then you wait and see if Amazon stops.
If they don't stop, you can file a criminal complaint. That will get Amazon's attention.
That’s if the requests are actually coming from Amazon, which seems very unlikely given some of the details in the post (rotating user agents, residential IPs, seemingly not interpreting robots.txt). The Amazon bot should come from known Amazon IP ranges and respect robots.txt. An Amazon engineer confirmed it in another comment: https://news.ycombinator.com/item?id=42751729
The blog post mentions things like changing user agent strings, ignoring robots.txt, and residential IP blocks. If the only thing that matches Amazon is the “AmazonBot” User Agent string but not the IP ranges or behavior then lighting your money on fire would be just as effective as hiring a lawyer to write a letter to Amazon.
> If I don't get a response by next Tuesday, I'm getting a lawyer to write a formal cease and desist letter.
Given the details, I wouldn’t waste your money on lawyers unless you have some information other than the user agent string.
Really puts me off it.
And even then, it's probably not going to be easy
(Disallowing * isn’t usually an option since it makes you disappear from search engines).
Put a link somewhere in your site that no human would visit, disallow it in robots.txt (under a wildcard because apparently OpenAI’s crawler specifically ignores wildcards), and when an IP address visits the link ban it for 24 hours.
but due to amount of IPs involved this did not have any impact on about if traffic
my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)
"The figure shows that although the probers use thousands of source IP addresses, they cannot be fully independent, because they share a small number of TCP timestamp sequences"
If they rotate IPs, ban by ASN, have a page with some randomized pseudo looking content in the source (not static), and explain that the traffic allocated to this ASN has exceed normal user limits and has been rate limited (to a crawl).
Have graduated responses starting at a 72 hour ban where every page thereafter regardless of URI results in that page and rate limit. Include a contact email address that is dynamically generated by bucket, and validate all inbound mail that it matches DMARC for Amazon. Be ready to provide a log of abusive IP addresses.
That way if amazon wants to take action, they can but its in their ballpark. You gatekeep what they can do on your site with your bandwidth. Letting them run hog wild and steal bandwidth from you programmatically is unacceptable.
<https://www.routeviews.org/routeviews/>
That also provides the associated AS, enabling blocking at that level as well, if warranted.
In any case, I agree with the sarcasm. Blocking data center IPs may not help the OP, because some of the bots are resorting to residential IP addresses.
Ultimately in two party communications, computers are mostly constrained by determinism, and the resulting halting/undecidability problems (in core computer science).
All AI Models are really bad at solving stochastic types of problems. They can approximate generally only to a point after which it falls off. Temporal consistency im time series data is also a major weakness. Throw the two together, and models can't really solve it. They can pattern match to a degree but that is the limit.
8ish years ago, at the shop I worked at we had a server taken down. It was an image server for vehicles. How did it go down? Well, the crawler in question somehow had access to vehicle image links we had due to our business. Unfortunately, the perfect storm of the image not actually existing (can't remember why, mighta been one of those weird cases where we did a re-inspection without issuing new inspection ID) resulted in them essentially DOSing our condition report image server. Worse, there was a bug in the error handler somehow, such that the server process restarted when this condition happened. This had the -additional- disadvantage of invalidating our 'for .NET 2.0, pretty dang decent' caching implementation...
It comes to mind because, I'm pretty sure we started doing some canary techniques just to be safe (Ironically, doing some simple ones were still cheaper than even adding a different web server.... yes we also fixed the caching issue... yes we also added a way to 'scream' if we got too many bad requests on that service.)
... And then found their own crawlers can't parse their own manifests.
It complied, but it was absolutely not fast or efficient. I aimed at compliance first, good code second, but never got to the second because of more human-oriented issues that killed the project.
If AI crawlers want access they can either behave, or pay. The consequence will almost universal blocks otherwise!
Not sure how to implement it in the cloud though, never had the need for that there yet.
[1] https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...
Their site is down at the moment, but luckily they haven't stopped Wayback Machine from crawling it: https://web.archive.org/web/20250117030633/https://zadzmo.or...
How? The difficulty of doing that is the problem, isn't it? (Otherwise we'd just be doing that already.)
Not quite what the original commenter meant but: WE ARE.
A major consequence of this reckless AI scraping is that it turbocharged the move away from the web and into closed ecosystems like Discord. Away from the prying eyes of most AI scrapers ... and the search engine indexes that made the internet so useful as an information resource.
Lots of old websites & forums are going offline as their hosts either cannot cope with the load or send a sizeable bill to the webmaster who then pulls the plug.
That counts as barely imho.
I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.
there is a case to be made about the value of the traffic you'll get from oai search though...
And how often does it check robots.txt? ClaudeBot will make hundreds of thousands of requests before it re-checks robots.txt to see that you asked it to please stop DDoSing you.
New reason preventing your pages from being indexed
Search Console has identified that some pages on your site are not being indexed
due to the following new reason:
Indexed, though blocked by robots.txt
If this reason is not intentional, we recommend that you fix it in order to get
affected pages indexed and appearing on Google.
Open indexing report
Message type: [WNC-20237597]
Or they could at least have the curtesy to scrap during night time / off peak hours.
Who cares? They've already scraped the content by then.
> It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more.
Impersonating crawlers from big companies is a common technique for people trying to blend in. The fact that requests are coming from residential IPs is a big red flag that something else is going on.
Based on the internal information I have been able to gather, it is highly unlikely this is actually Amazon. Amazonbot is supposed to respect robots.txt and should always come from an Amazon-owned IP address (You can see verification steps here: https://developer.amazon.com/en/amazonbot).
I've forwarded this internally just in case there is some crazy internal team I'm not aware of pulling this stunt, but I would strongly suggest the author treats this traffic as malicious and lying about its user agent.
Believe what you want though. Search for `xeiaso.net` in ticketing if you want proof.
I'd still be surprised if an Amazon domain resolved to a residential IP
This type of thing is commercially available as a service[1]. Hundreds of Millions of networks backdoored and used as crawlers/scrapers because of an included library somewhere -- and ostensibly legal because somewhere in some ToS they had some generic line that could plausibly be extended to using you as a patsy for quasi-legal activities.
If the traffic is coming from residential IPs then it’s most likely someone using these services and putting “AmazonBot” as a user agent to trick people.
Whatever happened to courtesy in scraping?
Money happened. AI companies are financially incentivized to take as much data as possible, as quickly as possible, from anywhere they can get it, and for now they have so much cash to burn that they don't really need to be efficient about it.
the hubris reminds me of dot-com era. that bust left a huge wreckage. not sure how this one is going to land.
When various companies got signal that at least for now they have a huge overton window of what is acceptable for AI to ingest, they are going to take all they can before regulation even tries to clamp down.
The bigger danger, is that one of these companies even (or, especially) one that claims to be 'Open', does so but gets to the point of being considered 'too big to fail' from an economic/natsec interest...
Remember, Facebook famously made it easy to scrape your friends from MySpace, and then banned that exact same activity from their site once they got big.
Wake the f*ck up.
Requests coming from residential ips is really suspicious.
Edit: the motivation for such a DDoS might be targeting Amazon, by taking down smaller sites and making it look like amazon is responsible.
If it is Amazon one place to start is blocking all the the ip ranges they publish. Although it sounds like there are requests outside those ranges...
They pay you for your bandwidth while they resell it to 3rd parties, which is why a lot of bot traffic looks like it comes from residential IPs.
If this person is seeing a lot of traffic from residential IPs then I would be shocked if it’s really Amazon. I think someone else is doing something sketchy and they put “AmazonBot” in the user agent to make victims think it’s Amazon.
You can set the user agent string to anything you want, as we all know.
They are very, very, very expensive for the amount of data you get. You are paying for per bit of data. Even with Amazon's money, the number quickly become untenable.
It was literally cheaper for us to subscribe to business ADSL/cable/fiber optic services to our corp office buildings and thrunk them together.
> I worked for Microsoft doing malware detection back 10+ years ago, and questionably sourced proxies were well and truly on the table
Big Company Crawlers using questionably sourced proxies - this seems striking. What can you share about it?
Although I’m not necessarily gonna make that accusation, because it would be pretty serious misconduct if it were true.
You'd be surprised...
> You'd be surprised...
Surprised by what? What do you know?
I'm curious how OP figured out it's Amazon's crawler to blame. I would love to point the finger of blame.
What if instead it was possible to feed the bots clearly damaging and harmfull content?
If done on a larger scale, and Amazon discovers the poisoned pills they could have to spend money rooting it out, quick like, and make attempts to stop their bots to ingest it.
Of course nobody wants to have that tuff on their own site though. That is the biggest problem with this.
With all respect, you're completely misunderstanding the scope of AI companies' misbehaviour.
These scrapers already gleefully chow down on CSAM and all other likewise horrible things. OpenAI had some of their Kenyan data-tagging subcontractors quit on them over this. (2023, Time)
The current crop of AI firms do not care about data quality. Only quantity. The only thing you can do to harm them is to hand them 0 bytes.
You would go directly to jail for things even a tenth as bad as Sam Altman has authorized.
- Bytespider (59%) and Amazonbot (21%) together accounted for 80% of the total traffic to our Git server.
- ClaudeBot drove more traffic through our Redmine in a month than it saw in the combined 5 years prior to ClaudeBot.
What agent name should we put in robots.txt to deny your crawler without using a wildcard? I can't see that documented anywhere.
User-agent: Crawlspace
Disallow: /
> May I ask why you’d want to block it, even if it crawls respectfully?The main audience for the product seems to be AI companies, and some people just aren't interested in feeding that beast. Lots of sites block Common Crawl even though their bot is usually polite.
Yes, it should — I use the library below, and it should split at the slash character, treating it as a prefix match, per spec.
https://github.com/samclarke/robots-parser/blob/master/Robot...
Maybe worth contacting law enforcement?
Although it might not actually be Amazon.
Or, folks failing the 'shared security model' of AWS and their stuff is compromised with botnets running on AWS.
Or, folks that are quasi-spoofing 'AmazonBot' because they think it will have a better not-block rate than anonymous or other requests...
* There is knowledge that the intended access was unauthorised
* There is an intention to secure access to any program or data held in a computer
I imagine US law has similar definitions of unauthorized access?
`robots.txt` is the universal standard for defining what is unauthorised access for bots. No programmer could argue they aren't aware of this, and ignoring it, for me personally, is enough to show knowledge that the intended access was unauthorised. Is that enough for a court? Not a goddamn clue. Maybe we need to find out.
Quite the assumption, you just upset a bunch of alien species.
(But again, I don't know UK law.)
The last part basically means the robots.txt file can be circumstantial evidence of intent, but there needs to be other factors at the heart of the case.
<https://www.imperva.com/legal/website-terms-of-use/>
Many, many, many hits for this or similar language:
<https://duckduckgo.com/?q=%22By+accessing+this+Site%2C+you+a...>
Mind: just because it's written doesn't mean it's enforceable, but to argue that what you've just denied isn't a widely-used premise of online contracts of adhesion fails the simplest empirical test.
Plus you'll want to allow access to /robots.txt.
Of course, if they're hammering new connections, then automatically adding temporary firewall rules if the user agent requests anything but /robots.txt might be the easiest solution. Well or just stick Cloudflare in front of everything.
Honestly I think this might end up being the mid-term solution.
For legitimate traffic it's not too onerous, and recognized users can easily have bypasses. For bulk traffic it's extremely costly, and can be scaled to make it more costly as abuse happens. Hashcash is a near-ideal corporate-bot combat system, and layers nicely with other techniques too.
https://fossil-scm.org/home/doc/trunk/www/antibot.wiki
I ran into this weeks ago and was super impressed to solve a self-hosted captcha and login as "anonymous". I use cgit currently but have dabbled with fossil previously and if bots were a problem I'd absolutely consider this
while true {
const page = await load_html_page(read_from_queue());
save_somewhere(page);
foreach link in page {
enqueue(link);
}
}
This means that every link on every page gets enqueued and saved to do something. Naturally, this means that every file of every commit gets enqueued and scraped.Having everything behind auth defeats the point of making the repos public.
Sure you might get bandwidth saturated, but that can happen with any type of content
Is there any bot string in the user agent? I'd wonder if it's GPTBot as I believe they don't respect a robots.txt deny wildcard.
Also, just put a limit on requests per IP? https://nginx.org/en/docs/http/ngx_http_limit_req_module.htm...
While facetious in nature, my point is that people walking around in real brick and mortar locations simply do not care. If you want police to enforce laws, those are the kinds of people that need to care about your problem. Until that occurs, youll have to work around the problem.
Do they keep retrieving the same data from the same links over and over and over again, like stuck in a forever loop, that runs week after week?
Or are they crawling your site at a hype aggressive way but getting more and more data? So it may tea them last say 2 days to crawl over it and then they go away?
I run a couple of public facing websites on a NUC and it just… chugs along? This is also amidst the constant barrage of OSINT attempts at my IP.
https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-rang...
Regardless, it sucks that you have to deal with this. The fact that you’re a customer makes it all the more absurd.
Do these bots use some client software (browser plugin, desktop app) that’s consuming unsuspecting users bandwidth for distributed crawling?
Edit: oh, I got your point now.
map $http_user_agent $bottype {
default "";
"~.*Amazonbot.*" "amazon";
"~.*ImagesiftBot.*" "imagesift";
"~.*Googlebot.*" "google";
"~.*ClaudeBot.*" "claude";
"~.*gptbot.*" "gpt";
"~.*semrush.*" "semrush";
"~.*mj12.*" "mj12";
"~.*Bytespider.*" "bytedance";
"~.*facebook.*" "facebook";
}
limit_req_zone $bottype zone=bots:10m rate=6r/m;
limit_req zone=bots burst=10 nodelay;
limit_req_status 429;
You can still have other limits by IPs. 429s tends to slow the scrapers, and it means you are spending a lot less on bandwidth and compute when they get too aggressive. Monitor and adjust the regex list over time as needed.Note that if SEO is a goal, this does make you vulnerable to blackhat SEO by someone faking a UA of a search engine you care about and eating their 6 req/minute quota with fake bots. You could treat Google differently.
This approach won't solve for the case where the UA is dishonest and pretends to be a browser - that's an especially hard problem if they have a large pool of residential IPs and emulate / are headless browsers, but that's a whole different problem that needs different solutions.
You can ingest this IP list periodically and set rules based on those IPs instead. Makes you not prone to the blackhat SEO tactic you mentioned. In fact, you could completely block GoogleBot UA strings that don’t match the IPs, without harming SEO, since those UA strings are being spoofed ;)
It’s like the friggin tobacco companies or something. Is anyone being the “good guys” on this?
Just cache static content and throttle by IP. This isn't a DDoS, just rotating IPs
> /wp-content/uploads/2014/09/contact-us/referanslar/petrofac/wp-content/uploads/2014/09/products_and_services/products_and_services/catalogue/references/capabilities/about-company/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/themes/domain/images/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/themes/domain/images/about-company/corporate-philosophy/index.htm
"User-Agent":["Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
Could this be prevented by having a link that when followed would serve a dynamically generated page that does all of the following:
A) insert some fake content outlining the oligarcs more lurid rumours or whichever disinformation you choose to push
C) embed links to assets in oligarchs companies so they get hit with some bandwith
C) dynamically create new Random pages that link to itself
And thus create an infinite loop, similar to a gzip bomb, which could potentially taint the model if done by enough people.
...
The closest you could possibly do with any meaningful influence, is option C, with the general observations of:
1. You'd need to 'randomize' the generated output link
2. You'd also want to maximize cachability of the replayed content to minimize work.
3. Add layers of obfuscation on the frontend side, for instance a 'hidden link (maybe with some prompt fuckery if you are brave) inside the HTML with a random bad link on your normal pages.
4. Randomize parts of the honeypot link pattern. At some point someone monitoring logs/etc will see that it's a loop and blacklist the path.
5. Keep up at 4 and eventually they'll hopefully stop crawling.
---
On the lighter side...
1. do some combination of above but have all honeypot links contain the right words that an LLM will just nope out of for regulatory reasons.
That said, all above will do is minimize pain (except, perhaps ironically, the joke response which will more likely blacklist you but potentially get you on a list or a TLA visit)...
... Most pragmatically, I'd start by suggesting the best option is a combination of nonlinear rate limiting, both on the ramp-up and the ramp-down. That is, the faster requests come in, the more you increment their 'valueToCheckAgainstLimit`. The longer it's been since last request, the more you decrement.
Also pragmatically, if you can extend that to put together even semi-sloppy code to then scan when a request to a junk link that results in a ban immediately results in another IP trying to hit the same request... well ban that IP as soon as you see it, at least for a while.
With the right sort of lookup table, IP Bans can be fairly simple to handle on a software level, although the 'first-time' elbow grease can be a challenge.
He's got "(Amazon)" while Amazon lists their useragent as "(Amazonbot/0.1;"
Nobody has problems with the Google Search indexer trying to crawl websites in a responsible way
I'm really just pointing out the inconsistent technocrat attitude towards labor, sovereignty, and resources.
I've recently blocked everything that isn't offering a user agent. If it had only pulled text I probably wouldn't have cared, but it was pulling images as well (bot designers, take note - you can have orders of magnitude less impact if you skip the images).
For me personally, what's left isn't eating enough bandwidth for me to care, and I think any attempt to serve some bots is doomed to failure.
If I really, really hated chatbots (I don't), I'd look at approaches that poison the well.
There is one official mod who steps in occasionally. He is not the one flagging stories or comments. If a big story becomes unflagged, that is his doing more often than not