NewsBlur is an open-source RSS news reader (full source available at [1]), something we should all agree is necessary to support the open web! But Cloudflare blocking all of my feed fetchers is bizarre behavior. And we’re on the verified bots list for years, but it hasn’t made a difference.
Let me know what I can do. NewsBlur publishes a list of IPs that it uses for feed fetching that I've shared with Cloudflare but it hasn't made a difference.
I'm hoping Cloudflare uses the IP address list that I publish and adds them to their allowlist so NewsBlur can keep fetching (and archiving) millions of feeds.
[0]: https://newsblur.com
I run an RSS feed on my blog out of principle and I don't bother reading other feeds I'm subscribed to
When I'm bored I come here, I go on Mastodon, and gods save me, I go on Reddit
Clearly they are not 100% consenting, or at best one of them (the content publisher) is misconfiguring/misunderstanding their setup. They enabled RSS on their service, then setup a rule to require human verification for accessing that RSS feed.
It's like a business advertising a singles only area, then hiring a security company and telling them to only allow couples in the building.
Newsblur was the first SaaS I could afford as a student. I have been subscriber for something like 20 years now. And I will keep doing it to the grave. Best money ever spent.
NewsBlur is "only" 15 years old (and GReader was there up until 11 years ago).
I used to get my internet from a small local ISP, and ip blacklisting basically means no one in our zipcode could have reliable internet.
These days, the 10-20% of us with an unobstructed sky view switched to starlink and didn’t look back.
The thing is, both ISPs use CGNAT, but there’s no way cloudflare is going to block Musk like they do the mom and pop shop.
Anyway, apparently residential proxy networks work pretty well if you hit a spurious ip block. I’ve had good luck with apple private relay too.
I’m hoping service providers realize how useless and damaging ip blocking is to their reputations, but I’m not holding my breath. Sometimes I think the endgame is just routing 100% of residential traffic through 8.8.8.8.
If you are worried about DoS attacks that may hammer on your feeds then you can use the same configuration rule to ignore the query string for cache keys (if your feed doesn't use query strings) and overriding the caching settings if your server doesn't set the proper headers. This way Cloudflare will cache your feed and you can serve any number of visitors without putting load onto your origin.
As for Cloudflare fixing the defaults, it seems unlikely to happen. It has been broken for years, Cloudflare's own blog is affected. They have been "actively working" on fixing it for at least 2 years according to their VP of product: https://news.ycombinator.com/item?id=33675847
One particularly effective strategy we've implemented is using separate subdomains for services designed for different types of traffic, allowing us to apply customized firewall and page rules to each subdomain.
For example:
- www. listennotes.com is dedicated to human users. E.g., https://www.listennotes.com/podcast-realtime/
- feeds. listennotes.com is tailored for bots, providing access to RSS feeds. Eg., https://feeds.listennotes.com/listen/wenbin-fangs-podcast-pl...
- audio. listennotes.com serves both humans and bots, handling audio URL proxies. E.g., https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c49651...
This subdomain-based approach enables us to fine-tune security and performance settings for each type of traffic, ensuring optimal service delivery.
We only need to provide the sitemap (with custom paths, not publicly available) in a few specific places, like Google Search Console. This means the rules for managing sitemaps are quite manageable. It’s not a perfect setup, but once we configure it, we can usually leave it untouched for a long time.
I tried for a long time to get around it, but now when I hit a website like this just close the tab and don't bother anymore.
That’s true in some cases, I’m sure, but also remember that most site owners deal with lots of tedious abuse. For example, some people get really annoyed about Tor being blocked but for most sites Tor is a tiny fraction of total traffic but a fairly large percentage of the abuse probing for vulnerabilities, guessing passwords, spamming contact forms, etc. so while I sympathize for the legitimate users I also completely understand why a busy site operator is going to flip a switch making their log noise go down by a double-digit percentage.
I've been creating accounts every time I need to visit Reddit now to read a thread about [insert subject]. They do not validate E-Mail, so I just use `example@example.com`, whatever random username it suggests, and `example` as a password. I've created at least a thousand accounts at this point.
Malicious Compliance, until they disable this last effort at accessing their content.
A good client it's either Lagrange (multiplatform), the old Lynx or Dillo with the Gopher plugin.
However, the undeniable reality is that accessing the website with a non-residential IP is a very, very strong indicator of sinister behaviour. Anyone that’s been in a position to operate one of these services will tell you that. For every…let’s call them ‘privacy-conscious’ user, there are 10 (or more) nefarious actors that present largely the same way. It’s easy to forget this as a user.
I’m all but certain that if Reddit or LinkedIn could differentiate, they would. But they can’t. That’s kinda the whole point.
> From a privacy POV, your VPN is doing nothing to them, because your IP address means very little to them from a tracking POV.
I disagree. (1) Since I have javascript disabled, IP address is generally their next best thing to go on. (2) I don't want to give them IP address to correlate with the other data they have on me, because if they sell that data, now someone else who only has my IP address suddenly can get a bunch of other stuff with it too.
But anyone making malicious POST requests, like spamming chatGPT comments, first makes GET requests to load the submission and find comments to reply to. If they think you're a low quality user, I don't see why they'd bother just locking down POSTs.
Get parameters can be abused like any parameter. This could be sql, could be directory traversal attempts, brute force username attempts, you name it.
I am absolutely not a fan of all these "are you human?" checks at all, doubly so when ad-blockers trigger them. I think there are very legitimate reasons for wanting to access certain sites without being tracked - anything related to health is an example.
Maybe I should have made a more substantive comment, but I don't believe this is as simple a problem as reducing it to request types.
Telegram channels have been a good alternative, but even that is going downhill thanks to French authorities.
Cloudflare and Google also often treat us like bots (endless captchas, etc) which makes it even more difficult.
And each one of these could potentially create thousands of accounts, and do 100x as many requests as a normal user would.
Even if only 1% of the people using your service are fraudsters, a normal user has at most a few accounts, while fraudsters may try to create thousands per day. This means that e.g. 90% of your signups are fraudulent, despite the population of fraudsters being extremely small.
It's like at my current and previous companies. They make a lot of security restrictions. The problem is, if somebody wants to get data out, they can get out anytime (or in). Security department says that it's against "accidental" leaks. I'm still waiting a single instance when they caught an "accidental" leak, and they are just not introducing extra steps, when at the end I achieve the exact same thing. Even when I caused a real potential leak, nobody stopped me to do it. The only reason why they have these security services/apps is to push responsibility to other companies.
I discovered this when I set up IPv6 using hurricane electric as a tunnel broker for IPv6 connectivity.
Seemingly Google has all HEnet IPv6tunnel subnets listed for such behaviour without it being documented anywhere. It was extremely annoying until I figured out what was going on.
Sounds suspiciously like how product managers talk to developers as well.
If only there was a law that allowed one to be excluded from automatic behavior profiling...
(The actual process at this restaurant is to sit down, fuss with your phone a bit, then get up like you're about to leave; someone will arrive promptly to take your order.)
Like, you visit Site A too often while blocking some javascript, and now Site B doesn't work for no apparent reason, and there's no resolution path. Worse, the bad information may become permanent if an owner uses it to taint your account, again with no clear reason or appeal.
I suspect Reddit effectively killed my 10+ year account (appeal granted, but somehow still shadowbanned) because I once used the "wrong" public wifi to access it.
Site owners probably don't even see these bounced visits, and it's such a tiny percentage of visitors who do this that it won't make a difference. Meh, it's just another annoyance to be able to use the web on our own terms.
I would get different captcha, one convoluted that wouldn't even load the required images.
And I would get the oops sorry dog page for everything.
I finally contacted amazon, gave them my (static) ip address and it was good.
In other locations, I have to solve a 6-distorted-letter captcha to log in, but that's the extent of it.
But not always. My most recent stumbling block is https://www.napaonline.com. Guess I'm buying oxygen sensors somewhere else.
Yeah, that's my solution as well. I take those annoyances as the website telling me that they don't want me there, so I grant them their wish.
Another problem is "resist fingerprinting" prevents some canvas processing, and many websites like bluesky, linked in or substack uses canvas to handle image upload, so your images appear to be stripes of pixel.
Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.
I understand why people give up, trying to fight for your privacy is an uphill battle with no end in sight.
Is that true? At least on iOS you can log into the ChatGPT with same email/password as the website.
I never use Google login for stuff and ChatGPT works fine for me.
In an adversarial environment, especially with both AI scrapers and AI posters, websites have to be able to identify and ban persistent abusers. Which unfortunately implies having some kind of identification of everybody.
I couldn't disagree more. The way to protect privacy is to make privacy the standard at the implementation layer, and to make it costly and difficult to breach it.
Trying to rely on political institutions without the practical and technical incentives favoring privacy will inevitably result in the political institutions themselves becoming the main instrument that erodes privacy.
If people who valued privacy really controlled the implementation layer we wouldn't have gotten to this point in the first place.
That's not true, I use ChatGPT's app on my phone without logging into a Google account.
You don't even need any kind of account at all to use it.
An android phone asks you to link a google account when you use it for the first time. It takes a very dedicated user to refuse that, then to avoid logging in into the gmail, youtube or app store apps which will all also link your phone to your google account when you sign in.
But I do actively avoid this, I use Aurora, F-droid, K9 and NewPipeX, so no link to google.
But then no ChatGPT app. When I start it, I get hit with a logging page to the app store and it's game over.
I haven't tried the ChatGPT app, but I know that, for example my bank and other financial services apps work with on-device fingerprint authentication and no Google account on /e/OS.
In the end, the fact remain: no chatgpt app without giving up your privacy, to google none the less.
Of course as Google doesn't claim they do this, many people would consider it unreasonably fearful/cynical.
Yes? I mean, not "leaks" - it's designed to upload your private data to Google and others.
https://www.tcd.ie/news_events/articles/study-reveals-scale-...
> Even when minimally configured and the handset is idle, with the notable exception of e/OS, these vendor-customised Android variants transmit substantial amounts of information to the OS developer and to third parties such as Google, Microsoft, LinkedIn, and Facebook that have pre-installed system apps. There is no opt-out from this data collection.
That's the opposite stance that would be bonkers.
Google and Apple are both heavily invested in ads (apple made 4.7 billion from ads in 2022), they have a track record of exfiltrating your data (remember contractors listening to your siri recordings?), of lying to the customers (remember the home button scandal on iPhone?), have control over a device that have your whole life yet runs partially on code you can't evaluate.
Trusting those people makes no sense at all. You have a business relationship with them, that's it.
I suspect that people operating Web sites have no idea how many legitimate users are blocked by CloudFlare.
And. based on the responses I got when I contacted two of the companies whose sites were chronically blocked by CloudFlare for months, it seemed like it wasn't worth any employee's time to try to diagnose.
Also, I'm frequently blocked by CloudFlare when running Tor Browser. Blocking by Tor exit node IP address (if that's what's happening) is much more understandable than blocking Firefox from a residential IP address, but still makes CloudFlare not a friend of people who want or need to use Tor.
I sometimes wonder if all Cloudflare employees are on some kind of whitelist that makes them not realize the ridiculous false positive rate of their bot detection.
The adversarial aspect of all this is a problem: P(malicious|Tor) is much higher than P(malicious|!Tor)
I'm guessing if it's really Resist Fingerprinting on Firefox (something Mullvad also has on by default), then there are other settings that aren't being enabled causing the issue. Mullvad actually lists the settings related to resisting fingerprinting here - https://mullvad.net/en/browser/hard-facts
I've contacted companies about this and they usually just tell me to use a different browser or computer, which is like "duh, really?" , but also doesn't solve the problem for me or anyone else.
The most egregious is Microsoft (just about every Microsoft service/page, really), where all you get is a "The request is blocked." and a few pointless identifiers listed at the bottom, purely because it thinks your browser is too old.
CF's captcha page isn't any better either, usually putting me in an endless loop if it doesn't like my User-Agent.
https://github.com/rails/rails/pull/50505/files#diff-dce8d06...
but like, why is it a website's job to tell me what browser version to use? unless my outdated browser is lacking legitmate functionality which is required by your website, just serve the page and be done with it.
def blocked?
user_agent_version_reported? && unsupported_browser?
end
well, you know what to do here :)You’re best off just picking real ones. We’ve got hit by a botnet sending 10k+ requests from 40 different ASNs with 1000s of different IPs. The only way we’re able to identify/block the traffic was excluding user agents matching some regex (for whatever reason they weren’t spoofing real user agents but weren’t sending actual ones either).
Different browsers use TLS in slightly different ways, send data in a slightly different order, have a different set of supported extensions / algorithms etc.
If your user agent says Safari 18, but your TLS fingerprint looks like Curl and not Safari, sophisticated services will immediately detect that something isn't right.
[1]: https://addons.mozilla.org/en-US/firefox/addon/random_user_a...
From experience, a lot of the things people do in hopes of protecting their privacy only makes them far easier to profile.
- The website judges your fingerprint based on how unique it is, but assumes that it's otherwise persistent. Randomizing my User-Agent serves the exact opposite - a given User-Agent might be more unique than using the default, but I randomize it to throw trackers off.
- To my knowledge, its "One in x browsers" metric (and by extension the "Bits of identifying information" and the final result) are based off of visitor statistics, which would likely be skewed as most of its visitors are privacy-conscious. They only say they have a "database of many other Internet users' configurations," so I can't verify this.
- Most of the measurements it makes rely on javascript support. For what it's worth, it claims my fingerprint is not unique when javascript is disabled, which is how I browse the web by default.
The other extreme would be fixing my User-Agent to the most common value, but I don't think that'd offer me much privacy unless I also used a proxy/NAT shared by many users.
But yes, without javascript a lot of tracking functions fail to operate. That is good for privacy, and EFF notes that on the site.
You can fix your UA to a common value, it's about providing the least amount of identifying bits, and randomizing it just provides another bit to identify you by. Always remember: an absence of information is also valuable information!
(I'm not saying I agree with it, just that it exists.)
Though what you mention does beg the question "is there really much privacy gain in that over using Referrer-Policy: same-origin and having referrer based pages work right?" I suppose so if you're randomizing your identity in an untrackable way for each connection it could be attractive... though I think that'd trigger being suspected as a bot far before the lack of proper same origin info :p.
Whenever I click a link to another site, i get a new tab in either a pre-assigned container or else in a “tmpNNNN” container, and i think either by default or I have it configured to omit Referer headers on those new tab navigations.
B. Cloudflare has healthy competition with AWS, Akamai, Fastly, Bunny.net, Mux, Google Cloud, Azure, you name it, there's a competitor. This isn't even an Apple vs Google situation.
And it is the DDoS prevention measures at issue here.
Nowadays, Cloudflare has image compression and CDN services, video storage and delivery services, serverless compute with Workers, domain registration, (soon) container support with optional GPUs, durable objects (basically serverless storage), serverless SQL databases (D1), even an AWS S3 competitor with B2. They even have bespoke services like CloudFlare Tunnels - what’s AWS got that’s anything like it?
Cloudflare is getting close to full-on AWS. At least, the parts most customers use. If they just added boring old VPSs, people would realize very quickly how full featured they are.
As for DDoS mitigation - you’ve still got AWS Shield, Akamai, Azure, Radware, F5, even Oracle (Dyn) competing in that market. Unless you could show Cloudflare did illegal tying as a monopolist specifically to sell DDoS prevention, there’s no case.
And yes, it's sad that the "make internet work again" is behind an expensive paywall..
If you have enterprise plan, you can have custom rules including allowing by url
I'm not sure either if RSS bots could be added to good bots, but if anyone has traffic from them, we can definitely try. (No high hopes though, given the responses I got from support so far)
Sure, tech wise it might work great, but from your users perspective: it's trash.
You simply shouldn't have any challenges whatsoever on an RSS feed. They're literally meant to be read by a machine.
The issue here is that Cloudflare's content type check is naive. And the fact that CF is checking the content-type header directly needs to be made more explicit OR they need to do a file type check.
There were compatibility issues with other type headers, at least in the past.
'application/rss+xml' (for RSS)
'application/atom+xml' (for Atom)
As soon as a majority of sites use the correct types, clients can start requiring it for newly added feeds, which in turn will make webmasters make it right if they want their feed to work.
'application/rss+xml' seems to be the best option though in my opinion. The '+xml' in the media type tells (good) parsers to fall back to using an XML parser if they don't understand the 'rss' part, but the 'rss' part provides more accurate information on the content's type for parsers that do understand RSS.
All that said, it's a mess.
Would not surprise me if Cloudflare lumps this in with text/html protections.
Since the user-agent has no way to distinguish scripts injected by cloudflare from scripts originating from the actual website, in order to pass the challenge they are forced to execute arbitrary code from an untrusted party. And malicious Javascript is practically ubiquitous on the general internet.
I tried reaching out to Cloudflare with issues like this in the past. The response is dozens of employees hitting my LinkedIn page yet no responses to basic, reproduceable technical issues.
You need to fix this internally as it's a reputational problem now. Less screwing around using Salesforce as your private Twitter, more leadership in triage. Your devs obviously aren't motivated to fix this stuff independently and for whatever reason they keep breaking the web.
I'm not saying this to say its a good thing; it isn't.
Here's something to consider though: Why are we going after Cloudflare for this? Isn't the website operator far, far more at-fault? They chose Cloudflare. They configure Cloudflare. They, in theory, publish an RSS feed, which is broken because of infrastructure decisions they made. You're going after Ryobi because you've got a leaky pipe. But beyond that: isn't this tool Cloudflare publishes doing exactly what the website operators intended it to do? It blocks non-human traffic. RSS clients are non-human traffic. Maybe the reason you don't want to go after the website operators is because you know you're in the wrong? Why can't these RSS clients detect when they encounter this situation, and prompt the user with a captive portal to get past it?
There will always be niche technologies and nascent standards and we're taking Cloudflare to task today because if they continue to stomp on them, we get nowhere.
"Don't use Cloudflare" is an option, but we can demand both.
I mean that somewhat sarcastically; but there does come a point where the demands are unreasonable, the technology is dead. There are probably more people browsing with JavaScript disabled than using RSS feeds. There are probably more people browsing on Windows XP than using RSS feeds. Do I yell at you because your personal blog doesn't support IE6 anymore?
This is a matter between You and the Website Operators, period. Cloudflare has nothing to do with this. This article puts "Cloudflare" in the title because its fun to hate on Cloudflare and it gets upvotes. Cloudflare is a tool. These website operators are using Cloudflare The Tool to block inhuman access to their websites. RSS CLIENTS ARE NOT HUMAN. Let me repeat that: Cloudflare's bot detection is working fully appropriately here, because RSS Clients are Bots. Everything here is working as expected. The part where change should be asked is: Website operators should allow inhuman actors past the Cloudflare bot detection firewall specifically for RSS feeds. They can FULLY DO THIS. Cloudflare has many, many knobs and buttons that Website Operators can tweak; one of those is e.g. a page rule to turn off bot detection for specific routes, such as `/feed.xml`.
If your favorite website is not doing this, its NOT CLOUDFLARE'S FAULT.
Take it up with the Website Operators, Not Cloudflare. Or, build an RSS Client which supports a captive portal to do human authorization. God this is so boring, y'all just love shaking your first and yelling at big tech for LITERALLY no reason. I suspect its actually because half of y'all are concerningly uneducated on what we're talking about.
It's not that hard. If the content being requested is RSS (or Atom, or some other syndication format intended for consumption by software), just don't do bot checks, use other mechanisms like rate limiting if you must stop abuse.
As an example: would you put a captcha on robots.txt as well?
As other stories here can attest to, Cloudflare is slowly killing off independent publishing on the web through poor product management decisions and technology implementations, and the fix seems pretty simple.
As another commenter noted, not even CF's own RSS feed seems to get the content type right. This issue could clearly use some work.
I understand that there are some more interactive rss readers, but from personal experience it’s more like “hey I’m a good bot, let me in”
Ideally you could make it a simple switch in the config, somethin like: "Allow automated access on RSS endpoints".
From the feed reader perspective it is a 403 response. For example my reader has been trying to read https://blog.cloudflare.com/rss/ and the last successful response it got was on 2021-11-17. It has been backing off due to "errors" but it still is checking every 1-2 weeks and gets a 403 every time.
This obviously isn't limited to the Cloudflare blog, I see it on many site "protected by" (or in this case broken by) Cloudflare. I could tell you what public cloud IPs my reader comes from or which user-agent it uses but that is besides the point. This is a URL which is clearly intended for bots so it shouldn't be bot-blocked by default.
When people reach out to customer support we tell them that this is a bug for the site and there isn't much we can do. They can try contacting the site owner but this is most likely the default configuration of Cloudflare causing problems that the owner isn't aware of. I often recommend using a service like FeedBurner to proxy the request as these services seem to be on the whitelist of Cloudflare and other scraping prevention firewalls.
I think the main solution would be to detect intended-for-robots content and exclude it from scraping prevention by default (at least to a huge degree).
Another useful mechanism would be to allow these to be accessed when the target page is cachable, as the cache will protect the origin from overload-type DoS attacks anyways. Some care needs to be taken to ensure that adding a ?bust={random} query parameter can't break through to the origin but this would be a powerful tool for endpoints that need protection from overload but not against scraping (like RSS feeds). Unfortunately cache headers for feeds are far from universal, so this wouldn't fix all feeds on its own. (For example the Cloudflare blog's feed doesn't set any caching headers and is labeled as `cf-cache-status: DYNAMIC`.)
Perhaps a solution would be for Cloudflare to have default page rules that disable bot-blocking features for common RSS feed URLs? Or pop-up a notice with instructions on how to create these page rules to users that appear to have RSS feeds on their website?
[1] Here is Overcast’s owner raising the issue in 2022: https://x.com/OvercastFM/status/1578755654587940865
It’s particularly frustrating that they give their own WARP service a pass. I’ve run into many sites that will block VPN traffic, including iCloud Privacy Relay, but WARP traffic goes through just fine.
If that guy makes money with that and has an issue with the Great Firewall Of America, there's a (bad) solution.
I wrote my own RSS bridge that scrapes websites using Scrapfly web scraping API that bypasses all that because it's so annoying that I can't even scrape some company's /blog that they are literally buying ads for but somehow have an anti-bot enabled that blocks all RSS readers.
Modern web is so anti social that the web 2.0 guys should be rolling in their "everything will be connected with APIs" graves by now.
The state of the art isn't much better today, it seems. Similar outcome with more steps.
In the end we had to use Cloudflare to rate limit the RSS endpoint.
I think this is fine. You are solving a specific problem and still allowing some traffic. The problem with the Cloudflare default settings is that they block all requests leading to users failing to get any updates even when fetching the feed at a reasonable rate.
BTW in this case another solution may just be to configure proper caching headers. Even if you only cache for 5min at a time that will be at most 1 request every 5min per Cloudflare caching location (I don't know the exact configuration but typically use ~5 locations per origin, so that would be only 1req/min which is trivial load and will handle both these inconsiderate scrapers and regular users. You can also configure all fetches to come from a single location and then you would only need to actually serve the feed once per 5min)
Isn't the correct solution to use CF to cache RSS endpoints aggressively?
That said, for publicly hosted open source documentation, we turn down the security settings almost all the way. Security level is set to "essentially off" (that's the actual setting name), no browser integrity check, TOR friendly (onion routing on), etc. We still have rate limits in place but they're pretty generous (~4 req/s sustained). For sites that don't require a login and don't accept inbound leads or something like that, that's probably around the right level. Our domains where doc authors manage their docs have higher security settings.
That said, being too generous can get you into trouble so I understand why people crank up the settings and just block some legitimate traffic. See our past post where AI scrapers scraped almost 100TB (https://news.ycombinator.com/item?id=41072549).
Unfixed for 4 months.
Can you whitelist URLs to be read by bots on Cloudflare? Maybe this is a good solution, where you as a site mantainer can include your RSS feeds, sitemaps, and other content for bots.
Also, Cloudflare could ship a feature by creating a dedicated section in the admin panel to let the user add and whitelist RSS feeds and sitemaps, making it easier (and educate) users to avoid blocking those bots who aren't a threat to your site, of course sill considering rules to avoid DDOS on this urls, like massive requests or stuff that common bots from RSS readers don't do.
Here are some DNS details:
The main Reddit site (www.reddit.com) uses Fastly. Old Reddit (old.reddit.com) also uses Fastly. However, the "vomit" address (which often returns 403s for RSS requests) uses AWS DNS. Is Old Reddit not behind Cloudflare, or is there another reason why it handles RSS requests differently?
No idea if CF already does this, but allowing users to generate access tokens for 3rd party services would be another way of easing access alongside their apparent URL and IP whitelisting.
P.S. when I mentioned this here on HN a few weeks back, it was implied that I probably did not respect robots.txt ( I do, Cloudflair does not) or that I should get in touch with the site administrators (impossible to do in any reasonably effective way at scale).
Also other companies offering similar services like imperva seems to be straight banning my ip after one visit to a website with uBlock Origin I first get a captcha, then a page saying I am not allowed, and whatever I do, even using an extensionless chrome browser with a new profile I can't visit it anymore because my ip is banned.
The other thing to think about is the lack of enforcement: you can’t complain to the bot police when some dude in China decides to harvest your data, and if you try blocking by user-agent or IP you’ll play whack-a-mole trying to stay ahead of the bot operators who will spoof the former and churn the latter. After developing an appreciation for why security people talk about validating correctness rather than trying to enumerate badness, you’ll end up with a combination of rate-limiting and broader blocking for the same reasons. Yes, it’s no fun but the problem isn’t the sites but the people abusing the free services we’ve been given.
This is part of what’s leading to the bludgeoning approach you see with blocking. They are not an individual thjng that can be blocked.
Looks like it should be possible under the WAF
For something like entirely static content, it's so much easier (and cheaper, all of the static hosting providers have an extremely generous free tier) to use static hosting.
And I say this as an SRE by heart who runs Kubernetes and Nomad for fun across a number of nodes at home and in various providers - my blog is on a static host. Use the appropriate solution for each task.
Before that, it was on a mediocre-even-at-the-time dedicated-cores VM. That caused performance problems... because its Internet "pipe" was straw-sized, it turned out. The server itself was fine.
Web server performance has regressed amazingly badly in the world of the Cloud. Even "serious" sites have decided the performance equivalent of shitty shared-host Web hosting is a great idea and that introducing all the problems of distributed computing at the architecture level will help their moderate-traffic site work better (LOL; LMFAO), so now they need Cloudflare and such just so their "scalable" solution doesn't fall over in a light breeze.
Once you start chasing views, it's going to come at the detriment of everything else.
I am glad to see other people calling out the problem. Hopefully, a solution will emerge.
if you don’t like it, make your own Internet: assumedly one not funded by ads