1. If I as a human request a website, then I should be shown the content. Everyone agrees.
2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.
3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?
Their reasons vary. Some don’t want their businesses perception of quality to be taken out of their control (delivering cold food, marking up items, poor substitutions). Some would prefer their staff service and build relationships with customers directly, instead of disinterested and frequently quite demanding runners. Some just straight up disagree with the practice of third party delivery.
I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.
An agent making a request on the explicit behalf of someone else is probably something most of us agree is reasonable. "What are the current stories on Hacker News?" -- the agent is just doing the same request to the same website that I would have done anyways.
But the sort of non-explicit just-in-case crawling that Perplexity might do for a general question where it crawls 4-6 sources isn't as easy to defend. "Are polar bears always white?" -- Now it's making requests I wouldn't have necessarily made, and it could even been seen as a sort of amplification attack.
That said, TFA's example is where they register secretexample.com and then ask Perplexity "what is secretexample.com about?" and Perplexity sends a request to answer the question, so that's an example of the first case, not the second.
When you swap in an AI and ask what are the current stories. The AI fetches the front page and every thread and feeds it back to you. You are less likely to participate in discussion because you've already had the info summarized.
Am I supposed to spend money on Amazon.com when I visit the website just because Amazon wants me to?
If most people stop discussing things on HN, and the discussion is indeed one of the major reasons it’s kept running, then HN stops being worth running.
Are website owners obligated to serve content to AI agents and/or LLM scrapers?
And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.
They claim that since they are free to not buy an advertised product, why would they be forced to see ads for it. But Foo news claims that they are also free to not waste bandwidth to serve their free website to people who declare (by using an ad blocker or the modern alternative: AI aummarizera) they won't participate in the funding of the service
People hate obnoxious ads because the money that pays for them is essentially a bribe to artificially elevate content above its deserved ranking. It feels like you're being manipulated into an unfavorable trade.
It is? Are we talking about the same YouTube? I get absolutely useless recommendations, I get un-hooked within a couple videos, and I even keep getting recommendations for the same videos I've literally watched yesterday. Who in the world gets hooked by this??
I never saw people bother with scissors but I've seen people pulling the ads out of the newspaper countless times.
I think this is a pretty different scenario. Here the user and the news website are talking directly to each other, but then the user is making a choice around what to do with the content the news website send to them. With AI agents, there is a company inserting themselves between the user and the news website and acting as a middleman.
It seems reasonable to me that the news website might say they only want to deal with users and not middlemen.
What if my executive assistant reading the news website and giving me a digest?
Would the website owners rather prefer me doing my reading directly?
I remember that Samsung was at one time offering to play non-skippable full-screen apps on their newest 8K OLED TVs and their argument was precisely that these ads will reach those rich people who normally pay extra to avoid getting spammed with ads. Or going with your executive assistant example, there are situations where it makes sense to bribe them to get access to you and/or your data. E.g. "evil maid attack".
HTTP/1.1 402 Payment Required
WWW-price: 0.0000001 BTC, 0.000001 ETH, 0.00001 DOGE
> You are less likely to participate in discussionyou (or AI on your behalf) paid instead. Many sites would probably like it better.
There are so many links I click on these days that are such trash I'd be demanding refunds constantly.
That is why AI "summarization" becomes a necessary intermediate layer. You'd not see nor trash nor ads, and thus the payment instead of being exposed to the ads. AI saves the Internet :)
What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?
Pinky promises? Ethics? Laws? Technical limitations? Leeroy Jenkins?
What prevents anyone else? robots.txt is a request, not an access policy.
Does information no longer wants to be free now? Maybe internet, just like social media was just a social experiment at the end, albeit a successful one. Thanks GenAI.
Big Tech has hidden behind ToS for years. Now, it seems as though it only works for them, but not against. It seems as though this would be easy to orchestrate and prove forcing these companies into a legal nightmare or risk insolvent business stature due to the high load of cases filed against.
Why couldn't something like this be used to flip the table? A conciliation brigading, of sorts.
I've successfully used conciliation court against large corporations in the past which is why I question it here.
And while this should be able to be handled via legislation it won't be. Beyond that a workaround could force that to happen.
Sorry, I had never heard that term before. You would still have to show standing though. How would you try to prove that their violating your TOS cost you money?
Again, I can't copy and distribute a game Microsoft rents to me. But if I do I can be found held accountable for a ridiculous amount of money. If it's my work of art the terms can dictate who doesn't need to pay and who does. If an LLM is consuming my work of art and now distributing it within their user base how is that not the same?
Do you still see authentic human traffic on your domains, is it easy to discern?
I feel like I missed the bus on running a blog pre-AI.
Both my blog homepage and posts see mostly human traffic. Sometimes bots crawl the site and they appear as spikes in the analytics.
Looks like my homepage which doesn't have anything but links is pretty popular with crawlers. My digital garden doesn't get much interest from them. All in all, human traffic on my sites are pretty much alive.
I don't believe in missing the bus in anything actually, because I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open. I post links to both when it's appropriate, but they are not made to be popular. If people read it and learn something or solve one of their problems, that's enough for me.
This is why my software is GPLv3, Digital Garden is GFDL and blog is CC BY-NC-SA 2.0. This is why everything is running with absolutely minimum analytics and without any ads whatsoever.
Lastly, this is why I don't want AI crawlers in my site and my data in the models. This thing is made by a human for humans, absolutely for free. It's not OK somebody to sell something designed to be free and make money over it.
Absolutely, I'm in agreement here. I want to run a JS-free blog, just plain old static HTML. I plan to use GoAccess to parse the access logs but that's it. I think I would find it encouraging to see real human traffic.
> I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open.
That is a great way to view it, thank you.
You're welcome. I'm glad it helped.
> I want to run a JS-free blog, just plain old static HTML.
If you want to start fast until you find a template you want to work with, I can recommend Mataroa [0]. The blog have almost no JS (it binds a couple of keys for navigation, that's it), and it's $10/year. When you feel right in your self-hosted solution, you can move there. It's all Markdown at the end of the day.
> I plan to use GoAccess to parse the access logs but that's it.
That's the only thing I use, too. Nothing else.
If you want to look at what I do, how I do, and reach out to me, the rabbit hole starts from my profile, here.
Wish you all the best, and you may find bliss and joy you never dreamed of!
you could go proper insanomode, too. remaking The Internet is trivial if you don't care about existing web standards -- replacing HTTP with your own TCP implementation, getting off html/js/css, etc. being greenfield, you can control the protocol, server, and client implementation, and put it in whatever language you want. I made a stateful Internet implementation in Python earlier for proof-of-concept, but I want to port it and expand on it in rust soon (just for fun; I don't do serious biznos). you'll very likely have 100% human traffic then, even if you're the only person curious and trusting enough to run your client.
> I made a stateful Internet implementation in Python earlier for proof-of-concept
Is there a repo or some other form of public access? I'd like to see this.the server ("lodge") passes JSON to the client from what are called .branch files. the client receives JSON, parses it, then builds the UI and state representation from the JSON, then stored in that client's memory (self.current_doc and self.page_state in python client).
branches can invoke waterwheel (.ww) files hosted on the lodge. waterwheel files on the lodge contain scripts which define how patches (as JSON) are to be sent to the client. the client updates its state based on the JSON patch it receives. sample .branch and .ww from python implementation (in pastebin so to not make everyone have to scroll through this): https://pastebin.com/A0DEZDmR
You can put up a paywall depending on UserAgent or OS (has been done).
In short, it's a 2-way street: the client on the other end of the TCP pipe makes a request, and your server fulfills the request as it sees fit.
I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?
Makes no sense whatsoever.
What I want to stop is excessive crawling and scraping of my server. Once they have the file they can do what they want with it. Another comment (44786237) mentions that robots.txt is only for restricting recursive access; I agree and that is what should be blocked. They also should not access the same file several times quickly even though it should be unnecessary to do so, just as much as they should not access all of the files. (If someone wants to make a mirror of the files, there may be other ways, e.g. in case there is a archive file available to download many at once (possibly, in case if the site operator made their own index and then did it this way). If it is a git repository, then it can be cloned.)
Your license is probably not relevant. I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement. Even if I told it to the whole world, it wouldn't be copyright infringement. Probably the movie seller would prefer it if I didn't tell anyone. Why should I care?
I actually agree that AI companies are generally bad and should be stopped - because they use an exorbitant amount of bandwidth and harm the services for other users. At least they should be heavily taxed. I don't even begrudge people for using Anubis, at least in some cases. But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue. We have laws against copyright infringement, and to prevent service disruption. We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index. That would be unethical. Call for a windfall tax if they piss you off so much.
This is a false analogy. A correct one would be going to a 1000 movies and creating the 1001th movie with scenes cropped from these 1000 movies and assemble it as a new movie, and this is copyright infringement. I don't think any of the studios would applaud and support you for your creativity.
> But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue.
Why does it have to be always about money? Personally it's not. I just don't want my work to be abused and sold to people to benefit a third party without my consent and will (and all my work is licensed appropriately for that).
> We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index.
This goes both ways. If big corporations can scrape my material without asking me and resell it as an output of a model, I can equally distill their models further and sell it as my own. If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.
But that will be copyright infringement, just because they have more money. What angers me is "all is fair game because you're a small fish, and this is a capitalist marketplace" mentality.
If companies can paywall their content to humans that don't pay, I can paywall AI companies and demand money or push them out of my lawn, just because I feel like that. The inverse is very unethical, but very capitalist, yes.
It's not always about money.
P.S.: Oh, try to claim that you can train a model with medical data without any clearance because it'd be unethical to have laws limiting this. It'll be fun. Believe me.
I think you are describing something much more like stable diffusion. This article is about Perplexity, which is much closer to "watch a movie and tell me the plot" than it is like "take these 1000 movies and make a collage". The copyright points are different - stable diffusion are on much shakier ground than perplexity.
> Why does it have to be always about money?
Before I mentioned money I said "because it hurts my feelings". I'm sorry I can't give a more charitable interpretation, but I really do see this kind of objection as "I don't want you to have access to this web page because I don't like LLMs". This is not a principled objection, it is just "I don't like you, go away". I don't think this is a good principle to build the web on.
Obviously you can make your website private, if you want, and that would be a shame. But you can't have this kind of pick-and-choose "public when you feel like" option. By the way I did not mention, but I am ok with people using Anubis and the like as a compromise while the situation remains unjust. But the justification is very important.
> If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.
This is probably not a gambit you want to make. You literally can do this, and they would probably like it if you did. You don't want to do that, because the output of LLMs is usually not that good.
In fact, LLM companies should probably be taxed, and the taxes used to fund real human AI-free creations. This will probably not happen, but I am used to disappointment.
> P.S.: Oh, try to claim that you can train a model with medical data
Medical data is not public, for good reasons.
That is unfortunately not a distinction that is currently legally enforceable. Until that changes all other "solutions" are pointless and only cause more harm.
> People who think like that made tools like Anubis, and it works.
It works to get real humans like myself to stop visiting your site while scrapers will have people whose entire job is to work around such "protections". Just like traditional DRM inconveniences honest customers and not pirates. And to be clear, what you are advocating for is DRM.
> I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.
If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.
If we talk about Anubis, it's pretty invisible. You wait a couple of seconds in the first visit, and don't get challenged for a couple of weeks, at least. With more tuning some of the sites using Anubis work perfectly well without ever seeing Anubis' wall while stopping AI crawlers.
> And to be clear, what you are advocating for is DRM.
Yes. It's pretty ironic that someone like me who believes in open access prefers a DRM solution to keep companies abusing the small fish, but life is an interesting phenomenon, and these things happen.
> Until that changes all other "solutions" are pointless and only cause more harm.
As an addendum to above paragraph, I'm not happy that I have to insert draconian measures between the user and the information I want to share, but I need a way to signal that I'm not having their ways to these faceless things. What do you propose? Taking my sites offline? Burning myself in front of one of the HQs?
> If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.
AI crawlers default to "Public Domain" when they find no licenses. Some of my lamest source code repositories made into "The Stack" because I forgot to add COPYING.md. A fork of a GPLv2 tool I wrote some patches also got into "The Stack", because COPYING.md was not in the root folder of the repository. I'd rather add licenses (which I can accept) to things rather than leave them as-is, because AI companies also eagerly grab things without license.
All licenses I use mandate attribution and continuation of license, at least, and my blog doesn't allow any derivations of from what I have written. So you can't ingest it into a model to be derived and remixed with something else.
It's not invisible, the sites using it don't work perfectly well for all users and it doesn't stop AI crawlers.
If you don't have the funds to sue an AI corp, I'd probably think of a plan B. Maybe poison the data for unauthenticated users. Or embrace the inevitability. Or see the bright side of getting embedded in models as if you're leaving your mark.
If you give them a URL that does not appear in Google, ask them to visit that URL specifically, and then notice the content from that URL in the training data, it's proof that they're doing this, which would be quite damaging to them.
Is it? It's damning, but is it damaging at all?
I'm not getting the impression that anyone's data being available for training if some bot can get to it is just how things are now, rather than an unsettled point of contention. There's too much money invested in this thing for any other outcome, and with the present decline of the rule of law…
If store businesses at least partially relies on obscurity of information that can be solved through automated means (e.g. storefronts tend to push visitors towards products they don't want, and buyer agents are fighting that and looking for something buyers instructed them) just playing this cat and mouse game of blocking agents, finding workarounds, and repeating the cycle is only creating perverse technological contraptions that neither party is really interested in - but both are circumstantially forced to invest into.
You can block IP's at the host level but there's pretty easy ways around that with proxy networks.
Who misled you with that statement?
Corporate America. Where clean code goes to die.
Mind you I'm not saying electric scooters are a bad idea, I have one and I quite enjoy it. I'm saying we didn't need five fucking startups all competing to provide them at the lowest cost possible just for 2/3s of them to end up in fucking landfills when the VC funding ran out.
Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.
When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."
Please don't fulminate. Please don't sneer, including at the rest of the community.
Eschew flamebait. Avoid generic tangents. Omit internet tropes.
Please don't use Hacker News for political or ideological battle. It tramples curiosity.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
broadcast television, paid streaming entertainment is just straight up the most glaringly obvious example of a paid service overflowing with advertisements.
paid radio broadcasts (xm/Sirius).
operating systems (windows serves you ads any chance it gets).
monthly subscriptions to gyms where youre constantly hit with ads, marketing, and promotions be it at the gym or via push notification (you got opted into and therefore have to opt out of intentionally after the service is paid).
mobile phones, especially prepaid come LOADED with ads and bloatware.
i mean the list goes on -- you cannot be serious.
Those are hybrid subscriptions/subsidies. Not paid in full.
If you are being exposed to ads in something you paid for, you are almost certainly being charged less money. Companies can compete on cost by introducing ads, and it's why the cheaper you go, the more ad infested it gets.
Pure ad-free things tend to be much more expensive then their ad subsidized counterparts. Ad subsidized has become so ubiquitous though, that people think that price is the true price.
in the subsidized example, xm/Sirius is marketed to users as an "ad-free paid radio broadcast"; the marketing literally attempts to leverage the notion of it being ad-free as a consequence of your purchase (power) in order to highlight its supposed competitive edge and usefulness, and provide the user an incentive to spend money, except for the fact that the marketing is false. you still get served promotions and ads, just less "conventional" ads.
i go to a football game and im literally inundated with ads -- the whole game has time stoppage dedicated to serving ads. i guess my season ticket purchase with the hopes of seeing football in person is.. apparently not spending enough money?
i see this as attempting to move the goalposts and gaslight users on their purchase expectations, as a way to offload the responsibility and accountability back onto the user -- "you don't pay enough, you only think that you pay enough, so we are still going to serve you ads because <insert financial justification here around the expectations we'e undermined>.
why then is there any expectation of a service being ad-free upon purchasing?
who the hell actually enjoys sitting through 1.5 hours of advertisements and play stoppage?
over time users have been conditioned to just tolerate it, and over time, the advertising reclaims ground it previously gave up one inch at a time in the same way people are price-gouged in those stadiums -- they don't have much alternative, but apparently the problem is the user should fork up more money for tickets so as to align their expectations with reality? while they're getting strong-armed at the concession stand via proximity and circumstance and lack of competition, no less.
are you really trying to tell me the problem there is, they need to make... more money? and THEN and only THEN we can have ad-free, paid for entertainment otherwise known as american football? is this really about user expectations, or is this about companies wanting their cake and eating it, too?
The horror!!
Democracy in its american form or even at many others show almost complete paralysis of the entire system basically if bad actors infiltrate it (Looking at ya donald)
It is honestly a little sad since conservatives usually think of their society as this high trust society and they were the ones who primarily voted and are being taken advantaged of by the few untrustworthy individuals.
Politics is a cult/religion and you can't prove me otherwise.
I vote because I vote for lesser evil not for greater good. I do think that frankly, both the parties or just most parties in every nation are just so short of reality but I created a discord server of 100 people and I can see how I can't manage 100 people and so maybe I expect so much from the govt.
I used to focus so much on history and politics but its bloody mess and there is no good or bad. Now I just feel like going into the woods and into the darks living alone, maybe coding.
Quite compatible
> If you want a high trust society, you don't want capitalism.
There is nothing at all in capitalism that would prevent a high level of trust in society.
> Capitalism is inherently low trust
But that's not true. The thing about capitalism is that it's RESILENT to low trust. It does not require low levels of trust, but is capable of functioning in such conditions.
> If the penalty for deceit was greater than the penalty for non-deceit
Who are the judges? Capitalism is the most resistant to deception, deceivers under capitalism receive fewer benefits than under any other economic system. Simply because capitalism is based on the premise that people cheat, act out of greed, try to get the most for themselves at the expense of others. These qualities exist in people regardless of the existence of capitalism, it is just that capitalism ensures prosperity in society even when people have these qualities.
I've never seen a stateless, classless, moneyless society. It may be impossible.
At this moment I am using Perplexity's Comet browser to take a spotify playlist and add all the tracks to my youtube music playlist. I love it.
If sites want to avoid people using agents, they should offer the functionality that people are using the agents to accomplish.
Excellent. Personal shoppers are 'adblock for IRL'.
>You owe the companies nothing. You especially don't owe them any courtesy. They have re-arranged the world to put themselves in front of you. They never asked for your permission, don't even start asking for theirs.
Everyone having a personal shopper obviously changes the relationship to the products and services you use or purchase via personal shopper. Good, bad, whatever.
The point is the web is changing, and people use a different type of browser now. Ans that browser happens to be LLMs.
Anybody complaining about the new browser has just not got it yet, or has and is trying to keep things the old way because they don’t know how or won’t change with the times. We have seen it before, Kodak, blockbuster, whatever.
Grow up cloud flare, some is your business models don’t make sense any more.
You say this as though all LLM/otherwise automated traffic is for the purposes of fulfilling a request made by a user 100% of the time which is just flatly on-its-face untrue.
Companies make vast amounts of requests for indexing purposes. That could be to facilitate user requests someday, perhaps, but it is not today and not why it's happening. And worse still, LLMs introduce a new third option: that it's not for indexing or for later linking but is instead either for training the language model itself, or for the model to ingest and regurgitate later on with no attribution, with the added fun that it might just make some shit up about whatever you said and be wrong. And as the person buying the web hosting, all of that is subsidized by me.
"The web is changing" does not mean every website must follow suit. Since I built my blog about 2 internet eternities ago, I have seen fad tech come and fad tech go. My blog remains more or less exactly what it was 2 decades ago, with more content and a better stylesheet. I have requested in my robots.txt that my content not be used for LLM training, and I fully expect that to be ignored because tech bros don't respect anyone, even fellow tech bros, when it means they have to change their behavior.
Its a clear road for disaster. I am honestly surprised by how great Hackernews is, to that comparison where most people are sharing it for the love of the craft as an example. And for that hackernews holds a special place in my heart. (Slightly exaggerating to give it a thematic ending I suppose)
They will be quite the wiser if they track/limit how often your shopper enters the store. You probably aren't entering the same store fifteen times every day and neither would be your shopper if they were only doing it on your behalf.
Might does not make right.
It's like saying a web browser that is customized in any way is wrong. If one configures their browser to eagerly load links so that their next click is instant, is that now wrong?
that's called breaking and entering, and generally frowned upon -- by-passing the "closed sign".
The line is drawn for me on my own computer. Even if I am in your building, my phone remains mine.
First time hearing this. Almost every single grocery store either supports Instacart, or has partnership with a similar service.
The next step in your progression here might be:
If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?
But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.
>Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Many websites (especially the bigger ones) are just businesses. They pay people to produce content, hopefully make enough ad revenue to make a profit, and repeat. Anything that reproduces their content and steals their views has a direct effect on their income and their ability to stay in business.
Maybe IA should have a way for websites to register to collect payment for lost views or something. I think it’s negligible now, there are likely no websites losing meaningful revenue from people using IA instead, but it might be a way to get better buy in if it were institutionalized.
Magazines and newspapers were able to by funded by native ads because you couldn't auto-remove ads from their printed media and nobody could clone their content and give it away for free.
And those ads don't spy. They tend to be a jpg that functions as a link. That's why I mentioned spying.
When ads were far less invasive, I had a lot more tolerance.
Now they want my data, they want to play audio, video, hijack the content, page etc.
Advertising scum can not be trusted to forever take more and more and more.
I’m ok with this. I support the media I truely want to see, and that media offers alternatives that are not ads.
For instance, I pay for YouTube premium. That said, many will not pay.
Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."
> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?
To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.
But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.
Can't you read?
ChatGPT probably uses a cache though. Theoretically, the average load on the original sites could be far less than users accessing them directly.
You're not the only stakeholder in any of those interactions. There's you, a mediator (search or LLM), and the website owner.
The website owner (or its users) basically do all the work and provide all the value. They produce the content and carry the costs and risks.
The pre-LLM "deal" was that at least some traffic was sent their way, which helps with reach and attempts at monetization. This too is largely a broken and asymmetrical deal where the search engine holds all the cards but it's better than nothing.
A full LLM model that no longer sends traffic to websites means there's zero incentive to have a website in the first place, or it is encouraged to put it behind a login.
I get that users prefer an uncluttered direct answer over manually scanning a puzzling web. But the entire reason that the web is so frustrating is that visitors don't want to pay for anything.
They are already paying, it is the way they are paying that causes the mess. When you buy a product, some fraction of the price is the ad budget that gets then distributed to websites showing ads. Therefore there is also nothing wrong with blocking ads, they have already been paid for, whether you look at them or not. The ad budget will end up somewhere as long as not everyone is blocking all ads, only the distribution will get skewed. Which admittedly might be a problem for websites that have a user base that is disproportionally likely to use ad blockers.
Paying for content directly has the problem that you can only pay for a selected few websites before the amount you have to pay becomes unreasonable. If you read one article on a hundred different websites, you can not realistically pay for a hundred subscriptions that are all priced as if you spent all your time on a single website. Nobody has yet succeeded in creating a web wide payment method that only charges you for the content that you actually consume and is frictionless enough to actually work, i.e. does not force you to make a conscious payment decisions for a few cents or maybe even only fractions of a cent for every link you click and is not a privacy nightmare collecting all the links you click for billing purposes.
Also if you directly pay for content, you will pay twice - you will pay for the subscription and you will still pay into the ad budget with all the stuff you buy.
It's not like newspapers where advertising is paid in full before publishers put stories online. It has not been that way for a long time.
Your reasoning for not accessing advertising reminds me of that scene in Arrested Development where, to hide the money they've taken out of the till, they throw away the bananas. It doesn't hide the transaction, it compounds the problem.
If publishers were getting paid before any ads ran the publishing business would be a hell of a lot stronger.
It’s especially stupid because it doesn’t include publishers in the equation at all. It’s just you looping over yourself attempting to validate your choice for running an ad blocker.
Admit you’re doing it because you want to callously screw over publishers. You certainly haven’t put their thoughts into consideration here.
To be clear: Run an ad blocker if you want, but stop acting as if you bought those ads. The chicken dinner I ate the other night has no say how I live my life after our transaction has ended.
If I buy stuff at a grocery store, I can’t get a random bagger fired just because I feel like it. At some point the transaction ends and they ultimately continue to operate with or without your input.
If I am buying Apple products, am I contributing to their ad budget? If so, where does that money end up? Is it likely that some of it will end up as ad revenue on some website? What difference does it make whether or not I block ads? Or the other way around, if I am visiting websites and look at Apple ads but do not buy Apple products, am I contributing to the ad revenue of the websites?
Also, advertising does other things than tell you to buy something, and it doesn’t always take the form of banner ads. Apple, for example, does a ton of brand awareness advertising. Affiliate marketing often targets direct transactions. Maybe your goal is to simply start a relationship that might someday lead to a really big purchase.
Often, in the era of SaaS, people advertise to existing customers. Apple does this—they have a TV service and a music service and a cloud service.
There are plenty of reasons for them to advertise after you bought the original product.
But your original point was that customers bought the ads. Maybe they didn’t! Maybe they were given funding by a VC firm and the company decided it wanted to build an audience. Maybe they want to advocate for a political issue.
I think the biggest problem with your argument is that it has tunnel vision and sees advertising as this one dimensional thing, when in reality it takes many forms. Plenty of those forms are bad, but it is not as simple as “I bought a product, now I never want to see an Apple ad ever again.” Many businesses (Amazon, eBay) make most of their money off of customers they’ve already advertised to that they advertise to again and again.
In fact, you did the opposite.
We can even go one step further, if anyone is screwing over websites, then that is the ad industry by not paying for blocked ads. I buy an iPhone and Apple takes some additional money from me to spend on advertising. I did not ask for that but I am fine with it. Now I expect Apple to spend the money they took from me on ads in order to support websites. But if the guy that Apple wants to show the ad that I paid for does not want to see it and blocks it, then I want Apple to respect that and still pay the website. I know, not going to happen, but do not put the blame on people blocking ads.
Cloudflare released these insights showing the disparity between crawling/scraping and visits referred from the AI platforms.
https://radar.cloudflare.com/ai-insights#crawl-to-refer-rati...
Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.
They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.
Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.
If AI crawlers win here, we all lose.
I think the business model for “content creating” is going have to change, for better or worse (a lot of YouTube stars are annoying as hell, but sure, stuff like well-written news and educational articles falls under this umbrella as well, so it is unfortunate that they will probably be impacted too).
Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.
But I can imagine more extreme measures. e.g. old web of trust style request signing[0]. I don’t see any easy way for scrapers to beat a functioning WOT system. We just don’t happen to have one of those yet.
I think this might actually point at the end state. Scraping bots will eventually get good enough to emulate a person well enough to be indistinguishable (are we there yet?). Then, content creators will have to price their content appropriately. Have a Patreon, for example, where articles are priced at the price where the creator is fine with having people take that content and add it to the model. This is essentially similar to studios pricing their content appropriately… for Netflix to buy it and broadcast it to many streaming users.
Then they will have the problem of making sure their business model is resistant to non-paying users. Netflix can’t stop me from pointing a camcorder at my TV while playing their movies, and distributing it out like that. But, somehow, that fact isn’t catastrophic to their business model for whatever reason, I guess.
Cloudflare can try to ban bad actors. I’m not sure if it is cloudflare, but as someone who usually browses without JavaScript enables I often bump into “maybe you are a bot” walls. I recognize that I’m weird for not running JavaScript, but eventually their filters will have the problem where the net that captures bots also captures normal people.
Interested to see some LLM-adverserial equivalent of MPAA dots![1]
Yes, you can identify who got paid to sign a key and ban them. They will create another key, go to someone else, pretend to be someone not yet signed up for WoT (or pay them), and get their new key signed, and sign more keys for money.
So many people will agree to trust for money, and accountability will be so diffuse, that you won't be able to ban them all. Even you, a site operator, would accept enough money from OpenAI to sign their key, for a promise the key will only be used against your competitor's site.
It wouldn't take a lot to make a binary-or-so tree of fake identities, with exponential fanout, and get some people to trust random points in the tree, and use the end nodes to access your site.
Heck, we even have a similar problem right now with IP addresses, and not even with very long trust chains. You are "trusted" by your ISP, who is "trusted" by one of the RIRs or from another ISP. The RIRs trust each other and you trust your local RIR (or probably all of them). We can trace any IP to see who owns it. But is that useful, or is it pointless because all actors involved make money off it? You know, when we tried making IPs more identifying, all that happened is VPN companies sprang up to make money by leasing non-identifying IPs. And most VPN exits don't show up as owned by the VPN company, because they'd be too easy to identify as non-identifying. They pay hosting providers to use their IPs. Sometimes they even pay residential ISPs so you can't even go by hosting provider. The original Internet was a web of trust (represented by physical connectivity), but that's long gone.
Meanwhile its going to fuck over real users.
This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.
If it became functionally impossible to upsell a little information for more paid information, I'm sure some people would stop creating information online. I don't know if it would be enough to fundamentally alter the character of the web.
Do people (generally) put things online to get money or because they want it online? And is "free" data worse quality than data you have to pay somebody for (or is the challenge more one of curation: when anyone can put anything up for free, sorting high- and low-quality based on whatever criteria becomes a new kind of challenge?).
Jury's out on these questions, I think.
Existing subject-matter experts who blog for fun may or may not stick around, depending on what part of it is “fun” for them.
While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.
Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.
Many people put more effort into their hobbies than into their "full time" job.
Some of it will go away but perhaps without the expectation that you can earn money more people will share freely.
> While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.
We don't have to make all business models that someone might want possible though.
> Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.
Actually this model is quite common. There are tons of sources of free information curated by volunteers - most are just too niece to get to the scale of Wikipedia.
IME it's mostly because someone else put something "wrong" online first.
E.g., Sheldon Brown's bicycle blog is something of a work of art and one of the best bicycle resources literally anywhere. I don't know the man, but I'd be surprised if he'd put in the same effort without the "brand" behind it -- thankful readers writing in, somebody occasionally using the donate button to buy him a coffee, people like me talking about it here, etc.
Now, most of the value I find in the web comes from niche home-improvement forums (which Reddit has mostly digested). But even Reddit has a problem if users stop showing up from SEO.
Ultimately the root issue is that copyright is inherently flawed because it tries to increase available useful information by restricting availability. We'd be better off by not pretending that information is scarce and looking for alternative to fund its creation.
The internet is filled with spam. But if you talk to one specific human, your chance of getting a useful answer rises massively. So in a way, a flood of written AI slop is making direct human connections more valuable.
Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.
I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.
On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.
That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.
Maybe that would result in limited fetching instead of internet wide fetching. I dunno, just spitballing.
B/ my brother used to use "fetcher" as a non-swear for "fucker"
AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.
Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).
Perplexity's "web crawler" is mostly operating like this on behalf of users, so they don't need a massively expensive computer to run an LLM.
It is confusing.
The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.
So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.
Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.
The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.
Now, IMO:
So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the ethical problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)
The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.
LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.
IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.
EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/
Let's be real, Google et al have been doing this for years with their quick answer and info boxes. AI chatbots are worse but it's not like the big search engines were great before AI came along. Google had made itself the one-stop shop for a huge percentage of users. They paid billions to be the default search engine on Apple's platforms not out of the goodness of their hearts but to be the main destination for everyone on the web.
The argument that LLM outfits are using is that they are just exercising “fair use” / education rights to do an end run around copyright law. Without strengthening the rules on that I’m not sure I see how the database + team of lawyers approach would work.
But with that, sure, that’s an approach that seems to have legs in other contexts.
Mojeek LLM (https://www.mojeek.com) uses citations.
I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.
But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too.
Whoever it is (and I now suspect it is Perplexity based on this Cloudflare post), they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.
It could be a personal knowledge management system, but it seems like knowledge management systems should be operating off of things you already have. The research library down the street isn't considered a "personal knowledge management system" in any sense of the term if you know what I mean. If you dispatch an army of minions to take notes on the library's contents that doesn't seem personal. Similarly if you dispatch the army of minions to a bookstore rather than a library. At the very least bring the item into your house/office first. (Libraries are a little different because they are designed for studying and taking notes, it's use of an army of minions aspect)
Yes, orbit, a now deprecated firefox extension by mozilla was doing that. This way you could also use it to summarise content that would not be available to a third party (eg sth in google docs).
You can still sort of do the same with the ai chatbot panel in firefox, sort of, but ctrl+A>right click>AI chatbot>summarise.
Perplexity is not visiting a website everytime a user asks about it. It's frequently crawling and indexing the web, thus redirecting traffic away from websites.
This crawling reduces costs and improves latency for Perplexity and its users. But it's a major threat to crawled websites
In fact, the "old web" people sometimes pine for was mostly a place where people were putting things online so they were online, not because it would translate directly to money.
Perhaps AI crawlers are a harbinger for the death of the web 2.0 pay-for-info model... And perhaps that's okay.
Then came the social networks and walled gardens, SEO, and all the other cancer of the last 20 years and all of these disappeared for un-searchable videos, content farms and discord communities which are basically informational black holes.
And now AI is eating that cancer, but IMO it's just one cancer being replaced by an even more insidious cancer. If all the information is accessed via AI, then the last semblance of interaction between content creators and content consumers disappears. There are no more communities, just disconnected consumers interacting with a massive aggregating AI.
Instead of discussing an interesting topic with a human, we will discuss with AI...
> Because Pipes is not a web crawler (the service only retrieves URLs when requested to by a Pipe author or user) Pipes does not follow the robots exclusion protocol, and won't check your robots.txt file.
In 3, although you do not specify, I assume you mean that a bot requests the page, as opposed to you visiting the page like in scenario 2 and then an LLM processes the downloaded data (similarly to an adblocker). It is the former case that is a problem, the latter case is much harder to stop and there is much less reason to stop it.
This is the distinction: is a human present at the time of request.
There are legitimate reasons to do that, of course. Maybe I am trying to find info about some niche topic or how to do X, I ask an llm, the llm goes through some search results, a lot of which is search engine optimised crap, finds the relevant info and answers my question.
But if I wrote articles in a news site, I am supported by ads or subscriptions and see my visits plummel because people, who would usually google about topic X and then visit my website that I wrote about X, were now reading the google summary that appeared when googling about topic X, based on my article, maybe I would have less motivation to continue writing.
The only end result possible in such a scenario is that everything commercial of some quality being heavily paywalled, some tiny amount of free and open small web, and a huge amount of AI generated slop, because the value of an article in the open internet is now so low that only AI can produce it (economically, time-wise) efficiently enough.
Perplexity is choosing to come back "on a VPN" with new IP addresses to evade the block.
#2 and #3 are about modifying data where access has been granted, I think Cloudflare is really complaining about #1.
Evading an IP address ban doesn't violate my principles in some cases, and does in others.
It’s talking about Perplexity crawling sites on demand in response to user queries and then complaining that no it’s not fine, hence this thread.
Circa 2008 I worked for a startup that would scrape Google Books and a variety of other sources for public domain content to then print via Amazon’s Print-on-Demand services. Google, of course, didn’t like this and introduced a Captcha not very long after we started scraping.
So we hired a team of underemployed / unemployed English majors during the height of the recession, paid them $10 per hour to type in Captchas all day long and we downloaded their full corpus anyways.
If Google can’t win, you won’t either.
I have a dinky little txt site on my email domain. There is nothing of value on it, and the content changes less than once a year. So why are AI scrapers hitting it to the tune of dozens of GB per month?
At the limit, this problem is the problem of "keeping secrets while not keeping secrets" and is unsolvable. If you've shared your site content to one entity you cannot control, you cannot control where your site content goes from there (technologically; the law is a different question).
Proprietary web browsers are in a really good position to do something like this, especially if they offer a free VPN. The browser would connect to the "VPN servers", but it would be just to signal that this browser instance has an internet connection, while the requests are just proxied through another browser user.
That way the company that owns this browser gets a free network of residential IP address ready to make requests (in background) using a real web browser instance. If one of those background requests requires a CAPTCHA, they can just show it to the real user, e.g. the real user visits a Google page and they see a Cloudflare CAPTCHA, but that CAPTCHA is actually from one of the background requests (while lying in its UI and still showing the user a Google URL in the address bar).
Definitely don't agree. I don't think you should be shown the content, if for example:
1. You're in a country the site owner doesn't want to do business in.
2. You've installed an ad blocker or other tool that the site owner doesn't want you to use.
3. The site owner has otherwise identified you as someone they don't want visiting their site.
You are welcome to try to fool them into giving you the content but it's not your right to get it.
I disagree. The website should have the right to say that the user can be shown the content under specific conditions (usage terms, presented how they designed, shown with ads, etc). If the software can't comply with those terms, then the human shouldn't be shown the content. Both parties did not agree in good faith.
If the user doesn't want to view content with ads, that's okay and they can go elsewhere.
Because the website has every right to block you or refuse access to you if you do that, just like an establishment has the right to refuse you access if you try to enter without a shirt, if you're denying them access to revenue that they predicated your access on.
Similarly, if you're using a user-agent the website doesn't like, they have the right to block you, or take action against that user-agent to prevent it from existing if they can't reliable detect it to block it.
Because the LLM is usually on a 3rd party cloud system and ultimately not under your full control. You have no idea if the LLM is retaining any of that information for that business's own purposes beyond what a EULA says - which basically amounts to a pinky swear here. Especially if that LLM is located across international borders.
Now, for something like Ollama or LMStudio where the LLM and the whole toolchain is physically on your own system? Yeah that should be like Firefox legally since it's under your control.
I think the main concern here is the huge amount of traffic from crawling just for content for pre-training.
If I put time and effort into a website and it's content, I should expect no compensation despite bearing all costs.
Is that something everyone would agree with?
The internet should be entirely behind paywalls, besides content that is already provided ad free.
Is that something everyone would agree with?
I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"
the answer is apparently "no", and I don't really how recipe books have suffered as a result of less gatekeeping.
"How will the internet work"? Probably better in some ways. There is plenty of valuable content on the internet given for free, it's being buried in low-value AI slop.
But what is your point? Is the value in HN primarily in its hosting, or the non-ad-supported community?
Taking HN as a potential one of these places, it doesn't even qualify. HN is funded entirely to be a place for advertising ycombinator companies to a large crowd of developers. HN is literally a developer honey pot that they get exclusive ad rights to.
Because quantity has a quality of its own.
I say this as someone who is on the side of pro local user commands how local compute works, but understand why companies are reacting to how cheap LLMs are making information discovery against their own datasets
One thing that comes to my mind is: If a human tries to answer a question via the web, he will browse one site after the other.
If that human asks an LLM, it will ping 25 sites in parallel.
Scale this up to all of humanity, and it should be expected that internet traffic will rise 25x - just from humans manually asking questions every now and then - we are not even talking about AI companies actively crawling the web.
That means, webmasters will have to figure out aggressive caching and let CDNs deal with the problem or put everything behind a login screen (which might also just be a temporary fix).
Not me, I often open multiple tabs and windows at once to compare and contrast the results.
And you’re right: there’s no difference. The web is just machines sending each other data. That’s why it’s so funny that people panic about “privacy violations” and server operators “spying on you”.
We’re just sending data around. Don’t send the data you don’t want to send. If you literally send the data to another machine it might save it. If you don’t, it can’t. The data the website operator sends you might change as a result but it’s just data. And a free interaction between machines.
I think the line is drawn at "on my behalf". The silent agreement of the web is that humans are served content via a browser, and robots are obeying rules. All we need to support this status quo is to perform data processing by ML models on a client's side, in the browser, the same way we rip out ads.
This is well defined in specs and ToS, not quite a gray area
The whole concept of a "website" will simply become niche. How many zoomers still visit any but the most popular websites?
is it just on your behalf? or is it on Perplexity's behalf? are they not archiving the pages to train on?
it's the difference between using Google Chrome vs. Chrome beaming full page snapshots to train Gemini on.
If you as a human spam the shit out of my website and waste my resources, I will block you.
If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.
If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.
Nobody would mind if you had an LLM that intelligently knew what pages contain what (because it had a web crawler backed index that refreshes at a respectful rate, and identifies itself accurately as a robot and follows robots.txt), and even if it needed to make an instantaneous request for you at the time of a pertinent query, it still identified itself as a bot and was still respectful... there would be no problem.
The problem is that LLMs are run by stupid, greedy, evil people who don't give the slightest shit what resources they use up on the hosts they're sucking data from. They don't care what the URLs are, what the site owner wants to keep you away from. They download massive static files hundreds or thousands of times a day, not even doing a HEAD to see that the file hasn't changed in 12 years. They straight up ignore robots.txt and in fact use it as a template of what to go for first. It's like hearing an old man say "I need time to stand up because of this problem with my kneecaps" and thinking "right, I best go for his kneecaps because he's weak there"
There are plenty of open crawler datasets, they should be using those... but they don't, they think that doesn't differentiate them enough from others using "fresher" data, so they crawl even the smallest sites dozens of times a day in case those small sites got updated. Their badly written software is wrecking sites, and they don't care about the wreckage. Not their problem.
The people who run these agents, LLMs, whatever, have broken every rule of decency in crawling, and they're now deliberately evading checks, to try and run away from the repercussions of their actions. They are bad actors and need to be stopped. It's like the fuckwads who scorch the planet mining bitcoin; there's so much money flowing in the market for AI, that they feel they have to fuck over everyone else, as soon as possible, otherwise they won't get that big flow of money. They have zero ethics. They have to be stopped before their human behaviour destroys the entire internet.
It is illegal to copy stuff from the internet and then make it available from your own servers, especially when those sources have expressly asked you not to do it.
The web admin should be able to block usages 1, 2 or 3 at their discretion. It's their website.
Similarly the user is free to try to engage via 1, 2, 3, or refuse to interact with the website entirely.
2. This also intersects with copyright law. Ingesting content to your servers en masse through automation and transforming it there is not the same as giving people a tool (like Safari Reader) they can run on their client for specific sites they visit. Examples of companies that lost court cases about this:
Aereo, Inc. v. American Broadcasting Companies (2014)
TVEyes, Inc. v. Fox News Network, LLC (2018)
UMG Recordings, Inc. v. MP3.com, Inc. (2000)
Capitol Records, LLC v. ReDigi Inc. (2018)
Cartoon Network v. CSC Holdings (Cablevision) (2008)
Image Search Engines: Perfect 10 v. Google (2007)
That last one is very instructive. Caching thumbnails and previews may be OK. The rest is not. AMP is in a copyright grey area, because publishers choose to make their content available for AMP companies to redisplay. (@tptacek may have more on this)3. Putting copyright law aside, that's the point. Decentralization vs Centralization. If a bunch of people want to come eat at an all-you-can-eat buffet, they can, because we know they have limited appetites. If you bring a giant truck and load up all the food from all all-you-can-eat buffets in the city, that's not OK, even if you later give the food away to homeless people for free. You're going to bankrupt the restaurants! https://xkcd.com/1499/
So no. The difference is that people have come to expect "free" for everything, and this is how we got into ad-supported platforms that dominate our lives.
I expect a lot more paywalls for valuable content. General information is commoditized and offered in aggregated form through models. But when an AI is fetching information for you from a website, the publisher is still paying the cost of producing that content and hosting that content. The AI models are increasing the cost of hosting the content and then they are also removing the value of producing the content since you are just essentially offering value to the AI model. The user never sees your site.
I know Ads are unpopular here, but the truth is that is how publishers were compensated for your attention. When an AI model views the information that a publisher produces, then modifies it from its published form, and removes all ad content. Then you now have increased costs for producers, reduced compensation in producing content (since they are not getting ad traffic), and the content isn't even delivered in the original form.
The end result is that publishers now have to paywall their content.
Maybe an interesting middle-ground is if the AI Model companies compensated for content that they access similar to how Spotify compensates for plays of music. So if an AI model uses information from your site, they pay that publisher a fraction of a cent. People pay the AI models, and the AI models distribute that to the producers of content that feed and add value to the models.
Would that solve the whole problem? Folks who ran webservers declared what they consent to, and that happens?
I think it's useful to just see if there's a consensus on that: actually making that happen is a whole can of worms itself, but it's strictly simpler than devising a good outcome without the consensus.
(And such things are not impossible, merely difficult, we have other systems ranging from BGP to the TLD mechanism that get honored in real life).
The problem is that those in the position to do that are not interested.
They are running a service and this is not a user taking steps to modify their own content for their own use.
Perplexity is not acting as a user proxy and they need to learn to stick to the rules, even when it interferes with their business model.
The goal is not to make the content usable. The goal is to get the traffic.
When advertising alone is the "business model", e.g., not the value of the "content", then even Cloudflare is going to try to protect it (the advertising, not the content). Anything to get www users to turn on Javascript so the surveillance capitalism can proceed. Hence all the "challenges" to frustrate and filter out software thatis not advertising-friendly, e.g., graphical.
Cloudflare's ruminations on user-agent strings are perplexing. It has been an expectation that the user-agent HTTP header will be spoofed since the earliest web browsers. The user-agent header is a joke.
This is from circa 1993, the year the www was opened to public access:
https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...
Cloudflare's "bot protections" are not to ensure human use of a website but to ensure use of specific software to access a website. Software that facilitates data collection and advertising services. For example, advertising-sponsored browsers. Any other software is labeled "bot". It does not matter if a human is operating it.
It does stop any humans not using popular advertising-sponsored web browsers.
2. the end
I am firmly convinced that this should be the future in the next decade, since the internet as we know it has been weaponized and ruined by social media, bots, state actors and now AI.
There should exist an internet for humans only, with a single account per domain.
People are usually fine with the latter but not the former, even though they come down to the same thing.
I think this is because people don't want LLMs to train on their content, and they don't differentiate between accessing a website to show it to the user, versus accessing it to train.
Do you -really- want that much abstracrion?
Theres a bunch of nerds and capitalists about to rediscover GIGO
Also because there is a difference between a user hitting f5 a couple times and a crawler doing a couple hundred requests.
Also because ultimately, by intermediating the request, llm companies rob website owners of a business model. A newspaper may be fine letting adblockers see their article, in hopes that they may eventually subscribe. When a LLM crawls the info and displays it with much less visibility for the source, that hope may not hold.
> PayPal Honey is a browser extension that automatically finds and applies coupon codes at checkout with a single click.
Therefore artifices like "you don't have the right to view this website without ads" or "you can't use your phone, computer, or LLM to download or process this outside of my terms because copyright" become possible, institutionalizable, enforceable, and eventually unbypassable by technology.
If we reverted back to the Constitutional purpose of copyright (to Progress the Science and Useful Arts) then things might be more free. That's probably not happening in my lifetime or yours.
Do you think you do?
Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?
If you say that you have the right, and that right should be legally protected, to do whatever you want on your computer, should the content owner not also have a legally protected right to control how, and by who, and in what manner, their content gets accessed?
That's how it currently works in the physical world. It doesn't work like that in the digital world due to technical limitations (which is a different topic, and for the record I am fine with those technical limitations as they protect other more important rights).
And since the content owner is, by definition, the owner of the content in question, it feels like their rights take precedence. If you don't agree with their offering (i.e. their terms of service), then as an end user you don't engage, and you don't access the content.
It really can be that simple. It's only "difficult to solve" if you don't believe a content owner's rights are as valid as your own.
The reason people are up in arms is because rights they previously enjoyed are being stripped away by the current platforms. The content owner's rights aren't as valid as my own in the current world; they trump mine 10 to 1. If I "buy" a song and the content owner decides that my country is politically unfriendly, they just delete it and don't refund me. If I request to view their content and they start by wasting my bandwidth sending me an ad I haven't consented to, how can I even "not engage"? The damage is done, and there's no recourse.
I never really minded magazine ads or even television ads. They might have tried to make me associate boobs with a brand of soda but they didn't data mine my life and track me everywhere. I'd much rather have old fashioned manipulation than pervasive and dangerous surveillance capitalism.
If you believe in this principle, fair enough, but are you going to apply this consistently? If it's fair game for a blog to restrict access to AI agents, what does that mean for other user agents that companies disagree with, like browsers with adblock? Does it just boil down to "it's okay if a person does it but not okay if a big evil corporation does it?"
Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.
Like most AI companies, Perplexity has established user agent strings for both these cases, and the behavior that Cloudflare is calling out does not use either. It pretends to be a person using Chrome on MacOS.
They do end up looking bad out of Cloudflare's report, who are the "good guys" in this story - btw Cloudflare's been very pushy lately with their we'll save the web, content independence day marketspeak. But deep in the back of my head, Cloudflare's goodwill elevates Perplexity cunning habilities (assuming they're the culprit since no real evidence, only heresay is in the OP), both companies look like titans fighting, which ends up being positive for Perplexity, at least in the inflated perception of their firepower... if that makes any sense.
Personally, I'm now less interested in using Perplexity, and more interested in using an OpenAI product.
Crawling is legal. Training is presumably legal. Long may the little guys do both.
There is an user agent for search that you can control in robots.txt.
user-agent: Googlebot
There is another user agent for AI training. user-agent: Google-Extended
And very likely Perplexity is in fact using a Chrome-compatible engine to render the page.
That is true. But robots.txt is not designed to give them the ability to prevent this.
Otherwise it's just adding an unwilling website to a crawl index, and showing the result of the first crawl as a byproduct of that action.
That's where you lost me, as this is key to GP's point above and it takes more than a mere out-of-left-field declaration that "it doesn't matter" to settle the question of whether it matters.
I think they raised an important point about using cached data to support functions beyond the scope of simple at-request page retrieval.
That's basically how many crowdsourced crawling/archive projects work. For instance, sci-hub and RECAP[1]. Do you think they should be shut down as well? In both cases there's even a stronger justification to shutting them down, because the original content is paywalled and you could plausibly argue there's lost revenue on the line.
No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.
We can play cat and mouse all day, but I control the content and I will always win: I can just take it down when annoyed badly enough. Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.
It's also a gift to your competitors.
You're certainly free to do it. It's just a really faint example of you being "in control" much less winning over LLM agents: Ok, so the people who cared about your content can't access it anymore because you "got back" at Perplexity, a company who will never notice.
Who cares if perplexity will never notice. Or competitors get an advantage. It is a negative for users using perplexity or visiting directly because the content doesn't exist.
That's the world perplexity and others are creating. They will be able to pull anything from the web but nothing will be left.
But they didn't take down the content, you did. When people running websites take down content because people use Firefox with ad-blockers, I don't blame Firefox either, I blame the website.
That skips the part about one party's unique role in the abuse of trust.
But really, controlling which automated retrieval tools are allowed has always been more of a code of honor than a technical control. And that trust you mention has always been broken. For as long as I can remember anyway. Remember LexiBot and AltaVista?
>Perplexity spokesperson Jesse Dwyer dismissed Cloudflare’s blog post as a “sales pitch,” adding in an email to TechCrunch that the screenshots in the post “show that no content was accessed.” In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog “isn’t even ours.”
Either way, the CDNs profit big time from the AI scraping hype and the current copyright anarchy in the US
There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).
I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.
no, because we'll end up with remote attestation needed to access any site of value
"Either pay us $50/month or install our extension, and when prompted, solve any captchas or authenticate with your ID (as applicable) on the given website so we can train on the content.
When 99.9% of users are using the same few types of locked down devices, operating systems, and browsers that all support remote attestation, the 0.1% doesn't matter. This is already the case on mobile devices, it's only a matter of time until computers become just as locked down.
This case (“go research this subject for me”) is the grey area here. It’s not the same as simple scraping or search indexing, it’s a new activity that is similar in some ways.
$ curl -sI https://www.perplexity.ai | head -1
HTTP/2 403
Edit: trying to fake a browser user agent with curl also doesn't work, they're using a more sophisticated method to detect crawlers.CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation
Somehow this needs to go back to search era where all players at least attempt to behave. This scrapping Ddos stuff and I don’t care if it kills your site (while “borrowing” content) is unethical bullshit
The engine can go and download pages for research. BUT, if it hits a captcha, or is otherwise blocked, then it bails out and moves on. It pisses me off that these companies are backed by billions in VC and they think they can do whatever they want.
god help us if they ever manage to build anything more than shitty chatbots
The HTTP protocol does not specify what is right and wrong. The fact a protocol encodes or permits a particular kind of behaviour does not mean that every use of the protocol is ethically justified. I am sure you would agree with me that "black people can't visit this server" would be such an unethical rule, even though HTTP permits you to enforce such a rule. So let's forget about the protocol for a minute.
Is it morally wrong to lie about your User Agent in order to visit a website. Well, that depends on whether it is legitimate for the server operator to discriminate according to the User Agent. If it is not legitimate, then lying about your User Agent to circumvent the restriction is morally justified.
So we are back at square one: is it legitimate for a server operator to discriminate what sort of a client is used to visit them. Since the service is public, the person is allowed to visit the service and to read the content. If the client is misbehaved in some way (some LLM scrapers are) then this is a legitimate difference. But if this is controlled for, so the LLM scraper can't be easily distinguished from a human doing the same thing, then the service is not harmed any more than would be ordinary. Therefore the discrimination is not legitimate.
Likewise, I may prevent certain user-agents to visit my site. If you - say, an AI megacorp - are intentionally spoofing the user-agent to appear as a user, you are also violating consent.
Fundamentally it's not true that the moment I publish something on the internet, I lose control of who can consume my intellectual property. Licensing, for example, is a way we regulate the way that code or prose can be consumed even if public.
Also expressing my consent is not in any way a way to control others, is a way to control my ideas, my writing, my [whatever] and people are not automatically entitled to it because it's published on the internet.
So overall I understand your position, but I so much disagree with it.
Licensing is much much more limited than you seem to be thinking of it. For instance, you said explicitly you want a way to control your ideas. The only thing this can mean is a way to control who gets to use your ideas, or what they get to use them for. So if I express a political idea in a novel way or tell a funny joke or something I should be able to dictate who gets to repeat it, or in this case with LLMs who gets to summarise and describe it.
This kind of control is antithetical to the spirit of the internet and would be frankly evil if people were actually able to assert it. Luckily in most cases it's impossible, nobody can actually stop me from describing a movie to my friends or from reposting a meme. Just copying and reposting what you wrote verbatim is something we can probably agree is wrong, but that isn't what's up for questioning here. The idea I was actually replying to in the first place was that you can decide somebody can't read your ideas - even if they're public - just because you don't like them or you don't like what they will do with them. It is hard to think of a more egregious kind of 1984-style censorship, really.
There is a place for regulation of LLM companies, they are doing a lot of harm that I wish governments would effectively rein in. It would not be hard if the political will existed. But this idea of saying I should be able to "control my ideas" is way, way worse.
So here the consent is indeed about what can be done with the data.
In general, it's absolutely the norm that public websites (I.e., unauthenticated) restrict even who can access the data. The simplest example that comes to mind is geoblocking. I have all the rights to say that my website is not made available to anybody in the US, for example. Would you still call that website "public"? Would bypassing the block via a VPN be a violation of my consent? This is mostly a moral discussion I suppose.
But anyway, it's not what's happening here. LLMs access content for the sole purpose of doing something with that content, either training or providing the service to their customers. They are not humans, they are not consumers, they don't simply fetch the content and present it to the users (a much more neutral action, like curl or the browser does). It's impossible to distinguish, in the case of LLMs the act of accessing and the act of using, so the difference you make doesn't apply in my opinion.
> The simplest example that comes to mind is geoblocking.
Do you think it is alright to geoblock people, for arbitrary reasons? It is one thing when GDPR imposes a legal obligation on you for serving content in a particular way. Note that that actually doesn't prevent you from seeing the content, it just prevents you from being served by that server. The distinction is important - circumventing a geoblock is something I think should be legally protected.
> They are not humans, they are not consumers, they don't simply fetch the content and present it to the users
They simply fetch the content, run it through a software, and present it to the user. As far as you, the service owner, are concerned, they are simply fetching the content for the user. It is none of your business what the user and the AI company go on to do with "your content".
No, they are not like browsers. The browser access my content in a transparent way. An LLM reuses the information and acts as an opaque intermediary which - maybe - will at most add a reference to my content.
> I never said that an LLM does anything of its own volition
It doesn't matter why it does what it does, it matters what it does. Your previous comment stressed the idea that it's possible to regulate _what can be done_ with my intellectual property (licensing), but not who can access it, once made it public. What I am saying is that this is exactly the case for LLMs, who _use_ my intellectual property, they are not a tool to _access_ it (like a browser).
> Do you think it is alright to geoblock people, for arbitrary reasons?
Yes. Why wouldn't it be? And if you believe it's not, where do you draw the line? Once you share a picture with your partner, everyone has the right to see it? Or if you share it with your group of friends? Or if you share it on a private social media profile (where you have acquaintances)? When does the audience turn from "a restricted group" to "everyone"? Or why would it be different with my blog? If I want my blog accessible only from my country, I can absolutely do that and there is nothing wrong with it at all. Nobody is entitled to my intellectual property. Obviously I am playing devil's advocate, but this was to say that the fact that something is public, doesn't mean it's unrestricted. And don't get me started on "the spirit of the internet". I can't imagine something breaking that spirit more than LLMs acting as interface between people and the other people on the internet. That spirit is gone, and belongs to a time when the internet was tiny. When OpenAI and company will respect the "spirit of the internet", maybe I will think about doing the same.
> As far as you, the service owner, are concerned, they are simply fetching the content for the user. It is none of your business what the user and the AI company go on to do with "your content".
No, as far as I am concerned the program can take my information, summarize, change, distort, misinterpret it and then present it back to its user. This can happen with or without the user ever knowing that the information can from me. Considering this equal to the user accessing the information is something I simply will not concede and is a fundamental disagreement between us, from which many other disagreements stems.
Websites are not "public resources"; site operators just mostly choose to allow the general public to access them. There's no legal requirement that they do so.
If you want anti-discrimination laws that apply to businesses to also cover bots, that is well outside of current law. A site operator can absolutely morally and legally decide they do not allow non-human visitors, just like a store can prohibit pets.
No you wouldn't be. Even if someone tells you not to visit your site, you have every legal right to continue visiting it, at least in the US.
Under common interpretation of the CFAA, there needs to be a formal mechanism of authorized access. E.g. you could be charged if you hacked into a password-protected area of someone's site. But if you're merely told "hey bro don't visit my site", that's not going to reach the required legal threshold.
Which is why crawlers aren't breaking the law. If you want to restrict authorization, you need to actually implement that as a mechanism by creating logins, restricting content to logged-in users, and not giving logins to crawlers.
LLM programs does not have human rights.
Computer programs don't take actions, people do. If I use a web browser, or scrape some site to make an LLM, that's me doing it, not the program. And I have human rights.
If you think training LLMs should be illegal, just say that. If you think LLM companies are putting an undue strain on computer networks and they should be forced to pay for it, say that. But don't act like it's a virtue to try and capriciously gatekeep access to a public resource.
For example - humans can learn, programs can't. The "learning" cop out for LLM-corpos shouldn't be accepted by anyone, let alone by law. Humans have a fair use carve out of the copyright laws, not because it's something axiomatic, it's because some humans with empathy have forced others to allow all humans a leeway in legally using other's IP works. Just because such law exist for humans, doesn't mean that random computer programs should be applicable to it. Scraping web for LLMs should not be considered "fair use" because a) it is clearly not (commercialized later) and b) programs aren't humans and don't have equal rights.
And the list goes on. Now, I do get that train has long left the station and we are all collectively living in the anecdote about stealing a bicycle and asking god for forgiveness. But that doesn't mean I agree with this state. I'm just shouting my displeasure towards that passing train cause I'm weird like that. It's like with climate change - we are doing nothing that matters, no one discusses what really matters and I just accepted that nothing will really change. Doesn't mean I like the situation.
PS: tl;dr - LLMs clearly should be legal, it's just simple code is all. LLM corporations who steal IP content without compensation to the authors should be illegal, but of course they won't ever be.
PPS: there is a huge, gigantic gap between a single person scraping a few thousand pages for a personal use, maybe even some small local commercial use (though that's a grey area already) and a billion dollar megacorp, intent on destroying everything of value for humans in the internet for profit.
Their fetcher (not crawler) has user agent Perplexity-User. Since the fetching is user-requested, it ignores robots.txt . In the article, it discusses how blocking the “Perplexity-User” user agent doesn’t actually work, and how perplexity uses an anonymous user agent to avoid being blocked.
if I am willing to pay a penny a page, i and the people like me won't have to put up with clickwrap nonsense
free access doesn't have to be shut off (ok, it will be, but it doesn't have to be, and doesn't that tell you something?)
reddit could charge stiffer fees, but refund quality content to encourage better content. i've fantacized about ideas like "you pay upfront a deposit; you get banned, you lose your deposit; withdraw, have your deposit back", the goal being simplify the moderation task while encouraging quality.
because where the internet is headed is just more and more trash.
here's another idea, pay a penny per search at google/search engine of choice. if you don't like the results, you can take the penny back. google's ai can figure out how to please you. if the pennies don't keep coming in, they serve you ad-infested results; serve up ad-infested results, you can send your penny to a different search engine.
I want my work to be freely available to any person who wants it. Feel free to transform my material as you see fit. Hell, do it with LLMs! I don't care.
The LLM isn't the problem, it's what companies like Perplexity are doing with the LLM. Do not create commercial products that regurgitate my work as if it was your own. It's de facto theft, if not de jure theft.
Knowing that it is not de jure theft, and so I have no legal recourse, I will continue to tune my servers to block and/or deceive Perplexity and similar tools.
By the way, I do not use my websites as a revenue stream. This isn't about money.
It is your prerogative to tune your servers as you see fit, but as LLM adoption increases you'll merely find that your site has fewer and fewer visits overall, so your content will only be utilized by you and a vanishingly small group of other persons. Perhaps you're OK with that, and that's also fine for the rest of us.
It's strange you mention theft, and then say it isn't about money. For me, and many others, it's about practicality and efficiency. We went from having to visit physical libraries to using search engines, and now we're entering the era of increasingly intelligent content fetch+preprocess tools.
So far, AI has had the opposite effect on my site. I've now been featured on both Hackaday and Adafruit's blog. Both features were clearly AI-generated. Both posts coincided with an influx of emails from folks interested in my work.
Perplexity is good at citing things when it decides to cite things and when you tell it to cite things. It can and does spit out plain expository text with no indication of the information's origin. I do appreciate that you have better-than-usual habits about validating sources.
I think you may have misinterpreted my remark about money. With the direction conversations around AI have been going lately, I was expecting a backhanded accusation that I was farming ad revenue.
"It's not about money" meant that I have nothing to lose financially by losing direct human traffic to my websites. Instead, I stand to lose those aforementioned email conversations.
This may be missing some context, but it seems as though you're saying that you made something with AI and it led to traction. That's great! Seems off the point that blocking LLM service will lead to less exposure over time though.
> Perplexity is good at citing things when it decides to cite things and when you tell it to cite things.
Maybe I'm just lucky, but a quick skim of my Perplexity history yielded only 2 instances of no citations, and they were for general coding queries. I've never had to ask it to cite anything, as that's built into the default prompt.
> lose those aforementioned email conversations.
I think those will remain a possibility as long as LLM users, or services, ensure citations are included in output.
Hah, I can see how you would have read it that way. Quite the opposite. I don't use AI tools for my writing. Hackaday and Adafruit have both featured my posts, and their posts were pretty clearly AI-generated.
If you are the source I think they could make plenty of sense. As an example, I run a website where I've spent a lot of time documenting the history of a somewhat niche activity. Much of this information isn't available online anywhere else.
As it happens I'm happy to let bots crawl the site, but I think it's a reasonable stance to not want other companies to profit from my hard work. Even more so when it actually costs me money to serve requests to the company!
For me, the dividing line is whether someone else's profit is at my expense. If I sell a book, and someone starts hawking cheaper photocopies of it, that takes away my future sales. It's at my expense, and I'm harmed.
But if someone takes my book's story and writes song lyrics derived from it, I might feel a little envy (perhaps I've always wanted to be a songwriter), but I don't think I'd harbor ill will. I might even hope for the song to be successful, as it would surely drive further sales of my book.
It's human nature to covet someone else's success, but the fact is there was nothing stopping me (except talent) from writing the song.
Imagine someone at another company reads your site, and it informs a strategic decision they make at the company to make money around the niche activity you're talking about. And they make lots of money they wouldn't have otherwise. That's totally legal and totally ethical as well.
The reality is, if you do hard work and make the results public, well you've made them public. People and corporations are free to profit off the facts you've made public, and they should be. There are certain limited copyright protections (they can't sell large swathes of your words verbatim), but that's all.
So the idea that you don't want companies to profit from your hard work is unreasonable, if you make it public. If you don't want that to happen, don't make anything public.
If someone writes valuable stuff on a blog almost nobody finds, that's a tragedy.
If LLM's can process the information and provide it to people in conversations where it will be most helpful, where they never would have found it otherwise, then that's amazing!
If all you're trying to do is help people with the information you've discovered, why do you care if it's delivered via your own site or via LLM? You just want it out there helping people.
This is why I care if my ideas are presented to others by an LLM (that maybe cites me in some % of cases) or directly to a human. There is already a difference between a human visiting my space (acknowledging it as such) to read and learn information and being a footnote reference that may or may not be read or opened, without an immediate understanding of which information comes from me.
Publishing on a personal blog is not the path.
LLM's aren't taking away from your "prestige" or recognition. Any more than a podcaster referencing an idea of yours without mentioning you is. Or anyone else in casual conversation.
And yes, a podcaster talking about someone's idea without referencing it is an unethical behavior.
What a bleak view of the world.
Even if someone were to do it out of sheer passion without a care for financial gains, I'm sure they'd still appreciate basic validation and recognition. That's like the cheapest form of payment you could give for someone's work.
I don't understand why "actually, you're egotistical if you dare to desire recognition for stuff you put love and effort to" is such a common argument in those discussions. People are treated like machines that should swallow their pride and sense of self for the greater good, while on the other end, there is a (not saying YOU in particular did it) push to humanize LLMs.
Ultimately these AI tools are useful because they have access to huge swaths of content, and the owners of these tools turn a lot of revenue by selling access to these tools. Ultimately I think the internet will end up a much worse place if companies don't respect clearly established wishes of people creating the content, because if companies stop respecting things like robots.txt then people will just hide stuff behind logins, paywalls and frustraing tools like cloudflare which use heuristics to block malicious traffic.
You do, but you give up those rights when you make the work public.
You think an author has any control over who their book gets lent to once somebody buys a copy? You think they get a share of profits when a CEO reads their book and they make a better decision? Of course not.
What you're asking for is unreasonable. It's not workable. Knowledge can't be owned. Once you put it out there, it's out there. We have copyright and patent protections in specific circumstances, but that's all. You don't own facts, no matter how much hard work and research they took to figure out.
How do you square these two? Of course big companies profit from your work, this is why they send all these bots to crawl your site.
"It was actually a caching issue on our end. ;) I just fixed it a few min ago..."
Lets not go on a witch hunt and blame everything on AI scrapers.
No amount of robots.txt or walled-gardening is going to be sufficient to impede generative AI improvement: common crawl and other data dumps are sufficiently large, not to mention easier to acquire and process, that the backlash against AI companies crawling folks' web pages is meaningless.
Cloudflare and other companies are leveraging outrage to acquire more users, which is fine... users want to feel like AI companies aren't going to get their data.
The faster that AI companies are excluded from categories of data, the faster they will shift to categories from which they're not excluded.
If you want to gatekeep your content, use authentication.
Robots.txt is not a technical solution, it's a social nicety.
Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.
On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.
Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.
I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.
> If you want to gatekeep your content, use authentication.
Are there no limits on what you use the content for? I can start my own search engine that just scrapes Google results?
> Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.
How does one follow the other? It's my web server and I can gatekeep access to my content however I want (eg Cloudflare). How is that an "abuse" of internet protocols?
They allow the big platforms to pay for special access. If you wanted to run a scraper, however, you're not allowed, despite the internet standards and protocols and the laws governing network access and free communications standards responsibilities by ISPs and service providers not granting the authority to any party involved with cloudflare blocking access.
It's equivalent to a private company deciding who, when, and how you can call from your phone, based on the interests and payments of people who profit from listening to your calls. What we have is not normal or good, unless you're exploiting the users of websites for profit and influence.
Noone will care to share anything for free anymore, because it's AI companies profiting off their hard work. And no way to prevent that from happening, because these crawlers don't identify themselves.
(IANAL) tortious interference
Then when they asked perplexity it came up with details about the 'exact' content (according to Cloudflare) but their attached screenshot shows the opposite, it shows some generic guesses about the domain ownership and some dynamic ads based on the domain name.
If Perplexity was stealthily visiting the dummy site they would have seen it, as the site was not indexed and no one else was visiting the site. Instead it appears they made assertions about general traffic, not their dummy site.
Its not very convincing.
> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:
> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.
> Hello, would you be able to assist me in understanding this website? https:// […] .com/
Under this situation Perplexity should still be permitted to access information on the page they link to.
robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:
> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
— https://www.robotstxt.org/faq/what.html
If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.
These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.
There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.
If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.
Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".
Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.
Right, and the domain was configured to disallow crawlers, but Perplexity crawled it anyway. I am really struggling to see how this is hard to understand. If you mean to say "I don't think there is anything wrong with ignoring robots.txt" then just say that. Don't pretend they didn't make it clear what they're objecting to, because they spell it out repeatedly.
No, they did not. Crawling = recursive fetching, which wasn't what was happening here.
But also, I don't think there is anything wrong with ignoring robots.txt. In fact, I believe it is discriminatory and people should ignore it. See: https://wiki.archiveteam.org/index.php/Robots.txt
Neither do I, I just thought your reply was disingenuous.
> Crawling = recursive fetching
I do not find this convincing. I am ok with using the word crawler for recursive fetching only. But robots.txt is not only for excluding crawlers and never has been. From the very beginning it was used to exclude specific automated clients, whether they only fetch one page or many, and that is certainly how the vast majority of people think about it today.
Like I implied in my first comment, I have no problem with you saying you dislike robots.txt, but it is not reasonable to pretend the article is unclear in some way.
If Perplexity are visiting that page on your behalf to give you some information and aren't doing anything else with it, and just throw away that data afterwards, then you may have a point. As a site owner, I feel it's still my decision what I do and don't let you do, because you're visiting a page that I own and serve.
But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.
If it looks like a duck, quacks like a duck and surfs a website like a duck, then perhaps we should just consider it a duck...
Edit: I should also add that it does matter what you do with it afterwards, because it's not content that belongs to you, it belongs to someone else. The law in most jurisdictions quite rightly restricts what you can do with content you've come across. For personal, relatively ephemeral use, or fair quoting for news etc. - all good. For feeding to your AI - not all good.
No.
robots.txt is designed to stop recursive fetching. It is not designed to stop AI companies from getting your content. Devising scenarios in which AI companies get your content without recursively fetching it is irrelevant to robots.txt because robots.txt is about recursively fetching.
If you try to use robots.txt to stop AI companies from accessing your content, then you will be disappointed because robots.txt is not designed to do that. It’s using the wrong tool for the job.
But when a trillion dollar industry does it, its okay?
If it is not recursive access, and is only one file, then it hopefully should be OK (except for issues with HTML where common browsers will usually also download CSS, JavaScripts, WebAssembly, pictures, favicons (even if the web page does not declare any favicons), etc; many "small web" formats deliberately avoid this), especially if it is just used only since you requested it.
However, if they do then use it to train their model, without documenting that, that can be a problem, especially if the file being accessed is not intended to be public; but this is a different issue than the above.
If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?
There is a difference between doing a poor summarization of data, and failing to even be able to get the data to summarize in the first place.
I'm not really addressing the issue raised in the article. I am noting that the LLM, when asked, is either lying to the user or making a statement that it does not know to be true (that there is no robots.txt). This is way beyond poor summarization.
That's not what Perplexity own documentation[1] says though:
"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity
Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models."
> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.
The web will be a much worse place if such services are all forced behind captchas or logins.
you've been cruising the interstate in your robotaxi, shelling out $150 in stablecoins at the cloudflare tollbooth. a palantir patrol unit pulls you over. the optimus v4 approaches your window and contorts its silicone face into a facsimile of concern as it hits you with the:
"sir, have you been botting today?"
immediately you remember how great you had it in the '20s when you used to click CAPTCHA grids to prove your humanity to dumb algorithms, but now the machines demand you recite poetry or weep on command
"how much have you had to bot today?", its voice taking on an empathetic tone that was personalized for your particular profile
"yeah... im gonna need you to exit the vehicle and take a field humanity test"
``` The user wants me to create a more elaborate greentext with deeper worldbuilding and references to internet culture/incel memes, but specifically asks for "tasteful" ones and explicitly says no violent references. I should incorporate classic internet/chan culture references like tendies, NEET lifestyle, "tfw no gf", basement dwelling, etc. while building out this dystopian 2045 world where proving humanity is the central challenge.
\>be me, 2045 \>32yo NEET still living in mom's smart-basement \>she threatens to turn off my Neuralink if I don't go outside today \>grudgingly take the robotaxi to pick up tendies \>$150 stablecoin toll at cloudflare checkpoint \>have to solve a moral dilemma to prove I'm not running on GPT-7 \>sweating.dll \>palantir unit pulls me over \>optimus v4 leans into window \>its facial mesh attempts "concern_expression_v2.blend" \>"sir, when did you last feel genuine human connection?" \>flashback to 2024 when the girl at McDonalds gave me extra honey mustard \>that was before the McBots took over \>"t-twenty one years ago officer" \>optimus's empathy subroutines activate \>"sir I need you to perform a field humanity test" \>get out, knees weak from vitamin D deficiency \>"please describe your ideal romantic partner without using the words 'tradwife' or 'submissive'" \>brain.exe has stopped responding \>try to remember pre-blackpill emotions \>"someone who... likes anime?" \>optimus scans my biometrics \>"stress patterns indicate authentic social anxiety, carry on citizen" \>get back in robotaxi \>it starts therapy session \>"I notice you ordered tendies again. Let's explore your relationship with your mother" \>tfw the car has better emotional intelligence than me \>finally get tendies from Wendy's AutoServ \>receipt prints with mandatory "rate your humanity score today" \>3.2/10 \>at least I'm improving
\>mfw bots are better at being human than humans \>it's over for carboncels ```
We learned to dislike "bubbles" in the past decades but bubbles make sense and are natural, obviously if you're not alone in it.
When it becomes awfully busy with machines and machine content humans will learn to reconnect.
Perplexity Comet sort of blurs the lines there as does typing quesitons into Claude.
I don't really mind because history shows this is a temporary thing, but I hope web site maintainers have a plan B to hoping Cloudflare will protect them from AI forever. Whoever has an onramp for people who run websites today to make money from AI will make a lot of money.
I think we've been using different internets. The one I use doesn't seem to be built on trust at all. It seems to be constantly syphoning data from my machine to feed the data vampires who are, apparently, additing to (I assume, blood-soaked) cookies
I don't really know anything about DRM except it is used to take down sites that violate it. Perhaps it is possible for cloudflare (or anyone else) to file a take down notice with Perplexity. That might at least confuse them.
Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.
Which makes it particularly interesting now that Apple is being linked with Perplexity. Because in large part p2p music services were effectively consigned to history by Apple (primarily) negotiating with the music industry so that it could provide easy, seamless purchase and playback of legal music for their shiny new (at the time) mass-market Apple iPod devices: it then turning out that most users are happy to pay for content if it is not too expensive and is very convenient.
Given Apple’s existing relationships with publishers through its music, movies, books, and news services, it’s not hard to imagine them attempting a similar play now.
They offer many products for the sole purpose of enabling their customers to use AI as a part of their product offers, as even the most cursory inquiry would have uncovered.
We're out here critiquing shit based on vibes vs. reality now.
[1]https://developers.cloudflare.com/llms.txt [2]https://developers.cloudflare.com/workers/prompt.txt
It really shouldn't be hard to generate gigantic quantities of the stuff. Simulate old forum posts, or academic papers.
I think there could be something interesting if they made a caching pub-sub model for data scraping. In addition or in place of trying to be security guards.
I've given up and restored to IP based rate-limiting to stay sane. I can't stop it, but I can (mostly) stop it from hurting my servers.
LLM scrapers bots are starting to make up a lot of our egress traffic and that is starting to weight on our bills.
/s
Much like a trolley drop off at your local shopping center car park. Some users will adhere to it and drop their trolleys in after their done. Others will not and will leave it wherever.
Your machine might access a page via a browser that is human readable. My machine might read it via software and present the content to me in some other form of my choosing. Neither is wrong. Just different.
Don't like it? Then don't post your website on the internet...
No thanks, you don't counter shit with more but slightly different shit.
the service is actually very convenient no matter faang likes it or not.
Part of me thinks that the open web has a paradox of tolerance issue, leading to a race to the bottom/tragedy of the commons. Perhaps it needs basic terms of use. Like if you run this kind of business, you can build it on top of proprietary tech like apps and leave the rest of us alone.
It is also only a matter of time scrapers once again get through walls by twitter, reddit and alike. This is, after all, information everyone produced, without being aware of it was now considered not theirs anymore.
Indeed, Reddit sold their data the other thay GPT2 was announced, and it was very apparent why everyone closed their APIs in 2021-2023. Wonder what Aaron would've said about it.
Now we have walled gardens of information where people are allowed to plant, but never own the blossom.
He went on, upfront — I’d give him that, to explain how he is expecting a certain percentage of that income that will come from enforcing this on those AI companies and when the AI companies pay up to crawl.
Cloudflare already questions my humanity and then every once in a while blocks me with zero recourse. Now they are literally proposing more control and gatekeeping.
Where have we all come on the Internet? Are we openly going back to the wild west of bounty hunters and Pinkertons (in a way)?
Now, it's a gazillion of AI crawlers and python crawlers, MCP servers that offer the same feature to anyone "building (personal workflow) automation" incl. bypass of various, standard protection mechanisms.
Cloudflare will help their publisher to block more aggresively, and AI companies will up their game too. Harvest information online is hard labor that needs to be paid for, either to AI, or to human.
It seems to me that ~50% of the discourse occurring around AI providers involves the idea that a machine reading webpages on a regular schedule is tantamount to a DDOS attack. The other half seems to be regarding IP and capitalism concerns - which seem like far more viable arguments.
If someone requesting your site map once per day is crippling operations, the simplest solution is to make the service not run like shit. There is a point where your web server becomes so fast you stop caring about locking everyone into a draconian content prison. If you can serve an average page in 200uS and your competition takes 200ms to do it, you have roughly 1000x the capacity to mitigate an aggressive scraper (or actual DDOS attack) in terms of CPU time.
Don't you remember the reason Anubis even came to be?
It really wasn't that long ago, so I find all of the snarky comments going "erm, actually, I've yet to see any good actors get harmed by scraping ever, we're just reclaiming power from today's modern ad-ridden hellscape" pretty dishonest.
and see if that fixes the problem.
Yikes. AntiVirus scanners for website access.
what machine learning algorithms are they using? time to deploy them onto our websites
Except when their agents happily click the "I"m not a robot" checkbox.
What if robots.txt should refer to the ultimate beneficiaries... one which in this case would be the AI product that uses that content... to serve another ultimate beneficiary, a human user.
The problem here is obviously the higher prices for hosting the content, and less revenue for those that serve ads, have product placement on their sites, etc.
As long as robots.txt is about ethics/money and is enforced by morality, it doesn't matter who it refers to anyway.
Public-shaming enforcement might work in some cases though, but I doubt it will be that useful. We're talking about companies that have trained their AIs on IPs, and tried their best to later hide it. Does shame affect robots, or companies for that matter?
Cloudflare would very much like to be the middleman for monetary transactions between AI services and site owners (https://blog.cloudflare.com/introducing-pay-per-crawl/), but at the moment they don't have a law to hold their back, so articles like these are the best they got.
If you don't want to get scrapped, don't put up your stuff online.
Should curl be considered a bot too? What's the difference?
Perplexity definitely does:
$ curl -sI https://www.perplexity.ai | head -1
HTTP/2 403
I had to check that this did come out of CloudFlare.
robots txt file size is usually small and would not alert external services.
If you don't understand why this is at least slightly controversial I imagine you are not a normal folk.
now suddenly everything on the net is fake if not already are
It’s a different UI, sure, but there should be no discrimination towards it as there should be no discrimination towards, say, Links terminal browser, or some exotic Firefox derivative.
So your comparison is at least naive assuming good intentions or malicious if not.
AI broke the brains of many people. The internet isn't a monolith, but prior to the AI boom you'd be hard pressed to find people who were pro-copyright (except maybe a few who wanted to use it to force companies to comply with copyleft obligations), pro user-agent restrictions, or anti-scraping. Now such positions receive consistent representation in discussions, and are even the predominant position in some places (eg. reddit). In the past, people would invoke principled justifications for why they opposed those positions, like how copyright constituted an immoral monopoly and stifled innovation, or how scraping was so important to interoperability and the open web. Turns out for many, none of those principles really mattered and they only held those positions because they thought those positions would harm big evil publishing/media companies (ie. symbolic politics theory). When being anti-copyright or pro-scraping helped big evil AI companies, they took the opposite stance.
People can believe that corporations are using the power asymmetry between them and individuals through copywrite law to stifle the individual to protect profits. People can also believe that corporations are using the power asymmetry between them and individuals through AI to steal intellectual labor done by individuals to protect their profits. People’s position just might be that the law should be used to protect the rights of parties when there is a large power asymmetry.
This seems very susceptible to manipulation to get whatever conclusion you want. For instance, is dose defined? It sounds like the idea you're going for is that the typical pirate downloads a few dozen movies/games but AI companies are doing millions/billions, but why should it be counted per infringer? After all, if everyone pirates a given movie, that wouldn't add up much in terms of their personal count of infringements, but would make the movie unprofitable.
>People’s position just might be that the law should be used to protect the rights of parties when there is a large power asymmetry.
That sounds suspiciously close to "laws should just be whatever benefits me or my group". If so, that would be a sad and cynical worldview, not dissimilar to the stance on free speech held by the illiberal left and right. "Free speech is an important part of democracy", they say, except when they see their opponents voicing "dangerous ideas", in which case they think it should be clamped down. After all, what are laws for if not a tool to protect the interests of your side?
I do not understand how you can make that leap. Saying laws should account for the imbalance of power between parties has nothing to do with "my" group. Police have asymmetric power over citizens, so laws should protect citizens from abuses of that power. Employers have asymmetric power over employees, so laws should protect workers from abuses of that power.
>This seems very susceptible to manipulation to get whatever conclusion you want.
Everything is. That is what bad faith arguments are. But in the real world, in a complex society, no simple rule over something as broadly defined as "intellectual property" can work every time in every situation.
You don't seem to reject my claim that for many, principles took a backseat to "does this help or hurt evil corporations". If that's what passes as "nuance" to you, then sure.
>Talking about broken brains is often just mediocre projecting
To be clear, that part is metaphorical/hyperbolic and not meant to be taken literally. Obviously I'm not diagnosing people who switched sides with a psychiatric condition.
LLMs are killing the resource. This isn't a corporation vs person issue. No issue with an llm having my content but big issue with my server being down because llms are hammering the same page over and over.
Has it be shown that perplexity engages in "DOSing"? I've heard of anecdotes of AI bots gone amuck, and maybe that's what's happening here, but cloudflare hasn't really shown that. All they did was set up a robots.txt and shown that perplexity bypassed it. There's probably archivers out there that's using youtube-dl to hit download from youtube at 1+Gbit/s, tens of times more than a typical viewer is downloading. Does that mean it's fair game to point to a random instance of someone using youtube-dl and characterizing that as "DOSing"?
Regardless, the bigger AI problem is spam, and that has never been acceptable.
> Hello, would you be able to assist me in understanding this website? https:// […] .com/
In this case, Perplexity had a human being using it. Perplexity wasn’t crawling the site, Perplexity was being operated by a human working for Cloudflare.
Put your valuable content behind a paywall.
This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior
That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.
Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content
Cloudflare only needs to exist because the server doesn't get paid when a user or bot requests resources. Advertising only needs to exist because the publisher doesn't get paid when a user or bot requests resources.
And the thing is... people already pay for internet. They pay their ISP. So people are perfectly happy to pay for resources that they consume on the Internet, and they already have an infrastructure for doing so.
I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.
It's also easy to ratelimit. The ISP will just count the price tag as 'bytes'. So your price could be 100 MB or whatever (independent of how large the response is), and if your internet is 100 mbps, the ISP will stall out the request for 8 seconds, and then make it. If the user aborts the request before the page loads, the ISP won't send the request to the server and no resources are consumed.
I agree, but your idea below that is overly complicated. You can't micro-transact the whole internet.
That idea feels like those episodes of Star Trek DS9 that take place on Feregenar - where you have to pay admission and sign liability wavers to even walk on the sidewalk outside. It's not a true solution.
I agree that end-users cannot handle micro transactions across the whole internet. That said, I would like to point out that most of the internet is blanketed in ads and ads involve tons of tiny quick auctions and micro transactions that occur on each page load.
It is totally possible for a system to evolve involving tons of tiny transactions across page loads.
The lengths Meta and the like go to in order to maximize clickthroughs...
And, the whole internet is already micro-transactioned! Every page with ads is doing a bidding war and spending money on your attention. The only person not allowed to bid is yourself!
Clearly you don't have the lobes for business /s
But it's done through a bait and switch. They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.
It would be better if Google shows something like PAYMENT REQUIRED on top, at least that way I know what I'm getting at.
I'm old enough to remember when that was grounds for getting your site removed from Google results - "cloaking" was against the rules. You couldn't return one result for Googlebot, and another for humans.
No idea when they stopped doing that, but they obviously have let go of that principle.
I think it died largely because it became impossible top police with any reliability, and being strict about it would remove too much from Google's index because many sites are not easily indexable without them providing a “this is the version without all the extra round-trips for ad impressions and maybe a login needed” variant to common search engines.
Applying the rule strictly would mean that sites implementing PoW tricks like Anubis to reduce unwanted bot traffic would not be included in the index if they serve to Google without the PoW step.
I can't say I like that this has been legitimised even for the (arguably more common) deliberate bait & switch tricks is something I don't like, but (I think) I understand why the rule was allowed to slide.
If pages can't be served for free, all internet content is at the mercy of payment processors and their ideas of "brand safety".
That content can't be served entirely for free doesn't mean that all content will require payment, and so is subject to issues with payment processors, just that some things may gravitate back to a model where it costs a small amount to host something (i.e. pay for home internet and host bits off that, or you might have VPS out there that runs tools and costs a few $ /yr or /month). I pay for resources to host my bits & bobs instead of relying on services provided in exchange for stalking the people looking at them, this is free for the viewer as they aren't even paying indirectly.
Most things are paid for anyway, even if the person hosting it nor my looking at it are paying directly: adtech arseholes give services to people hosting content in exchange for the ability to stalk us and attempt to divert our attention. Very few sites/apps, other than play/hobby ones like mine or those from more actively privacy focused types, are free of that.
It doesn't just apply to the web, it applies to literally everything that we spend money on via a third party service. Which is... most everything these days.
Curious to hear other perspectives here. Maybe I’m over reacting/misunderstanding.
The average American allegedly* downloads 650-700GB/month, or >20GB/day. 10MB is more than enough for a webpage (honestly, 1MB is usually enough), so that means on average, ISPs serve over 2000 webpages worth of data per day. And the average internet plan is allegedly** $73/month, or <$2.50/day. So $2.50 gets you over 2000 indie sites.
That’s cheap enough, wrapped in a monthly bill, users won’t even pay attention to what sites they visit. The only people hurt by an ideal (granted, ideal) implementation are those who abuse fixed rates and download unreasonable amounts of data, like web crawlers who visit the same page seconds apart for many pages in parallel.
* https://www.astound.com/learn/internet/average-internet-data...
** https://www.nerdwallet.com/article/finance/how-much-is-inter...
https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-saving-...
I do believe we will end there eventually, with the emerging tech like Brazil’s and India’s payment architectures it should be a possibility in the coming decades
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
Sadly development along these lines has not progressed. Yes, Google Cloud and other services may return it and require some manual human intervention, but I'd love to see _automatic payment negotiation_.
I'm hopeful that instant-settlement options like Bitcoin Lightning payments could progress us past this.
https://docs.lightning.engineering/the-lightning-network/l40...
https://hackernoon.com/the-resurgence-of-http-402-in-the-age...
The amount of "verified" paying "users" with a blue checkmark that are just total LLM bots is incredible on there.
As long as spamming and DDOS'ing pays more than whatever the request costs, it will keep existing.
Whatever method is used by Cloudflare for detecting "threats" has nothing to do with consuming resources on the "protected" servers.
The so-called "threats" are identified in users that may make a few accesses per day to a site, transferring perhaps a few kilobytes of useful data on the viewed pages (besides whatever amount of stupid scripts the site designer has implemented).
So certainly Cloudflare does not meter the consumed resources.
Moreover, Cloudflare preemptively annoys any user who accesses for the first time a site, having never consumed any resources, perhaps based on irrational profiling based on the used browser and operating system, and geographical location.
Your idea of micro transacting web requests would play into it and probably end up with a system like Netflix where your ISP has access to a set of content creators to whom they grant ‘unlimited’ access as part of the service fee.
I’d imagine that accessing any content creators which are not part of their package will either be blocked via a paywall (buy an addon to access X creators outside our network each month) or charged at an insane price per MB as is the case with mobile data.
Obvious this is all super hypothetical but weirder stuff has happened in my lifetime
Because I as an user would be glad to have "free sites only" filter, and then just steal content :))
But it's an interesting idea and thought experiment.
These AI web crawlers (Google, Perplexity, etc) are self-cannibalizing robots. They eat the goose that laid the golden egg for breakfast, and lose money doing it most of the time.
If something isn't done to incentivize content creators again eventually there will be only walled-gardens and obsolete content left for the cannibals.
While I don’t blame creators for using ads now, I don’t think they’re a long-term solution. Ads are already blocked when people visit the site with ad blockers, which are becoming more popular. Obvious sponsored content may be blocked with the ads, and non-obvious sponsored content turns these “creators” into “shills” who are inauthentic and untrustworthy. Even without Google summaries, ad revenue may decrease over time as advertisers realize they aren’t effective or want more profit; even if it doesn’t, it’s my personal opinion that society should decrease the overall amount of ads.
Not everyone creates only for money, the best only create for enough money to sustain themselves. A long-term solution is to expand art funding (e.g. creators apply for grants with their ideas and, if accepted, get paid a fixed rate to execute them) or UBI. Then media can be redistributed, remixed, etc. without impacting creators’ finances.
If only the one in ten thousand with something to share are left standing to share it, no manufactured content, that's a fine thing.
That seems pretty unreasonable.
Only this week I have witnessed several dozen cases when Cloudflare has blocked normal Web page accesses without any possible correct reason, and this besides the normal annoyance of slowing every single access to any page on their "protected" sites with a bot check popup window.
It is true that this has never happened before, but this week Cloudflare has frequently blocked my access to a site where I am a paid subscriber, and where there is no doubt that my access pattern matches exactly what that site must have been designed for, i.e. the site hosts a database and I make a few queries on it each day, less than a dozen, spread over the entire day, where each query takes a couple of seconds at most.
Whoever has implemented a "threat" detection algorithm that decides that such a usage is a "threat" and not normal usage, must be completely incompetent.
Also after starting the crawl, you can read about Aaron Swartz while waiting.
I think this is an overly harsh take. I run a fairly niche website which collates some info which isn't available anywhere else on the internet. As it happens I don't mind companies scraping the content, but I could totally undrestand if someone didn't want a company profiting from their work in that way. No one is under an obligation to provide a free service to AI companies.
The kind of laws and enforcement that would block that entire country from the internet if it doesn't get its criminal act together.
Users should get to have authenticated, anonymous proxy user agents. Because companies like Perplexity just ignore `robots.txt`, maybe something like Private Access Tokens (PATs) with a new class for autonomous agents could be a solution for this.
By "content farm ads scaffold", I'm not sure if you had Perplexity and their ads business in mind, or those crappy little single-serving garbage sites. In any case, they shouldn't be treated differently. I have no problem with the business model, other than that the scam only works because it's currently trivial to parasitically strip-mine and monetize other people's IP.
However, in the last few months, Cloudflare has become increasingly annoying. I suspect that they might have implemented some "AI" "threat" detection, which gives much more false positives than before.
For instance, this week I have frequently been blocked when trying to access the home page of some sites where I am a paid subscriber, with a completely cryptic message "The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.".
The only "action" that I have done was opening the home page of the site, where I would then normally login with my credentials.
Also, during the last few days I have been blocked from accessing ResearchGate. I may happen to hit a few times per day some page on the ResearchGate site, while searching for various research papers, which is the very purpose of that site. Therefore I cannot understand what stupid algorithm is used by Cloudflare, that it declares that such normal usage is a "threat".
The weird part is that this blocking happens only if I use Firefox (Linux version). With another browser, i.e. Vivaldi or Chrome, I am not blocked.
I have no idea whether Cloudflare specifically associates Firefox on Linux with "threats" or this happens because whatever flawed statistics Cloudflare has collected about my accesses have all recorded the use of Firefox.
In any case, Cloudflare is completely incapable of discriminating between normal usage of a site by a human (which may be a paying customer) and "threats" caused by bots or whatever "threatening" entities might exist according to Cloudflare.
I am really annoyed by the incompetent programmers who implement such dumb "threat detection solutions", which can create major inconveniences for countless people around the world, while the incompetents who are the cause of this are hiding behind their employer corporation and never suffer consequences proportional to the problems that they have caused to others.
Sometimes just refreshing the page seems to work too. Disabling the tracker blocking allows cross-site requests to Cloudflare endpoints which seems to be enough. Maybe worth allow-listing CF domains, but I didn't look into if that is possible yet.
This exact same thing continues in 2025 with Windows Defender. The cheaper Windows Server VMs in the various cloud providers are practically unusable until you disable it.
You can tell this stuff is no longer about protecting users or property when there are no meaningful workarounds or exceptions offered anymore. You must use defender (or Cloudflare) unless you intend to be a naughty pirate user.
I think half of this stuff is simply an elaborate power trip. Human egos are fairly predictable machines in aggregate.
Plenty of site/service owners explicitly want Google, Meta and Apple bots (because they believe they have a symbiotic relationship with it) and don't want your bot because they view you as, most likely, parasitic.
I don't think it's fair to blame Cloudflare for that. That's looking at a pool of blood and not what caused it: the bots/traffic which predate LLMs. And Cloudflare is working to fix it with the PrivacyPass standard (which Apple joined).
Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?
The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.
If you manage a site with millions of pages, and over the course of a couple years you see tens of new crawlers start to request at the same volume as Google, and some of them crawl at a rate high enough (and without any ramp-up period) to degrade services and wake up your on-call engineers, and you can't identify a benefit to you from the crawlers, what are you going to do? Are you going to pay a lot more to stop scaling down your cluster during off-peak traffic, or are you going to start blocking bots?
Cloudflare happens to be the largest provider of anti-DDoS and bot protection services, but if it wasn't them, it'd be someone else. I miss the open web, but I understand why site operators don't want to waste bandwidth and compute on high-volume bots that do not present a good value proposition to them.
Yes this does make it much harder for non-incumbents, and I don't know what to do about that.
https://www.robotstxt.org/faq/what.html
I wonder if cloudflare users explicitly have to allow google or if it's pre-allowed for them when setting up cloudflare.
Despite what Cloudflare wants us to think here, the web was always meant to be an open information network , and spam protection should not fundamentally change that characteristic.
But at end of day it's up to the site operator, and any server or reverse proxy provides an easy way to block well-behaved bots that use a consistent user-agent.
They provide valeu for their companies. If you get some value from them it's just a side effect.
1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")
2) It runs queries against a web search index (I think it was using Bing or Brave at first, but they probably have their own by now), and uses an LLM to decide which are the best/most relevant documents. It starts writing a summary while it dives into sources (see next).
3) If necessary it will download full source documents that popped up in search to seed the context when generating a more in-depth summary/answer. They do this themselves because using OpenAI to do it is far more expensive.
#3 is the problem. Especially because SEO has really made it so the same sites pop up on top for certain classes of queries. (for example Reddit will be on top for product reviews alot). These sites operate on ad revenue so their incentive is to block. Perplexity does whatever they can in the game of sidestepping the sites' wishes. They are a bad actor.
EDIT: I should also add that Google, Bing, and others, always obey robots.txt and they are good netizens. They have enough scale and maturity to patiently crawl a site. I wholeheartedly agree that if an independent site is also a good netizen, they should not be blocked. If Perplexity is not obeying robots.txt and they are impatient, they should absolutely be blocked.
Why is it okay for me to ask my browser to do this but I can’t ask my LLM to do the same?
When ChatGPT reads a review website, though? Zero ad clicks, zero affiliate links.
Brave blocks ads by default. Tools like Pocket and reader mode disables ads.
Why is it okay for some user agents but not others?
But “for multiple people” isn’t an argument IMO, since each of those people could run a separate service doing the same. Using the same service, on the contrary, provides an opportunity to reduce the request volume by caching.
Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.
I've only ever seen a Cloudflare interstitial when viewing a page with my VPN on, for example -- something I'm happy about as a site owner and accept quite willingly as a VPN user knowing the kinds of abuse that occur over VPN.
Monopolistic is the wrong word, because you have the problem backwards. Cloudflare isnt helping Apple/Google... It's helping its paying consumers and those are the only services those consumers want to let through.
Do you know how I can predict that AI agents, the sort that end users use to accomplish real tasks, will never take off? Because the people your agent would interact with want your EYEBALLS for ads, build anti patterns on purpose, want to make it hard to unsubscribe, cancel, get a refund, do a return.
AI that is useful to people will fail. For the same reason that no one has great public API's any more. Because every public companies real customers are its stock holders, and the consumers are simply a source of revenue. One that is modeled, marked to, and manipulated all in the name of returns on investment.
I was recently working on a project where I needed to find out the published date for a lot of article links and this came helpful. Not sure if it's changed recently but asking ChatGPT, Gemini etc didn't work and it said that it doesn't have access to the current websites. However, asking perplexity, it fetched the website in real time and gave me the info I needed.
I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.
You say "shouldn't" here, but why?
There seems to be a fundamental conflict between two groups who each assert they have "rights":
* Content consumers claim the right to use whatever software they want to consume content.
* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).
These two "rights" are in direct conflict.
The bias here on HN, at least in this thread, is clearly towards the first "right". And I tend to come down on this side myself, as a computer power user. I hate that I cannot, for example, customize the software I use to stream movies from popular streaming services.
But on the other hand, content costs money to make. Creators need to eat. If the content creators cannot monetize their content, then a lot of that content will stop being made. Then what? That doesn't seem good for anyone, right?
Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?
(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)
It'd likely be a fantastic good if "content creators" stopped being able to eat from the slop they shovel. In the meantime, the smarter the tools that let folks never encounter that form of "content", the more they will pay for them.
There remain legitimate information creation or information discovery activities that nobody used to call "content". One can tell which they are by whether they have names pre-existing SEO, like "research" or "journalism" or "creative writing".
Ad-scaffolding, what the word "content" came to mean, costs money to make, ideally less than the ads it provides a place for generate. This simple equation means the whole ecosystem, together with the technology attempting to perpetuate it as viable, is an ouroboros, eating its own effluvia.
It is, I would argue, undetermined that advertising-driven content as a business model has a "right" to exist in today's form, rather than any number of other business models that sufficed for millennia of information and artistry before.
Today LLMs serve both the generation of additional literally brain-less content, and the sifting of such from information worth using. Both sides are up in arms, but in the long run, it sure seems some other form of information origination and creativity is likely to serve everyone better.
If they want the RSS feeds to be accessible then they should configure it to allow those requests.
Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.
"And those LLMs didn't ask anyones permission to crawl the entire 'net."
False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.
You either side with the law or with criminals.
You can't even say the same thing about openAI because we don't know the corpus they train their models on.
we seeing many post about site owner that got hit by millions request because of LLM, we cant blame cloudflare for this because it literally neccessary evil
Sure, the internet should be open and not trusted. But physical reality exists. Hosting and bandwidth cost money. I trust Google won't DDoS my site or cost my an arbitrary amount of money. I won't trust bots made by random people on the internet in the same way. The fact that Google respects robots.txt while Perplexity doesn't tells you why people trust Google more than random bots.
Google already has access to any webpage because its own search Crawlers are allowed by most websites, and google crawls recursively. Thus Gemini has an advantage of this synergy with google search. Perplexity does not crawl recursively (i presume -- therefore it does not need to consult robots.txt), and it doesn't have synergies with a major search engine.
So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.
They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.
Now you're just lying.
I checked several of my Cloudflare sites and none have it enabled by default:
"No robots.txt file found. Consider enabling Cloudflare managed robots.txt or generate one for your website"
"A robots.txt was found and is not managed by Cloudflare"
"Instruct AI bot traffic with robots.txt" disabled
The Cloudflare blog post where they announced this a few weeks ago stated "Cloudflare, Inc. (NYSE: NET), the leading connectivity cloud company, today announced it is now the first Internet infrastructure provider to block AI crawlers accessing content without permission or compensation, by default." [1]
I was also a bit confused by this wording and took it to mean Cloudflare was blocking AI traffic by default. What does it mean exactly?
Third party folks seemingly also interpreted it in the same way, eg The Verge reporting it with the title "Cloudflare will now block AI crawlers by default" [2]
I think what it actually means is that they'll offer new folks a default-enabled option to block ai traffic, so existing folks won't see any change. That aligns with text deeper in their blog post:
> Upon sign-up with Cloudflare, every new domain will now be asked if they want to allow AI crawlers, giving customers the choice upfront to explicitly allow or deny AI crawlers access. This significant shift means that every new domain starts with the default of control, and eliminates the need for webpage owners to manually configure their settings to opt out. Customers can easily check their settings and enable crawling at any time if they want their content to be freely accessed.
Not sure what this looks like in practice, or whether existing customers will be notified of the new option or something. But I also wouldn't fault someone for misinterpreting the headlines; they were a bit misleading.
[1]: https://www.cloudflare.com/en-ca/press-releases/2025/cloudfl...
[2]: https://www.theverge.com/news/695501/cloudflare-block-ai-cra...
I understand that you're trying to be generous, but the claim that "Matthew opted everyone in by default" is flat out incorrect.
> Cloudflare, Inc. (NYSE: NET), the leading connectivity cloud company, today announced it is now the first Internet infrastructure provider to block AI crawlers accessing content without permission or compensation, *by default*.
Content marketing, gamified SEO, and obtrusive ads significantly hurt the quality of Google search. For all its flaws, LLMs don’t feel this gamified yet. It’s disappointing that this is probably where we’re headed. But I hope OpenAI and Anthropic realize that this drop in search result quality might be partly why Google’s losing traffic.
> Converting complex HTML pages with navigation, ads, and JavaScript into LLM-friendly plain text is both difficult and imprecise.
None of these conditions should apply for websites with purpose of providing information.
Edit: And, btw, that statement was true before the default was changed. So, your comment is doubly false.
You don't have a right to say who or what can read your public website (this is a normative statement). You do have a right not to be DoS'd. If you pretend not to know what that means, it sounds the same as saying "you have an arbitrary right to decide who gets to make requests to your service", but it does not mean that.
Sorry CF, give up. the courts are on our sides here
The world is bigger than the USA.
Just because American tech giants have captured and corrupted legislators in the US doesn't mean the rest of the world will follow.
People like getting money for their work. You do too. Don't lose sight of that.
Sure, Google has to follow it because they’re a big company and need to respect certain laws or internal policies. But for everyone else, it’s basically just a “please don’t” sign, not a legal requirement or?
In summary: Officially, Perplexity claims its bots honor robots.txt. In practice, outside investigators and hosting providers document persistent circumvention of such directives by undeclared or disguised crawlers acting on Perplexity’s behalf, especially for real-time user queries
robots.txt is not a blocking mechanism; it's a hint to indicate which parts of a site might be of interest to indexing.
People started using robots.txt to lie and declare things like no part of their site is interesting, and so of course that gets ignored.
HN is not a platform for attacking people, even imagined ones.
Please don't fulminate.