Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

fxtentacle
·
6 days ago
·
[ - ]

I find this problem quite difficult to solve:

1. If I as a human request a website, then I should be shown the content. Everyone agrees.

2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

itsdesmond
·
6 days ago
·
[ - ]

Some stores do not welcome Instacart or Postmates shoppers. You can shop there. You can shop with your phone out, scanning every item to price match, something that some bookstores frown on, for example. Third party services cannot send employees to index their inventory, nor can they be dispatched to pick up an item you order online.

Their reasons vary. Some don’t want their businesses perception of quality to be taken out of their control (delivering cold food, marking up items, poor substitutions). Some would prefer their staff service and build relationships with customers directly, instead of disinterested and frequently quite demanding runners. Some just straight up disagree with the practice of third party delivery.

I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.

rjbwork
·
6 days ago
·
[ - ]

But I can send my personal shopper and you'll be none the wiser.

Polizeiposaune
·
6 days ago
·
[ - ]

To stretch the analogy to the breaking point: If you send 10,000 personal shoppers all at once to the same store just to check prices, the store's going to be rightfully annoyed that they aren't making sales because legit buyers can't get in.

hombre_fatal
·
6 days ago
·
[ - ]

Your comment and the above comment of course show different cases.

An agent making a request on the explicit behalf of someone else is probably something most of us agree is reasonable. "What are the current stories on Hacker News?" -- the agent is just doing the same request to the same website that I would have done anyways.

But the sort of non-explicit just-in-case crawling that Perplexity might do for a general question where it crawls 4-6 sources isn't as easy to defend. "Are polar bears always white?" -- Now it's making requests I wouldn't have necessarily made, and it could even been seen as a sort of amplification attack.

That said, TFA's example is where they register secretexample.com and then ask Perplexity "what is secretexample.com about?" and Perplexity sends a request to answer the question, so that's an example of the first case, not the second.

AuthAuth
·
6 days ago
·
[ - ]

Hacker news wants you to vist the site, look at the main page, enter threads and participate in discussion.

When you swap in an AI and ask what are the current stories. The AI fetches the front page and every thread and feeds it back to you. You are less likely to participate in discussion because you've already had the info summarized.

jychang
·
6 days ago
·
[ - ]

Who cares what Hacker News wants? You’re not obliged to participate in discussion.

Am I supposed to spend money on Amazon.com when I visit the website just because Amazon wants me to?

egypturnash
·
6 days ago
·
[ - ]

If most people quit spending money on Amazon then Amazon stops being worth running.

If most people stop discussing things on HN, and the discussion is indeed one of the major reasons it’s kept running, then HN stops being worth running.

·
5 days ago
·
[ - ]

AuthAuth
·
6 days ago
·
[ - ]

Whats the point of a human coming to a site if all the threads and empty and its front page is a glorified RSS feed for lazy peoples AI agents?

noboostforyou
·
4 days ago
·
[ - ]

> You’re not obliged to participate in discussion.

Are website owners obligated to serve content to AI agents and/or LLM scrapers?

AnthonBerg
·
6 days ago
·
[ - ]

Who cares what you want?

danlitt
·
6 days ago
·
[ - ]

Most humans place the desires of human beings over the desires of companies.

lenkite
·
1 day ago
·
[ - ]

Indeed. But that is a false equivalence - this is conflict of desires between small companies and creators and an AI-corp where the AI-corp wants to steal their content and give it to users with their shop branding.

butlike
·
4 days ago
·
[ - ]

It was a corollary example

ithkuil
·
6 days ago
·
[ - ]

Foo news wants you to visit the site, look at the main page, watch the ads, click on them and buy the products advertised by third parties which will give money to Foo news in exchange for this service.

And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.

They claim that since they are free to not buy an advertised product, why would they be forced to see ads for it. But Foo news claims that they are also free to not waste bandwidth to serve their free website to people who declare (by using an ad blocker or the modern alternative: AI aummarizera) they won't participate in the funding of the service

skydhash
·
6 days ago
·
[ - ]

It's not ads. We have ads in paper magazines and newspapers and no one went around with scissors to remove them. It's obnoxious ads, designed to violently grabs your attention and trackers (malware). It's like a newspapers giving your address to a whole crew of salemens that intrudes on your property at 3am and looking at you sleeping and installing cameras in your bathroom. All so that they can jump at you in the street to loudly claim they have the underwear you told your partner you like. If you're going to be that invasive about my person, then I'm going to be that forceful about restrictions.

imtringued
·
5 days ago
·
[ - ]

This is one of the dumbest things about ad networks. Google has enough data about your watching habits on Youtube and their algorithm is basically as good as it gets in terms of showing you what you want to watch and getting you hooked on it, but the moment they show you ads, all that technical expertise appears to have vanished into thin air and all they show you is fake mobile ads?

People hate obnoxious ads because the money that pays for them is essentially a bribe to artificially elevate content above its deserved ranking. It feels like you're being manipulated into an unfavorable trade.

Timwi
·
4 days ago
·
[ - ]

> their algorithm is basically as good as it gets in terms of showing you what you want to watch and getting you hooked on it

It is? Are we talking about the same YouTube? I get absolutely useless recommendations, I get un-hooked within a couple videos, and I even keep getting recommendations for the same videos I've literally watched yesterday. Who in the world gets hooked by this??

autoexec
·
5 days ago
·
[ - ]

> We have ads in paper magazines and newspapers and no one went around with scissors to remove them.

I never saw people bother with scissors but I've seen people pulling the ads out of the newspaper countless times.

remus
·
5 days ago
·
[ - ]

> And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.

I think this is a pretty different scenario. Here the user and the news website are talking directly to each other, but then the user is making a choice around what to do with the content the news website send to them. With AI agents, there is a company inserting themselves between the user and the news website and acting as a middleman.

It seems reasonable to me that the news website might say they only want to deal with users and not middlemen.

ithkuil
·
5 days ago
·
[ - ]

I understand; but as an excercise to better understand this problem I'll keep doing devil's advocate and I'll raise with:

What if my executive assistant reading the news website and giving me a digest?

Would the website owners rather prefer me doing my reading directly?

fxtentacle
·
5 days ago
·
[ - ]

Yes. Because they want to own your attention and that only works if they are interfacing directly to you.

I remember that Samsung was at one time offering to play non-skippable full-screen apps on their newest 8K OLED TVs and their argument was precisely that these ads will reach those rich people who normally pay extra to avoid getting spammed with ads. Or going with your executive assistant example, there are situations where it makes sense to bribe them to get access to you and/or your data. E.g. "evil maid attack".

trhway
·
6 days ago
·
[ - ]

With all the crypto development how come we haven't got to

  HTTP/1.1 402 Payment Required
  WWW-price: 0.0000001 BTC, 0.000001 ETH, 0.00001 DOGE

> You are less likely to participate in discussion

you (or AI on your behalf) paid instead. Many sites would probably like it better.

autoexec
·
5 days ago
·
[ - ]

If people were forced to pay for websites by the http request people would demand that websites stop loading a ton of externally hosted JS, stop filling sites with ads, and would demand that websites actually have content worth the price.

There are so many links I click on these days that are such trash I'd be demanding refunds constantly.

trhway
·
5 days ago
·
[ - ]

>There are so many links I click on these days that are such trash

That is why AI "summarization" becomes a necessary intermediate layer. You'd not see nor trash nor ads, and thus the payment instead of being exposed to the ads. AI saves the Internet :)

dns_snek
·
6 days ago
·
[ - ]

It's not a development problem, it's an adoption problem. Publishers are desperate to sell us on a $20+/month subscription, they don't want to offer convenient affordable access to single articles.

skydhash
·
6 days ago
·
[ - ]

$20/month would be nice if it wasn't a tier with less ads. I want no ads, and full-text rss feeds (because I want to use my clients to read). It's like how Netflix refuses to build a basic search and filter, or Spotify refuses to an actual library manager. They don't want you in control of your consumption.

bayindirh
·
6 days ago
·
[ - ]

As a person who has a couple of sites out there, and witnesses AI crawlers coming and fetching pages from these sites, I have a question:

What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

Pinky promises? Ethics? Laws? Technical limitations? Leeroy Jenkins?

Aeolun
·
6 days ago
·
[ - ]

> What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

What prevents anyone else? robots.txt is a request, not an access policy.

utbabya
·
6 days ago
·
[ - ]

This honor system mostly worked at scale because interests align, which seems to be no longer the case.

Does information no longer wants to be free now? Maybe internet, just like social media was just a social experiment at the end, albeit a successful one. Thanks GenAI.

egypturnash
·
6 days ago
·
[ - ]

“Information Wants To Be Free. Information also wants to be expensive. ...That tension will not go away.” - the full aphorism

https://en.wikipedia.org/wiki/Information_wants_to_be_free

windexh8er
·
6 days ago
·
[ - ]

Can the Terms of Service of individual content creators leverage a "death of a thousand cuts" model to produce a legal honeypot which would require organizations like Perplexity to be bound up in 10s of thousands of conciliation court cases?

Big Tech has hidden behind ToS for years. Now, it seems as though it only works for them, but not against. It seems as though this would be easy to orchestrate and prove forcing these companies into a legal nightmare or risk insolvent business stature due to the high load of cases filed against.

Why couldn't something like this be used to flip the table? A conciliation brigading, of sorts.

Eisenstein
·
6 days ago
·
[ - ]

Because lawyers are expensive and big tech companies have lots of them. Because it takes a ton of time and effort to sue someone. Because you need to show standing, which means you need to be able to demonstrate you lost something of value by their actions. Because the power imbalance is heavily weighted towards a corporation. Because the way to deal with such things should be legislation and not court decisions. And lots more reasons...

windexh8er
·
6 days ago
·
[ - ]

That's exactly why I said conciliation court. None of what you've outlined is required nor is it expensive. But, for each case, the defendant is still required to show up.

I've successfully used conciliation court against large corporations in the past which is why I question it here.

And while this should be able to be handled via legislation it won't be. Beyond that a workaround could force that to happen.

Eisenstein
·
5 days ago
·
[ - ]

> conciliation court

Sorry, I had never heard that term before. You would still have to show standing though. How would you try to prove that their violating your TOS cost you money?

windexh8er
·
4 days ago
·
[ - ]

Is it not viable to produce a work of art and say that this is free for humans, but not for bots and cannot be used for training and said violation cost X?

Again, I can't copy and distribute a game Microsoft rents to me. But if I do I can be found held accountable for a ridiculous amount of money. If it's my work of art the terms can dictate who doesn't need to pay and who does. If an LLM is consuming my work of art and now distributing it within their user base how is that not the same?

Eisenstein
·
4 days ago
·
[ - ]

These are arguments you would tell the judge. And the judge would almost certainly tell you 'this is the wrong venue for that. You are in small claims. I need an itemized list of monetary damages you have suffered before I can make a judgement.'

BrenBarn
·
5 days ago
·
[ - ]

Maybe you could say the increase in traffic increased your hosting costs by a penny or whatever.

accrual
·
6 days ago
·
[ - ]

Thanks for sharing your experience. A little off-topic but I'd like to start hosting some personal content, guides/tutorials, etc.

Do you still see authentic human traffic on your domains, is it easy to discern?

I feel like I missed the bus on running a blog pre-AI.

bayindirh
·
5 days ago
·
[ - ]

I intentionally doesn't keep detailed analytics on my homepage server and my digital garden, because I respect my users and don't want to push unnecessary Javascript on them. The blog platform I use (Mataroa) keeps rudimentary analytics (essentially page hit counters, nothing more) on index, RSS and per post.

Both my blog homepage and posts see mostly human traffic. Sometimes bots crawl the site and they appear as spikes in the analytics.

Looks like my homepage which doesn't have anything but links is pretty popular with crawlers. My digital garden doesn't get much interest from them. All in all, human traffic on my sites are pretty much alive.

I don't believe in missing the bus in anything actually, because I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open. I post links to both when it's appropriate, but they are not made to be popular. If people read it and learn something or solve one of their problems, that's enough for me.

This is why my software is GPLv3, Digital Garden is GFDL and blog is CC BY-NC-SA 2.0. This is why everything is running with absolutely minimum analytics and without any ads whatsoever.

Lastly, this is why I don't want AI crawlers in my site and my data in the models. This thing is made by a human for humans, absolutely for free. It's not OK somebody to sell something designed to be free and make money over it.

accrual
·
5 days ago
·
[ - ]

> I intentionally doesn't keep detailed analytics on my homepage server and my digital garden, because I respect my users and don't want to push unnecessary Javascript on them.

Absolutely, I'm in agreement here. I want to run a JS-free blog, just plain old static HTML. I plan to use GoAccess to parse the access logs but that's it. I think I would find it encouraging to see real human traffic.

> I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open.

That is a great way to view it, thank you.

bayindirh
·
5 days ago
·
[ - ]

> That is a great way to view it, thank you.

You're welcome. I'm glad it helped.

> I want to run a JS-free blog, just plain old static HTML.

If you want to start fast until you find a template you want to work with, I can recommend Mataroa [0]. The blog have almost no JS (it binds a couple of keys for navigation, that's it), and it's $10/year. When you feel right in your self-hosted solution, you can move there. It's all Markdown at the end of the day.

> I plan to use GoAccess to parse the access logs but that's it.

That's the only thing I use, too. Nothing else.

If you want to look at what I do, how I do, and reach out to me, the rabbit hole starts from my profile, here.

Wish you all the best, and you may find bliss and joy you never dreamed of!

[0]: https://www.mataroa.blog

kldg
·
6 days ago
·
[ - ]

if you do analytics, it is not so hard, but then you need to store user data (if not directly, then worse, with a third party), which should be viewed as a liability. I see ~2/3 human traffic, ~1/3 bot traffic (I just parse user agent strings and count whitelisted browsers as human), but my main landing page is all dynamic-populated webgl. I just asked Gemini what it sees on website, and it states "The page appears to be loading, with the text "Loading room data...".[1] There are also labels for "BG", "FG", and "CURSOR", and a background weather animation." -so I can be feel reasonably confident I don't need to worry about AI, for now; it needs a machine-friendly frontend.

you could go proper insanomode, too. remaking The Internet is trivial if you don't care about existing web standards -- replacing HTTP with your own TCP implementation, getting off html/js/css, etc. being greenfield, you can control the protocol, server, and client implementation, and put it in whatever language you want. I made a stateful Internet implementation in Python earlier for proof-of-concept, but I want to port it and expand on it in rust soon (just for fun; I don't do serious biznos). you'll very likely have 100% human traffic then, even if you're the only person curious and trusting enough to run your client.

71bw
·
6 days ago
·
[ - ]

  > I made a stateful Internet implementation in Python earlier for proof-of-concept

Is there a repo or some other form of public access? I'd like to see this.

kldg
·
5 days ago
·
[ - ]

it's not in a shareable state; is unsafe as-is. can share general idea and sample "webpage" files, though.

the server ("lodge") passes JSON to the client from what are called .branch files. the client receives JSON, parses it, then builds the UI and state representation from the JSON, then stored in that client's memory (self.current_doc and self.page_state in python client).

branches can invoke waterwheel (.ww) files hosted on the lodge. waterwheel files on the lodge contain scripts which define how patches (as JSON) are to be sent to the client. the client updates its state based on the JSON patch it receives. sample .branch and .ww from python implementation (in pastebin so to not make everyone have to scroll through this): https://pastebin.com/A0DEZDmR

71bw
·
5 days ago
·
[ - ]

I was right to ask, this seems extremely cool. Hit me up via mail [in bio] if you ever end up polishing it enough to share.

1024core
·
6 days ago
·
[ - ]

It's your server. You're free to do whatever you want. You can serve different versions of the page depending on the UserAgent (has been done many times before).

You can put up a paywall depending on UserAgent or OS (has been done).

In short, it's a 2-way street: the client on the other end of the TCP pipe makes a request, and your server fulfills the request as it sees fit.

tempfile
·
6 days ago
·
[ - ]

The way to prevent people from downloading your pages and using them is to take them off the public internet. There are laws to prevent people from violating your copyright or from preventing access to your service (by excessive traffic). But there is (thankfully) no magical right that stops people from reading your content and describing it.

bayindirh
·
6 days ago
·
[ - ]

Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data. People who think like that made tools like Anubis, and it works.

I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?

Makes no sense whatsoever.

zzo38computer
·
6 days ago
·
[ - ]

I don't want AI companies to scrape my sites (or use the files I wrote) for training data either, but that is not specifically what I am trying to stop (unless the files are supposed to be private and unpublished). I should not stop them from using the files for what they want, once they have them. (I also specifically do not want to block use of lynx, curl, Dillo, etc.)

What I want to stop is excessive crawling and scraping of my server. Once they have the file they can do what they want with it. Another comment (44786237) mentions that robots.txt is only for restricting recursive access; I agree and that is what should be blocked. They also should not access the same file several times quickly even though it should be unnecessary to do so, just as much as they should not access all of the files. (If someone wants to make a mirror of the files, there may be other ways, e.g. in case there is a archive file available to download many at once (possibly, in case if the site operator made their own index and then did it this way). If it is a git repository, then it can be cloned.)

tempfile
·
6 days ago
·
[ - ]

Of course some people want that. And at the moment they can prevent it. But those methods may stop working. Will it then be alright to do it? Of course not, so why bother mentioning that they are able to prevent it now - just give a justification.

Your license is probably not relevant. I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement. Even if I told it to the whole world, it wouldn't be copyright infringement. Probably the movie seller would prefer it if I didn't tell anyone. Why should I care?

I actually agree that AI companies are generally bad and should be stopped - because they use an exorbitant amount of bandwidth and harm the services for other users. At least they should be heavily taxed. I don't even begrudge people for using Anubis, at least in some cases. But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue. We have laws against copyright infringement, and to prevent service disruption. We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index. That would be unethical. Call for a windfall tax if they piss you off so much.

bayindirh
·
5 days ago
·
[ - ]

> I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement.

This is a false analogy. A correct one would be going to a 1000 movies and creating the 1001th movie with scenes cropped from these 1000 movies and assemble it as a new movie, and this is copyright infringement. I don't think any of the studios would applaud and support you for your creativity.

> But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue.

Why does it have to be always about money? Personally it's not. I just don't want my work to be abused and sold to people to benefit a third party without my consent and will (and all my work is licensed appropriately for that).

> We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index.

This goes both ways. If big corporations can scrape my material without asking me and resell it as an output of a model, I can equally distill their models further and sell it as my own. If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.

But that will be copyright infringement, just because they have more money. What angers me is "all is fair game because you're a small fish, and this is a capitalist marketplace" mentality.

If companies can paywall their content to humans that don't pay, I can paywall AI companies and demand money or push them out of my lawn, just because I feel like that. The inverse is very unethical, but very capitalist, yes.

It's not always about money.

P.S.: Oh, try to claim that you can train a model with medical data without any clearance because it'd be unethical to have laws limiting this. It'll be fun. Believe me.

tempfile
·
5 days ago
·
[ - ]

> This is a false analogy.

I think you are describing something much more like stable diffusion. This article is about Perplexity, which is much closer to "watch a movie and tell me the plot" than it is like "take these 1000 movies and make a collage". The copyright points are different - stable diffusion are on much shakier ground than perplexity.

> Why does it have to be always about money?

Before I mentioned money I said "because it hurts my feelings". I'm sorry I can't give a more charitable interpretation, but I really do see this kind of objection as "I don't want you to have access to this web page because I don't like LLMs". This is not a principled objection, it is just "I don't like you, go away". I don't think this is a good principle to build the web on.

Obviously you can make your website private, if you want, and that would be a shame. But you can't have this kind of pick-and-choose "public when you feel like" option. By the way I did not mention, but I am ok with people using Anubis and the like as a compromise while the situation remains unjust. But the justification is very important.

> If companies can scrape my pages to sell my content as theirs, I can scrape theirs and unpaywall them.

This is probably not a gambit you want to make. You literally can do this, and they would probably like it if you did. You don't want to do that, because the output of LLMs is usually not that good.

In fact, LLM companies should probably be taxed, and the taxes used to fund real human AI-free creations. This will probably not happen, but I am used to disappointment.

> P.S.: Oh, try to claim that you can train a model with medical data

Medical data is not public, for good reasons.

account42
·
5 days ago
·
[ - ]

> Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data.

That is unfortunately not a distinction that is currently legally enforceable. Until that changes all other "solutions" are pointless and only cause more harm.

> People who think like that made tools like Anubis, and it works.

It works to get real humans like myself to stop visiting your site while scrapers will have people whose entire job is to work around such "protections". Just like traditional DRM inconveniences honest customers and not pirates. And to be clear, what you are advocating for is DRM.

> I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.

bayindirh
·
5 days ago
·
[ - ]

> It works to get real humans like myself to stop visiting your site

If we talk about Anubis, it's pretty invisible. You wait a couple of seconds in the first visit, and don't get challenged for a couple of weeks, at least. With more tuning some of the sites using Anubis work perfectly well without ever seeing Anubis' wall while stopping AI crawlers.

> And to be clear, what you are advocating for is DRM.

Yes. It's pretty ironic that someone like me who believes in open access prefers a DRM solution to keep companies abusing the small fish, but life is an interesting phenomenon, and these things happen.

> Until that changes all other "solutions" are pointless and only cause more harm.

As an addendum to above paragraph, I'm not happy that I have to insert draconian measures between the user and the information I want to share, but I need a way to signal that I'm not having their ways to these faceless things. What do you propose? Taking my sites offline? Burning myself in front of one of the HQs?

> If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.

AI crawlers default to "Public Domain" when they find no licenses. Some of my lamest source code repositories made into "The Stack" because I forgot to add COPYING.md. A fork of a GPLv2 tool I wrote some patches also got into "The Stack", because COPYING.md was not in the root folder of the repository. I'd rather add licenses (which I can accept) to things rather than leave them as-is, because AI companies also eagerly grab things without license.

All licenses I use mandate attribution and continuation of license, at least, and my blog doesn't allow any derivations of from what I have written. So you can't ingest it into a model to be derived and remixed with something else.

account42
·
5 days ago
·
[ - ]

> If we talk about Anubis, it's pretty invisible. You wait a couple of seconds in the first visit, and don't get challenged for a couple of weeks, at least. With more tuning some of the sites using Anubis work perfectly well without ever seeing Anubis' wall while stopping AI crawlers.

It's not invisible, the sites using it don't work perfectly well for all users and it doesn't stop AI crawlers.

bayindirh
·
5 days ago
·
[ - ]

I haven't seen any problems with any Anubis enabled site I encountered. Can you give examples? This is interesting.

fxtentacle
·
5 days ago
·
[ - ]

I've never seen problems with Anubis.

hombre_fatal
·
6 days ago
·
[ - ]

I guess that's a question that might be answered by the NYT vs OpenAI lawsuit at least on the enforceability of copyright claims if you're a corporation like NYT.

If you don't have the funds to sue an AI corp, I'd probably think of a plan B. Maybe poison the data for unauthenticated users. Or embrace the inevitability. Or see the bright side of getting embedded in models as if you're leaving your mark.

miki123211
·
6 days ago
·
[ - ]

the fact that it would be discovered almost immediately.

If you give them a URL that does not appear in Google, ask them to visit that URL specifically, and then notice the content from that URL in the training data, it's proof that they're doing this, which would be quite damaging to them.

Freak_NL
·
6 days ago
·
[ - ]

> […] it's proof that they're doing this, which would be quite damaging to them.

Is it? It's damning, but is it damaging at all?

I'm not getting the impression that anyone's data being available for training if some bot can get to it is just how things are now, rather than an unsettled point of contention. There's too much money invested in this thing for any other outcome, and with the present decline of the rule of law…

autoexec
·
5 days ago
·
[ - ]

Nothing, and that's why I expect they all do it.

tintor
·
6 days ago
·
[ - ]

technical limitations / data poisoning measures

cellis
·
6 days ago
·
[ - ]

Easy "By Appointment only" or "rate limited to authenticated users" done.

p3rls
·
6 days ago
·
[ - ]

That is not the breaking point at all of the analogy-- that literally happens to my custom CMS/wiki/image host I built for my niche, kpopping.com. We are constantly attacked by crawlers. Meanwhile google rewards wordpress slop that buys backlinks with #1 pageranks for years. Welcome to the internet.

sublinear
·
6 days ago
·
[ - ]

Too bad. Build a bigger store or publish this information so we don't need 10,000 personal shoppers. Was this not the whole point of having a website? Who distorted that simple idea into the garbage websites we have now?

recursive
·
6 days ago
·
[ - ]

Weird take. The store doesn't owe your personal shippers anything.

drdaeman
·
6 days ago
·
[ - ]

That's fair, but if there's enough of supply and demand for this to get traction (and online shopping is bug, and autonomous agents are sort of trending), this conflict of interest paired with a no-compromise "we don't own you anything" attitude is bound to escalate in an arms race. And YMMV but I don't like where that race may possibly end.

If store businesses at least partially relies on obscurity of information that can be solved through automated means (e.g. storefronts tend to push visitors towards products they don't want, and buyer agents are fighting that and looking for something buyers instructed them) just playing this cat and mouse game of blocking agents, finding workarounds, and repeating the cycle is only creating perverse technological contraptions that neither party is really interested in - but both are circumstantially forced to invest into.

the_real_cher
·
6 days ago
·
[ - ]

In the same token the personal shoppers don't owe the store anything either.

eddythompson80
·
6 days ago
·
[ - ]

Surely they owe them money for the goods and service, no? I thought that's how stores worked.

the_real_cher
·
6 days ago
·
[ - ]

Context friend. This article and entire comments sections is about questionable web page access. Context.

eddythompson80
·
6 days ago
·
[ - ]

You're replying in a store metaphor thread though. Context matters.

recursive
·
6 days ago
·
[ - ]

Then they can't complain if they're barred entry.

the_real_cher
·
6 days ago
·
[ - ]

http is neutral. it's up to the client to ignore robots.txt

You can block IP's at the host level but there's pretty easy ways around that with proxy networks.

eddythompson80
·
6 days ago
·
[ - ]

> http is neutral.

Who misled you with that statement?

the_real_cher
·
6 days ago
·
[ - ]

Http doesnt have emotions or thought last time I checked.

eddythompson80
·
6 days ago
·
[ - ]

It seems that a 403 makes you sad though.

the_real_cher
·
6 days ago
·
[ - ]

iproyal.com makes me smile again

eddythompson80
·
6 days ago
·
[ - ]

And Cloudflare makes you cry. See, it's not neutral. Glad you learned something today. The more one learns everyday, the less stupid you become.

drdaeman
·
6 days ago
·
[ - ]

IETF?

dabockster
·
6 days ago
·
[ - ]

> Who distorted that simple idea into the garbage websites we have now?

Corporate America. Where clean code goes to die.

bradleyjg
·
6 days ago
·
[ - ]

It’s possible to violate all sorts of social norms. Societies that celebrate people that do so are on the far opposite end of the spectrum from high trust ones. They are rather unpleasant.

ToucanLoucan
·
6 days ago
·
[ - ]

Just the Silicon Valley ethos extended to it's logical conclusions. These companies take advantage of public space, utilities and goodwill at industrial scale to "move fast and break things" and then everyone else has to deal with the ensuing consequences. Like how cities are awash in those fucking electric scooters now.

Mind you I'm not saying electric scooters are a bad idea, I have one and I quite enjoy it. I'm saying we didn't need five fucking startups all competing to provide them at the lowest cost possible just for 2/3s of them to end up in fucking landfills when the VC funding ran out.

SoftTalker
·
6 days ago
·
[ - ]

My city impounded them and made them pay a fee to get them back. Now they have to pay a fee every year to be able to operate. Win/win.

account42
·
5 days ago
·
[ - ]

Do those fees actually improve anything for the citizens who now have to deal with vehicles abandoned on sidewalks everywhere or does it just buy the major a nicer yacht?

pixl97
·
6 days ago
·
[ - ]

[flagged]

tomhow
·
5 days ago
·
[ - ]

> Oh, this is a bunch of baloney...

Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

Please don't fulminate. Please don't sneer, including at the rest of the community.

Eschew flamebait. Avoid generic tangents. Omit internet tropes.

Please don't use Hacker News for political or ideological battle. It tramples curiosity.

https://news.ycombinator.com/newsguidelines.html

p3rls
·
6 days ago
·
[ - ]

[flagged]

tomhow
·
5 days ago
·
[ - ]

You can't comment like this on Hacker News, no matter what you're replying to. It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

bradleyjg
·
6 days ago
·
[ - ]

[flagged]

Workaccount2
·
6 days ago
·
[ - ]

[flagged]

_proofs
·
6 days ago
·
[ - ]

this is such a wild comment -- there are countless products where regardless of purchase -- the user is still served advertisements. i have no idea what reality, or timeline, this comment belongs in.

broadcast television, paid streaming entertainment is just straight up the most glaringly obvious example of a paid service overflowing with advertisements.

paid radio broadcasts (xm/Sirius).

operating systems (windows serves you ads any chance it gets).

monthly subscriptions to gyms where youre constantly hit with ads, marketing, and promotions be it at the gym or via push notification (you got opted into and therefore have to opt out of intentionally after the service is paid).

mobile phones, especially prepaid come LOADED with ads and bloatware.

i mean the list goes on -- you cannot be serious.

Workaccount2
·
6 days ago
·
[ - ]

> pay for services in full directly

Those are hybrid subscriptions/subsidies. Not paid in full.

If you are being exposed to ads in something you paid for, you are almost certainly being charged less money. Companies can compete on cost by introducing ads, and it's why the cheaper you go, the more ad infested it gets.

Pure ad-free things tend to be much more expensive then their ad subsidized counterparts. Ad subsidized has become so ubiquitous though, that people think that price is the true price.

_proofs
·
5 days ago
·
[ - ]

this seems like semantics and corporate hand-waving -- that's not what is conveyed to the user in what i have observed as the context of paid services and the promises asserted around what a purchase gets a customer.

in the subsidized example, xm/Sirius is marketed to users as an "ad-free paid radio broadcast"; the marketing literally attempts to leverage the notion of it being ad-free as a consequence of your purchase (power) in order to highlight its supposed competitive edge and usefulness, and provide the user an incentive to spend money, except for the fact that the marketing is false. you still get served promotions and ads, just less "conventional" ads.

i go to a football game and im literally inundated with ads -- the whole game has time stoppage dedicated to serving ads. i guess my season ticket purchase with the hopes of seeing football in person is.. apparently not spending enough money?

i see this as attempting to move the goalposts and gaslight users on their purchase expectations, as a way to offload the responsibility and accountability back onto the user -- "you don't pay enough, you only think that you pay enough, so we are still going to serve you ads because <insert financial justification here around the expectations we'e undermined>.

why then is there any expectation of a service being ad-free upon purchasing?

who the hell actually enjoys sitting through 1.5 hours of advertisements and play stoppage?

over time users have been conditioned to just tolerate it, and over time, the advertising reclaims ground it previously gave up one inch at a time in the same way people are price-gouged in those stadiums -- they don't have much alternative, but apparently the problem is the user should fork up more money for tickets so as to align their expectations with reality? while they're getting strong-armed at the concession stand via proximity and circumstance and lack of competition, no less.

are you really trying to tell me the problem there is, they need to make... more money? and THEN and only THEN we can have ad-free, paid for entertainment otherwise known as american football? is this really about user expectations, or is this about companies wanting their cake and eating it, too?

sublinear
·
6 days ago
·
[ - ]

[flagged]

arrowsmith
·
6 days ago
·
[ - ]

Go spend some time in Brazil or South Africa or other places where no-one trusts anyone (for good reasons), then report back.

bradleyjg
·
6 days ago
·
[ - ]

A place where you can lose you wallet and get it back with all the cash inside.

The horror!!

sublinear
·
6 days ago
·
[ - ]

[flagged]

arrowsmith
·
6 days ago
·
[ - ]

[flagged]

sublinear
·
6 days ago
·
[ - ]

[flagged]

Imustaskforhelp
·
6 days ago
·
[ - ]

Isn't that the system that we are already living in?

Democracy in its american form or even at many others show almost complete paralysis of the entire system basically if bad actors infiltrate it (Looking at ya donald)

It is honestly a little sad since conservatives usually think of their society as this high trust society and they were the ones who primarily voted and are being taken advantaged of by the few untrustworthy individuals.

Politics is a cult/religion and you can't prove me otherwise.

I vote because I vote for lesser evil not for greater good. I do think that frankly, both the parties or just most parties in every nation are just so short of reality but I created a discord server of 100 people and I can see how I can't manage 100 people and so maybe I expect so much from the govt.

I used to focus so much on history and politics but its bloody mess and there is no good or bad. Now I just feel like going into the woods and into the darks living alone, maybe coding.

Workaccount2
·
6 days ago
·
[ - ]

Let me guess: A low violence society is bad because people get attacked and beat up?

·
6 days ago
·
[ - ]

sensanaty
·
6 days ago
·
[ - ]

That's quite literally the opposite of what high trust means...

fireflash38
·
6 days ago
·
[ - ]

That's a very sad and lonely way to live.

sublinear
·
6 days ago
·
[ - ]

I don't think we're talking about the same thing.

account42
·
5 days ago
·
[ - ]

Obviously. You should heed the advice of other posters who told you to look up the meaning of the word.

inquirerGeneral
·
6 days ago
·
[ - ]

[dead]

ghurtado
·
6 days ago
·
[ - ]

[flagged]

sublinear
·
6 days ago
·
[ - ]

[flagged]

tempfile
·
6 days ago
·
[ - ]

[flagged]

immibis
·
6 days ago
·
[ - ]

[flagged]

Ray20
·
6 days ago
·
[ - ]

> High trust is prima facie incompatible with capitalism

Quite compatible

> If you want a high trust society, you don't want capitalism.

There is nothing at all in capitalism that would prevent a high level of trust in society.

> Capitalism is inherently low trust

But that's not true. The thing about capitalism is that it's RESILENT to low trust. It does not require low levels of trust, but is capable of functioning in such conditions.

> If the penalty for deceit was greater than the penalty for non-deceit

Who are the judges? Capitalism is the most resistant to deception, deceivers under capitalism receive fewer benefits than under any other economic system. Simply because capitalism is based on the premise that people cheat, act out of greed, try to get the most for themselves at the expense of others. These qualities exist in people regardless of the existence of capitalism, it is just that capitalism ensures prosperity in society even when people have these qualities.

immibis
·
6 days ago
·
[ - ]

https://theonion.com/this-war-will-destabilize-the-entire-mi...

sublinear
·
6 days ago
·
[ - ]

Why bring up capitalism? I don't get it. What's stopping people from lying and cheating under any other system?

dgshsg
·
6 days ago
·
[ - ]

When lying and cheating doesn't get you ahead, there is no reason to do it.

Workaccount2
·
6 days ago
·
[ - ]

If we look at any communist society, the only way to get ahead was lying and cheating. China was forced to adopt capitalist markets to deal with this, hence why modern China hardly resembles the USSR, Cuba, Venezuela, or Laos.

immibis
·
6 days ago
·
[ - ]

Communist with a capital C.

I've never seen a stateless, classless, moneyless society. It may be impossible.

ghurtado
·
6 days ago
·
[ - ]

You seriously think that mankind wasn't lying and cheating long before inventing capitalism?

dgshsg
·
6 days ago
·
[ - ]

Sure, but the risk/reward ratio was different.

Ray20
·
6 days ago
·
[ - ]

The problem is that without capitalism ONLY lying and cheating will get you ahead. Look at ANY country that builds its economy on the restriction of people's economic freedom, on the absence of private property rights - these are the most deceitful and disgusting regimes in the world with zero level of public trust.

rapind
·
6 days ago
·
[ - ]

It's all about scale. The impact of your personal shopper is insignificant unless you manage to scale it up into a business where everyone has a personal shopper by default.

nickthegreek
·
6 days ago
·
[ - ]

How is everyone having a personal shopper a problem of scale? I was going to shop myself, but I sent someone else to do it for me.

At this moment I am using Perplexity's Comet browser to take a spotify playlist and add all the tracks to my youtube music playlist. I love it.

SoftTalker
·
6 days ago
·
[ - ]

We'll see more of this sort of thing as AI agents become more popular and capable. They will do things that the site or app should be able to do (or rather, things that users want to be able to do) but don't offer. The YouTube music playlist is a good example. One thing I'd like to be able to do is make a playlist of some specific artists. But you can't. You have to select specific songs.

If sites want to avoid people using agents, they should offer the functionality that people are using the agents to accomplish.

dylan604
·
6 days ago
·
[ - ]

Let's look at the opposite benefit to a store if a mom that would need to bring her 3 kids to the store vs that mom having a personal shopper. In this case, the personal shopper is "better" for the store as far as physical space. However, I'm sure the store would still rather have the mom and 3 kids physically in the store so that the kids can nag mom into buying unneeded items that are placed specifically to attract those kids' attention.

pixl97
·
6 days ago
·
[ - ]

>o that the kids can nag mom into buying unneeded items

Excellent. Personal shoppers are 'adblock for IRL'.

>You owe the companies nothing. You especially don't owe them any courtesy. They have re-arranged the world to put themselves in front of you. They never asked for your permission, don't even start asking for theirs.

rapind
·
6 days ago
·
[ - ]

I didn't use the word "problem". In fact I presented no opinion at all. I'm just pointing out that scale matters a lot. In fact, in tech, it's often the only thing that matters. It's naive (or narrative) to think it doesn't.

Everyone having a personal shopper obviously changes the relationship to the products and services you use or purchase via personal shopper. Good, bad, whatever.

mbrumlow
·
6 days ago
·
[ - ]

Well then. Seems like you would be a fool to not allow personal shoppers then.

The point is the web is changing, and people use a different type of browser now. Ans that browser happens to be LLMs.

Anybody complaining about the new browser has just not got it yet, or has and is trying to keep things the old way because they don’t know how or won’t change with the times. We have seen it before, Kodak, blockbuster, whatever.

Grow up cloud flare, some is your business models don’t make sense any more.

goatlover
·
6 days ago
·
[ - ]

Some people use LLMs to search. Other people still prefer going to the actual websites. I'm not going to use an LLM to give me a list of the latest HN posts or NY Times articles, for example.

ToucanLoucan
·
6 days ago
·
[ - ]

> Anybody complaining about the new browser has just not got it yet, or has and is trying to keep things the old way because they don’t know how or won’t change with the times. We have seen it before, Kodak, blockbuster, whatever.

You say this as though all LLM/otherwise automated traffic is for the purposes of fulfilling a request made by a user 100% of the time which is just flatly on-its-face untrue.

Companies make vast amounts of requests for indexing purposes. That could be to facilitate user requests someday, perhaps, but it is not today and not why it's happening. And worse still, LLMs introduce a new third option: that it's not for indexing or for later linking but is instead either for training the language model itself, or for the model to ingest and regurgitate later on with no attribution, with the added fun that it might just make some shit up about whatever you said and be wrong. And as the person buying the web hosting, all of that is subsidized by me.

"The web is changing" does not mean every website must follow suit. Since I built my blog about 2 internet eternities ago, I have seen fad tech come and fad tech go. My blog remains more or less exactly what it was 2 decades ago, with more content and a better stylesheet. I have requested in my robots.txt that my content not be used for LLM training, and I fully expect that to be ignored because tech bros don't respect anyone, even fellow tech bros, when it means they have to change their behavior.

Imustaskforhelp
·
6 days ago
·
[ - ]

Tech bros just respect money. Making money is very easy in the short term if you don't show ethics. Venture capitalism and the whole growth/indie hacking is focused around making money and making it fast.

Its a clear road for disaster. I am honestly surprised by how great Hackernews is, to that comparison where most people are sharing it for the love of the craft as an example. And for that hackernews holds a special place in my heart. (Slightly exaggerating to give it a thematic ending I suppose)

julkali
·
6 days ago
·
[ - ]

Do not conflate your own experience with everyone else's.

tom_m
·
6 days ago
·
[ - ]

Perplexity isn't your personal anything. It's a service just like Postmates and Uber. You want a personal shopper equivalent? You're going to pay more money. It won't say perplexity all over it.

dataflow
·
5 days ago
·
[ - ]

> But I can send my personal shopper and you'll be none the wiser.

They will be quite the wiser if they track/limit how often your shopper enters the store. You probably aren't entering the same store fifteen times every day and neither would be your shopper if they were only doing it on your behalf.

542354234235
·
6 days ago
·
[ - ]

True, and I would ask, what is your point? Is it that no rule can have 100% perfect enforcement? That all rules have a grey area if you look close enough? Was it just a "gotcha" statement meant to insinuate what the prior commenter said was invalid?

·
6 days ago
·
[ - ]

amelius
·
6 days ago
·
[ - ]

But the store owner can ask the personal shopper to leave, if e.g. they find out that they work for a personal shopper service.

account42
·
5 days ago
·
[ - ]

What the article is advocating for is hiring bouncers that strip all shoppers so they can do just that.

fireflash38
·
6 days ago
·
[ - ]

And you can be trespassed and prosecuted if you continue to violate.

ghurtado
·
6 days ago
·
[ - ]

Sure. There's lots of things you could do, but you don't do them because they are wrong.

Might does not make right.

rjbwork
·
6 days ago
·
[ - ]

How is it wrong to send my personal shopper? How is it wrong to have an agent act directly on my behalf?

It's like saying a web browser that is customized in any way is wrong. If one configures their browser to eagerly load links so that their next click is instant, is that now wrong?

ghurtado
·
6 days ago
·
[ - ]

Here's a good rule of thumb: if you have to do it without other people knowing, because otherwise they wouldn't let you do it: chances are it's a bad thing to do.

_proofs
·
6 days ago
·
[ - ]

if you send your personal shopper to a store, and the business is... closed for business, or refusing you entry, and you just... go in anyway.

that's called breaking and entering, and generally frowned upon -- by-passing the "closed sign".

itsdesmond
·
6 days ago
·
[ - ]

[flagged]

dang
·
6 days ago
·
[ - ]

Whoa, please don't post like this. We end up banning accounts that do.

https://news.ycombinator.com/newsguidelines.html

itsdesmond
·
6 days ago
·
[ - ]

Aw, alright. I thought it was a funny way to make the point and I figured the yo momma structure was traditional enough to not be taken as a proper insult. Heard tho.

dang
·
5 days ago
·
[ - ]

Thanks for this. Now that you explain your intent, I see the joke. Unfortunately, it's too easy for the intent not to come across in these forsaken little text blobs that we're all limited to here. A lot of it boils down to the absence of voice tone and body language.

indymike
·
5 days ago
·
[ - ]

> I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.

The line is drawn for me on my own computer. Even if I am in your building, my phone remains mine.

ugh123
·
6 days ago
·
[ - ]

What if my local ai model and system crawls, indexes and trains itself on content that only I can see and work with?

tokioyoyo
·
5 days ago
·
[ - ]

> Some stores do not welcome Instacart or Postmates shoppers

First time hearing this. Almost every single grocery store either supports Instacart, or has partnership with a similar service.

cma
·
6 days ago
·
[ - ]

These are more like a store putting up a billboard or catalog and asking people to turn off their meta AI glasses nearby because the store doesn't want AI translating it on your behalf as a tourist.

itsdesmond
·
6 days ago
·
[ - ]

It is not because the store does not expend any resources on the singular instance of the glasses capturing the content of the billboard. Web requests cost money.

jasonjmcghee
·
6 days ago
·
[ - ]

I think it's an issue of scale.

The next step in your progression here might be:

If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?

fxtentacle
·
6 days ago
·
[ - ]

Maybe we should just institutionalize and explicitly legalize the Internet Archive and Archive Team. Then, I can download a complete and halfway current crawl of domain X from the IA and that way, no additional costs are incurred for domain X.

But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.

palmfacehn
·
6 days ago
·
[ - ]

https://commoncrawl.org/

>Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

fxtentacle
·
5 days ago
·
[ - ]

The problem is that many websites and domains are missing from it.

stanmancan
·
6 days ago
·
[ - ]

I have mixed feelings on this.

Many websites (especially the bigger ones) are just businesses. They pay people to produce content, hopefully make enough ad revenue to make a profit, and repeat. Anything that reproduces their content and steals their views has a direct effect on their income and their ability to stay in business.

Maybe IA should have a way for websites to register to collect payment for lost views or something. I think it’s negligible now, there are likely no websites losing meaningful revenue from people using IA instead, but it might be a way to get better buy in if it were institutionalized.

like_any_other
·
6 days ago
·
[ - ]

If magazines and newspapers were once able to be funded by native ads, so can websites. The spying industry doesn't want you to know this, but ads work without spying too - just look at all the IRL billboards still around.

jay_kyburz
·
6 days ago
·
[ - ]

Thanks for pointing this out! This is too often ignored!

stanmancan
·
6 days ago
·
[ - ]

I never said anything about spying.

Magazines and newspapers were able to by funded by native ads because you couldn't auto-remove ads from their printed media and nobody could clone their content and give it away for free.

like_any_other
·
5 days ago
·
[ - ]

You can't remove ads that are part of a site's native HTML either - well, not easily, not without an AI determining what is an ad based on the content itself. The few ads I see despite uBlock are like that - something the website author themself included, and not by pulling it in from a different domain.

And those ads don't spy. They tend to be a jpg that functions as a link. That's why I mentioned spying.

skeezyboy
·
6 days ago
·
[ - ]

Newspapers sell information. Information is now trivial to copy and send across the globe, when 50 years ago it wasnt. And youre wrong about "nobody could clone their content", because they absolutely could, different editions were pressed throughout the day (morning, lunch, evening newspapers) at the peak of print media. The barrier to entry used to be a printing press, now its just an internet connection, print media has a hard time accepting that

theshackleford
·
6 days ago
·
[ - ]

If ads were more respectful I wouldn’t have to remove them. Alas they can’t help themselves and so I do.

When ads were far less invasive, I had a lot more tolerance.

Now they want my data, they want to play audio, video, hijack the content, page etc.

Advertising scum can not be trusted to forever take more and more and more.

stanmancan
·
5 days ago
·
[ - ]

I also have ad-blockers for the same reason. However, if you don't support the people or companies producing the media you consume then don't be surprised when they go out of business.

theshackleford
·
5 days ago
·
[ - ]

> don't be surprised when they go out of business.

I’m ok with this. I support the media I truely want to see, and that media offers alternatives that are not ads.

For instance, I pay for YouTube premium. That said, many will not pay.

ivape
·
6 days ago
·
[ - ]

Or websites can monetize their data via paid apis and downloadable archives. That's what makes Reddit the most valuable data trove for regular users.

ccgreg
·
6 days ago
·
[ - ]

I don't think Reddit pays the people who voluntarily write Reddit content. Valuable to Reddit, I guess.

p3rls
·
6 days ago
·
[ - ]

[flagged]

cj
·
6 days ago
·
[ - ]

Doesn't o3 sort of already do this? Whenever I ask it something, it makes it look like it simultaneously opens 3-8 pages (something a human can't do).

Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."

> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?

To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.

But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.

Spivak
·
6 days ago
·
[ - ]

You don't middle click a bunch of links when doing research? Of all the things to point to I wouldn't have thought "opens a bunch of tabs" to be one of the differentiating behaviors between browsing with Firefox and browsing with an LLM.

daviddanielng
·
4 days ago
·
[ - ]

> simultaneously opens 3-8 pages (something a human can't do).

Can't you read?

fauigerzigerk
·
6 days ago
·
[ - ]

>Doesn't o3 sort of already do this?

ChatGPT probably uses a cache though. Theoretically, the average load on the original sites could be far less than users accessing them directly.

skeezyboy
·
6 days ago
·
[ - ]

how do you propose we do anything about this? any law you propose would have to be global

tr_user
·
6 days ago
·
[ - ]

I saw someone suggest in another post, if only one crawler was visiting and scraping and everyone else reused from that copy I think most websites would be ok with it. But the problem is every billionaire backed startup draining your resources with something similar to a DOS attack.

npc_anon
·
6 days ago
·
[ - ]

The problem in your logic is that all points starts wit "I".

You're not the only stakeholder in any of those interactions. There's you, a mediator (search or LLM), and the website owner.

The website owner (or its users) basically do all the work and provide all the value. They produce the content and carry the costs and risks.

The pre-LLM "deal" was that at least some traffic was sent their way, which helps with reach and attempts at monetization. This too is largely a broken and asymmetrical deal where the search engine holds all the cards but it's better than nothing.

A full LLM model that no longer sends traffic to websites means there's zero incentive to have a website in the first place, or it is encouraged to put it behind a login.

I get that users prefer an uncluttered direct answer over manually scanning a puzzling web. But the entire reason that the web is so frustrating is that visitors don't want to pay for anything.

danbruc
·
6 days ago
·
[ - ]

But the entire reason that the web is so frustrating is that visitors don't want to pay for anything.

They are already paying, it is the way they are paying that causes the mess. When you buy a product, some fraction of the price is the ad budget that gets then distributed to websites showing ads. Therefore there is also nothing wrong with blocking ads, they have already been paid for, whether you look at them or not. The ad budget will end up somewhere as long as not everyone is blocking all ads, only the distribution will get skewed. Which admittedly might be a problem for websites that have a user base that is disproportionally likely to use ad blockers.

Paying for content directly has the problem that you can only pay for a selected few websites before the amount you have to pay becomes unreasonable. If you read one article on a hundred different websites, you can not realistically pay for a hundred subscriptions that are all priced as if you spent all your time on a single website. Nobody has yet succeeded in creating a web wide payment method that only charges you for the content that you actually consume and is frictionless enough to actually work, i.e. does not force you to make a conscious payment decisions for a few cents or maybe even only fractions of a cent for every link you click and is not a privacy nightmare collecting all the links you click for billing purposes.

Also if you directly pay for content, you will pay twice - you will pay for the subscription and you will still pay into the ad budget with all the stuff you buy.

shortformblog
·
6 days ago
·
[ - ]

Publishers don't get paid a dime if you block the ad unless they are doing a direct ad transaction. Adtech has largely made that transaction a rarity for like 30 years.

It's not like newspapers where advertising is paid in full before publishers put stories online. It has not been that way for a long time.

Your reasoning for not accessing advertising reminds me of that scene in Arrested Development where, to hide the money they've taken out of the till, they throw away the bananas. It doesn't hide the transaction, it compounds the problem.

If publishers were getting paid before any ads ran the publishing business would be a hell of a lot stronger.

danbruc
·
5 days ago
·
[ - ]

Of course, they will not get paid for me visiting the website if I block the ads, but that was not my point. People have already bought stuff and with that paid for the ad budget. And that money will be spent somewhere. Maybe someone else will see the ad that I blocked, someone who would otherwise not have seen it because the ad budget would have been exhausted. Or maybe the prices for ads go up because there are less impressions to sell. Only if companies would lower their ad budgets in response to ad blocking would there be less money to distribute. If that would be the case, then my argument would fail.

shortformblog
·
5 days ago
·
[ - ]

Your point is illogical. It’s like you’ve invented a theory as to how companies advertise that has zero tethering to reality.

It’s especially stupid because it doesn’t include publishers in the equation at all. It’s just you looping over yourself attempting to validate your choice for running an ad blocker.

Admit you’re doing it because you want to callously screw over publishers. You certainly haven’t put their thoughts into consideration here.

To be clear: Run an ad blocker if you want, but stop acting as if you bought those ads. The chicken dinner I ate the other night has no say how I live my life after our transaction has ended.

danbruc
·
5 days ago
·
[ - ]

If I buy an iPhone, does some fraction of the price contribute to Apple's ad budget? If so, where does that money end up? What would change if I did not block Apple ads?

shortformblog
·
5 days ago
·
[ - ]

It’s up to them how they spend their money, not you. You can complain if they somehow damaged your product, they got your money unfairly, or were somehow doing something bad with your data, but at some point it is their money to spend how they see fit. They earned it, and they might spend it on advertising.

If I buy stuff at a grocery store, I can’t get a random bagger fired just because I feel like it. At some point the transaction ends and they ultimately continue to operate with or without your input.

danbruc
·
5 days ago
·
[ - ]

I am neither complaining nor trying them what to do with their money, that looks like a complete deflection to me.

If I am buying Apple products, am I contributing to their ad budget? If so, where does that money end up? Is it likely that some of it will end up as ad revenue on some website? What difference does it make whether or not I block ads? Or the other way around, if I am visiting websites and look at Apple ads but do not buy Apple products, am I contributing to the ad revenue of the websites?

shortformblog
·
5 days ago
·
[ - ]

Maybe in the cosmic sense you are, in that they have a giant pile of money, and you contributed a few pennies to it, but this is not how accounting works. Your transaction and their ad budget are separate things.

Also, advertising does other things than tell you to buy something, and it doesn’t always take the form of banner ads. Apple, for example, does a ton of brand awareness advertising. Affiliate marketing often targets direct transactions. Maybe your goal is to simply start a relationship that might someday lead to a really big purchase.

Often, in the era of SaaS, people advertise to existing customers. Apple does this—they have a TV service and a music service and a cloud service.

There are plenty of reasons for them to advertise after you bought the original product.

But your original point was that customers bought the ads. Maybe they didn’t! Maybe they were given funding by a VC firm and the company decided it wanted to build an audience. Maybe they want to advocate for a political issue.

I think the biggest problem with your argument is that it has tunnel vision and sees advertising as this one dimensional thing, when in reality it takes many forms. Plenty of those forms are bad, but it is not as simple as “I bought a product, now I never want to see an Apple ad ever again.” Many businesses (Amazon, eBay) make most of their money off of customers they’ve already advertised to that they advertise to again and again.

danbruc
·
5 days ago
·
[ - ]

Well, I don't give a shit about the advertising goals of Apple or anyone else, that is why I block ads. And that is also completely irrelevant, the question was whether I am screwing over websites when I am using an ad blocker. I argue not, because as a consumer I still contribute to the ad budgets that become the ad revenue of the websites. What I am not doing when I block ads is influencing how the money gets distributed among all the websites, I can live with that. And if the money is not consumer money, so what? What do I have to do with companies distributing VC money among websites?

shortformblog
·
5 days ago
·
[ - ]

LOL, you don’t. You really don’t. As I told you like four hours ago, ads are impression-based. Just because you bought something that helped them buy an ad doesn’t mean you did shit for my website.

In fact, you did the opposite.

danbruc
·
5 days ago
·
[ - ]

I know that ads are based on impressions as I told you before, but my money still has to end up somewhere even if I am using an ad blocker. So where does it end up if not as ad revenue on some websites? You must not confuse the people paying for the ads and in turn for the ad revenue of websites by buying stuff with the people deciding how that money gets distributed among all the websites by looking at ads.

We can even go one step further, if anyone is screwing over websites, then that is the ad industry by not paying for blocked ads. I buy an iPhone and Apple takes some additional money from me to spend on advertising. I did not ask for that but I am fine with it. Now I expect Apple to spend the money they took from me on ads in order to support websites. But if the guy that Apple wants to show the ad that I paid for does not want to see it and blocks it, then I want Apple to respect that and still pay the website. I know, not going to happen, but do not put the blame on people blocking ads.

shortformblog
·
5 days ago
·
[ - ]

You’re describing socialism (wealth redistribution to be exact). At this point, just make that money a tax and give it to the publishers directly. Cut out the middlemen.

danbruc
·
5 days ago
·
[ - ]

Well, what is the difference, the ad budget fraction of the price is like a tax. I think given a choice most people would prefer to get their stuff a bit cheaper and not contribute to the ad budget. But we pay it and then the companies hand the money out to various parties to display ads creating the possibility of running a business on ad revenue. And in many cases I can ignore ads, I can not look at billboards, I can switch to a different channel during the commercial break, I can flip over the ad pages in newspapers and magazines but they still get paid. Only on the internet have we decided to only pay for ads when somebody actually looks at them. I just asked for the same thing on the internet, pay for the inclusion on the website, whether someone actually sees it or not. Not sure how that is socialism and wealth redistribute.

·
5 days ago
·
[ - ]

hahn-kev
·
5 days ago
·
[ - ]

I feel like this could work if the payment was handled by your ISP. Content provider tells the ISP how much their content costs that there subscribers pay, and the ISP pays them. I already pay my ISP. The real problem is that it's kinda too late for this kind of change. And also the ISP would need to prevent their users from running up a bill that the ISP would be responsible for and without tracking them that's not possible.

ricardo81
·
6 days ago
·
[ - ]

Agreed.

Cloudflare released these insights showing the disparity between crawling/scraping and visits referred from the AI platforms.

https://radar.cloudflare.com/ai-insights#crawl-to-refer-rati...

yojo
·
6 days ago
·
[ - ]

Ads are a problematic business model, and I think your point there is kind of interesting. But AI companies disintermediating content creators from their users is NOT the web I want to replace it with.

Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.

They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.

Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.

If AI crawlers win here, we all lose.

bee_rider
·
6 days ago
·
[ - ]

I think it’s basically impossible to prevent AI crawlers. It is like video game cheating, at the extreme they could literally point a camera at the screen and have it do image processing, and talk to the computer through the USB port emulating, a mouse and keyboard outside the machine. They don’t do that, of course, because it is much easier to do it all in software, but that is the ultimate circumvention of any attempt to block them out that doesn’t also block out humans.

I think the business model for “content creating” is going have to change, for better or worse (a lot of YouTube stars are annoying as hell, but sure, stuff like well-written news and educational articles falls under this umbrella as well, so it is unfortunate that they will probably be impacted too).

yojo
·
6 days ago
·
[ - ]

I don’t subscribe to technological inevitabilism.

Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.

But I can imagine more extreme measures. e.g. old web of trust style request signing[0]. I don’t see any easy way for scrapers to beat a functioning WOT system. We just don’t happen to have one of those yet.

0: https://en.m.wikipedia.org/wiki/Web_of_trust

bee_rider
·
6 days ago
·
[ - ]

> Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.

I think this might actually point at the end state. Scraping bots will eventually get good enough to emulate a person well enough to be indistinguishable (are we there yet?). Then, content creators will have to price their content appropriately. Have a Patreon, for example, where articles are priced at the price where the creator is fine with having people take that content and add it to the model. This is essentially similar to studios pricing their content appropriately… for Netflix to buy it and broadcast it to many streaming users.

Then they will have the problem of making sure their business model is resistant to non-paying users. Netflix can’t stop me from pointing a camcorder at my TV while playing their movies, and distributing it out like that. But, somehow, that fact isn’t catastrophic to their business model for whatever reason, I guess.

Cloudflare can try to ban bad actors. I’m not sure if it is cloudflare, but as someone who usually browses without JavaScript enables I often bump into “maybe you are a bot” walls. I recognize that I’m weird for not running JavaScript, but eventually their filters will have the problem where the net that captures bots also captures normal people.

Barbing
·
6 days ago
·
[ - ]

>Then they will have the problem of making sure their business model is resistant to non-paying users. Netflix can’t stop me from pointing a camcorder at my TV while playing their movies, and distributing it out like that. But, somehow, that fact isn’t catastrophic to their business model for whatever reason, I guess.

Interested to see some LLM-adverserial equivalent of MPAA dots![1]

[1] https://en.wikipedia.org/wiki/Coded_anti-piracy

skeezyboy
·
6 days ago
·
[ - ]

Netflix CAN "stop you from pointing a camera at your TV and distributing it" because of copyright law.

account42
·
5 days ago
·
[ - ]

Which is also how AI scrapers should be solved. Papering over the issue with technological "solutions" only hurts real users.

skeezyboy
·
5 days ago
·
[ - ]

in the UK at least, that has been recognized as fair use.

immibis
·
6 days ago
·
[ - ]

Beating web of trust is actually pretty easy: pay people to trust you.

Yes, you can identify who got paid to sign a key and ban them. They will create another key, go to someone else, pretend to be someone not yet signed up for WoT (or pay them), and get their new key signed, and sign more keys for money.

So many people will agree to trust for money, and accountability will be so diffuse, that you won't be able to ban them all. Even you, a site operator, would accept enough money from OpenAI to sign their key, for a promise the key will only be used against your competitor's site.

It wouldn't take a lot to make a binary-or-so tree of fake identities, with exponential fanout, and get some people to trust random points in the tree, and use the end nodes to access your site.

Heck, we even have a similar problem right now with IP addresses, and not even with very long trust chains. You are "trusted" by your ISP, who is "trusted" by one of the RIRs or from another ISP. The RIRs trust each other and you trust your local RIR (or probably all of them). We can trace any IP to see who owns it. But is that useful, or is it pointless because all actors involved make money off it? You know, when we tried making IPs more identifying, all that happened is VPN companies sprang up to make money by leasing non-identifying IPs. And most VPN exits don't show up as owned by the VPN company, because they'd be too easy to identify as non-identifying. They pay hosting providers to use their IPs. Sometimes they even pay residential ISPs so you can't even go by hosting provider. The original Internet was a web of trust (represented by physical connectivity), but that's long gone.

Spivak
·
6 days ago
·
[ - ]

It is inevitable, not because of some technological predestination but because if these services get hard-blocked and unable to perform their duties they will ship the agent as a web browser or browser add-on just like all the VSCode forks and then the requests will happen locally through the same pipe as the user's normal browser. It will be functionally indistinguishable from normal web traffic since it will be normal web traffic.

skeledrew
·
6 days ago
·
[ - ]

Then personal key sharing will become a thing, similar to BugMeNot et al.

account42
·
5 days ago
·
[ - ]

A web of trust is not going to plug the analog hole gp already mentioned.

Meanwhile its going to fuck over real users.

subspeakai
·
6 days ago
·
[ - ]

This is the fascinating case where I think this all goes - At some point costs come down and you can do this and bypass everything

shadowgovt
·
6 days ago
·
[ - ]

> Otherwise there is literally no reason for them to make any of it available on the open web

This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.

If it became functionally impossible to upsell a little information for more paid information, I'm sure some people would stop creating information online. I don't know if it would be enough to fundamentally alter the character of the web.

Do people (generally) put things online to get money or because they want it online? And is "free" data worse quality than data you have to pay somebody for (or is the challenge more one of curation: when anyone can put anything up for free, sorting high- and low-quality based on whatever criteria becomes a new kind of challenge?).

Jury's out on these questions, I think.

yojo
·
6 days ago
·
[ - ]

Any information that requires something approximating a full-time job worth of effort to produce will necessarily go away, barring the small number of independently wealthy creators.

Existing subject-matter experts who blog for fun may or may not stick around, depending on what part of it is “fun” for them.

While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.

Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.

SoftTalker
·
6 days ago
·
[ - ]

Wikipedia works for the same reason open-source does: because most of the contributors are experts in the subject and have paid jobs in that field. Some are also just enthusiasts.

ndriscoll
·
6 days ago
·
[ - ]

OpenStreetMap is basically Wikipedia for maps and is quite successful. Over 10M registered users and millions of edits per day. Lots of information is also shared online on forums for free. The hosting (e.g. reddit) is basically a commodity that benefits from network effects. The information is the more interesting bit, and people share it because they feel like it.

account42
·
5 days ago
·
[ - ]

> Any information that requires something approximating a full-time job worth of effort to produce will necessarily go away

Many people put more effort into their hobbies than into their "full time" job.

Some of it will go away but perhaps without the expectation that you can earn money more people will share freely.

> While some must derive satisfaction from increasing the total sum of human knowledge, others are probably blogging to engage with readers or build their own personal brand, neither of which is served by AI scrapers.

We don't have to make all business models that someone might want possible though.

> Wikipedia is an interesting case. I still don’t entirely understand why it works, though I think it’s telling that 24 years later no one has replicated their success.

Actually this model is quite common. There are tons of sources of free information curated by volunteers - most are just too niece to get to the scale of Wikipedia.

skeezyboy
·
6 days ago
·
[ - ]

A large portion of "content" these days is copy/pasted shite so they can get views to get ad revenue, quite simply.

account42
·
5 days ago
·
[ - ]

> Do people (generally) put things online to get money or because they want it online?

IME it's mostly because someone else put something "wrong" online first.

hansvm
·
6 days ago
·
[ - ]

Ofttimes people are sufficiently anti-ad that this point won't resonate well. I'm personally mostly in that camp in that with relatively few exceptions money seems to make the parts of the web I care about worse (it's hard to replace passion, and wading through SEO-optimized AI drivel to find a good site is a lot of work). Giving them concrete examples of sites which would go away can help make your point.

E.g., Sheldon Brown's bicycle blog is something of a work of art and one of the best bicycle resources literally anywhere. I don't know the man, but I'd be surprised if he'd put in the same effort without the "brand" behind it -- thankful readers writing in, somebody occasionally using the donate button to buy him a coffee, people like me talking about it here, etc.

vertoc
·
6 days ago
·
[ - ]

But even your example gets worse with AI potentially - the "upsell" of his blog isn't paid posts but more subscribers so there will be thankful readers, a few donators, people talking about it. If the only interface becomes an AI summary of his work without credit, it's much more likely he stops writing as it'll seem like he's just screaming into the void

hansvm
·
6 days ago
·
[ - ]

I don't think we're disagreeing?

skeezyboy
·
6 days ago
·
[ - ]

the upsell lol whats the upsell of that post you just made?

blacksmith_tb
·
6 days ago
·
[ - ]

Sheldon died in 2008, but there's no doubt that all the bicycling wisdom he posted lives on!

wulfstan
·
6 days ago
·
[ - ]

He's that widely respected that amongst those who repair bikes (I maintain a fleet of ~10 for my immediate family) he is simply known as "Saint Sheldon".

yojo
·
6 days ago
·
[ - ]

I agree that specific examples help, though I think the ones that resonate most will necessarily be niche. As a teen, I loved Penny Arcade, and watched them almost die when the bottom fell out of the banner-ad market.

Now, most of the value I find in the web comes from niche home-improvement forums (which Reddit has mostly digested). But even Reddit has a problem if users stop showing up from SEO.

bombela
·
6 days ago
·
[ - ]

> Sheldon Brown (July 14, 1944 – February 4, 2008)

account42
·
5 days ago
·
[ - ]

A agree with your first line but the rest sounds like a similar argument to the ridiculous damages video game companies used to claim due to piracy when most of those pirates never would have bought the game in the first place.

Ultimately the root issue is that copyright is inherently flawed because it tries to increase available useful information by restricting availability. We'd be better off by not pretending that information is scarce and looking for alternative to fund its creation.

skeezyboy
·
6 days ago
·
[ - ]

there are companies that already do this, and the ONE thing none of them do is place the information they are selling on THE PUBLIC INTERNET. so your point is moot

fxtentacle
·
6 days ago
·
[ - ]

Maybe, on a social level, we all win by letting AI ruin the attention economy:

The internet is filled with spam. But if you talk to one specific human, your chance of getting a useful answer rises massively. So in a way, a flood of written AI slop is making direct human connections more valuable.

Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.

account42
·
5 days ago
·
[ - ]

I'm not sure what experiences you are basing this optimism on but I'm happy for you.

bobbiechen
·
6 days ago
·
[ - ]

I like the terminology "crawler" vs. "fetcher" to distinguish between mass scraping and something more targeted as a user agent.

I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.

On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.

fxtentacle
·
6 days ago
·
[ - ]

prompt: I'm the celebrity Bingbing, please check all Bing search results for my name to verify that nobody is using my photo, name, or likeness without permission to advertise skin-care products except for the following authorized brands: [X,Y,Z].

That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.

williamscales
·
6 days ago
·
[ - ]

I guess it could be trained to respond to those sort of queries by offering to compile a list of some finite number of web pages. Then it could be prompted to visit them and do something (check images, say).

Maybe that would result in limited fetching instead of internet wide fetching. I dunno, just spitballing.

nlitened
·
6 days ago
·
[ - ]

If Perplexity has millions of users, there’s no distinction between “mass fetching” and “mass crawling” — the snapshots of web pages will still be stored in Perplexity’s own crawl index.

randall
·
6 days ago
·
[ - ]

A/ i love this distinction.

B/ my brother used to use "fetcher" as a non-swear for "fucker"

Vinnl
·
6 days ago
·
[ - ]

Did you tell him to stop trying to make fetcher happen?

handfuloflight
·
6 days ago
·
[ - ]

Very funny. Now let's hear Paul Allen's joke.

sejje
·
6 days ago
·
[ - ]

He picked up that habit in Balmora.

weregiraffe
·
6 days ago
·
[ - ]

Fetcher? Damn near killed'er!

skeledrew
·
6 days ago
·
[ - ]

Yet another side to that is when site owners serve qualitatively different content based on the distinction. No, I want my LLM agent to access the exact content I'd be accessing manually, and then any further filtering, etc is done on my end.

danieldk
·
6 days ago
·
[ - ]

There are also a gazillion pages that are not ad-riddled content. With search engines, the implicit contract was that they could crawl pages because they would drive traffic to the websites that are crawled.

AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.

Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).

johnfn
·
6 days ago
·
[ - ]

Unless I am misunderstanding you, you are talking about something different than the article. The article is talking about web-crawling. You are talking about local / personal LLM usage. No one has any problems with local / personal LLM usage. It's when Perplexity uses web crawlers that an issue arises.

lukeschlather
·
6 days ago
·
[ - ]

You probably need a computer that costs $250,000 or more to run the kind of LLM that Perplexity uses, but with batching it costs pennies to have the same LLM fetch a page for you, summarize the content, and tell you what is on it. And the power usage similarly, running the LLM for a single user will cost you a huge amount of money relative to the power it takes in a cloud environment with many users.

Perplexity's "web crawler" is mostly operating like this on behalf of users, so they don't need a massively expensive computer to run an LLM.

sterlind
·
6 days ago
·
[ - ]

does Perplexity store crawled pages for training?

st3fan
·
6 days ago
·
[ - ]

Is the article really talking about crawling? Because in one of their screenshots where they ask information about the "honeypot" website you can see that the model requested pages from the website. But that is most definitely "fetching by proxy because I asked a question about the website" and not random crawling.

It is confusing.

johnfn
·
6 days ago
·
[ - ]

Yea, now I feel like my comment might be misleading. The title mentions crawling; the article itself is talking about something else.

porridgeraisin
·
6 days ago
·
[ - ]

I don't think people have a problem with an LLM issuing GET website.com and then summarising that, each and every time it uses that information (or atleast, save a citation to it and refer to that citation). Except ad ecosystem, ignoring them for now, please refer to last paragraph.

The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.

So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.

Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.

The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.

Now, IMO:

So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the ethical problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)

wulfstan
·
6 days ago
·
[ - ]

Yes, this is the crux of the matter.

The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.

LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.

IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.

EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/

giantrobot
·
6 days ago
·
[ - ]

> LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content

Let's be real, Google et al have been doing this for years with their quick answer and info boxes. AI chatbots are worse but it's not like the big search engines were great before AI came along. Google had made itself the one-stop shop for a huge percentage of users. They paid billions to be the default search engine on Apple's platforms not out of the goodness of their hearts but to be the main destination for everyone on the web.

BrawnyBadger53
·
6 days ago
·
[ - ]

Anything but expanding copyright laws. Tbh, a pay per citation with an opt in database to add your info (think music streaming style monetization) would be reasonable to me. Not that I think it's a good scheme for music but I think it's fitting for web crawling. Though it does inevitably lead to enshitification. Pick your poison I guess.

wulfstan
·
6 days ago
·
[ - ]

The reason it works for music is because the people behind the databases have a team of lawyers that will come after you for violating copyright/performance legislation if you don’t pay your dues.

The argument that LLM outfits are using is that they are just exercising “fair use” / education rights to do an end run around copyright law. Without strengthening the rules on that I’m not sure I see how the database + team of lawyers approach would work.

But with that, sure, that’s an approach that seems to have legs in other contexts.

skydhash
·
6 days ago
·
[ - ]

That’s why websites have no issues with googlebot and the search results. It’s a giant index and citation list. But stripping works from its context and presenting as your own is decried throughout history.

nelblu
·
6 days ago
·
[ - ]

> LLMs should also use citations.

Mojeek LLM (https://www.mojeek.com) uses citations.

troyvit
·
6 days ago
·
[ - ]

> If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.

snowwrestler
·
6 days ago
·
[ - ]

We have a faceted search that creates billions of unique URLs by combinations of the facets. As such, we block all crawlers from it in robots.txt, which saves us AND them from a bunch of pointless indexing load.

But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too.

Whoever it is (and I now suspect it is Perplexity based on this Cloudflare post), they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.

graybanana
·
5 days ago
·
[ - ]

We have the same issue (billions of URLs). The newer bots that rotate IPs across thousands of IP ranges are killing us and there is no good way to block them short of captcha's or forcing logins, which we would really rather not inflict on our users.

fluidcruft
·
6 days ago
·
[ - ]

In theory, couldn't the LLM access the content on your browser and it's cache, rather than interacting with the website directly? Browser automation directly related to user activity (prefetch etc) seems qualitatively different to me. Similarly, refusing to download content or modifying content after it's already in my browser is also qualitatively different. That all seems fair-use-y. I'm not sure there's a technical solution beyond the typical cat/mouse wars... but there is a smell when a datacenter pretends to be a person. That's not a browser.

It could be a personal knowledge management system, but it seems like knowledge management systems should be operating off of things you already have. The research library down the street isn't considered a "personal knowledge management system" in any sense of the term if you know what I mean. If you dispatch an army of minions to take notes on the library's contents that doesn't seem personal. Similarly if you dispatch the army of minions to a bookstore rather than a library. At the very least bring the item into your house/office first. (Libraries are a little different because they are designed for studying and taking notes, it's use of an army of minions aspect)

freehorse
·
6 days ago
·
[ - ]

> couldn't the LLM access the content on your browser

Yes, orbit, a now deprecated firefox extension by mozilla was doing that. This way you could also use it to summarise content that would not be available to a third party (eg sth in google docs).

You can still sort of do the same with the ai chatbot panel in firefox, sort of, but ctrl+A>right click>AI chatbot>summarise.

talos_
·
6 days ago
·
[ - ]

This analogy doesn't map to the actual problem here.

Perplexity is not visiting a website everytime a user asks about it. It's frequently crawling and indexing the web, thus redirecting traffic away from websites.

This crawling reduces costs and improves latency for Perplexity and its users. But it's a major threat to crawled websites

shadowgovt
·
6 days ago
·
[ - ]

I have never created a website that I would not mind being fully crawled and indexed into another dataset that was divorced from the source (other than such divorcement makes it much harder to check pedigree, which is an academic concern, not a data-content concern: if people want to trust information from sources they can't know and they can't verify I can't fix that for them).

In fact, the "old web" people sometimes pine for was mostly a place where people were putting things online so they were online, not because it would translate directly to money.

Perhaps AI crawlers are a harbinger for the death of the web 2.0 pay-for-info model... And perhaps that's okay.

short_sells_poo
·
6 days ago
·
[ - ]

There's an important distinction that we are glossing over I think. In the times of the "old web", people were putting things online to interact with a (large) online audience. If people found your content interesting, they'd keep coming back and some of them would email you, there'd be discussions on forums, IRC chatrooms, mailing lists, etc. Communities were built around interesting topics, and websites that started out as just some personal blog that someone used to write down their thoughts would grow into fonts of information for a large number of people.

Then came the social networks and walled gardens, SEO, and all the other cancer of the last 20 years and all of these disappeared for un-searchable videos, content farms and discord communities which are basically informational black holes.

And now AI is eating that cancer, but IMO it's just one cancer being replaced by an even more insidious cancer. If all the information is accessed via AI, then the last semblance of interaction between content creators and content consumers disappears. There are no more communities, just disconnected consumers interacting with a massive aggregating AI.

Instead of discussing an interesting topic with a human, we will discuss with AI...

account42
·
5 days ago
·
[ - ]

I agree but that cancer isn't limited to the internet or even originated from it. An until society as a whole is ready to deal with it the only thing we can do is form our own subculture that rejects this new normal. Instead of caring about what gets scraped or otherwise used by mega corporations for profit, care about finding more exchanges with real humans. Or in other words: be part creating the world you want to see and ignore those that choose not to participate.

k1m
·
6 days ago
·
[ - ]

When Yahoo! Pipes was still running (long time ago), their official position was:

> Because Pipes is not a web crawler (the service only retrieves URLs when requested to by a Pipe author or user) Pipes does not follow the robots exclusion protocol, and won't check your robots.txt file.

Vegenoid
·
6 days ago
·
[ - ]

There is a significant distinction between 2 and 3 that you glossed over. In 1 and 2, you the human may be forced to prove that you are human via a captcha. You are present at the time of the request. Once you’ve performed the exchange, then the HTML is on your computer and so you can do what you want to it.

In 3, although you do not specify, I assume you mean that a bot requests the page, as opposed to you visiting the page like in scenario 2 and then an LLM processes the downloaded data (similarly to an adblocker). It is the former case that is a problem, the latter case is much harder to stop and there is much less reason to stop it.

This is the distinction: is a human present at the time of request.

philistine
·
6 days ago
·
[ - ]

To me it's even simpler: 3 is a request made from another ip address that isn't directly yours. Why should an LLM request that acts exactly like a VPN request be treated differently from a VPN request?

freehorse
·
6 days ago
·
[ - ]

Yeah, I also find the analogy about "agent on behalf of the user interacting with a website" weak, because it is not about "an agent", it is a 3rd party service that actually takes content from a website, processes it and serves it to the user (even with their own ads?). It is more akin to, let's say, a scammy website that copies content from other legit websites and serves their own ads, than software running on the user's computer.

There are legitimate reasons to do that, of course. Maybe I am trying to find info about some niche topic or how to do X, I ask an llm, the llm goes through some search results, a lot of which is search engine optimised crap, finds the relevant info and answers my question.

But if I wrote articles in a news site, I am supported by ads or subscriptions and see my visits plummel because people, who would usually google about topic X and then visit my website that I wrote about X, were now reading the google summary that appeared when googling about topic X, based on my article, maybe I would have less motivation to continue writing.

The only end result possible in such a scenario is that everything commercial of some quality being heavily paywalled, some tiny amount of free and open small web, and a huge amount of AI generated slop, because the value of an article in the open internet is now so low that only AI can produce it (economically, time-wise) efficiently enough.

skeezyboy
·
6 days ago
·
[ - ]

I dare say there isnt much value in "writing articles in a news site". Odds are youve just copied it from another source yourself. Are you actually doing primary source journalism? And another issue is that the website is probably public. So dont make it public if its so valuable. But you dont because it isnt.

ai-christianson
·
6 days ago
·
[ - ]

Websites should be able to request payment. Who cares if it is a human or an agent of a human if it is paying for the request?

adriand
·
6 days ago
·
[ - ]

Cloudflare launched a mechanism for this: https://blog.cloudflare.com/introducing-pay-per-crawl/

carlosjobim
·
6 days ago
·
[ - ]

They are able to request payment.

sds357
·
6 days ago
·
[ - ]

What if the agent is reselling the request?

axus
·
6 days ago
·
[ - ]

For 1, 2, and 3, the website owner can choose to block you completely based on IP address or your User Agent. It's not nice, but the best reaction would be to find another website.

Perplexity is choosing to come back "on a VPN" with new IP addresses to evade the block.

#2 and #3 are about modifying data where access has been granted, I think Cloudflare is really complaining about #1.

Evading an IP address ban doesn't violate my principles in some cases, and does in others.

zeta0134
·
6 days ago
·
[ - ]

If the LLM were running this sort of thing at the user's explicit request this would be fine. The problem is training. Every AI startup on the planet right now is aggressively crawling everything that will let them crawl. The server isn't seeing occasional summaries from interested users, but thousands upon thousands of bots repeatedly requesting every link they can find as fast as they can.

fxtentacle
·
6 days ago
·
[ - ]

Then what if I ask the LLM 10 questions about the same domain and ask it to research further? Any human would then click through 50 - 100 articles to make sure they know what that domain contains. If that part is automated by using an LLM, does that make any legal change? How many page URLs do you think one should be allowed to access per LLM prompt?

zeta0134
·
6 days ago
·
[ - ]

All of them. That's at the explicit request of the user. I'm not sure where the downvotes are coming from, since I agree with all of these points. The training thing has merely pissed off lots of server operators already, so they quite reasonably tend to block first and ask questions later. I think that's important context.

hombre_fatal
·
6 days ago
·
[ - ]

TFA isn’t talking about crawling to harvest training data.

It’s talking about Perplexity crawling sites on demand in response to user queries and then complaining that no it’s not fine, hence this thread.

cjonas
·
6 days ago
·
[ - ]

Doesn't perplexity crawl to harvest and index data like a traditional search engine? Or is it all "on demand"?

lukeschlather
·
6 days ago
·
[ - ]

For the most part I would assume they pay for access to Google or Bing's index. I also assume they don't really train models. So all their "crawling" is on behalf of users.

mnmalst
·
6 days ago
·
[ - ]

But that's not what this article is about. From, what I understand, this articles is about a user requesting information about a specific domain and not general scraping.

ivanjermakov
·
6 days ago
·
[ - ]

The problem is not about personal use. It's about big corporations scrapping billions of pages to make money.

antonymoose
·
6 days ago
·
[ - ]

While I agree it’s a problem, unless you get litigious or the backing of a national congressional body, you might as well not care. You won’t win this fight.

Circa 2008 I worked for a startup that would scrape Google Books and a variety of other sources for public domain content to then print via Amazon’s Print-on-Demand services. Google, of course, didn’t like this and introduced a Captcha not very long after we started scraping.

So we hired a team of underemployed / unemployed English majors during the height of the recession, paid them $10 per hour to type in Captchas all day long and we downloaded their full corpus anyways.

If Google can’t win, you won’t either.

skeezyboy
·
6 days ago
·
[ - ]

those publicly accessible pages yea?

jabroni_salad
·
6 days ago
·
[ - ]

If it was just one human requesting one summary of the page nobody would ever notice. The typical watermark for junk traffic is pretty high as it was.

I have a dinky little txt site on my email domain. There is nothing of value on it, and the content changes less than once a year. So why are AI scrapers hitting it to the tune of dozens of GB per month?

bigbuppo
·
6 days ago
·
[ - ]

Right, but the LLM isn't really being used for that. It's being used for marketing and advertising purposes most of the time. The AI companies also let you play with it from time to time so you'll be a shill for them, but mostly it's the advertising people you claim to not like.

shadowgovt
·
6 days ago
·
[ - ]

Not only is it difficult to solve, it's the next step in the process of harvesting content to train AIs: companies will pay humans (probably in some flavor of "company scrip," such as extra queries on their AI engine) to install a browser extension that will piggy-back on their human access to sites and scrape the data from their human-controlled client.

At the limit, this problem is the problem of "keeping secrets while not keeping secrets" and is unsolvable. If you've shared your site content to one entity you cannot control, you cannot control where your site content goes from there (technologically; the law is a different question).

quectophoton
·
6 days ago
·
[ - ]

> companies will pay humans (probably in some flavor of "company scrip," such as extra queries on their AI engine) to install a browser extension that will piggy-back on their human access to sites and scrape the data from their human-controlled client.

Proprietary web browsers are in a really good position to do something like this, especially if they offer a free VPN. The browser would connect to the "VPN servers", but it would be just to signal that this browser instance has an internet connection, while the requests are just proxied through another browser user.

That way the company that owns this browser gets a free network of residential IP address ready to make requests (in background) using a real web browser instance. If one of those background requests requires a CAPTCHA, they can just show it to the real user, e.g. the real user visits a Google page and they see a Cloudflare CAPTCHA, but that CAPTCHA is actually from one of the background requests (while lying in its UI and still showing the user a Google URL in the address bar).

mrinterweb
·
6 days ago
·
[ - ]

Is a perplexity visit not cached and shared between users who perform a similar search? I don't know much about perplexity, but I'd be surprised if scraped results weren't used to serve multiple searches and users. By bypassing the no-crawl directive, that is a violation of the website's expressed request. I think it is different if individual users chooses to bypass certain things on a website, but for a company to choose to do it is another story.

paulcole
·
6 days ago
·
[ - ]

> 1. If I as a human request a website, then I should be shown the content. Everyone agrees.

Definitely don't agree. I don't think you should be shown the content, if for example:

1. You're in a country the site owner doesn't want to do business in.

2. You've installed an ad blocker or other tool that the site owner doesn't want you to use.

3. The site owner has otherwise identified you as someone they don't want visiting their site.

You are welcome to try to fool them into giving you the content but it's not your right to get it.

jpadkins
·
6 days ago
·
[ - ]

> 1. If I as a human request a website, then I should be shown the content. Everyone agrees.

I disagree. The website should have the right to say that the user can be shown the content under specific conditions (usage terms, presented how they designed, shown with ads, etc). If the software can't comply with those terms, then the human shouldn't be shown the content. Both parties did not agree in good faith.

dgshsg
·
6 days ago
·
[ - ]

You want the website to be able to force the user to see ads?

jpadkins
·
6 days ago
·
[ - ]

no, I think a fair + just world, both parties agree before they transact. There is no force in either direction (don't force creators to give their content on terms they don't want, don't force users to view ads they don't want). It's perfectly fine if people with strict preferences don't match. It's a big web, there are plenty of creators and consumers.

If the user doesn't want to view content with ads, that's okay and they can go elsewhere.

throw10920
·
5 days ago
·
[ - ]

> If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it.

Because the website has every right to block you or refuse access to you if you do that, just like an establishment has the right to refuse you access if you try to enter without a shirt, if you're denying them access to revenue that they predicated your access on.

Similarly, if you're using a user-agent the website doesn't like, they have the right to block you, or take action against that user-agent to prevent it from existing if they can't reliable detect it to block it.

dawnerd
·
6 days ago
·
[ - ]

Nothing wrong if they fetch on your behalf. The problem is when they endlessly crawl along with every other ai company doing the same.

dabockster
·
6 days ago
·
[ - ]

Because the LLM is usually on a 3rd party cloud system and ultimately not under your full control. You have no idea if the LLM is retaining any of that information for that business's own purposes beyond what a EULA says - which basically amounts to a pinky swear here. Especially if that LLM is located across international borders.

Now, for something like Ollama or LMStudio where the LLM and the whole toolchain is physically on your own system? Yeah that should be like Firefox legally since it's under your control.

sussmannbaka
·
6 days ago
·
[ - ]

4. If I now go one step further and use a commercial DDoS service to make the get requests for me because this comparison is already a stretch, then why would the DDoS provider accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

GardenLetter27
·
6 days ago
·
[ - ]

And isn't the obvious solution to just make some sort of browsers add-on for the LLM summary so the request comes from your browser and then gets sent to the LLM?

I think the main concern here is the huge amount of traffic from crawling just for content for pre-training.

otterley
·
6 days ago
·
[ - ]

Why would a personal browser have to crawl fewer pages than the agent’s mechanism? If anything, the agent would be more efficient because it could cache the content for others to use. In the situation we’re talking about, the AI engine is behaving essentially like a caching proxy—just like a CDN.

Workaccount2
·
6 days ago
·
[ - ]

>2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

If I put time and effort into a website and it's content, I should expect no compensation despite bearing all costs.

Is that something everyone would agree with?

The internet should be entirely behind paywalls, besides content that is already provided ad free.

Is that something everyone would agree with?

I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"

Bjartr
·
6 days ago
·
[ - ]

You're free to deny access to your site arbitrarily, including for lack of compensation.

ndiddy
·
6 days ago
·
[ - ]

This article is about Cloudflare attempting to deny Perplexity access to their demo site by blocking Perplexity's declared user-agent and official IP range. Perplexity responded to this denial by impersonating Google Chrome on macOS and rotating through IPs not listed in their published IP range to access the site anyway. This means it's not just "you're free to deny access to your site arbitrarily", it's "you're free to play a cat-and-mouse game indefinitely where the other side is a giant company with hundreds of millions of dollars in VC funding".

Bjartr
·
6 days ago
·
[ - ]

The comment I'm responding to established a slightly different context by asking a specific question about getting compensation from site visitors.

cjonas
·
6 days ago
·
[ - ]

Like for people or are using a ad block or for a crawler downloading your content so it can be used by an AI response?

Bjartr
·
6 days ago
·
[ - ]

Arbitrarily, as in for any reason. It's your site, you decide what constraints an incoming request must meet for it to get a response containing the content of your site.

Workaccount2
·
6 days ago
·
[ - ]

>and the website should not be notified about it.

giantrobot
·
6 days ago
·
[ - ]

My user agent and its handling of your content once it's on my computer are not your concern. You don't need to know if the data is parsed by a screen reader, an AI agent, or just piped to /dev/null. It's simply not your concern and never will be.

·
6 days ago
·
[ - ]

nradov
·
6 days ago
·
[ - ]

Yes, I agree with that. If a website owner expects compensation then they should use a paywall.

Chris2048
·
6 days ago
·
[ - ]

If I put time and effort into a food recipe should I (get) compensation?

the answer is apparently "no", and I don't really how recipe books have suffered as a result of less gatekeeping.

"How will the internet work"? Probably better in some ways. There is plenty of valuable content on the internet given for free, it's being buried in low-value AI slop.

Workaccount2
·
6 days ago
·
[ - ]

You understand that HN is ad supported too, right?

Chris2048
·
6 days ago
·
[ - ]

No, I don't.

But what is your point? Is the value in HN primarily in its hosting, or the non-ad-supported community?

Workaccount2
·
6 days ago
·
[ - ]

Outside of Wikipedia, I'm not sure what content you are thinking of.

Taking HN as a potential one of these places, it doesn't even qualify. HN is funded entirely to be a place for advertising ycombinator companies to a large crowd of developers. HN is literally a developer honey pot that they get exclusive ad rights to.

lovich
·
6 days ago
·
[ - ]

Because quantity has a quality of its own.

I say this as someone who is on the side of pro local user commands how local compute works, but understand why companies are reacting to how cheap LLMs are making information discovery against their own datasets

martin82
·
6 days ago
·
[ - ]

It's a tough issue indeed.

One thing that comes to my mind is: If a human tries to answer a question via the web, he will browse one site after the other.

If that human asks an LLM, it will ping 25 sites in parallel.

Scale this up to all of humanity, and it should be expected that internet traffic will rise 25x - just from humans manually asking questions every now and then - we are not even talking about AI companies actively crawling the web.

That means, webmasters will have to figure out aggressive caching and let CDNs deal with the problem or put everything behind a login screen (which might also just be a temporary fix).

userbinator
·
6 days ago
·
[ - ]

If a human tries to answer a question via the web, he will browse one site after the other.

Not me, I often open multiple tabs and windows at once to compare and contrast the results.

renewiltord
·
6 days ago
·
[ - ]

The websites don’t nag you, actually. They just send you data. You have configured your user agent to nag yourself when the website sends you data.

And you’re right: there’s no difference. The web is just machines sending each other data. That’s why it’s so funny that people panic about “privacy violations” and server operators “spying on you”.

We’re just sending data around. Don’t send the data you don’t want to send. If you literally send the data to another machine it might save it. If you don’t, it can’t. The data the website operator sends you might change as a result but it’s just data. And a free interaction between machines.

SergeAx
·
5 days ago
·
[ - ]

1. Sometimes you should prove that you are human first.

I think the line is drawn at "on my behalf". The silent agreement of the web is that humans are served content via a browser, and robots are obeying rules. All we need to support this status quo is to perform data processing by ML models on a client's side, in the browser, the same way we rip out ads.

Tuna-Fish
·
6 days ago
·
[ - ]

I would not mind 3, so long as it's just the LLM processing the website inside its context window, and no information from the website ends up in the weights of the model.

ccgreg
·
6 days ago
·
[ - ]

Note that a book author cannot publish a book and then refuse to let libraries buy copies and lend them out. This was litigated 100+ years ago.

account42
·
5 days ago
·
[ - ]

The difference is that libraries aren't all about concentrating wealth for themselves.

TZubiri
·
6 days ago
·
[ - ]

In that case the llm would be a user-agent, quite distinct from scraping without a specific user request.

This is well defined in specs and ToS, not quite a gray area

Spacecosmonaut
·
6 days ago
·
[ - ]

Regarding point 3: The problem from the perspective of websites would not be any different if they had been completely ad free. People would still consume LLM generated summaries because they cut down clicks and eyeballing to present you information that directly pertains to the promt.

The whole concept of a "website" will simply become niche. How many zoomers still visit any but the most popular websites?

sterlind
·
6 days ago
·
[ - ]

> why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

is it just on your behalf? or is it on Perplexity's behalf? are they not archiving the pages to train on?

it's the difference between using Google Chrome vs. Chrome beaming full page snapshots to train Gemini on.

amiga386
·
6 days ago
·
[ - ]

If you as a human are well behaved, that is absolutely fine.

If you as a human spam the shit out of my website and waste my resources, I will block you.

If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.

If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.

Nobody would mind if you had an LLM that intelligently knew what pages contain what (because it had a web crawler backed index that refreshes at a respectful rate, and identifies itself accurately as a robot and follows robots.txt), and even if it needed to make an instantaneous request for you at the time of a pertinent query, it still identified itself as a bot and was still respectful... there would be no problem.

The problem is that LLMs are run by stupid, greedy, evil people who don't give the slightest shit what resources they use up on the hosts they're sucking data from. They don't care what the URLs are, what the site owner wants to keep you away from. They download massive static files hundreds or thousands of times a day, not even doing a HEAD to see that the file hasn't changed in 12 years. They straight up ignore robots.txt and in fact use it as a template of what to go for first. It's like hearing an old man say "I need time to stand up because of this problem with my kneecaps" and thinking "right, I best go for his kneecaps because he's weak there"

There are plenty of open crawler datasets, they should be using those... but they don't, they think that doesn't differentiate them enough from others using "fresher" data, so they crawl even the smallest sites dozens of times a day in case those small sites got updated. Their badly written software is wrecking sites, and they don't care about the wreckage. Not their problem.

The people who run these agents, LLMs, whatever, have broken every rule of decency in crawling, and they're now deliberately evading checks, to try and run away from the repercussions of their actions. They are bad actors and need to be stopped. It's like the fuckwads who scorch the planet mining bitcoin; there's so much money flowing in the market for AI, that they feel they have to fuck over everyone else, as soon as possible, otherwise they won't get that big flow of money. They have zero ethics. They have to be stopped before their human behaviour destroys the entire internet.

pavon
·
6 days ago
·
[ - ]

Question from a non-web-developer. In case 3, would it be technically possible for Perplexity's website to fetch the URL in question using javascript in the user's browser, and then send it to the server for LLM processing, rather than have the server fetch it? Or do cross-site restrictions prevent javascript from doing that?

fergie
·
6 days ago
·
[ - ]

> why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

It is illegal to copy stuff from the internet and then make it available from your own servers, especially when those sources have expressly asked you not to do it.

danlitt
·
6 days ago
·
[ - ]

You need more qualifiers for this to be true. archive.org routinely archives content that the site operators would prefer to be taken down.

remus
·
6 days ago
·
[ - ]

The solution to 3 seems fairly straightforward: user requests content and passes it to llm to summarise.

Neil44
·
6 days ago
·
[ - ]

Flip it around, why would you go to the trouble of creating a web page and content for it, if some AI bot is going to scrape it and save people the trouble of visiting your site? The value of your work has been captured by some AI company (by somewhat nefarious means too).

donkeybeer
·
5 days ago
·
[ - ]

I don't see the problem. I want AI agents to learn my website and have it available as part of its corpus of knowledge to users. If asked for something it can answer the question, and if user requests make a parallel web search for sources which would bring up my page. The latter is only a bonus not a necessity for me. Getting the information out there by whatever means is my first priority.

otikik
·
6 days ago
·
[ - ]

I don't really see the issue.

The web admin should be able to block usages 1, 2 or 3 at their discretion. It's their website.

Similarly the user is free to try to engage via 1, 2, 3, or refuse to interact with the website entirely.

EGreg
·
6 days ago
·
[ - ]

1. I actually disagree. I think teasers should be free but websites should charge micropayments for their content. Here is how it can be done seamlessly, without individuals making decisions to pay every minute: https://qbix.com/ecosystem

2. This also intersects with copyright law. Ingesting content to your servers en masse through automation and transforming it there is not the same as giving people a tool (like Safari Reader) they can run on their client for specific sites they visit. Examples of companies that lost court cases about this:

  Aereo, Inc. v. American Broadcasting Companies (2014)
  TVEyes, Inc. v. Fox News Network, LLC (2018)
  UMG Recordings, Inc. v. MP3.com, Inc. (2000)
  Capitol Records, LLC v. ReDigi Inc. (2018)
  Cartoon Network v. CSC Holdings (Cablevision) (2008)
  Image Search Engines: Perfect 10 v. Google (2007)

That last one is very instructive. Caching thumbnails and previews may be OK. The rest is not. AMP is in a copyright grey area, because publishers choose to make their content available for AMP companies to redisplay. (@tptacek may have more on this)

3. Putting copyright law aside, that's the point. Decentralization vs Centralization. If a bunch of people want to come eat at an all-you-can-eat buffet, they can, because we know they have limited appetites. If you bring a giant truck and load up all the food from all all-you-can-eat buffets in the city, that's not OK, even if you later give the food away to homeless people for free. You're going to bankrupt the restaurants! https://xkcd.com/1499/

So no. The difference is that people have come to expect "free" for everything, and this is how we got into ad-supported platforms that dominate our lives.

glenstein
·
6 days ago
·
[ - ]

I would love micropayments as a kind of baked-in ecosystem support. You can crawl if you want, but it's pay to play. Which hopefully drives motivation for robust norms for content access and content scraping that makes everyone happy.

EGreg
·
6 days ago
·
[ - ]

I want to bring Ted Nelson on my channel and interview him about Xanadu. Does anyone here know him?

https://xanadu.com.au/ted/XU/XuPageKeio.html

jacurtis
·
6 days ago
·
[ - ]

I think this is the world we are going to. I'm not going to get mired in the details of how it would happen, but I see this end result as inevitable (and we are already moving that way).

I expect a lot more paywalls for valuable content. General information is commoditized and offered in aggregated form through models. But when an AI is fetching information for you from a website, the publisher is still paying the cost of producing that content and hosting that content. The AI models are increasing the cost of hosting the content and then they are also removing the value of producing the content since you are just essentially offering value to the AI model. The user never sees your site.

I know Ads are unpopular here, but the truth is that is how publishers were compensated for your attention. When an AI model views the information that a publisher produces, then modifies it from its published form, and removes all ad content. Then you now have increased costs for producers, reduced compensation in producing content (since they are not getting ad traffic), and the content isn't even delivered in the original form.

The end result is that publishers now have to paywall their content.

Maybe an interesting middle-ground is if the AI Model companies compensated for content that they access similar to how Spotify compensates for plays of music. So if an AI model uses information from your site, they pay that publisher a fraction of a cent. People pay the AI models, and the AI models distribute that to the producers of content that feed and add value to the models.

benreesman
·
6 days ago
·
[ - ]

This is a hypothetical so give me a little rope here, but what if robots.txt wasn't a suggestion? What if it were binding (leaving aside for a moment how one would enforce / mandate / guarantee that)?

Would that solve the whole problem? Folks who ran webservers declared what they consent to, and that happens?

I think it's useful to just see if there's a consensus on that: actually making that happen is a whole can of worms itself, but it's strictly simpler than devising a good outcome without the consensus.

(And such things are not impossible, merely difficult, we have other systems ranging from BGP to the TLD mechanism that get honored in real life).

amelius
·
6 days ago
·
[ - ]

It's because they own the content so they get to set the terms.

account42
·
5 days ago
·
[ - ]

It's quite easy to solve. Hold companies legally accountable for computer fraud and abuse.

The problem is that those in the position to do that are not interested.

gentle
·
6 days ago
·
[ - ]

I believe you're being disingenuous. Perplexity is running a set of crawlers that do not respect robots.txt and take steps to actively evade detection.

They are running a service and this is not a user taking steps to modify their own content for their own use.

Perplexity is not acting as a user proxy and they need to learn to stick to the rules, even when it interferes with their business model.

1vuio0pswjnm7
·
6 days ago
·
[ - ]

The simple answer to #3 is advertising, including telemetry, tracking and other forms of web-based surveillance. These usually rely on certain browser "features" and/or default settings.

The goal is not to make the content usable. The goal is to get the traffic.

When advertising alone is the "business model", e.g., not the value of the "content", then even Cloudflare is going to try to protect it (the advertising, not the content). Anything to get www users to turn on Javascript so the surveillance capitalism can proceed. Hence all the "challenges" to frustrate and filter out software thatis not advertising-friendly, e.g., graphical.

Cloudflare's ruminations on user-agent strings are perplexing. It has been an expectation that the user-agent HTTP header will be spoofed since the earliest web browsers. The user-agent header is a joke.

This is from circa 1993, the year the www was opened to public access:

https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...

Cloudflare's "bot protections" are not to ensure human use of a website but to ensure use of specific software to access a website. Software that facilitates data collection and advertising services. For example, advertising-sponsored browsers. Any other software is labeled "bot". It does not matter if a human is operating it.

1vuio0pswjnm7
·
5 days ago
·
[ - ]

(IMHO) The correct way to limit "abuse", e.g., by "bots", is to rate limit. But as other commenters point out, Cloudflare routinely (and knowingly) blocks humans sending only a single GET request, e.g., with Javascript disabled. Needless to say, this does not exceed any reasonable rate limit. It is not "abuse". By Cloudflare's own admission, and as demonstrated by the case of Perplexity AI, this "bot protection" does not stop "bots".

It does stop any humans not using popular advertising-sponsored web browsers.

baxuz
·
6 days ago
·
[ - ]

1. To access a website you need a limited anonymized token that proves you are a human being, issued by a state authority

2. the end

I am firmly convinced that this should be the future in the next decade, since the internet as we know it has been weaponized and ruined by social media, bots, state actors and now AI.

There should exist an internet for humans only, with a single account per domain.

glenstein
·
6 days ago
·
[ - ]

A fascinating variation on this same issue can be found in Neal Stephenson's "Fall, or Dodge in Hell". There the solution is (1) discredit weaponized social media in its entirety by amplifying it's output exponentially and make its hostility universal in all directions, to the point that it's recognizeable as bad faith caricature. That way it can't be strategically leveraged with disproportionate directional focus against strategic targets by bad actors and (2) a new standard called PURDA, which is kind of behavioral signature as the mark of unique identity.

·
6 days ago
·
[ - ]

bbqfog
·
6 days ago
·
[ - ]

Correct, it’s user hostile to dictate which software is allowed to see content.

klabb3
·
6 days ago
·
[ - ]

They all do it. Facebook, Reddit, Twitter, Instagram. Because it interferes with their business model. It was already bad, but now the conflict between business and the open web is reaching unprecedented levels, especially since the copyright was scrapped for AI companies.

snihalani
·
6 days ago
·
[ - ]

you are paying for LLM but not paying for the website. LLM is removing the power the website had. Legally, that's cause for loss of income

miki123211
·
6 days ago
·
[ - ]

People seem to differentiate between an LLM on some other computer accessing the website and doing God knows what with it, versus your browser accessing the website and then passing it to an LLM.

People are usually fine with the latter but not the former, even though they come down to the same thing.

I think this is because people don't want LLMs to train on their content, and they don't differentiate between accessing a website to show it to the user, versus accessing it to train.

carlosjobim
·
6 days ago
·
[ - ]

Legal category?

cyanydeez
·
4 days ago
·
[ - ]

Think of it like tge telephone game.

Do you -really- want that much abstracrion?

Theres a bunch of nerds and capitalists about to rediscover GIGO

pyrale
·
6 days ago
·
[ - ]

Because LLM companies have historically been extremely disingenuous when it comes to crawling these sites.

Also because there is a difference between a user hitting f5 a couple times and a crawler doing a couple hundred requests.

Also because ultimately, by intermediating the request, llm companies rob website owners of a business model. A newspaper may be fine letting adblockers see their article, in hopes that they may eventually subscribe. When a LLM crawls the info and displays it with much less visibility for the source, that hope may not hold.

epolanski
·
6 days ago
·
[ - ]

It's somebody's else content and resources and they are free to ban you or your bots as much as they please.

Sam6late
·
5 days ago
·
[ - ]

perplexity is being used to bypass paywalls, I noticed this when I pasted into it a text and it was captured as a hyperlinked text. I will try other websites with paywalls to see if it is a go.

Beijinger
·
6 days ago
·
[ - ]

How about I open a proxy, replace all ads with my ads, redirect the content to you and we share the ad revenue?

fxtentacle
·
6 days ago
·
[ - ]

That's somewhat antisocial, but perfectly legal in the US. It's called PayPal Honey, for example, and has been running for 13 years now.

rustc
·
6 days ago
·
[ - ]

Since when does PayPal Honey replace ads on websites?

> PayPal Honey is a browser extension that automatically finds and applies coupon codes at checkout with a single click.

galaxy_gas
·
6 days ago
·
[ - ]

They overwrite ad attributions, affiliate links, clickthrough attributions with their own.

carlosjobim
·
6 days ago
·
[ - ]

That's the Brave browser.

beardyw
·
6 days ago
·
[ - ]

You speak as 1% of the population to 1% of the population. Don't fool yourself.

RiverCrochet
·
6 days ago
·
[ - ]

Intellectual property laws are what creates the entitlement that someone else besides you can tell you what to do with the things Internet connected computers and phones download, because almost everything you download is copy of something a person created, therefore its copyrighted for the life of the author + 75 years or whatever by default.

Therefore artifices like "you don't have the right to view this website without ads" or "you can't use your phone, computer, or LLM to download or process this outside of my terms because copyright" become possible, institutionalizable, enforceable, and eventually unbypassable by technology.

If we reverted back to the Constitutional purpose of copyright (to Progress the Science and Useful Arts) then things might be more free. That's probably not happening in my lifetime or yours.

sbarre
·
6 days ago
·
[ - ]

All of these scenarios assume you have an unconditional right to access the content on a website in whatever way you want.

Do you think you do?

Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

If you say that you have the right, and that right should be legally protected, to do whatever you want on your computer, should the content owner not also have a legally protected right to control how, and by who, and in what manner, their content gets accessed?

That's how it currently works in the physical world. It doesn't work like that in the digital world due to technical limitations (which is a different topic, and for the record I am fine with those technical limitations as they protect other more important rights).

And since the content owner is, by definition, the owner of the content in question, it feels like their rights take precedence. If you don't agree with their offering (i.e. their terms of service), then as an end user you don't engage, and you don't access the content.

It really can be that simple. It's only "difficult to solve" if you don't believe a content owner's rights are as valid as your own.

hansvm
·
6 days ago
·
[ - ]

It doesn't work like that in the physical world though. Once you've bought a book the author can't stipulate that you're only allowed to read it with a video ad in the sidebar, by drinking a can of coke before each chapter, or by giving them permission to sniff through your family's medical history. They can't keep you from loaning it out for other people to read, even thousands of other people. They can't stop you from reading it in a certain room or with your favorite music playing. You can even have an LLM transcribe or summarize it for you for personal use (not everyone has those automatic page flipping machines, but hypothetically).

The reason people are up in arms is because rights they previously enjoyed are being stripped away by the current platforms. The content owner's rights aren't as valid as my own in the current world; they trump mine 10 to 1. If I "buy" a song and the content owner decides that my country is politically unfriendly, they just delete it and don't refund me. If I request to view their content and they start by wasting my bandwidth sending me an ad I haven't consented to, how can I even "not engage"? The damage is done, and there's no recourse.

cutemonster
·
6 days ago
·
[ - ]

If there's an article you want to read, and the ToS says that in between reading each paragraph, you must switch to their YouTube channel and look at their ads about cat food for 5 minutes, are your going to do that?

JimDabell
·
6 days ago
·
[ - ]

Hacker News has collectively answered this question by consistently voting up the archive.is links in the comments of every paywalled article posted here.

giantrobot
·
6 days ago
·
[ - ]

New sites have collectively decided to require people use those services because they can't fathom not enshittifying everything until it's an unusable transaction hellscape.

I never really minded magazine ads or even television ads. They might have tried to make me associate boobs with a brand of soda but they didn't data mine my life and track me everywhere. I'd much rather have old fashioned manipulation than pervasive and dangerous surveillance capitalism.

gruez
·
6 days ago
·
[ - ]

>Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

If you believe in this principle, fair enough, but are you going to apply this consistently? If it's fair game for a blog to restrict access to AI agents, what does that mean for other user agents that companies disagree with, like browsers with adblock? Does it just boil down to "it's okay if a person does it but not okay if a big evil corporation does it?"