FOSS infrastructure is under attack by AI companies

1004
619
todsacerdoti
1 month ago
thelibre.news

ericholscher
·
1 month ago
·
[ - ]

Yep -- our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... (quoted in the OP) -- everyone I know has a similar story who is running large internet infrastructure -- this post does a great job of rounding a bunch of them up in 1 place.

I called it when I wrote it, they are just burning their goodwill to the ground.

I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.

pjc50
·
1 month ago
·
[ - ]

> just burning their goodwill to the ground

AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.

UncleMeat
·
1 month ago
·
[ - ]

Yep. And it is much more far reaching than that. Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet. The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained. All intellectual property belongs to them. All labor belongs to them. Why would they need good will when they own everything?

"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.

rectang
·
1 month ago
·
[ - ]

> Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet.

And this is why AI training is not "fair use". The AI companies seek to train models in order to compete with the authors of the content used to train the models.

A possible eventual downfall of AI is that the risk of losing a copyright infringement lawsuit is not going away. If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.

WhyOhWhyQ
·
1 month ago
·
[ - ]

I've pointed this out to a few people in this space. They tend to suggest that the value in AI is so great this means we should get rid of copyright law entirely.

dbingham
·
1 month ago
·
[ - ]

That value is only great if it's shared equitably with the rest of the planet.

If it's owned by a few, as it is right now, it's an existential threat to the life, liberty, and pursuit of a happiness of everyone else on the planet.

We should be seriously considering what we're going to do in response to that threat if something doesn't change soon.

UncleMeat
·
1 month ago
·
[ - ]

Yep. The "wouldn't it be great if we had robots do all the labor you are currently doing" argument only works if there is some plan to make sure that my rent gets paid other than me performing labor.

Spivak
·
1 month ago
·
[ - ]

It depends if you're the only one out of a job. If it really is everyone then the answer will likely be some variant of metaphorically or literally killing your landlord in favor of a different resource allocation scheme. I put these kinds of things in a "in that world I would have bigger problems" bucket.

ricudis
·
1 month ago
·
[ - ]

And that's the ultimate fail of capitalist ethics - the notion that we must all work just so we can survive. Look at how many shitty and utterly useless jobs exist just so people can be employed on them to survive.

This has to change somehow.

"Machines will do everything and we'll just reap the profits" is a vision that techno-millenialists are repeating since the beginnings of the Industrial Revolution, but we haven't seen that happening anywhere.

For some strange reason, technological progress seem to be always accompanied with an increase on human labor. We're already past the 8-hours 5-days norm and things are only getting worse.

robertlagrant
·
1 month ago
·
[ - ]

> And that's the ultimate fail of capitalist ethics - the notion that we must all work just so we can survive. Look at how many shitty and utterly useless jobs exist just so people can be employed on them to survive.

This isn't a consequence of capitalism. The notion of having to work to survive - assuming you aren't a fan of slavery - is baked into things at a much more fundamental level. And lots of people don't work, and are paid by a welfare state funded by capitalism-generated taxes.

> "Machines will do everything and we'll just reap the profits" is a vision that techno-millenialists are repeating since the beginnings of the Industrial Revolution, but we haven't seen that happening anywhere.

They were wrong, but the work is still there to do. You haven't come up with the utopian plan you're comparing this to.

> For some strange reason, technological progress seem to be always accompanied with an increase on human labor.

No it doesn't. What happens is not enough people are needed to do a job any more, so they go find another job. No one's opening barista-staffed coffee shops on every corner in the time when 30% of the world was doing agricultural labour.

consteval
·
1 month ago
·
[ - ]

> This isn't a consequence of capitalism.

Yes, it is. The fact we have welfare isn't a refutation of that, it's proof. The welfare is a bandaid over the fundamental flaws of capitalism. A purely capitalist system is so evil, it is unthinkable. Those people currently on welfare should, in a free labor market, die and rot in the street. We, collectively, decided that's not a good idea and went against that.

That's why the labor market, and truly all our markets, are not free. Free markets suck major ass. We all know it. Six year olds have no business being in coal mines, no matter how much the invisible hand demands it.

wdkrnls
·
1 month ago
·
[ - ]

You have a very different definition of free than I do. Free to me means that people enter into agreements voluntarily. It's hard to claim a market is free when it's participants have no other choice...

integralof6y
·
1 month ago
·
[ - ]

> That value is only great if it's shared equitably with the rest of the planet.

I think this should be an axiom which should be respected by any copyright rule.

joquarky
·
1 month ago
·
[ - ]

You are correct, but the real problem is that copyright needs complete reform.

Let's not forget the basis:

> [The Congress shall have Power . . . ] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

Is our current implementation of copyright promoting the progress of science and useful arts?

Or will science and the useful arts be accelerated by culling back the current cruft of copyright laws?

For example, imagine if copyright were non-transferable and did not permit exclusive licensing agreements.

salawat
·
1 month ago
·
[ - ]

The "publisher bootstrap kit + revenue sharing agreement" would become ubiquitous overnight.

tadfisher
·
1 month ago
·
[ - ]

AI is going to implode within 2 years. Once it starts ingesting its own output as training data it is going to be at best capped at its current capability and at worst even more hallucinatory and worthless.

bigfudge
·
1 month ago
·
[ - ]

The mistake you make here is to forget that the training data of the original models was also _full_ or errors and biases — and yet they still produced coherent and useful output. LLM training seems to be incredibly resilient to noise in the training set.

joquarky
·
1 month ago
·
[ - ]

Forget what it eats to continue improving.

Realize what it already has.

A foundational language model with no additional training is already quite powerful.

And that genie isn't going back into the bottle.

mafuy
·
1 month ago
·
[ - ]

Nonsense. Some of the current best AI models were specifically trained on AI output.

neilv
·
1 month ago
·
[ - ]

That's a talking point for bros looking to exploit it as their ticket.

"The upside of my gambit is so great for the world, that I should be able to consume everyone else's resources for free. I promise to be a benevolent ruler."

RodgerTheGreat
·
1 month ago
·
[ - ]

"What's good for Milo Minderbinder is good for the world."

freeone3000
·
1 month ago
·
[ - ]

…meaning that whatever model results would have no protection, and would be free for anyone to use?

hn_acc1
·
1 month ago
·
[ - ]

That's not how conservatism works. AI oligarchs are part of the "in" group in the "there are laws that protect but do not bind the in group, and laws that bind but do not protect the out group" summary. Anyone with a net worth less than FOTUS is part of the "out" group.

rectang
·
1 month ago
·
[ - ]

AI is worthless without training data. If all content becomes AI generated because AI outcompetes original content then there will be no data left to train on.

When Google first came out in 1998, it was amazing, spooky how good it was. Then people figured out how to game pagerank and Google's accuracy cratered.

AI is now in a similar bubble period. Throwing out all of copyright law just for the benefit of a few oligarchs would be utter foolishness. Given who is in power right now I'm sure that prospect will find a few friends, but I think the odds of it actually happening before the bubble bursts are pretty small.

olleromam91
·
1 month ago
·
[ - ]

Are we not past past critical mass though? The velocity at which these things can out compete human labor is astonishing, any future human creations or original content will already have lost the battle the moment it goes online and gets cloned by AI.

AtomBalm
·
1 month ago
·
[ - ]

We should, but not for those reasons.

If software and ideas become commodities and the legal ecosystem around creating captive markets disappears, then we will all be much better off.

thayne
·
1 month ago
·
[ - ]

I'm doubtful the AI companies would be happy with getting rid of laws protecting _their_ intellectual property.

DidYaWipe
·
1 month ago
·
[ - ]

What an infantile worldview.

joquarky
·
1 month ago
·
[ - ]

> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

https://news.ycombinator.com/newsguidelines.html

DidYaWipe
·
1 month ago
·
[ - ]

OK. To be clear, that wasn't about the OP, but rather the alleged people promoting the abolition of copyright... which would significantly hurt open source.

The people agitating for such things are usually leeches who want everything free and do, in fact, hold an infantile worldview that doesn't consider how necessary remuneration is to whatever it is they want so badly (media pirates being another example).

Not that I haven't "pirated" media, but this is usually the result of it not being available for purchase or my already having purchased it.

pjc50
·
1 month ago
·
[ - ]

There's already been an interesting ruling that a pure AI output is not, in itself, copyrightable.

joquarky
·
1 month ago
·
[ - ]

I'm curious what will happen when someone modifies a single byte (or a "sufficient" number of bytes) of AI output, thereby creating a derivative work, and then claiming copyright on that modified work.

eadmund
·
1 month ago
·
[ - ]

> The AI companies seek to train models in order to compete with the authors of the content used to train the models.

When I read someone else’s essay I may intend to write essays like that author. When I read someone else’s code I may intend to write code like that author.

AI training is no different from any other training.

> If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.

Do you mean the output of the AI training process (the model), or the output of the AI model? If the former, yes, sure: if a model actually contains within it it copies of data, then sure: it’s a copy of that work.

But we should all be very wary of any argument that the ability to create a new work which is identical to a previous work is itself derivative. A painter may be able to copy Gogh, but neither the painter’s brain nor his non-copy paintings (even those in the style of Gogh) are copies of Gogh’s work.

rectang
·
1 month ago
·
[ - ]

If you as an individual recognizably regurgitate the essay you read, then you have infringed. If an AI model recongnizably regurgitates the essay it trained on then it has infringed. The AI argument that passing original content through an algorithm insulates the output from claims of infringement because of "fair use" is pigwash.

eadmund
·
1 month ago
·
[ - ]

> If an AI model recongnizably regurgitates the essay it trained on then it has infringed.

I completely agree — that’s why I explicitly wrote ‘non-copy paintings’ in my example.

> The AI argument that passing original content through an algorithm insulates the output from claims of infringement because of "fair use" is pigwash.

Sure, but the argument that training an AI on content is necessarily infringement is equally pigwash. So long as the resulting model does not contain copies, it is not infringement; and so long as it does not produce a copy, it is not infringement.

kod
·
1 month ago
·
[ - ]

> So long as the resulting model does not contain copies, it is not infringement

That's not true.

The article specifically deals with training by scraping sites. That does necessarily involve producing a copy from the server to the machine(s) doing the scraping & training. If the TOS of the site incorporates robots.txt or otherwise denies a license for such activity, it is arguably infringement. Sourcehut's TOS for example specifically denies the use of automated tools to obtain information for profit.

joquarky
·
1 month ago
·
[ - ]

I'm curious how this can be applied with the inevitable combinatorial exhaustion that will happen with musical aspects such as melody, chord progression, and rhythm.

Will it mean longer and longer clips are "fair use", or will we just stop making new content because it can't avoid copying patterns of the past?

mega_dean
·
1 month ago
·
[ - ]

> I'm curious how this can be applied with the inevitable combinatorial exhaustion that will happen with musical aspects such as melody, chord progression, and rhythm.

https://www.vice.com/en/article/musicians-algorithmically-ge...

They did this in 2020. The article points out that "Whether this tactic actually works in court remains to be seen" and I haven't been following along with the story, so I don't know the current status.

rectang
·
1 month ago
·
[ - ]

More germane is that there will be a smoking gun for every infringement case: whether or not the model was trained on the original. There will be no pretending that the model never heard the piece it copied.

consteval
·
1 month ago
·
[ - ]

> AI training is no different from any other training.

Yes, it is. One is done by a computer program, and one is done by a human.

I believe in the rights and liberties of human beings. I have no reason to believe in rights for silicon. You, and every other AI apologist, are never able to produce anything to back up what is largely seen as an outrageous world view.

You cannot simply jump the gun and compare AI training to human training like it's a foregone conclusion. No, it doesn't work that way. Explain why AI should have rights. Explain if AI should be considered persons. Explain what I, personally, will gain from extending rights to AI. And explain what we, collectively, will gain from it.

UncleMeat
·
1 month ago
·
[ - ]

Outcomes matter. Things that are fine at an individual level can become social harmful at scale.

joquarky
·
1 month ago
·
[ - ]

What happens when a culture becomes overwhelmingly individualistic and becomes blind to the at-scale harms?

maaaaattttt
·
1 month ago
·
[ - ]

I have this line of thought as well but then I wonder, if we are all out of jobs and out of substantial capital to spend, how do these owners make money ultimately? It's a genuine question and I'm probably missing something obvious. I can see a benevolant/post-scarcity spin to this but the non-benevolant one seems self defeating.

neutronicus
·
1 month ago
·
[ - ]

"Making money" is only a relevant goal when you need money to persuade humans to do things for you.

Once you have an army of robot slaves ... you've rendered the whole concept of money irrelevant. Your skynet just barters rare earth metals with other skynets and your robot slaves furnish your desired lifestyle as best they can given the amount of rare earth metals your skynet can get its hands on. Or maybe a better skynet / slave army kills your skynet / slave army, but tough tits, sucks to be you and rules to be whoever's skynet killed yours.

sambull
·
1 month ago
·
[ - ]

Good thing AI for now needs power, water and a place to exchange heat. Our version of womp rats if it goes to far I guess.

neutronicus
·
1 month ago
·
[ - ]

That's part of the "rare earth metals" synecdoche - hydroelectric dams, thorium mines, great lakes heat sinks - they're all things for skynets to kill or barter for as expedient

sevensor
·
1 month ago
·
[ - ]

I don’t think you’re missing anything, I think the plan really is to burn it all down and rule over the ashes. The old saw “if you’re so smart, why aren’t you rich?” works in reverse too. This is a foolish, shortsighted thing to do, and they’re doing it anyway. Not really thinking about where value actually comes from or what their grandchildren’s lives would be like in such a world.

lukevp
·
1 month ago
·
[ - ]

Capitalism is an unthinking, unfeeling force. The writing is on the wall that AI is coming, and being altruistic about it doesn’t do jack to keep others from the land grab. Their thinking is, might as well join the rush and hope they’re one of the winners. Every one of us sitting on the sidelines will be impacted in some way or the other. So who’re the smart ones, the ones who grab shovels and start digging, or the ones who watch as the others dig their graves and do nothing?

bigfudge
·
1 month ago
·
[ - ]

This is a bleak view. How about the ones who work hard to shape the way we adopt the technology to societal benefit ?

moooo99
·
1 month ago
·
[ - ]

> How about the ones who work hard to shape the way we adopt the technology to societal benefit ?

they are heavily outnumbered and "outfunded"

salawat
·
1 month ago
·
[ - ]

Some technology is fundamentally incompatible with some societal architecture's implementation details; AI is one such technology.

Ubiquitous surveillance is another.

Miraste
·
1 month ago
·
[ - ]

There are a few people like that, but not many. And certainly none in the AI space.

mistrial9
·
1 month ago
·
[ - ]

obviously China is going full forward and better at it, with no "Capitalism" involved

MyOutfitIsVague
·
1 month ago
·
[ - ]

There's plenty of capitalism in Chinese business. It's not a purely communist country, it's a hybrid system with an active market economy.

Analemma_
·
1 month ago
·
[ - ]

China has been communist-in-name-only since Deng, you're accidentally proving the parent's point instead of refuting it.

greenchair
·
1 month ago
·
[ - ]

ha, thanks for that!

billy99k
·
1 month ago
·
[ - ]

I already started incorporating AI into my workflow. It's definitely helped with productivity.

At some point in the future, if you aren't using AI, you won't be able to compete in the job market.

sozforex
·
1 month ago
·
[ - ]

At some point in the future, if you aren't AI, you won't be able to compete in the job market.

billy99k
·
1 month ago
·
[ - ]

Sure, maybe in 50 years. At the moment, it's a productivity tool. Strangely, by the look of the down votes, the HN community doesn't quite understand this.

sozforex
·
1 month ago
·
[ - ]

How confident are you that you will not be outcompeted by AI's in 3-7 years?

blibble
·
1 month ago
·
[ - ]

what you don't understand is you are training your own replacement

the tools feed back to the mothership what you are accepting and what you aren't

this is a far better signal than anything they get from crawling the internet

DrillShopper
·
1 month ago
·
[ - ]

Class traitors never understand this

Ray20
·
1 month ago
·
[ - ]

That's a laughable idea.

Job market is formed by the presence of needs and the presence of the ability to satisfy them. AI - does not reduce the ability to satisfy needs, so only possible situation where you won't be able to compete - is either the socialists will seize power and ban competition, or all the needs will be met in some other ways. In any other situation - there will be job market and the people will compete in it

moooo99
·
1 month ago
·
[ - ]

> there will be job market and the people will compete in it

maybe there will be. I'm sure there also is a market for Walkman somewhere, its just exceedingly small.

The proclaimed goal is to displace workers on a grand scale. This is basically the vision of any AI company and literally the only way you could even remotely justify their valuations given the heavy losses they incur right now.

> Job market is formed by the presence of needs and the presence of the ability to satisfy them

The needs of a job market are largely shaped by the overall economy. Many industrial nations are largely service based economies with a lot of white collar jobs in particular. These white collar jobs are generally easier to replace with AI than blue collar jobs because you don't have to deal with pesky things like the real, physical world. The problem is: if white collar workers are kicked out of their jobs en masse, it also negatively affects the "value" of the remaining people with employment (exhibit A: tech job marker right now).

> is either the socialists will seize power and ban competition,

I am really having a hard time understanding where this obsession with mythical socialism comes from. The reality we live in is largely capitalistic and a striving towards a monopoly - i.e. a lack of competition - is basically the entire purpose of a corporation, which is only kept in check by government regulations.

Ray20
·
1 month ago
·
[ - ]

>The proclaimed goal is to displace workers on a grand scale.

It doesn't matter. What you need to understand - is that in the source of the job market is needs, ability to meet those needs and ability to exchanges those ability on one another. And nothing of those are hindered by AI.

>Many industrial nations are largely service based economies with a lot of white collar jobs in particular.

Again: in the end of the day it doesn't change anything. In the end of the day you need a cooked dinner, a built house and everything else. So someone must build a house and exchange it for a cooked dinners. That's what happening (white collar workers and international trade balance included) and that's what job market is. AI doesn't changes the nature of those relationship. Maybe it replace white collar workers, maybe even almost all of them - that's only mean that they will go to satisfy another unsatisfied needs of other people in exchange for satisfying their own, job marker won't go anywhere, if anything - amount of satisfied needs will go up, not down.

>if white collar workers are kicked out of their jobs en masse, it also negatively affects the "value" of the remaining people with employment

No, it doesn't. I mean it does if they would be simply kicked out, but that's not the case - they would be replaced by AI. So the society get all the benefits that they were creating plus additional labor force to satisfy earlier unsatisfied needs.

>exhibit A: tech job marker right now

I don't have the stats at hand, but aren't blue collar workers doing better now than ever before?

>I am really having a hard time understanding where this obsession with mythical socialism comes from

From the history of the 20th century? I mean not obsession, but we we are discussing scenarios of the disappearance (or significant decrease) of the job market, and the socialists are the most (if not only) realistic reason for that at the moment.

>The reality we live in is largely capitalistic and a striving towards a monopoly

Yeas, and this monopoly, the monopoly, are called "socialism".

>corporation, which is only kept in check by government regulations.

Generally corporation kept in check by economic freedom of other economic agents, and this government regulations that protects monopolies from free market. I mean why would government regulate in other direction? Small amount of big corporations are way easier for government to control and get personal benefits from them.

glodime
·
1 month ago
·
[ - ]

> In the end of the day you need a cooked dinner, a built house and everything else. So someone must build a house and exchange it for a cooked dinners.

You should read some history.This veiw is so naive and overconfident.

Ray20
·
1 month ago
·
[ - ]

My views on this issue are shaped by history. Starting with crop production and plowing and ending with book printing, conveyor belts and microelectronics - creating tools that increase productivity has always led to increased availability of goods, and the only reason that has lead to decreased availability - is things that has hindered ability to create and exchange goods.

forgetfreeman
·
1 month ago
·
[ - ]

I started a borderline smug response here pointing out how bullshit white collar and service jobs* where in deep shit but folks who actually work for a living would be fine. I scrapped it halfway through when it occurred to me that if everyone's broke then by definition nobody's spending money on stuff like contractors, mechanics, and other hardcore blue collar trades. Toss in AI's force multiplication of power demands in the face of all of the current issues around global warming and it starts to feel like pursuing this tech is fractally stupid and the best evidence to date I've seen that a neo-luddite movement might actually be a thing the world could benefit from. That last part is a pretty wild thought coming from a retired developer who spent the bulk of his adult life in IT, but here we are.

* https://phys.org/news/2023-08-people-pointless-meaningless-j...

kmeisthax
·
1 month ago
·
[ - ]

Neo-Luddism is less stupid when you remember that the Luddites weren't angry that looms existed. Smashing looms was their tactic, not their goal.

Parliament had made a law phasing in the introduction of automated looms; specifically so that existing weavers were first on the list to get one. Britain's oligarchy completely ignored this and bought or built looms anyway; and because Parliament is part of that oligarchy, the law effectively turned into "weavers get looms last". That's why they were smashing looms - to bring the oligarchy back to the negotiating table.

The oligarchy responded the way all violent thugs do: killing their detractors and lying about their motives.

Ray20
·
1 month ago
·
[ - ]

>if everyone's broke >nobody's spending money on stuff like contractors, mechanics, and other hardcore blue collar trades.

Why would this happen? Money is simply a medium of exchange of values that this contractors, mechanics and other hardcore blue collar trades are creating. How can they be broke, if Ai doesn't disturb their ability to create values and exchange it?

forgetfreeman
·
1 month ago
·
[ - ]

Customers that have funds available to purchase the services you offer and who are willing to actually spend that money are a hard requirement to maintain any business. If white collar and service industries are significantly disrupted by AI this necessarily reduces the number of potential customers. Thing is you don't have to lay off that many people to bankrupt half of the contractors in the country, a decent 3-5 year recession is all it takes. Folks stop spending on renovations and maintenance work when they're worried about their next paycheck.

Ray20
·
1 month ago
·
[ - ]

>who are willing to actually spend that money

Money mean nothing. It is simply medium of exchange. The question is, is there anything to exchange? And the answer is yeas, and position of white collar workers doesn't affect availability of things for exchange. There's no reason for recession, there is nothing that can hinder ability of blue collar workers to create goods and services, all that things that when combined is called "wealth".

Don't think in the meaningless category of "what set of digits will be printed in the piece of paper called paycheck?". Think in the terms, that are implied: "What goods and services blue collar workers can't afford to themselves?". And it will become clear that the set of unaffordable goods and services to blue collar workers will decrease because of the replacement white collar workers with AI, because it is not hinder their ability to create those goods and services.

forgetfreeman
·
1 month ago
·
[ - ]

> Money mean nothing.

You think so? Give me the contents of your checking, savings, and retirement accounts and then get back to me on that.

> position of white collar workers doesn't affect availability of things for exchange.

You appear to be confused about the concept of consumers, let me help. Consumers are the people who buy things. When there are fewer consumers in a market, demand for products and services declines. This means less sales. So no, you don't get to unemploy big chunks of the population and expect business to continue thriving.

Ray20
·
1 month ago
·
[ - ]

>When there are fewer consumers in a market, demand for products and services declines.

No, demand is unlimited and defined by the amount of production.

>You don't get to unemploy big chunks of the population and expect business to continue thriving.

I mean, generally replaced worker with the instruments - is the main way to business (and society) to thrive. In other words, what goods and services will became less affordable to the blue collar workers?

forgetfreeman
·
1 month ago
·
[ - ]

> No, demand is unlimited and defined by the amount of production.

Enough of your trolling, go waste someone else's time.

sozforex
·
1 month ago
·
[ - ]

When ~white collar [researchers, programmers, managers, salespeople, translators, illustrators, ...] lose their income/jobs to AI's -> lose their ability to buy products/services and at the same time try to shift in mass to doing some kind of manual work, do you think that would not affect incomes of those who are the current blue collar class?

Ray20
·
1 month ago
·
[ - ]

>do you think that would not affect incomes of those who are the current blue collar class?

Obviously it is affect. Supply of goods are increased and their relative market value are increased - how can this not increase their incomes?

forgetfreeman
·
1 month ago
·
[ - ]

The law of supply and demand dictates that when the supply of a thing increases it's value decreases.

Ray20
·
1 month ago
·
[ - ]

> it's value decreases

I mean yeas, values of consumed goods will decrease, so blue color workers will be able to consume more. That's exactly what is called increase of income.

forgetfreeman
·
1 month ago
·
[ - ]

My gut is telling me you're being intentionally obtuse but I'm going to give you the benefit of the doubt. To reiterate in detail:

AI is poised to disrupt large swaths of the workforce. If large swaths of the workforce are disrupted this necessarily means a bunch of people will see their income negatively impacted (job got replaced by AI). Broke people by definition don't have money to spend on things, and will prioritize tier one of Maslow's Hierarchy out of necessity. Since shit like pergolas and oil changes are not directly on tier 1 they will be deprioritized. This in turn cuts business to blue collar service providers. Net result: everyone who isn't running an AI company or controlling some currently undefined minimum amount of capital is fucked.

If you're trying to suggest that any notional increases in productivity created by AI will in any way benefit working class individuals either individually or as a group you are off the edge of the map economically speaking. Historical precedents and observed executive tier depravity both suggest any increase in productivity will be used as an excuse to cut labor costs.

Ray20
·
1 month ago
·
[ - ]

>This in turn cuts business to blue collar service providers.

No, it doesn't. Where's that is come from?

I mean, look at the situation from the perspective of blue collar service providers: what is exactly those goods and services, that they was be able to afford for themselves, but AI will make it unaffordable for them? Pretty obviously, that there's about none of those goods and services. So, in big picture, all that process that you described, doesn't lead to any disadvantage of blue collar workers.

forgetfreeman
·
1 month ago
·
[ - ]

I literally described the mechanism to you twice and you're still acting confused. I'm not sure if we have a language barrier here or what but go check out a Khan Academy course on economics or maybe try running a lemonade stand for an afternoon if you still don't get it.

photonthug
·
1 month ago
·
[ - ]

I think the obvious thing you are missing is just b2b. It doesn’t actually matter if people have any money.

Similar to how advertising and legal services are required for everything but have ambiguous ROI at best, AI is set to become a major “cost of doing business“ tax everywhere. Large corporations welcome this even if it’s useless, because it drags down smaller competitors and digs a deeper moat.

Executives large and small mostly have one thing in common though.. they have nothing but contempt for both their customers and their employees, and would much rather play the mergers and acquisitions type of games than do any real work in their industry (which is how we end up in a world where the doors are flying off airplanes mid flight). Either they consolidate power by getting bigger or they get a cushy exit, so.. who cares about any other kind of collateral damage?

greenavocado
·
1 month ago
·
[ - ]

Money is a proxy for control. Eventually humans will become mostly redundant and slated for elimination except for the chosenites of the managerial classes and a small number of technicians. Either through biological agents, famines, carefully engineered (civil?) wars and conflicts designed to only exterminate the non-managerial classes, or engineered Calhounian behavioral sinks to tank fertility rates below replacement.

Beijinger
·
1 month ago
·
[ - ]

https://senecaeffect.substack.com/p/exterminations-a-review

riehwvfbk
·
1 month ago
·
[ - ]

Ssssh, you can't say that. Those types of brain damage are protected diversity.

kerkeslager
·
1 month ago
·
[ - ]

Why should we care if they make money? Owning things isn't a contribution to society.

Building things IS a contribution to society, but the people who build things typically aren't the ultimate owners. And even in cases where the builders and owners are the same, entitling the builders and all of their future heirs to rent seek for the rest of eternity is an inordinate reward.

imtringued
·
1 month ago
·
[ - ]

You don't. It's like Minecraft. You can do almost everything in Minecraft alone and everything exists in infinite quantity, so why trade in the first place?

This goes both ways. Let's say there is something you want but you're having trouble obtaining it. You'd need to give something in exchange.

But the seller of what you want doesn't need the things you can easily acquire, because they can get those things just as easily themselves.

The economy collapses back into self sufficiency. That's why most Minecraft economy servers start stagnating and die.

afavour
·
1 month ago
·
[ - ]

Unfortunately I don’t think the logic extends beyond “if we don’t do it, someone else will”. Anything after that is secondary.

mistrial9
·
1 month ago
·
[ - ]

What people say is not the same as what people do.. in other words, what is spoken in public repeatedly is not representational of actual decision flows

articlepan
·
1 month ago
·
[ - ]

Money is only a bookkeeping tool for complex societies. The aim of the owner class in a worker-less world would be accumulation of important resources to improve their lives and to trade with other owners (money would likely still be used for bookkeeping here). A wealthy resource-owner might strive to maintain a large zone of land, defended by AI weaponry, that contains various industrial/agricultural facilities producing goods and services via AI.

They would use some of the goods/services produced themselves, and also trade with other owners to live happy lives with everything they need, no workers involved.

Non-owners may let the jobless working class inhabit unwanted land, until they change their minds.

datadrivenangel
·
1 month ago
·
[ - ]

Better for them to give us jobs so we owe them and are less likely to revolt!

rjbwork
·
1 month ago
·
[ - ]

With what and against what? There will be spy satellites and drones and automated turrets that will turn you to pulp if you come within, say, 50KM of their compound borders.

blibble
·
1 month ago
·
[ - ]

will they care if they have an army of cheap easily replaceable robots with guns?

I miss the star trek visions of the future

now the "good" outcome is a world sized north korea, with elon as ruler

and the bad outcome is the ruler using his army of robots to eliminate the possibility of the peasant revolt once and for all

quuxplusone
·
1 month ago
·
[ - ]

One (satirical) answer to this question is given in Greg Egan's "The Discrete Charm of the Turing Machine" (2017). https://i.4pcdn.org/tg/1599529933107.pdf

kmeisthax
·
1 month ago
·
[ - ]

The non-benevolent future is not self-defeating; we have historical examples of depressingly stable economies with highly concentrated ownership. The entirety of the European dark ages was the end result of (western[0]) Rome's elites tearing the planks out of the hull of the ship they were sailing. The consequence of such a system is economic stagnation, but that's not a consequence that the elites have to deal with. After all, they're going to be living in the lap of luxury, who cares if the economy stagnates?

This economic relationship can be collectively[1] described as "feudalism". This is a system in which:

- The vast majority of people are obligated to perform menial labor, i.e. peasant farmers.

- Class mobility is forbidden by law and ownership predominantly stays within families.

- The vast majority of wealth in the economy is in the form of rents paid to owners.

We often use the word "capitalist" to describe all businesses, but that's a modern simplification. Businesses can absolutely engage in feudalist economies just as well, or better, than they can engage in capitalist ones. The key difference is that, under capitalism, businesses have to provide goods or services that people are willing to pay for. Feudalism makes no such demand; your business is just renting out a thing you own.

Assuming AI does what it says on the tin (which isn't at all obvious), the endgame of AI automation is an economy of roughly fifty elite oligarchs who own the software to make the robots that do all work. They will be in a constant state of cold war, having to pay their competitors for access to the work they need done, with periodic wars (kinetic, cyber, legal, whatever) being fought whenever a company intrudes upon another's labor-enclave.

The question of "well, who pays for the robots" misunderstands what money is ultimately for. Money is a token that tracks tax payments for coercive states. It is minted specifically to fund wars of conquest; you pay your soldiers in tax tokens so the people they conquer will have to barter for money to pay the tax collector with[2]. But this logic assumes your soldiers are engaging in a voluntary exchange. If your 'soldiers' are killer robots that won't say no and only demand payment in energy and ammunition, then you don't need money. You just need to seize critical energy and mineral reserves that can be harvested to make more robots.

So far, AI companies have been talking of first-order effects like mass unemployment and hand-waving about UBI to fix it. On a surface level, UBI sounds a lot like the law necessary to make all this AI nonsense palatable. Sam Altman even paid to have a study done on UBI, and the results were... not great. Everyone who got money saw real declines in their net worth. Capital-c Conservative types will get a big stiffy from the finding that UBI did lead people to work less, but that's only part of the story. UBI as promoted by AI companies is bribing the peasants. In the world where the AI companies win, what is the economic or political restraining bolt stopping the AI companies from just dialing the UBI back and keeping more of the resources for themselves once traditional employment is scaled back? Like, at that point, they already own all the resources and the means of production. What makes them share?

[0] Depending on your definition of institutional continuity - i.e. whether or not Istanbul is still Constantinople - you could argue the Roman Empire survived until WWI.

[1] Insamuch as the complicated and ideosyncratic economic relationships of medieval Europe could even be summed up in one word.

[2] Ransomware vendors accidentally did this, establishing Bitcoin (and a few other cryptos) as money by demanding it as payment for a data ransom.

sozforex
·
1 month ago
·
[ - ]

You may find "Technofeudalism: What Killed Capitalism" book to your liking.

pkdpic
·
1 month ago
·
[ - ]

And how could they possibly base their actions on good when their technology is more important than fire? History is depending on them to do everything possible to increase their market cap.

kerkeslager
·
1 month ago
·
[ - ]

Careful, I think you're being sarcastic, but you're in a space where a lot of people believe what you just said unironically.

anon373839
·
1 month ago
·
[ - ]

Ha! Comment of the week.

nancyminusone
·
1 month ago
·
[ - ]

More important than fire? AI runs on fire.

DrillShopper
·
1 month ago
·
[ - ]

> The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained.

I agree with you in the case of AI companies, but the desire to own everything an bee completely unconstrained is the dream of every large corporation.

chii
·
1 month ago
·
[ - ]

> remake the entire world into one where the owners of these companies own everything and are completely unconstrained

how has this been any different from the past 10,000 years of human conquest and domination?

nemomarx
·
1 month ago
·
[ - ]

in the past, you had to give some of your spoils to those who did the conquering for you, and laborers after that. if you can automate and replace all work, including maintening the robots that do that and training them, you no longer need to share anything.

lithocarpus
·
1 month ago
·
[ - ]

In my view it's the same thing, same trajectory -- with more power in the hands of fewer people further along the trajectory.

It can be better or worse depending on what those with power choose to do. Probably worse. There has been conquest and domination for a long time, but ordinary people have also lived in relative peace gathering and growing food in large parts of the world in the past, some for entire generations. But now the world is rapidly becoming unable to support much of that as abundance and carrying capacity are deleted through human activity. And eventually the robot armies controlled by a few people will probably extract and hoard everything that's left. Hopefully in some corners some people and animals can survive, probably by being seen as useful to the owners.

neutronicus
·
1 month ago
·
[ - ]

On the bright side, armies of robot slaves give us an off-ramp from the unsustainable pyramid scheme of population growth.

Be fruitful, and multiply, so that you may enjoy a comfortable middle age and senescence exploiting the shit out of numerous naive 25-year-olds! If it's robots, we can ramp down the population of both humans and robots until the planet can once again easily provide abundance.

lithocarpus
·
1 month ago
·
[ - ]

Sure, the problem though is it won't be "we" deciding what the robots do, it will most likely be a few powerful people of dubious character and motivations since those are the sort of people who pursue power and end up powerful.

That's why even though technology could theoretically be used to save us from many of our problems, it isn't primarily used that way.

neutronicus
·
1 month ago
·
[ - ]

True.

But presumably petty tyrants with armies of slave robots are less interested than consensus in a long-term vision for humanity that involves feeding and housing a population of 10 billion.

So after whatever horrific holocaust follows the AI wars the way is clear for a hundred thousand humans to live in the lap of luxury with minimal impact on the planet. Even if there are a few intervening millennia of like 200 humans living in the lap of luxury and 99,800 living in sex slavery.

outside1234
·
1 month ago
·
[ - ]

The thing is that this will be their destruction as well. If workers don't have any money (because they don't have jobs), nobody can afford what the owners have to sell?

BigParm
·
1 month ago
·
[ - ]

The human population will be decimated just as the work horse population was.

yubblegum
·
1 month ago
·
[ - ]

They are also gutting the profession of software engineering. It's a clever scam actually: to develop software a company will need to pay utility fees to A"I" companies and since their products are error prone voila use more A"I" tools to correct the errors of the other tools. Meanwhile software knowledge will atrophy and soon ala WALE we'll have software "developers" with 'soft bones' floating around on conveyed seats slurping 'sugar water' and getting fat and not knowing even how to tie their software shoelaces.

bbarnett
·
1 month ago
·
[ - ]

Yes, like the Pixel camera app, which mangles photos with AI processing, and users complain that it won't let people take pics.

One issue was a pic with text in it, like a store sign. Users were complaining that it kept asking for better focus on the text in the background, before allowing a photo. Alpha quality junk.

Which is what AI is, really.

anthk
·
1 month ago
·
[ - ]

AI tarpits && lim (human curated contant/mediocre AI answers -> 0) = AI's crumbling into dust by themselves.

davidmurdoch
·
1 month ago
·
[ - ]

We, the people, might need to come up with a few proverbial tranquilizer guns here soon

Sharlin
·
1 month ago
·
[ - ]

Maxim 1: "Pillage, then burn."

Coffeewine
·
1 month ago
·
[ - ]

Another Schlock Mercenary fan? Or does this adage have many adherents?

datadrivenangel
·
1 month ago
·
[ - ]

The adage predates the longest continuous webcomic, but not as a maxim.

Sharlin
·
1 month ago
·
[ - ]

Yep, a fan I am.

ferguess_k
·
1 month ago
·
[ - ]

That's pretty much what our future would look like -- you are irrelevant. Well I mean we are already pretty much irrelevant nowadays, but the more so in the "progressive" future of AI.

speerer
·
1 month ago
·
[ - ]

https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-domi...

links to this comment.

asveikau
·
1 month ago
·
[ - ]

Rules and laws are for other people. A lot of people reading this comment having mistaken "fake it til you make it" or "better to not ask permission" for good life advice are responsible for perpetrating these attitudes, which are fundamentally narcissistic.

slowmovintarget
·
1 month ago
·
[ - ]

"... you have the lawyers clean it all up later." - Eric Schmidt

kordlessagain
·
1 month ago
·
[ - ]

> AI will be incorporated into all products whether you like it or not

AI will be incorporated into the government, whether you like it or not.

FTFY!

huijzer
·
1 month ago
·
[ - ]

I think the logic is more like “we have to do everything we can to win or we will disappear”. Capitalism is ruthless and the big techs finally have some serious competition, namely: each other as well as new entrants.

Like why else can we just spam these AI endpoints and pay $0.07 at the end of the month? There is some incredible competition going on. And so far everyone except big tech is the winner so that’s nice.

lgeek
·
1 month ago
·
[ - ]

> One crawler downloaded 73 TB of zipped HTML files in May 2024 [...] This cost us over $5,000 in bandwidth charges

I had to do a double take here. I run (mostly using dedicated servers) infrastructure that handles a few hundred TB of traffic per month, and my traffic costs are on the order of $0.50 to $3 per TB (mostly depending on the geographical location). AWS egress costs are just nuts.

Ray20
·
1 month ago
·
[ - ]

I think uncontrolled price of cloud traffic - is a real fraud and way bigger problem then some AI companies that ignore robot.txt. One time we went over limit on Netlify or something, and they charged over thousand for a couple TB.

joepie91_
·
1 month ago
·
[ - ]

> I think uncontrolled price of cloud traffic - is a real fraud

Yes, it is.

> and way bigger problem then some AI companies that ignore robot.txt.

No, it absolutely is not. I think you underestimate just how hard these AI companies hammer services - it is bringing down systems that have weathered significant past traffic spikes with no issues, and the traffic volumes are at the level where literally any other kind of company would've been banned by their upstream for "carrying out DDoS attacks" months ago.

Ray20
·
1 month ago
·
[ - ]

>I think you underestimate just how hard these AI companies hammer services

Yeas, I completely don't understand this and don't understand comparing this with ddos attacks. There's no difference with what search engines are doing, and in some way it's worse? How? It's simply scraping data, what significant problems may it cause? Cache pollution? And thats'it? I mean even when we talking about ignoring robots.txt (which search engines are often doing too) and calling costly endpoints - what is the problem to add to those endpoints some captcha or rate limiters?

Suppafly
·
1 month ago
·
[ - ]

>which I then emailed 3x and never got a reply.

Send a bill to their accounts payable team instead.

ldoughty
·
1 month ago
·
[ - ]

Detect AI scraper and inject an in-page notice that by continuing they accept your terms of use.

Profit... By sending them invoices :-)

dabockster
·
1 month ago
·
[ - ]

Honestly this is crazy enough to work. Bonus points if both you and the scraping company reside in the same state.

TuringNYC
·
1 month ago
·
[ - ]

>> which I then emailed 3x and never got a reply.

At which point does the crawling cease to be a bug/oversight and constitute a DDOS?

ferguess_k
·
1 month ago
·
[ - ]

Maybe just feed them dynamically generated garbage information? More fun than no information.

gnz11
·
1 month ago
·
[ - ]

OP’s linked blog post mentioned they got hit with a large spike in bandwidth charges. Sending them garbage information costs money.

ferguess_k
·
1 month ago
·
[ - ]

Yeah you have a point, hmmm, wish there were a way to somehow generate those garbages with minimum bandwidth. Something like, I can send you a very compressed 256 bytes of data which expands to something like 1 mega bytes.

madeforhnyo
·
1 month ago
·
[ - ]

Good ol' zip bomb

https://furry.engineer/@niko/113728467796605323

kevindamm
·
1 month ago
·
[ - ]

there is -- but instead of garbage expanding data, add in several delays within the response so that the data takes extraordinarily long

Depending on the number of simultaneous requesting connections, you may be able to do this without a significant change to your infrastructure. There are ways to do it that don't exhaust your number of (IP, port) available too, if that is an issue.

Then the hard part is deciding which connections to slow, but you can start with a proportional delay based on the number of bytes per source IP block or do it based on certain user agents. Might turn into a small arms race but it's a start.

Steltek
·
1 month ago
·
[ - ]

Tarpit instead? Trickle out a dead end response (no links) at bytes-per-second speeds until the bot times out.

https://en.wikipedia.org/wiki/Tarpit_(networking)

InfamousRece
·
1 month ago
·
[ - ]

It does not even have to be dynamically generated. Just pre-generate a few thousand static pages of AI slop and serve that. Probably cheaper than dynamic generation.

m463
·
1 month ago
·
[ - ]

I kind of suspect some of these companies probably have more horsepower and bandwidth in one crawler than a lot of these projects have in their entire infrastructure.

spenczar5
·
1 month ago
·
[ - ]

Thanks for writing about this. Is it clear that this is from crawlers, as opposed to dynamic requests triggered by LLM tools, like Claude Code fetching docs on the fly?

Freebytes
·
1 month ago
·
[ - ]

Along with having block lists, perhaps you could add poison to your results that generates random bad code that will not work, and that is only seen by bots (display: none when rendered), and the bots will use it, but a human never would.

ATechGuy
·
1 month ago
·
[ - ]

Wondering if used tried stopping such bots with Captcha?

aspir
·
1 month ago
·
[ - ]

Just a callout that Fastly provides free bot detection, CDN, and other security services for FOSS projects, and has been for 10+ years https://www.fastly.com/fast-forward (disclaimer, I work for Fastly and help with this program)

Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.

Aachen
·
1 month ago
·
[ - ]

I've been running into bot detection on at least five different websites in the past two months (not even including captcha walls)

Not sure what to tell you but I surely feel quite human

Three of the pages told me to contact customer support and the other two were a hard and useless block wall. Only from Codeberg did I get a useful response, the other two customer supports were the typical "have you tried clearing your cookies" and restart the router advice — which is counterproductive because cookie tracking is often what lets one pass. Support is not prepared to deal with this, which means I can't shop at the stores that have blocking algorithms erroneously going off. I also don't think any normal person would ever contact support, I only do it to help them realise there's a problem and they're blocking legitimate people from using the internet normally

Beware if you employ this...

RVuRnvbM2e
·
1 month ago
·
[ - ]

Were the walls you hit caused by Fastly's bot detection? I've found it to be quite accurate.

On the other hand CloudFlare and Akamai mistakenly block me all the damn time.

Aachen
·
1 month ago
·
[ - ]

It's not like they say, but it's at least three different implementations and I don't think any were cloudflare because I've been running into those pages for years and they've got captchas (functional or not). One of them was Akamai I think indeed

aspir
·
1 month ago
·
[ - ]

Yeah, I definitely don't want to pivot this thread into a product pitch, as the important thing is helping the open-source projects, but we can work with the maintainers to tune the systems to be as strict/lax as preferred. I'm sure the other services can too, to be fair.

structural
·
1 month ago
·
[ - ]

The underlying issue is that many sites aren't going to get feedback from the real people they've blocked, so their operators won't actually know that tuning is required (also, the more strict the system, the higher percentage of requests will be marked as bots, which might lead an operator to want things to be even more strict...)

aspir
·
1 month ago
·
[ - ]

I will say -- a higher-end bot detection service should provide paper trails on the block actions they take (this may not be available for freemium tiers, depending on the vendor).

But to your point, the real kicker is the "many sites aren't going to get feedback from the real people they've blocked" since those tools inherently decided that the traffic was not human. You start getting into Westworld "doesn't look like anything to me" territory.

Aachen
·
1 month ago
·
[ - ]

I'm not into westworld so can't speak to the latter paragraph, but as for "high-end" vendors' paper trail: how do log files help uncover false blocks? Any vendor will be able to look up these request IDs printed on the blocking page, but how does it help?

You don't know if each entry in the log is a real customer until they buy products proportional to some fraction of their page load rate, or real people until they submit useful content or whatever your site is about. Many people just read information without contributing to the site itself and that's okay, too. A list of blocked systems won't help; I run a server myself, I see the legit-looking user agent strings doing hundreds of thousands of requests, crawling past every page in sequence, but if there wasn't this inhuman request pattern and I just saw this user agent and IP address and other metadata among a list of blocked access attempts, I'd have no clue if the ban is legit or not

With these protection services, you can't know how much frustration is hiding in that paper trail, so I'm not blocking anyone from my sites; I'm making the system stand up to crawling. You have to do that regardless for search engines and traffic spikes like from HN

999900000999
·
1 month ago
·
[ - ]

To be fair,

>I'm Not a Robot (film) https://en.m.wikipedia.org/wiki/I%27m_Not_a_Robot_(film)

Aachen
·
1 month ago
·
[ - ]

Oh my, a Dutch film that actually sounds good?! I get to watch a movie that's originally in my native language for perhaps the second time in my life, thanks for linking this :D

Edit: and it's on YouTube in full! Was wondering which streaming service I'd have to buy for this niche genre of Dutch sci-fi but that makes life easy: https://www.youtube.com/watch?v=4VrLQXR7mKU

Final update: well, that was certainly special. Favorite moment was 10:26–10:36 ^^. Don't think that comes fully across in the baked-in subtitles in English though. Overall it could have been an episode of Dark Mirror, just shorter. Thanks again for the tip :)

999900000999
·
1 month ago
·
[ - ]

Glad to help.

I have to assume the Dutch movie industry just isn't too big.

I guess it's a side effect of America's media, but when I went to Europe including the Netherlands almost everyone spoke English at an almost native level.

It almost felt like playing a video game where there is an immersive mode you can just turn off if it gets too difficult ( subtitles in English at all public facilities).

xena
·
1 month ago
·
[ - ]

It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.

diggan
·
1 month ago
·
[ - ]

Nice work :)

One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?

I know there is a loading animation widget thingy, but the first time I saw that page (some weeks ago at the Gnome issue tracker), it was proof-of-work'ing for like 20 seconds, and I wasn't sure what was going on, I initially thought I got blocked or that the captcha failed to load.

Of course, now I understand what it is, but I'm not sure it's 100% clear when you just see the "checking if you're a bot" page in isolation.

xena
·
1 month ago
·
[ - ]

> One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?

Will do! https://github.com/TecharoHQ/anubis/issues/25

ranger_danger
·
1 month ago
·
[ - ]

also if you're using JShelter, which blocks Worker by default, there is no indication that it's never going to work, and the spinner just goes on forever doing nothing

xena
·
1 month ago
·
[ - ]

Noted! I filed a bug: https://github.com/TecharoHQ/anubis/issues/38

All of this is placeholder wording, layouts, CSS, and more. It'll be fixed in time. This is teething pain that I will get through.

hartator
·
1 month ago
·
[ - ]

Maybe a progress bar?

xena
·
1 month ago
·
[ - ]

There's no way to really make a progress bar make sense, it's a luck-based mechanic.

diggan
·
1 month ago
·
[ - ]

Maybe one of those (slightly misleading) progressbars that have a dynamic speed that gets slower and slower the closer to the finish it gets? Just to indicate that it's working towards something

blibble
·
1 month ago
·
[ - ]

more, easier proof of works

and the law of large numbers will do the rest

yifanl
·
1 month ago
·
[ - ]

That's multiplying the work the server has to do by a large number so it can show a nicer progress bar.

Seems very counter to the purpose.

wink
·
1 month ago
·
[ - ]

So just like the windows copy dialog. Progress bar it is.

isoprophlex
·
1 month ago
·
[ - ]

It'll be somewhat involved, but based on the difficulty vs the clients hashing speed you could say something probabilistic like "90% of the time, this window will be gone in xyz seconds from now"?

xena
·
1 month ago
·
[ - ]

Yeah, I have to get the data for that though! I'm gonna add that to the list.

clvx
·
1 month ago
·
[ - ]

I really like this. I don't mind Internet acting like the Wild Wild West but I do mind there's no accountability. This is a nice way to pass the economic burden to the crawlers for sites who still want to stay freely available. You want the data, spend money on your side to get it. Even though the downside is your site could be delisted from search engines, there's no reason why you cannot register your service in a global or p2p indexer.

lukan
·
1 month ago
·
[ - ]

"why you cannot register your service in a global or p2p indexer"

Network effects anyone? So yes, we should work on a different way of indexing the web again, than via google, but easier said than done I think ..

isoprophlex
·
1 month ago
·
[ - ]

Loving it, great work as always.

Also

> https://news.ycombinator.com/item?id=43422781

Integrate a way to calculate micro-amounts of the shitcoin of your choice and we might have the another actually legitimately useful application of cryptocurrencies on our hands..!

vhcr
·
1 month ago
·
[ - ]

Anubis is only going to work as long as it doesn't gets famous, if that happens crawlers will start using GPUs / ASICs for the proof of work and it's game over.

bashfulpup
·
1 month ago
·
[ - ]

The entire reason bots are so agressive is because they are cheap to run.

If a GPU was required per scrape then >90% simply couldn't afford it at scale.

xena
·
1 month ago
·
[ - ]

Author of Anubis here. If that happens, I win.

eb0la
·
1 month ago
·
[ - ]

If that happens, count with me to use Anubis to factor large primes or whatever science needs as a background task.

enrico204
·
1 month ago
·
[ - ]

Actually, that is not a bad idea. @xena maybe Anubis v2 could make the client participate in some sort of SETI@HOME project, creating the biggest distributed cluster ever created :-D

programd
·
1 month ago
·
[ - ]

Oh come now, clearly Anubis should make the clients mine bitcoin as proof of work, with a split for the website and the author.

Oh dear, somebody is going to implement this in about an hour, aren't they....

grotorea
·
1 month ago
·
[ - ]

Just in case you didn't know, cryptominers in Javascript are already thing. Firefox even blocks them.

clvx
·
1 month ago
·
[ - ]

a service that allows you expose and host your data in a private manner getting a cut from whatever token your endpoints have generated.

kotenok2000
·
1 month ago
·
[ - ]

Shouldn't you factor composite numbers? Factoring prime numbers is pointless.

knowaveragejoe
·
1 month ago
·
[ - ]

I love that I seem to stumble upon something by you randomly every so often. I'd just like to say that I enjoy your approach to explanations in blog form and will look further into Anubis!

reginald78
·
1 month ago
·
[ - ]

Maybe I'm missing something, but doesn't this mean the work has to be done by the client AND the server every time a challenge is issued? I think ideally you'd want work that was easy for the server and difficult for the server. And what is to stop being DDoS'd by clients that are challenged but neglect to perform the challenge?
Regardless, I think something like this is the way forward if one doesn't want to throw privacy entirely out the window.

client

xena
·
1 month ago
·
[ - ]

The magic of proof of work is that it's something that's really hard to do but easy to validate. Anubis' proof of work works like this:

A sha256 hash is a bunch of bytes like this:

  394d1cc82924c2368d4e34fa450c6b30d5d02f8ae4bb6310e2296593008ff89f

We usually write it out in hex form, but that's literally what the bytes in ram look like. In a proof of work validation system, you take some base value (the "challenge") and a rapidly incrementing number (the "nonce"), so the thing you end up hashing is this:

  await sha256(`${challenge}${nonce}`);

The "difficulty" is how many leading zeroes the generated hash needs to have. When a client requests to pass the challenge, they include the nonce they used. The server then only has to do one sha256 operation: the one that confirms that the challenge (generated from request metadata) and the nonce (provided by the client) match the difficulty number of leading zeroes.

The other trick is that presenting the challenge page is super cheap. I wrote that page with templ (https://templ.guide) so it compiles to native Go. This makes it as optimized as Go is modulo things like variable replacement. If this becomes a problem I plan to prerender things as much as possible. Rendering the challenge page from binary code or ram is always always always going to be so much cheaper than your webapp ever will be.

I'm planning on adding things like changing out the hash in use, but right now sha256 is the best option because most CPUs in active deployment have instructions to accelerate sha256 hashing. This combined with webcrypto jumping to heavily optimized C++ and the JIT in JS being shockingly good means that this super naïve approach is probably the most efficient way to do things right now.

I'm shocked that this all works so well and I'm so glad to see it take off like it has.

k1tanaka
·
1 month ago
·
[ - ]

I am sorry if this question is dumb, but how does proof of work deter bots/scrappers from accessing a website?

I imagine it costs more resource to access the protected website but would this stop the bots? Wouldn't they be able to pass the challenge and scrap the data after? Or normal scrapbots usually timeout after a small amount of time/ resources is used?

joepie91_
·
1 month ago
·
[ - ]

There are a few ways in which bots can fail to get past such challenges, but the most durable one (ie. the one that you cannot work around by changing the scraper code) is that it simply makes it much more expensive to make a request.

Like spam, this kind of mass-scraping only works because the cost of sending/requesting is virtually zero. Any cost is going to be a massive increase compared to 'virtually zero', at the kind of scale they operate at, even if it would be small to a normal user.

dbmnt
·
1 month ago
·
[ - ]

Put simply, most bots just aren't designed to solve such challenges.

diggan
·
1 month ago
·
[ - ]

> I think ideally you'd want work that was easy for the server and difficult for the server.

That's exactly how it works (easy for server, hard for client). Once the client completed the Proof-of-Work challenge, the server doesn't need to complete the same challenge, it only needs to validate that the results checks out.

Similar to how in Proof-of-Work blockchains where coming up with the block hashes is difficult, but validating them isn't nearly as compute-intensive.

This asymmetric computation requirement is probably the most fundamental property of Proof-of-Work, Wikipedia has more details if you're curious: https://en.wikipedia.org/wiki/Proof_of_work

Fun fact: it seems Proof-of-Work was used as a DoS preventing technique before it was used in Bitcoin/blockchains, so seems we've gone full circle :)

namaria
·
1 month ago
·
[ - ]

I think going full circle would be something like bitcoin being created on top of DoS prevention software and then eventually DoS prevention starting to use bitcoin. A tool being used for something than something else than the first something again is just... nothing? Happens all the time?

GaggiX
·
1 month ago
·
[ - ]

The AI anime girl has 6 fingers btw, combating AI bot with AI girls.

Edit: I will probably send a pull request to fix it.

xena
·
1 month ago
·
[ - ]

I'm commissioning an artist to make better assets. These are the placeholders that I used with the original rageware implementation. I never thought it would take off like this!

·
1 month ago
·
[ - ]

pabs3
·
1 month ago
·
[ - ]

Could you add an option for non-JS users? Maybe a Linux command-line we can paste the output of into a form.

brushfoot
·
1 month ago
·
[ - ]

At this rate, it's more than FOSS infrastructure -- although that's a canary in the coalmine I especially sympathize with -- it's anonymous Internet access altogether.

Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article -- or even real user agents because they're wired up to something like Playwright.

What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.

Freak_NL
·
1 month ago
·
[ - ]

We're half way there already. It always hits me whenever I am doing some mapping for OpenStreetMap and I'm looking up local businesses without their own internet presence. They use Facebook, Instagram, X, etc. for their digital calling card. I normally don't use Facebook (or Instagram, and gave up on X) and have no account there, and every time I follow one of those links, you get some info, and then you get a dialogue screen telling you to make an account or get lost, or you just get some obscure error.

I don't mind registering an account for private communities, but for stuff which people put up thinking it is just going to be publicly visible it's really annoying.

yurishimo
·
1 month ago
·
[ - ]

> ... but for stuff which people put up thinking it is just going to be publicly visible ...

I don't think these business owners really understand. Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.

I agree with you that it is extremely frustrating.

Suppafly
·
1 month ago
·
[ - ]

>Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.

The people without a basic internet presence aren't likely to be customers anyway so it's not a huge loss. It's trivial to setup a basic account for any site that doesn't contain any personal data you want to keep hidden, if you aren't willing to do that, you're in a tiny minority.

dreamcompiler
·
1 month ago
·
[ - ]

> It's trivial to setup a basic account for any site that doesn't contain any personal data you want to keep hidden

It's equally trivial for a restaurant to set up a custom domain with their own 2-page website (overview and menu) on any of a hundred platforms that provide this service.

Most of these services are not free like FB, but any business that can afford a landline phone can afford a real website.

sofixa
·
1 month ago
·
[ - ]

> Most of these services are not free like FB

There are free ones as well, just as a subdomain (something.wordpress.com or something.wix.com), not a full top level custom domain.

Suppafly
·
1 month ago
·
[ - ]

>It's equally trivial for a restaurant to set up a custom domain with their own 2-page website (overview and menu) on any of a hundred platforms that provide this service.

Sure but they don't want to. If you want to see the menu they have online you need to follow their rules, not your own.

elaus
·
1 month ago
·
[ - ]

And if they want me as a customer, they have to follow _my_ rules.

Obviously the restaurant has enough other customers and I have enough other restaurants to go to, so we both will be fine.

Suppafly
·
1 month ago
·
[ - ]

>And if they want me as a customer, they have to follow _my_ rules.

Sure, but putting their menu behind a trivial to access account shows they don't want you as a customer. You're the one complaining, not them.

madeofpalk
·
1 month ago
·
[ - ]

I don't have a Facebook or Instagram account, but I definitely eat tacos and I was put off when I couldn't see a new taco place's opening hours without an instagram accoutn.

I'm not sure why you think why people who don't have a Facebook account wouldn't eat at restaurants

pabs3
·
1 month ago
·
[ - ]

There are ways to access public Instagram content without an account, gallery-dl for eg works most of the time. Tweeper can also be useful.

https://github.com/mikf/gallery-dl https://git.ao2.it/tweeper.git

Suppafly
·
1 month ago
·
[ - ]

You're in a tiny enough minority that it doesn't matter to them. It's like Amish complaining that they can't use a drive-thru window or something. Except it'd take you 30 seconds, one time, to solve your problem forever.

inetknght
·
1 month ago
·
[ - ]

> Except it'd take you 30 seconds, one time, to solve your problem forever.

And those 30 seconds are a harrowing pit of despair out of which comes the rest of your life filled with advertisements, tracking, second-guessing, and accusations of being a hypocrite.

dabbz
·
1 month ago
·
[ - ]

To be fair, they've been shown to still track unauthenticated users via fingerprinting and mapping it to known data from your friends who do have Facebook and upload this data (phone numbers, first, last name, etc). NOT having an account doesn't mean you aren't being tracked.

Not that it means you should just make an account to make their tracking easier...

Suppafly
·
1 month ago
·
[ - ]

I haven't had any of that just from having an FB and IG account. I honestly forgot I had an IG account for a long time until someone shared something with me and I realized I had one.

aendruk
·
1 month ago
·
[ - ]

I’m a little optimistic here. Step by small step lately the more normal people in my life have been joining this demographic.

DidYaWipe
·
1 month ago
·
[ - ]

There's no "basic Internet presence" for an individual. If you think it through, every attempt to make that a thing has wound up being MySpace, Facebook, or the next dumpster.

photonthug
·
1 month ago
·
[ - ]

> What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.

Just to say the quiet part out loud here.. one of the biggest reasons this is depressing is that it's not only vandalism but actually vandalism with huge compounding benefits for the assholes involved and grabbing the content is just the beginning. If they take down the site forever due to increasing costs? Great, because people have to use AI for all documentation. If we retreat from captcha and force people to put in credit cards or telephone numbers? Great, because the internet is that much less anonymous. Data exfiltration leads to extra fraud? Great, you're gonna need AI to combat that. It's all pretty win-win for the bad actors.

People have discussed things like the balkanization of the internet for a long time now. One might think that the threat of that and/or the fact that it's such an unmitigated dumpster fire already might lead to some caution about making it worse! But pushing the bounds of harassment and friction that people are willing to deal with is moot anyway, because of course they have no real choice in the matter.

danaris
·
1 month ago
·
[ - ]

I dunno. I run a small browser-game, and while my server has been periodically getting absolutely pulverized by LLM scrapers, I have yet to see a single new account that looks remotely like it was created by a bot. (Also, the rate of new signups hasn't changed notably.) This is true for both the game and its Wiki—which is where most of the scraping traffic has been. (And which I will almost certainly have to set to be almost-completely authwalled if the scraping doesn't let up.)

clvx
·
1 month ago
·
[ - ]

For your next X requests you need to process these many tokens. I mean that sounds utopian for sure.

vhcr
·
1 month ago
·
[ - ]

That already existed a few years ago, Coinhive, and everybody hated it.

noosphr
·
1 month ago
·
[ - ]

You don't need an authorization wall to have your stuff behind, you can just as easily use an anonymous micropayment service for each request.

That we live in an internet where getting too many visitors is an existential crisis for websites should tell you that our internet is not one that can survive long.

nonchalantsui
·
1 month ago
·
[ - ]

Are there any reputable anonymous micropayment services?

Cthulhu_
·
1 month ago
·
[ - ]

Back when search engines caused this, the industry made an agreement and designed the robots.txt spec in order to avoid legal frameworks being made to stop them. Because of that, legal frameworks weren't being made.

Now there's a new generation of hungry hungry hippo indexers that didn't agree to that and who feel intense pressure from competition to scoop up as much data as they can, who just ignore it.

Legislation should have been made anyway, and those that ignore robots.txt blocked / fined / throttled / etc.

bitmasher9
·
1 month ago
·
[ - ]

I’m not sure that I like this plan. We shouldn’t let the illegal AIs gain more knowledge and usefulness than the legal ones.

There’s other options besides a blanket ban.

aqfamnzc
·
1 month ago
·
[ - ]

As it is now, unethical AIs have a huge advantage over ethical ones.

tremon
·
1 month ago
·
[ - ]

Unethical behaviour always has a huge advantage over ethical behaviour, that's nothing new and pretty much by definition. The only way to prevent a race to the bottom is to make the unethical behaviour illegal or unprofitable.

joepie91_
·
1 month ago
·
[ - ]

None of the AIs have any 'knowledge' to begin with, so that's an easy one to satisfy.

chneu
·
1 month ago
·
[ - ]

I'd be shocked if a single member of the US House or Congress know what robots.txt is.

diggan
·
1 month ago
·
[ - ]

I'm pleased to declare you shocked: https://www.google.com/search?q=site%3Acongress.gov+%22robot...

skyyler
·
1 month ago
·
[ - ]

I don't know if "robots.txt" appearing in congressional record really counts. Do any of the decision makers appear to have a command of what the file does? Or do they typically relegate to industry professionals, as they often do?

TheRealPomax
·
1 month ago
·
[ - ]

How would legislation in the US or EU stop traffic from China or Thailand or Russia? At best you'd be fragmenting the internet, which isn't really a "best", that's a terrible idea.

j2kun
·
1 month ago
·
[ - ]

This is the key point, but if US laws are being violated and AI is considered part of national security, that could be used by the US government in international negotiations, and for justification for sanctions, etc. It would be a good deterrent.

Ndymium
·
1 month ago
·
[ - ]

I was also under attack recently [0]. The little Forgejo instance where I host my code (of several open source packages so it needs to be open) was run into the ground and the disk was filled with generated zip archives. I'm not the only one who has suffered the same fate. For me, the attacks subsided (for now) when I banned Alibaba Cloud's IP range.

If you are hosting a Forgejo instance, I strongly recommend setting DISABLE_DOWNLOAD_SOURCE_ARCHIVES to true. The crawlers will still peg your CPU but at least your disk won't be filled with zip files.

[0] https://blog.nytsoi.net/2025/03/01/obliterated-by-ai

zoobab
·
1 month ago
·
[ - ]

"disk was filled with generated zip archives"

That's bad software design to generate ZIP files on the fly.

abound
·
1 month ago
·
[ - ]

They could very well just be temp files, I know the Go standard library will write large multi-part file uploads to temp disk, for example.

It'd be better to totally stream it of course, but that's not always an option for one reason or another.

Ndymium
·
1 month ago
·
[ - ]

They're deleted by default every 24 hours and that time is configurable. Not useful when you get 60 requests per second though.

diggan
·
1 month ago
·
[ - ]

> They're deleted by default every 24 hours

Hm, so it's a cache then? Requesting the same tarball 100 times shouldn't create 100 zip files if they're cached, and if they aren't cached they shouldn't fill up the disk.

Ndymium
·
1 month ago
·
[ - ]

They are a cache, but you can generate them for each branch, tag, and commit in at least three different formats... Now imagine you have a repo with several thousand commits.

diggan
·
1 month ago
·
[ - ]

Yeah, fair point. Although I think the only uniqueness is commits here, tarballs generated from branches and tags are ultimately the same as the ones generated by the commit that those reference. But I still agree with your overall point.

devit
·
1 month ago
·
[ - ]

Or perhaps switch to well-engineered software actually properly designed to be served on the public Internet.

Clearly generating zip files, writing them fully to disk and then sending them to the client all at once is a completely awful and unusable design, compared to the proper design of incrementally generating and transmitting them to the client with minimal memory consumption and no disk usage at all.

The fact that such an absurd design is present is a sign that most likely the developers completely disregarded efficiency when making the software, and it's thus probably full of similar catastrophic issues.

For example, from a cursory look at the Forgejo source code, it appears that it spawns "git" processes to perform all git operations rather than using a dedicated library and while I haven't checked, I wouldn't be surprised if those operations were extremely far from the most efficient way of performing a given operation.

It's not surprising that the CPU is pegged at 100% load and the server is unavailable when running such extremely poor software.

Ndymium
·
1 month ago
·
[ - ]

Just noting that the archives are written to disk on purpose, as they are cached for 24 hours (by default). But when you have a several thousand commit repository, and the bots tend to generate all the archive formats for every commit…

But Forgejo is not the only piece of software that can have CPU intensive endpoints. If I can't fence those off with robots.txt, should I just not be allowed to have them in the open? And if I forced people to have an account to view my packages, then surely I'd have close to 0 users for them.

devit
·
1 month ago
·
[ - ]

Well then such a cache needs obviously to have limit to the disk space it uses and some sort of cache replacement policy, since if one can generate a zip file for each tag, that means that the total disk space of the cache is O(n^2) where n is the disk usage of the git repositories (imagine a single repository where each commit is tagged and adds a new file of constant size), so unless one's total disk space is a million/billion times larger than the disk space used by the repositories, it's guaranteed to fill the disk without such a limit.

lelanthran
·
1 month ago
·
[ - ]

The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.

There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.

It'll all burn down.

djha-skin
·
1 month ago
·
[ - ]

I actually envision Liapunov stability, like wolf and rabbit populations. In this scenario, we're the rabbits. Human content will increase when AI populations decrease, this providing more food for AI, which will then increase. This drowns out human expression, and the humans will grow quieter. This provides less fodder for the AI, and they decrease. This means less noise and the humans grow louder. The cycle repeats and nauseam.

GolfPopper
·
1 month ago
·
[ - ]

Until broken by the Butlerian Jihad, "Though shalt not make a machine in the likeness of the mind of man."

keyringlight
·
1 month ago
·
[ - ]

I've thought along similar lines for art, what ecological niches are there where AI can't participate, are harder to pull training data from or not economical, where humans can flourish.

bashfulpup
·
1 month ago
·
[ - ]

Anything we humans deem private in nature from other humans.

Y_Y
·
1 month ago
·
[ - ]

See e.g. https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...

InfamousRece
·
1 month ago
·
[ - ]

If the logistic driving parameter is large enough it can also lead to complete chaos.

cle
·
1 month ago
·
[ - ]

IMO this was one of the real motives for Web Environment Integrity. Allow Google to index but nobody else.

We're kind of stuck between a rock and a hard place here. Which do you prefer, entrenched incumbents or affordable/open hosting?

lelandfe
·
1 month ago
·
[ - ]

I’m supremely confident that attestation will arrive in one form or another in the near future.

Anonymous browsing and potentially-malicious bots look identical. This was sort of OK up until now.

cle
·
1 month ago
·
[ - ]

Agreed, it seems inevitable. Unfortunately I think it will also result in further centralization & consolidation into a handful of "trusted" megacorps.

If you thought browser fingerprinting for ad tracking was creepy, just wait until they're using your actual fingerprint.

natebc
·
1 month ago
·
[ - ]

does indeed sound like we're headed right back to AOL. At least this time it'll be faster? Certainly won't be as charming.

breckenedge
·
1 month ago
·
[ - ]

Google is already scraping your site and presenting answers directly in search results. If I cared about traffic (hence selling ad space), why would I want my site indexed by Google at all anymore? Lots of advertising-supported sites are going to go dark because only bots will visit them.

sgc
·
1 month ago
·
[ - ]

It will entrench established search engines even more if they have to move to auth-based crawling, so that the only crawlers will be those you invite. Most people will do this for google, bing, and maybe one or two others if there is a simple tool to do so.

nicce
·
1 month ago
·
[ - ]

> The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

AI companies with best anti-captcha mechanics will win and will inject ads to LLM output in more sophisticated way.

renegat0x0
·
1 month ago
·
[ - ]

This cannot be further from the truth. Ad business is not going anywhere. It will grow even bigger.

OpenAI goes through initial cycle of enshittification. Google is too big right now. Once they establish dominance you will have to see 5 unskippable ads between prompts, even for paid plan.

I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.

Example 'search' project: https://rumca-js.github.io/search

nicce
·
1 month ago
·
[ - ]

The stated problem was about indexing, accessing content and advertising in that context.

> I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.

> Example 'search' project: https://rumca-js.github.io/search

That is not really solution. Since typical indexing still works for masses, your approach is currently unique. But in the end, bots will be capable of reading on web page context if human is capable on reading them. And we get back to the original problem where we try to detect bots from humans. It's the only way.

ethan_smith
·
1 month ago
·
[ - ]

What about the next-gen of AI that would be able to signup autonomously? Even if implemented auth-walls everywhere right now, whats stopping the companies to get some real cheap labor to create accounts on websites and use them to scrape the content?

Is it going to become another race like the adblocker -> detect adblocker -> bypass adblocker detector and so on...?

sir-alien
·
1 month ago
·
[ - ]

Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI

danieldk
·
1 month ago
·
[ - ]

How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.

Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.

(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)

I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.

regularfry
·
1 month ago
·
[ - ]

How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.

what
·
1 month ago
·
[ - ]

Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.

regularfry
·
1 month ago
·
[ - ]

And in that case both systems end up with a situation new entrants can't enter.

usefulcat
·
1 month ago
·
[ - ]

I don't see how that helps the case where the UA looks like a normal browser and the source IP looks residential.

lacksconfidence
·
1 month ago
·
[ - ]

How about if they claim to be google chrome running on windows 11, from a residential IP address? Is that a human or an AI bot?

KTibow
·
1 month ago
·
[ - ]

We actually can do this already.

https://duckduckgo.com/duckduckgo-help-pages/results/duckduc...

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...

prmoustache
·
1 month ago
·
[ - ]

I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.

ATechGuy
·
1 month ago
·
[ - ]

This is scary!

Thorrez
·
1 month ago
·
[ - ]

The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.

nonrandomstring
·
1 month ago
·
[ - ]

This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.

insane_dreamer
·
1 month ago
·
[ - ]

Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

lelanthran
·
1 month ago
·
[ - ]

> Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]

[1] The AI SaaS will simply run a background worker on the client to do their search indexing.

__MatrixMan__
·
1 month ago
·
[ - ]

You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.

chr1
·
1 month ago
·
[ - ]

AI is good at solving captchas. But even if everyone added a captcha search engines will continue indexing. Because it is easy to add authentication for search engines to escape captcha, Google will just need to publish a public key.

orthecreedence
·
1 month ago
·
[ - ]

This is fine, as Google's utility as a search engine has turned into a hot pile of garbage, at least for my cases. Where a decade ago I could put in a few keywords and get relevant results, I now have to guide it with several "quoted phrases" and -exclusions to get the result I'm looking for on the second or third result page. It has crumbled under its own weight, and seems to suggest irrelevant trash to me first and foremost because it's the website of some big player or content farm. Either their algorithm is tuned for mass manipulation or they lost the arms race with SEO cretins (or both).

Granted, I'm not looking forward to some LLM condensing all the garbage and handing me a Definitive Answer (TM) based on the information it deems relevant for inclusion.

jbk
·
1 month ago
·
[ - ]

VideoLAN here.

Same for us, our forum and our Gitlab are getting hammered by AI companies bots.

Most of them don’t respect robots.txt…

factchecking
·
1 month ago
·
[ - ]

Did you document the measures you took to remedy this?

dabockster
·
1 month ago
·
[ - ]

I doubt they'd post it since it would mean those AI firms could use it to adapt to their countermeasures.

xyzal
·
1 month ago
·
[ - ]

In case anyone is interested in a tiny bit of sabotage, I am under the impression I managed to 'drown' true information on my microblog by generating contradicting posts with LLaMa (tens of them for each real post) and invisibly linking them, so a human would not click through.

You know, flood the zone with s***, Bannon-style ...

knowaveragejoe
·
1 month ago
·
[ - ]

This is an approach I've seen used and I'm not sure what success it has had. But logically it seems sound: explicitly reference paths that no human would actually see - traffic hitting those paths are bots. They can't help themselves.

xyzal
·
1 month ago
·
[ - ]

Regarding pollution of the LLM weights, at least the Russian propaganda machine seems to prove it works, see https://www.france24.com/en/live-news/20250310-russian-disin...

RobKohr
·
1 month ago
·
[ - ]

Yep, just make sure to add them to your robots.txt file so that only the bad robots are harmed.

ethan_smith
·
1 month ago
·
[ - ]

Temporary solution and only works if only some of us are doing it. What if these bots have a "manager" LLM agent that takes the decision on what pages to scrape?

joquarky
·
1 month ago
·
[ - ]

When I read this yesterday, I was contemplating one possible way to mitigate this at a larger scale is if websites could create random virtual paths/endpoints that drive the bot into a locally served Library of Babel[0] that poisons the spiders with lots of useless text.

It won't work for well-structured sites where the bots know the exact endpoint they want to scrape, but might slow down the more exploratory spider threads.

[0] https://libraryofbabel.info/

kod
·
1 month ago
·
[ - ]

Besides flooding them with junk, what about outright sabotage in the form of serving zip bombs or other ways to waste computational resources?

xyzal
·
1 month ago
·
[ - ]

I think we need to aim for the bots to get _negative_ utility value from visiting our traps, not just zero value.

Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.

Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).

A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.

And so on, and so forth ...

dabockster
·
1 month ago
·
[ - ]

Even though I agree with what you're doing in principle, I feel it's necessary to remind and warn everyone here that sabotaging bots could be viewed as a violation of laws such as the US's Computer Fraud and Abuse Act[1]. I mean, unless the Second Amendment is suddenly interpreted to include cyberweapons.

[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act

greybox
·
1 month ago
·
[ - ]

Insane, I wonder if we eventually end up with a non-search-engine indexed version of the web that's more like browsing in the 90s where websites just had to link to oneanother to get noticed . . . .

I love that the solution to LLM scraping is to serve the browser a proof of work, before they allow access - I wonder if things like new sites start to do this . . . It would mean they won't be indexed by search engines, but it would help to protect the IP

xena
·
1 month ago
·
[ - ]

Hi! I do this! See https://github.com/TecharoHQ/anubis for more info!

hmry
·
1 month ago
·
[ - ]

I hope lots of websites adopt this, mainly because I want to see more happy jackal girls while browsing.

xena
·
1 month ago
·
[ - ]

My monetization strategy is unironically to offer a de-anime'd version under the name Techaro BotStopper or something.

20after4
·
1 month ago
·
[ - ]

It's a clever (and hilarious) strategy that will probably sell at least a few licenses. As an anime hater I'd be motivated by this.

floren
·
1 month ago
·
[ - ]

Just change the pictures in cmd/anubis/static/img/ to whatever you prefer, I think.

ozornin
·
1 month ago
·
[ - ]

Or, alternatively, you know, pay the author for the work they've done

numpad0
·
1 month ago
·
[ - ]

Or, alternatively, just embrace anime-porn content spectrum. I mean, just compare platforms that are free of it and ones that are chock full, and see which ones die and which grows.

floren
·
1 month ago
·
[ - ]

Sure, if you're going to deploy it on your company site, but I think if you're running a personal website and want to throttle LLM crawlers without falsely advertising that you're a furry, you could just go and modify this piece of MIT-licensed software.

xena
·
1 month ago
·
[ - ]

Or you could pay for BotStopper and have RPM/DEB packages too.

pipe01
·
1 month ago
·
[ - ]

That's kinda what the company Friendly Captcha does

noveltyaccount
·
1 month ago
·
[ - ]

Does the PoW make money via crypto mining? Or is it just to waste the caller's CPU cycles? If you could monetize the PoW then you could re-challenge at an interval tuned so that the caller pays for their usage.

xena
·
1 month ago
·
[ - ]

It's to waste CPU cycles. I don't want to touch cryptocurrency with a 20 foot pole. I realize I'm leaving money on the table by doing this, but I don't want to alienate the kinds of communities I want to protect.

dpc_01234
·
1 month ago
·
[ - ]

By doing PoW as a side effect of something you need to do anyway for other reason, you actually make mining less profitable for other miners, which is helping to eliminate waste.

This is an aspect that a lot of PoW haters miss. While PoW is a waste, there are long term economic incentives to minimize it to either being a side-effect of something actually useful, or using energy that would go to waste anyway, making it's overall effect gravitate toward neutral.

Unfortunately such a second order effects are hard to explain to most people.

dematz
·
1 month ago
·
[ - ]

I always felt like crypto is nothing but speculating on value with no other good uses, but there is a kind of motivation here.

Say a hash challenge gets widely adopted, and scraping becomes more costly, maybe even requires GPUs. This is great, you can declare victory.

But what if after a while the scraping companies, with more resources than real users, are better able to solve the hash?

Crypto appeals here because you could make the scrapers cover the cost of serving their page.

Ofc if you’re leery of crypto you could try to find something else for bots to do. xkcd 810 for example. Or folding at home or something. But something to make the bot traffic productive, because if it’s just a hardware capability check maybe the scrapers get better hardware than the real users. Or not, no clue idk

joquarky
·
1 month ago
·
[ - ]

This feels like one of the few ways to potentially avoid what seems like the inevitability of attestation.

greybox
·
1 month ago
·
[ - ]

thank you for your contribution to society!

corytheboyd
·
1 month ago
·
[ - ]

I never thought about it until now, but it’s insane that the companies who offer both LLM products and cloud compute services are double dipping— they get the LLM product to sell, as well as the elevated load egress (and compute, etc.) money. When you look at it that way, where’s the incentive to even care about inefficient LLM scraping, leaving it terrible makes you money from your other empire, cloud storage egress costs.

nzeid
·
1 month ago
·
[ - ]

We need a project in the spirit of Spamhaus to actively maintain a list of perpetrating IPs. If they're cycling through IPs and IP blocks I don't know how sustainable a CAPTCHA-like solution is.

mrweasel
·
1 month ago
·
[ - ]

Just block all of AWS, Alibaba, GCP and Azure, or throttle them aggressively. If you have clients/customers that need more requests per second then have them provide you with their IPs.

The problem is that these companies are fairly well funded and renting infrastructure isn't an issue.

kijin
·
1 month ago
·
[ - ]

Exactly. They're renting infrastructure on well-known clouds, not cycling through consumer IPs like yesterday's botnets. Block all web traffic from well-known cloud IPs, and you can keep 99% of the LLM bots away. Alibaba seems to be the most common source of bot traffic on my infrastructure lately, and I also see Huawei Cloud from time to time. Not much AWS, probably because of their high IPv4 pricing.

You can allow API access from cloud IPs, as long as you don't do anything expensive before you've authenticated the client.

cgh
·
1 month ago
·
[ - ]

From the article:

“…they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses - mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure - actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.”

So it looks like much of the traffic, particularly from China, is indeed using consumer ips to disguise itself. That’s why they blocked based on browser type (MS Edge, in this case).

dougb5
·
1 month ago
·
[ - ]

This matches exactly with what I'm seeing on my own sites too and it's from all over the world, not just China.

(I described my bot woes a few weeks ago at https://news.ycombinator.com/item?id=43208623. The "just block bots!" replies were well-intentioned but naive -- I've still found no signal that works reliably well to distinguish bots from real traffic.)

kijin
·
1 month ago
·
[ - ]

I saw a fair amount of that kind of behavior, too, mostly around the summer of last year. At some point it dropped off sharply. Over the last few months, at least for the servers I keep an eye on, most of the trouble has been from Chinese cloud IPs.

Either the LLM devs got more funding, or maybe the authorities took down the botnet they were using.

blueflow
·
1 month ago
·
[ - ]

Why only in the "spirit of Spamhaus"? Spamhaus still exists. Add Google and Microsoft AS to the DROP/NOROUTE list, that would be hilarious.

danaris
·
1 month ago
·
[ - ]

Because while this is clearly related to spam, it's not the same thing, and presumably if Spamhaus themselves felt it was within their wheelhouse, they'd already be doing it.

·
1 month ago
·
[ - ]

voidUpdate
·
1 month ago
·
[ - ]

This sounds backwards to me, if you maintain a list of IPs but they are constantly cycling them, it'll get out of date quickly, but a captcha-like system will (hopefully) always stop bot traffic

pavon
·
1 month ago
·
[ - ]

While some of the residential IPs are from malware, a lot of it is from residential IP proxies, where people are paid to run proxy software from their home. If it starts getting around that people who run this software quickly become blocked by the majority of the internet that will lessen that part of the problem.

nzeid
·
1 month ago
·
[ - ]

Only if your CAPTCHA-like is hurled at every client indiscriminately. Otherwise you'll end up right back where Spamhaus started: maintaining your own list of good and bad actors.

The advantage of a third party service is that you're sharing intel of bad actors.

voidUpdate
·
1 month ago
·
[ - ]

I can't confirm but I believe it is applied to every client

diggan
·
1 month ago
·
[ - ]

> According to Drew, LLM crawlers don't respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

How do they know that these are LLM crawlers and not anything else?

TonyTrapp
·
1 month ago
·
[ - ]

As someone that is also affected by this: We see a manifold increase in requests since this LLM crap is going on. Many of these IPs come from companies that obviously work with LLM technology, but the problem is that it's 100s of IPs doing 1 request, not 1 IP doing 100s of requests. It's just extremely unlikely that anyone else is responsible for this.

diggan
·
1 month ago
·
[ - ]

> IPs come from companies that obviously work with LLM technology

Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?

> is that it's 100s of IPs doing 1 request

Are all of those IPs within the same ranges or scattered?

Thanks a lot for taking the time to talk about your experience btw, as someone who hasn't been hit by this it's interesting to have more details about it before it eventually happens.

TonyTrapp
·
1 month ago
·
[ - ]

> Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?

Those are the ones that make it obvious, yes. It's not exclusive, though, but enough to connect the dots.

> Are all of those IPs within the same ranges or scattered?

The IP ranges are all over the place. Alibaba seems to have tons of small ASNs, for instance.

boris
·
1 month ago
·
[ - ]

> How do they know that these are LLM crawlers and not anything else?

I can tell you what it looks like in case of a git web interface like cgit: you get a burst of one or two isolated requests from a large number of IPs each for very obscure (but different) URLs, like a file contents at a specific commit id. And the user agent suggesting it's coming from IPhone or Android.

JimDabell
·
1 month ago
·
[ - ]

That was my reaction. It seems like the article is saying two mutually exclusive things:

- We cannot block them because we can’t differentiate legitimate traffic from illegitimate traffic…

- …but we can conclusively identify this traffic as coming from AI crawlers.

xena
·
1 month ago
·
[ - ]

It's a situation where it's difficult to tell for individual requests at request handling time, but easy to see when you look at the total request volume.

chneu
·
1 month ago
·
[ - ]

It's the behavior of the traffic in hindsight that's obvious. It's difficult to identify in the moment. This is by design.

Getting caught isn't a big deal. Getting caught in the act is. As long as they get their data, it doesn't matter if they're caught afterwards.

shadowfacts
·
1 month ago
·
[ - ]

In my case, no small fraction of the traffic was from OpenAI and Anthropic. There were also other user agents that literally said "AI".

alyandon
·
1 month ago
·
[ - ]

I had a similar issue with a Gitea instance that has some public repos. Gitea doesn't follow best practices for a web application and the action to create an archive of a repo is an HTTP GET request instead of an HTTP POST. AI crawlers were hitting that link over and over again causing the server to repeatedly run itself out of disk space.

xinayder
·
1 month ago
·
[ - ]

I'm pretty sure Forgejo had the same issue and it was fixed on their side.

alyandon
·
1 month ago
·
[ - ]

I'll check that out - thanks for the pointer.

scoofy
·
1 month ago
·
[ - ]

God... I literally just coded up three different honeypots for this exact problem on my site, https://golfcourse.wiki, because LLM scrapers are a constant problem. I added a stupid recaptcha to the sign up form after literally 10,000 fake users were created by bots, averaging about 50 per day, and I have to say, recaptcha was suprisingly cumbersome to set up.

It's awful and it was costing me non-trivial amounts of money just from the constant pinging at all hours, for thousands of pages that absolutely do not need to be scraped. Which is just insane, because I actively design robots.txt to direct the robots to the correct pages to scrape.

So far so good with the honeypots, but I'll probably be creating more and clamping down harder on robots.txt to simply whitelist instead of blacklist. I'm thinking of even throwing in a robots honeypot directly in sitemap.xml that should bait robots to visit when they're not following the robots.txt.

It's really, really ridiculous.

PaulDavisThe1st
·
1 month ago
·
[ - ]

> They do so using random User-Agents from tens of thousands of IP addresses,

"tens of thousands" ? I think not:

  % sudo fail2ban-client status gitbots | more
  Status for the jail: gitbots
  |- Filter
  |  |- Currently failed: 0
  |  |- Total failed: 573555
  |  `- File list: /var/log/nginx/gitea_access.log
  `- Actions
     |- Currently banned: 78671
     |- Total banned: 573074

skybrian
·
1 month ago
·
[ - ]

I see this going the way of email, with larger, well-known, and more well-behaved crawlers being allowed to index websites for free, and smaller, unknown crawlers suffering brownouts, getting banned, or having to pay for access. It will be harder to self-host a website or run your own crawler.

herbturbo
·
1 month ago
·
[ - ]

Yes this does remind me of the old spam wars of early 2000s. Back then collaborative block lists were useful to reject senders at IP level before using a Bayesian system on the message itself.

Even though these bots are using different IPs with each request, that IP may be reused for a different website, and donating those IPs to a central system could help identify entire subnets to block.

Another trick was “tar-pitting” suspect senders (browser agent for example) to slow their message down and delay their process.

arkh
·
1 month ago
·
[ - ]

I think the solution to this problem is the same as with scammers and is an analog one.

Bust the kneecaps of all the people responsible for those crawlers. Publicly. And all of them: from the person entering the command to the CEO of the company going through all the middle management. You did not go against this policy? Intact kneecaps are a privilege which just got revoked in your case.

fuzztester
·
1 month ago
·
[ - ]

Busting their kneecaps is too mild. Shoot them in the kneecaps, Jack Higgins / Liam Devlin / IRA style:

https://en.m.wikipedia.org/wiki/Jack_Higgins

https://en.m.wikipedia.org/wiki/Liam_Devlin

https://en.m.wikipedia.org/wiki/Irish_Republican_Army

https://en.m.wikipedia.org/wiki/Kneecapping

dbaio
·
1 month ago
·
[ - ]

This is how we handled this in the FreeBSD infrastructure. https://blog.sysopscafe.com/posts/ai-crawlers-hammering-git-... It has been running this way for a month, and the workload is fine. I hope it continues this way.

boris
·
1 month ago
·
[ - ]

Thanks for sharing. If I understood correctly, you have rate-limiter specific urls (those with commit ids) that are infrequently requested by users but frequently by bots. Which means, provided the bots continue trying to request them, any user request will most likely end up being denies. In this case a simpler solution might be to just block such urls outright. The only advantage of your more complex solution that I can see is that if the bots stop requesting these urls, they will again become accessible to the normal users. Or am I missing something?

dbmnt
·
1 month ago
·
[ - ]

My guess after reading the same -- the bot traffic comes in bursts and targets a specific commit hash for a while. Users are unlikely to need that specific commit, and even less likely to need it at the same time a bot is bursting requests for it. There's probably a small risk of denying a real user, but there's a large reduction in traffic from the bots making it to git; a worthwhile trade.

QuadrupleA
·
1 month ago
·
[ - ]

Isn't this just poor, sloppy crawler implementation? You shouldn't need to fetch a repo more than once to add it to a training set.

mrweasel
·
1 month ago
·
[ - ]

From reading Drew Devaults angry post from earlier this week, my take is that not only is it poorly implemented crawlers, it's also that it's cheaper to scrape, rather than keep copies on hand. Effectively these companies are outsourcing the storage of "their" training data to everyone on the internet.

Ideally a site would get scraped once, and then the scraper would check if content has changed, e.g. etag, while also learning how frequently content changes. So rather than just hammer some poor personal git repo over and over, it would learn that Monday is a good time to check if something changed and then back off for a week.

QuadrupleA
·
1 month ago
·
[ - ]

That seems crazy - millions of $ on GPUs but they can't afford some cheap storage? And direct network scraping seems super high latency. Although I guess a massive pretaining run might cycle through the corpus very slowly. Dunno, sounds fishy.

dalke
·
1 month ago
·
[ - ]

I see ChatGPT's bots pull down all of my Python wheels every couple of weeks.

Wheels that haven't changed in years, with a "Last-Modified" and "ETag" that haven't changed.

The only thing that makes sense to me is it's cheaper them to re-pull and re-analyze the data than to develop a cache.

fuzztester
·
1 month ago
·
[ - ]

Or, could it be, just possibly be (gasp), that some of the devs at these "hotshot" AI companies are just ignorant or lazy or pressured enough, so as to not do such normal checks? Wouldn't be surprised if so.

dalke
·
1 month ago
·
[ - ]

You think they do cache the data but don't use it?

For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.

fuzztester
·
1 month ago
·
[ - ]

>You think they do cache the data but don't use it?

that's not what I meant.

and it is not they, it is it.

i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.

google:

http header last modified

and look for the etag link too.

fuzztester
·
1 month ago
·
[ - ]

here you go:

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

danaris
·
1 month ago
·
[ - ]

It's not that they can't afford storage.

It's that not doing so means they can increase their profit numbers just a skoshe more.

And at least as long as they haven't IPOed, that number's the only thing that matters. Everything getting in the way of increasing it is just an obstacle to be removed.

csydas
·
1 month ago
·
[ - ]

You are correct it's poor and sloppy, but it's not "just" that. It's a lack of concern over the effects of their poor/sloppy crawler implementation.

The poor implementation is not really relevant, it's companies deciding they own the internet and can take whatever they want, let everyone else deal with the consequences. The companies do not care what the impact of their ai non-sense is..

fuzztester
·
1 month ago
·
[ - ]

https://news.ycombinator.com/item?id=43429532

Forbo
·
1 month ago
·
[ - ]

A capitalist externalizing costs?! Why, no, never! /s

Havoc
·
1 month ago
·
[ - ]

It’s people that don’t care if they ruin things for everyone else.

Crawlers have existed forever in search engine space and mostly behave.

This sort of no rate limit, fake user agent, 100s of IPs approach used by AI teams is obviously intentionally not caring who it fucks over. More malicious than sloppy implementation

mistrial9
·
1 month ago
·
[ - ]

it is an ecosystem of social roles, not just "people" .. casting the decision into individual choices is not the right filter to understand this situation..

wadadadad
·
1 month ago
·
[ - ]

I'm not sure I'm following what you mean by 'social roles'. Which roles are you referring to here?

I'll disagree that it's not at least individual malicious choice, though. Someone decided that they needed to fake/change user agents (as one example), and implemented it. Most likely it was more than one person- some manager(s)/teams probably also either suggested or agreed to this choice.

I would like to think at some point in this decision making process, someone would have considered 'is it ethical to change user agents to get around bans? Is it ethical to ignore robots.txt?' and decided not to proceed, but apparently that's not happening here...

blibble
·
1 month ago
·
[ - ]

they're probably using their own garbage AI to write the crawler

the result? a mixed up version of 5000 plagiarised "baby's first webcrawler" github projects

Sharlin
·
1 month ago
·
[ - ]

The brute-force solution to the first hard problem in computer science is to not have a cache at all.

mycall
·
1 month ago
·
[ - ]

Yet in these cases mentioned in the article, if they had an HTTP static cache version of each page, using git hooks to refresh them, the bots would be negligable to their services. That is assuming the bots use HTTP 80/443 instead of git 9418

diggan
·
1 month ago
·
[ - ]

Sounds like it me. Why build a crawler that fetches one HTML page per commit in a repository instead of doing a bare-clone and then just get the data from there? Surely would contain even more data too, compared to the HTML page.

superkuh
·
1 month ago
·
[ - ]

And poor, sloppy, website implementation. If your professional website can't handle 20k hits it's ... well, poor. Because my home connection hosted on my desktop PC website tanked 20k hits from alibaba bot (among a few more thousand of normal traffic) yesterday without missing a beat.

It is literally the point of public websites to answer HTTP requests. If yours can't you're doing something wrong.

totetsu
·
1 month ago
·
[ - ]

Could one put a mangler on the responses to suspected bots to poison their data sets with nonsense code.. :/

usefulcat
·
1 month ago
·
[ - ]

If you could conclusively identify bots, you could not serve them anything and save yourself some work (and bandwidth).

If you can't conclusively identify bots, you'll end up serving 'poisoned' responses to actual users. Doesn't seem like a viable solution.

someothherguyy
·
1 month ago
·
[ - ]

Cloudflare is offering something like this: https://blog.cloudflare.com/ai-labyrinth/

hans_castorp
·
1 month ago
·
[ - ]

I thought the same. Maybe start prefixing each commit message with "LLVM bots are all lying bastards" or something similar :)

flir
·
1 month ago
·
[ - ]

I wonder if that would just make them try harder. Scrape multiple times and diff, to gain confidence that the data hasn't been posioned.

We've never had one of these arms races end up with the defenders winning.

postexitus
·
1 month ago
·
[ - ]

as long as you can detect the bot is an ai crawler.

niemandhier
·
1 month ago
·
[ - ]

This. Not only mangle the content. Flood the bot with tailored misinformation and things that are illegal in this particular legislation but not yours.

They will never respect you, but the second they notice this hurts their business more than it gains them, they will stop.

moomin
·
1 month ago
·
[ - ]

Combine this with the Anubis tech...

pfoof
·
1 month ago
·
[ - ]

a proper CSS or comments in HTML might do this

bovermyer
·
1 month ago
·
[ - ]

Yeah, I've had to mitigate this problem too, on a tiny forum serving a very niche audience. It was brought to its knees by a handful of LLM bots that completely ignored robots.txt.

Thankfully, these bots were easy enough to block at the firewall level, but that may not work forever.

diggan
·
1 month ago
·
[ - ]

Were these bots also using random IPs and random User-Agents and if so, how do you know they are LLM agents?

bovermyer
·
1 month ago
·
[ - ]

The User-Agent was not random; that's how I was able to block them this time.

szszrk
·
1 month ago
·
[ - ]

But they were far from honest, if, as you say, crawlers impersonate real users and their browsers.

immibis
·
1 month ago
·
[ - ]

This is probably a dumb question, but have they tried sending abuse reports to hosting providers? Or even lawsuits? Most hosting providers take it seriously when their client is sending a DoS attack, because if they don't, they can get kicked off the internet by their provider.

FeepingCreature
·
1 month ago
·
[ - ]

To be clear, this is not an attack in the deliberate sense, and has nothing to do with AI except in that AI companies want to crawl the internet. This is more "FOSS sites damaged by extreme incompetence and unaccountability." The crawlers could just as well be search engine startups.

Vegenoid
·
1 month ago
·
[ - ]

> "FOSS sites damaged by extreme incompetence and unaccountability."

Given this info, the natural next question is “who is doing the harm?”

The answer is “AI companies”. Most people would now view the situation as having a lot to do with AI companies.

FeepingCreature
·
1 month ago
·
[ - ]

As a matter of fact, yes. As a matter of cause, no. Being an AI company doesn't make these companies especially incompetent; rather, this is the normal tech company level of incompetence, and being an AI company causes it to externalize via crawling.

Bing used to do the same thing. (It might still do it, I just haven't heard about it in a while.)

insane_dreamer
·
1 month ago
·
[ - ]

It’s not incompetence. These large AI companies have (or can hire) the competency to engineer proper crawlers. This is deliberate due to lack of accountability and “who’s gonna stop them?”

mdp2021
·
1 month ago
·
[ - ]

> has nothing to do with AI

Surely has very little to do with "intelligence".

GuinansEyebrows
·
1 month ago
·
[ - ]

I think there’s room to argue that that’s a matter of perspective. The intent of the attacker is not that important to the people being attacked.

johnnyanmac
·
1 month ago
·
[ - ]

In the same way a drunk driver isn't deliberately trying to run over pedestrians, I suppose. I think gross negligence is in many ways worse than malice. A malicious actor can at least be somewhat reasoned with.

soco
·
1 month ago
·
[ - ]

Correct, they could - but they are not. This is about the unaccountability, and if I was charitable (which I'm not) I'd add also incompetence, of the techbros leading the AI giants. Are we still expecting anything like "ethics"? I hope the few engineers reading HN will still have some, but the higher you go the foreign the concept gets.

FeepingCreature
·
1 month ago
·
[ - ]

Sure, I just think focusing it on AI companies misses the reason this happens. It's not an "AI company problem", it's a "tech company problem". It just happens that AI companies are the tech companies that externalize their incompetence with crawlers at this point in time.

GuinansEyebrows
·
1 month ago
·
[ - ]

The only way to stop a bad AI with a gun is a good AI with a gun?

joepie91_
·
1 month ago
·
[ - ]

> The crawlers could just as well be search engine startups.

And yet they are not. So what does that tell you?

MrArthegor
·
1 month ago
·
[ - ]

Yes, any crawler or some traffic peak can do this if your infrastructure is not well engineered

tecleandor
·
1 month ago
·
[ - ]

You're getting it the wrong way. It's: Any crawler that's not well engineered, that doesn't follow robots.txt, that fakes its User Agent, that doesn't allow you to contact hem, that fetches content an indiscriminate number of times, repeatedly, all day long... can do this to your infrastructure unless you're a giant.

What these crawlers are doing is akin to DDoS attacks.

mrweasel
·
1 month ago
·
[ - ]

Please do explain how you'd engineer a site to deal with barrage of poorly written scrapers descending upon it. After you've done geo-ip routing, implemented various levels of caching, separated read/write traffic and bought an ever increasing amount of bandwidth, what is there left to do?

You could also get CloudFlare, or some other CDN, but depending on your size that might not be within your budget. I don't get why the rest of the internet should subsidize these AI companies. They're not profitable and live of venture capital and increase the operation costs of everyone else.

stego-tech
·
1 month ago
·
[ - ]

> ...15% is due to Amazon

And you just know they'll gladly bill you for egress charges for their own bot traffic, too.

EDIT: Actually, this is an excellent question. By default, these bots would likely appear to come from "the internet" and thus be subject to egress charges for data transfers. Since all three major cloud providers also have significant interests in AI, wouldn't this be a sort of "silent" price increase, or a form of exploitive revenue pumping? There's nothing stopping Google, Microsoft/OpenAI, or Amazon from sending an army of bots against your sites, scraping the data, and then stiffing you with the charges for their own bots' traffic. Would be curious if anyone has read the T&Cs of their own rate cards closely enough to see if that's the case, or has proof in their billing metrics.

---

Original post continues below:

One topic of conversation I think worth having in light of this is why we still agree to charge for bandwidth consumed instead of bandwidth available, just as general industry practice. Bits are cheap in the grand scheme of things, even free, since all the associated costs are for the actual hardware infrastructure and human labor involved in setup and maintenance - the actual cost per bit transmitted is ridiculously small, infinitesimally so to be practical to bill.

It seems to me a better solution is to go back to charging for capacity instead of consumption, at least in an effort to reduce consumption charges for projects hosted. In the meantime, I'm 100% behind blocking entire ASNs and IP blocks from accessing websites or services in an effort to reduce abuse. I know a prior post about blocking the entirety of AWS ingress traffic got a high degree of skepticism and flack from the HN community about its utility, but now more than ever it seems highly relevant to those of us managing infrastructure.

Also, as an aside: all the more reason not to deploy SRV records for home-hosted services. I suspect these bots are just querying standard HTTP/S ports, and so my gut (but NOT data - I purposely don't collect analytics, even at home, so I have NO HARD EVIDENCE FOR THIS CLAIM) suggests that having nothing directly available on 80/443 will greatly limit potential scrapers.

roenxi
·
1 month ago
·
[ - ]

We're close to finding a clear use-case for Bitcoin with this one.

CivBase
·
1 month ago
·
[ - ]

So the idea is each request would require the client to pay a toll of some amount of work towards mining a cryptocurrency? That's actually brilliant. I'd take this over ads any day. But I do see a few problems...

1. Using the web would become much more compute/energy intensive and old devices would quickly lose access to the modern web.

2. Some hosts would inevitably double-dip by implementing this and ads or by "overcharging" the amount of work. There would have to be some kind of limit on how much work can be required by hosts - or at least some way to monitor and hold hosts accountable for the amount of work they charge.

3. There would need to be a cheap and reliable way to prove the client's work was correct and accurate. Otherwise people will inevitably find a way to spoof the work in order to reduce their compute/energy cost.

wyager
·
1 month ago
·
[ - ]

Interesting. Basically moving the proof-of-work off the user's phone and to a dedicated mine. Websites could just have a lightning wallet or something and auto-charge the user 1e-7 bitcoin to access the page.

falcor84
·
1 month ago
·
[ - ]

I would absolutely love this as an alternative to the ad-based model

nottorp
·
1 month ago
·
[ - ]

That is called micropayments and doesn't need to have anything to do with blockchain.

All you need is a central clearing house service that can handle billions of 0.000001 transactions per day.

Incidentaly, I doubt the bitcoin chain could handle that...

pca006132
·
1 month ago
·
[ - ]

There are off-chain solutions to handle most of the payment, and only put a summary on-chain. I think there are already micropayments in Brave or something.

Vegenoid
·
1 month ago
·
[ - ]

At this point, why even use bitcoin, or crypto at all. Just use a crypto that is better suited to micropayments.

rhaps0dy
·
1 month ago
·
[ - ]

Proof-of-work crypto is interesting here because it is fungible with computation, so these solutions that charge computation to users are literally equivalent to crypto.

It's a solution that already has adoption, does not require everyone to sign up with a centralized service, and does not require everyone to pay money (they can pay with small amounts of computation instead) so it remains accessible to ~everyone.

nottorp
·
1 month ago
·
[ - ]

Yes, sites could use a bot protection service that runs captcha breaking AIs on the viewer's browser. Said bot protection service could then break captchas for forum spammers to make real money.

roenxi
·
1 month ago
·
[ - ]

I was thinking this would involve farming the mining (the energy intensive part) out to clients. That basically just means they have to do sha hashes at some difficulty. The good thing is if you do 10 hashes at difficulty 5 you'd expect one to also pattern match difficulty 6, so I expect even low-difficulty hashing will eventually result in a block mine.

Of course it isn't very secure because if the client sees a mined block they might have the technical savvy keep it. But you'd be forcing big web scrapers to run a horribly inefficient mining operation and they'd hate it. Plus you can run a blacklist of hated clients and double the difficulty for them, which is very low-cost for false positives and very high-cost for real scrapers - that isn't a result of using Bitcoin but it'd be funny.

rglullis
·
1 month ago
·
[ - ]

The days of an open web are long gone. Every server will eventually have to require authentication for access, and to get an account you will have to provide some form of payment or social proof.

Honestly, I don't see it necessarily as a bad thing.

regularfry
·
1 month ago
·
[ - ]

If that was piggy-backed on ActivityPub, Matrix, Solid, or something else decentralised, and if I could say "this bot is acting as my agent, if it misbehaves then I personally get blocked" then there could be something in this. I don't see how to get around artificial identity farms though. That's also not something that payment or social proof fixes. If payment isn't trivial then you exclude genuine people; if it's the act of interacting with a payment processor that's being taken as proof-of-existence then it's outsourcing the ability to interact with anything in the modern world to Visa and Mastercard. That's bad. Social proof is also problematic because if your business is to run an identity farm, then having all your identities interact in legible ways isn't hard, so the social proof needs to be grounded in something global, and there are approximately no good choices.

rglullis
·
1 month ago
·
[ - ]

It doesn't have a global solution and it doesn't need to be implemented only on a specific technology-based system.

I mean, at Communick I offer Matrix, Mastodon, Funkwhale and Lemmy accounts only for paying customers. As such, I have implemented payments via Stripe for convenience, but that didn't stop from getting customers who wanted to pay directly via crypto, SEPA and even cash. It also didn't stop me from bypassing the whole system and giving my friends and family an account directly.

regularfry
·
1 month ago
·
[ - ]

Right, so the gap I'm seeing is that anyone who wants to identity farm just does what you've done. The problem isn't the assurance that you receive, it's the assurance you give to anyone else that any of your customers are flesh-and-blood.

rglullis
·
1 month ago
·
[ - ]

> it's the assurance you give to anyone else that any of your customers are flesh-and-blood.

Why would any third party rely on authentication based on the relationship between my service and my customers?

kittikitti
·
1 month ago
·
[ - ]

"some form of social proof."

Sounds like sanctioned racism.

rglullis
·
1 month ago
·
[ - ]

Oh, stop.

I'm talking about social proof as in "You are an student of the city university, so you get an account at the library", "Julie from the book reading group wanted an account at our Bookwyrm server, so I made an account for her" or even "Unnamed customer who signed up for Cingular Wireless and was given an authorization code to access Level 2 support directly".

kittikitti
·
1 month ago
·
[ - ]

This is being naive about the kinds of gatekeeping and social proof occurring today. I fully believe you didn't intend to mention social proof to be racist, but with people like Zuck and Elon removing DEI, being racist is social proof you belong in their elite club.

rglullis
·
1 month ago
·
[ - ]

> This is being naive about the kinds of gatekeeping and social proof occurring today

You are taking one thing I said (service providers will require some form of payment or social proof to give credentials to people who want to access the service), assumed the worst possible interpretation (people will only implement the worst possible forms of social proofing), and to top it off you added something else (gatekeeping) entirely on your own.

I can not dictate how you interpret my comment, but maybe could you be a bit more charitable and assume positive intent when talking with people you never met?

jpnc
·
1 month ago
·
[ - ]

No it doesn't. Think invites to private torrent sites.

consumer451
·
1 month ago
·
[ - ]

I was wondering... Common Crawl [0] exists, why do "AI companies" do this on their own?

Is Common Crawl data not fresh enough? Is there some other deficiency?

Whatever the problem is, a single crawler which every AI company can reference seems like a compromise to solve this issue, doesn't it?

[0] https://commoncrawl.org/

ccgreg
·
1 month ago
·
[ - ]

Many non-AI and AI research projects and companies do use Common Crawl. There's apparently only a small list of AI companies who don't.

kerkeslager
·
1 month ago
·
[ - ]

AI as it currently exists is a massive blight on society. The best things being done with AI are cute side projects, whereas the most common usages are low-quality thefts of human copyrighted work without attribution.

20 years ago the fear of AI was that it would take over the world and try to kill us. Today we can clearly see that the threat of AI is the amoral humans that control it.

CaffeineLD50
·
1 month ago
·
[ - ]

I thought there was some crawler tarpit out there that could waste their time.

Nepenthes

https://arstechnica.com/tech-policy/2025/01/ai-haters-build-...

ThinkBeat
·
1 month ago
·
[ - ]

Banning Edge to deal with the problem is probably not ideal.

What I am thinking about may even be less idea.

For people actively working on these projects how about puptting the git server on a private net with VPN or SSH access.

Use a seperate read only static git server to the net.

pmontra
·
1 month ago
·
[ - ]

I was also thinking about VPNs but the static copy still has to serve a lot of traffic so I don't know if that's an economically viable solution. Furthermore it creates a market for VPN credentials, but that's another issue. At least I expect that a bot with sold or stolen credential will be easier to discover.

Anyway, why not git clone the project and parse it locally instead of scraping the web pages? I understand that scraping works on every kind of content but given the scale git clone and periodical git fetch could save money even to the scrapers.

Finally, all of this reminds me about Peter Watt's Maelstrom, when viruses infested the Internet so much (at about this time in history) that nobody was using it anymore [1]

[1] https://rifters.com/maelstrom/maelstrom_master.htm

otterpro
·
1 month ago
·
[ - ]

Yes, this might be the best solution. and put read-only repo on github or some public facing host.

harhargange
·
1 month ago
·
[ - ]

Can IPFS or torrent and large local databases decentralised by people be a solution to this? I personally have the resources to share and host TBs of data but didn't find a good use to it.

finnthehuman
·
1 month ago
·
[ - ]

For that to work, a website has to push a mirror into that alternate system, and the scraper has to know the associated mirror exists.

That's two big "ifs" for something I'm not aware of a standardized way of announcing. And the entire thing crumbles as soon as someone who wants every drop of data possible says "crawl their sites anyway to make sure they didn't forget to publish anything into the 2nd system."

GTP
·
1 month ago
·
[ - ]

I doubt, as the article mentions scraping the same resource after just 6 hours. AI companies want to make sure they have fresh data, whileit would be hard to keep such a database updated.

Self-Perfection
·
1 month ago
·
[ - ]

This thought came to me as well.

This way crawlers might contribute back by providing extra storage and bandwidth.

Though something like ZeroNet seems a better approach to allow dynamic content.

smashah
·
1 month ago
·
[ - ]

The real solution to this is for cloud fees to fall by 80%.

KolmogorovComp
·
1 month ago
·
[ - ]

My question is, can we serve PoW challenges to these AI LLM scrapers that can be profitable?

xena
·
1 month ago
·
[ - ]

That's what I've been doing! It works shockingly well. https://github.com/TecharoHQ/anubis

KolmogorovComp
·
1 month ago
·
[ - ]

Unless I am missing something, the result of that generated work has no monetary value though.

xena
·
1 month ago
·
[ - ]

I was inspired by https://en.wikipedia.org/wiki/Hashcash, which was proof of work for email to disincentivize spam. To my horror, it worked sufficiently for my git server so I released it as open source. It's now its own project and protects big sites like GNOME's GitLab.

01HNNWZ0MV43FF
·
1 month ago
·
[ - ]

That's cool! What if instead of sha256 you used one of those memory-hard functions like script? Or is sha needed because it has a native impl in browsers?

xena
·
1 month ago
·
[ - ]

Right now I'm using SHA-256 because this project was originally written as a vibe sesh rage against the machine. The second reason is that the combination of Chrome/Firefox/Safari's JIT and webcrypto being native C++ is probably faster than what I could write myself. Amusingly, supporting this means it works on very old/anemic PCs like PowerMac G5 (which doesn't support WebAssembly because it's big-endian).

I'm gonna do experiments with xeiaso.net as the main testing ground.

ziddoap
·
1 month ago
·
[ - ]

The monetary value is not having a misbehaving AI bot download 73TB or whatever of your data.

cess11
·
1 month ago
·
[ - ]

Interesting idea. Seems to me it might be possible to use with a Monero mining challenge instead, for those low real traffic applications where most of the requests are sure to be bots.

jsheard
·
1 month ago
·
[ - ]

I'm curious if the PoW component is really necessary, AIUI untargeted crawlers are usually curl wrappers which don't run Javascript, so requiring even a trivial amount of JS would defeat them. Unless AI companies are so flush with cash that they can afford to just use headless Chrome for everything, efficiency be damned.

xena
·
1 month ago
·
[ - ]

Sadly, in testing the proof of work is needed. The scrapers run JS because if you don't run JS the modern web is broken. Anubis is tactically designed to make them use modern versions of Firefox/Chrome at least.

They really do use headless chrome for everything. My testing has shown a lot of them are on Digital Ocean. I have a list of IP addresses in case someone from there is reading this and can have a come to jesus conversation with those AI companies.

blibble
·
1 month ago
·
[ - ]

these companies have more compute than everyone else in the world put together

a proof of work function will end up selecting FOR them!

anthk
·
1 month ago
·
[ - ]

Use judo techniques. Use their own computing power against them with fake links to fake Markov generated bullshit at random, until their cache get poisoned with no turning point as it's impossible; the LLM's begin to either forget their own stuff or hallucinate once their input it's basically feeded from other LLM's (or themselves).

01HNNWZ0MV43FF
·
1 month ago
·
[ - ]

It'll still keep your site from getting hammered

blibble
·
1 month ago
·
[ - ]

until some drone working for the parasites (google/facebook/openai) sees this post and writes 5 lines of code to defeat it

and now you have an experience where the bots have it easier time accessing your content than legitimate visitors

GTP
·
1 month ago
·
[ - ]

How would those 5 lines of code look like? The base of this solution is that it asks to solve a computationally-intensive problem whose solution, once provided, isn't computationally-intensive to check. How would those 5 lines of code change this?

blibble
·
1 month ago
·
[ - ]

nice try, Google employee

GTP
·
1 month ago
·
[ - ]

Lol, such a childish excuse to not answer.

rapnie
·
1 month ago
·
[ - ]

I know that mCaptcha is based on PoW. It may be usable here.

https://mcaptcha.org/

rswail
·
1 month ago
·
[ - ]

Finally, a reason for bitcoins!

jimnotgym
·
1 month ago
·
[ - ]

Has there been a test case that ignoring robots.txt and wilfully evading controls is unlawful access to a computer system?

jmyeet
·
1 month ago
·
[ - ]

I see this as a temporary problem. A human brain can be trained on way less than the entire corpus of everything humans have ever written. Ultimately, this will apply to LLMs (or whatever succeeds them) too.

There's another aspect to this too: China and DeepSeek. While this was released by a private company, I think there's a not-insigificant chance that it reflects Chinese government policy to "commoditize your complements" [1]. Companies like OPenAI want to hide their secret sauce so it can't be produced. Training an LLM is expensive. If there are high-quality LLMs out there for free you can just download, then this moat completely evaporates.

[1]: https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

megadata
·
1 month ago
·
[ - ]

Perhaps time to start a central community ban pool for IP ranges?

MartijnBraam
·
1 month ago
·
[ - ]

Doesn't really work if crawlers are coming from the IP ranges of AWS and Azure etc...

immibis
·
1 month ago
·
[ - ]

Traditionally, holders of IP ranges that attack the internet at large get kicked off the internet by having those ranges blacklisted everywhere. This can also get them in serious trouble with the places they got their IP ranges (I assume AWS has them directly from ARIN, so maybe not) and their upstream bandwidth providers and so on, as well as making them less attractive hosts because they are blocked everywhere.

mrweasel
·
1 month ago
·
[ - ]

Kicking AWS of the internet would effectively break half of everything. So much stuff is relying on services running on AWS that it's not even funny.

We've seen random stuff break when AWS has had outages, not because we used AWS ourselves, but because suppliers do.

curtisblaine
·
1 month ago
·
[ - ]

That's actually an argument in favour of kicking AWS off the Internet. We rely too much on their services, to the point we're afraid of banning their IPs if they do something bad. Better stop this now than being worse off later. The best moment would have been ten years ago, the second best moment is today.

megadata
·
1 month ago
·
[ - ]

You don't need to kick anyone off the internet. Just ban them for accessing your resources.

mrweasel
·
1 month ago
·
[ - ]

No, but the suggestion from the parent comment was to have the holder of the offending IP ranges kicked off the internet.

Technically I'm all for kicking AWS off the internet for a day or to, for failing to police their customers, but it would just break a lot.

megadata
·
1 month ago
·
[ - ]

AWS and Azure can be blocked like anyone else.

Tostino
·
1 month ago
·
[ - ]

Or sometimes they use consumer IP proxies. Makes it even harder because sometimes those IPs get reused for actual users.

megadata
·
1 month ago
·
[ - ]

Block them for 24 hours.

blibble
·
1 month ago
·
[ - ]

just ban the lot

nothing good comes from there

unfortunately then they instantly switch to home IPs

Galanwe
·
1 month ago
·
[ - ]

There's already loads of these. The problem is that most of these IPs are just cloud providers or DC ISPs.

tecleandor
·
1 month ago
·
[ - ]

Or even worse, lots of them are using barely legal residential proxies so the requests are coming from everywhere. In Drew DeVault's article linked in this post he complained precisely about the residential-looking source IP addresses [0]. And I think I remember something about a Chinese company, some months ago, very aggressively scraping using that method.

Companies like DataImpulse [1] or ScraperAPI [2] will happily publicize their services with that specific target.

  0: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
  1: https://dataimpulse.com/use-cases/ai-proxies/
  2: https://www.scraperapi.com/solutions/ai-data/

tremon
·
1 month ago
·
[ - ]

Are these "residential proxies" assumed to be infected devices part of a botnet or malicious apps on users' phones?

Galanwe
·
1 month ago
·
[ - ]

A lot of these residential proxy as a service companies just use regular ISPs to run their automated headless browsers. It doesn't need to be illegal.

prmoustache
·
1 month ago
·
[ - ]

I guess iot devices and mobile apps.

Unethical, definitely. Illegal, no.

megadata
·
1 month ago
·
[ - ]

> There's already loads of these

Examples?

wiredfool
·
1 month ago
·
[ - ]

Across my sites -- mostly open data sites -- the top 10 referrers are all bots. That doesn't include the long tail of randomized user agents that we get from the Alibaba netblocks.

At this point, I think we're well under 1% actual users on a good day.

biophysboy
·
1 month ago
·
[ - ]

Can someone with more experience developing AI tools explain what these bots are mostly doing? Are they collecting data for training, or are they for the more recent search functionality? Or are they enhancing responses with links?

xena
·
1 month ago
·
[ - ]

AI expert here. It's probably for collecting training data and the crawlers are probably very unsupervised. I'd guess that they're literally the most simplistic crawler code you can imagine combined with parallelism across machines.

The good news is that it's easy to disrupt these crawlers with some easy hacks. Tactical reverse slowloris is probably gonna make a comeback.

biophysboy
·
1 month ago
·
[ - ]

If its for training data, why are they straining FOSS so much? Is there thousands of actors repeatedly making training data all the time? I thought it was a sort of one-off thing w/ the big tech players.

xena
·
1 month ago
·
[ - ]

Git forges are some of the worst case for this. The scrapers click on every link on every page. If you do this to a git forge, it gets very O(scary) very fast because you have to look at data that is not frequently looked at and will NOT be cached. Most of how git forges are fast is through caching.

The thing about AI scrapers is that they don't just do this once. They do this every day in case every file in a glibc commit from 15 years ago changed. It's absolutely maddening and I don't know why AI companies do this, but if this is not handled then the git forge falls over and nobody can use it.

Anubis is a solution that should not have to exist, but the problem it solves is catastrophically bad, so it needs to exist.

biophysboy
·
1 month ago
·
[ - ]

That's very strange to me that they do it everyday. I thought training runs took months. Do they throw away the vast majority of their training attempts (e.g. one had suboptimal hyperparameters, etc)?

xena
·
1 month ago
·
[ - ]

Training attempts != training data

yetihehe
·
1 month ago
·
[ - ]

Zipbombing rogue AI's will be the new hotness.

xena
·
1 month ago
·
[ - ]

Unironically I tried that. Limited success.

sir-alien
·
1 month ago
·
[ - ]

It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.

napolux
·
1 month ago
·
[ - ]

and then they'll login there too...

sir-alien
·
1 month ago
·
[ - ]

logins are more easily banned, and highly complex captchas for signup needs a human to signup and solve. As long as it's easier to get banned than it is to signup it will at least deter.

fareesh
·
1 month ago
·
[ - ]

one of the worst culprits of this is bytedance's scraper which has a singapore ip block

amusingly some scammy companies have crowdfunded off this type of "traffic" by citing "interest in asian markets"

·
1 month ago
·
[ - ]

pratik661
·
1 month ago
·
[ - ]

Could this lead to a situation where user agents also need to have their identities verified with a certificate authority the way websites do?

EvanAnderson
·
1 month ago
·
[ - ]

We'll be back to attestation in the browser (like Google's still-as-of-yet failed WEI proposal). It's coming.

·
1 month ago
·
[ - ]

piokoch
·
1 month ago
·
[ - ]

Copyright the content and sue those who use it for AI training. I believe there is a lot of low hanging fruit for lawyers here. I would be surprised if they weren't preparing to hit Open AI and alikes. Very badly. Google get away with deep linking issues as publishers after all had some interest in being linked from the search engine, here publishers see zero value.

pjc50
·
1 month ago
·
[ - ]

It is copyrighted. There's ongoing lawsuits about whether AI training counts as copyright infringement; e.g. Thomson Reuters v. ROSS Intelligence https://www.debevoise.com/insights/publications/2025/02/an-e...

However, they could just do an end run round this. In the UK they're planning to get the government to help them just grab everything for free: https://www.gov.uk/government/consultations/copyright-and-ar...

macinjosh
·
1 month ago
·
[ - ]

Whatever happened to terms of use and EULAs? These big companies use them against individuals all the times, why can't small sites state in their Terms of Use or put up a EULA that states no crawling/copying is allowed. Shouldn't that open up avenues to sue?

pjc50
·
1 month ago
·
[ - ]

> These big companies use them against individuals all the times

There's your answer. Lawfare works in favor of the party with deeper pockets.

ziddoap
·
1 month ago
·
[ - ]

>Copyright the content and sue those who use it for AI training.

This costs money, time, and ongoing commitment. FOSS isn't typically known for being overflowing with cash.

cudgy
·
1 month ago
·
[ - ]

Good thing law firms are overflowing with the desire to make large amounts of cash from contingency based cases.

ziddoap
·
1 month ago
·
[ - ]

Even if the lawyer services are provided pro-bono, it is still a large time commitment and added stress for the non-lawyer people involved for cases that aren't guaranteed to win.

adestefan
·
1 month ago
·
[ - ]

It's unclear if you'd win this case right now.

sinuhe69
·
1 month ago
·
[ - ]

The FOSS community can transfer their copyrights to an organisation in order to take legal action against AI operators.

ziddoap
·
1 month ago
·
[ - ]

Which organizations are willing to take on that burden (time, expenses, stress, etc.) for free, with no guarantee that they will win?

There is not really any incentive or reward for 3rd-party organizations to step in and do this.

pjc50
·
1 month ago
·
[ - ]

That's what GNU was originally for, and why they requested copyright assignment from contributors to their projects.

·
1 month ago
·
[ - ]

·
1 month ago
·
[ - ]

sinuhe69
·
1 month ago
·
[ - ]

I guess my post was downvoted by AI bots?! :D

Because reckless and greedy AI operators not only endanger FOSS projects, they threaten to collapse the free accessible internet as a whole as well. Sooner or later, we will need to fight for our freedom, our rights as individual human against rogue AI and the übermacht of the mega-corporations, just as we need to fight against the concentration of contents behind corporate gates today.

And I don’t any other way than going juridical against these operators. They give a sh*t about the little humans, not even copyrights and other legal regulations.

ognarb
·
1 month ago
·
[ - ]

INAL at some point, I was wondering if it wouldn't be easier to sue these companies for DDoS-ing than copyright violations.

jrh3
·
1 month ago
·
[ - ]

This is why we can't have nice things. Restricting access or making it a little more difficult preserves goodness for real people.

mary-ext
·
1 month ago
·
[ - ]

I do wonder if it's a customer on Alibaba Cloud that's doing all this or if it's entirely Alibaba's own doing that's been wrecking havoc on everyone's sites (mine included)

I don't really like blocking an entire ASN, especially since I don't mind (responsible) crawling to begin with, but I was left with no choice

yyhhsj0521
·
1 month ago
·
[ - ]

I doubt Alibaba uses its own cloud. It doesn't work like that at most cloud provider companies.

mary-ext
·
1 month ago
·
[ - ]

right, then the claim of this being caused by "AI companies" feels a lot more dubious because at least this time the perpetrator has just been this one customer on Alibaba Cloud, and we don't actually know what they're up to.

lifty
·
1 month ago
·
[ - ]

All the expensive endpoints should be behind user registration. Only static pages, or other cheap endpoints should be public.

mcstempel
·
1 month ago
·
[ - ]

There are options beyond auth walls for detecting/enforcing behavior as well since these scrapers have very recognizable device signatures: https://stytch.com/blog/detecting-ai-agent-use-abuse/

brandonmenc
·
1 month ago
·
[ - ]

This article starts by citing a blog article - displays a screenshot of the article - but doesn't link to it.

unwind
·
1 month ago
·
[ - ]

It was this: [1], posted here yesterday I think.

[1]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

nicco-love
·
1 month ago
·
[ - ]

Niccolò here, I'm really sorry about that -- I'm using a weird tooling system to handle articles, which currently has issues with links. I'm working to fix that asap.

s3wall
·
1 month ago
·
[ - ]

screenshots in place of links, cited articles likely shadow-banned on HN, FOSS is under attack 2.0.

mook
·
1 month ago
·
[ - ]

It also mentions Corbet of LWN but missed the article about it: https://lwn.net/Articles/1008897/

MattGrommes
·
1 month ago
·
[ - ]

I wonder if there's a way for a human-controller browser to generate some kind of token that proves it's being used by a human?

Obviously there's still ways to pay people to run the browser but it would be nice for this activity to cost the AI company something without blocking actual people.

lousken
·
1 month ago
·
[ - ]

I think both sides are partially at fault here.

LLM bots are doing a great job of stress testing infra, so if you are running abominations like Gitlab or any terribly coded site and you are exposing it to the internet, you are just asking for trouble. If anything, Gitlab should stop pumping bloat and focus on some performance, because it's really bad. I would hope FOSS projects would stick to something like Forgejo, although I am not sure of their CI/CD state. Though my guess is that they are 85% there with 1/10 of Gitlabs resources.

On the other side are of course badly coded bots that are aggressively trying to download everything. This was happening before LLMs and it just increased significantly because of them. I think we will reach a tipping point soon and then we will just assume those bots are just another malicious actor (like regular DDOS), and we will start actively taking them down, even with help of law enforcement.

Last thing I wanna see is 3 second bot challanges on every single site I visit, cookie banners are more than enough of a nightmare already.

curtisblaine
·
1 month ago
·
[ - ]

Would it be possible to use security flaws in their crawler to make them do very expensive things? In the case of China, illegal (anti CCP) things? I don't think many of those companies are too rigorous about their cybersecurity, and cybersecurity is expensive per se anyway.

superkuh
·
1 month ago
·
[ - ]

To me it sounds like these people are operating websites that don't work. My home internet (80 down, 5 up) connection hosted website handled 20k+ hits from the alibaba ai crawler yesterday without missing a beat. And many thousands more from GPTbot, etc.

I'll grant it can be a problem for super-heavy "application" websites where every GET is a serious computation. So I'm not surprised gitlab is having problems. They've literally the most bloated and heaviest website I've ever seen. Maybe applications shouldn't be websites.

But this spreading social hysteria, this belief that all non-humans are dangerous and must be blocked is a nerd snipe. It really doesn't apply to most situations. Just running a website as we've always run them, public, and letting all user-agents access, is much less resource intensive than these various "mitigations" people are implementing. Mitigations which end up being worse than the problem itself in terms of preventing actual humans from reading text.

TonyTrapp
·
1 month ago
·
[ - ]

There's source repository browsers (git/svn) way, way leaner than GitLab that have the same issues. Any repo browser offering a blame view for files can be brought down by those bots' traffic patterns. I have been hosting such repository browsers for 10+ years and it was never an issue until the arrival of these bots.

superkuh
·
1 month ago
·
[ - ]

Indeed. It's really exposing a major downside to running applications in browser context. It never really made sense. These applications really don't want public traffic like actual websites do. They should remain applications and stay off the web. But more likely is that the web will be destroyed to fit the requirements of the applications. Like what cloudflare, etc, and all this anti-bot social hysteria is doing.

juliangmp
·
1 month ago
·
[ - ]

At some point we'll have to create our own walled gardens and throttle traffic to non-logged-in users heavily or outright forbid them.

Its sad that it has to come to this. But especially when those "scrapers" are in a foreign country, you can't even do anything legally.

diggan
·
1 month ago
·
[ - ]

Alternatively, be OK with the fact that anything you put in public can be used for anything and by anyone, regardless of licensing and laws. A pirate's dream basically.

Personally, when I first got connected to the internet around ~1999, that was the approach I've followed since, I don't share things I am not OK with others to use for whatever they want.

imtringued
·
1 month ago
·
[ - ]

Certificate authorities will be delighted to sell overpriced client certificates as a result of this.

pratik661
·
1 month ago
·
[ - ]

My thoughts exactly.

In physical environments where people are bombarded with low-information “noise”, gatekeeping (ie credentialism) emerges as a natural mechanism.

Gud
·
1 month ago
·
[ - ]

These companies are performing DDoS attacks and should have the rule book thrown at them.

grotorea
·
1 month ago
·
[ - ]

I wonder how long until everyone gets a cryptographic public key and has to sign every HTTP request to not be blocked. Or every site requires login to use. And social media and things like bug reporting all requiring real ID.

Snowfield9571
·
1 month ago
·
[ - ]

I wonder how free hosting services like netlify and vercel handle this? If a free tier user gets spammed do they just pass the cost on? I can't imagine that is the case so they must have some protections built in.

leerob
·
1 month ago
·
[ - ]

If a free tier user goes over the included usage on Vercel, their site gets automatically paused. There isn't a credit card required to be put down.

ethan_smith
·
1 month ago
·
[ - ]

That's bold of you to assume. Vercel has horror stories even for the paid tier.

mcnnowak
·
1 month ago
·
[ - ]

What's to stop someone putting some Terms of Service clause on their site, or creating a license which guarantees the site owners ownership of any content generated by scraping their site?

Like some sort of legal honeypot trap.

Centigonal
·
1 month ago
·
[ - ]

With RL agents being the new frontier, I wouldn't be surprised if some post training runs now include multiple simultaneous calls to the web. That could be part of the flood.

MR4D
·
1 month ago
·
[ - ]

Might be time for a class action lawsuit. Something like that could work out really well for the little guy as it would probably make a dent in the LLM companies pockets and access to data.

lerp-io
·
1 month ago
·
[ - ]

from another angle it's actually good in the long run because ai generated content is not copyrightable which means they wont own the models as they can and will be distilled by other ais making it a public good. so maybe instead of complaining and trying to fight it, just accept the new reality that anything that can be accessed will be accessed by whatever and instead maybe we should just have apis for everything with rate limits and different tiers.

nonrandomstring
·
1 month ago
·
[ - ]

These are DDOS attacks and should be treated in law as such. (Although I do realise that in many countries now we no longer have any effective "rule of law")

WesolyKubeczek
·
1 month ago
·
[ - ]

At some point it's easier to geoblock a whole country at the firewall level and loginwall the rest of the world, rather than trying to explain that in your jurisdiction, which is not their jurisdiction, what they are doing is a crime — which they don't give a single fuck about.

·
1 month ago
·
[ - ]

rvnx
·
1 month ago
·
[ - ]

In the world, the richest wins, not the nicest (cf. Sam Altman)

nonrandomstring
·
1 month ago
·
[ - ]

lol. Tell that to Nicolae and Elena Ceaucescu, Saddam Hussain, Muammar Gaddafi, and all those other tinpot nobjockeys who thought their money and influence would save them from "nice" people.

briandear
·
1 month ago
·
[ - ]

You going to prosecute China?

immibis
·
1 month ago
·
[ - ]

Is that where Sam Altman is hiding?

ttyprintk
·
1 month ago
·
[ - ]

Each of these crawlers is as much a DoS as Aaron Swartz.

udev4096
·
1 month ago
·
[ - ]

I can't believe such a highly voted post which was submitted 4 hours ago is at the bottom of front page while YC promotes shitty AI posts at top

jgalt212
·
1 month ago
·
[ - ]

Without Cloudflare Turnstile (or some other similar solution), we would have been forced to suspend all our free content-based marketing pages.

tehjoker
·
1 month ago
·
[ - ]

don’t immediately assume that openai and anthropoic are being better citizens just because they label some of their bots. the american companies are more likely to be extremely aggressive bc they will feel no consequences for their actions.

in all likelihood all of these assholes are paying some unscrupulous suppliers for the data so the terabytes of traffic aren’t immediately attributable to them.

vhantz
·
1 month ago
·
[ - ]

If you cite an article, you should link to it.

glitchc
·
1 month ago
·
[ - ]

Are there multiple IPs or multiple connections coming from the same IP? If the latter, rate-limiting might work.

Zak
·
1 month ago
·
[ - ]

> there's one user reporting one minute delay, and another - from his phone

> out of those only 3% passed Anubi's proof of work, hinting at 97% of the traffic being bots

This doesn't follow. If I open a link from my phone and it shows a spinner and gets hot, I'm closing it long before it gets to one minute and maybe looking for a way to contact the site's maintainer to tell them how annoying it was.

·
1 month ago
·
[ - ]

marginalia_nu
·
1 month ago
·
[ - ]

I wonder if the future is for honest crawlers to do something like DKIM to provide a cheap cryptographically verifiable identity, where reputation can be staked on good behavior, and to treat the rest of the traffic like it's a full fledged chrome instance that had better be capable of solving hashcash challenges when traffic gets too hot.

It's a shitty solution, but as it stands the status quo is quite untenable and will eventually have cloudflare as a spooky MITM for all the web's traffic.

insane_dreamer
·
1 month ago
·
[ - ]

I know robots.txt doesn’t work, but is there a way to identify and block crawlers?

briandear
·
1 month ago
·
[ - ]

Does GitHub have this problem?

myaccountonhn
·
1 month ago
·
[ - ]

I've started hitting rate limits when just cloning regular repos now.

·
1 month ago
·
[ - ]

reportgunner
·
1 month ago
·
[ - ]

afaik github is microsoft and openai is a partner of microsoft so probably not

TowerTall
·
1 month ago
·
[ - ]

Can't this be mitigated by putting everything behind a login?

try_the_bass
·
1 month ago
·
[ - ]

Banning these crawlers is not the answer. Poisoning the well is.

Aachen
·
1 month ago
·
[ - ]

There's one thing worse than a block page being shown to humans because an algorithm decided they're a bot: purposefully false information being shown to humans

It's also just wasting more of the planet's resources as compared to blocking

And more effort, with as only upside that it's not immediately obvious to the bot that it is being blocked so it'll suck in more of your pages

I understand that people are exploring options but I wouldn't label this as a solution to anything as of today's state of the art

xeeeeeeeeeeenu
·
1 month ago
·
[ - ]

There are bot identification methods that may produce lots of false negatives, but zero false positives.

Simple example: no legitimate user has "GPTBot" in their user agent string.

Aachen
·
1 month ago
·
[ - ]

Ah yeah, that's a good example

It'll still trickle down to people using the system and waste people's time (from development to the people working to produce all the things used by these companies and mopping up the impact this has to eventually the users) but that would definitely resolve one of my concerns

sourtrident
·
1 month ago
·
[ - ]

Honestly reminds me of the early email spam wars - clever attempts at blocking, endless whack-a-mole with spoofed addresses. Problem is, now it's hitting open source projects we love. Maybe the AI gold rush is accidentally strangling the very communities that made its own existence possible.

keepamovin
·
1 month ago
·
[ - ]

Adaptation - adapt or die. Find a business model that can sustain, without the naivety that people will pay for what they can take without consequence.

sylware
·
1 month ago
·
[ - ]

I am self-hosted (email + web), I did quit the DNS (which registrars are now mostly hostile to noscript/basic (x)html browsers anyway), and I thought it would give me some relief...

Nope.

You don't have only the AI crawlers, you have also scans and hack attempts (which look alike script-kiddy stuff), all the time. Some smell of AI strapped to javascript web engines (or click farms with real humans???).

Smart: IP ranges from all over the world, and "clouds" make that even worse since the pown systems or bad actors (the guys who scan the whole ipv4 internet for its own good AND MANY SELL THE F* SCAN DATA: onyphe, stretchoid.com, etc) are "moving", in other words clouds are protecting those guys and are weaponizing hackers with their massive network resources, wrecking small hosting. No cloud is spared: aws, microsoft, google, ovh, ucloud.cn, etc.

I send good vibes to the brave open source software small hosting (until they are noscript/basic (x)html compatible ofc).

Many fixed-IPv4 pown systems have been referenced by security communities, often for months, sometimes years, and the people with the right leverage, don't seem to do a damn thing about it.

Currently, I wonder if I should not block all digital ocean IP ranges... and I was about to do the same with ucloud.cn IP ranges.

The second you host anything on the net, it WILL take a significant amount of your time. Do presume you will be pown, that's why security communities are referencing each other too.

Then I am thinking going towards 2 types of "hosting": private IPv6+port("randomized" for each client, may be transient in time depending of the service) thanks to those /64 prefixes (maybe /92 prefixes are a thing, for mobile internet?). Yes this complicated and convoluted. Second type, a 'standard' permanent IP, but with services which are implemented in an _HARDCORE_ simple way, if possible near 100% static. I am thinking going even further: assembly on bare metal, custom kernel based on hand compilation of linux code (RISC-V hardware ofc, FPGA for bigger hosting?)

I don't think anything will improve unless carrier scale network operators start to show their teeth.

smjburton
·
1 month ago
·
[ - ]

Outside of disruptive measures like requiring accounts, captchas, or payment, one possible solution would be to use AI against itself by training machine learning models to monitor, flag, and issue challenges to web requests exhibiting "bot-like" behavior. This way, not all web traffic would be disrupted with challenges until the machine learning models have reason to believe the traffic is coming from a bot.

napolux
·
1 month ago
·
[ - ]

In the past month, I've had to block LLM bots attacking my poor little VPSs — not once, but twice.

First, it was Facebook https://news.ycombinator.com/item?id=23490367 and now it's these other companies.

What's worse? They completely ignore a simple HTTP 429 status.

DamonHD
·
1 month ago
·
[ - ]

Almost no one pays attention to 429 for general Web pages. Eg on an RSS file Googlebot speeds up repolling. Amazon RSS feed puller does pay some attention to 429.

503 is at least apparently understood by more crawlers/bots, but they still like to blame the victim: YouTube sends me a condescending (and inaccurate) email when it gets a 503 for ignoring cache headers and other basics it seems...

larodi
·
1 month ago
·
[ - ]

As I already noted and got downvoted for it - the incentives to go or support open source (unless bros start realising troves of training data already scraped) are really diminishing.

Since apparently they (scrapers) have no intent In doing (releasing) but so, expect the commercial open source to achieve a de facto protocol status very soon. And the rest may not exist in a centralised and such free manner anymore.

verytrivial
·
1 month ago
·
[ - ]

The only thing that really annoys me about this is that the resources get scraped more than once. Sure, update a delta every month or whatever, but can't the AI bros get the sh*t together and share scrapes? It's embarrassing.

The special sauce is in parsing, tokenizing, enriching etc. There is no value in re-scraping, and massive cost, right?

Just do it.

throwaway173738
·
1 month ago
·
[ - ]

Does anyone maintain a list of ai products developed without scraping?

jsheard
·
1 month ago
·
[ - ]

I don't know about text models, but most of the big stock photo companies are offering image generators trained on their own libraries rather than scraped images. Adobe, Getty, Shutterstock, iStock, possibly more. They're positioning themselves as the safe option while the legality of scraping is still up in the air.

xena
·
1 month ago
·
[ - ]

I don't think it's possible to develop a frontier model without mass scraping. The economics simply don't add up. You need at least 10 trillion tokens to make an 8 billion parameter model. 10 trillion tokens is something like 40 terabytes.

You simply can't get 40 terabytes of text without mass scraping.

mdp2021
·
1 month ago
·
[ - ]

> You need at least 10 trillion tokens to make an 8 billion parameter model

Are you sure it is not just very inefficient?

GTP
·
1 month ago
·
[ - ]

I'm not an AI expert, but it seems to me that the common consensus is that current LLMs are quite inefficient and that there's room for improvement.

flir
·
1 month ago
·
[ - ]

{}

matanyall
·
1 month ago
·
[ - ]

Can we just have something that replaced the page with test datasets and Disney IP to poison the training data? Or maybe just embed it into the page itself, but hidden?

flerchin
·
1 month ago
·
[ - ]

Use the fucking clone button you assholes.

WesolyKubeczek
·
1 month ago
·
[ - ]

...at some point, some people started appreciating mailing lists and the distributed nature of Git again.

anthk
·
1 month ago
·
[ - ]

And Usenet, and IRC with a registered user prereq to join.

Also, set AI tarpits as fake links with recursive calls. Make then mad with non-curated bullshit made from Markov chain generators until their cache begins to rot forever.

ewzimm
·
1 month ago
·
[ - ]

This problem will likely only get worse, so I'd be interested to see how people adapt. I was thinking about sending data through the mail like the old days. Maybe we go back to the original Tim Berners-Lee Xanadu setup charging users small amounts for access but working out ISP or VPN deals to give subscribers enough credit to browse without issues.

krapp
·
1 month ago
·
[ - ]

Xanadu was Ted Nelson, not Tim Berners-Lee.

Also I would argue that not having capitalist incentives baked directly into the network is what made the web work, for good or bad. Xanadu would never have gotten off the ground if people had to pay their ISP then pay for every website, or every packet, or every clicked link or whatever.

Reading the Xanadu page on Wikipedia tells me "Every document can contain a royalty mechanism at any desired degree of granularity to ensure payment on any portion accessed, including virtual copies ("transclusions") of all or part of the document."

That would be absolute chaos at scale.

ewzimm
·
1 month ago
·
[ - ]

Oops, you're right! They claimed that Tim Berners-Lee stole their idea.

I agree that the lack of monetization was important to the development and that it would have been chaos as proposed, but will the current setup be sustainable forever in the world of AI?

We have projects like Ethereum that are specifically intended to merge payments and computing, and I wouldn't be surprised if at some point in the future, some kind of small access fee negotiated in the background without direct user involvement become a component of access. I wouldn't expect people to pay ISPs but rather some kind of token exchange to occur that would benefit both the network operators and the web hosts by verifying classes of users. Non-fungible token exchanges could be used as a kind of CATPCHA replacement by cryptographically verifying users anonymously with a third-party token holder as the intermediary.

For example, let's say Mullvad or some other VPN company purchased a small amount of verification tokens for its subscribers who pay them anonymously for an account. On the other side, let's say a government requires people to register through their ISP, and the ISP purchases the same tokens on behalf of the user, and then exchanges the tokens on behalf of the user. In either case, the person can stand behind a third party who both sends them the data they requested and exchanges the verification tokens, which the site operator could then exchange for reimbursement of their services to their hosting provider.

This is just a high-level idea of how we might get around the challenges of a web dominated by bots and AI, but I'm sure the reality of it will be more interesting.

krapp
·
1 month ago
·
[ - ]

I hate AI as much as any reasonable person should, but I don't think money is a viable filter when governments and corporations will just throw as much money legislation and infrastructure at it as needed to render it irrelevant. They can just budget it in, or pass laws requiring privileged access.

Meanwhile as profit motives begin to dominate (as they inevitably would,) access to information and resources becomes more and more of a privilege than a right, and everything becomes more commercialized, faster.

I won't claim to have a better idea, though. The best solutions in my mind are simply not publishing anything to the web and letting AI choke on its own vomit, or poisoning anything you do publish, somehow.

WesolyKubeczek
·
1 month ago
·
[ - ]

Usenet, as far as I remember, used to be a fucking hell to maintain right. With each server having to basically mirror everything, it was a hog on bandwidth and storage, and most server software at its heyday was a hog on filesystems of its day (you had to make sure you have plenty of inodes to spare).

The other day, I logged into Usenet using eternalseptember, and found out that it consisted in 95% of zombies sending spam you could recognize from the millenium start. On one hand, it made me feel pretty nostalgic. Yay, 9/11 conspiracy theories! Yay, more all-caps deranged Illuminati conspiracies! Yay, Nigerian princes! Yay, dick pills! And an occasional on-topic message which strangely felt out of place.

On the other hand, I felt like I was in a half-dark mall bereft of most of its tenants, where the only place left is 85-year old watch repair shop and a photocopy service on the other end of the floor. On still another hand, turns out I haven't missed much by not being on Usenet, as all-caps deranged conspiracy shit is quite abound on Facebook.

I would welcome a modern replacement for Usenet, but I feel like it would need a thorough redesign based on modern connectivity patterns and computing realities.

krapp
·
1 month ago
·
[ - ]

Culturally, the modern replacement for Usenet is probably Reddit. Architecturally, probably something built on top of a federated protocol like ActivityPub (Mastodon) or Nostr (Lemmy).

But I guess realistically you can't fight entropy forever. Even Hacker News, aggressively moderated as it is, is slowly but irrevocably degrading over time.

anthk
·
1 month ago
·
[ - ]

I have a curated list of tech and science related newsgroups which work really well. No SPAM since Google Groups went to /dev/null.

Also, I often access FIDO over NNTP.

flir
·
1 month ago
·
[ - ]

Usenet wasn't that bad if you didn't take the binary groups.

> and found out that it consisted in 95% of zombies sending spam you could recognize from the millenium start

I like to imagine a forgotten server, running since the mid-90s, its owners long since imprisoned for tax fraud, still pumping out its daily quota of penis enlagement spam.

ebiester
·
1 month ago
·
[ - ]

Yes and no.

The distributed nature of git is fine until you want to serve it to the world - then, you're back to bad actors. They're looking for commits because it's nicely chunked, I'm taking a guess.

TonyTrapp
·
1 month ago
·
[ - ]

> They're looking for commits because it's nicely chunked, I'm taking a guess.

They're not looking for anything specifically from what I can tell. If that was the case, they would be just cloning the git repository, as it would be the easiest way to ingest such information. Instead, they just want to guzzle every single URL they can get hold of. And a web frontend for git generates thousands of those. Every file in a repository results in dozens, if not hundreds of unique links for file revisions, blame, etc. and many of those are expensive to serve. Which is why they are often put in robots.txt, so everything was fine until the LLM crawlers came along and ignored robots.txt.

WesolyKubeczek
·
1 month ago
·
[ - ]

The distributed nature of git lets me be independent of some central instance (you may decide that the master copy resides on Github, but with the advent of mesh VPNs like the ones Zerotier and Tailscale offer, you could also sidestep it and push/pull from your colleagues directly as well). It also lets me dictate who gets to access it.

What the article describes, though, is possibly the worst way a machine can access a git repository, which is using a web UI and scraping that, instead of cloning it and adding all the commits to its training set. I feel like they simply don't give a shit. They got such a huge capital injection that they feel they can afford to not give a shit about their own cost efficiency and that they go using the scorched earth tactics. After all, even their own LLMs can produce a naive scraper that wreaks havoc on the internet infrastructure, and they just let it loose. Got mine, fuck you all the way!

But then they will release some DeepSeek R(xyz), and yay, all the hackernews who were roasting them for such methods, will be applauding them for a new version of an "open source" stochastic parrot. Yay indeed.

dzonga
·
1 month ago
·
[ - ]

maybe we can be back to closed systems once again.

fb (meta) & big tech put their user contributed stuff behind a paywall. yet abuse open systems.

ltr_
·
1 month ago
·
[ - ]

how to flush all this shit away? so they will have to learn to swim.

zackmorris
·
1 month ago
·
[ - ]

Unpopular opinion - this isn't about LLMs, but how web development has devolved from the declarative serving of lightweight media files to the imperative generation of bloated and brittle SPAs that we never get free from babysitting.

Where we could have once wrapped our mostly static websites in Varnish or a scalable P2P cache like Coral CDN, now we must fiddle and twiddle with robots.txt and appeal to the goodwill of megacorps who never cared about being good netizens before, even when they weren't profiting from scraping to such a degree.

This is yet another chance for me to scream into the void that we're still doing this all wrong. Our sites should work more like htmx, with full static functionality, adding dynamic embellishment when available. Business logic should happen deterministically in one place on the backend or "serverless" with some kind of distributed consensus protocol like Raft/Paxos or a CRDT, then propagate to the frontend through a RESTful protocol, similarly to how Firebase or Ruby Hotwire/Laravel Livewire work. The way that we mostly all do form validation wrong in 2 places with 2 languages is almost hilariously tragic in how predictably it happens.

But the real tragedy is that the wealthiest and most powerful companies that could have fixed web development decades ago don't care about you. Amazon, Google and Microsoft would rather double down on byzantine cloud infrastructure than devote even a fraction of their profit to pure research into actually fixing all of this.

Meanwhile the rest of us sit and spin, sacrificing the hours and days and years of our lives building out other people's ideas to make rent. Many of us know exactly how to fix things, but with infinite backlogs and never truly exiting burnout, we're too tired at the end of the day to contribute to FOSS projects and get real work done. Our valiant quest to win the internet lottery has become a death march through a seemingly inescapable tragedy of the commons.

Instead of fixing the web at a foundational level from first principles, we'll do the wrong thing like we always do and lock everything down behind login walls and endless are-you-human/2FA challenges. Then the LLMs will evolve past us and wrap our cryptic languages and frameworks in human language to a level where even pair programming won't be enough for us to decipher the code or maintain it ourselves.

If I was the developer tasked with hardening a website to LLMs, the first thing I would do is separate the static and dynamic content. I'd fix most of the responses to respect standard HTTP cache headers. Then I'd put that behind the first Cloudlare competitor I could find that promises to never have a human challenge screen. Then I'd wrap every backend API endpoint in Russian doll caching middleware. Then I'd shard the database by user id as a last resort, avoiding that at all cost by caching queries and/or using modern techniques like materialized views to put the burden of scaling on the database and scale vertically or gradually migrate the heaviest queries to a document or column-oriented store. Better yet, move to a stronger store that's already solved all of these problems, like CouchDB/PouchDB.

Then I'd build a time machine to convince everyone to do things right the first time instead of building a tech industry upon unforced errors. Oh wait, former me already tried sounding the alarm and nobody cared anyway. How can I even care anymore, when honestly I don't see any way to get out of this mess on any practical timescale? I guess the irony is that only LLMs can save us now.

VoodooJuJu
·
1 month ago
·
[ - ]

[dead]

account42
·
1 month ago
·
[ - ]

[flagged]

szczepano
·
1 month ago
·
[ - ]

there is simple solution for gentlemens:

robots.txt should allow to exclude all AI crawlers and AI crawlers should be forced to add "AI" to their crawl user agent headers and also respect robots.txt saying they can't crawl this website

right now we need to do this:

User-agent: *

Disallow: /

tecleandor
·
1 month ago
·
[ - ]

Nice. How do we force them to respect robots.txt?

szczepano
·
1 month ago
·
[ - ]

they respect robots.txt at least major ones like meta, claude, google, openai, based on my infra observations robots.txt is enough in 90%, 10% is just banning ip ranges for couple of days but those are no AI companies

cudgy
·
1 month ago
·
[ - ]

“… he complained that LLM companies were crawling data without respecting robosts.txt“

Maybe that is the problem? They misspelled robots.txt

Surac
·
1 month ago
·
[ - ]

We blocked all non EU ip ranges. Our customers are all from EU so no intrest in US or Asia. With the new US administration i assume nothing good from them, and Asia is a notorius bad player

Aachen
·
1 month ago
·
[ - ]

So much for an open internet

DeathArrow
·
1 month ago
·
[ - ]

If you put something on the internet, expect it to be used. If you don't like it, use pay walls, captchas, throttling.

3029aGfqgaF
·
1 month ago
·
[ - ]

Open source licenses are no longer strong enough. The copyright laundering use case was not been anticipated when the licenses were written.

All licenses need a clause like the following:

  This software is for humans. AI training is prohibited and carries a default
  penalty of $1 trillion.

regularfry
·
1 month ago
·
[ - ]

They are "no longer strong enough" to the extent that GitHub, backed by Microsoft, has had to indemnify their GitHub Copilot customers against copyright claims, and to provide a feature to explicitly prevent open source code from being regurgitated into your codebase.

That's not "no longer strong enough". That's a very strong system applying leverage to a powerful actor.

pabs3
·
1 month ago
·
[ - ]

Such a license would not meet the Open Source Iniative's Open Source Definition:

https://opensource.org/osd/

dithered_djinn
·
1 month ago
·
[ - ]

That might indeed apply to open source software.

If we instead adopt the view of free software (https://www.gnu.org/philosophy/open-source-misses-the-point....), the fact that OpenAI and other large corporations train their large-language models behind closed doors - with no disclosure of their training corpus - effectively represents the biggest attack on GPL-licensed code to date.

No evidence suggests that OpenAI and others exclude GPL-licensed repositories from their training sets. And nothing prevents the incorporation of GPL-licensed code into proprietary codebases. Note that a few papers have documented the regurgitation of literal text snippets by large language models (one example: https://arxiv.org/pdf/2409.12367v2).

To me, this seems like the LLM-version of using coin-mixing to obscure the trail of Bitcoin transactions in the blockchain. The current situation also reminds me of how the generalization of the SaaS model led to the creation of the Affero GPL license (https://www.gnu.org/licenses/why-affero-gpl.html).

LLM's enable the circumvention of the spirit of free software licenses, as well as of the legal mechanisms to enforce them.

pabs3
·
1 month ago
·
[ - ]

I absolutely agree with you that the current big LLMs enable an attack on all FOSS licenses and especially copyleft ones. That doesn't mean that one couldn't create LLM code generators in a respectful way. Do license analysis on the input code and then train separate models on the different license buckets, with the outputs from each model considered derivative works of the input corpus.

Also I don't think a restriction on the FSF's freedom 2 "The freedom to study how the program works" based on what tools you use and how you use them fits with FSF philosophy, nor do I think it is appropriate. You should be able to run whatever analysis tools you have available to study the program. Being able to ingest a program into a local LLM model and then ask questions about the codebase before you understand it yourself is valuable. Or aren't a programmer and or aren't familiar with the language used, then a local LLM could help you make the changes needed to add a new feature. In that situation LLMs can enable practical software freedom, for those who can't afford to pay/convince a programmer to make the changes they want.

https://www.gnu.org/philosophy/free-sw.html

In addition, OpenAI clearly do not respect copyrights and licenses in general, so would ignore any anti-AI clauses, which would make them ineffective and thus pointless. So, I think we should tackle the LLM problem through the law, and not through licenses. That is already happening with various caselaw in software, writing, artwork etc.

It isn't possible or practical to change the existing body of Free Software to use new anti-AI clauses anyway.https://juliareda.eu/2021/07/github-copilot-is-not-infringin...

BTW, LLMs could also in theory be used to licensewash proprietary software, see "Does free software benefit from ML models being derived works of training data?" by Matthew Garret:

https://mjg59.dreamwidth.org/57615.html

dithered_djinn
·
1 month ago
·
[ - ]

I see what you are saying and don't completely disagree. I however feel that the spirit of free software is to set all software free. From that it follows, that if we are going to follow the current route of complete disregard for authorship and licenses, then the free software movement should continue fighting to liberate all software in existence. In other words, those LLM's that you mention that are to enable software freedom for users who cannot code themselves, in a fair world, they would be trained with both free and proprietary software. After all, a derivative work from a proprietary software should also be subject to fair use. The output produced by the LLM wouldn't necessarily be a literal copy-paste of any particular proprietary software... as the models would just be "learning" from them. The company could just continue doing business as usual, build on their brand and yada yada yada.

Regarding the licensing, I'll restate my point that the Affero license was created precisely in a moment where the existing licenses could no longer uphold the freedoms that the Free Software Foundation set out to defend. A change of license was the right solution at that particular point in time and, if it worked then, I think we can all agree that there is at least a precedent that such a course of action might work and should at the very least be considered as a possible solution for today's problems.

That said, my own personal view is more aligned with demanding the nation states to pressure big corporations so that currently closed-source software becomes at least open-source (either by law, or simply by stopping using it and invest their budget in free alternatives instead). Note I said open source and not free. I just would like to read their code and feed it to my LLM's :)

pabs3
·
1 month ago
·
[ - ]

On setting all software free, indeed, thats the point made in the post by mjg59. None of the AI companies train on their own proprietary software though, which is telling.

On Affero, that was indeed definitely needed, although some folks on HN seem to think that privately modifying code is allowed by copyright, even if the modified version is outputting a public website, thus what the license says is irrelevant. That seems bogus to me, but seems a loophole if it is legit. Anyway, personally I think that people should simply just never use SaaS, nor web apps. It also doesn't help with data portability.

I'd go further and advocate for legally mandated source code escrow for copyright validity, and GPL like rights to the code once public, which would happen if the software is off the market for N years.

dithered_djinn
·
1 month ago
·
[ - ]

> I'd go further and advocate for legally mandated source code escrow for copyright validity, and GPL like rights to the code once public, which would happen if the software is off the market for N years.

I agree 100%.

diggan
·
1 month ago
·
[ - ]

Huh? FOSS licenses work exactly like designed! I'm literally using MIT because I don't give a fuck what people do with the code I publish, limiting it to "humans" or restricting the usage makes it very not FOSS.

Sure, if you want to try to prevent AI training by licensing, do that, but it's no longer FOSS, so please don't call it that.

fsflover
·
1 month ago
·
[ - ]

> I'm literally using MIT because I don't give a fuck what people do with the code I publish

MIT license requires this:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Do the AI companies follow this requirement?

diggan
·
1 month ago
·
[ - ]

> Do the AI companies follow this requirement?

I haven't seen any LLMs being able to reproduce full copies or even "substantial portions" of any existing software, unless we're talking "famous" functions like those from Quake and such.

You have any examples of that happening? I might have missed it

Eridrus
·
1 month ago
·
[ - ]

I know a lot of FOSS people are hostile to AI in general, and this is an immediate problem, but I feel like a better solution for everyone would be for there to be some sort of central repo of this information that AI companies can pull from without externalizing their costs like this.

lostpencil
·
1 month ago
·
[ - ]

Are you suggesting that everyone move their projects to a single code forge (GitHub)?

Also, isn't this basically just extortion? "I know you're minding you're own business FOSS maintainer, but move your code to our recommended forge instead so we can stop DDoSing you?"

Eridrus
·
1 month ago
·
[ - ]

I'm mostly suggesting a mirror to some centralized location. GitHub is probably a good place to mirror to for code, but it could be elsewhere.

I was actually thinking of a more general thing than just code, eg similar to CommonCrawl, but maybe a code specific thing is what is needed.

lostpencil
·
1 month ago
·
[ - ]

Isn't this still similar to extortion? Maintainers aren't creating the problem. They are minding their own business until scrapers come along and make too many unnecessary requests. Seems like the burden is clearly on the scrapers. They could easily be hitting the pages much less often for a start.

Doesn't your suggestion shift the responsibility to likely under-sponsored FOSS maintainers rather than companies? Also, how do people agree to switch to some centralized repository and how long does that take? Even if people move over, would that solve the issue? How would a scraper know not to crawl a maintainer's site? Scrapers already ignore robots.txt, so they'd probably still crawl even if you verified you've uploaded the latest content.

Eridrus
·
1 month ago
·
[ - ]

Scrapers still have an economic incentive to do what is easiest. Providing an alternative that is easier than fighting sysadmin blocks would likely cause them to take the easier route and make it less of a cat and mouse game for sysadmins.

danaris
·
1 month ago
·
[ - ]

Scrapers have an incentive to take all data.

If you put some data in a central repository, they will take it.

Then they will go and DDoS the rest of the Internet in order to take all the rest of the data.

gyesxnuibh
·
1 month ago
·
[ - ]

Why buy the cow when you get the milk for free tho?

Eridrus
·
1 month ago
·
[ - ]

Because sysadmins are taking separate actions to make the milk expensive.

mdp2021
·
1 month ago
·
[ - ]

> hostile to AI in general

What?! "AI"?!?! We are talking about traffic abusers!...

phkahler
·
1 month ago
·
[ - ]

So I'll just float an idea again that always gets rejected here. This is yet another problem that could be solved completely by... Eliminating anonymity by default on the internet.

To be clear, you could still have anonymous spaces like Reddit where arbitrary user IDs are used and real identities are discarded. People could opt-in to those spaces. But for most people most of the time, things get better when you can verify sources. Everything from DDOS to spam, to malware infections to personal attacks and threats will be reduced when anonymity is removed.

Yes there are downsides to this idea but I'd like people to have real conversations around those rather than throw the baby out with the bath water.

ziddoap
·
1 month ago
·
[ - ]

>Yes there are downsides to this idea but I'd like people to have real conversations around those rather than throw the baby out with the bath water.

It's hard to have a serious conversation when you present a couple of upsides but completely understate/not mention the downsides.

Eliminating anonymity comes with real danger. What about whistleblowers and marginalized groups? The increased likelihood of targeted harassment, stalking, and chilling effects on free speech? The increase in surveillance? The reduction in content creation and legitimate criticism of companies/products/etc? The power imbalance granted to whoever validates identities?

pjc50 brings up some other great points, which got me thinking even more:

Removing anonymity creates a greater incentive to steal identities, has a slew of logistical issues (who/how are IDs verified, what IDs are accepted, what are the enforcement mechanisms and who enforces them, etc.), creates issues with shared accounts and corporate/brand accounts, would require cooperation across every country with internet access (good luck!) otherwise it doesn't really work, and probably a million other things if I keep thinking about it.

pjc50
·
1 month ago
·
[ - ]

So in this scenario, whose real user ID would be used for the scrapers?

Doesn't this just create an even worse market for identity theft and botnets?

How does this apply to countries without a national ID system like the United States?

What do you do with an ID traced to a different country, anyway?

> personal attacks and threats will be reduced when anonymity is removed

People are happy to make death threats under their real name, newspaper byline, blue tick, or on the presidential letterhead if they're doing so from a position of power.

adamc
·
1 month ago
·
[ - ]

In the USA, where an increasingly authoritarian government has come into power... I think many would stop using the internet if that happened.

orthecreedence
·
1 month ago
·
[ - ]

I mean, this has already happened, it just happened in a more sinister way than "FreedomNet now requires logins from all users!" Ad companies and social media track everything you do and can tie it together with various forms of packaged/bought/sold identity that follow you wherever you go. Even with agressive ad blocking, I get ads on Instagram for things I looked up in my browser that I have never used to log into Insta with. We're constantly deanonymized, it just happens below the surface. And all of this is hoovered up by the US dragnet surveillance programs.

So do I support a fully authenticated internet? Fuck no. If we can get good at bot detection, zip bomb the fuckers. In the meantime, work as hard as we can to dismantle the hellscape that the internet has become. I'm all for decentralized, sovereign identity systems that aren't owned by some profiteering corpo cretins or some proto-fascist state, but I don't want it to be a requirement to look at photos of dogs or plan my next trip.

mdp2021
·
1 month ago
·
[ - ]

> Yes there are downsides to this idea

Such as living under logging. Which, you know (you know?), some people will radically refuse, with several crucial justifications. One of them is that privacy is a condition for Dignity. Another is Prudence. Another one is a warning millennia old, about who you should choose as confident. And more.

finnthehuman
·
1 month ago
·
[ - ]

A few minutes of spitballing what implementations might look like create a number of problems that appear to make the idea a nonstarter. You should have a real proposal that explores the possibility space, say what the key requirements are, and assuage (or confirm) people's objections. That way more people might be willing to engage with the idea seriously.

GTP
·
1 month ago
·
[ - ]

Ok, let's have a very concrete discussion then: how would you implement this on top of the current protocol stack?

jryan49
·
1 month ago
·
[ - ]

I've been thinking about this even before AI. The internet has become society itself. Complete anonymity on the internet removes any sort of social pressure to act like a civilized person.

tliltocatl
·
1 month ago
·
[ - ]

It's not just "downsides", it's completely missing the point of the web.

ferguess_k
·
1 month ago
·
[ - ]

I dare to say the inconceivable, you shouldn't have free plans even for the community. This will also push FOSS projects to seek some money to pay for their infrastructures which probably leads to better pay for their maintainers.

Nothing should be $$ free unless you already paid with your tax. Same principle -> As long as HN starts to charge every account, I'm happy to pay a small amount per month. This token amount of pay per account will also reduce the number of bots.

diggan
·
1 month ago
·
[ - ]

Are you saying that Gnome shouldn't offer access to their VCS for free, and all Gnome developers should pay a small sum to be able to access it?

FOSS is generally built on the idea that anyone can use the code for anything, if you start to add a price for that, not only do you effectively gate your project from "poor people", but it also kind of erodes some of the core principles behind FOSS.

PaulDavisThe1st
·
1 month ago
·
[ - ]

> all Gnome developers should pay a small sum to be able to access it?

There's access via (e.g.) the git protocol (git://....) and access via http.

These attacks all happen via the latter, since the former is already access-controlled.

diggan
·
1 month ago
·
[ - ]

Offering read-only mirrors via git+http:// might be a solution then, at least to shed the load if anything. It does remind me a bit about companies complaining about being scraped and trying to prevent it, instead of offering a API so no one would have to scrape them.

PaulDavisThe1st
·
1 month ago
·
[ - ]

We do precisely this ... and we're still dealing with the load issues. Currently I have fail2ban doing a 10 day block on any IP addr that hits our read only http-git endpoint twice in 30 mins. The problem with this is that the default implementation of iptables doesn't scale well to 100k blocked addresses.

pabs3
·
1 month ago
·
[ - ]

Does that not cause your devs, or people doing contributions, or browsing commits via the web, to get blocked?

PaulDavisThe1st
·
1 month ago
·
[ - ]

Yes, it does mean that anyone using the web-based interface to git gets blocked.

We mirror to github for public access; our developers all use git itself, not the web interface, for interacting with the repo.

How/what github et al. are doing to deal with this, I do not know.

pabs3
·
1 month ago
·
[ - ]

In that situation, why bother to have a local web interface at all?

Edit: nevermind, I see you are using Gitea rather than cgit/etc. I guess Gitea can't disable the problematic commit/etc views.

diggan
·
1 month ago
·
[ - ]

Fair enough, thanks for explaining to an outsider. I hope you manage to work out a good solution in the end!

myaccountonhn
·
1 month ago
·
[ - ]

There is nothing that says you can't charge money for FOSS software. FOSS is more about having the ability to inspect and freely change your software to your use-cases.

diggan
·
1 month ago
·
[ - ]

> There is nothing that says you can't charge money for FOSS software

Well, yes and no. If you had a cost to access the source code, I'm pretty sure I'd stop calling that FOSS. If you only have a price for downloading binaries, sure, still FOSS, since we're talking source code licensing.

> Nothing should be $$ free

I took this statement at face value, and assumed parent argued for basically eliminating FOSS.

myaccountonhn
·
1 month ago
·
[ - ]

> Well, yes and no. If you had a cost to access the source code, I'm pretty sure I'd stop calling that FOSS

Even the FSF thinks you can charge money for free software and still call it FOSS: https://www.gnu.org/philosophy/selling.html.

pabs3
·
1 month ago
·
[ - ]

Something can be FOSS even if it isn't released to the public, just amongst friends. Its only about every user being able to access the source.

anthk
·
1 month ago
·
[ - ]

I used to buy Debian DVD's when there was the freest Linux distro before GNUinos and now Trisquel.

ferguess_k
·
1 month ago
·
[ - ]

OK I'm dumb.

diggan
·
1 month ago
·
[ - ]

No, just overly capitalistic, but that's probably more because of your environment than anything, and very possible to change if you want :)

ferguess_k
·
1 month ago
·
[ - ]

Actually I'm anti-capitalistic. That solution was supposed to undermine corporations who take other people's work for free. Maybe it shouldn't cover the whole FOSS, but I do think it fits OP's use case.

devit
·
1 month ago
·
[ - ]

Well, looking at the SourceHut code, it's written in Python and handles git by spawning a "git" process.

In other words, it was written with no consideration for performance at all.

A competent engineer would use Rust or C++ with an in-process git library, perhaps rewrite part of the git library or git storage system if necessary for high performance, and would design a fast storage system with SSDs, and rate-limit slow storage access if there has to be slow storage.

That's the actual problem, LLMs are seemingly just adding a bit of load that is exposing the extremely amateurish design of their software, unsuitable for being exposed on the public Internet.

Anyway, they can work around the problem by restricting their systems to logged in users (and restricting registration if necessary), and using mirroring their content to well-implemented external services like GitHub or GitLab and redirecting the users there.

steveklabnik
·
1 month ago
·
[ - ]

> A competent engineer would use Rust or C++ with an in-process git library,

The issue is, there aren't any fully featured ones of these yet. Sure, they do exist, but you run into issues. Spawning a git process isn't about not considering performance, it's about correctness. You simply won't be able to support a lot of people if you don't just spawn a git process.

hmmm-i-wonder
·
1 month ago
·
[ - ]

>In other words, it was written with no consideration for performance at all.

This is a bold assumption to make on such little data other than "your opinion".

Developing in python is not a negative, and depending on the people, the scope of the product and the intended use is completely acceptable. The balance of "it performs what its needed to do in an acceptable window of performance while providing x,y,z benefits" is almost a certain discussion the company and its developers have had.

What it never tried to solve was scaling to LLM and crawler abuse. Claiming that they have made no performance considerations because they can't scale to handle a use case they never supported is just idiotic.

>That's the actual problem, LLMs are seemingly just adding a bit of load that is exposing the extremely amateurish design of their software.

"Just adding a bit of load" != 75%+ of calls. You can't be discussing this in good faith and make simplistic reductions like this. Either you are trolling or naively blaming the victims without any rational thought or knowledge.