1: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
This paper by economists from the University of Chicago economists found zero false positives of 1,992 human-written documents and over 99% recall in detecting AI documents. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5407424
Find me a clean public dataset with no AI involvement and I will be happy to report Pangram's false positive rate on it.
You’re punishing them for claiming to do a good job. If they truly are doing a bad job, surely there is a better criticism you could provide.
EditLens (Ours)
Predicted Label
Human Mix AI
┌─────────┬─────────┬─────────┐
Human │ 1770 │ 111 │ 0 │
├─────────┼─────────┼─────────┤
True Mix │ 265 │ 1945 │ 28 │
Label ├─────────┼─────────┼─────────┤
AI │ 0 │ 186 │ 1695 │
└─────────┴─────────┴─────────┘
It looks like 5% of human texts from your paper are marked as mixed, and mixed texts are 5-10% if mixed texts as AI, from your paper.I guess I don’t see that this is much better than what’s come before, using your own paper.
Edit: this is an irresponsible Nature news article, too - we should see a graph of this detector over the past ten years to see how much of this ‘deluge’ is algorithmic error
In this case, several important distinctions are drawn, including being open about criteria, about such things as "perplexity" and "burstiness" as properties being tested for, and an explanation of why they incorrectly claim the Declaration of Independence is AI generated (it's ubiquitous). So it seems like a lot of important distinctions are being drawn that testify to the credibility of the model, which has to matter to you if you're going to start moralizing.
https://www.pangram.com/blog/why-perplexity-and-burstiness-f...
Pangram is fundamentally different technology, it's a large deep learning based model that is trained on hundreds of millions of human and AI examples. Some people see a dozen failed attempts at a problem as proof that the problem is impossible, but I would like to remind you that basically every major and minor technology was preceded by failed attempts.
I was with total certainty under the impression that detecting AI-written text to be an impossible-to-solve problem. I think that's because it's just so deceptively intuitive to believe that "for every detector, there'll just be a better LLM and it'll never stop."
I had recently published a macOS app called Pudding to help humans prove they wrote a text mainly under the assumption that this problem can't be solved with measurable certainty and traditional methods.
Now I'm of course a bit sad that the problem (and hence my solution) can be solved much more directly. But, hey, I fell in love with the problem, so I'm super impressed with what y'all are accomplishing at and with Pangram!
The big AI providers don't have any obvious incentive to do this. If it happens 'naturally' in the pursuit of quality then sure, but explicitly training for stealth is a brand concern in the same way that offering a fully uncensored model would be.
Smaller providers might do this (again in the same way they now offer uncensored models), but they occupy a miniscule fraction of the market and will be a generation or two behind the leaders.
There are also rings of reviewer fraud going on where groups of people in these niche areas all get assigned their own papers and recommend acceptance and in many cases the AC is part of this as well. Am not saying this is common but it is occurring.
It feels as if every layer of society is in maximum extraction mode and this is just a single example. No one is spending time to carefully and deeply review a paper because they care and they feel on principal that’s the right thing to do. People did used to do this.
Without professions, there are no more professional communities really, no more professional standards to uphold, no reason to get in the way of somebody’s publications.
I would propose that the evolution you speak of is more related to our technology (and I am not just saying AI, far from it) and how it is now possible to perform the very minimum requirements of a task with little effort.
She opens with an example of a bank. She walked in and asked for a debit card. The teller told her to take a seat. 30 minutes later, the teller told her the bank doesn't issue debit cards. Firstly, what kind of bank doesn't issue debit cards, and secondly, what kind of bank takes 30 minutes to figure out whether or not it issues debit cards? And this is just one of many examples of things that society does that have no reason not to work, that should have been selected away long ago if they did not work - that bank should have been bankrupt long ago - but for some reason this is not happening and everything is just getting clogged with bullshit and non-working solutions.
It's back to OP's point. There's no such thing as professions now. Just jobs. We put them on and off like hats. With that churn comes lack of institutional knowledge and a rule set handed down from the C Suite for front line employees completely detached from the front line work.
Enshitification run rampant.
Nothing like the textile mills of the 1900s. You'll need to do better.
The normal functioning of markets would be that badly-working things are slowly driven out, while well-working things grow and replace them. Even without any reference to financial markets, this is simply what you expect to happen when people have a variety of things to choose from.
I could hypothesize that markets have evolved to the point where it's impossible for new things to grow unless they are already shit. Perhaps because everyone's too busy working for the shit things (which is partly because the government keeps printing money to the previously successful things in order to prevent the economy collapsing and therefore landlords got to charge exorbitant rent) or perhaps because they just don't have any money because of the above, and can only afford the cheap shit things (but a lot of the shit things are expensive?) or perhaps because people are afraid to start new things because they're afraid of the government (I've observed that not infrequently on HN, also something something testosterone microplastics) or perhaps because advertising effectiveness has reached the point where new things never become discoverable and stay crowded out as old things ramp up advertisement to compensate or perhaps we're just all depressed (because of the housing market probably).
But what's the proof? How do you prove (with any rigor) a given text is AI-generated?
From what I remember, (long before generative AI) you would still occasionally get very crappy reviews (as author). When I participated (couple of times) to review committees, when there was a high variance between reviews the crappy reviews were rather easy to spot and eliminate.
Now it's not bad to detect crappy (or AI) reviews, but I wonder if it would change much the end result compared to other potential interventions.
They wrote a paper describing how they did it. https://arxiv.org/pdf/2510.03154
you cannot. beyond extra data (metadata) embedded in the content, it is impossible to tell whether given text was generated by a LLM or not (and I think the distinction is rather puerile personally)
We also wanted to quantify our EditLens model's FPR on the same domain, so we ran all of ICLR's 2022 reviews. Of 10,202 reviews, Pangram marked 10,190 as fully human, 10 as lightly AI-edited, 1 as moderately AI-edited, 1 as heavily AI-edited, and none as fully AI-generated.
That's ~1 in 1k FPR for light AI edits, 1 in 10k FPR for heavy AI edits.
> AI or not, it's hard to tell them apart.
Apparently not for this tool.
We can't use this to convict a single reviewer, but we can almost surely say that many reviewers just gave the review work to an AI.
21%...? Am I reading it right? I bet no one expected it's so low when they clicked this title.
In accident investigation we often refer to "holes in the swiss cheese lining up." Dereliction of duty is commonly one of the holes that lines up with all the others, and is apparently rampant in this field.
he didn't say he read it carefully after running it through the slop machine.
I think there is a far more interesting discussion to be had here about how useful the 21% percent were. How well does an AI execute a peer review?
This is a conference purporting to do PEER review. No matter how good the AI, it's not a peer review.
And that's not necessarily a bad thing. If I set up RAG correctly, then tell the AI to generate K samples, then spend time to pick out the best one, that's still significant human input, and likely very good output too. It's just invisible what the human did.
And as models get better, the necessary K will become smaller....
I occasionally get people telling me AI is unreliable, and I tell them the same thing: the tech is nearly infinitely flexible (computing over the space of ideas!), so that says a lot more about how they’re using it.
ok of course the human reviewers could still use AI here but then so could the authors, ad infinitum..
The problem is the entire article is made up. Sure, the author can trace client-side traffic, but the vast majority of start-ups would be making calls to LLMs in their backend (a sequence diagram in the article even points this out!!), where it would be untraceable. There is certainly no way the author can make a broad statement that he knows what's happening across hundreds of startups.
Yet lots of comments just taking these conclusions at face value. Worse, when other commenters and myself pointed out the blatant impossibility of the author's conclusion, got some responses just rehashing how the author said they "traced network traffic", even though that doesn't make any sense as they wouldn't have access to backends of these companies.
In general, what bothers me the most is the lack of transparency from researchers that use LLMs. Like, give me the text and explicitly mention that you used LLM for it. Even better, if one links the prompt history.
The lack of transparency causes greater damage than the using LLM for generating text. Otherwise, we will keep chasing the perfect AI detector which to me seems to be pointless.
h/t to Paul Cantrell https://hachyderm.io/@inthehands/115633840133507279
For the record I actually like the AI writing style. It's a huge improvement in readability over most academic writing I used to come across.
However, almost every peer review I was a part of, pre- and post-LLM, had one reviewer who provided a questionable review. Sometimes I'd wonder if they'd even read the submission, and sometimes, there were borderline unethical practices like trying to farm citations through my submission. Luckily, at least one other diligent reviewer would provide a counterweight.
Safe to say that I don't find it surprising, and hearing / reading others' experiences tells me it's yet another symptom of a barely functioning mechanism that is peer review today.
Sadly, it's the best mechanism that institutions are willing to support.
Many of the researchers may not have native command of English and even if, AI can help in writing in general.
Obviously I’m not referring to pure AI generated BS.
On the one hand (and the most important thing, IMO) it's really bad to judge people on the basis of "AI detectors", especially when this can have an impact on their career. It's also used in education, and that sucks even more. AI detectors have bad rates, can't detect concentrated efforts (i.e. finetunes will trick every detector out there, I've tried) can have insane false positives (the first ones that got to "market" were rating the declaration of independence as 100% AI written), and at best they'll only catch the most vanilla outputs.
On the other hand, working with these things, and just being online is impossible to say that I don't see the signs everywhere. Vanilla LLMs fixate on some language patterns, and once you notice them, you see them everywhere. It's not just x; it was truly y. Followed by one supportive point, the second supportive point and the third supportive point. And so on. Coupled with that vague enough overview style, and not much depth, it's really easy to call blatant generations as you see them. It's like everyone writes in linkedin infused mania episodes now. It's getting old fast.
So I feel for the people who got slop reviews. I'd be furious. Especially when its faux pas to call it out.
I also feel for the reviewers that maybe got caught in this mess for merely "spell checking" their (hopefully) human written reviews.
I don't know how we'll fix it. The only reasonable thing for the moment seems to be drilling into everyone that at the end of the day they own their stuff. Be it a homework, a PR or a comment on a blog. Some are obviously more important than the others, but still. Don't submit something you can't defend, especially when your education/career/reputation depends on it.
But you can see the slippery slope: first you ask your favorite LLM to check your grammar, and before you think about it, you are just asking it to write the whole thing.
Where you purposefully put spaces.
Like this.
And the clicker is?
You get my point. I don't see a way out of this in the social media context because it's just spam. Producing the slop takes an order of magnitude less effort than parsing it. But when it comes to peer reviews and papers I think some kind of reputation system might help. If you get caught doing this shit you need to pay some consequence.
Humans optimize for effort.
We have expanded the market for lemons.
People can say they are doing the work, use AI, and offload testing on the other party.
Buyers will respond by moving their purchase price down. People selling quality content will realize they don’t have a chance to get the fair value and exit the market.
Anyone arguing otherwise, needs to explain how, or who, is going to handle the added burden of verification that has been foisted on all of society.
That is a tad bit too much...
Many of us use AI to not write text, but re-write text. My favorite prompt: "Write this better." In other words, AI is often used to fix awkward phrasing, poor flow, bad english, bad grammar etc.
It's very unlikely that an author or reviewer purely relies on AI written text, with none of their original ideas incorporated.
As AI detectors cannot tell rewrites from AI-incepted writing, it's fair to call them BS.
Ignore...
hoisted by your own petard
I’ve been to CVPR, NeurIPS and AGI conferences over the last decade and they used to be where progress in AI was displayed.
No longer. Progress is all in your github and increasingly only dominated by the “new” AI companies (Deepmind, OAI, Anthropic, Alibaba etc…)
No major landscape shifting breakthroughs have come out of CSAIL, BAIR, NYU, TuM etc in ~the last 5 years.
I’d expect this will continue as the only thing that matters at this point is architecture data and compute.
And, if your AI can't write a paper, are you even any good as an AI researcher? :^)
It's inevitable that faces will be devoured by AI Leopards.
I increasingly see AI generated slop across the internet - on twitter, nytimes comments, blog/substack posts from smart people. Most of it is obvious AI garbage and it's really f*ing annoying. It largely has the same obnoxious style and really bad analogies. Here's an (impossible to realize) proposal: any time AI-generated text is used, we should get to see the whole interaction chain that led to its production. It would be like a student writing an essay who asks a parent or friend for help revising it. There's clearly a difference between revisions and substantial content contribution.
The notion that AI is ready to be producing research or peer reviews is just dumb. If AI correctly identifies flaws in a paper, the paper was probably real trash. Much of the time, errors are quite subtle. When I review, after I write my review and identify subtle issues, I pass the paper through AI. It rarely finds the subtle issues. (Not unlike a time it tried to debug my code and spent all its time focused on an entirely OK floating point comparison.)
For anecdotal issues with PL: I am working on a 500 word conference abstract. I spent a long while working on it but then dropped it into opus 4.5 to see what would happen. It made very minimal changes to the actual writing, but the abstract (to me) reads a lot better even with its minimal rearrangements. That surprises me. (But again, these were very minimal rearrangements: I provided ~550 words and got back a slightly reduced, 450 words.) Perhaps more interestingly, PL's characterizations are unstable. If I check the original claude output, I get "fully AI-generated, medium". If I drop in my further refined version (where I clean up claude's output), I get fully human. Some of the aspects which PL says characterize the original as AI-generated (particular n-grams in the text) are actually from my original work.
The realities are these: a) ai content sucks (especially in style); b) people will continue to use AI (often to produce crap) because doing real work is hard and everyone else is "sprinting ahead" using the semi-undetectable (or at least plausibly deniable) ai garbage; c) slowly the style of AI will almost certainly infect the writing style of actual people (ugh) - this is probably already happening; I think I can feel it in my own writing sometimes; d) AI detection may not always work, but AI-generated content is definitely proliferating. This *is* a problem, but in the long run we likely have few solutions.
The thought of thousands of people having to do what she had to do is depressing. I was sitting in the room with her while she wrote it, submitted it for checking, "AI" detected! She found the only way to avoid this was to go over it again and again simplifying and dumbing it down to use very basic sentence structures which ended up reading like something from a primary schooler. The whole thing is ass-backwards.
If they had a conference on, say, the Americans, wouldn't it be fair for Americans to have a seat at the table?
Yes, AI slop is an issue. But throwing more AI at detecting this, and most importantly, not weighing that detection properly, is an even bigger problem.
And, HN-wise, "this seems like AI" seems like a very good inclusion in the "things not to complain about" FAQ. Address the idea, not the form of the message, and if it's obviously slop (or SEO, or self-promotion), just downvote (or ignore) and move on...
Those are all legitimate concerns or even valid complaints, though, and, once raised, those concerns can be addressed by fixing the problem, if the person responsible for the state of affairs chooses to do so.
If someone is accused falsely of using AI or anything else that they genuinely didn’t do, like a paywall, then I can see your “downvote and move on” strategy as being perhaps expedient, but I don’t think your comparison is a helpful framing. Accessibility concerns are valid for the same reason as paywall concerns: it’s a valid position to desire our shared knowledge and culture to be accessible by one and by all without requiring a ticket to ride, entry through a turnstile, or submitting to profiling or tracking. If someone releases their ideas into the world, it’s now part of our shared consciousness and social fabric. Ideas can’t be owned once they’re shared, nor can knowledge be siloed once it’s dispersed.
It seems that you’re saying that simply because there isn’t a good rejoinder to false claims of AI usage that we shouldn’t make such claims at all, even legitimate ones, but this gives cover to bad actors and limits discourse to acceptable approved topics, and perhaps lowers the level of discourse by preventing necessary expectations of disclosure of AI usage from forming. If we throw in the towel on AI usage being expected to be disclosed, then that’s the whole ballgame. Folks will use it and not say so, because it will be considered rude to even suggest that AI was used, which isn’t helpful to the humans who have to live in such a society.
We ought to have good methodological reasons for the things we publish if we believe them to be true, and I’m not trying to be a naysayer or anything, but I respectfully disagree with your statement generally and on the points. All of the things you mentioned should be called out for cause, even if there isn’t much interesting discussion to be had, because the facts of the matters you mention are worth mentioning themselves in their own right. Just like we should let people like things, we should let people dislike things, and saying so adds checks and balances to our producer-consumer dynamic.