The paper was https://openreview.net/forum?id=0ZnXGzLcOg and the problem flagged was "Two authors are omitted and one (Kyle Richardson) is added. This paper was published at ICLR 2024." I.e., for one cited paper, the author list was off and the venue was wrong. And this citation was mentioned in the background section of the paper, and not fundamental to the validity of the paper. So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).
I think there are some egregious papers in their dataset, and this error does make me pause to wonder how much of the rest of the paper used AI assistance. That said, the "single error" papers in the dataset seem similar to the one I checked: relatively harmless and minor errors (which would be immediately caught by a DOI checker), and so I have to assume some of these were included in the dataset mainly to amplify the author's product pitch. It succeeded.
And this is what's operative here. The error spotted, the entire class of error spotted, is easily checked/verified by a non-domain expert. These are the errors we can confirm readily, with obvious and unmistakable signature of hallucination.
If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.
Checking the rest of the paper requires domain expertise, perhaps requires an attempt at reproducing the authors' results. That the rest of the paper is now in doubt, and that this problem is so widespread, threatens the validity of the fundamental activity these papers represent: research.
I am troubled by people using an LLM at all to write academic research papers.
It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.
I'd see a failure of the 'author' to catch hallucinations, to be more like a failure to hide evidence of misconduct.
If academic venues are saying that using an LLM to write your papers is OK ("so long as you look it over for hallucinations"?), then those academic venues deserve every bit of operational pain and damaged reputation that will result.
Google Translate et al were never good enough at this task to actually allow people to use the results for anything professional. Previous tools were limited to getting a rough gloss of what words in another language mean.
But LLMs can be used in this way, and are being used in this way; and this is increasingly allowing non-English-fluent academics to publish papers in English-language journals (thus engaging with the English-language academic community), where previously those academics they may have felt "stuck" publishing in what few journals exist for their discipline in their own language.
Would you call this "shoddy" or "irresponsible"?
It reminds me of kids these days and their fancy calculators! Those new fangled doohickeys just aren't reliable, and the kids never realize that they won't always have a calculator on them! Everyone should just do it the good old fashioned way with slide rules!
Or these darn kids and their unreliable sources like Wikipedia! Everyone knows that you need a nice solid reliable source that's made out of dead trees and fact checked but up to 3 paid professionals!
Sure, maybe someday LLMs will be able to report facts in a mostly reliable fashion (like a typical calculator), but we're definitely not even close to that yet, so until we are the skepticism is very much warranted. Especially when the details really do matter, as in scientific research.
LLM's do not work reliably, that's not their purpose.
If you use them that way it's akin to using a butter knife as a screwdriver. You might get away with it once or twice, but then you slip and stab yourself. Better to go find screwdriver if you need reliable.
I do think they can be used in research but not without careful checking. In my own work I’ve found them most useful as search aids and brainstorming sounding boards.
I made this a separate comment, because it's wildly off topic, but... they actually aren't. Especially for very large numbers or for high precision. When's the last time you did a firmware update on yours?
It's fairly trivial to find lists of calculator flaws and then identify them in research papers. I recall reading a research paper about it in the 00's.
Of course you are right. It is the same with all tools, calculators included, if you use them improperly you get poor results.
In this case they're stochastic, which isn't something people are used to happening with computers yet. You have to understand that and learn how to use them or you will get poor results.
I do think it can be used in research but not without careful checking. In my own work I've found it most useful as a search aid and for brainstorming.
^ this same comment 10 years ago
This is really just restating what I already said in this thread, but you're right. That's because wikipedia isn't a primary source and was never, ever meant to be. You are SUPPOSED to go read it then click through to the primary sources and cite those.
Lots of people use it incorrectly and get bad results because they still haven't realized this... all these years later.
Same thing with treating stochastic LLM's like sources of truth and knowledge. Those folks are just doing it wrong.
In an academic paper, you condense a lot of thinking and work, into a writeup.
Why would you blow off the writeup part, and impose AI slop upon the reviewers and the research community?
As a professional mathematician I used wikipedia all the time to lookup quick facts before verifying it myself or elsewhere. A calculator well; I can use an actual programming language.
Up until this point neither of those tools were asvertised or used by people to entirely replace human input.
AI People: "AI is a completely unprecedented technology where its introduction is unlike the introduction of any other transformative technology in history! We must treat it totally differently!"
Also AI People: "You're worried about nothing, this is just like when people were worried about the internet."
Also everyone I know has been relying on google scholar for 10+ years. Is that AI-ish? There are definitely errors on there. If you would extrapolate from citation issues to the content in the age of LLMs, were you doing so then as well?
It's the age-old debate about spelling/grammar issues in technical work. In my experience it rarely gets to the point that these errors eg from non-native speakers affect my interpretation. Others claim to infer shoddy content.
I am unconvinced that the particular error mentioned above is a hallucination, and even less convinced that it is a sign of some kind of rampant use of AI.
I hope to find better examples later in the comment section.
What I find more interesting is how easy these errors are to introduce and how unlikely they are to be caught. As you point out, a DOI checker would immediately flag this. But citation verification isn’t a first-class part of the submission or review workflow today.
We’re still treating citations as narrative text rather than verifiable objects. That implicit trust model worked when volumes were lower, but it doesn’t seem to scale anymore
There’s a project I’m working on at Duke University, where we are building a system that tries to address exactly this gap by making references and review labor explicit and machine verifiable at the infrastructure level. There’s a short explainer here that lays out what we mean, if useful context helps: https://liberata.info/
I wouldn't trust today's GPT-5-with-web-search to do turn a bullet point list of papers into proper citations without checking myself, but maybe I will trust GPT-X-plus-agent to do this.
...and including the erroneous entry is squarely the author's fault.
Papers should be carefully crafted, not churned out.
I guess that makes me sweetly naive
Typically when you add it you get the info from another paper or copy the bibtex entry from Google scholar, but it's really at most 10 minutes work, more likely 2-5. Every paper might have 5-10 new entries in the bibliography, so that's 1 hour or less of work?
> Papers should be carefully crafted, not churned out.
I think you can say the same thing for code and yet, even with code review, bugs slip by. People aren't perfect and problems happen. Trying to prevent 100% of problems is usually a bad cost/benefit trade-off.
The entire idea of super-detailed citations is itself quite outdated in my view. Sure, citing the work you rely on is important, but that could be done just as well via hyperlinks. It's not like anybody (exclusively) relies on printed versions any more.
Well the title says ”hallucinations”, not ”fabrications”. What you describe sounds exactly like what AI builders call hallucinations.
There was dumb stuff like this before the GPT era, it's far from convincing
I don’t think the point being made is “errors didn’t happen pre-GPT”, rather the tasks of detecting errors have become increasingly difficult because of the associated effects of GPT.
Did the increase to submissions to NeurIPS from 2020 to 2025 happen because ChatGPT came out in November of 2022? Or was AI getting hotter and hotter during this period, thereby naturally increasing submissions to ... an AI conference?
I'm sure people made mistakes on their bibliographies at that time as well!
And did we all really dig up and read Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953)?
Edited to add: Someone made a chart! Here: https://papercopilot.com/statistics/neurips-statistics/
You can see the big bump after the book-length restriction was lifted, and the exponential rise starting ~2016.
I had to go to the basement of the library, use some sort of weird rotating knob to move a heavy stack of journals over, find some large bound book of the year's journals, and navigate to the paper. When I got the page, it had been cut out by somebody previous and replaced with a photocopied verison.
(I also invested a HUGE amount of my time into my bibliography in every paper I've written as first author, curating a database and writing scripts to format in the various journal formats. This involved multiple independent checks from several sources, repeated several times.
The real challenges there aren't the "biggies" above, though, it's the ones in obscure journals you have to get copies of by inter-library agreements. My PhD was in applied probability and I was always happy if there were enough equations so that I could parse out the French or Russian-language explanation nearby.
If you didn't, you are lying. Full stop.
If you cite something, yes, I expect that you, at least, went back and read the original citation.
The whole damn point of a citation is to provide a link for the reader. If you didn't find it worth the minimal amount of time to go read, then why would your reader? And why did you inflict it on them?
Also, in my field (economics), by far the biggest source of finding old papers invalid (or less valid, most papers state multiple results) is good old fashioned coding bugs. I'd like to see the software engineers on this site say with a straight face that writing bugs should lead to jail time.
My hand is up.
I do not believe in gaol, but I do agree with the sentiment.
Mr. Turing and his halting problem would like to politely disagree with this assertion.
Getting all possible software correct is impossible, clearly. Getting all the software you release is more possible because you can choose not to release the software that it is too hard to prove correct.
Not that the suggestion is practical or likely, but your assertion that it is impossible is incorrect.
There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.
On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".
Until we can change how we fund science on the fundamental level; how we assign grants — it will be indeed very hard problem to deal with.
But the problem isn’t just funding, it’s time. Successfully running a replication doesn’t get you a publication to help your career.
The question is, how can universities coordinate to add this requirement and gain status from it
Grant awarding institutions like the NIH and NSF presumably? The NSF has as one of its functions, “to develop and encourage the pursuit of a national policy for the promotion of basic research and education in the sciences”. Encouraging the replication of research as part of graduate degree curricula seems to fall within bounds. And the government’s interest in science isn’t novelty per se, it’s the creation and dissemination of factually correct information that can be useful to its constituents.
> And the government’s interest in science isn’t novelty per se, it’s the creation and dissemination of factually correct information that can be useful to its constituents.
This sounds very naive.
A single university or even department could make this change - reproduction is the important work, reproduction is what earns a PhD. Or require some split, 20-50% novel work maybe is also expected. Now the incentives are changed. Potentially, this university develops a reputation for reliable research. Others may follow suit.
Presumably, there's a step in this process where money incentivizes the opposite of my suggestion, and I'm not familiar with the process to know which.
Is it the university itself which will be starved of resources if it's not pumping out novel (yet unreproducible) research?
> Is it the university itself which will be starved of resources if it's not pumping out novel (yet unreproducible) research?
Researchers apply for grants to fund their research, the university is generally not paying for it and instead they receive a cut of the grant money if it is awarded (IE. The grant covers the costs to the university for providing the facilities to do the research). If a researcher could get funding to reproduce a result then they could absolutely do it, but that's not what funds are usually being handed out for.
That is good practice
It is rare, not common. Managers and funders pay for features
Unreliable insecure software sells very well, so making reliable secure software is a "waste of money", generally
In a lot of cases, the salary for a grad student or tech is small potatoes next to the cost of the consumables they use in their work.
For example,I work for a lab that does a lot of sequencing, and if we’re busy one tech can use 10k worth of reagents in a week.
But two, and more importantly, no one is checking.
Tree falls in the forest, no one hears, yadi-yada.
You'll notice you can click on author names and you'll get links to their various scholar pages but notably DBLP, which makes it easy to see how frequently authors publish with other specific authors.
Some of those authors have very high citation counts... in the thousands, with 3 having over 5k each (one with over 18k).
I think this is the big part of it. There is no incentive to do it even when the study can be reproduced.
The final bit is a thing I think most people miss when they think about replication. A lot of papers don't get replicated directly but their measurements do when other researchers try to use that data to perform their own experiments, at least in the more physical sciences this gets tougher the more human centric the research is. You can't fake or be wrong for long when you're writing papers about the properties of compounds and molecules. Someone is going to come try to base some new idea off your data and find out you're wrong when their experiment doesn't work. (or spend months trying to figure out what's wrong and finally double check the original data).
(People are better about this in psychology, now: schoolchildren are taught about some of the more egregious cases, even before university, and individual researchers are much more willing to take a sceptical view of certain suspect classes of "prevailing understanding". The fact that even I, a non-psychologist, know about this, is good news. But what of the fields whose practitioners don't know they have this problem?)
But without repetition being impactful to your career and the pressure to quickly and constantly push new work, a failure to reproduce is generally considered a reason to move on and tackle a different domain. It takes longer to trace the failure and the bar is higher to counter an existing work. It's much more likely you've made a subtle mistake. It's much more likely the other work had a subtle success. It's much more likely the other work simply wasn't written such that a work could be sufficiently reproduced.
I speak from experience too. I still remember in grad school I was failing to reproduce a work that was the main competitor to the work I had done (I needed to create comparisons). I emailed the author and got no response. Luckily my advisor knew the author's advisor and we got a meeting set up and I got the code. It didn't do what was claimed in the paper and the code structure wasn't what was described either. The result? My work didn't get published and we moved on. The other work was from a top 10 school and the choice was to burn a bridge and put a black mark on my reputation (from someone with far more merit and prestige) or move on.
That type of thing won't change in a reproduction system but needs an open system and open reproduction system as well. Mistakes are common and we shouldn't punish them. The only way to solve these issues is openness
Not if the result you're building off of is a model, you can just assume it
of course the problem is that academia likes to assert its autonomy (and grant orgs are staffed by academia largely)
academia is too fragmented and extremely inefficient
Most people (that I talk to, at least) in science agree that there's a reproducibility crisis. The challenge is there really isn't a good way to incentivize that work.
Fundamentally (unless you're independent wealthy and funding your own work), you have to measure productivity somehow, whether you're at a university, government lab, or the private sector. That turns out to be very hard to do.
If you measure raw number of papers (more common in developing countries and low-tier universities), you incentivize a flood of junk. Some of it is good, but there is such a tidal wave of shit that most people write off your work as a heuristic based on the other people in your cohort.
So, instead it's more common to try to incorporate how "good" a paper is, to reward people with a high quantity of "good" papers. That's quantifying something subjective though, so you might try to use something like citation count as a proxy: if a work is impactful, usually it gets cited a lot. Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations." Now, the trouble with this method is people won't want to "waste" their time on incremental work.
And that's the struggle here; even if we funded and rewarded people for reproducing results, they will always be bumping up the citation count of the original discoverer. But it's worse than that, because literally nobody is going to cite your work. In 10 years, they just see the original paper, a few citing works reproducing it, and to save time they'll just cite the original paper only.
There's clearly a problem with how we incentivize scientific work. And clearly we want to be in a world where people test reproducibility. However, it's very very hard to get there when one's prestige and livelihood is directly tied to discovery rather than reproducibility.
This would especially help newer grad students learn how to begin to do this sort of research.
Maybe doing enough reproductions could unlock incentives. Like if you do 5 reproductions than the AC would assign your next paper double the reviewers. Or, more invasively, maybe you can't submit to the conference until you complete some reproduction.
1. https://en.wikipedia.org/wiki/Quis_custodiet_ipsos_custodes%...
If you are thinking about this from an academic angle then sure its sounds weird to say "Two Staff jobs in a row from the University of LinkedIn" as a degree. But I submit this as basically the certificate you desire.
What if we got Undergrads (with hope of graduate studies) to do it? Could be a great way to train them on the skills required for research without the pressure of it also being novel?
If you're a tenure-track academic, your livelihood is much safer from having them try new ideas (that you will be the corresponding author on, increasing your prestige and ability to procure funding) instead of incrementing.
And if you already have tenure, maybe you have the undergrad do just that. But the tenure process heavily filters for ambitious researchers, so it's unlikely this would be a priority.
If instead you did it as coursework, you could get them to maybe reproduce the work, but if you only have the students for a semester, that's not enough time to write up the paper and make it through peer review (which can take months between iterations)
It's the Google search algorithm all over again. And it's the certificate trust hierarchy all over again. We keep working on the same problems.
Like the two cases I mentioned, this is a matter of making adjustments until you have the desired result. Never perfect, always improving (well, we hope). This means we need liquidity with the rules and heuristics. How do we best get that?
First X people that reproduce Y get Z percent of patent revenue.
Or something similar.
But nobody want to pay for it
sometimes you can just do something new and assume the previous result, but thats more the exception. youre almost always going to at least in part reproducr the previous one. and if issues come up, its often evident.
thats why citations work as a good proxy. X number of people have done work based around this finding and nobody has seen a clear problem
theres a problem of people fabricating and fudging data and not making their raw data available ("on request" or with not enough meta data to be useful) which wastes everyones time and almost never leads to negative consequences for the authors
The difficult part is surfacing that information to readers of the original paper. The semantic scholar people are beginning to do some work in this area.
give it a published paper and it runs through papers that have cited it and give you an evaluation
"Dr Alice failed to reproduce 20 would-be headline-grabbing papers, preventing them from sucking all the air out of the room in cancer research" is something laudable, but we're not lauding it.
No, you do not have to. You give people with the skills and interest in doing research the money. You need to ensure its spent correctly, that is all. People will be motivated by wanting to build a reputation and the intrinsic reward of the work
This is exactly what rewarding replication papers (that reproduce and confirm an existing paper) will lead to.
Catch-22 is a fun game to get caught in.
Ban publication of any research that hasn't been reproduced.
Unless it is published, nobody will know about it and thus nobody will try to reproduce it.
Your second point is the important one. AI may be the thing that finally forces the community to take reproducibility, attribution, and verification seriously. That’s very much the motivation behind projects like Liberata, which try to shift publishing away from novelty first narratives and toward explicit credit for replication, verification, and followthrough. If that cultural shift happens, this moment might end up being a painful but necessary correction.
https://blog.plan99.net/replication-studies-cant-fix-science...
Funding replication studies in the current environment would just lead to lots of invalid papers being promoted as "fully replicated" and people would be fooled even harder than they already are. There's got to be a fix for the underlying quality issues before replication becomes the next best thing to do.
HN is very tedious/lazy when it comes to science criticism -- very much agree with you on this.
My only point is replication is necessary to establish validity, even if it is not sufficient. Whether it gives a scientist a false sense of security doesn't change the math of sampling.
I also agree with you on quality issues. I think alternative investment strategies (other than project grants) would be a useful step for reducing perverse incentives, for example. But there's a lot of things science could do.
i don't know how any of that writing generalizes to other parts of academic research. i mean, i know that you say it does, but i don't think it does. what exactly do you think most academic research institutions and the federal government spend money on? for example, wet lab research. you don't know anything about wet lab research. i think if you took a look at a typical e.g. basic science in immunology paper, built on top of mouse models, you would literally lose track of any of its meaning after the first paragraph, you would feed it into chatgpt, and you would struggle to understand the topic well enough to read another immunology paper, you would have an immense challenge talking about it with a researcher in the field. it would take weeks of reading. you have no medicine background, so you wouldn't understand the long horizon context of any of it. you wouldn't be able to "chatbot" your way into it, it would be a real education. so after all of that, would you still be able to write the conclusion you wrote in the medium post? i don't think so, because you would see that by many measures, you cannot generalize a froo-froo policy between "subjective political dispute about COVID-19" writing and wet lab research. you'd gain the wisdom to see that they're different things, and you lack the background, and you'd be much more narrow in what you'd say.
it doesn't even have to be in the particulars, it's just about wisdom. that is my feedback. you are at once saying that there is greater wisdom to be had in the organization and conduct of research, and then, you go and make the highly low wisdom move to generalize about all academic research. which you are obviously doing not because it makes sense to, you're a smart guy. but because you have some unknown beef with "academics" that stems from anger about valid, common but nonetheless subjective political disputes about COVID-19.
- Alzheimers
- Cancer
- Alzheimers
- Skin lesions (first paper discussed in the linked blog post)
- Epidemiology (COVID)
- Epidemiology (COVID, foot and mouth disease, Zika)
- Misinformation/bot studies
- More misinformation/bot studies
- Archaeology/history
- PCR testing (in general, discussion opens with testing of whooping cough)
- Psychology, twice (assuming you count "men would like to be more muscular" as a psych claim)
- Misinformation studies
- COVID (the highlighted errors in the paper are objective, not subjective)
- COVID (the highlighted errors are software bugs, i.e. objective)
- COVID (a fake replication report that didn't successfully replicate anything)
- Public health (from 2010)
- Social science
Your summary of this as being about a "valid and common but subjective political dispute" I don't agree is accurate. There's no politics involved in any of these discussions or problems, just bad science.
Immunology has the same issues as most other medical fields. Sure, there's also fraud that requires genuinely deep expertise to find, but there's plenty that doesn't. Here's a random immunology paper from a few days ago identified as having image duplications, Photoshopping of western blots, numerous irrelevant citations and weird sentence breaks all suggestive that the paper might have been entirely faked or at least partly generated by AI: https://pubpeer.com/publications/FE6C57F66429DE2A9B88FD245DD...
The authors reply, claiming the problems are just rank incompetence, and each time someone finds yet another problem with the paper leading to yet another apology and proclamation of incompetence. It's just another day on PubPeer, nothing special about this paper. I plucked it off the front page. Zero wet lab experience is needed to understand why the exact same image being presented as two different things in two different papers is a problem.
And as for other fields, they're often extremely shallow. I actually am an expert in bot detection but that doesn't help at all in detecting validity errors in social science papers, because they do things like define a bot as anyone who tweets five times after midnight from a smartphone. A 10 year old could notice that this isn't true.
Paper A, by bob, bill, brad. Validated by Paper B by carol, clare, charlotte.
or
Paper A, by bob, bill, brad. Unvalidated.
Google Scholar's PDF reader extension turns every hyperlinked citation into a popout card that shows citation counts inline in the PDF: https://chromewebstore.google.com/detail/google-scholar-pdf-...
I am still reviewing papers that propose solutions based on a technique X, conveniently ignoring research from two years ago that shows that X cannot be used on its own. Both the paper I reviewed and the research showing X cannot be used are in the same venue!
There is also the reality that "one paper" or "one study" can be found contradicted almost anything, so if you just went with "some other paper/study debunks my premise" then you'd end up producing nothing. Plus many inside know that there's a lot of slop out there that gets published, so they can (sometimes reasonably IMHO) dismiss that "one paper" even when they do know about it.
It's (mostly) not fraud or malicious intent or ignorance, it's (mostly) humans existing in the system in which they must live.
However, given the feedback by other reviewers, I was the only one who knew that X doesn’t work. I am not sure how these people mark themselves as “experts” in the field if they are not following the literature themselves.
It's like buying a piece of furniture from IKEA, except you just get an Allen key, a hint at what parts to buy, and blurry instructions.
If correct form (LaTeX two-column formatting, quoting the right papers and authors of the year etc.) has been allowing otherwise reject-worthy papers to slip through peer review, academia arguably has bigger problems than LLMs.
Perhaps repro should become the basis of peer review?
There seems to be a rule in every field that "99% of everything is crap." I guess AI adds a few more nines to the end of that.
The gems are lost in a sea of slop.
So I see useless output (e.g. crap on the app store) as having negative value, because it takes up time and space and energy that could have been spent on something good.
My point with all this is that it's not a new problem. It's always been about curation. But curation doesn't scale. It already didn't. I don't know what the answer to that looks like.
This is just article publishers not doing the most basic verification failing to notice that the citations in the article don't exist.
What this should trigger is a black mark for all of the authors and their institutions, both of which should receive significant reputational repercussions for publishing fake information. If they fake the easiest to verify information (does the cited work exist) what else are they faking?
> to finally take reproducibility more seriously
I've long argued for this, as reproduction is the cornerstone of science. There's a lot of potential ways to do this but one that I like is linking to the original work. Suppose you're looking at the OpenReview page and they have a link for "reproduction efforts" and with at minimum an annotation for confirmation or failure.This is incredibly helpful to the community as a whole. Reproduction failures can be incredibly helpful even when the original work has no fraud. In those cases a reprising failure reveals important information about the necessary conditions that the original work relies on.
But honestly, we'll never get this until we drop the entire notion of "novel" or "impact" and "publish or perish". Novel is in the eye of the reviewer and the lower the reviewer's expertise the less novel a work seems (nothing is novel as a high enough level). Impact can almost never be determined a priori, and when it can you already have people chasing those directions because why the fuck would they not? But publish or perish is the biggest sin. It's one of those ideas that looks nice on paper, like you are meaningfully determining who is working hard and who is hardly working. But the truth is that you can't tell without being in the weeds. The real result is that this stifles creativity, novelty, and impact as it forces researchers to chase lower hanging fruit. Things you're certain will work and can get published. It creates a negative feedback loop as we compete: "X publishes 5 papers a year, why can't you?" I've heard these words even when X has far fewer citations (each of my work had "more impact").
Frankly, I believe fraud would dramatically reduce were researchers not risking job security. The fraud is incentivized by the cutthroat system where you're constantly trying to defend your job, your work, and your grants. They'll always be some fraud but (with a few exceptions) researchers aren't rockstar millionaires. It takes a lot of work to get to point where fraud even works, so there's a natural filter.
I have the same advice as Mervin Kelly, former director of Bell Labs:
How do you manage genius?
You don't> When reached for comment, the NeurIPS board shared the following statement: “The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference). As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities.”
Maybe I'm overreacting, but this feels like an insanely biased response. They found the one potentially innocuous reason and latched onto that as a way to hand-wave the entire problem away.
Science already had a reproducibility problem, and it now has a hallucination problem. Considering the massive influence the private sector has on the both the work and the institutions themselves, the future of open science is looking bleak.
They’re right that a citation error doesn’t automatically invalidate the technical content of a paper, and that there are relatively benign ways these mistakes get introduced. But focusing on intent or severity sidesteps the fact that citations, claims, and provenance are still treated as narrative artifacts rather than things we systematically verify
Once that’s the case, the question isn’t whether any single paper is “invalid” but whether the workflow itself is robust under current incentives and tooling.
A student group at Duke has been trying to think about with Liberata, i.e. what publishing looks like if verification, attribution, and reproducibility are first class rather than best effort
They have a short explainer here that lays out the idea if useful context helps: https://liberata.info/
How did these 100 sources even get through the validation process?
> Isn't disqualifying X months of potentially great research due to a misformed, but existing reference harsh?
It will serve as a reminder not to cut any corners.
I wouldn't call a misformed reference a critical issue, it happens. That's why we have peer reviews. I would contend drawing superficially valid conclusions from studies through use of AI is a much more burning problem that speaks more to the integrity of the author.
> It will serve as a reminder not to cut any corners.
Or yet another reason to ditch academic work for industry. I doubt the rise of scientific AI tools like AlphaXiv [1], whether you consider them beneficial or detrimental, can be avoided - calling for a level pragmatism.
Crazy to say this in a discussion where peer review missed hallucinated citations
Seems like CYA, seems like hand wave. Seems like excuses.
It's like arguing against strict liability for drunk driving because maybe somebody accidentally let their grape juice sit to long and they didn't know it was fermented... I can conceive of such a thing, but that doesn't mean we should go easy on drunk driving.
Who would pay them? Conference organizers are already unpaid and undestaffed, and most conferences aren't profitable.
I think rejections shouldn't be automatic. Sometimes there are just typos. Sometimes authors don't understand BibTeX. This needs to be done in a way that reduces the workload for reviewers.
One way of doing this would be for GPTZero to annotate each paper during the review step. If reviewers could review a version of each paper with yellow-highlighted "likely-hallucinated" references in the bibliography, then they'd bring it up in their review and they'd know to be on their guard for other probably LLM-isms. If there's only a couple likely typos in the references, then reviewers could understand that, and if they care about it, they'd bring it up in their reviews and the author would have the usual opportunity to rebut.
I don't know if GPTZero is willing to provide this service "for free" to the academic community, but if they are, it's probably worth bringing up at the next PAMI-TC meeting for CVPR.
This statement isn’t wrong, as the rest of the paper could still be correct.
However, when I see a blatant falsification somewhere in a paper I’m immediately suspicious of everything else. Authors who take lazy shortcuts when convenient usually don’t just do it once, they do it wherever they think they can get away with it. It’s a slippery slope from letting an LLM handle citations to letting the LLM write things for you to letting the LLM interpret the data. The latter opens the door to hallucinated results and statistics, as anyone who has experimented with LLMs for data analysis will discover eventually.
Labor is the bottleneck. There aren't enough academics who volunteer to help organize conferences.
(If a reader of this comment is qualified to review papers and wants to step up to the plate and help do some work in this area, please email the program chairs of your favorite conference and let them know. They'll eagerly put you to work.)
One "simple" way of doing this would be to automate it. Have authors step through a lint step when their camera-ready paper is uploaded. Authors would be asked to confirm each reference and link it to a google scholar citation. Maybe the easy references could be auto-populated. Non-public references could be resolved by uploading a signed statement or something.
There's no current way of using this metadata, but it could be nice for future systems.
Even the Scholar team within Google is woefully understaffed.
My gut tells me that it's probably more efficient to just drag authors who do this into some public execution or twitter mob after-the-fact. CVPR does this every so often for authors who submit the same paper to multiple venues. You don't need a lot of samples for deterrence to take effect. That's kind of what this article is doing, in a sense.
For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex
This is equivalent to a typo. I’d like to know which “hallucinations” are completely made up, and which have a corresponding paper but contain some error in how it’s cited. The latter I don’t think matters.Here's a random one I picked as an example.
Paper: https://openreview.net/pdf?id=IiEtQPGVyV
Reference: Asma Issa, George Mohler, and John Johnson. Paraphrase identification using deep contextual- ized representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 517–526, 2018.
Asma Issa and John Johnson don't appear to exist. George Mohler does, but it doesn't look like he works in this area (https://www.georgemohler.com/). No paper with that title exists. There are some with sort of similar titles (https://arxiv.org/html/2212.06933v2 for example), but none that really make sense as a citation in this context. EMNLP 2018 exists (https://aclanthology.org/D18-1.pdf), but that page range is not a single paper. There are papers in there that contain the phrases "paraphrase identification" and "deep contextualized representations", so you can see how an LLM might have come up with this title.
Institutions can choose an arbitrary approach to mistakes; maybe they don't mind a lot of them because they want to take risks and be on the bleeding edge. But any flexible attitude towards fabrications is simply corruption. The connected in-crowd will get mercy and the outgroup will get the hammer. Anybody criticizing the differential treatment will be accused of supporting the outgroup fraudsters.
Think of it this way: if I wanted to commit pure academic fraud maliciously, I wouldn't make up a fake reference. Instead, I'd find an existing related paper and merely misrepresent it to support my own claims. That way, the deception is much harder to discover and I'd have plausible deniability -- "oh I just misunderstood what they were saying."
I think most academic fraud happens in the figures, not the citations. Researchers are more likely to to be successful at making up data points than making up references because it's impossible to know without the data files.
In fairness, NeurIPS is just saying out loud what everyone already knows. Most citations in published science are useless junk: it’s either mutual back-scratching to juice h-index, or it’s the embedded and pointless practice of overcitation, like “Human beings need clean water to survive (Franz, 2002)”.
Really, hallucinated citations are just forcing a reckoning which has been overdue for a while now.
Can't say that matches my experience at all. Once I've found a useful paper on a topic thereafter I primarily navigate the literature by traveling up and down the citation graph. It's extremely effective in practice and it's continued to get easier to do as the digitization of metadata has improved over the years.
A somewhat-related parable: I once worked in a larger lab with several subteams submitting to the same conference. Sometimes the work we did was related, so we both cited each other's paper which was also under review at the same venue. (These were flavor citations in the "related work" section for completeness, not material to our arguments.) In the review copy, the reference lists the other paper as written by "anonymous (also under review at XXXX2025)," also emphasized by a footnote to explain the situation to reviewers. When it came time to submit the camera-ready copy, we either removed the anonymization or replaced it with an arxiv link if the other team's paper got rejected. :-) I doubt this practice improved either paper's chances of getting accepted.
Are these the sorts of citation rings you're talking about? If authors misrepresented the work as if it were accepted, or pretended it was published last year or something, I'd agree with you, but it's not too uncommon in my area for well-connected authors to cite manuscripts in process. I don't think it's a problem as long as they don't lean on them.
By using an LLM to fabricate citations, authors are moving away from this noble pursuit of knowledge built on the "shoulders of giants" and show that behind the curtain output volume is what really matters in modern US research communities.
(If you're qualified to review papers, please email the program chair of your favorite conference and let them know -- they really need the help!)
As for my review, the review form has a textbox for a summary, a textbox for strengths, a textbox for weaknesses, and a textbox for overall thoughts. The review I received included one complete set of summary/strengths/weaknesses/closing thoughts in the summary text box, another distinct set of summary/strengths/weaknesses/closing thoughts in the strengths, another complete and distinct review in the weaknesses, and a fourth complete review in the closing thoughts. Each of these four reviews were slightly different and contradicted each other.
The reviewer put my paper down as a weak reject, but also said "the pros greatly outweigh the cons."
They listed "innovative use of synthetic data" as a strength, and "reliance on synthetic data" as a weakness.
He was against establishment dogma, not pro-anti intellectualism.
Including coca cola and Linux!
I won't deny I am terrible at articulating my point, but I will maintain it. We can undeniably say that science, scientific institutions, scientific periodic journals, funding and any other financial instrument constructed to promote scientific advancements is rotten by design and should be abandoned immediately. This joke serves no good.
"But what about muh scientific method?" Yeah yeah yeah, whoever thinks modern science honors logic and reason is part of the problem and has being played, and forever will be
Most big tech PhD intern job postings have NeurIPS/ICML/ICLR/etc. first author paper as a de facto requirement to be considered. It's like getting your SAG card.
If you get one of these internships, it effectively doubles or triples your salary that year right away. You will make more in that summer than your PhD stipend. Plus you can now apply in future summers and the jobs will be easier to get. And it sets your career on a good path.
A conservative estimate of the discounted cash value of a student's first NeurIPS paper would certainly be five figures. It's potentially much higher depending on how you think about it, considering potential path dependent impacts on future career opportunities.
We should not be surprised to see cheating. Nonetheless, it's really bad for science that these attempts get through. I also expect some people did make legitimate mistakes letting AI touch their .bib.
Most industry AI jobs that aren’t research based know that NeurIPS publications are a huge deal. Many of the managers don’t even know what a workshop is (so you can pass off NeurIPS workshop work as just “NeurIPS”)
A single first author main conference work effectively allows a non Ph.D holder to be treated like they have a Ph.d (be qualified for professional researcher jobs). This means that a decent engineer with 1 NeurIPS publication is easily worth 300K+ YOY assuming US citizen. Even if all they have is a BS ;)
And if you are lucky to get a spotlight or an oral, that’s probably worth closer to 7 figures…
If we grant that good carrots are hard to grow, what's the argument against leaning into the stick? Change university policies and processes so that getting caught fabricating data or submitting a paper with LLM hallucinations is a career ending event. Tip the expected value of unethical behaviours in favour of avoiding them. Maybe we can't change the odds of getting caught but we certainly can change the impact.
This would not be easy, but maybe it's more tractable than changing positive incentives.
i don't think there are any AI detection tools that are sufficiently reliable that I would feel comfortable expelling a student or ending someone's career based on their output.
for example, we can all see what's going on with these papers (and it appears to be even worse among ICLR submissions). but it is possible to make an honest mistake with your BibTeX. Or to use AI for grammar editing, which is widely accepted, and have it accidentally modify a data point or citation. There are many innocent mistakes which also count as plausible excuses.
in some cases further investigation maybe can reveal a smoking gun like fabricated data, which is academic misconduct whether done by hand or because an AI generated the LaTeX tables. punishments should be harsher for this than they are.
It’s for sure plausible that it’s increasing, but I’m certain this kind of thing happened with humans too.
Better detectors, like the article implies, won’t solve the problem, since AI will likely keep improving
It’s about the fact that our publishing workflows implicitly assume good faith manual verification, even as submission volume and AI assisted writing explode. That assumption just doesn’t hold anymore
A student initiative at Duke University has been working on what it might look like to address this at the publishing layer itself, by making references, review labor, and accountability explicit rather than implicit
There’s a short explainer video for their system: https://liberata.info/
It’s hard to argue that the current status quo will scale, so we need novel solutions like this.
Not great, but to be clear this is different from fabricating the whole paper or the authors inventing the citations. (In this case at least.)
GPTZero of course knows this. "100 hallucinations across 53 papers at prestigious conference" hits different than "0.07% of citations had issues, compared to unknown baseline, in papers whose actual findings remain valid."
In the past, a single paper with questionable or falsified results at a top tier conference was big news.
Something that casts doubt on the validity of 53 papers at a top AI conference is at least notable.
> whose actual findings remain valid
Remain valid according to who? The same group that missed hundreds of hallucinated citations?
What is the base rate of bad citations pre-AI?
And finally yes. Peer review does not mean clicking every link in the footnotes to make sure the original paper didn't mislink, though I'm sure after this bruhaha this too will be automated.
It wasn't just broken links, but citing authors like "lastname, firstname" and made up titles.
I have done peer reviews for a (non-AI) CS conference and did at least skim the citations. For papers related to my domain, I was familiar with most of the citations already, and looked into any that looked odd.
Being familiar with the state of the art is, in theory, what qualifies you to do peer reviews.
Then peoples CV's could say "My inventions have led to $1M in licensing revenue" rather than "I presented a useless idea at a decent conference because I managed to make it sound exciting enough to get accepted".
>GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers
And I'm left wondering if they mean 100 papers or 100 hallucinations
The subheading says
>GPTZero's analysis 4841 papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations
Which accidentally a word, but seems to clarify that they do legitimately mean 100 papers.
A later heading says
>Table of 100 Hallucinated Citations in Published Across 53 NeurIPS Papers
Which suggests either the opposite, or that they chose a subset of their findings to point out a coincidentally similar number of incidents.
How many papers did they find hallucinations in? I'm still not certain. Is it 100, 53 or some other number altogether? Does their quality of scrutiny match the quality of their communication. If they did in-fact find 100 Hallucinations in 53 papers, would the inconsistency against their claim of "papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations" meet their own bar for a hallucination?
When training a student, normally we expect a lack of knowledge early, and reward self-awareness, self-evaluation and self-disclosure of that.
But the very first epoch of a model training run, when the model has all the ignorance of a dropped plate of spaghetti, we optimize the network to respond to information, as anything from a typical human to an expert, without any base of understanding.
So the training practice for models is inherently extreme enforced “fake it until you make it”, to a degree far beyond any human context or culture.
(Regardless, humans need to verify, not to mention read, the sources they site. But it will be nice when models can be trusted to accurately access what they know/don’t-know too.)
I guess GPTZero has such a tool. I'm confused why it isn't used more widely by paper authors and reviewers
In my experience you will see considerable variation in citation formats, even in journals that strictly define it and require using BibTex. And lots of journals leave their citation format rules very vague. Its a problem that runs deep.
a) p-hacking and suppressing null results
b) hallucinations
c) falsifying data
Would be cool to see an analysis of this
You gotta horse trade if you want to win. Take one for the team or get out of the way.
To me, it's no different than stealing a car or tricking an old lady into handing over her fidelity account. You are stealing, and society says stealing is a criminal act.
If they actually committed theft, well then that already is illegal too.
But right now, doing "shitty research" isn't illegal and it's unlikely it ever will be.
If you do a search for "contractor imprisoned for fraud" you'll find plenty of cases where a private contract dispute resulted in criminal convictions for people who took money and then didn't do the work.
I don't know if taking money and then merely pretending to do the research would rise to the level of criminal fraud, but it doesn't seem completely outlandish.
EDIT - The threshold amount varies. Sometimes it's as low as a few hundred dollars. However, the point stands on its own, because there's no universe where the sum in question is in misdemeanor territory.
Most institutions aren't very chill with grant money being misused, so we already don't need to burden then state with getting Johnny muncipal prosecutor to try and figure out if gamma crystallization imaging sources were incorrect.
If you're taking public funds (directly or otherwise) with the intent to either:
A) Do little to no real work, and pass of the work of an AI as being your own work, or
B) Knowingly publish falsified data
Then you are, without a single shred of doubt, in criminal fraud territory. Further, the structural damage you inflict when you do the above is orders of magnitude greater than the initial fraud itself. That is a matter for civil courts ("Our company based on development on X fraudulent data, it cost us Y in damages").
Whether or not charges are pressed is going to happen way after all the internal reviews have demonstrated the person being charged has gone beyond the "honest mistake" threshold. It's like Walmart not bothering to call the cops until you're into felony territory, there's no point in doing so.
These are not all the submissions that they received. The review process can be... brutal for some people (depending on the quality of their submission)
Also: there were 15 000 submissions that were rejected at NeurIPS; it would be very interesting to see what % of those rejected were partially or fully AI generated/hallucinated. Are the ratios comperable?
Sharing code enables others to validate the method on a different dataset.
Even before LLMs came around there were lots of methods that looked good on paper but turned out not to work outside of accepted benchmarks
If I drop a loaded gun and it fires, killing someone, we don't go after the gun's manufacturer in most cases.
Go look up the P320 pistol and the tons of accidental discharges that’s it’s caused.
https://stateline.org/2025/03/10/more-law-enforcement-agenci...
Should be extremely easy for AI to successfully detect hallucinated references as they are semi-structured data with an easily verifiable ground truth.
Publishing is just the way to get grants.
A PI explained it to me once, something like this
Idea(s) -> Grant -> Experiments -> Data -> Paper(s) -> Publication(s) -> Idea(s) -> Grant(s)
Thats the current cycle ... remove any step and its a dead end
It’s a problem. The previous regime prior to publishing-mania was essentially a clubby game of reputation amongst peers based on cocktail party socialization.
The publication metrics came out of the harder sciences, I believe, and then spread to the softest of humanities. It was always easy to game a bit if you wanted to try, but now it’s trivial to defeat.
But here's the thing: let's say you're an university or a research institution that wants to curtail it. You catch someone producing LLM slop, and you confirm it by analyzing their work and conducting internal interviews. You fire them. The fired researcher goes public saying that they were doing nothing of the sort and that this is a witch hunt. Their blog post makes it to the front page of HN, garnering tons of sympathy and prompting many angry calls to their ex-employer. It gets picked up by some mainstream outlets, too. It happened a bunch of times.
In contrast, there are basically no consequences to institutions that let it slide. No one is angrily calling the employers of the authors of these 100 NeurIPS papers, right? If anything, there's the plausible deniability of "oh, I only asked ChatGPT to reformat the citations, the rest of the paper is 100% legit, my bad".
I'm sure plenty of more nuanced facts are also entirely without basis.
The best possible outcome is that these two purposes are disconflated, with follow-on consequences for the conferences and journals.
In conference publications, it's less common.
Conference publications (like NEURips) is treated as announcement of results, not verified.
These clearly aren't being peer-reviewed, so there's no natural check on LLM usage (which is different than what we see in work published in journals).
We verify: is the stuff correct, and is it worthy of publication (in the given venue) given that it is correct.
There is still some trust in the authors to not submit made-up-stuff, albeit it is diminishing.
Fake references are more common in the introduction where you list relevant material to strengthen your results. They often don't change the validity of the claim, but the potential impact or value.
Consider the unit economics. Suppose NeurIPS gets 20,000 papers in one year. Suppose each author should expect three good reviews, so area chairs assign five reviewers per paper. In total, 100,000 reviews need to be written. It's a lot of work, even before factoring emergency reviewers in.
NeurIPS is one venue alongside CVPR, [IE]CCV, COLM, ICML, EMNLP, and so on. Not all of these conferences are as large as NeurIPS, but the field is smaller than you'd expect. I'd guess there are 300k-1m people in the world who are qualified to review AI papers.
Another problem is that conferences move slowly and it's hard to adjust the publication workflow in such an invasive way. CVPR only recently moved from Microsoft's CMT to OpenReview to accept author submissions, for example.
There's a lot of opportunity for innovation in this space, but it's hard when everyone involved would need to agree to switch to a different workflow.
(Not shooting you down. It's just complicated because the people who would benefit are far away from the people who would need to do the work to support it...)
This says just as much about the humans involved.
When a reviewer is outgunned by the volume of generative slop, the structure of peer review collapses because it was designed for human-to-human accountability, not for verifying high-speed statistical mimicry. In these papers, the hallucinations are a dead giveaway of a total decoupling of intelligence from any underlying "self" or presence. The machine calculates a plausible-looking citation, and an exhausted reviewer fails to notice the "Soul" of the research is missing.
It feels like we’re entering a loop where the simulation is validated by the system, which then becomes the training data for the next generation of simulation. At that point, the human element of research isn't just obscured—it's rendered computationally irrelevant.
But I saw it in Apple News, so MISSION ACCOMPLISHED!
I even know PIs who got fame and funding based on some research direction that supposedly is going to be revolutionary. Except all they had were preliminary results that from one angle, if you squint, you can envision some good result. But then the result never comes. That's why I say, "fake it, and never make it".
As we get more and more papers that may be citing information that was originally hallucinated in the first place we have a major reliability issue here. What is worse is people that did not use AI in the first place will be caught in the crosshairs since they will be referencing incorrect information.
There needs to be a serious amount of education done on what these tools can and cannot do and importantly where they fail. Too many people see these tools as magic since that is what the big companies are pushing them as.
Other than that we need to put in actual repercussions for publishing work created by an LLM without validating it (or just say you can’t in the first place but I guess that ship has sailed) or it will just keep happening. We can’t just ignore it and hope it won’t be a problem.
And yes, humans can make mistakes too. The difference is accountability and the ability to actually be unsure about something so you question yourself to validate.
If we go back to Google, before its transformation into an AI powerhouse — as it gutted its own SERPs, shoving traditional blue links below AI-generated overlords that synthesize answers from the web’s underbelly, often leaving publishers starving for clicks in a zero-click apocalypse — what was happening?
The same kind of human “evaluators” were ranking pages. Pushing garbage forward. The same thing is happening with AI. As much as the human "evaluators" trained search engines to elevate clickbait, the very same humans now train large language models to mimic the judgment of those very same evaluators. A feedback loop of mediocrity — supervised by the... well, not the best among us. The machines still, as Stephen Wolfram wrote, for any given sequence, use the same probability method (e.g., “The cat sat on the...”), in which the model doesn’t just pick one word. It calculates a probability score for every single word in its vast vocabulary (e.g., “mat” = 40% chance, “floor” = 15%, “car” = 0.01%), and voilà! — you have a “creative” text: one of a gazillion mindlessly produced, soulless, garbage “vile bile” sludge emissions that pollute our collective brains and render us a bunch of idiots, ready to swallow any corporate poison sent our way.
In my opinion, even worse: the corporates are pushing toward “safety” (likely from lawsuits), and the AI systems are trained to sell, soothe, and please — not to think, or enhance our collective experience.
AI Overview: Based on the research, [Chen and N. Flammarion (2022)](https://gptzero.me/news/neurips/) investigate why Sharpness-Aware Minimization (SAM) generalizes better than SGD, focusing on optimization perspectives
The link is a link to the OP web page calling the "research" a hallucination.
The problem is consequences (lack of).
Doing this should get you barred from research. It won’t.
One thing that has bothered me for a very long time is that computer science (and I assume other scientific fields) has long since decided that English is the lingua franca, and if you don't speak it you can't be part of it. Can you imagine if being told that you could only do your research if you were able to write technical papers in a language you didn't speak, maybe even using glyphs you didn't know? It's crazy when you think about it even a little bit, but we ask it of so many. Let's not include the fact that 90% of the English-speaking population couldn't crank out a paper to the required vocabulary level anyway.
A very legitimate, not trying to cheat, use for LLMs is translation. While it would be an extremely broad and dangerous brush to paint with, I wonder if there is a correlation between English-as-a-Second (or even third)-Language authors and the hallucinations. That would indicate that they were trying to use LLMs to help craft the paper to the expected writing level. The only problem being that it sometimes mangles citations, and if you've done good work and got 25+ citations, it's easy for those errors to slip through.
Just ask authors to submit their bib file so we don't need to do OCR on the PDF. Flag the unknown citations and ask reviewers to verify their existence. Then contact authors and ban if they can't produce the cited work.
This is low hanging fruit here!
Detecting slop where the authors vet citations is much harder. The big problem with all the review rules is they have no teeth. If it were up to me we'd review in the open, or at least like ICLR. Publish the list of known bad actors and let is look at the network. The current system is too protective of egregious errors like plagiarism. Authors can get detected in one conference, pull, and submit to another, rolling the dice. We can't allow that to happen and we should discourage people from associating with these conartists.
AI is certainly a problem in the world of science review, but it's far from the only one and I'm not even convinced it's the biggest. The biggest is just that reviewers are lazy and/or not qualified to review the works they're assigned. It takes at least an hour to properly review a paper in your niche, much more when it's outside. We're over worked as is, with 5+ works to review, not to mention all the time we got to spend reworking our own works that were rejected due to the slot machine. We could do much better if we dropped this notion of conference/journal prestige and focused on the quality of the works and reviews.
Addressing those issues also addresses the AI issues because, frankly, *it doesn't matter if the whole work was done by AI, what matters is if the work is real.*
No one cares about citations. They are hallucinated because they are required to be present for political reasons, even though they have no relevance.
Many such cases of this. More than 100!
They claim to have custom detection for GPT-5, Gemini, and Claude. They're making that up!
Most people getting flagged are getting flagged because they actually used AI and couldn’t even be bothered to manually deslop it.
People who are too lazy to put even a tiny bit of human intentionality into their work deserve it.
Although then why not just cite existing papers for bogus reasons?
There need to be dis-incentives for sloppy work. There is a tension between quality and quantity in almost every product. Unfortunately academia has become a numbers-game with paper-mills.
This feels a bit like the "LED stoplights shouldn't be used because they don't melt snow" argument.
Thank you for that perfect example of a strawman argument! No, spellcheckers that use AI is not the main concern behind disclosing the use of AI in generating scientific papers, government reports, or any large block of nonfiction text that you paid for that is supposed to make to sense.
Maybe? There's certainly a push to force the perception of inevitability.
What people are pissed about is the fact their tax dollars fund fake research. It's just fraud, pure and simple. And fraud should be punished brutally, especially in these cases, because the long tail of negative effects produces enormous damage.
For people who think this is too harsh, just remember we aren't talking about undergrads who cheat on a course paper here. We're talking about people who were given money (often from taxpayers) that committed fraud. This is textbook white collar crime, not some kid being lazy. At a minimum we should be taking all that money back from them and barring them from ever receiving grant money again. In some cases I think fines exceeding the money they received would be appropriate.