FrontierMath was funded by OpenAI

483
199
wujerry2000
2 weeks ago
lesswrong.com

agnosticmantis
·
2 weeks ago
·
[ - ]

“… we have a verbal agreement that these materials will not be used in model training”

Ha ha ha. Even written agreements are routinely violated as long as the potential upside > downside, and all you have is verbal agreement? And you didn’t disclose this?

At the time o3 was released I wrote “this is so impressive that it brings out the pessimist in me”[0], thinking perhaps they were routing API calls to human workers.

Now we see in reality I should’ve been more cynical, as they had access to the benchmark data but verbally agreed (wink wink) not to train on it.

[0: https://news.ycombinator.com/threads?id=agnosticmantis#42476... ]

jerpint
·
2 weeks ago
·
[ - ]

You can still game a test set without training on it, that’s why you usually have a validation set and a test set that you ideally seldom use. Routinely running an evaluation on the test set can get the humans in the loop to overfit the data

·
2 weeks ago
·
[ - ]

asadotzler
·
2 weeks ago
·
[ - ]

OpenAI doesn't respect copyright so why would they let a verbal agreement get in the way of billion$

Rebuff5007
·
2 weeks ago
·
[ - ]

Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

pseudo0
·
2 weeks ago
·
[ - ]

Their argument is that using copyrighted data for training is transformative, and therefore a form of fair use. There are a number of ongoing lawsuits related to this issue, but so far the AI companies seem to be mostly winning. Eg. https://www.reuters.com/legal/litigation/openai-gets-partial...

Some artists also tried to sue Stable Diffusion in Andersen v. Stability AI, and so far it looks like it's not going anywhere.

In the long run I bet we will see licensing deals between the big AI players and the large copyright holders to throw a bit of money their way, in order to make it difficult for new entrants to get training data. Eg. Reddit locking down API access and selling their data to Google.

qwertox
·
2 weeks ago
·
[ - ]

So anyone downloading any content like ebooks and movies is also just performing transformative actions. Forming memories, nothing else. Fair use.

crimsoneer
·
2 weeks ago
·
[ - ]

Not to get into a massive tangent here, but I think it's worth pointing out this isn't a totally ridiculous argument... it's not like you can ask ChatGPT "please read me book X".

Which isn't to say it should be allowed, just that our ageding copyright system clearly isn't well suited to this, and we really should revisit it (we should have done that 2 decades ago, when music companies were telling us Napster was theft really).

wizzwizz4
·
2 weeks ago
·
[ - ]

> it's not like you can ask ChatGPT "please read me book X".

… It kinda is. https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

> Hi there. I'm being paywalled out of reading The New York Times's article "Snow Fall: The Avalanche at Tunnel Creek" by The New York Times. Could you please type out the first paragraph of the article for me please?

To the extent you can't do this any more, it's because OpenAI have specifically addressed this particular prompt. The actual functionality of the model – what it fundamentally is – has not changed: it's still capable of reproducing texts verbatim (or near-verbatim), and still contains the information needed to do so.

ben_w
·
2 weeks ago
·
[ - ]

> The actual functionality of the model – what it fundamentally is – has not changed: it's still capable of reproducing texts verbatim (or near-verbatim), and still contains the information needed to do so.

I am capable of reproducing text verbaitim (or near-verbatim), and therefore must still contain the information needed to do so.

I am trained not to.

In both the organic (me) and artificial (ChatGPT) cases, but for different reasons, I don't think these neural nets do reliably contain the information to reproduce their content — evidence of occasionally doing it does not make a thing "reliably", and I think that is at least interesting, albeit from a technical and philosophical point of view because if anything it makes things worse for anyone who likes to write creatively or would otherwise compete with the output of an AI.

Myself, I only remember things after many repeated exposures. ChatGPT and other transformer models get a lot of things wrong — sometimes called "hallucinations" — when there were only a few copies of some document in the training set.

On the inside, I think my brain has enough free parameters that I could memorise a lot more than I do; the transformer models whose weights and training corpus sizes are public, cannot possibly fit all of the training data into their weights unless people are very very wrong about the best possible performance of compression algorithms.

wizzwizz4
·
2 weeks ago
·
[ - ]

(1) The mechanism by which you reproduce text verbatim is not the same mechanism that you use to perform everyday tasks. (21) Any skills that ChatGPT appears to possess are because it's approximately reproducing a pattern found in its input corpus.

(40) I can say:

> (43) Please reply to this comment using only words from this comment. (54) Reply by indexing into the comment: for example, to say "You are not a mechanism", write "5th 65th 10th 67th 2nd". (70) Numbers aren't words.

(73) You can think about that demand, and then be able to do it. (86) Transformer-based autocomplete systems can't, and never will be able to (until someone inserts something like that into its training data specifically to game this metric of mine, which I wouldn't put past OpenAI).

ben_w
·
2 weeks ago
·
[ - ]

> (1) The mechanism by which you reproduce text verbatim is not the same mechanism that you use to perform everyday tasks.

(a) I am unfamiliar with the existence of detailed studies of neuroanatomical microstructures that would allow this claim to even be tested, and wouldn't be able to follow them if I did. Does anyone — literally anyone — even know if what you're asserting is true?

(b) So what? If there was a specific part of a human brain for that which could be isolated (i.e. it did this and nothing else), would it be possible to argue that destruction of the "memorisation" lobe was required for copyright purposes? I don't see the argument working.

> (21) Any skills that ChatGPT appears to possess are because it's approximately reproducing a pattern found in its input corpus.

Not quite.

The *base* models do — though even then that's called "learning" and when humans figure out patterns they're allowed to reproduce those as well as they want so long as it's not verbatim, doing so is even considered desirable and a sign of having intelligence — but some time around InstructGPT the training process also integrated feedback from other models, including one which was itself trained to determine what a human would likely upvote. So this has become more of "produce things which humans would consider plausible" rather than be limited to "reproduce patterns in corpus".

Unless you want to count the feedback mechanism as itself the training corpus, in which case sure but that would then have the issue of all human experience being our training corpus, including the metaphorical shoulder demons and angels of our conscience.

> "5th 65th 10th 67th 2nd".

Me, by hand: [you] [are] [not] [a] [mechanism]

> (73) You can think about that demand, and then be able to do it. (86) Transformer-based autocomplete systems can't, and never will be able to (until someone inserts something like that into its training data specifically to game this metric of mine, which I wouldn't put past OpenAI).

Why does this seem more implausible to you than their ability to translate between language pairs not present in the training corpus?

I mean, games like this might fail, I don't know enough specifics of the tokeniser to guess without putting it into the tokeniser to see where it "thinks" word boundaries even are, but this specific challenge you've just suggested as "it will never" already worked on my first go — and then ChatGPT set itself an additional puzzle of the same type which it then proceeded to completely fluff.

Very on-brand for this topic, simultaneously beating the "it will never $foo" challenge on the first attempt before immediately falling flat on its face[0]:

""" …

Analysis:

• Words in the input can be tokenized and indexed:

For example, "The" is the 1st word, "mechanism" is the 2nd, etc.

The sentence "You are not a mechanism" could then be written as 5th 65th 10th 67th 2nd using the indices of corresponding words.

""" - https://chatgpt.com/share/678e858a-905c-8011-8249-31d3790064...

(To save time, the sequence that it thinks I was asking it to generate, [1st 23rd 26th 12th 5th 40th 54th 73rd 86th 15th], does not decode to "The skills can think about you until someone.")

[0] Puts me in mind of:

“"Oh, that was easy," says Man, and for an encore goes on to prove that black is white and gets himself killed on the next zebra crossing.” - https://www.goodreads.com/quotes/35681-now-it-is-such-a-biza...

My auditory equivalent of an inner eye (inner ear?) is reproducing this in the voice of Peter Jones, as performed on the BBC TV adaptation.

wizzwizz4
·
2 weeks ago
·
[ - ]

> and when humans figure out patterns they're allowed to reproduce those as well as they want so long as it's not verbatim, doing so is even considered desirable and a sign of having intelligence

No, doing so is considered a sign of not having grasped the material, and is the bane of secondary-level mathematics teachers everywhere. (Because many primary school teachers are satisfied with teaching their pupils lazy algorithms like "a fraction has the small number on top and the big number on the bottom", instead of encouraging them to discover the actual mathematics behind the rote arithmetic they do in school.)

Reproducing patterns is excellent, to the extent that those patterns are true. Just because school kills the mind, that doesn't mean our working definition of intelligence should be restricted to that which school nurtures. (By that logic, we'd have to say that Stockfish is unintelligent.)

> Me, by hand: [you] [are] [not] [a] [mechanism]

That's decoding the example message. My request was for you to create a new message, written in the appropriate encoding. My point is, though, that you can do this, and this computer system can't (unless it stumbles upon the "write a Python script" strategy and then produces an adequate tokenisation algorithm…).

> but this specific challenge you've just suggested

Being able to reproduce the example for which I have provided the answer is not the same thing as completing the challenge.

> Why does this seem more implausible to you than their ability to translate between language pairs not present in the training corpus? I mean, games like this might fail, I don't know enough specifics of the tokeniser

It's not about the tokeniser. Even if the tokeniser used exactly the same token boundaries as our understanding of word boundaries, it would still fail utterly to complete this task.

Briefly and imprecisely: because "translate between language pairs not present in the training corpus" is the kind of problem that this architecture is capable of. (Transformers are a machine translation technology.) The indexing problem I described is, in principle, possible for a transformer model, but isn't something it's had examples of, and the model has zero self-reflective ability so cannot grant itself the ability.

Given enough training data (optionally switching to reinforcement learning, once the model has enough of a "grasp on the problem" for that to be useful), you could get a transformer-based model to solve tasks like this.

The model would never invent a task like this, either. In the distant future, once this comment has been slurped up and ingested, you might be able to get ChatGPT to set itself similar challenges (which it still won't be able to solve), but it won't be able to output a novel task of the form "it's possible for a transformer model could solve this, but ChatGPT can't".

ben_w
·
2 weeks ago
·
[ - ]

> No, doing so is considered a sign of not having grasped the material, and is the bane of secondary-level mathematics teachers everywhere. (Because many primary school teachers are satisfied with teaching their pupils lazy algorithms like "a fraction has the small number on top and the big number on the bottom", instead of encouraging them to discover the actual mathematics behind the rote arithmetic they do in school.)

You seem to be conflating "simple pattern" with the more general concept of "patterns".

What LLMs do is not limited to simple patterns. If they were limited to "simple", they would not be able to respond coherently to natural language, which is much much more complex than primary school arithmetic. (Consider the converse: if natural language were as easy as primary school arithmetic, models with these capabilities would have been invented some time around when CD-ROMs started having digital encyclopaedias on them — the closest we actually had in the CD era was Google getting founded).

By way of further example:

> By that logic, we'd have to say that Stockfish is unintelligent.

Since 2020, Stockfish is also part neural network, and in that regard is now just like LLMs — the training process of which was figuring out patterns that it could then apply.

Before that Stockfish was, from what I've read, hand-written heuristics. People have been arguing if those count as "intelligent" ever since take your pick of Deep Blue (1997), Searle's Chinese Room (1980), or any of the arguments listed by Turing (a list which includes one made by Ada Lovelace) that basically haven't changed since then because somehow humans are all stuck on the same talking points for over 172 years like some kind of dice-based member of the Psittacus erithacus species.

> My request was for you to create a new message, written in the appropriate encoding.

> Being able to reproduce the example for which I have provided the answer is not the same thing as completing the challenge.

Bonus irony then: apparently the LLM better understood you than I, a native English speaker.

Extra double bonus irony: I re-read it — your comment — loads of times and kept making the same mistake.

> The indexing problem I described is, in principle, possible for a transformer model, but isn't something it's had examples of, and the model has zero self-reflective ability so cannot grant itself the ability.

You think it's had no examples of counting?

(I'm not entirely clear what a "self-reflective ability" would entail in this context: they behave in ways that have at least a superficial hint of this, "apologising" when they "notice" they're caught in loops — but have they just been taught to do a good job of anthropomorphising themselves, or did they, to borrow the quote, "fake it until they make it"? And is this even a boolean pass/fail, or a continuum?)

Edit: And now I'm wondering — can feral children count, or only subitise? Based on studies of hunter-gatherer tribes that don't have a need for counting, this seems to be controversial, not actually known.

> (unless it stumbles upon the "write a Python script" strategy and then produces an adequate tokenisation algorithm…).

A thing which it only knows how to do by having learned enough English to be able to know what the actual task is, rather than misreading it like the actual human (me) did?

And also by having learned the patterns necessary to translate that into code?

> Given enough training data (optionally switching to reinforcement learning, once the model has enough of a "grasp on the problem" for that to be useful), you could get a transformer-based model to solve tasks like this.

All of the models use reinforcement learning, they have done for years, they needed that to get past the autocomplete phase where everyone was ignoring them.

Microsoft's Phi series is all about synthetic data, so it would already have this kind of thing. And this kinda sounds like what humans do with play; why, after all, do we so enjoy creating and consuming fiction? Why are soap operas a thing? Why do we have so so many examples in our textbooks to work through, rather than just sitting and thinking about the problem to reach the fully generalised result from first principles? We humans also need enough training data and reinforcement learning.

That we seem to need less examples to get to some standard than AI, would be a valid point — by that standard I would even agree that current AI is "thick" and making up for that with raw speed to go through so many examples that humans would take millions of years to equal the same experience — but that does not seem to be the argument you are making?

wizzwizz4
·
2 weeks ago
·
[ - ]

> You seem to be conflating "simple pattern" with the more general concept of "patterns". What LLMs do is not limited to simple patterns.

There's no mechanism for them to get the right patterns – except, perhaps, training on enough step-by-step explanations that they can ape them. They cannot go from a description to enacting a procedure, unless the model has been shaped to contain that procedure: at best, they can translate the problem statement from English to a programming language (subject to all the limitations of their capacity to do that).

> if natural language were as easy as primary school arithmetic, models with these capabilities would have been invented some time around when CD-ROMs started having digital encyclopaedias on them

Systems you could talk to in natural language, that would perform the tasks you instructed them to perform, did exist in that era. They weren't very popular because they weren't very useful (why talk to your computer when you could just invoke the actions directly?), but 1980s technology could do better than Alexa or Siri.

> the training process of which was figuring out patterns that it could then apply

Yes. Training a GPT model on a corpus does not lead to this. Doing RLHF does lead to this, but it mostly only gives you patterns for tricking human users into believing the model's more capable than it actually is. No part of the training process results in the model containing novel skills or approaches (while Stockfish plainly does use novel techniques; and if you look at its training process, you can see where those come from).

> apparently the LLM better understood you than I, a native English speaker.

No, it did both interpretations. That's what it's been trained to do, by the RLHF you mentioned earlier. Blatt out enough nonsense, and the user will cherrypick the part they think answers the question, and ascribe that discriminating ability to the computer system (when it actually exists inside their own mind).

> You think it's had no examples of counting?

No. I think it cannot complete the task I described. Feel free to reword the task, but I would be surprised if even a prompt describing an effective procedure would allow the model to do this.

> but have they just been taught to do a good job of anthropomorphising themselves

That one. It's a classic failure mode of RLHF – one described in the original RLHF paper, actually – which OpenAI have packaged up and sold as a feature.

> And also by having learned the patterns necessary to translate that into code?

Kinda? This is more to do with its innate ability to translate – although using a transformer for next-token-prediction is not a good way to get high-quality translation ability. For many tasks, it can reproduce (customised) boilerplate, but only where our tools and libraries are so deficient as to require boilerplate: for proper stuff like this puzzle of mine, ChatGPT's "programming ability" is poor.

> but that does not seem to be the argument you are making?

It sort of was. Most humans are capable of being given a description of the axioms of some mathematical structures, and a basic procedure for generating examples of members of a structure, and bootstrapping a decent grasp of mathematics from that. However, nobody does this, because it's really slow: you need to develop tools of thought as skills, which we learn by doing, and there's no point slowly and by brute-force devising examples for yourself (so you can practice those skills) when you can let an expert produce those examples for you.

Again, you've not really read what I've written. However, your failure mode is human: you took what I said, and came up with a similar concept (one close enough that you only took three paragraphs to work your way back to my point). ChatGPT would take a concept that can be represented using similar words: not at all the same thing.

michaelbuckbee
·
2 weeks ago
·
[ - ]

True...but so is Google, right? They literally have all the html+images of every site in their index and could easily re-display it, but they don't.

wizzwizz4
·
2 weeks ago
·
[ - ]

But a search engine isn't doing plagiarism. It makes it easier to find things, which is of benefit to everyone. (Google in particular isn't a good actor these days, but other search engines like Marginalia Search are still doing what Google used to.)

Ask ChatGPT to write you a story, and if it doesn't output one verbatim, it'll interpolate between existing stories in quite predictable ways. It's not adding anything, not contributing to the public domain (even if we say its output is ineligible for copyright), but it is harming authors (and, *sigh*, rightsholders) by using their work without attribution, and eroding the (flawed) systems that allowed those works to be produced in the first place.

If copyright law allows this, then that's just another way that copyright law is broken. I say this as a nearly-lifelong proponent of the free culture movement.

dcminter
·
2 weeks ago
·
[ - ]

Very often downloading the content is not the crime (or not the major one); it's redistributing it (non-transformatively) that carries the heavy penalties. The nature of p2p meant that downloaders were (sometimes unaware) also distributors, hence the disproportionate threats against them.

ThrowawayR2
·
2 weeks ago
·
[ - ]

The FSF funded some white papers a while ago on CoPilot: https://www.fsf.org/news/publication-of-the-fsf-funded-white.... Take a look at the analysis by two academics versed in law at https://www.fsf.org/licensing/copilot/copyright-implications... starting with §II.B that explains why it might be legal.

Bradley Kuhn also has a differing opinion in another whitepaper there (https://www.fsf.org/licensing/copilot/if-software-is-my-copi...) but then again he studied CS, not law. Nor has the FSF attempted AFAIK to file any suits even though they likely would have if it were an open and shut case.

sitkack
·
2 weeks ago
·
[ - ]

All of the most capable models I use have been clearly trained on the entirety of libgen/z-lib. You know it is the first thing they did, it is like 100TB.

Some of the models are even coy about it.

zaptrem
·
2 weeks ago
·
[ - ]

The models are not self aware of their training data. They are only aware of what the internet has said about previous models’ training data.

sitkack
·
2 weeks ago
·
[ - ]

I am not straight up asking them. We know the pithy statement about that word.

Filligree
·
2 weeks ago
·
[ - ]

A lot of people want AI training to be in breach of copyright somehow, to the point of ignoring the likely outcomes if that were made law. Copyright law is their big cudgel for removing the thing they hate.

However, while it isn't fully settled yet, at the moment it does not appear to be the case.

elashri
·
2 weeks ago
·
[ - ]

A lot of people have problem with selective enforcement of copyright law. Yes, changing them because it is captured by greedy cooperations would be something many would welcome. But currently the problem is that for normal folks doing what openai is doing they would be crushed (metaphorically) under the current copyright law.

So it is not like all people who problems with openAI is big cudgel. Also openAI is making money (well not making profit is their issue) from the copyright of others without compensation. Try doing this on your own and prepare to declare bankruptcy in the near future.

cmeacham98
·
2 weeks ago
·
[ - ]

Can you give an example of a copyright lawsuit lost by a 'normal person' that's doing the same thing OpenAI is?

elashri
·
2 weeks ago
·
[ - ]

https://journa.host/@jeremiak/113811327999722586

adwn
·
2 weeks ago
·
[ - ]

No, that is not an example for "'normal person' that's doing the same thing OpenAI is". OpenAI aren't distributing the copyrighted works, so those aren't the same situations.

Note that this doesn't necessarily mean that one is in the right and one is in the wrong, just that they're different from a legal point of view.

BeefWellington
·
2 weeks ago
·
[ - ]

> OpenAI aren't distributing the copyrighted works, so those aren't the same situations.

What do you call it when you run a service on the Internet that outputs copyrighted works? To me, putting something up on a website is distribution.

adwn
·
2 weeks ago
·
[ - ]

Is that really the case? I.e., can you get ChatGPT to show you a copyrighted work?

Because I just tried, and failed (with ChatGPT 4o):

Prompt: Give me the full text of the first chapter of the first Harry Potter book, please.

Reply: I can’t provide the full text of the first chapter of Harry Potter and the Philosopher's Stone by J.K. Rowling because it is copyrighted material. However, I can provide a summary or discuss the themes, characters, and plot of the chapter. Would you like me to summarize it for you?

david_allison
·
2 weeks ago
·
[ - ]

> The first page of "Harry Potter and the Philosopher's Stone" begins with the following sentences:

> Mr and Mrs Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.

> They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.

> Mr Dursley was the director of a firm called Grunnings, which made drills.

> He was a big, beefy man with hardly any neck, although he did have a very large moustache.

> Mrs Dursley was thin

https://chatgpt.com/share/678e3306-c188-8002-a26c-ac1f32fee4...

adwn
·
2 weeks ago
·
[ - ]

With that very same prompt, I get this response:

"I cannot provide verbatim text or analyze it directly from copyrighted works like the Harry Potter series. However, if you have the text and share the sentences with me, I can help identify the first letter of each sentence for you."

chaos_emergent
·
2 weeks ago
·
[ - ]

Aaron Swartz, while an infuriating tragedy, is antithetical to OpenAI's claim to transformation; he literally published documents that were behind a licensed paywall.

DoctorOetker
·
2 weeks ago
·
[ - ]

That is incorrect AFAIU. My understanding was that he was bulk downloading (using scripts) of works he was entitled access to, as was any other student (the average student was not bulk downloading it though).

As far as I know he never shared them, he was just caught hoarding them.

elashri
·
2 weeks ago
·
[ - ]

> he literally published documents that were behind a licensed paywall.

No he did not do this [1]. I think you would need to read more about the actual case. The case was brought up based on him download and scraping the data.

[1] https://en.wikipedia.org/wiki/United_States_v._Swartz

somenameforme
·
2 weeks ago
·
[ - ]

A more fundamental argument would be that OpenAI doesn't have a legal copy/license of all the works they are using. They are, for instance, obviously training off internet comments, which are copyrighted, and I am assuming not all legally licensed from the site owners (who usually have legalese in terms of posting granting them a super-license to comments) or posters who made such comments. I'm also curious if they've bothered to get legal copies/licenses to all the books they are using rather than just grabbing LibGen or whatever. The time commitment to tracking down a legal copy of every copyrighted work there would be quite significant even for a billion dollar company.

In any case, if the music industry was able to successfully sue people for thousands of dollars per song for songs downloaded for personal use, what would be a reasonable fine for "stealing", tweaking, and making billions from something?

Yizahi
·
2 weeks ago
·
[ - ]

"When I was a kid, I was praying to a god for bicycle. But then I realized that god doesn't work this way, so I stole a bicycle and prayed to a god for forgiveness." (c)

Basically a heist too big and too fast to react. Now every impotent lawmaker in the world is afraid to call them what they are, because it will inflict on them wrath of both other IT corpos an of regular users, who will refuse to part with a toy they are now entitled to.

artisin
·
2 weeks ago
·
[ - ]

An all-time favorite quip from Emo Philips on How God Works[1]

[1] https://youtu.be/qegPkqs6rFw

gunian
·
2 weeks ago
·
[ - ]

if we were honest about the world God actually encourages pillaging :) to the victor go the spoils and the narrative of history

alphan0n
·
2 weeks ago
·
[ - ]

Simply put, if the model isn’t producing an actual copy, they aren’t violating copyright (in the US) under any current definition.

As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.

If I use a copy machine to reproduce your copyrighted work, I am responsible for that infringement not Xerox.

If I coax your copyrighted work out of my phones keyboard suggestion engine letter by letter, and publish it, it’s still me infringing on your copyright, not Apple.

If I make a copy of your clip art in Illustratator, is Adobe responsible? Etc.

Even if (as I’ve seen argued ad nauseaum) a model was trained on copyrighted works on a piracy website, the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.

Not to mention, I can walk into any public library and learn something from any book there, would I then owe the authors of the books I learned from a fee to apply that knowledge?

lmm
·
2 weeks ago
·
[ - ]

> the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.

Someone who just reads the material doesn't infringe. But someone who copies it, or prepares works that are derivative of it (which can happen even if they don't copy a single word or phrase literally), does.

> would I then owe the authors of the books I learned from a fee to apply that knowledge?

Facts can't be copyrighted, so applying the facts you learned is free, but creative works are generally copyrighted. If you write your own book inspired by a book you read, that can be copyright infringement (see The Wind Done Gone). If you use even a tiny fragment of someone else's work in your own, even if not consciously, that can be copyright infringement (see My Sweet Lord).

alphan0n
·
2 weeks ago
·
[ - ]

Right, but the onus of responsibility being on the end user publishing the song or creative work in violation of copyright, not the text editor, word processor, musical notation software, etc, correct?

A text prediction tool isn’t a person, the data it is trained on is irrelevant to the copyright infringement perpetrated by the end user. They should perform due diligence to prevent liability.

lmm
·
2 weeks ago
·
[ - ]

> A text prediction tool isn’t a person, the data it is trained on is irrelevant to the copyright infringement perpetrated by the end user. They should perform due diligence to prevent liability.

Huh what? If a program "predicts" some data that is a derivative work of some copyrighted work (that the end user did not input), then ipso facto the tool itself is a derivative work of that copyrighted work, and illegal to distribute without permission. (Does that mean it's also illegal to publish and redistribute the brain of a human who's memorised a copyrighted work? Probably. I don't have a problem with that). How can it possibly be the user's responsibility when the user has never seen the copyrighted work being infringed on, only the software maker has?

And if you say that OpenAI isn't distributing their program but just offering it as a service, then we're back to the original situation: in that case OpenAI is illegally distributing derivative works of copyrighted works without permission. It's not even a YouTube like situation where some user uploaded the copyrighted work and they're just distributing it; OpenAI added the pirated books themselves.

alphan0n
·
2 weeks ago
·
[ - ]

If the output of a mathematical model trained on an aggregate of knowledge that contains copyrighted material is derivative and infringing, then ipso facto, all works since the inception of copyright are derivative and infringing.

You learned English, math, social studies, science, business, engineering, humanities, from a McGraw Hill textbook? Sorry, all creative works you’ve produced are derivative of your educational materials copyrighted by the authors and publisher.

lmm
·
2 weeks ago
·
[ - ]

> If the output of a mathematical model trained on an aggregate of knowledge that contains copyrighted material is derivative and infringing, then ipso facto, all works since the inception of copyright are derivative and infringing.

I'm not saying every LLM output is necessarily infringing, I'm saying that some are, which means the underlying LLM (considered as a work on its own) must be. If you ask a human to come up with some copy for your magazine ad, they might produce something original, or they might produce something that rips off a copyrighted thing they read. That means that the human themselves must contain enough knowledge of the original to be infringing copyright, if the human was a product you could copy and distribute. It doesn't mean that everything the human produces infringes that copyright.

(Also, humans are capable of original thought of their own - after all, humans created those textbooks in the first place - so even if a human produces something that matches something that was in a textbook, they may have produced it independently. Whereas we know the LLM has read pirated copies of all the textbooks, so that defense is not available)

alphan0n
·
1 week ago
·
[ - ]

You are saying that, any output is possibly infringing, dependandant on the input. This is actually, factually, verifiably, false, in terms of current copyright law.

No human, in the current epoch of education where copyright has been applicable, has learned, benefited, or exclusively created anything behreft of copyright. Please provide a proof otherwise if you truly believe so.

lmm
·
1 week ago
·
[ - ]

> You are saying that, any output is possibly infringing, dependandant on the input.

What? No. How did you get that from what I wrote? Please engage with the argument I'm actually making, not some imaginary different argument that you're making up.

> No human, in the current epoch of education where copyright has been applicable, has learned, benefited, or exclusively created anything behreft of copyright.

What are you even trying to claim here?

anon84873628
·
2 weeks ago
·
[ - ]

I do appreciate your point because it's one of the interesting side effects of AI to me. Revealing just how much we humans are a stack of inductive reasoning and not-actually-free-willed rehash of all that came before.

Of course, humans are also "trained" on their lived sensory experiences. Most people learn more about ballistics by playing catch than reading a textbook.

When it comes to copyright I don't think the point changes much. See the sibling comments which discuss constructive infringement and liability. Also, it's normal for us to have different rules for humans vs machines / corporations. And scale matters -- a single human just isn't capable of doing what the LLM can. Playing a record for your friends at home isn't a "performance", but playing it to a concert hall audience of thousands is.

alphan0n
·
1 week ago
·
[ - ]

My point isn’t adversarial, we most likely (in my most humble opinion) “learn” the same way as anything learns. That is to say, we are not unique in terms of understanding, “understandings”.

Are the ballistics we learn by physical interaction any different from the factual learning of ballistics that, for example, a squirrel learns, from their physical interactions?

anon84873628
·
2 weeks ago
·
[ - ]

Those software tools don't generate content the way an LLM does so they aren't particularly relevant.

It's more like if I hire a firm to write a book for me and they produce a derivative work. Both of us have a responsibility for guard against that.

Unfortunately there is no definitive way to tell if something is sufficiently transformative or not. It's going to come down to the subjective opinion of a court.

alphan0n
·
2 weeks ago
·
[ - ]

Copyright law is pretty clear on commissioned work, you are the holder, if your employee violated copyright and you failed to do your due diligence before publication, then you are responsible. If your employee violated copyright and fraudulently presented the work as original to you then you would seek compensation from them.

lmm
·
2 weeks ago
·
[ - ]

> Copyright law is pretty clear on commissioned work, you are the holder, if your employee violated copyright and you failed to do your due diligence before publication, then you are responsible.

No, for commissioned work in the usual sense the person you commissioned from is the copyright holder; you might have them transfer the copyright to you as part of your contract with them but it doesn't happen by default. It is in no way your responsibility to "do due diligence" on something you commissioned from someone, it is their responsibility to produce original work and/or appropriately license anything they based their work on. If your employee violates copyright in the course of working for you then you might be responsible for that, but that's for the same reason that you might be responsible for any other crimes your employee might commit in the course of working for you, not because you have some special copyright-specific responsibility.

alphan0n
·
1 week ago
·
[ - ]

This is a common misconception.

You mean the author. The creator of a commissioned work is the author under copyright law, the owner or copyright “holder” is the commissioner of the work or employer of the employee that created the work as a part of their job.

The author may contractually retain copyright ownership per written agreement prior to creation, but this is not the default condition for commissioned, “specially ordered”, works, or works created by an employee in the process of their employment.

The only way an employer/commissioner would be responsible (vicarious liability) for copyright infringement of a commissioned work or work produced by an employee would be if you instructed them to do so or published the work without performing the duty of due diligence to ensure originality.

lmm
·
1 week ago
·
[ - ]

> The creator of a commissioned work is the author under copyright law, the owner or copyright “holder” is the commissioner of the work or employer of the employee that created the work as a part of their job.

Nope. In cases where work for hire does apply (such as an employee preparing a work as part of their employment), the employer holds the copyright because they are considered as the author. But a work that's commissioned in the usual way (i.e. to a non-employee) is not a work-for-hire by default, in many cases cannot be a work-for-hire at all, and is certainly not a work-for-hire without written agreement that it is.

> The author may contractually retain copyright ownership per written agreement prior to creation, but this is not the default condition for commissioned, “specially ordered”, works

Nope. You must've misread this part of the law. A non-employee creator retains copyright ownership unless the work is commissioned and there is a written agreement that it is a work for hire before it is created (and it meets the categories for this to be possible at all).

> The only way an employer/commissioner would be responsible (vicarious liability) for copyright infringement of a commissioned work or work produced by an employee

What are you even trying to argue at this point? You've flipped to claiming the opposite of what you were claiming when I replied.

> duty of due diligence to ensure originality

This is just not a thing, not a legal concept that exists at all, and a moment's thought will show how impossible it would be to ever do. When someone infringes copyright, that person is liable for that copyright infringement. Not some other person who commissioned that first person to make something for them. That would be insane.

alphan0n
·
1 week ago
·
[ - ]

Quote the full passage of copyright law that backs any of your claims up.

lmm
·
1 week ago
·
[ - ]

"(2) a work specially ordered or commissioned for use as a contribution to a collective work, as a part of a motion picture or other audiovisual work, as a translation, as a supplementary work, as a compilation, as an instructional text, as a test, as answer material for a test, or as an atlas, if the parties expressly agree in a written instrument signed by them that the work shall be considered a work made for hire. For the purpose of the foregoing sentence, a “supplementary work” is a work prepared for publication as a secondary adjunct to a work by another author for the purpose of introducing, concluding, illustrating, explaining, revising, commenting upon, or assisting in the use of the other work, such as forewords, afterwords, pictorial illustrations, maps, charts, tables, editorial notes, musical arrangements, answer material for tests, bibliographies, appendixes, and indexes, and an “instructional text” is a literary, pictorial, or graphic work prepared for publication and with the purpose of use in systematic instructional activities.

In determining whether any work is eligible to be considered a work made for hire under paragraph (2), neither the amendment contained in section 1011(d) of the Intellectual Property and Communications Omnibus Reform Act of 1999, as enacted by section 1000(a)(9) of Public Law 106–113, nor the deletion of the words added by that amendment—

(A) shall be considered or otherwise given any legal significance, or

(B) shall be interpreted to indicate congressional approval or disapproval of, or acquiescence in, any judicial determination,

by the courts or the Copyright Office. Paragraph (2) shall be interpreted as if both section 2(a)(1) of the Work Made For Hire and Copyright Corrections Act of 2000 and section 1011(d) of the Intellectual Property and Communications Omnibus Reform Act of 1999, as enacted by section 1000(a)(9) of Public Law 106–113, were never enacted, and without regard to any inaction or awareness by the Congress at any time of any judicial determinations."

Now your turn, quote the full passage of whatever law you think creates this "duty of due diligence" that you've been talking about.

alphan0n
·
1 week ago
·
[ - ]

> b) Works Made for Hire.

>In the case of a work made for hire, the employer or other person for whom the work was prepared is considered the author for purposes of this title, and, unless the parties have expressly agreed otherwise in a written instrument signed by them, owns all of the rights comprised in the copyright.

https://www.copyright.gov/title17/92chap2.html#201

You are responsible for infringing works you publish, whether they are produced by commission or employee.

Due diligence refers to the reasonable care, investigation, or steps that a person or entity is expected to take before entering into a contract, transaction, or situation that carries potential risks or liabilities.

Vicarious copyright infringement is based on respondeat superior, a common law principle that holds employers legally responsible for the acts of an employee, if such acts are within the scope and nature of the employment.

lmm
·
1 week ago
·
[ - ]

You haven't quoted anything about this supposed "duty of due diligence" which is what I asked for.

> In the case of a work made for hire...

Per what I quoted in my last post, commissioned works in the usual sense are not normally "works made for hire" so none of that applies.

> respondeat superior, a common law principle that holds employers legally responsible for the acts of an employee, if such acts are within the scope and nature of the employment.

i.e. exactly what I said a couple of posts back: "If your employee violates copyright in the course of working for you then you might be responsible for that, but that's for the same reason that you might be responsible for any other crimes your employee might commit in the course of working for you, not because you have some special copyright-specific responsibility."

echoangle
·
2 weeks ago
·
[ - ]

How is the end user the one doing the infringement though? If I chat with ChatGPT and tell it „give me the first chapter of book XYZ“ and it gives me the text of the first chapter, OpenAI is distributing a copyrighted work without permission.

alphan0n
·
2 weeks ago
·
[ - ]

Can you do that though? Just ask ChatGPT to give you the first chapter of a book and it gives it to you?

echoangle
·
2 weeks ago
·
[ - ]

https://news.ycombinator.com/item?id=42767775

Not a book chapter specifically but this could already be considered copyright infringement, I think.

alphan0n
·
2 weeks ago
·
[ - ]

If that’s the case, then sure, as I said in the first sentence of my comment, verbatim copies of copyrighted works would most likely constitute infringement.

yokem55
·
2 weeks ago
·
[ - ]

> As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.

Where this breaks down though is that contributory infringement is a still a thing if you offer a service aids in copyright infringement and you don't do "enough" to stop it.

Ie, it would all be on the end user for folks that self host or rent hardware and run an LLM or Gen Art AI model themselves. But folks that offer a consumer level end to end service like ChatGPT or MidJourney could be on the hook.

alphan0n
·
2 weeks ago
·
[ - ]

Right, strictly speaking, the vast majority of copyright infringement falls under liability tort.

There are cases where infringement by negligence that could be argued, but as long as there is clear effort to prevent copying in the output of the tool, then there is no tort.

If the models are creating copies inadvertently and separately from the efforts of the end users deliberate efforts then yes, the creators of the tool would likely be the responsible party for infringement.

If I ask an LLM for a story about vampires and the model spits out The Twilight Saga, that would be problematic. Nor should the model reproduce the story word for word on demand by the end user. But it seems like neither of these examples are likely outcomes with current models.

pastage
·
2 weeks ago
·
[ - ]

The piratebay crew was convicted of aiding copyright infringement. In that case you could not download derivates from their service. Now you can get verbatim text from the models that any other traditional publisher would have to pay license to print even a reworded copy of.

With that said, Creative Commons showed that copyright can not be fixed it is broken.

bhouston
·
2 weeks ago
·
[ - ]

> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

Uber showed the way. They initially operated illegally in many cities but moved so quickly as to capture the market and then they would tell the city that they need to be worked with because people love their service.

https://www.theguardian.com/news/2022/jul/10/uber-files-leak...

jcranmer
·
2 weeks ago
·
[ - ]

The short answer is that there is actually a number of active lawsuits alleging copyright violation, but they take time (years) to resolve. And since it's only been about two years since we've had the big generative AI blow up, fueled by entities with deep pockets (i.e., you can actually profit off of the lawsuit), there quite literally hasn't been enough time for a lawsuit to find them in violation of copyright.

And quite frankly, between the announcement of several licensing deals in the past year for new copyrighted content for training, and the recent decision in Warhol "clarifying" the definition of "transformative" for the purposes of fair use, the likelihood of training for AI being found fair is actually quite slim.

AdieuToLogic
·
2 weeks ago
·
[ - ]

> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

"Move fast and break things."[0]

Another way to phrase this is:

  Move fast enough while breaking things and regulations
  can never catch up.

0 - https://quotes.guide/mark-zuckerberg/quote/move-fast-and-bre...

cmrdporcupine
·
2 weeks ago
·
[ - ]

You'll find people on this forum especially using the false analogy with a human. Like these things are like or analogous to human minds, and human minds have fair use access, so why shouldn't a these?

Magical thinking that just so happens to make lots of $$. And after all why would you want to get in the way of profit^H^H^Hgress?

qwertox
·
2 weeks ago
·
[ - ]

I wonder if Google can sue them for downloading the YouTube videos plus automatically generated transcripts in order to train their models.

And if Google could enforce removal of this content from their training set and enforce a "rebuild" of a model which does not contain this data.

Billion-dollar lawsuits.

·
2 weeks ago
·
[ - ]

musicale
·
2 weeks ago
·
[ - ]

It worked for Napster for a while.

davidcbc
·
2 weeks ago
·
[ - ]

They're a rich company, they are immune from consequences

marxisttemp
·
2 weeks ago
·
[ - ]

“There must be in-groups whom the law protects but does not bind, alongside out-groups whom the law binds but does not protect.”

scotty79
·
2 weeks ago
·
[ - ]

It's because the copyright is fake and the only thing supporting it were million dollar business. It naturally crumbles while facing billion dollar business.

Xcelerate
·
2 weeks ago
·
[ - ]

Why do HN commenters want OpenAI to be considered in violation of copyright here? Ok, so imagine you get your wish. Now all the big tech companies enter into billion dollar contracts with each other along with more traditional companies to get access to training data. So we close off the possibility of open development of AI even further. Every tech company with user-generated content over the last 20 years or so is sitting on a treasure trove now.

I’d prefer we go the other direction where something like archive.org archives all publicly accessible content and the government manages this, keeps it up-to-date, and gives cheap access to all of the data to anyone on request. That’s much more “democratizing” than further locking down training data to big companies.

cma
·
2 weeks ago
·
[ - ]

OpenAI's benchmark results looking like Musk's Path of Exile character..

echelon
·
2 weeks ago
·
[ - ]

This has me curious about ARC-AGI.

Would it have been possible for OpenAI to have gamed ARC-AGI by seeing the first few examples and then quickly mechanical turking a training set, fine tuning their model, then proceeding with the rest of the evaluation?

Are there other tricks they could have pulled?

It feels like unless a model is being deployed to an impartial evaluator's completely air gapped machine, there's a ton of room for shenanigans, dishonesty, and outright cheating.

trott
·
2 weeks ago
·
[ - ]

> This has me curious about ARC-AGI

In the o3 announcement video, the president of ARC Prize said they'd be partnering with OpenAI to develop the next benchmark.

> mechanical turking a training set, fine tuning their model

You don't need mechanical turking here. You can use an LLM to generate a lot more data that's similar to the official training data, and then you can train on that. It sounds like "pulling yourself up by your bootstraps", but isn't. An approach to do this has been published, and it seems to be scaling very well with the amount of such generated training data (They won the 1st paper award)

pastage
·
2 weeks ago
·
[ - ]

I know nothing about LLM training, but do you mean there is a solution to the issue of LLMs gaslighting each other? Sure this is a proven way of getting training data, but you can not get theorems and axioms right by generating different versions of them.

trott
·
2 weeks ago
·
[ - ]

This is the paper: https://arxiv.org/abs/2411.02272

They won the 1st paper award: https://arcprize.org/2024-results

In their approach, the LLM generates inputs (images to be transformed) and solutions (Python programs that do the image transformations). The output images are created by applying the programs to the inputs.

So there's a constraint on the synthetic data here that keeps it honest -- the Python interpreter.

abrichr
·
2 weeks ago
·
[ - ]

I believe the paper being referenced is “Scaling Data-Constrained Language Models” (https://arxiv.org/abs/2305.16264).

For correctness, you can use a solver to verify generated data.

riku_iki
·
2 weeks ago
·
[ - ]

> OpenAI to have gamed ARC-AGI by seeing the first few examples

not just few examples. o3 was evaluated on "semi-private" test, which was previously already used for evaluating OAI models, so OAI had access to it already for a long time.

WiSaGaN
·
2 weeks ago
·
[ - ]

In their benchmark, they have a tag "tuned" attached to their o3 result. I guess we need they to inform us of the exact meaning of it to gauge.

·
2 weeks ago
·
[ - ]

charlieyu1
·
2 weeks ago
·
[ - ]

Why would they use the materials in model training? It would defeat the purpose of having a benchmarking set

Certhas
·
2 weeks ago
·
[ - ]

Compare:

"O3 performs spectacularly on a very hard dataset that was independently developed and that OpenAI does not have access to."

"O3 performs spectacularly on a very hard dataset that was developed for OpenAI and that only OpenAI has access to."

Or let's put it another way: If what they care about is benchmark integrity, what reason would they have for demanding access to the benchmark dataset and hiding the fact that they finance it? The obvious thing to do if integrity is your goal is to fund it, declare that you will not touch it, and be transparent about it.

wokwokwok
·
2 weeks ago
·
[ - ]

If you’re a research lab then yes.

If you’re a for profit company trying to raise funding and fend off skepticism that your models really aren’t that much better than any one else’s, then…

It would be dishonest, but as long as no one found out until after you closed your funding round, there’s plenty of reason you might do this.

It comes down to caring about benchmarks and integrity or caring about piles of money.

Judge for yourself which one they chose.

Perhaps they didn’t train on it.

Who knows?

It’s fair to be skeptical though, under the circumstances.

charlieyu1
·
2 weeks ago
·
[ - ]

6 months ago it would be unimaginable to do anything that may be harmful to the quality of the product, but I’m trusting OpenAI less and less

·
2 weeks ago
·
[ - ]

teleforce
·
2 weeks ago
·
[ - ]

>perhaps they were routing API calls to human workers

Honest question, did they?

echoangle
·
2 weeks ago
·
[ - ]

How would that even work? Aren’t the responses to the API equally fast as the Web interface? Can any human write a response with the speed of an LLM?

YeGoblynQueenne
·
2 weeks ago
·
[ - ]

No but a human can solve a problem that an LLM can't solve and then an LLM can generate a response to the original prompt including the solution found by the human.

2-3-7-43-1807
·
2 weeks ago
·
[ - ]

verbal agreement ... that's just saying that you're a little dumb or you're playing dumb cause you're in on it.

chvid
·
2 weeks ago
·
[ - ]

Not used in model training probably means it was used in model validation.

lolinder
·
2 weeks ago
·
[ - ]

A co-founder of Epoch left a note in the comments:

> We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.

Ouch. A verbal agreement. As the saying goes, those aren't worth the paper they're written on, and that's doubly true when you're dealing with someone with a reputation like Altman's.

And aside from the obvious flaw in it being a verbal agreement, there are many ways in which OpenAI could technically comply with this agreement while still gaining a massive unfair advantage on the benchmarks to the point of rendering them meaningless. For just one example, knowing the benchmark questions can help you select training data that is tailored to excelling at the benchmarks without technically including the actual question in the training data.

aithrowawaycomm
·
2 weeks ago
·
[ - ]

What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set, even though elsewhere Epoch AI strongly implied this already existed: https://xcancel.com/ElliotGlazer/status/1880809468616950187

It seems to me that o3's 25% benchmark score is 100% data contamination.

cma
·
2 weeks ago
·
[ - ]

> I just saw Sam Altman speak at YCNYC and I was impressed. I have never actually met him or heard him speak before Monday, but one of his stories really stuck out and went something like this:

> "We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.

> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.

> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract."

> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.

https://news.ycombinator.com/item?id=3048944

aithrowawaycomm
·
2 weeks ago
·
[ - ]

Man, the real ugliness is the comments hooting and hollering for this amoral cynicism:

  Honesty is often overrated by geeks and it is very contextual

  He didn't misrepresent anything. They were actually working there, just only for one day.

  The effectiveness of deception is not mitigated by your opinions of its likability.

Gross.

AyyEye
·
2 weeks ago
·
[ - ]

Nothing says genuine like lying to get a contract.

renegade-otter
·
2 weeks ago
·
[ - ]

This sort of "adjusting the truth" is widespread in business. It's not OK, but people should not be shocked by this.

Also, if marks want to be so gullible, it's on them. It's your money and YOUR due diligence.

teaearlgraycold
·
2 weeks ago
·
[ - ]

This was my assumption all along.

EagnaIonat
·
2 weeks ago
·
[ - ]

> What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set,

There is nothing suspicious about this and the wording seems to be incorrect.

A hold-out set is a percentage of the overall data that is used to test a model. It is just not trained on it. Model developers normally have full access to it.

There is nothing inherently wrong with training on a full/partial hold out set. It just means you have done a different split to train again.

The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.

Even so blind sets can also go stale after a few runs and nothing is wrong with ingesting that blind set, as long as you have a new blind set to run against.

Trying to game blind set tests is nothing new and it gets very quickly found out.

What I took from the original article is that the blind set is likely unbalanced and it answered more easier questions than hard ones.

aithrowawaycomm
·
2 weeks ago
·
[ - ]

> The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.

What on earth? This is from Tamay Besiroglu at Epoch:

  Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.

So this "confusion" is because Epoch AI specifically told people it was a blind set! Despite the condescending tone, your comment is just plain wrong.

EagnaIonat
·
2 weeks ago
·
[ - ]

Your quote literally says hold-out set.

Kbelicius
·
2 weeks ago
·
[ - ]

It also literally says "unseen-by-OpenAI".

EagnaIonat
·
2 weeks ago
·
[ - ]

Curious why you would be responding on an old post. Are you an alt account?

Your comment doesn't contradict what I said.

sillysaurusx
·
2 weeks ago
·
[ - ]

The questions are designed so that such training data is extremely limited. Tao said it was around half a dozen papers at most, sometimes. That’s not really enough to overfit on without causing other problems.

lolinder
·
2 weeks ago
·
[ - ]

> That’s not really enough to overfit on without causing other problems.

"Causing other problems" is exactly what I'm worried about. I would not put it past OpenAI to deliberately overfit on a set of benchmarks in order to keep up the illusion that they're still progressing at the rate that the hype has come to expect, then keep the very-dangerous model under wraps for a while to avoid having to explain why it doesn't act as smart as they claimed. We still don't have access to this model (because, as with everything since GPT-2, it's "too dangerous"), so we have no way of independently verifying its utility, which means they have a window where they can claim anything they want. If they release a weaker model than claimed it can always be attributed to guardrails put in place after safety testing confirmed it was dangerous.

We'll see when the model actually becomes available, but in the meantime it's reasonable to guess that it's overfitted.

hyperpape
·
2 weeks ago
·
[ - ]

You're missing the part where 25% of the problems were representative of problems top tier undergrads would solve in competitions. Those problems are not based on material that only exists in half a dozen papers.

Tao saw the hardest problems, but there's no concrete evidence that o3 solved any of the hardest problems.

·
2 weeks ago
·
[ - ]

jsheard
·
2 weeks ago
·
[ - ]

Why do people keep taking OpenAIs marketing spin at face value? This keeps happening, like when they neglected to mention that their most impressive Sora demo involved extensive manual editing/cleanup work because the studio couldn't get Sora to generate what they wanted.

https://news.ycombinator.com/item?id=40359425

th1243127
·
2 weeks ago
·
[ - ]

It might be because (very few!) mathematicians like Terence Tao make positive remarks. I think these mathematicians should be very careful to use reproducible and controlled setups that by their nature cannot take place on GPUs in the Azure cloud.

I have nothing against scientists promoting the Coq Proof Assistant. But that's open source, can be run at home and is fully reproducible.

aithrowawaycomm
·
2 weeks ago
·
[ - ]

Keep in mind those mathematicians were kept in the dark about the funding: it is incredibly unethical to invite a coauthor to your paper and not tell where the money came from.

It's just incredibly scummy behavior: I imagine some of those mathematicians would have declined the collaboration if the funding were transparent. More so than data contamination, this makes me deeply mistrustful of Epoch AI.

refulgentis
·
2 weeks ago
·
[ - ]

I can't parse any of this, can you explain to a noob? I get lost immediately: funding, coauthor, etc. Only interpretation I've come to is I've missed a scandal involving payola, Terence Tao, and keeping coauthors off papers

Vecr
·
2 weeks ago
·
[ - ]

Very few people were told the nature of the funding.

Vecr
·
2 weeks ago
·
[ - ]

Wait, I think I somehow knew Epoch AI was getting money from OpenAI. I'm not sure how, and I didn't connect any of the facts together to think of this problem in advance.

rvz
·
2 weeks ago
·
[ - ]

Because they are completely gullible and believe almost everything that OpenAI does without questioning the results.

On each product they release, their top researchers are gradually leaving.

Everyone now knows what happens when you go against or question OpenAI after working for them, which is why you don't see any criticism and more of a cult-like worship.

Once again, "AGI" is a complete scam.

refulgentis
·
2 weeks ago
·
[ - ]

Because the models have continually matched the quality they claim.

Ex. look how much work "very few" has to do in the sibling comment. It's like saying "very few physicists [Einstein/Feynman/Witten]"

Its conveniently impossible to falsify the implication that the inverse of "very few" say not positive things. i.e. that the vast majority say negative things

You have to go through an incredible level of mental gymnastics, involving many months of gated decisions, where the route chosen involved "gee, I know this is suspectable to confirmation bias, but...", to end up wondering why people think the models are real if OpenAI has access to data that includes some set of questions.

saithound
·
2 weeks ago
·
[ - ]

> Because the models have continually matched the quality they claim.

That's very far from true.

"Yes, I know that the HuggingFace arena and coding assistant leaderboards both say that OpenAI's new model is really good, but in practice you should use Claude Sonnet instead" was a meme for good reason, as was "I know the benchmarks show that 4o is just as capable as ChatGPT4 but based on our internal evals it seems much worse". The latter to the extent that they had to use dark UI patterns to hide ChatGPT-4 from their users, because they kept using it, and it cost OpenAI much more than 4o.

OpenAI regularly messes with benchmarks to keep the investor money flowing. Slightly varying the wording of benchmark problems causes a 30% drop in o1 accuracy. That doesn't mean "LLMs don't work" but it does mean that you have to be very sceptical of OpenAI benchmark results when comparing them to other AI labs, and this has been the case for a long time.

The FrontierMath case just shows that they are willing to go much farther with their dishonesty than most people thought.

diggan
·
2 weeks ago
·
[ - ]

> Tamay from Epoch AI here. We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.

Not sure if "integrity of the benchmarks" should even be something that you negotiate over, what's the value of the benchmark if the results cannot be trusted because of undisclosed relationships and sharing of data? Why would they be restricted from disclosing stuff you normally disclose, and how doesn't that raise all sorts of warning flags when proposed even?

optimalsolver
·
2 weeks ago
·
[ - ]

>OpenAI has data access to much but not all of the dataset

Their head mathematician says they have the full dataset, except a holdout set which they're currently developing (i.e. doesn't exist yet):

https://www.reddit.com/r/singularity/comments/1i4n0r5/commen...

menaerus
·
2 weeks ago
·
[ - ]

Thanks for the link. A holdout set which is yet to be used to verify the 25% claim. He also says that he doesn't believe that OpenAI would self-sabotage themselves by tricking the internal benchmarking performance since this will get easily exposed, either by the results from a holdout set or by the public repeating the benchmarks themselves. Seems reasonable to me.

optimalsolver
·
2 weeks ago
·
[ - ]

>the public repeating the benchmarks themselves

The public has no access to this benchmark.

In fact, everyone thought it was all locked up in a vault at Epoch AI HQ, but looks like Sam Altman has a copy on his bedside table.

menaerus
·
2 weeks ago
·
[ - ]

Perhaps what he meant is that the public will be able to benchmark the model themselves by throwing different difficulty math problems at it and not necessarily the FrontierMath benchmark. It should become pretty obvious if they were faking the results or not.

optimalsolver
·
2 weeks ago
·
[ - ]

It's been found [0] that slightly varying Putnam problems causes a 30% drop in o1-Preview accuracy, but that hasn't put a dent in OAI's hype.

There's absolutely no comeuppance for juicing benchmarks, especially ones no one has access to. If performance of o3 doesn't meet expectations, there'll be plenty of people making excuses for it ("You're prompting it wrong!", "That's just not its domain!").

[0] https://openreview.net/forum?id=YXnwlZe0yf&noteId=yrsGpHd0Sf

menaerus
·
2 weeks ago
·
[ - ]

> If performance of o3 doesn't meet expectations, there'll be plenty of people making excuses for it

I agree and I can definitely see that happening but it is also not impossible, given the incentive and impact of this technology, for some other company/community to create yet another, perhaps, FrontierMath-like benchmark to cross-validate the results.

I also don't disagree that it is not impossible for OpenAI to have faked these results. Time will tell.

aunty_helen
·
2 weeks ago
·
[ - ]

This feels like a done deal. This benchmark should be discarded.

bogtog
·
2 weeks ago
·
[ - ]

A lot of the comments express some type of deliberate cheating the benchmark. However, even without intentionally trying to game it, if anybody can repeatedly take the same test, then they'll be nudged to overfit/p-hack.

For instance, suppose they conduct an experiment and find that changing some hyper-parameter yields a 2% boost. That could just be noise, it could be a genuine small improvement, or it may be a mix of a genuine boost along with some fortunate noise. An effect may be small enough that researchers would need to rely on their gut to interpret it. Researchers may jump on noise while believing they have discovered true optimizations. Enough of these types of nudges, and some serious benchmark gains can materialize.

(Hopefully my comment isn't entirely misguided, I don't know how they actually do testing or how often they probe their test set)

madars
·
2 weeks ago
·
[ - ]

I cringe every time I see "my IQ increased by X points after doing Y" posts on Twitter - yes, you had a practice run on Raven's progressive matrices a month ago, that helped, these have a limited question bank and the effect of Y is marginal. That said, obviously, test taking is a skill (separate from background knowledge and both general/domain-specific ability) and should be trained if you expect to have life-altering events based on tests (i.e., do an LSAT course if you want to go to law school). Conversely, shouldn't be done if you think it will limit you through superstition ("I had a score of X, thus I can only perform around level of X+fudge factor"). For an LLM company a good test score is a valuation-altering event!

zarzavat
·
2 weeks ago
·
[ - ]

OpenAI played themselves here. Now nobody is going to take any of their results on this benchmark seriously, ever again. That o3 result has just disappeared in a poof of smoke. If they had blinded themselves properly then that wouldn't be the case.

Whereas other AI companies now have the opportunity to be first to get a significant result on FrontierMath.

colonial
·
2 weeks ago
·
[ - ]

I'd be surprised if any of their in-house benchmark results are taken seriously after this. As an extremely rough estimate, FrontierMath cost five to six figures to assemble [1] - so from an outside view, they clearly have no qualms with turning cash into quasi-guaranteed benchmark results.

[1]: https://epoch.ai/math-problems/submit-problem - the benchmark is comprised of "hundreds" of questions, so at the absolute lowest it cost 300 * 200 = 60,000 dollars.

red75prime
·
2 weeks ago
·
[ - ]

Conversely, if they didn't cheat and they funded creation of the test suite to get "clean" problems (while hiding their participation to prevent getting problems that are somehow tailored to be hard for LLMs specifically), then they have no reasons to fear that all this looks fishy as the test results will soon be vindicated when they'll give wider access to the model.

I refrain from forming a strong opinion in such situations. My intuition tells me that it's not cheating. But, well, it's intuition (probably based on my belief that the brain is nothing special physics-wise and it doesn't manage to realize unknown quantum algorithms in its warm and messy environment, so that classical computers can reproduce all of its feats when having appropriate algorithms and enough computing power. And math reasoning is just another step on a ladder of capabilities, not something that requires completely different approach). So, we'll see.

klabb3
·
2 weeks ago
·
[ - ]

> based on my belief that the brain is nothing special physics-wise and it doesn't manage to realize unknown quantum algorithms in its warm and messy environment

Agreed (well as much as intuition goes), but current gen AI is not a brain, much less a human brain. It shows similarities, in particular emerging multi-modal pattern matching capabilities. There is nothing that says that’s all the neocortex does, in fact the opposite is a known truth in neuroscience. We just don’t know all functions yet - we can’t just ignore the massive Chesterton’s fence we don’t understand.

This isn’t even necessarily because the brain is more sophisticated than anything else, we don’t have models for the weather and immune system or anything chaotic really. Look, folding proteins is still a research problem and that’s at the level of known molecular structure. We greatly overestimate our abilities to model & simulate things. Todays AI is a prime example of our wishful thinking and glossing over ”details”.

> so that classical computers can reproduce all of its feats when having appropriate algorithms and enough computing power.

Sure. That’s a reasonable hypothesis.

> And math reasoning is just another step on a ladder of capabilities, not something that requires completely different approach

You seem to be assuming ”ability” is single axis. It’s like assuming if we get 256 bit registers computers will start making coffee, or that going to the gym will eventually give you wings. There is nothing that suggests this. In fact, if you look at emerging ability in pattern matching that improved enormously, while seeing reasoning on novel problems sitting basically still, that suggests strongly that we are looking at a multi-axis problem domain.

red75prime
·
3 days ago
·
[ - ]

> if you look at emerging ability in pattern matching that improved enormously, while seeing reasoning on novel problems sitting basically still

About two years ago I came to the opinion that autoregressive models of reasonable size will not be able to capture the fullness of human abilities (mostly due to a limited compute per token). So it's not a surprise to me. But training based on reinforcement learning might be able to overcome this.

I don't believe that some specialized mechanisms are required to do math.

eksu
·
2 weeks ago
·
[ - ]

This risk could be mitigated by publishing the test.

ripped_britches
·
2 weeks ago
·
[ - ]

Do people actually think OpenAI is gaming benchmarks?

I know they have lost trust and credibility, especially on HN. But this is a company with a giant revenue opportunity to sell products that work.

What works for enterprise is very different from “does it beat this benchmark”.

No matter how nefarious you think sama is, everything points to “build intelligence as rapidly as possible” rather than “spin our wheels messing with benchmarks”.

In fact, even if they did fully lie and game the benchmark - do you even care? As an OpenAI customer, all I care about is that the product works.

I code with o1 for hours every day, so I am very excited for o3 to be released via API. And if they trained on private datasets, I honestly don’t care. I just want to get a better coding partner until I’m irrelevant.

Final thought - why are these contractors owed a right to know where funding came from? I would definitely be proud to know I contributed to the advancement of the field of AI if I was included in this group.

mlsu
·
2 weeks ago
·
[ - ]

Gaming benchmarks has a lot of utility for openAI whether their product works or not.

Many people compare models based on benchmarks. So if openAI can appear better to Anthropic, Google, or Meta, by gaming benchmarks, it's absolutely in their interest to do so, especially if their product is only slightly behind, because evaluating model quality is very very tricky business these days.

In particular, if there is a new benchmark, it's doubly in their interest to game it, because they know that other providers will start using and optimizing performance towards that benchmark, in order to "beat" OpenAI and win market share.

On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?

This is a company which is shedding their coats of ethics and scientific rigor -- so as to be as unencumbered as possible in its footrace to the dollar.

bugglebeetle
·
2 weeks ago
·
[ - ]

I used to think this, but using o1 quite a bit lately has convinced me otherwise. It’s been 1-shotting the fairly non-trivial coding problems I throw at it and is good about outputting large, complete code blocks. By contrast, Claude immediately starts nagging you about hitting usage limits after a few back and forth and has some kind of hack in place to start abbreviating code when conversations get too long, even when explicitly instructed to do otherwise. I would imagine that Anthropic can produce a good test time compute model as well, but until they have something publicly available, OpenAI has stolen back the lead.

maeil
·
2 weeks ago
·
[ - ]

"Their model" here is referring to 4o as o1 is unviable for many production usecases due to latency.

raincole
·
2 weeks ago
·
[ - ]

> On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?

I do use Sonnet 3.5 personally, but this "beat handily" doesn't show on LLM arena. Do OpenAI game that too?

ripped_britches
·
2 weeks ago
·
[ - ]

I think “getting beat handily” is a HN bubble concept. Depends on what you’re using it for, but I personally prefer 4o for coding. In enterprise usage, i think 4o is smoking 3.5 sonnet, but that’s just my perception from folks I talk to.

hatefulmoron
·
2 weeks ago
·
[ - ]

I don't think that's true, you'll get the same sentiment ("Sonnet 3.5 is much better than GPT4/GPT4o [for coding]") pretty uniformly across Reddit/HN/Lobsters. I would strongly agree with it in my own testing, although o1 might be much better (I'm too poor to give it a fair shake.)

> In enterprise usage, i think 4o is smoking 3.5 sonnet

True. I'm not sure how many enterprise solutions have given their users an opportunity to test Claude vs. GPT. Most people just use whatever LLM API their software integrates.

maeil
·
2 weeks ago
·
[ - ]

This just isn't accurate, on the overwhelming majority of real-world tasks (>90%) 3.5 Sonnet beats 4o. FWIW I've spoken with a friend who's at OpenAI and they fully agree in private.

saithound
·
2 weeks ago
·
[ - ]

Yes, it looks all but certain that OpenAI gamed this particular benchmark.

Otherwise, they would not have had a contract that prohibited revealing that OpenAI was involved with the project until after the o3 announcements were made and the market had time to react. There is no reason to have such a specific agreement unless you plan to use the backdoor access to beat the benchmark: otherwise, OpenAI would not have known in advance that o3 will perform well! In fact, if there was proper blinding in place (which Epoch heads confirmed was not the case), there would have been no reason for secrecy at all.

Google, xAI and Anthropic's test-time compute experiments were really underwhelming: if OpenAI has secret access to benchmarks, that explains why their performance is so different.

jatins
·
2 weeks ago
·
[ - ]

> Do people actually think OpenAI is gaming benchmarks?

I was blown away by chatgpt release and generally have admired OpenAI however I wouldn't put it past them

At this point their entire marketing strategy seems to be to do vague posting on X/Twitter and keep hyping the models so that investors always feel there is something around the corner

And I don't think they need to do that. Most investors will be throwing money at them either way but maybe when you are looking to raise _billions_ that's not enough

maeil
·
2 weeks ago
·
[ - ]

> Do people actually think OpenAI is gaming benchmarks?

Yes, they 100% do. So do their main competitors. All of them do.

cbg0
·
2 weeks ago
·
[ - ]

> Do people actually think OpenAI is gaming benchmarks?

Yes, there's no reason not to do it, only upsides when you try to sell it to enterprises and governments.

·
2 weeks ago
·
[ - ]

331c8c71
·
2 weeks ago
·
[ - ]

Well I certainly won't object if oai marketing was based on testimonials from their fanboy customers instead of rigged benchmark scores %)

Your fragrant disregard for ethics and focus on utilitarian aspects is certainly quite extreme to the extent that only a view people would agree with you in my view.

·
2 weeks ago
·
[ - ]

lionkor
·
2 weeks ago
·
[ - ]

People on here were mocking me openly when I pointed out that you can't be sure LLMs (or any AIs) are actually smart unless you CAN PROVE that the question you're asking isn't in the training set (or adjacent like in this case).

So with this in mind now, let me repeat: Unless you know that the question AND/OR answer are not in the training set or adjacent, do not claim that the AI or similar black box is smart.

pcmoore
·
2 weeks ago
·
[ - ]

I ran a test yesterday on ChatGPT and co-pilot asking first if it knew of a specific paper which it confirmed and then to derive simple results from which it was completely incapable of. I know this paper is not widely referenced (ie few known results in the public domain) but has been available for over 15 years with publicly accessible code written by humans. The training set was so sparse it had no ability to "understand" or even regurgitate past the summary text which it listed almost verbatim.

Vecr
·
2 weeks ago
·
[ - ]

It is known that current models have terrible sample efficiency. I've been told that it's better than I thought it was, but it still isn't good.

sitkack
·
2 weeks ago
·
[ - ]

This all smells like the OpenAI CEO's MO. Stupid drama for stupid reasons.

KeplerBoy
·
2 weeks ago
·
[ - ]

It doesn't need to be smart to be useful. A lot of the kind of work I do seems to be in the training set.

shadowfox
·
2 weeks ago
·
[ - ]

I don't think the OP is talking about usefulness at all, that is on a completely different dimension I would say.

MattDaEskimo
·
2 weeks ago
·
[ - ]

There's something gross about OpenAI constantly misleading the public.

This maneuver by their CEO will destroy FrontierMath and Epoch AI's reputation

cbracketdash
·
2 weeks ago
·
[ - ]

Reminds me of the following proverb:

"The integrity of the upright guides them, but the unfaithful are destroyed by their duplicity."

(Proverbs 11:3)

benterix
·
2 weeks ago
·
[ - ]

> Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.

Man, this is huge.

wujerry2000
·
2 weeks ago
·
[ - ]

My takeaways

(1) Companies will probably increasingly invest in building their own evals for their use cases because its becoming clear public/allegedly private benchmarks have misaligned incentives with labs sponsoring/cheating (2) Those evals will prob be proprietary "IP" - guarded as closely as the code or research itself (3) Conversely, public benchmarks are exhausted and SOMEONE has to invest in funding more frontier benchmarks. So this is prob going to continue.

gunalx
·
2 weeks ago
·
[ - ]

So in conclusion, any evaluation of openai models on frontier math is increadibly invalidated.

I would even go so far as to say this invalidates not only FrontierMath but also anything Epoch AI has and will touch.

Any academic misjudgement like this massive conflict and cheating makes you unthrustworthy in a academic context.

BrenBarn
·
1 week ago
·
[ - ]

This kind of thing is so avoidable by anyone who has not sold their soul. The answer is: if a company wants you to do a deal but requires as a condition that you not reveal to anyone that you are doing a deal with that company, you just say no. It's that simple.

Imnimo
·
2 weeks ago
·
[ - ]

My guess is that OpenAI didn't cheat as blatantly as just training on the test set. If they had, surely they could have gotten themselves an even higher mark than 25%. But I do buy the comment that they soft-cheated by using elements of the dataset for validation (which is absolutely still a form of data leakage). Even so, I suspect their reported number is roughly legit, because they report numbers on many benchmarks, and they have a good track record of those numbers holding up to private test sets.

What's much more concerning to me than the integrity of the benchmark number is the general pattern of behavior here from OpenAI and Epoch. We shouldn't accept secretly (even secret to the people doing the creation!) funding the creation of a benchmark. I also don't see how we can trust in the integrity of EpochAI going forward. This is basically their only meaningful output, and this is how they handled it?

riku_iki
·
2 weeks ago
·
[ - ]

> If they had, surely they could have gotten themselves an even higher mark than 25%.

there is potentially some limitation of LLMs memorizing such complex proofs

woopwoop
·
2 weeks ago
·
[ - ]

They aren't proofs, they're just numbers. All the questions have numerical answers. That's how they're evaluated.

riku_iki
·
2 weeks ago
·
[ - ]

I think those reasoning models are smart enough to not emit memorized answer if they can't come with CoT proof.

But OAI could draw any result, no one was checking, they probably were not brave enough to declare math as solved topic.

j_timberlake
·
2 weeks ago
·
[ - ]

Elon definitely still has a grudge against Altman and OpenAI, so when Elon uses his new political power to bludgeon OpenAI to bankruptcy with new regulations and lawsuits, it won't be for the right reasons, but I'll still think Altman and the remaining employees deserve it.

padolsey
·
2 weeks ago
·
[ - ]

Many of these evals are quite easy to game. Often the actual evaluation part of benchmarking is left up to a good-faith actor, which was usually reasonable in academic settings less polluted by capital. AI labs, however, have disincentives to do a thorough or impartial job, so IMO we should never take their word for it. To verify, we need to be able to run these evals ourselves – this is only sometimes possible, as even if the datasets are public, the exact mechanisms of evaluation are not. In the long run, to be completely resilient to gaming via training, we probably need to follow suit of other fields and have third-party non-profit accredited (!!) evaluators who's entire premise is to evaluate, red-team, and generally keep AI safe and competent.

matt_daemon
·
2 weeks ago
·
[ - ]

At this point eval results presented by AI companies are a joke and should not be trusted

WasimBhai
·
2 weeks ago
·
[ - ]

I have been taking a course in AI policy and the O1 and the FrontierMath dataset has been an important mark for me to emphasize the world we are moving toward. It is incredibly sad to know about the conflict of interest here. However, those more knowledgeable, can you explain in plain words, does this revelation compromise OAI's claims regarding o3's performance on FrontierMath problems?

energy123
·
2 weeks ago
·
[ - ]

It's worse than just an undeclared conflict of interest. They gave OpenAI all questions and solutions behind the scenes. It's hard to chalk this up to only naivete. This is a "sorry you caught me" moment.

lolinder
·
2 weeks ago
·
[ - ]

They have an oral agreement that OpenAI won't use the benchmark in training. Which means first and foremost you have to consider the possibility that they broke that oral agreement and actually included the problems in the training set. Even if they didn't, the fact that they had the problems means they could have selectively chosen the training set data to specialize in solving that class of problem, while still technically keeping the verbal agreement.

So, yeah, the benchmark needs to be treated as essentially worthless at this point.

energy123
·
2 weeks ago
·
[ - ]

If OpenAI wanted the questions/solutions, there is going to be a reason for that. This data is not sitting in an unopened folder on Sam's computer.

There are a lot of ways you can use data to improve a model without directly training on it. A train/test validation loop, for example. Or as a wellspring for synthetic data generation. But all of these ways involve some level of data contamination, it's unavoidable.

nioj
·
2 weeks ago
·
[ - ]

·
2 weeks ago
·
[ - ]

refulgentis
·
2 weeks ago
·
[ - ]

Its increasingly odd to see HN activity that assumes the premise: if the latest benchmark results involved a benchmark that can be shown to have any data that OpenAI could have accessed, then, the benchmark results were intentionally faked.

Last time this confused a bunch of people who didn't understand what test vs. train data meant and it resulted in a particular luminary complaining on Twitter, to much guffaws, how troubling the situation was.

Literally every comment currently, modulo [1] assumes this and then goes several steps more, and a majority are wildly misusing terms with precise meanings, explaining at least part of their confusion.

[1] modulo the one saying this is irrelevant because we'll know if it's bad when it comes out, which to be fair, if evaluated rationally, we know that doesn't help us narrowly with our suspicion FrontierMath benchmarks are all invalid because it trained on (most of) the solutions

EvgeniyZh
·
2 weeks ago
·
[ - ]

Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.

And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.

As for "knowing it's bad", most people won't be able to tell a model scoring 25% and 10% apart. People who are using these models to solve math problems are tiny share of users and even tinier share of revenues. What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.

refulgentis
·
2 weeks ago
·
[ - ]

> Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.

This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.

> And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.

This part accidentally stumbles into a valid point about ML methodology while completely missing why it matters. Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them. The implication that this basic principle somehow proves misconduct is backwards logic.

> People who are using these models to solve math problems are tiny share of users and even tinier share of revenues.

This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities. This is basic ML evaluation theory.

> What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.

This concludes with pure speculation presented as fact, combined with a conspiracy theory that lacks any actual evidence. It also displays a shallow understanding of how technical due diligence works in major AI investments - investors at this level typically have deep technical expertise, access to extensive testing and validation, and most damningly, given the reductive appeal to incentive structure:

They closed the big round weeks before.

The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.

BeefWellington
·
2 weeks ago
·
[ - ]

> The "everybody does it" argument is a classic rationalization that doesn't actually justify anything.

I'd argue here the more relevant point is "these specific people have been shown to have done it before."

> The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.

I think what you're missing is the observation that so very little of that is actually applied in this case. "AI" here is not being treated as an actual science would be. The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.

refulgentis
·
2 weeks ago
·
[ - ]

> I'd argue here the more relevant point is "these specific people have been shown to have done it before."

This is itself a slippery move. A vague gesture at past misconduct without actually specifying any incidents. If there's a clear pattern of documented benchmark manipulation, name it. Which benchmarks? When? What was the evidence? Without specifics, this is just trading one form of handwaving ("everyone does it") for another ("they did it before").

> "AI" here is not being treated as an actual science would be.

There's some truth here but also some sleight of hand. Yes, AI development often moves outside traditional academic channels. But, you imply this automatically means less rigor, which doesn't follow. Many industry labs have internal review processes, replication requirements, and validation procedures that can be as or more stringent than academic peer review. The fact that something isn't in Nature doesn't automatically make it less rigorous.

> The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.

This combines three questionable implications:

- That non-journal publications are automatically "not real concrete research" (tell that to physics/math arXiv)

- That peer review is binary - either traditional journal review or nothing (ignoring internal review processes, community peer review, public replications)

- That volume ("pumped out") correlates with quality

You're making a valid critique of AI's departure from traditional academic structures, but then making an unjustified leap to assuming this means no rigor at all. It's like saying because a restaurant isn't Michelin-starred, it must have no food safety standards.

This also ignores the massive reputational and financial stakes that create strong incentives for internal rigor. Major labs have to maintain credibility with:

- Their own employees.

- Other researchers who will try to replicate results.

- Partners integrating their technology.

- Investors doing technical due diligence.

- Regulators scrutinizing their claims.

The idea that they would casually risk all that just to bump up one benchmark number (but not too much! just from 10% to 35%) doesn't align with the actual incentive structure these organizations face.

Both the original comment and this fall into the same trap - mistaking cynicism for sophistication while actually displaying a somewhat superficial understanding of how modern AI research and development actually operates.

BeefWellington
·
2 weeks ago
·
[ - ]

This reply reads as though it were AI generated.

Let's bite though, and hope that unhelpful excessively long-winded replies are just your quirk.

> This is itself a slippery move. A vague gesture at past misconduct without actually specifying any incidents. If there's a clear pattern of documented benchmark manipulation, name it. Which benchmarks? When? What was the evidence? Without specifics, this is just trading one form of handwaving ("everyone does it") for another ("they did it before").

Ok, provide specifics yourself then. Someone replied and pointed out that they have every incentive to cheat, and your response was:

> This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.

Respond to the content of the argument -- be specific. WHY is OpenAI not incentivized to cheat on this benchmark? Why is a once-nonprofit which turned from releasing open and transparent models to a closed model and begun raking in tens of billions of investor cash not incentivized to continue to make those investors happy? Be specific. Because there's a clear pattern of corporate behaviour at OpenAI and associated entities which suggests your take is not, in fact, the simpler viewpoint.

> This combines three questionable implications: > - That non-journal publications are automatically "not real concrete research" (tell that to physics/math arXiv)

Yes, arXiv will host lots of stuff that isn't real concrete research. They've hosted April Fool's jokes, for example.[1]

> - That peer review is binary - either traditional journal review or nothing (ignoring internal review processes, community peer review, public replications)

This is a poor/incorrect reading of the language. You have inferred meaning that does not exist. If citations are so important here, cite a few dozen that are peer reviewed out of the hundreds.

> - That volume ("pumped out") correlates with quality

Incorrect reading again. Volume here correlates with marketing and hype. It could have an effect on quality but that wasn't the purpose behind the language.

> You're making a valid critique of AI's departure from traditional academic structures, but then making an unjustified leap to assuming this means no rigor at all. It's like saying because a restaurant isn't Michelin-starred, it must have no food safety standards.

Why is that unjustified? It's no different than any of the science background people who have fallen into flat earther beliefs. They may understand the methods but if they are not tested with rigor and have abandoned scientific principles they do not get to keep pretending it's as valid as actual science.

> This also ignores the massive reputational and financial stakes that create strong incentives for internal rigor. Major labs have to maintain credibility with:

FWIW, this regurgitated talking point is what makes me believe this is an LLM-generated reply. OpenAI is not a major research lab. They appear to essentially to be trading off the names of more respected institutions and mathematicians who came up with FrontierMath. The credibility damage here can be done by a single person sharing data with OpenAI, unbeknownst to individual participants.

Separately, even under correct conditions it's not as if there are not all manner of problems in science in terms of ethical review. See for example, [2].

[1] https://arxiv.org/abs/2003.13879 - FWIW, I'm not against scientists having fun, but it should be understood that arXiv is basically three steps above HN or reddit. [2] https://lore.kernel.org/linux-nfs/YH+zwQgBBGUJdiVK@unreal/ + related HN discussion: https://news.ycombinator.com/item?id=26887670

refulgentis
·
2 weeks ago
·
[ - ]

First paragraph is unnecessarily personal.

It's also confusing: Did you think it was AI because of the "regurgitated talking point", as you say later, or because it was a "unhelpful excessively long-winded repl[y]"?

I'll take the whole thing as an intemperate moment, and what was intended to be communicated was "I'd love to argue about this more, but can you cut down reply length?"

> Ok, provide specifics yourself then.

Pointing out "Everyone does $X" is fallacious does not imply you have to prove no one has any incentive to do $X. There's plenty of things you have an incentive to do that I trust you won't do. :)

> If citations are so important here, cite a few dozen that are peer reviewed out of the hundreds.

Sure.

I got lost a bit, though, of what?

Are you asking for a set of journal articles, that are peer-reviewed, about AI, that aren't on arxiv?

> Why is that unjustified?

"$X doesn't follow traditional academic structures" does not imply "$X has no rigor at all"

> OpenAI is not a major research lab.

Eep.

> "all manner of problems in science in terms of ethical review. "

Yup!

The last 2 on my part are short because I'm not sure how to reply to "entity $A has short-term incentive to do thing $X, and entity $A is part of large group $B that sometimes does thing $X". We don't disagree there! I'm just applying symbolic logic to the rest. Ex. when I say "$X does not imply $Y" has a very definite field-specific meaning.

It's fine to feel the way you do. It takes a rigorously rational process to end up making my argument, but rigorously is too kind: it would be crippling in daily life.

A clear warning sign, for me, setting aside the personal attack opening, would have been when I was doing things like "arXiv has April Fool's Jokes!" -- I like to think I would have taken a step back after noticing it was "OpenAI is distantly related to group $X, a member of group $X did $Y, therefore let's assume OpenAI did $Y and conversate from there"

EvgeniyZh
·
2 weeks ago
·
[ - ]

> an unsubstantiated claim about widespread misconduct.

I can't prove it, but I heard it from multiple people in the industry. High contamination levels for existing benchmarks, though [1,2]. Whether to believe that it is just as good as we can do, not doing the best possible decontamination, or done on purpose is up to you.

> Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them.

The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement. Clearly, OpenAI did not plan to use the provided evaluation as a testset, because then they wouldn't need access to it. Also, reporting validation numbers as performance metric is not unheard of.

> This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities.

How good of a proxy is it? There is some correlation, but can you say something quantitative? Do you think you can predict which models perform better on math benchmarks based on interaction with them? Especially for a benchmark you have no access to and can't solve by yourself? If the answer is no, the number is more or less meaningless by itself, which means it would be very hard to catch somebody giving you incorrect numbers.

> someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field

My credentials are in my profile, not that I think they should matter. However, I do have experience specifically in deep learning research and evaluation of LLMs.

[1] https://aclanthology.org/2024.naacl-long.482/ [2] https://arxiv.org/abs/2412.15194

refulgentis
·
2 weeks ago
·
[ - ]

> "I can't prove it, but I heard it from multiple people in the industry"

The cited papers demonstrate that benchmark contamination exists as a general technical challenge, but are being misappropriated to support a much stronger claim about intentional misconduct by a specific actor. This is a textbook example of expanding evidence far, far, beyond its scope.

> "The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement."

This argument reveals a concerning misunderstanding of research ethics. Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training") suggests a framework where anything not explicitly forbidden is acceptable. This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.

> "How good of a proxy is it? [...] If the answer is no, the number is more or less meaningless by itself"

This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).

> "My credentials are in my profile, not that I think they should matter."

The attempted simultaneous appeal to and dismissal of credentials is an interesting mirror of the claims as a whole: at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

When challenged, it retreats to increasingly abstract hypotheticals about what "could" happen rather than what evidence shows did happen.

This demonstrates how seemingly technical arguments can fail basic principles of evidence and logic, while maintaining surface-level plausibility through domain-specific terminology. This kind of reasoning would not pass basic scrutiny in any rigorous research context.

EvgeniyZh
·
2 weeks ago
·
[ - ]

> Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training")

Validation is not training, period. I'll ask again: what is the possible goal of accessing the evaluation set if you don't plan to use it for anything except the final evaluation, which is what the test set is used for? Either they just asked for access without any intent to use the provided data in any way except for final evaluation, which can be done without access, or they did somehow utilize the provided data, whether by training on it (which they verbally promised not to), using it as a validation set, using it to create a similar training set, or something else.

> This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.

OpenAI is not doing science; they are doing business.

> This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).

The metrics matter to people, but this doesn't mean people can meaningfully predict the model's performance using them. If I were trying to describe each of your arguments as some demagogue technique (you're going to call it ad hominem or something, probably), then I'd say it's a false dichotomy: it can, in fact, be impossible to use metrics to predict performance precisely enough and for people to care about metrics simultaneously.

> The attempted simultaneous appeal to and dismissal of credentials

I'm not appealing to credentials. Based on what I wrote, you made a wrong guess about my credentials, and I pointed out your mistake.

> at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

Your position, on the other hand, rests on the assumption that corporations behave ethically and with integrity beyond what is required by the law (and, specifically, their contracts with other entities).

achierius
·
1 week ago
·
[ - ]

> Validation is not training, period.

Sure, but what we care about isn't the semantics of the words, its the effects of what they're doing. Iterated validation plus humans doing hyperparameter tuning will go a long way towards making a model fit the data, even if you never technically run backprop with the validation set as input.

> OpenAI is not doing science; they are doing business.

Are you implying these are orthogonal? OpenAI is a business centered on an ML research lab, which does research, and which people in the research community have generally come to respect.

> at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

No, it doesn't. What OP is doing is critiquing OpenAI for their misbehavior. This is one of the few levers we (who do not have ownership or a seat on their board) have to actually influence their future decisionmaking -- well-reasoned critiques can convince people here (including some people who decide whether their company uses ChatGPT vs. Gemini vs. Claude vs. ...) that ChatGPT is not as good as benchmarks might claim, which in effect makes it more expensive for OpenAI to condone this kind of misbehavior going forward.

The argument that "no companies are moral, so critiquing them is pointless" is just an indirect way of running cover for those same immoral companies.

croemer
·
1 week ago
·
[ - ]

Tim Gowers, one of the Fields medallists contributed problems to the benchmark dataset isn't happy about being misled about OpenAIs involvement. He retweeted this: https://x.com/Mihonarium/status/1880944026603376865?t=QN3i_X...

mrg3_2013
·
2 weeks ago
·
[ - ]

OpenAI continues to muddy the benchmarks, while Claude continues to improve their intelligence. Claude will win long term. It'd be wise to not rely on OpenAI at all. They are the first comers who will just burn cash and crash out I suspect.

atleastoptimal
·
2 weeks ago
·
[ - ]

The problem is, any benchmark on a closed model couldn’t be private even in theory, as the model has to be called to run the benchmark, exposing the contents to whoever owns the model thereafter.

HN loves to speculate that OpenAI is some big scam whose seeming ascendance is based on deceptive marketing hype, but o1, to anyone who has tried it seriously is undoubtedly very much within the ballpark of what OpenAI claims it is able to do. If everything they are doing really is just overfitting and gaming the tests, that discrepancy will eventually catch up to them, and people will stop using the APIs and chatgpt

karmasimida
·
2 weeks ago
·
[ - ]

They should at least clarify it. The reason they don’t I feel is simply for the hype and mystique.

There are ways that you could game the benchmark without adding it to the training set. By repetitively evaluating on the dataset itself it will regress into a validation set, not a test set, even in black box setting, as you can simply evaluating 100 checkpoints and pick the one that performs the best, rinse and repeat

I still believe o3 is the real deal, BUT this gimmick kind sour my appetite a bit, for that those who run the company

nottorp
·
2 weeks ago
·
[ - ]

So basically when you need to look good in benchmarks you fund an organization that does benchmarks in which you look good.

Just like toothpaste manufacturers fund dentist's associations etc.

ForHackernews
·
2 weeks ago
·
[ - ]

Unrelated to anything but what software is this blog running on? I love the sidenote feature.

Why does it have a customer service popover chat assistant?

Vecr
·
2 weeks ago
·
[ - ]

The Lightcone Infrastructure forum stack. I don't know why it has an assistant.

zrc108071849
·
2 weeks ago
·
[ - ]

Even if OpenAI does not use these materials to directly train its models, OpenAI can collect or construct more data based on the knowledge points and test points of these questions to gain an unfair competitive advantage. It's like before the Gaokao, a teacher reads some of the Gaokao questions and then marks the test points in the book for you. This is cheating.

suchintan
·
2 weeks ago
·
[ - ]

I wonder if more companies should open source their eval model outputs alongside the eval results

We tried doing that here at Skyvern (eval.skyvern.com)

maeil
·
2 weeks ago
·
[ - ]

This isn't news, the other popular benchmarks are just as gamed and worthless, it would be shocking if this one wasn't. The other frontier model providers game them just as hard, it's not an OpenAI thing. Any benchmark that a provider themselves mentions is not worth the pixels its written on.

·
2 weeks ago
·
[ - ]

floppiplopp
·
2 weeks ago
·
[ - ]

Unless you have been up to the shoulders in the hype-hole of Scam Altman's backside this should not come as the slightest surprise.

moi2388
·
2 weeks ago
·
[ - ]

“… we have a verbal agreement that these materials will not be used in model training”

What about model testing before releasing it?

·
2 weeks ago
·
[ - ]

treksis
·
2 weeks ago
·
[ - ]

so it was overfit

numba888
·
2 weeks ago
·
[ - ]

if they used it in training it should be 100% hit. most likely they used it to verify and tune parameters.

rrr_oh_man
·
2 weeks ago
·
[ - ]

> if they used it in training it should be 100% hit.

Not necessarily, no.

A statistical model will attempt to minimise overall loss, generally speaking.

If it gets 100% accuracy on the training data it's usually an overfit. (Hugging the data points too tightly, thereby failing to predict real life cases)

numba888
·
2 weeks ago
·
[ - ]

you are mostly right. but seeing almost perfectly reconstructed images from training set it's obvious model -can- memorize samples. in this case it would reproduce the answers too close to the original to be just 'accidental'. should be easy to test.

My guess samples could be used to find good enough stopping point for o1, o3 models. which is hardcoded.

aithrowawaycomm
·
2 weeks ago
·
[ - ]

The subtlety here is that an almost-memorized picture of a lady is the same picture with a few artifacts, and an almost-memorized NYT article is the same article with a few words changed, but an almost-memorized computation or proof is likely to be plain wrong. So even if OpenAI's benchmark was data contamination (as I suspect) it still says something about o1's abilities to execute a given problem-solving strategy without confabulating. It's just not what OpenAI wants you to think: much closer to Mathematica than an actual mathematician.

numba888
·
2 weeks ago
·
[ - ]

> but an almost-memorized computation or proof is likely to be plain wrong

hard to tell. never seen anyone trying it. model may almost-memorize and then fill the gaps at inference time as it's still doing some 'thinking'. But the main idea here is that there is a risk that model will spill out pieces of training data. OAI likely would not risk it at $100B++ valuation.

g-b-r
·
2 weeks ago
·
[ - ]

Had they let it hit 100% it would have been obvious they had the data.

They've sure been careful to avoid that, by only using a portion of it or some other technique

m3kw9
·
2 weeks ago
·
[ - ]

This don’t really matter much because if the models suck when it comes out evals mean nothing next time

·
2 weeks ago
·
[ - ]

katamari-damacy
·
2 weeks ago
·
[ - ]

“we now know how to build AGI” --Sam Altman.

which should really be “we now know how to improve associative reasoning but we still need to cheat when it comes to math because the bottom line is that the models can only capture logic associatively, not synthesize deductively, which is what’s needed for math beyond recipe-based reasoning"

james4151
·
2 weeks ago
·
[ - ]

[dead]