They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):
> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.
[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...
The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.
This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.
In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.
There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.
> Short of the companies fishing out the questions from API logs (which seems quite unlikely)
They all pretty clearly state[1] versions of "We use your queries (removing personal data) to improve the models" so I'm not sure why that's unlikely.
https://help.openai.com/en/articles/5722486-how-your-data-is...
Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.
Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.
Two months later there's a bombshell exposé detailing insider reports of how they cheated the test by cooking their training data using an army of PhDs to hand-solve. Shame.
At a minimum investor confidence goes down the drain, if it doesn't trigger lawsuits from their investors. Then you're looking at maybe another CEO ouster fiasco with a crisis of faith across their workforce. That workforce might be loyal now, but that's because their RSUs are worth something and not tainted by fraud allegations.
If you're right, I suppose it really depends on how well they could hide it via layers of indirection and compartmentalization, and how hard they could spin it. I don't really have high hopes for that given the number of folks there talking to the press lately.
Doesnt cause too much scandal lol
Why surprisingly?
2028 is twice as long as capable LLMs existed to date. By "capable" here I mean capable enough to even remotely consider the idea of LLMs solving such tasks in the first place. ChatGPT/GPT-3.5 isn't even 2 years old!
4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.
Exponential pace of progress isn't usually just one thing; if you zoom in, any particular thing may plateau, but its impact compounds in enabling growth of successors, variations, and related inventions. Nor is it a smooth curve, if you look closely. I feel statements like "a 405b model is not 5 times better than a 70b model" are zooming in on a specific class of models so much you can see the pixels of the pixel grid. There's plenty of open and promising research in tweaking the current architecture in training or inference (see e.g. other thread from yesterday[0]), on top of changes to architecture, methodology, methods of controlling or running inference on exiting models by lobotomizing them or grafting networks to networks, etc. The field is burning hot right now, we're counting space between incremental improvements and interesting research directions in weeks. The overall exponent of "language models" power may just well continue when you zoom out a little bit further.
--
E.g. if someone scores 60% at a high school exam, is it impossible for anyone to be more than 67% smarter than this person at that subject?
Then what if you have another benchmark where GPT3.5 scores 0%, but GPT4 scores 2%. Does it make GPT4 infinitely better?
E.g. supposedly there was one LLM that did 2% in FrontierMath.
Being able to solve self contained exercise can be obviously very challenging, but there are other different types of skills that might or might not be related and have to be solved as well.
Not really. It would just need to do more steps in a sequence that current models do. And that number has been going up consistently. So it would be just another narrow AI expert system. It is very likely that it will be solved, but it is very unlikely that it will be generally capable in the sense most researchers understand AGI today.
"All exponents in nature are s-curves" isn't really useful unless you can point at the limiting factors more precisely than "total energy in observable universe" or something. And you definitely need more than "did you know that exponents are really s-curves?" to even assume we're anywhere close to the inflection point.
The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.
What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules.
Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information.
This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?
I found that they are good at logic and math problems but still hallucinate. I didn’t try to stretch test them with hard problems though.
Under natural deduction all proofs are sub trees of the graph which is induced by the inference rules from the premise. Right now LLMs can't even do a linear proof if it gets too long when given all the induced vertices.
It’s not about whether a random human can solve them. It’s whether AI, in general, can. Humans, in general, have proven to be able to solve them already.
> It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.
I think it is possible to achieve AGI without creating an AGI that is an expert mathematician, and that it is possible to create a system that can do FrontierMath without achieving AGI. I.e. I think failure or success at FrontierMath is orthogonal to achieving AGI (though success at it may be a step on the way). Some humans can do it, and some AGIs could do it, but people and AI systems can have human-level intelligence without being able to do it. OTOH I think it would be hard to claim you have ASI if it can't do FrontierMath.
I think we have to keep in mind that humans have specialized. Some do law. Some do math. Some are experts at farming. Some are experts at dance history. It's not the average AI vs the average human. It's the best AI vs the best humans at one particular task.
The point with FrontierMath is that we can summon at least one human in the world who can solve each problem. No AI can in 2024
If you have a single system that can solve any problem any human can, I'd call that ASI, as it's way smarter than any human. It's an extremely high bar, and before we reach it I think we'll have very intelligent systems that can do more than most humans, so it seems strange not to call those AGIs (they would meet the definition of AGI on Wikipedia [1]).
[1] https://en.wikipedia.org/wiki/Artificial_general_intelligenc...
I don't think that's the popular definition.
AGI = solve any problem any human can. In this case, we've not reached AGI since it can't solve most FrontierMath problems.
ASI = intelligence far surpasses even the smartest humans.
If the definition of AGI has is that it's more intelligent than the average human, you can argue that we already have AGI today. But no one thinks we have AGI today. Therefore, AGI is not Claude 3.5.
Hence, I think the most acceptable definition for AGI is that it can solve any problem any human can.
People have all sorts of definitions for AGI. Some are more popular than others but at this point, there is no one true definition. Even Open AI's definition is different from what you have just said. They define it as "highly autonomous systems that outperform humans in most economically valuable tasks"
>AGI = solve any problem any human can.
That's a definition some people use yes but a machine that can solve any problem any human can is by definition super-intelligent and super-capable because there exists no human that can solve any problem any human can.
>If the definition of AGI has is that it's more intelligent than the average human, you can argue that we already have AGI today. But no one thinks we have AGI today.
There are certainly people who do, some of which are pretty well respected in the community, like Norvig.
https://www.noemamag.com/artificial-general-intelligence-is-...
We don't need every human in the world to learn complex topology math like Terence Tao. Some need to be farmers. Some need to be engineers. Some need to be kindergarten teachers. When we need someone to solve those problems, we can call Terence Tao.
When AI needs to solve those problems, it can't do it without humans in 2024. Period.
That's the whole point of this discussion.
The definition of ASI historically is that it's an intelligence that far surpasses humans - not at the level of the best humans.
It doesn't have much to do with need. Not every human can be as capable regardless of how much need or time you allocate for them to do so. Then some humans are shoulders above peers in one field but come a bit short in another closely related one they've sunk a lot of time into.
Like i said, arguing about a one true definition is pointless. It doesn't exist.
>The definition of ASI historically is that it's an intelligence that far surpasses humans - not at the level of the best humans.
A Machine that is expert level in every single field would likely far surpass the output of any human very quickly. Yes, there might exist intelligences that are significantly more 'super' but that is irrelevant. Competence, like generality is a spectrum. You can have two super-human intelligences with a competence gap.
ASI is when it is able to develop a much better version of itself to then iteratively go past all of that.
There are currently no tools that let llms do this and no one is building the tools for answering open ended questions.
An AI that can be onboarded to a random white collar job, and be interchangeably integrated into organisations, surely is AGI for all practical purposes, without eliminating the value of 100% of human experts.
Source?
Also, AIMOv2 is doing stage 2 of their math challenge, they are now at "national olympics" level of difficulty. They have a new set of questions. Last year's winner (27/50 points) got 2/50 on the new set. In the first 3 weeks of the competition the top score is 10/50 on the new set, mostly with Qwen2.5-math. Given that this is a purposefully made new set of problems, and according to the organizers "made to be AI hard", I'd say the regurgitation stuff is getting pretty stale.
Also also, the fact that claude3.5 can start coding in an invented language w/ ~20-30k tokens of "documentation" about the invented language is also some kind of proof that the stochastic parrots are the dismissers in this case.
That's exactly what countless techniques related to chain of thought do.
People have found that even letting llms generate gibberish tokens produces better final outputs. Which isn't a surprise when you realise that the only way a llm can do computation is by outputting tokens.
People call it agentic AI but that's a word without a definition.
Needless to say better LLMs help with my work, the same way that a stronger horse makes plowing easier.
The difference is that unlike horse breeders, e.g. Anthropic and openAI, I want to get to the internal combustion engine and tractors.
We should evaluate LLMs on text from beyond their knowledge cutoff date, by computing their per-byte perplexity or per-byte compression ratio. There's a deep theoretical connection between compression and learning.
The intuition here is that being able to predict the future of science (or any topic, really) is indicative of true understanding. Slightly more formally: When ICLR 2025 announces and publishes the accepted papers, Yoshua Bengio is less surprised/perplexed by what's new than a fresh PhD student. And Terence Tao is less surprised/perplexed by what will be proven in math in the next 10 years than a graduate student in a related field.
This work has it right: https://ar5iv.labs.arxiv.org/html//2402.00861
> [Not even 2%]
> Abstract: We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/TIGER-AI-Lab/TheoremQA
Put a bit more poetically: a Prolog benchmark can adequately test an LLM’s ability to create proofs in Euclidean geometry. But it will never test an LLM’s ability to reason whether a given axiomatization of geometry is actually a reasonable abstraction of physical space. And if our LLMs can do novel Euclidean proofs but are not able to meta-mathematically reason about novel axioms, then they aren’t really using intelligence. Formal logical puzzles are only a small subset of logical reasoning.
Likewise, when Euclidean proofs were a fun pastime among European upper-classes, the real work was being done by mathematicians who built new tools for projective and analytic geometry. In some sense our LLM benchmarks are focusing on the pastime and not the work. But in another sense LLMs are focusing on the tricksy and annoying sides of actually proving things, leaving humans free to think about deeper problems. So I’m not skeptical of LLMs’ utility in mathematical research, but rather the overinflated (and investor-focused) claims that this stuff is a viable path to AGI.
I don't think it's a requirement that a system claiming to be AGI should be able to solve these problems, 99.99% of humans can't either.
Saying that you need to solve these to be considered AGI is ridiculously strict.
I’d say these problems strongly encourage that sort of behavior.
I’m also someone who thinks building in abilities like this to LLMs would broadly benefit the LLMs and the world, because I think this stuff generalizes. But, even if not, It would be hard to say that an LLM that could test 80% on this benchmark would be not useful to a research mathematician. Terence Tao’s dream is something like this that can hook up to LEAN, leaving research mathematicians as editors, advisors, and occasionally working on the really hard parts while the rest is automated and provably correct. There’s no doubt in my mind that a high scoring LLM for this benchmark would be helpful in that concept.
Also, coming up with good problems is an art in its own right; the Soviets was famous for institutionalizing anti-Semitism via special math puzzles for Jews in Moscow Univerisity entrance exams. The questions are constructed as such that are hard to solve but have some elementary solutions to divert criticism.