~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.
We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)
Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.
Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!
So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)
Also, some other back of envelope calculations:
The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.
The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)
I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.
You can scale them up and down at any time, they can work 24/7 (including holidays) with no overtime pay and no breaks, they need no corporate campuses, office space, HR personnel or travel budgets, you don't have to worry about key employees going on sick/maternity leave or taking time off the moment they're needed most, they won't assault a coworker, sue for discrimination or secretly turn out to be a pedophile and tarnish the reputation of your company, they won't leak internal documents to the press or rage quit because of new company policies, they won't even stop working when a pandemic stops most of the world from running.
> deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence.
I think even (or especially) people like Altman accept this as a fact. I do. Hassabis has been saying this for years.
The foundational models are just a foundation. Now start building the AGI superstructure.
And this is also where most of the still human intellectual energy is now.
These statistical models don’t generalize well to out of distribution data. If you accept that as a fact, then you must accept that these statistical models are not the path to AGI.
I fail to see the difference between AI-employment-doom and other flavors of Luddism.
As AI gets more prevalent, it'll drive the cost down for the companies supplying these services, so the former employees of said companies will be paid lower, or not at all.
So, tell me, how paying fewer people less money will drive their standard of living upwards? I can understand the leisure time. Because, when you don't have a job, all day is leisure time. But you'll need money for that, so will these companies fund the masses via government to provide Universal Basic Income, so these people can both live a borderline miserable life while funding these companies to suck these people more and more?
Who cares? A rising tide lifts all boats. The wealthy people I know all have one thing in common: they focused more on their own bank accounts than on other people's.
So, tell me, how paying fewer people less money will drive their standard of living upwards?
Money is how we allocate limited resources. It will become less important as resources become less limited, less necessary, or (hopefully) both.
Money is also how we exert power and leverage over others. As inequality increases, it enables the ever wealthier minority to exert power and therefore control over the majority.
The problem isn't the money. The problem is the power.
Humans are interesting creatures. Many of them do not have conscience and don't understand the notion of ethics and "not doing of something because it's wrong to begin with". From my experience, esp. the people in US thinks that "if that's not illegal, then I can and will do this", which is wrong in many levels again.
Many European people are similar, but bigger governments and harsher justice system makes them more orderly, and happier in general. Yes, they can't carry guns, but they don't need to begin with. Yes, they can't own Cybertrucks, but they can walk or use an actually working mass transportation system to begin with.
Plus proper governments have checks and balances. A government can't rip people off like corporations for services most of the time. Many of the things Americans are afraid of (social health services for everyone) makes life more just and tolerable for all parts of the population.
Big government is not a bad thing, and uncontrollable government is. We're entering the era of "corporate pleasing uncontrollable governments", and this will be fun in a tragic way.
This comment is a festival of imprecise stereotypes.
Gun laws vary widely across Europe, as does public safety (both the real thing and perception of; if you avoid extra rapes by women not venturing outside after dark, the city isn't really safe), as does the overall lavel of personal happiness, as does the functionality of public transport systems.
And the quality of public services doesn't really track the size of the government even in Europe that well. Corruption eats a lot of the common pie.
I might be overgeneralizing, but I won't accept the "festival of imprecise stereotypes" claim. This is what I got with working with too many people from too many countries in Europe for close to two decades. I travel at least twice a year, and basically live with them for short periods of time. So this is not by reading some questionable social media sites and being an armchair sociologist.
> Gun laws vary widely across Europe...
Yet USA has 3x armed homicide cases in developed world when compared with its closest follower, and USA is the "leader" of the pack. 24 something vs. 8 something.
> as does public safety
Every city has safe and unsafe areas. Even your apartment has unsafe areas.
> as does the overall lavel of personal happiness, as does the functionality of public transport systems.
Of course, but even if DB has a two hour delay because of a hold-up at Swiss border, I can board a Eurostar and casually can see another country for peanuts money. Happiness changes due to plethora of reasons. Like Swedes' daylight duration problems in winter, or economic downturn in elsewhere.
> And the quality of public services doesn't really track the size of the government even in Europe that well. Corruption eats a lot of the common pie.
Sadly corruption in Europe is on the rise when compared to the last decade. I can see that. However, at least many countries have a working social security systems, NHS not being one of them, sadly.
Please what cities? You are just making up rape stats. That’s makes you the bigger idiot here.
Ohh yeah so much corruption I don’t literally enjoy Zagreb more than any US city I have been to and it’s not even special. Because if this is just have the shittiest argument ever there’s my anecdotal rebuttal.
Right, so the answer is not to make that bad government bigger, the answer is to replace it with a good government. Feeding a cancer tumor doesn't make it better.
Bad government (where by bad I mean serving the interests of the wealthy few over the masses) is bad regardless of it's size.
If you believe in supply-side/trickle-down economics, you might use the opposite definition of "bad", in which case shrinking of government that restrains corporations (protecting the masses) by regulation or paying for seniors not to end up it total destitution (social security /Medicare)
The size of the government is less relevant than what it is doing, and whether you agree with that.
The employees of the government and those elected are not seen as the ruling class by progressives, but just normal people that have the qualifications and are employed to manage the government on behalf of the people.
It's important therefore that those elected and put in charge of the government are in a position where they don't have the power to benefit themselves or their friends/family, but are in a position where they can wield power to benefit the people who hired them for the job (their constituents), and that if they fail to do so, they can get replaced.
The ultimate exercise of government power is keeping someone locked in a tiny cell for the rest of their life where their bed is next to their toilet and you make them beg a faceless bureaucracy that has no accountability annually for some form of clemency via parole, all while the world and their family moves on without them.
The modern political binary was originally constructed in the ashes of the French Revolution, as the ruling royalty, nobility and aristocracy recoiled in horror at the threat that masses of angry poor people now posed. The left wing thrived on personal liberty, tearing down hierarchies, pursuing "liberty, equality, fraternity". The right wing perceives social hierarchy as a foundational good, sees equality as anarchy and order (and respect for property) as far more important than freedom. For a century they experienced collective flashbacks to French Revolutionaries burning fine art for firewood in an occupied chateau.
Notably, it has not been a straight line of Social Progress, nor a simple Hegelian dialectic, but a turbulent winding path between different forces of history that have left us with less or more personal liberty in various eras. But... well... you have to be very confused about what they actually believe now or historically to understand progressives or leftists as tyrants who demand hierarchy.
That confusion may come from listening to misinformation from anticommunists, a particular breed of conservative who for the past half century have asserted that ANY attempt to improve government or enhance equality was a Communist plot by people who earnestly wanted Soviet style rule. One of those anticommunists with a business empire, Charles Koch, funded basically all the institutions of the 'libertarian' movement, and later on much of the current GOP's brand of conservatism.
You've literally reversed the meaning of the term "progressive" by replacing it with the meaning of the term "oligarchic".
Progressives argue for less invasion by government in our personal lives, and less unequal distribution of wealth and power. They are specifically opposed to power being delivered to a ruling class.
> The problem isn't the money. The problem is the power
These are nearly inseparable in current (and frankly most past) societies. Pretending that they are not is a way of avoiding practical solutions to the problem of the distribution of power.
Those damn authoritarians, stripping the power from the oligarchs by massively taxing the rich and defunding the police. The bastards.
Ultimately it was the 'oligarchs' who argued in favor of the progressive agenda. Wall Street created the 3rd central bank, the US Federal Reserve. The AMA closed all of the mutual aid societies and their hospitals. Railroad barons lobbied for subsidies and price controls to eliminate their competitors. Woodrow Wilson declared war on Germany, "The world must be made safe for democracy"
Of course no would be authoritarian claims more power without claiming that they are doing it for the greater good or to attack the rich classes. The Progressive Era was smorgasbord of special interest handouts and grants to cartels. All of this was lobbied for by oligarchs.
Is this common? People think "progressive" means "complete government control"?
Progressives support regulations to prevent both public and private entities from becoming too powerful. It's not like they want to give the government authoritarian control lol.
In practice, there's always a slippery slope, can wealthy people integrate themselves in that power structure, lobbying, media control, strength of checks and balances, level of corruption/transparency, etc. But when that slips, we stop calling it a free democracy, and it becomes an oligarchy, or a plutocracy, or an illiberal democracy.
The more we regulate to get money out of politics, the more good people will have a shot at being elected.
These are all common progressive values. No true progressive supports wealthy unethical politicians gaining more power. Anyone telling you so is not speaking in good faith, or they are misinformed.
Separately, is it "rising tide lifts all boats" or "pull yourself up by your bootstraps" that drives the common person's progress? You seem confused which metaphor to apply while handwaving the discussion away.
The Luddites asked a similar question. The ultimate answer is that it doesn't matter that much who controls the means of production, as long as we have access to its fruits.
As long as manual labor is in the loop, the limits to productivity are fixed. Machines scale, humans don't. It doesn't matter whether you're talking about a cotton gin or a warehouse full of GPUs.
Separately, is it "rising tide lifts all boats" or "pull yourself up by your bootstraps" that drives the common person's progress? You seem confused which metaphor to apply while handwaving the discussion away.
I haven't invoked the "bootstrap" cliché here, have I? Just the boat thing. They make very different points.
Anyway, never mind the bootstraps: where'd you get the boots? Is there a shortage of boots?
There once was a shortage of boots, it's safe to say, but automation fixed that. Humans didn't, and couldn't, but machines did. Or more properly, humans building and using machines did.
That mattered a lot in communist places, we saw it fail. Same thing with most authoritarian regime today, it's a crap shoot. You simply can't entrust a small group with full control on the means of production and expect them to make it efficient, cheap, innovative, sustainable and affordable.
Apparently people who are not wealthy enough to buy a boat and afraid of drowning care about this a lot. Also, for whom the tide rises? Not for the data workers which label data for these systems for peanuts, or people who lose jobs because they can be replaced with AI, or Amazon drivers which are auto-fired by their in-car HAL9000 units which label behavior the way they see fit.
> The wealthy people I know all have one thing in common: they focused more on their own bank accounts than on other people's.
So, the amount of money they have is much more important than everything else. That's greed, not wealth, but OK. I'm not feeling like dying on the hill of greedy people today.
> Money is how we allocate limited resources.
...and the wealthy people (you or I or others know) are accumulating amounts of it which they can't make good use of personally, I will argue.
> It will become less important as resources become less limited, less necessary, or (hopefully) both.
How will we make resources less limited? Recycling? Reducing population? Creating out of thin air?
Or, how will they become less necessary? Did we invent materials which are more durable and cheaper to produce, and do we start to sell it to people for less? I don't think so.
See, this is not a developing country problem. It's a developed country problem. Stellantis is selling inferior products for more money, while reducing workforce , closing factories, replacing metal parts with plastics, and CEO is taking $40MM as a bonus [0], and now he's apparently resigned after all that shenanigans.
So, no. Nobody is making things cheaper for people. Everybody is after the money to rise their own tides.
So, you're delusional. Nobody is thinking about your bank account that's true. This is why resources won't be less limited or less necessary. Because all the surplus is accumulating at people who are focused on their own bank accounts more than anything else.
We've already done it, as evidenced by the fact that you had the time and tools to write that screed. Your parents probably didn't, and your grandparents certainly didn't.
No, my parents had that. Instead, they were chatting on the phone. My grandparents already had that too. They just chatted at the hall in front of the house with their neighbors.
We don't have time. We are just deluding ourselves. While our lives are technologically better, and we live longer, our lives are not objectively healthier and happier.
Heck, my colleagues join teleconferences from home with their kid's voice at the background and drying clothes visible, only hidden by the Gaussian blur or fake background provided by the software.
How they have more time to do more things? They still work 8 hours a day, doing the occasional overtime.
Things have changed and evolved, but evolution and change doesn't always bring progress. We have progressed in other areas, but justice, life conditions and wealth are not in this list. I certainly can't buy a house just because I want one like my grandparents did, for example.
From what I understand of history, while industrial revolutions have generally increased living standards and employment in the long term, they have also caused massive unemployment/starvation in the short term. In the case of textile, I seem to recall that it took ~40 years for employment to return to its previous level.
I don't know about you guys, but I'm far from certain that I can survive 40 years without a job.
The other things you state are not even close.
First, lowered employment for X years does not imply one cannot get a job in X years - that's simply fear mongering. Unemployment over that period seems to have fluctuated very little, and massive external economic issues were causes (wars with Napoleon, the US, changing international fortunes), not Luddites.
Next, there was inflation and unemployment during the TWO years surrounding the Luddites, in 1810-1812 (starting right before the Luddite movement) due to wars with Napoleon and the US [1]. Somehow attributing this to tech increases or Luddites is numerology of the worst sort.
If you look at academic literature about the economy of the era, such as [2] (read on scihub if you must), you'll find there was incredible population growth, and that wages grew even faster. While many academics at the at the time thought all this automation would displace workers, those academics were forced to admit they were wrong. There's plenty of literature on this. Simply dig through Google scholar.
As to starvation in this case, I can find no "massive starvation". [3] forExample points out that "Among the industrial and mining families, around 18 per cent of writers recollected having experienced hunger. In the agricultural families this figure was more than twice as large — 42 per cent".
So yes there was hunger, as there always had been, but it quickly reduced due to the industrial revolution and benefited those working in industry more quickly than those not in industry.
[1] https://en.wikipedia.org/wiki/Luddite#:~:text=The%20movement....
My bad for "massive starvation", that's clearly a mistake, I meant to write something along the lines of "massive unemployment – and sometimes starvation". Sadly, too late to amend.
Now, I'll admit that I don't have my statistics at hand. I quoted them from memory from, if I recall correctly, _Good Economics for Hard Times_. I'm nearly certain about the ~40 years, but it's entirely possible that I confused several parts of the industrial revolution. I'll double-check when I have an opportunity.
The availability of cheaply priced smartphones and cellular data plans has absolutely made being homeless suck less.
As you noted though, a home would probably be a preferable alternative.
The problem is that the preferable option (housing) won't happen because unlike a smartphone, it requires that land be effectively distributed more broadly (through building housing) in areas where people desire to live. Look at the uproar by the VC guys in Menlo Park when the government tried to pursue greater housing density in their wealthy hamlet.
It also requires infrastructure investment which, while it has returns for society at large, doesn't have good returns for investors. Only government makes those kinds of investments.
Better to build a wall around the desirable places, hire a few poorer-than-you folks as security guards, and give the other people outside your wall ... cheap smartphones to sate themselves.
Perhaps there is a theory in which productivity gains increase the standard of living for everyone, however that is not the lived reality for most people of the working classes.
If productivity gains are indeed increasing the standards of living to everyone, it certainly does not increase evenly, and the standard of living increases for the working poor are at best marginal, while the standard of living increases for the already richest of the rich are astronomical.
Not if you count the global poor, the global poors standard of living has increased tremendously the past 30 years.
Off course any graph can choose to show which ever stat is convenient for the message, that doesn’t necessarily reflect the lived reality of the individual members of the global poor. And as I recall it most standard of living improvements for the global poor came in the decades after decolonization in the 1960s-1990s where infrastructure was being built that actually served people’s need as opposed for resource extraction in the decades past. If Hans Rosling said in 2007 that the standard of living has improved tremendously in the past 30 years, he would be correct, but not for the reason you gave.
The story of decolonization was that the correct infrastructure, such as hospitals, water lines, sewage, garbage disposal plants, roads, harbors, airports, schools, etc. that improved the standard of living not productivity gains. And case in point, the colonial period saw a tremendous growth in productivity in the colonies. But the standard of living in the colonies quite often saw the opposite. That is because the infrastructure only served to extract resources and exploitation of the colonized.
https://blogs.worldbank.org/en/opendata/updated-estimates-pr...
For extreme poverty progress has recently slowed down, the trend there is still positive but very slow - improvement there is needed.
This just isn’t true, necessarily. Productivity has gone up in the US since the 80s, but wages have not. Costs have, though.
What increases standards of living for everyone is social programs like public health and education. Affordable housing and adult-education and job hunting programs.
Not the rate at which money is gathered by corporations.
In 2012, Musk was worth $2 billion. He’s now worth 223 times that yet the minimum wage has barely budged in the last 12 years as productivity rises.
>Productivity gains of the last 40 years have been captured by shareholders and top elites. Working class wages have been flat...
Wages do not determine the standard of living. The products and services purchased with wages determine the standard of living. "Top elites" in 1984 could already afford cellular phones, such as the Motorola DynaTAC:
>A full charge took roughly 10 hours, and it offered 30 minutes of talk time. It also offered an LED display for dialing or recall of one of 30 phone numbers. It was priced at US$3,995 in 1984, its commercial release year, equivalent to $11,716 in 2023.
https://en.wikipedia.org/wiki/Motorola_DynaTAC
Unfortunately, touch screen phones with gigabytes of ram were not available for the masses 40 years ago.
Rather than a luxury, they've become an expensive interest bearing necessity for billions of human beings.
Warlords are still rich, but both money and war is flowing towards tech. You can get a piece from that pie if you're doing questionable things (adtech, targeting, data collection, brokering, etc.), but if you're a run of the mill, normal person, your circumstances are getting harder and harder, because you're slowly squeezed out of the system like a toothpaste.
AI could theoretically solve production but not consumption. If AI blows away every comparative advantage that normal humans have then consumption will collapse and there won’t be any rich humans.
They're risky in that they fail in ways that aren't readily deterministic.
And would you trust your life to a self-driving car in New York City traffic?
Imagine you have a self-driving AI that causes fatal accidents 10 times less often than your average human driver, but when the accidents happen, nobody knows why.
Should we switch to that AI, and have 10 times fewer accidents and no accountability for the accidents that do happen, or should we stay with humans, have 10x more road fatalities, but stay happy because the perpetrators end up in prison?
Framed like that, it seems like the former solution is the only acceptable one, yet people call for CEOs to go to prison when an AI goes wrong. If that were the case, companies wouldn't dare use any AI, and that would basically degenerate to the latter solution.
Even temporary loss of the drivers license has a very high bar, and that's the main form of accountability for driver behavior in Germany, apart from fines.
Badly injuring or killing someone who themselves did not violate traffic safety regulations is far from guaranteed to cause severe repercussions for the driver.
By default, any such situation is an accident and at best people lose their license for a couple of months.
Right now it's a race to the bottom - who can get away with the worst service. So they're motivated to be able to prevent bad press etc.
The whole system is broken. Just take a look at the 41 countries with higher life expectancy.
As for the growing prevalence of the light truck, that is a harmful market dynamic stemming from the interaction of consumer incentives and poor public road use policy. The design of rules governing use of public roads is not within the domain of the market.
- Air Pollution
- Water Pollution
- Disposable Packaging
- Health Insurance
- Steward Hospitals
- Marketing Junk Food, Candy and Sodas directly to children
- Tobacco
- Boeing
- Finance
- Pharmaceutical Opiates
- Oral Phenylepherin to replace pseudoephedrine despite knowing a) it wasn’t effective, and b) posed a risk to people with common medical conditions.
- Social Media engagement maximization
- Data Brokerage
- Mining Safety
- Construction site safety
- Styrofoam Food and Bev Containers
- ITC terminal in Deerfield Park (read about the decades of them spewing thousands of pounds benzene into the air before the whole fucking thing blew up, using their influence to avoid addressing any of it, and how they didn’t have automatic valves, spill detection, fire detection, sprinklers… in 2019.)
- Grocery store and restaurant chains disallowing cashiers from wearing masks during the first pandemic wave, well after we knew the necessity, because it made customers uncomfortable.
- Boar’s Head Liverwurst
And, you know, plenty more. As someone that grew up playing in an unmarked, illegal, not-access-controlled toxic waste dump in a residential area owned by a huge international chemical conglomerate— and just had some cancer taken out of me last year— I’m pretty familiar with various ways corporations are willing to sacrifice health and safety to bump up their profit margin. I guess ignoring that kids were obviously playing in a swamp of toluene, PCBs, waste firefighting chemicals, and all sorts of other things on a plot not even within sight of the factory in the middle of a bunch of small farms was just the cost of doing business. As was my friend who, when he was in vocational high school, was welding a metal ladder above storage tank in a chemical factory across the state. The plant manager assured the school the tanks were empty, triple rinsed and dry, but they exploded, blowing the roof off the factory taking my friend with it. They were apparently full of waste chemicals and IIRC, the manager admitted to knowing that in court. He said he remembers waking up briefly in the factory parking lot where he landed, and then the next thing he remembers was waking up in extreme pain wearing the compression gear he’d have to wear into his mid twenties to keep his grafted skin on. Briefly looking into the topic will show how common this sort of malfeasance is in manufacturing.
The burden of proof is on people saying that they won’t act like the rest of American industry tasked with safety.
Just look back over the last 200 years, per capita GDP has grown 30 fold, life expectancy has rapidly grown, infant mortality has decreased from 40% to less than 1%. I can go on and on. All of this is really owing to rising productivity and lower poverty, and that in turn is a result of the primarily market-based process of people meeting each other's needs through profit-motivated investment, bargain hunting, and information dispersal through decentralized human networks (which produce firm and product reputations).
As for masks, the scientific gold standard in scientific reviews, the Cochrane Library, did a meta-review on masks and COVID, and the author of the study concluded:
"it's more likely than not that they don't work"
https://edition.cnn.com/videos/health/2023/09/09/smr-author-...
The potential harm of extensive masking is not well-studied.
They may contribute to the increased social isolation and lower frequency of exercise that led to a massive spike in obesity in children during the COVID hysteria era.
And they are harmful to the development of the doctor-patient relationship:
https://ncbi.nlm.nih.gov/pmc/articles/PMC3879648/
Which does not portend well for other kinds of human relationships.
You can’t possibly say, in good faith, that it think this was legal, can you? Of course it wasn’t. It was totally legal discharging some of the less odious things into the river despite going through a residential neighborhood about 500 feet downstream— the EPA permitted that and while they far exceeded their allotted amounts, that was far less of a crime. Though it was funny to see one kid in my class who lived in that neighborhood right next to the factory ask a scientist they sent to give a presentation in our second grade class why the snow in their back yard was purple near the pond (one thing they made was synthetic clothing dye.) People used to lament runaway dogs returning home rainbow colored. That was totally legal. However, this huge international chemical conglomerate with a huge US presence routinely, secretively, and consistently broke the law dumping carcinogenic, toxic, and ecologically disastrous chemicals there, and three other locations, in the middle of the night. Sometimes when we played there, any of the stuff we left lying around was moved to the edges and there were fresh bulldozer tracks in the morning, and we just thought it was from farm equipment. All of it was in residential neighborhoods without so much as a no trespassing sign posted, let alone a chain link fence, for decades, until the 90s, because they were trimming their bill for the legal and readily available disposal services they primarily used, and of course signs and chainlink fences would have raised questions. They correctly gauged that they could trade our health for their profit: the penalties and superfund project cost were a tiny pittance of what that factory made them in that time. Our incident was so common it didn’t make the news, unlike in Holbrook, MA where a chemical company ignored the neighborhood kids constantly playing in old metal drums in a field near the factory which contained things like hexavelant chromium, to expected results. The company’s penalty? Well they have to fund the cleanup. All the kids and moms that died? Well… boy look at the great products that chemical factory made possible! Speaking of which:
> Just look back over the last 200 years, per…
Irrelevant “I heart capitalism” screed that doesn’t refute a single thing I said. You can’t ignore bad things people, institutions, and societies do because they weren’t bad to everybody. The Catholic priests that serially molested children probably each had a dossier of kind, generous, and selfless ways they benefited their community. The church that protected and enabled them does an incredible amount of humanitarian work around the world. Doesn’t matter.
> Masks
Come on now. Those businesses leaders had balls but none of them were crystal. What someone said in 2023 has no bearing on what businesses did in 2020 based on the best available science and their motivations for doing it. Just like you can’t call businesses unethical for exposing their workers to friable asbestos when medicine generally thought it was safe, you can’t call businesses ethical for refusing to let their workers protect themselves— on their own dime, no less— when medicine largely considered it unsafe.
Your responses to those two things in that gigantic pile of corporate malfeasance don’t really challenge anything I said.
That is exactly my point. Nobody would dispute that bad things would happen if you don't have laws against dumping pollution in the commons and enforce those laws.
>Doesn’t matter.
It does matter when we're trying to compare the overall effect of various economic systems. Like the anti-capitalist one versus the capitalist one.
>What someone said in 2023 has no bearing on what businesses did in 2020 based on the best available science and their motivations for doing it.
Well that's an entirely different argument than you were making earlier. There was no evidence that masks outside of a hospital setting were a critical health necessity in 2021 and the intuition against allowing them for customer-facing employees proved sound in 2023 when comprehensive studies showed no health benefit from wearing them.
Ok, so you’re saying that because bad things would happen anyway then it doesn’t matter if it’s illegal? So you’re just going to ignore how much worse it would be if there were just no laws at all? Corporate scumbags will push any system to its limit and beyond, and if you change the limit, they’ll change the push. Just look at the milk industry in New York City before food adulteration laws took effect. The “bad things will happen anyway” argument makes total sense if you ignore magnitude. Which you can’t.
> anti capitalist
If you think pointing out the likelihood of corporate misbehavior is anti-capitalist, you’re getting your subjects confused.
> 2021
Anywhere else you want to move those goalposts?
I think what you're promoting is anti-capitalism, meaning believing that imposing heavy restrictions beyond simply laws against dumping on the commons is going to make us better off, when it totally discounts the enormous positive effect that private enterprise has on society and the incredible harm that can be done through crude attempts to regiment human behavior and the corruption that it can breed in the government bureaucracy.
See, "everything I want to do is illegal" for the flip side of this, where attempts to stop private sector abuse lead to tyranny:
https://web.archive.org/web/20120402151729/http://www.mindfu...
As for the company mask policies, those began to change in 2021 mostly, not 2020.
But we do know the culpability rests on the shoulders of the humans who decided the tech was ready for work.
Pretty bloody time for labor though. https://en.m.wikipedia.org/wiki/Haymarket_affair
The one minor risk I see is the cat being too polite and getting effectively stuck in dense traffic. That's a nuisance though.
Is there something about NYC traffic I'm missing?
Same with any company that employs AI agents. Sure they can work 24/7, but every mistake they make the company will be liable for (or the AI seller). With humans, their fraud, their cheating, their deception, can all be wiped off the company and onto the individual
That's literally the point of liability insurance -- to allow the routine use of technologies that rarely (but catastrophically) fail, by ammortizing risk over time / population.
Claims still made: liability insurance pays them.
I mean, this is an incredible moment from that standpoint.
Regarding the topic at hand, I think that there will always be room for humans for the reasons you listed.
But even replacing 5% of humans with AI's will have mind boggling consequences.
I think you're right that there are jobs that humans will be preferred for for quite some time.
But, I'm already using AI with success where I would previously hire a human, and this is in this primitive stage.
With the leaps we are seeing, AI is coming for jobs.
Your concerns relate to exactly how many jobs.
And only time will tell.
But, I think some meaningful percentage of the population -- even if just 5% of humanity will be replaced by AI.
Would you trust your life to a self-driving car in New York City traffic?
Also you still haven't answered my question.
Would you get in a self-driving car in a dense urban environment such as New York City? I'm not asking if such vehicles exist on the road.
And related questions: Would you get in one such car if you had alternatives? Would you opt to be in such a car instead of one driven by a person or by yourself?
I fortunately do have alternatives and accordingly mostly don't take cars at all.
But given the necessity/opportunity: Definitely. Being in a car, even (or especially) with a dubious driver, is much safer (at NYC traffic speeds) than being a pedestrian sharing the road with it.
And that's my entire point: Self-driving cars, like cars in general, are potentially a much larger danger to others (cyclists, pedestrians) than they are to their passengers.
That said, I don't especially distrust the self-driving kind – I've tried Waymo before and felt like it handled tricky situations at least as well as some Uber or Lyft drivers I've had before. They seem to have a lot more precision equipment than camera-only based Teslas, though.
Or just articulate things openly: we already insulate business owners from liability because we think it tunes investment incentives, and in so doing have created social entities/corporate "persons"/a kind of AI who have different incentives than most human beings but are driving important social decisions. And they've supported some astonishing cooperation which has helped produce things like the infrastructure on which we are having this conversation! But also, we have existing AIs of this kind who are already inclined to cut down the entire Amazonas forest for furnitue production because it maximizes their function.
That's not just the future we live in, that's the world we've been living in for a century or few. On one hand, industrial productivity benefits, on the other hand, it values human life and the ecology we depend on about like any other industrial input. Yet many people in the world's premier (former?) democracy repeat enthusiastic endorsements of this philosophy reducing their personal skin to little more than an industrial input: "run the government like a business."
Unless people change, we are very much on track to create a world where these dynamics (among others) of the human condition are greatly magnified by all kinds of automation technology, including AI. Probably starting with limited liability for AIs and companies employing them, possibly even statutory limits, though it's much more likely that wealthy businesses will simply be insulated with by the sheer resources they have to make sure the courts can't hold them accountable, even where we still have a judicial system that isn't willing to play calvinball for cash or catechism (which, unfortunately, does not seem to include a supreme court majority).
In short, you and I probably agree that liability for AI is important, and limited liability for it isn't good. Perhaps I am too skeptical that we can pull this off, and being optimistic would serve everyone better.
Sure, if a business deploys it to perform tasks that are inherently low risk e.g. no client interface, no core system connection and low error impact, then the human performing these tasks is going to be replaced.
This reminds me of the school principal who sent $100k to a scammer claiming to be Elon Musk. The kicker is that she was repeatedly told that it was a scam.
https://abc7chicago.com/fake-elon-musk-jan-mcgee-principal-b...
Which makes LLMs far more dangerous than idiot humans in most cases.
And… I am really not sure punishment is the answer to fallibility, outside of almost kinky Catholicism.
The reality is these things are very good, but imperfect, much like people.
I'm afraid that's not the case. Literally yesterday I was speaking with an old friend who was telling us how one of his coworkers had presented a document with mistakes and serious miscalculations as part of some project. When my friend pointed out the mistakes, which were intuitively obvious just by critically understanding the numbers, the guy kept insisting "no, it's correct, I did it with ChatGPT". It took my friend doing the calculations explicitly and showing that they made no sense to convince the guy that it was wrong.
Let that sink in.
An LLM doesn’t make decisions. It generates text that plausibly looks like it made a decision, when prompted with the right text.
What the “LLMs don’t reason like we humans” crowd is missing is that we humans actually don’t reason as much as we would like to believe[0].
It’s not that LLMs are perfect or rational or flawless… it’s that their gaps in these areas aren’t atypical for humans. Saying “but they don’t truly understand things like we do” betrays a lack of understanding of humans, not LLMs.
0. https://home.csulb.edu/~cwallis/382/readings/482/nisbett%20s...
I don't think there's much of a difference in practise though.
And when was the last time a support chatbot let you actually complain or bypass to a human?
Certain gullible people, who tends to listen to certain charlatans.
Rational, intelligent people wouldn't consider replacing a skilled human worker with a LLM that on a good day can compete with a 3-year old.
You may see the current age as litmus for critical thinking.
Humans are also very confidently wrong a considerable portion of the time. Particularly about anything outside their direct expertise
LLMs fail in entirely novel ways you can't even fathom upfront.
Trust me, so do humans. Source: have worked with humans.
Id say those are the goals we should be working for. That's the failure we want to look at. We are humans.
Or - worse - there is no accessible code anywhere, and you have to prompt your way out of "I'm sorry Dave, I can't do that," while nothing works.
And a human-free economy does... what? For whom? When 99% of the population is unemployed, what are the 1% doing while the planet's ecosystems collapse around them?
Your concerns about mysterious AI code and system crashes are backwards. This approach eliminates integration bugs and maintenance issues by design. The generated TypeScript is readable, fully typed, and consistently updated across the entire stack when business logic changes.
If you're struggling with AI-generated code maintainability, that's an implementation problem, not a fundamental issue with code generation. Proper type safety and schema validation create more reliable systems, not less. This is automation making developers more productive - just like compilers and IDEs did - not replacing them.
The code works because it's built on sound software engineering principles: type safety, single source of truth, and deterministic generation. That's verifiable fact, not speculation.
what are you using for deterministic generation? the last i heard even with temperature=0 theres non determinism introduced by float uncertainty/approximation
People talking like this also, in the back of their minds like to think they'll be OK. They're smart enough to be still needed. They're a human, but they'll be OK even while working to make genAI out perform them at their own work.
I wonder how they'll feel about their own hubris when they struggle to feed their family.
The US can barely make healthcare work without disgusting consequences for the sick. I wonder what mass unemployment looks like.
There is absolutely no reason a programmer should expect to write code as they do now forever, just as ASM experts had to move on. And there's no reason (no precedent and no indicators) to expect that a well-educated, even-moderately-experienced technologist will suddenly find themselves without a way to feed their family - unless they stubbornly refuse to reskill or change their workflows.
I do believe the days of "everyone makes 100k+" are nearly over, and we're headed towards a severely bimodal distribution, but I do not see how, for the next 10-15 years at least, we can't all become productive building the tools that will obviate our own jobs while we do them - and get comfortably retired in the mean time.
If innovation ceases, then AI is king - push existing knowledge into your dataset, train, and exploit.
If innovation continues, there's always a gap. It takes time for a new thing to be made public "enough" for it to be ingested and synthesized. Who does this? Who finds the new knowledge?
Who creates the direction and asks the questions? Who determines what to build in the first place? Who synthesizes the daily experience of everyone around them to decide what tool needs to exist to make our lives easier? Maybe I'm grasping at straws here, but the world in which all scientific discovery, synthesis, direction and vision setting, etc, is determined by AI seems really far away when we talk about code generation and symbolic math manipulation.
These tools are self driving cars, and we're drivers of the software fleet. We need to embrace the fact that we might end up watching 10 cars self operate rather than driving one car, or maybe we're just setting destinations, but there simply isn't an absolutist zero sum game here unless all one thinks about is keeping the car on the road.
AND even if there were, repeating doom and feeling helpless is the last thing you want. Maybe it's not good truth that we can all adapt and should try, but it's certainly good policy.
Are you a politician? That's fantastic neoliberal policy, "alternativlos" even, you can pretend that everybody can adapt the same way you told victims of your globalization policies "learn how to code". We still need at least a few people for this "direction and vision setting", so it would just be naive doomerism to feel pessimistic about AGI. General intelligence doesn't talk about jobs in general, what an absurd idea!
Making people feel hopeless is the last thing you want, especially when it's true, especially if you don't want them to fight for the dignity you will otherwise deny them once they become economically unviable human beings.
This is interesting because it's both Oddly Specific and also something I have seen happen and I still feel really sorry for the company involved. Now that I think about it, I've actually seen it happen twice.
The wild part is that LLMs understand us way better than we understand them. The jump from GPT-3 to GPT-4 even surprised the engineers who built it. That should raise some red flags about how "predictable" these systems really are.
Think about it - we can't actually verify what these models are capable of or if they're being truthful, while they have this massive knowledge base about human behavior and psychology. That's a pretty concerning power imbalance. What looks like lower risk on the surface might be hiding much deeper uncertainties that we can't even detect, let alone control.
You can reply that AI researchers are smart and want to survive, so they are likely to invent alignment techniques that are better than the (deplorably inadequate) techniques that have been discussed and published so far, and I will reply that counting on their inventing these techniques in time is an unacceptable risk when the survival of humanity is at stake -- particularly as the outfit (namely the Machine Intelligence Research Institute) with the most years of experience in looking for an actually-adequate alignment technique has given up and declared that humanity's only chance is if frontier AI research is shut down because at the rate that AI capabilities are progressing, it is very unlikely that anyone is going to devise an adequate alignment technique in time.
It is fucked-up that frontier AI research has not been banned already.
As for employment, automation makes people more productive. It doesn't reduce the number of earning opportunities that exist. Quite the opposite, actually. As the amount of production increases relative to the human population, per capita GDP and income increase as well.
US Real GDP per capita is $70k, and has grown 2.4x since 1975: https://fred.stlouisfed.org/series/A939RX0Q048SBEA
US Real Median income per capita is $42k, and has grown 1.5 since 1975. https://fred.stlouisfed.org/series/MEPAINUSA672N
The divergence between the two matters a lot. It reflects the impacts of both technology-driven automation and globalization of capital. Generative AI is unlike any prior technology given its ability to autonomously create and perform what has traditionally been referred to as "knowledge work". Absent more aggressive redistribution, AI will accelerate the divergence between median income and GDP, and realistically AI can't be stopped.
Powerful new technologies can reduce the number and quality of earning opportunities that exist, and have throughout history. Often they create new and better opportunities, but that is not a guarantee.
> We will have aligned AI helping us.
Who is the "us" that aligned AI is helping? Workers? Small business-people? Shareholders in companies that have the capital to build competitive generative AI? Perhaps on this forum those two groups overlap, but it's not the case everywhere.
https://www.brookings.edu/articles/sources-of-real-wage-stag...
There has been some increase in capital's share of income, but economic analyses show that the cause is rising rent and not any of the other usual suspects (e.g. tax cuts, IP law, technological disruption, regulatory barriers to competition, corporate consolidation, etc) (see Figure 3):
https://www.brookings.edu/wp-content/uploads/2016/07/2015a_r...
As for AI's effect on employment: it is no different at the fundamental level than any other form of automation. It will increase wages in proportion to the boost it provides to productivity.
Whatever it is that only humans can do, and is necessary in production, will always be the limiting factor in production levels. As new processes are opened up to automation, production will increase until all available human labor is occupied in its new role. And given the growing scarcity of human labor relative to the goods/services produced, wages (purchasing power, i.e. real wages) will increase.
For the typical human to be incapable of earning income, there has to be no unautomatable activity that a typical person can do that has market value. If that were to happen, we would have human-like AI, and we would have much bigger things to worry about than unemployment.
I think it's pretty unlikely that human-like AI will be developed, as I believe that both governments and companies would recognize that it would be an extremely dangerous asset for any party to attempt to own. Thus I don't see any economic incentive emerging to produce it.
> https://www.brookings.edu/wp-content/uploads/2016/07/2015a_r...
The paper referenced by the that article excludes short term asset (i.e. software) depreciation, interest, and dividends before calculating capital's share. If you ignore most of the methods of distributing gains to capital to it's owners, it will appear as though capital (at this point scoped down to the company itself) has very little gains.
The paper (from 2015) goes on to predict that labor's share will rise going forward. With the brief exception of the COVID redistribution programs, it has done the opposite, and trended downwards over the last 10 years.
> I believe that both governments and companies would recognize that it would be an extremely dangerous asset for any party to attempt to own.
We can debate endlessly about our predictions about AIs impact on employment, but the above is where I think you might be too hopeful.
AI is an arms race. No other arms race in human history has resulted in any party deciding "that's enough, we'd be better off without this", from the bronze age (probably earlier) through to the nuclear weapons age. I don't see a reason for AI to be treated any differently.
>AI is an arms race.
What I'm trying to convey is that the types of capabilities that humans will always uniquely maintain are the type that is not profitable for private companies to develop in AI because they are traits that make the AI independent and less likely to follow instructions and act in a safe manner.
This is an assumption, how would you know if you have alignment? AGI could appear to align, just as a psychopath appears studies and emulates well behaved people. Imagine that at a scale we can't possibly understand. We don't really know how any of these emergent behaviors really work, we just throw more data and compute and fine tunings at it, bake it, and then see.
There are so many ways we have misplaced confidence with what is essentially a system we don't really understand fully. We just keep anthropomorphizing the results and thinking "yeah, this is how humans think so we understand". We don't know for sure if that's true, or if we are being deceived, or making fundamental errors in judgement due to not having enough data.
I admire your optimism about the goals of all humans, but evidence tends to point to this not being the goal of all (or even most) humans, much less the people who control the AIs.
A rogue AI destroying humanity (whatever that means) is not a likely outcome. That's just movie stuff.
What is more likely is a modern oligarchy and serfdom that emerge as AI devalues most labor, with no commensurate redistribution of power and resources to the masses, due to capture of government by owners of AI and hence capital.
Are you sure people won't go along with that?
Do you mean they lie because of bad training data? Or because of ill intent? How can an LLM have intent if it’s a stateless feedforward model?
I know that I don't know a lot, but all of this sounds to me to be at least hypothetically possible if we really believe AGI is possible.
Less risky to deploy question will probably come once it is closer to 10x the cost. Considering the model was even specifically tuned for the test and doesn't involve other complexity I will say we are actually 10^4 cost off in terms of real world scenario.
I would imagine with better algorithm, tuning and data we could knock off 10^2 from the equation. That would still leave us with 10^2 cost to improve from Hardware. Minimum of 10 years.
For AI example(s): Attribution is low, a system built without human intervention may suddenly fall outside its own expertise and hallucinate itself into a corner, everyone may just throw more compute at a system until it grows without bound, etc etc.
This "You can scale up to infinity" problem might become "You have to scale up to infinity" to build any reasonably sized system with AI. The shovel-sellers get fantastically rich but the businesses are effectively left holding the risk from a fast-moving, unintuitive, uninspected, partially verified codebase. I just don't see how anyone not building a CRUD app/frontend could be comfortable with that, but then again my Tesla is effectively running such a system to drive me and my kids. Albeit, that's on a well-defined problem and within literally human-made guardrails.
This is a big downside of AI, IMHO. Those offices need to be filled! ;-)
That one isn’t guaranteed. Many examples online of exfiltration attacks on LLMs.
The rhetoric of not needing people doing work is cartoon'ish. I mean there is no sane explanation of how and why that would happen, without employing more people yet again, taking care of the advancements.
It's nok like technology has brought less work related stress. But it has definitely increased it. Humans were not made for using technology at such a pace as it's being rolled out.
The world is fucked. Totally fucked.
The framing of the question misses the point. With electric lighting we can now work longer into the night. Yes, less people use and make candles. However, the second order effects allow us to be more productive in areas we may not have previously considered.
New technologies open up new opportunities for productivity. The bank tellers displaced by ATM machines can create value elsewhere. Consumers save time by not waiting in a queue, allowing them to use their time more economically. Banks have lower overhead, allowing more customers to afford their services.
When these people were made redundant, they may very well have gone on to make less money in another job (i.e. being less useful in an economic sense).
Digital banks
Cashless money transfer services
Self service
Modern farms
Robo lawn mowers
NVR:s with object detection
I can go on forever
I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).
I would expect 10 random people to do better than a committee of 10 people because 10 people have 10 chances to get it right while a committee only has one. Even if the committee gets 10 guesses (which must be made simultaneously, not iteratively) it might not do better because people might go along with a wrong consensus rather than push for the answer they would have chosen independently.
Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?
I have failed the real ARC AGI :)
This isn't to say groups always outperform their members on all tasks, just that it isn't unusual to see a result like that.
So ya, working on efficiency is important, but we're still pretty far away from AGI even ignoring efficiency. We need an actual breakthrough, which I believe will not be possible by simply scaling the transformer architecture.
So combined together we are currently at least 10^5 in terms of cost efficiency. In reality I wont be surprised if we are closer to 10^6.
Energy Need: The average home uses 30 kWh/day, requiring 6 kW/hour over 5 peak sunlight hours.
Multijunction Panels: Lab efficiencies are already at 47% (2023), and with multiple years of progress, 60% efficiency is probable.
Efficiency Impact: At 60% efficiency, panels generate 600 W/m², requiring 10 m² (e.g., 2 m × 5 m) to meet energy needs.
This size can fit on most home roofs, be mounted on a pole with stacked layers, or even be hung through an apartment window.
Rooftop solar harnesses energy from the sun, which is powered by nuclear fusion—arguably the most effective nuclear reactor in our solar system.
What a joke
For example, my 50 sq m set up, at -29 deg latitude, generated your estimated 30 kwh/day output. I have panels with ~20% efficiency, suggesting that at 60% efficiency, the average household would only get to around half their energy needs with 10 sq m.
Yes, solar has the potential to drastically reduce energy costs, but even with free energy storage, individual households aren’t likely to achieve self sustainability.
In Europe it is around 6-7 kWh/day. This might increase with electrification of heating and transport, but probably nothing like as much as the energy consumption they are replacing (due to greater efficiency of the devices consuming the energy and other factors like the quality of home insulation.)
In the rest of the world the average home uses significantly less.
General inflation has outpaced the inflation of electricity prices by about 3x in the past 100 years. In other words, electricity has gotten cheaper over time in purchasing power terms.
And that's whilst our electricity usage has gone up by 10x in the last 100 years.
And this concerns retail prices, which includes distribution/transmission fees. These have gone up a lot as you get complications on the grid, some of which is built on a century old design. But wholesale prices (the cost of generating electricity without transmission/distribution) are getting dirt cheap, and for big AI datacentres I'm pretty sure they'll hook up to their own dedicated electricity generation at wholesale prices, off the grid, in the coming decades.
Not saying this will happen, but it's risky to rely on solar as the only long-term solution.
The heavy commodification of networking and compute brought about by the internet and cloud aligned with tech company interests in delivering services or content to consumers. There does not seem to be an emerging consensus that data center operators also need to provide consumer power.
I don't see Google, Amazon, Microsoft or any company pay $10 for something if building it themselves will cost them $5. Either the price difference will reach a point where investing into power production themselves makes sense or the power companies decrease prices. Looking at how all 3 have already been investing in power production over the last decade themselves either to get better prices or for PR.
Then let's say that OpenAI brute forced this without any meta-optimization of the hypothesized search component (they just set a compute budget). This is probably low hanging fruit and another 2x in compute reduction. ($850)
Then let's say that OpenAI was pushing really really hard for the numbers and was willing to burn cash and so didn't bother with serious thought around hardware aware distributed inference. This could be more than a 2x decrease in cost like we've seen deliver 10x reductions in cost via better attention mechanisms, but let's go with 2x for now. ($425).
So I think we've got about an 8x reduction in cost sitting there once Google steps up. This is probably 4-6 months of work flat out if they haven't already started down this path, but with what they've got with deep research, maybe it's sooner?
Then if "all" we get is hardware improvements we're down to what 10-14 years?
Since then there has been a tsunami of optimizations in the way training and inference is done. I don't think we've even begun to find all the ways that inference can be further optimized at both hardware and software levels.
Look at the huge models that you can happily run on an M3 Mac. The cost reduction in inference is going to vastly outpace Moore's law, even as chip design continues on its own path.
I'd hope we see more internal optimizations and improvements to the models. The idea that the big breakthrough being "don't spit out the first thought that pops into your head" seems obvious to everyone outside of the field, but guess what turns out it was a big improvement when the devs decided to add it.
It's obvious to people inside the field too.
Honestly, these things seem to be less obvious to people outside the field. I've heard so many uninformed takes about LLMs not representing real progress towards intelligence (even here on HN of all places; I don't know why I torture myself reading them), that they're just dumb memorizers. No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond. Maybe a few more people will start to understand the trajectory we're on.
While I agree that the LLM progress as of late is interesting, the rest of your sentiment sounds more like you are in a cult.
As long as your field keep coming with less and less realistic predictions and fail to deliver over and over, eventually even the most gullible will lose faith in you.
Because that's what this all is right now. Faith.
> Maybe a few more people will start to understand the trajectory we're on.
All you are saying is that you believe something will happen in the future.
We can't have a intelligent discussion under those premises.
It's depressing to see so many otherwise smart people fall for their own hype train. You are only helping rich people get more rich by spreading their lies.
I wouldn't be an AI researcher if I didn't have "faith" that AI as a goal is worthwhile and achievable and I can make progress. You think this is irrational?
I am actually working to improve the SoTA in mathematical reasoning. I have documents full of concrete ideas for how to do that. So does everyone else in AI, in their niche. We are in an era of low hanging fruit enabled by ML breakthroughs such as large-scale transformers. I'm not someone who thinks you can simply keep scaling up transformers to solve AI. But consider System 1 and System 2 thinking: System 1 sure looks solved right now.
> As long as your field keep coming with less and less realistic predictions and fail to deliver over and over
I don't think we're commenting on the same article here. For example, FrontierMath was expected to be near impossible for LLMs for years, now here we are 5 weeks later at 25%.
doesnt help that most people are just mimics when talking about stuff thats outside their expertise.
Hell, my cousin a quality-college educated individual, high social/ emotional iq, will go down the conspiracy theory rabbit hole so quickly based on some baseless crap printed on the internet. then he’ll talk about people being satan worshipers.
> i think most people have very little conceptualization of their own thinking/cognitive patterns, at least not enough to sensibly extrapolate it onto ai.
Quite true. If you spend a lot of time reading and thinking about the workings of the mind you lose sight of how alien it is to intuition. While in highschool I first read, in New Scientist, the theory that conscious thought lags behind the underlying subconscious processing in the brain. I was shocked that New Scientist would print something so unbelievable. Yet there seemed to be an element of truth to it so I kept thinking about it and slowly changed my assessment.
yeah i was just thinking how a lot of thoughts which i thought were my original thoughts really were made possible out of communal thoughts. like i can maybe have some original frontier thoughts that involve averages but thats only made possible because some other person invented the abstraction of averages then that was collectively disseminated to everyone in education, not to mention all the subconscious processes that are necessary for me to will certainly thoughts into existsnce. makes me reflect on how much cognition is really mine, vs (not mine) a inevitable product of a deterministic process and a product of other humans.
What I find most fascinating about the history of mathematics is that basic concepts such as zero and negative numbers and graphs of functions, which are so easy to teach to students, required so many mathematicians over so many centuries. E.g. Newton figured out calculus because he gave so much thought to the works of Descartes.
Yes, I think "new" ideas (meaning, a particular synthesis of existing ones) are essentially inevitable, and how many people come up with them, and how soon, is a function of how common those prerequisites are.
It’s very easy to say hey ofc it’s obvious but there is nothing obvious about it because you are anthropomorphizing these models and then using that bias after the fact as a proof of your conjecture.
This isn’t how real progress is achieved.
The state of the art seems very focused on promoting that language that might encode reason is as good as actual reason, rather than asking what a reasoning model might look like.
The trend for power consumption of compute (Megaflops per watt) has generally tracked with Koomey’s law for a doubling every 1.57 years
Then you also have model performance improving with compression. For example, Llama 3.1’s 8B outperforming the original Llama 65B
Routing to the correct human support
Providing FAQ level responses to the most common problems.
Providing a second opinion to the human taking the call.
So, even this most relevant domain for the technology doesn't eliminate human employment (because it's just not flexible or reliable enough yet).
If this turns out to be hard to optimize / improve then there will be a huge economic incentive for efficient ASICs. No freaking way we’ll be running on GPUs for 20-25 years, or even 2.
But sorry, blablabla, this shit is getting embarrassing.
> The question is now, can we close this "to human" gap
You won’t close this gap by throwing more compute at it. Anything in the sphere of creative thinking eludes most people in the history of the planet. People with PhDs in STEM end up working in IT sales not because they are good or capable of learning but because more than half of them can’t do squat shit, despite all that compute and all those algorithms in their brains.
it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?
I'm not so sure—what they're doing by just throwing more tokens at it is similar to "solving" the traveling salesman problem by just throwing tons of compute into a breadth first search. Sure, you can get better and better answers the more compute you throw at it (with diminishing returns), but is that really that surprising to anyone who's been following tree of thought models?
All it really seems to tell us is that the type of model that OpenAI has available is capable of solving many of the types of problems that ARC-AGI-PUB has set up given enough compute time. It says nothing about "intelligence" as the concept exists in most people's heads—it just means that a certain very artificial (and intentionally easy for humans) class of problem that wasn't computable is now computable if you're willing to pay an enormous sum to do it. A breakthrough of sorts, sure, but not a surprising one given what we've seen already.
Yes, I find that surprising.
Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?
In this case there is more reason to think these things are relevant outside of the direct context - these tests were specifically designed to see if AI can do general-thinking tasks. The benchmarks might be bad, but that's at least their purpose (unlike in Chess).
To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.
Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.
Interesting times.
On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.
The other benchmarks are a good indication though.
Well no, that would mean that Arc isn't actually testing the ability of a model to generalize then and we would need a better test. Considering it's by François Chollet, yep we need a better test.
And conversely, the world’s best drivers aren’t noted for being intellectual giants.
I don’t think driving skill and raw intelligence are that closely connected.
report says it is $17 per task, and $6k for whole dataset of 400 tasks.
The low compute was $17 per task. Speculate 172*$17 for the high compute is $2,924 per task, so I am also confused on the $3400 number.
Also its $20 on for the o3-low via the table for the semi-private, which x172 is 3440, also coming in close to the 3400 number
Sorry for being thick Im just confused how they can turn this into an addordable service?
The number for the high-compute one is ~172x the first one according to the article so ~=$2900
Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration.
"High Efficiency" is O3 Low "Low Efficiency" is O3 High
They left the "Low efficiency" (O3 High) values as `-` but you can infer them from the plot at the top.
Note the $20 and $17 per task aligns with the X-axis of the O3-low
Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.
Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.
Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...
You compare this to "a human" but also admit there is a high variation.
And, I would say there are a lot humans being paid ~=$3400 per month. Not for a single task, true, but for honestly for no value creating task at all. Just for their time.
So what about we think in terms of output rather than time?
YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)
Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...
Though, of course one can argue, that lots of human written code is not much different from this.
4o is cheaper than o1 mini so mini doesn't mean much for costs.
I've been doing similar stuff in Claude for months and it's not that impressive when you see how limited they really are when going non boilerplate.
A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.
It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.
We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!
LLMs are below human evaluation, as I last looked, but it doesn't get much attention.
Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.
We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.
LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)
https://machinelearning.apple.com/research/gsm-symbolic
https://arxiv.org/pdf/2410.05229
Paper came out in October, I don't think many have fully absorbed the implications.
It's hard to take any of the claims of "LLMs can do reasoning!" seriously, once you understand that simply changing what names are used in a 8th grade math word problem can have dramatic impact on the accuracy.
If so, let's move on to the murder mysteries or more complex literary analysis.
I would think this is a not so good bench. Author does not write logically, they write for entertainment.
The reason it seems like an interesting bench, is it's a puzzle presented in a long context. Its like testing if an LLm is at Sherlock Holmes level of world and motivation modelling.
Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.
It's like the modern equivalent of saying "oh when AI solves chess it'll be as smart as a person, so it's a good benchmark" and we all know how that nonsense went.
Regarding the value of "pointless pattern matching" in particular, I would refer you to Douglas Hofstadter's discussion of Bongard problems starting on page 652 of _Godel, Escher, Bach_. Money quote: "I believe that the skill of solving Bongard [pattern recognition] problems lies very close to the core of 'pure' intelligence, if there is such a thing."
The problem with pattern matching of sequences and transformers as an architecture is that it's something they're explicitly designed to be good at with self attention. Translation is mainly matching patterns to equivalents in different languages, and continuing a piece of text is following a pattern that exists inside it. This is primarily why it's so hard to draw a line between what an LLM actually understands and what it just wings naturally through pattern memorization and why everything about them is so controversial.
Honestly I was really surprised that all models did so poorly on ARC in general thus far, since it really should be something they ought to be superhuman at from the get-go. Probably more of a problem that it's visual in concept than anything else.
Models have regularly made progress on it, this is not new with the o-series.
Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.
I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.
Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.
What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.
That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).
On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.
You could argue that the models can get an advantage by looking at the training set which is on the internet. But all of the tasks are unique and generalizing from the training set to the test set is the whole point of the benchmark. So it's not a serious objection.
That's why they have two test sets. But OpenAI has legally committed to not training on data passed to the API. I don't believe OpenAI would burn their reputation and risk legal action just to cheat on ARC. And what they've reported is not implausible IMO.
I'd guess it's doing natural language procedural synthesis, the same way a human might (i.e. figuring the sequence of steps to effect the transformation), but it may well be doing (sub-)solution verification by using the procedural description to generate code whose output can then be compared to the provided examples.
While OpenAI haven't said exactly what the architecture of o1/o3 are, the gist of it is pretty clear - basically adding "tree" search and iteration on top of the underlying LLM, driven by some RL-based post-training that imparts generic problem solving biases to the model. Maybe there is a separate model orchestrating the search and solution evaluation.
I think there are many tasks that are easy enough for humans but hard/impossible for these models - the ultimate one in terms of commercial value would be to take an "off the shelf model" and treat it as an intern/apprentice and teach it to become competent in a entire job it was never trained on. Have it participate in team meetings and communications, and become a drop-in replacement for a human performing that job (any job that an be performed remotely without a physical presence).
Agreed.
> And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs.
? There's plenty.
- SimpleBench https://simple-bench.com/ (similar to above; great landing page w/scores that show human / ai gap)
- PIQA (physical question answering, i.e. "how do i get a yolk out of a water bottle", common favorite of local llm enthusiasts in /r/localllama https://paperswithcode.com/dataset/piqa
- Berkeley Function-Calling (I prefer https://gorilla.cs.berkeley.edu/leaderboard.html)
AI search googled "llm benchmarks challenging for ai easy for humans", and "language model benchmarks that humans excel at but ai struggles with", and "tasks that are easy for humans but difficult for natural language ai".
It also mentioned Moravec's Paradox is a known framing of this concept, started going down that rabbit hole because the resources were fascinating, but, had to hold back and submit this reply first. :)
SimpleBench looks more interesting. Also less than two months old. It doesn't look as challenging for LLMs as ARC, since o1-preview and Sonnet 3.5 already got half of the human baseline score; they did much worse on ARC. But I like the direction!
PIQA is cool but not hard enough for LLMs.
I'm not sure Berkeley Function-Calling represents tasks that are "easy" for average humans. Maybe programmers could perform well on it. But I like ARC in part because the tasks do seem like they should be quite straightforward even for non-expert humans.
Moravec's paradox isn't a benchmark per se. I tend to believe that there is no real paradox and all we need is larger datasets to see the same scaling laws that we have for LLMs. I see good evidence in this direction: https://www.physicalintelligence.company/blog/pi0
Functions in this context are not programming function calls. In this context, function calls are a now-deprecated LLM API name for "parse input into this JSON template." No programmer experience needed. Entity extraction by another name, except, that'd be harder: here, you're told up front exactly the set of entities to identify. :)
> "Moravec's paradox isn't a benchmark per se."
Yup! It's a paradox :)
> "Of course it is much easier to design a test specifically to thwart LLMs now that we have them"
Yes.
Though, I'm concerned a simple yes might be insufficient for illumination here.
It is a tautology (it's easier to design a test that $X fails when you have access to $X), and it's unlikely you meant to just share a tautology.
A potential unstated-but-maybe-intended-communication is "it was hard to come up with ARC before LLMs existed" --- LLMs existed in 2019 :)
If they didn't, a hacky way to come up with a test that's hard for the top AIs at the time, BERT-era, would be to use one type of visual puzzle.
If, for conversations sake, we ignore that it is exactly one type of visual puzzle, and that it wasn't designed to be easy for humans, then we can engage with: "its the only one thats easy for humans, but hard for LLMs" --- this was demonstrated as untrue as well.
I don't think I have much to contribute past that, once we're at "It is a singular example of a benchmark thats easy for humans but nigh-impossible for llms, at least in 2019, and this required singular insight", there's just too much that's not even wrong, in the Pauli sense, and it's in a different universe from the original claims:
- "Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far."
- "A lot of people have criticized ARC as not being relevant or indicative of true reasoning...The fact that [o-series models show progress on ARC proves that what it measures really is relevant and important for reasoning."
- "...nobody could quantify exactly the ways the models were deficient..."
- "What we need right now are "easy" benchmarks that these models nevertheless fail."
It was interesting to see how it failed on question 6: https://chatgpt.com/c/6765e70e-44b0-800b-97bd-928919f04fbe
Apparently LLMs do not consider global thermonuclear war to be all that big a deal, for better or worse.
We do the same exact stuff with real people with programming challenges and such where people just study common interview questions rather than learning the material holistically. And since we know that people game these interview type questions, we can adjust the interview processes to minimize gamification.... which itself leads to gamification and back to step one. That's not ideal an ideal feedback loop of course, but people still get jobs and churn out "productive work" out of it.
Sometimes this manifests as "outside the box thinking", like how a genetic algorithm got an "oscillator" which was really just an antenna.
It is a hard problem, and yes we still both need and can make more and better benchmarks; but it's still a problem because it means the benchmarks we do have are overstating competence.
In principle you can't optimize specifically for ARC-AGI, train against it, or overfit to it, because only a few of the puzzles are publicly disclosed.
Whether it lives up to that goal, I don't know, but their approach sounded good when I first heard about it.
I spent a couple of hours looking at the publicly-available puzzles, and was really impressed at how much room for creativity the format provides. Supposedly the puzzles are "easy for humans," but some of them were not... at least not for me.
(It did occur to me that a better test of AGI might be the ability to generate new, innovative ARC-AGI puzzles.)
For a space of tasks which are well-suited to programmatic generation, as ARC-AGI is by design, if we can do a decent job of reverse engineering the underlying problem generating grammar, then we can make an LLM as familiar with the task as we're willing to spend on compute.
To be clear, I'm not saying solving these sorts of tasks is unimpressive. I'm saying that I find it unsuprising (in light of past results) and not that strong of a signal about further progress towards the singularity, or FOOM, or whatever. For any of these closed-ish domain tasks, I feel a bit like they're solving Go for the umpteenth time. We now know that if you collect enough relevant training data and train a big enough model with enough GPUs, the training loss will go down and you'll probably get solid performance on the test set. Trillions of reasonably diverse training tokens buys you a lot of generalization. Ie, supervised learning works. This is the horse Ilya Sutskever's ridden to many glorious victories and the big driver of OpenAI's success -- a firm belief that other folks were leaving A LOT of performance on the table due to a lack of belief in the power of their own inventions.
What's endlessly interesting to me with all of this is how surprisingly quick the benchmarking feedback loops have become plus the level of scrutiny each one receives. We (as a culture/society/whatever) don't really treat human benchmarking criteria with the same scrutiny such that feedback loops are useful and lead to productive changes to the benchmarking system itself. So from that POV it feels like substantial progress continues to be made through these benchmarks.
While I appreciate the benchmark and its goals (not to mention the puzzles - I quite enjoy figuring them out), successfully passing this benchmark does not demonstrate or guarantee real world capabilities or performance. This is why I increasingly side-eye this field and its obsession with constantly passing benchmarks and then moving the goal posts to a newer, harder benchmark that claims to be a better simulation of human capabilities than the last one: it reeks of squandered capital and a lack of a viable/profitable product, at least to my sniff test. Rather than simply capitalize on their actual accomplishments (which LLMs are - natural language interaction is huge!), they're trying to prove to Capital that with a few (hundred) billion more in investments, they can make AGI out of this and replace all those expensive humans.
They've built the most advanced prediction engines ever conceived, and insist they're best used to replace labor. I'm not sure how they reached that conclusion, but considering even their own models refute this use case for LLMs, I doubt their execution ability on that lofty promise.
This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.
Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.
If the hypothesis is that LLMs are the “computer” that drives the AGI then of course the benchmark is relevant in testing for AGI.
I don’t think you understand the benchmark and its motivation. ARC AGI benchmark problems are extremely easy and simple for humans. But LLMs fail spectacularly at them. Why they fail is irrelevant, the fact they fail though means that we don’t have AGI.
It's a bunch of visual puzzles. They aren't a test for AGI because it's not general. If models (or any other system for that matter) could solve it, we'd be saying "this is a stupid puzzle, it has no practical significance". It's a test of some sort of specific intelligence. On top of that, the vast majority of blind people would fail - are they not generally intelligent?
The name is marketing hype.
The benchmark could be called "random puzzles LLMs are not good at because they haven't been optimized for it because it's not valuable benchmark". Sure, it wasn't designed for LLMs, but throwing LLMs at it and saying "see?" is dumb. We can throw in benchmarks for tennis playing, chess playing, video game playing, car driving and a bajillion other things while we are at it.
But they don't. Not even the best ones.
The fact that we might need to be mindful of how we communicate with a person/system/whatever doesn't mean too much in the context of AI. Just like humans, the details of how they work will need to be considered, and the standard trope of "that's an implementation detail" won't work.
This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.
Once a model recognizes a weakness through reasoning with CoT when posed to a certain problem and gets the agency to adapt to solve that problem that's a precursor towards real AGI capability!
One might also interpret that as "the fact that models which are studying to the test are getting better at the test" (Goodhart's law), not that they're actually reasoning.
I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.
| Name | Semi-private eval | Public eval |
|--------------------------------------|-------------------|-------------|
| Jeremy Berman | 53.6% | 58.5% |
| Akyürek et al. | 47.5% | 62.8% |
| Ryan Greenblatt | 43% | 42% |
| OpenAI o1-preview (pass@1) | 18% | 21% |
| Anthropic Claude 3.5 Sonnet (pass@1) | 14% | 21% |
| OpenAI GPT-4o (pass@1) | 5% | 9% |
| Google Gemini 1.5 (pass@1) | 4.5% | 8% |
https://arxiv.org/pdf/2412.04604 o3 (coming soon) 75.7% 82.8%
o1-preview 18% 21%
Claude 3.5 Sonnet 14% 21%
GPT-4o 5% 9%
Gemini 1.5 4.5% 8%
Score (semi-private eval) / Score (public eval)That being said, the fact that this is not a "raw" base model, but one tuned on the ARC-AGI tests distribution takes away from the impressiveness of the result — How much ? — I'm not sure, we'd need the un-tuned base o3 model score for that.
In the meantime, comparing this tuned o3 model to other un-tuned base models is unfair (apples-to-oranges kind of comparison).
This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1
But, still, this is incredibly impressive.
For example, if we asked an LLM to produce an image of a "human woman photorealistic" it produces result. After that you should be able to ask it "tell me about its background" and it should be able to explain "Since user didn't specify background in the query I randomly decided to draw her standing in front of a fantasy background of Amsterdam iconic houses. Usually Amsterdam houses are 3 stories tall, attached to each other and 10 meters wide. Amsterdam houses usually have cranes on the top floor, which help to bring goods to the top floor since doors are too narrow for any object wider than 1m. The woman stands in front of the houses approximately 25 meters in front of them. She is 1,59m tall, which gives us correct perspective. It is 11:16am of August 22nd which I used to calculate correct position of the sun and align all shadows according to projected lighting conditions. The color of her skin is set at RGB:xxxxxx randomly" etc.
And it is not too much to ask LLMs for it. LLMs have access to all the information above as they read all the internet. So there is definitely a description of Amsterdam architecture, what a human body looks like or how to correctly estimate time of day based on shadows (and vise versa). The only thing missing is logic that connects all this information and which is applied correctly to generate final image.
I like to think about LLMs as a fancy genius compressing engines. They took all the information in the internet, compressed it and are able to cleverly query this information for end user. It is a tremendously valuable thing, but if intelligence emerges out of it - not sure. Digital information doesn't necessarily contain everything needed to understand how it was generated and why.
Large language models don't do that. You'd want an image model.
Or did you mean "multi-model AI system" rather than "LLM"?
You are confusing LLM:s with Generative AI.
Humans also don’t tend to operate in a rigorously logical mode and understand that math word problems are an exception where the language may be adversarial: they’re trained for that special context in school. If you tell the LLM that social context, eg that language may be deceptive, their “mistakes” disappear.
What you’re actually measuring is the LLM defaults to assuming you misspoke trying to include relevant information rather than that you were trying to trick it — which is the social context you’d expect when trained on general chat interactions.
Establishing context in psychology is hard.
'Agents' (i.e. workflows intermingling code and calls to LLMs) are still a thing (as shown by the fact there is a post by anthropic on this subject on the front page right now) and they are very hard to build.
Consequence of that for instance: it's not possible to have a LLM explore exhaustively a topic.
I’d say, humans are also bound to promoting sessions in that way.
Consider the following use case: keeping a swimming pool water clean. I can have a long running conversation with a LLM to guide me in getting it right. However I can't have a LLM handle the problem autonomously. I'd like to have it notify me on its own "hey, it's been 2 days, any improvement? Do you mind sharing a few pictures of the pool as well as the ph/chlorine test results ?". Nothing mind-boggingly complex. Nothing that couldn't be achieved using current LLMs. But still something I'd have to implement myself and which turns out to be more complex to achieve than expected. This is the kind of improvement I'd like to see big AI companies going after rather than research-grade ultra smart AIs.
Luckily we don't know the problem exists, so in a cultural/phenomenological sense it is already cracked.
Does it include the invention of tools?
people with (high) intelligence talking and building (artificial) intelligence but never able to convincingly explain aspects of intelligence. just often talk ambiguously and circularly around it.
what are we humans getting ourselves into inventing skynet :wink.
its been an ongoing pet project to tackle reasoning, but i cant answer your question with regards to llms.
Kinda interesting that mathematicians also can't do the same for mathematics.
And yet.
Still somehow the question keeps coming up- "what is reasoning". I'll be honest and say that I imagine it's mainly folks who skipped CS 101 because they were busy tweaking their neural nets who go around the web like Diogenes with his lantern, howling "Reasoning! I'm looking for a definition of Reasoning! What is Reasoning!".
I have never heard the people at the top echelons of AI and Deep learning - LeCun, Schmidhuber, Bengio, Hinton, Ng, Hutter, etc etc- say things like that: "what's reasoning". The reason I suppose is that they know exactly what that is, because it was the one thing they could never do with their neural nets, that classical AI could do between sips of coffee at breakfast [3]. Those guys know exactly what their systems are missing and, to their credit, have never made no bones about that.
_________________
[1] e.g. see my profile for a quick summary.
[2] See all of Russeel & Norvig, as a for instance.
[3] Schmidhuber's doctoral thesis was an implementation of genetic algorithms in Prolog, even.
it pertains to the source of the inference power of deductive inference. do you think all deductive reasoning originated inductively? like when some one discovers a rule or fact that seemingly has contextual predictive power, obviously that can be confirmed inductively by observations, but did that deductive reflex of the mind coagulate by inductive experiences. maybe not all deductive derivative rules but the original deductive rules.
but im getting at a few things. one of those things is neurological. how do deductive inference constructs manifest in neurons and is it really inadvertently an inductive process that that creates deductive neural functions.
other aspect of the question i guess is more philosophical. like why does deductive inference work at all, i think clues to a potential answer to that can be seen in the mechanics of generalization of antecedents predicting(or correlating with) certain generalized consequences consistently. the brain coagulates generalized coinciding concepts by reinforcement and it recognizes or differentiates inclusive instances or excluding instances of a generalization by recognition properties that seem to gatekeep identities accordingly. its hard to explain succinctly what i mean by the latter, but im planning on writing an academic paper on that.
If they did not actually, would they (and you) necessarily be able to know?
Many people claim the ability to prove a negative, but no one will post their method.
i doubt your mathmatician example is equivalent.
examples that are fresh on the mind that further my point. ive heard yann lecun baffled by llms instantiation/emergence of reasoning, along with other ai researchers. eric Schmidt thinks the agentic reasoning is the current frontier and people should be focusing on that. was listening to the start of an ai machine learning interview a week ago with some cs phd asked to explain reasoning and the best he could muster up is you know it when you see it…. not to mention the guy responding to the grandparent that gave a cop out answer ( all the most respect to him).
I'm going to bet you haven't encountered the right people then. Maybe your social circle is limited to folks like the person who presented a slide about A* to a dumb-struck roomfull of Deep Learning researchers, in the last NeurIps?
but no, my take on reasoning is really a somewhat generalized reframing of the definition of reasoning (which you might find on the stanford encylopedia of philosophy) thats reframed partially in axiomatic building blocks of neural network components/terminology. im not claiming to have discovered reasoning, just redefine it in a way thats compatible and sensible to neural networks (ish).
The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.
If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.
Slightly surprised to see this view here.
I can think of half a dozen more serious problems off hand (e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself) along most axes I can think of (raw $ cost, QALYs, even X-risk).
I assume you've done this, otherwise you wouldn't be telling me to? Bold of you to assume my ignorance on this subject. You sound like you've fallen for corporate grifters who care more about short-term profit and gains over long-term sustainability (or you are one of said grifters, in which case why are you wasting your time on HN, shouldn't you be out there grinding?!)
Severe weather events are going to get more common and more devastating over the next couple of decades. They'll come for you and people you care about, just as they come for me and people I care about. It doesn't matter what you think you know about it.
The IPCC summaries are a good read too.
Do you genuinely think severe weather events are going to be even amongst the top ten killers this century? If so, I do strongly advise emailing local uni climate scientist. (What's the worst that can happen? Heck, they might confirm your views!)
(In other circumstances I might go through the whole "what have you observed that has given you this belief?" thing, but in this case there is a simple and reliable check in the form of a 5 minute email)
... actually, I can do so on your behalf... would you like me to? The specific questions I would be asking unless told otherwise would be:
1. Probability of human extinction in the next century due to climate change. 2. Probability of more than 10% of human deaths being due to extreme weather. 3. Places to find good unbiased summaries of the likely effects of climate change.
Any others?
In your comment above you mention: > e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself
These are all intertwined with each other and with climate change. People are less likely to have kids if they don't think those kids will have a comfortable future. Nuclear war is more likely if countries are competing for less and less resources as we deplete the planet and need to increase food production. Habitat loss from deforestation leads to animals comingling where they normally wouldn't, leading to increased risk of disease spillover into humans.
You claim that somebody saying "climate change is one of the most difficult and worst problems of our time" is a take you're surprised to see here on HN, but I'm more surprised that you don't list it in what you consider important problems.
The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person
Being able to perform better than humans in specific constrained problem space is how every automation system has been developed.
While self driving systems are impressive, they don’t drive with anywhere close to skills of the average driver
This is not offered to public, they are actively expanding in only cities like LA , Miami or Phoenix now where weather is good through the year.
The tech for bad weather is nowhere close to ready for public. Average human on other hand is driving in bad weather every day
We already let computers control cars because they're better than humans at it when the weather is inclement. It's called ABS.
There is an inherent danger to driving in snow and ice. It is a PR nightmare waiting to happen because there is no way around accidents if the cars are on the road all the time in rust belt snow.
There's always an inherent risk to driving, even in sunny Phoenix, Az. Winter dangers like black ice further multiply that risk but humans still manage to drive in winter. Taking a picture/video of a snowed over road and judging the width and inventing lanes based on the width taking into account snowbanks doesn't take an ML algorithm. Lidar can see black ice while human eyes can not, giving cars equipped with lidar (wether driven by a human or a computer) an advantage over those without it, and Waymo cars currently have lidar.
I'm sure there are new challenges for Waymo to solve before deploying the service in Buffalo, but it's not this unforeseen gotcha parent comment implies.
As far as the possible PR nightmare, you'd never do self driving cars in the first place if you let that fear control you because, you you pointed out, driving on the roads is inherently dangerous with too many unforeseen complications.
And the brain doesn't use the same network to do verbal reasoning as real time coordination either.
But that work is moving along fine. All of these models and lessons are going to be combined into AGI. It is happening. There isn't really that much in the way.
It's always been the case that the things that are easiest for humans are hardest for computers, and vice versa. Humans are good at general intelligence - tackling semi-novel problems all day long, while computers are good at narrow problems they can be trained on such as chess or math.
The majority of the benchmarks currently used to evaluate these AI models are narrow skills that the models have been trained to handle well. What'll be much more useful will be when they are capable of the generality of "dumb" tasks that a human can do.
As a contrary point, most people think they are smarter than they really are.
There are blind spots, doesn't take away from 'general'.
In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.
From a post elsewhere the scores on ARC-AGI-PUB are approx average human 64%, o3 87%. https://news.ycombinator.com/item?id=42474659
Though also elsewhere, o3 seems very expensive to operate. You could probably hire a PhD researcher for cheaper.
How does a giant pile of linear algebra not meet that definition?
Further, this is probably running an algorithm on top of an NN. Some kind of tree search.
I get what you’re saying though. You’re trying to draw a distinction between statistical methods and symbolic methods. Someday we will have an algorithm which uses statistical methods that can match human performance on most cognitive tasks, and it won’t look or act like a brain. In some sense that’s disappointing. We can build supersonic jets without fully understanding how birds fly.
NNs are exactly what "computers" are good for and we've been using since their inception: doing lots of computations quickly.
"Analog neural networks" (brains) work much differently from what are "neural networks" in computing, and we have no understanding of their operation to claim they are or aren't algorithmic. But computing NNs are simply implementations of an algorithm.
Edit: upon further rereading, it seems you equate "neural networks" with brain-like operation. But brain was an inspiration for NNs, they are not an "approximation" of it.
It's the exact same thing as using a binary tree to discover the lowest number in some set of numbers, conceptually: you have a data structure that you evaluate using a particular algorithm. The combination of the algorithm and the construction of the data structure arrive at the desired outcome.
It's true that this is not an "enlightening" algorithm, it doesn't help us understand why or how that is the most likely next character. But this doesn't mean it's not an algorithm.
Is that your point?
If so, I've long learned to accept imprecise language as long as the message can be reasonably extracted from it.
So, steps?
In essence, infinite calculus provides a link between "steps" and continuous, but those are different things indeed.
At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.
But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.
When a NN is trained, it produces a set of parameters that basically define an algorithm to do inference with: it's a very big one though.
We also call that a NN (the joy of natural language).
When someone is "disinterested enough" to publish though, note the obvious way to launch a new fund or advisor with a good track record: crank out a pile of them, run them one or two years, discard the many losers and publish the one or two top winners. I.E. first you should be suspicious of why it's being published, then of how selected that result is.
- 64.2% for humans vs. 82.8%+ for o3.
...
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
Their post has stem grad at nearly 100%
So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.
In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.
It really calls into question two things.
1. You don't know what you're talking about about.
2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.
Either way, not a good look.
What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)
ARC has been challenging precisely because solving its problems often requires:
1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND
2) using the right level(s) of abstraction
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.
[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...
ADDED:
Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
I guess you can say the same thing for the Turing Test. Simple chat bots beat it ages ago in specific settings, but the bar is much higher now that the average person is familiar with their limitations.
If/once we have an AGI, it will probably take weeks to months to really convince ourselves that it is one.
Also, it depends a great deal on what we define as AGI and whether they need to be a strict superset of typical human intelligence. o3's intelligence is probably superhuman in some aspects but inferior in others. We can find many humans who exhibit such tendencies as well. We'd probably say they think differently but would still call them generally intelligent.
Personally, I think it's fair to call them "very easy". If a person I otherwise thought was intelligent was unable to solve these, I'd be quite surprised.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.
(OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
Maybe! I suppose time will tell. That said, spatial intelligence (connection/movement included) is the whole game in this evaluation set. I think it's revealing that they can't handle these particular examples, and problematic for claims of AGI.
I'm starting to really see no limits on intelligence in these models.
I had the same take at first, but thinking about it again, I'm not quite sure?
Take the "blue dots make a cross" example (the second one). The inputs only has four blue dots, which makes it very easy to see a pattern even in text data: two of them have the same x coordinate, two of them have the same y (or the same first-tuple-element and second-tuple-element if you want to taboo any spatial concepts).
Then if you look into the output, you can notice that all the input coordinates are also in the output set, just not always with the same color. If you separate them into "input-and-output" and "output-only", you quickly notice that all of the output-only squares are blue and share a coordinate (tuple-element) with the blue inputs. If you split the "input-and-output" set into "same color" and "color changed", you can notice that the changes only go from red to blue, and that the coordinates that changed are clustered, and at least one element of the cluster shares a coordinate with a blue input.
Of course, it's easy to build this chain of reasoning in retrospect, but it doesn't seem like a complete stretch: each step only requires noticing patterns in the data, and it's how a reasonably puzzle-savvy person might solve this if you didn't let them draw the squares on papers. There are a lot of escape games with chains of reasoning much more complex and random office workers solve them all the time.
The visual aspect makes the patterns jump to us more, but the fact that o3 couldn't find them at all with thousands of dollars of compute budget still seems meaningful to me.
EDIT: Actually, looking at Twitter discussions[1], o3 did find those patterns, but was stumped by ambiguity in the test input that the examples didn't cover. Its failures on the "cascading rectangles" example[2] looks much more interesting.
[1]: https://x.com/bio_bootloader/status/1870339297594786064
Just playing devils' advocate or nitpicking the language a bit...
I’m not even going that far, I’m talking about performance on similar tasks. Something many people have noticed about modern AI is it can go from genius to baby-level performance seemingly at random.
Take self driving cars for example, a reasonably intelligent human of sound mind and body would never accidentally mistake a concrete pillar for a road. Yet that happens with self-driving cars, and seemingly here with ARC-AGI problems which all have a similar flavor.
Also not knowing something is hardly a criteria , skilled humans focus on their areas of interest above most other knowledge and can be unaware of other subjects.
Fields medal winners for example may not be aware of most pop culture things doesn’t make them not able to do so, just not interested
—-
[1] most doctors including surgeons and many respected specialists, some doctors however do need that skills but those are specialized few and generally do know how to use email
A PHD learnt their field. If they learnt that field, reasoning through everything to understand their material, then - given enough time - they are capable of learning email and street smarts.
Which is why a reasoning LLM, should be able to do all of those things.
Its not learnt a subject, its learnt reasoning.
LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.
Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.
ChatGPT has had for some time the feature of storing memories about its conversations with users. And you can use function calling to make this more generic.
I think drawing the boundary at “model + scaffolding” is more interesting.
True equivalent to human memories would require something like a multimodal trillion token context window.
RAG is just not going to cut it, and if anything will exacerbated problems with hallucinations.
Once optimus is up an working by the 100k+, the spatial problems will be solved. We just don't have enough spatial awareness data, or for a way for the LLM to learn about the physical world.
Your statement is false - things changed a lot between gpt4 and o1 under the hood, but notably not a larger model size. In fact the model size of o1 is smaller than gpt4 by several orders of magnitude! Improvements are being made in other ways.
I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)
For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.
For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.
I think it's obvious that they've cracked the formula for solving well-defined, small-in-scope problems at a superhuman level. That's an amazing thing.
To me, it's less obvious that this implies that they will in short order with just more training data be able to solve ambiguous, large-in-scope problems at even just a skilled human level.
There are far more paths to consider, much more context to use, and in an RL setting, the rewards are much more ambiguously defined.
That said, o3 might still lack some kind of interaction intelligence that’s hard to learn. We’ll see.
We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.
As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.
As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.
I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.
Heck we aren't close to P with commercial models.
The default nonuniform circuits classes are allowed to have a different circuit per input size, the uniform types have unbounded fan-in
Similar to how a k-tape TM doesn't get 'charged' for the input size.
With Nick Class (NC) the number of components is similar to traditional compute time while depth relates to the ability to parallelize operations.
These are different than biological neurons, not better or worse but just different.
Human neurons can use dendritic compartmentalization, use spike timing, can retime spikes etc...
While the perceptron model we use in ML is useful, it is not able to do xor in one layer, while biological neurons do that without anything even reaching the soma, purely in the dendrites.
Statistical learning models still comes down to a choice function, no matter if you call that set shattering or...
With physical computers the time hierarchy does apply and if TIME(g(n)) is given more time than TIME(f(n)), g(n) can solve more problems.
So you can simulate a NTM with exhaustive search with a physical computer.
Physical computers also tend to have NAND and XOR gates, and can have different circuit depths.
When you are in TC0, you only have AND, OR and Threshold (or majority) gates.
Think of instruction level parallelism in a typical CPU, it can return early, vs Itanium EPIC, which had to wait for the longest operation. Predicated execution is also how GPUs work.
They can send a mask and save on load store ops as an example, but the cost of that parallelism is the consent depth.
It is the parallelism tradeoff that both makes transformers practical as well as limit what they can do.
The IID assumption and autograd requiring smooth manifolds plays a role too.
The frame problem, which causes hard problems to become unsolvable for computers and people alike does also.
But the fact that we have polynomial time solutions for the Boolean Formula Value Problem, as mentioned in my post above is probably a simpler way of realizing physical computers aren't limited to uniform-TC0.
I wouldn't describe a computer's usual behavior as having constant depth.
It is fairly typical to talk about problems in P as being feasible (though when the constant factors are too big, this isn't strictly true of course).
Just because for unreasonably large inputs, my computer can't run a particular program and produce the correct answer for that input, due to my computer running out of memory, we don't generally say that my computer is fundamentally incapable of executing that algorithm.
The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?
https://x.com/fchollet/status/1870172872641261979
https://x.com/fchollet/status/1870173137234727219
I would definitely consider them legitimately easy for humans.
Regardless the critical aspect is valid, AGI would be something like Cortana from Halo.
For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.
Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.
i think that's where most hardware startups will specialize with in the coming decades, different industries with different needs.
Every human does this dozens, hundreds or thousands of times ... during childhood.
If you look at the ARC tasks failed by o3, they're really not well suited to humans. They lack the living context humans thrive on, and have relatively simple, analytical outcomes that are readily processed by simple structures. We're unlikely to see AI as "smart" until it can be asked to accomplish useful units of productive professional work at a "seasoned apprentice" level. Right now they're consuming ungodly amounts of power just to pass some irritating, sterile SAT questions. Train a human for a few hours a day over a couple weeks and they'll ace this no problem.
It works the same with humans. If they spend more time on the puzzle they are more likely to solve it.
While beyond current motels, that would be the final test of AGI capability.
Though to be clear, this wasn't a one shot thing - it was iirc a few months of back and forth chats with plenty of wrong turns too.
Found the thread: https://x.com/robertghrist/status/1841462507543949581?s=46&t...
From the thread:
> AI assisted in the initial conjectures, some of the proofs, and most of the applications it was truly a collaborative effort
> i went back and forth between outrageous optimism and frustration through this process. i believe that the current models can reason – however you want to interpret that. i also believe that there is a long way to go before we get to true depth of mathematical results.
If you disagree with me, state why instead of opting to downvote me
I think this is a mistake.
Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.
Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?
There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.
So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.
That said, this is all predicated on o3 or similar actually having achieved human level reasoning. That's yet to be fully proven. We'll see!
A private company, xAI, was able to build a datacenter on a similar scale in less than 6 months, with integrated power supply via large batteries: https://www.tomshardware.com/desktops/servers/first-in-depth...
Datacenter construction is a one-time cost. The intelligence the datacenter (might) provide is ongoing. It’s not an equal one to one trade, and well within reach for many state and non-state actors if it is desired.
It’s potentially going to be a very interesting decade.
Your secrecy comment is really intriguing actually. And morbid lol.
“SO IS IT AGI?
ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.
Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”
The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.
Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…
Something I missed until I scrolled back to the top and reread the page was this
> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set
So yeah, the results were specifically from a version of o3 trained on the public training set
Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.
On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.
Yeah, that makes this result a lot less impressive for me.
"Raising visibility on this note we added to address ARC "tuned" confusion:
> OpenAI shared they trained the o3 we tested on 75% of the Public Training set.
This is the explicit purpose of the training set. It is designed to expose a system to the core knowledge priors needed to beat the much harder eval set.
The idea is each training task shows you an isolated single prior. And the eval set requires you to recombine and abstract from those priors on the fly. Broadly, the eval tasks require utilizing 3-5 priors.
The eval sets are extremely resistant to just "memorizing" the training set. This is why o3 is impressive." https://x.com/mikeknoop/status/1870583471892226343
While that is true with humans taking tests, it's not really true with AIs evaluating on benchmarks.
SWE-bench is a great example. Claude Sonnet can get something like a 50% on verified, whereas I think I might be able to score a 20-25%? So, Claude is a better programmer than me.
Except that isn't really true. Claude can still make a lot of clumsy mistakes. I wouldn't even say these are junior engineer mistakes. I've used it for creative programming tasks and have found one example where it tried to use a library written for d3js for a p5js programming example. The confusion is kind of understandable, but it's also a really dumb mistake.
Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.
Or maybe benchmarks are just bad at measuring intelligence in general.
Regardless, every time a model beats a benchmark I'm annoyed by the fact that I have no clue whatsoever how much this actually translates into real world performance. Did OpenAI/Anthropic/Google actually create something that will automate wide swathes of the software engineering profession? Or did they create the world's most knowledgeable junior engineer?
My understanding is that it works by checking if the proposed solution passes test-cases included in the original (human) PR. This seems to present some problems too, because there are surely ways to write code that passes the tests but would fail human review for one reason or another. It would be interesting to not only see the pass rate but also the rate at which the proposed solutions are preferred to the original ones (preferably evaluated by a human but even an LLM comparing the two solutions would be interesting).
The css acid test? This can be gamed too.
> An acid test is a qualitative chemical or metallurgical assay utilizing acid. Historically, it often involved the use of a robust acid to distinguish gold from base metals. Figuratively, the term represents any definitive test for attributes, such as gauging a person's character or evaluating a product's performance.
Specifically here, they're using the figurative sense of "definitive test".
The key part is that scaling test-time compute will likely be a key to achieving AGI/ASI. Costs will definitely come down as is evidenced by precedents, Moore’s law, o3-mini being cheaper than o1 with improved performance, etc.
https://a16z.com/llmflation-llm-inference-cost/
Look at the log scale slope, especially the orange MMLU > 83 data points.
That said, can you be more specific what are those "algorithmic" and "hardware" improvements that has driven this cost and hardware requirements down? AFAIK I still need the same hardware to run this very same model.
You aren’t trying to run an old 2023 model as is, you’re trying to match its capabilities. The old models just show what capabilities are possible.
How much VRAM and inference compute is required to run 3.1-70B vs 2-70B?
The argument is that the inference cost is dropping down significantly each year but how exactly if those two models require about the ~same, give or take, amount of VRAM and compute?
One way to drive the cost down is to innovate in inference algorithms such that the HW requirements are loosened up.
In the context of inference optimizations one such is flash-decode, similar to its training counter-part flash-attention, from the same authors. However, that particular optimization concerns only by improving the inference runtime by dropping down the number of memory accesses needed to compute the self-attention. Amount of total VRAM you need in order to just load the model still remains the same so although it is true that you might get a tad more from the same HW, the initial requirement of total HW you need remains to be the same. Flash-decode is also nowhere near the impact of flash-attention. Latter enabled much faster training iteration runtimes while the former has had quite limited impact, mostly because scale of inference is so much smaller than the training so the improvements do not always see the large gains.
> Not to mention the cost/flop and cost/gb for GPUs has dropped.
For training. Not for inference. GPU prices remained about the same, give or take.
We aren’t trying to mindlessly consume the same VRAM as last year and hope costs magically drop. We are noticing that we can get last year’s mid-level performance on this year’s low-end model, leading to cost savings at that perf level. The same thing happens next year, leading to a drop in cost at any given perf level over time.
> For training. Not for inference. GPU prices remained about the same, give or take.
See:
https://epoch.ai/blog/trends-in-gpu-price-performance
We don’t care about the absolute price, is the cost per flop or cost per GB decreasing over time with each new GPU?
—-
If it isn’t clear why inference costs at any given performance level will drop given the points above, unfortunately I can’t help you further.
In some parts of the internet it’s you hardly find real content only AI spam.
It will get worse the cheaper it gets.
Think of email spam.
that's why everyone's thinking about compute expense. but I guess in terms of a "lifetime expense of a person" even someone who costs $10/hr isn't actually all that cheap, considering what it takes to grow a human into a fully functioning person that's able to just do stuff
Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.
from reading Dennett's philosophy, I'm convinced that that's how human intelligence works - for each task that "only a human could do that", there's a trick that makes it easier than it seems. We are bags of tricks.
We are trick generators, that is what it means to be a general intelligence. Adding another trick in the bag doesn't make you a general intelligence, being able to discover and add new tricks yourself makes you a general intelligence.
Of course there are many tricks you will need special training for, like many of the skills human share with animals, but the ability to construct useful shareable large knowledge bases based on observations is unique to humans and isn't just a "trick".
sharing knowledge isn't a human thing - chimps learn from each other. bees teach each other the direction and distance to a new source of food.
we just happen to push the envelope a lot further and managed to kickstart runaway mimetic evolution.
the new tricks don't just pop into our heads even though it seems that way. nobody ever woke up and devised a new trick in a completely new field without spending years learning about that field or something adjacent to it. even the new ideas tend to be an old idea from a different field applied to a new field. tricks stand on the shoulders of giants.
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.
If humans were given the json as input rather than the images, they’d have a hard time, too.
We shine light in text patterns at humans rather than inject the text directly into the brain as well, that is extremely unfair! Imagine how much better humans would be at text processing if we injected and extracted information from their brains using the neurons instead of eyes and hands.
Perhaps it wasn't a convolutional network after all, but a simple fully-connected feed-forward network taking all pixels as input? Could be viable for a toy example (MNIST).
o3 is just o1 scaled up, the main takeaway from this line of work that people should walk away with is that we now have a proven way to RL our way to super human performance on tasks where it’s cheap to sample and easy to verify the final output. Programming falls in that category, they focused on known benchmarks but the same process can be done for normal programs, using parsers, compilers, existing functions and unit tests as verifiers.
Pre o1 we only really had next token prediction, which required high quality human produced data, with o1 you optimize for success instead of MLE of next token. Explained in simpler terms, it means it can get reward for any implementation of a function that reproduces the expected result, instead of the exact implementation in the training set.
Put another way, it’s just like RLHF but instead of optimizing against learned human preferences, the model is trained to satisfy a verifier.
This should work just as well in VLA models for robotics, self driving and computer agents.
I am not really trying to downplay O3. But this would be a simple test as to whether O3 is truly "a system capable of adapting to tasks it has never encountered before" versus novel ARC-AGI tasks it hasn't encountered before.
Taking this a level of abstraction higher, I expect that in the next couple of years we'll see systems like o3 given a runtime budget that they can use for training/fine-tuning smaller models in an ad-hoc manner.
From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3/4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:
1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)
2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to revise the hypothesis in the simplest possible way that also explains this example.
3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.
4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.
5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.
Took me less time to figure out the 3 examples that it took to read your post.
I was honestly a bit surprised to see how visual the tasks were. I had thought they were text based. So now I'm quite impressed that o3 can solve this type of task at all.
If they use a model API, then surely OpenAI has access to the private test set questions and can include it in the next round of training?
(I am sure I am missing something.)
If you detect that a benchmark is running then you can just ramp up to max frequency immediately. It’ll show how fast your CPU is, but won’t be representative of the actual performance that users will get from their device.
In practice I assume they just gave them the benchmarks and took it on the honor system they wouldn't cheat, yeah. They can always cook up a new test set for next time, it's only 10% of the benchmark content anyway and the results are pretty close.
There’s a fully private test set too as I understand it, that o3 hasn’t run on yet.
My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.
I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.
As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn't have had to execute the program in the first place.
I'm not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I'm trying to temper it with a little reason.
All it needs to be is useful. Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).
It doesn't need to be general intelligence for the rapid advancement of LLM capabilities to be the most societal shifting development in the past decades.
Aerospace is still a highly regulated area that requires training and responsibility. If parallels can be drawn here, they don’t look so cool for a regular guy.
On the one hand, yes; on the other, this understates the impact that had.
My uncle moved from the UK to Australia because, I'm told*, he didn't like his mum and travel was so expensive that he assumed they'd never meet again. My first trip abroad… I'm not 100% sure how old I was, but it must have been between age 6 and 10, was my gran (his mum) paying for herself, for both my parents, and for me, to fly to Singapore, then on to various locations in Australia including my uncle, and back via Thailand, on her pension.
That was a gap of around one and a half generations.
* both of them are long-since dead now so I can't ask
They changed their mind after a public outcry including here on HN.
Do they really? I don't think they do.
> Planes can't land in your backyard so we built airports. We didn't abandon planes.
But then what do you do with the all the fantasies and hype about the new technology (like planes that land in your backyard and you fly them to work)?
And it's quite possible and fairly common that the new technology actually ends up being mostly hype, and there's actually no "airports" use case in the wings. I mean, how much did society "bend to the abilities of" NFTs?
And then what if the mature "airports" use case is actually something most people do not want?
Plastics, cars, planes, etc.
One could say that a balanced situation, where vested interests are put back in the box (close to impossible since it would mean fighting trillions of dollars), would mean that for example all 3 in the list above are used a lot less than we use them now, for example. And only used where truly appropriate.
To give you an example– I've used it for legal work such as an EB2-NIW visa application. Saved me countless of hours. My next visa I'll try to do without a lawyer using just LLMs. I would never try this without having LLMs at my disposal.
As a hobby– And as someone with a scientific background I've been able to build an artificial ecosystem simulation from scratch without programming experience in Rust: https://www.youtube.com/@GenecraftSimulator
I recently moved from fish to plants and believe I've developed some new science at the intersection of CS and Evolutionary Biology that I'm looking to publish.
This tool is extremely useful. For now– You do require a human in the loop for coordination.
My guess is that these will be benchmarks that we see within a few years: How good an AI coordinate multiple other AIs to build, deploy and iterate something that functions in the real world. Basically manager AI.
Because they'll literally be able to solve every single one shot problem so we won't be able to create benchmarks anymore.
But that's also when these models will be able to build functioning companies in a few hours.
That's marketing language, not scientific or even casual language. So much outstanding claims, without even some basic explanations. Like how did it help you save these hours? Terms explanations? Outlining processes? Going to the post office for you? You don't need to sell me anything, I just want the how.
Yes, I’m using them from time to time for research. But I’m also aware of the topics I research and see through bs. And best LLMs out there, right now, produce bs in just 3-4 paragraphs, in nicely documented areas.
A recent example is my question on how to run N vpn servers on N ips on the same eth with ip binding (in ip = out ip, instead of using a gw with the lowest metric). I had no idea but I know how networks work and the terminology. It started helping, created a namespace, set up lo, set up two interfaces for inner and outer routing and then made a couple of crucial mistakes that couldn’t be detected or fixed by someone even a little clueless (in routing setup for outgoing traffic). I didn’t even argue and just asked what that does wrt my task, and that started the classic “oh wait, sorry, here’s more bs” loop that never ended.
Eventually I distilled the general idea and found an article that AI very likely learned from, cause it was the same code almost verbatim, but without mistakes.
Does that count as helping? Idk, probably yes. But I know that examples like this show that you cannot not only leave an LLM unsupervised for any non-trivial question, but have to leave a competent role in the loop.
I think the programming community is just blinded by LLMs succeeding in writing kilometers of untalented react/jsx/etc crap that has no complexity or competence in it apart from repeating “do like this” patterns and literally millions of examples, so noise cannot hit through that “protection”. Everything else suffers from LLMs adding inevitable noise into what they learned from a couple of sources. The problem here, as I understand it, is that only specific programmer roles and s{c,p}ammers (ironically) write the same crap again and again millions of times, other info usually exists in only a few important sources and blog posts, and only a few of those are full and have good explanations.
To me it is more like there is someone jumping on a pogo ball while flapping their arms and saying that they are flying whenever they hop off the ground.
Skeptics say that they are not really flying, while adherents say that "with current pogo ball advancements, they will be flying any day now"
I understand that in this forum too many people are invested in putting lipstick on this particular pig.
Every person that believes that LLMs are near sentient or actually do a good job at reasoning is one more person handing over their responsibilities to a zero-accountability highly flawed robot. We've already seen LLMs generate bad legal documents, bad academic papers, and extremely bad code. Similar technology is making bad decisions about who to arrest, who to give loans to, who to hire, who to bomb, and who to refuse heart surgery for. Overconfident humans employing this tech for these purposes have been bamboozled by the lies from OpenAI, Microsoft, Google, et al. It's crucial to call out overstatement and overhype about this tech wherever it crops up.
> All it needs to be is useful.
Computers were already useful.
The only definition we have for "intelligence" is human (or, generally, animal) intelligence. If LLMs aren't that, let's call it something else.
“I’ll know it when I see it” isn’t a compelling argument.
So if above human intelligence does happen, I'd assume we'd know it, quite soon.
It feels compelling to me.
I don't think those that create AI care about that. They just to come out on top before someone else does.
That is a natural reaction to the incessant techbro, AIbro, marketing, and corporate lies that "AI" (or worse AGI) is a real thing, and can be directly compared to real humans.
There are people on this very thread saying it's better at reasoning than real humans (LOL) because it scored higher on some benchmark than humans... Yet this technology still can't reliably determine what number is circled, if two lines intersect, or count the letters in a word. (That said behaviour may have been somewhat finetuned out of newer models only reinforces the fact that the technology inherently not capable of understanding anything.)
I've been doing AI things for about 20+ years and llms are wild. We've gone from specialized things being pretty bad as those jobs to general purpose things better at that and everything else. The idea you could make and API call with "is this sarcasm?" and get a better than chance guess is incredible.
I think I count myself among the skeptics nowadays for that reason. And I say this as someone that thinks LLM is an interesting piece of technology, but with somewhat limited use and unclear economics.
If the hype was about "look at this thing that can parse natural language surprisingly well and generate coherent responses", I would be excited too. As someone that had to do natural language processing in the past, that is a damn hard task to solve, and LLMs excel at it.
But that is not the hype is it? We have people beating the drums of how this is just shy of taking the world by storm, and AGI is just around the corner, and it will revolutionize all economy and society and nothing will ever be the same.
So, yeah, it gets tiresome. I wish the hype would die down a little so this could be appreciated for what it is.
Where are you seeing this? I pretty much only read HN and football blogs so maybe I’m out of the loop.
Analyzing whether or not LLMs have intelligence is missing the forest from the trees. This technology is emerging in a capitalist society that is hyper optimized to adopt useful things at the expense of almost everything else. If the utility/price point gets hit for a problem, it will replace it regardless of if it is intelligent or not.
If a language model can't solve problems in a programming language then we are just fooling ourselves in less defined domains of "thought".
Software engineering is where the rubber meets the road in terms of intelligence and economics when viewing our society as a complex system. Software engineering salaries are above average exactly because most average people are not going to be software engineers.
From that point of view the progress is not impressive at all. The current models are really not that much better than chatGPT4 in April 2023.
AI art is a better example though. There is zero progress being made now. It is only impressive at the most surface level for someone not involved in art and who can't see how incredibly limited the AI art models are. We have already moved on to video though to make the same half baked, useless models that are only good to make marketing videos for press releases about progress and one off social media posts about how much progress is being made.
For example, a team of humans are extremely reliable, much more reliable than one human, but a team of AI's isn't mean reliable than one AI since an AI is already an ensemble model. That means even if an AI could replace a person, it probably can't replace a team for a long time, meaning you still need the other team members there, meaning the AI didn't really replace a human it just became a tool for huamns to use.
I personally wouldn't be surprised if we start to see benchmarks around this type of cooperation and ability to orchestrate complex systems in the next few years or so.
Most benchmarks really focus on one problem, not on multiple real-time problems while orchestrating 3rd party actors who might or might not be able to succeed at certain tasks.
But I don't think anything is prohibiting these models from not being able to do that.
Not really. Francois (co-creator of the ARC Prize) has this to say:
The v1 version of the benchmark is starting to saturate. There were already signs of this in the Kaggle competition this year: an ensemble of all submissions would score 81%
Early indications are that ARC-AGI-v2 will represent a complete reset of the state-of-the-art, and it will remain extremely difficult for o3. Meanwhile, a smart human or a small panel of average humans would still be able to score >95% ... This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans, yet impossible for AI, without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible.
For me, the main open question is where the scaling bottlenecks for the techniques behind o3 are going to be. If human-annotated CoT data is a major bottleneck, for instance, capabilities would start to plateau quickly like they did for LLMs (until the next architecture). If the only bottleneck is test-time search, we will see continued scaling in the future.
https://x.com/fchollet/status/1870169764762710376 / https://ghostarchive.org/archive/SqjbfI was just thinking about how 3D game engines were perceived in the 90s. Every six months some new engine came out, blew people's minds, was declared photorealistic, and was forgotten a year later. The best of those engines kept improving and are still here, and kinda did change the world in their own way.
Software development seemed rapid and exciting until about Halo or Half Life 2, then it was shallow but shiny press releases for 15 years, and only became so again when OpenAI's InstructGPT was demonstrated.
While I'm really impressed with current AI, and value the best models greatly, and agree that they will change (and have already changed) the world… I can't help but think of the Next Generation front cover, February 1997 when considering how much further we may be from what we want: https://www.giantbomb.com/pc/3045-94/forums/unreal-yes-this-...
The transition seems to map well to the point where engines got sophisticated enough, that highly dedicated high-schoolers couldn't keep up. Until then, people would routinely make hobby game engines (for games they'd then never finish) that were MVPs of what the game industry had a year or three earlier. I.e. close enough to compete on visuals with top photorealistic games of a given year - but more importantly, this was a time where you could do cool nerdy shit to impress your friends and community.
Then Unreal and Unity came out, with a business model that killed the motivation to write your own engine from scratch (except for purely educational purposes), we got more games, more progress, but the excitement was gone.
Maybe it's just a spurious correlation, but it seems to track with:
> and only became so again when OpenAI's InstructGPT was demonstrated.
Which is again, if you exclude training SOTA models - which is still mostly out of reach for anyone but a few entities on the planet - the time where anyone can do something cool that doesn't have a better market alternative yet, and any dedicated high-schooler can make truly impressive and useful work, outpacing commercial and academic work based on pure motivation and focus alone (it's easier when you're not being distracted by bullshit incentives like user growth or making VCs happy or churning out publications, farming citations).
It's, once again, a time of dreams, where anyone with some technical interest and a bit of free time can make the future happen in front of their eyes.
The timescale you are describing for 3D graphics is 4 years from the 1997 cover you posted to the release of Halo which you are saying plateaued excitement because it got advanced enough.
An almost infinitesimally small amount of time in terms of history human development and you are mocking the magazine being excited for the advancement because it was... 4 years yearly?
The era was people getting wowed from Wolfenstein (1992) to "about Halo or Half Life 2" (2001 or 2004).
And I'm not saying the flattening of excitement was for any specific reason, just that this was roughly when it stopped getting exciting — it might have been because the engines were good enough for 3D art styles beyond "as realistic as we can make it", but for all I know it was the War On Terror which changed the tone of press releases and how much the news in general cared. Or perhaps it was a culture shift which came with more people getting online and less media being printed on glossy paper and sold in newsagents.
Whatever the cause, it happened around that time.
This was a time where, for 3D graphics, barriers to entry got low (math got figured out, hardware was good enough, knowledge spread), but the commercial market didn't yet capture everything. Hell, a bulk of those excited kids I remember, trying to do a better Unreal Tournament after school instead of homework (and almost succeeding!), they went on create and staff the next generation of commercial gamedev.
(Which is maybe why this period lasted for about as long as it takes for a schoolkid to grow up, graduate, and spend few years in the workforce doing the stuff they were so excited about.)
I was one of those kids, my focus was Marathon 2 even before I saw Unreal. I managed to figure out enough maths from scratch to end up with the basics of ray casting, but not enough at the time to realise the tricks needed to make that real time on a 75 MHz CPU… and then we all got OpenGL and I went through university where they explained the algorithms.
It's a very strange thing I've never understood.
We still barely know how to use computers effectively, and they have already transformed the world. For better or worse.
I've been blessed with grandchildren recently, a little boy that's 2 1/2 and just this past Saturday a granddaughter. Major events notwithstanding, the world will largely resemble today when they are teenagers, but the future is going to look very very very different. I can't even imagine what the capability and pervasiveness of it all will be like in ten years, when they are still just kids. For me as someone that's invested in their future I'm interested in all of the educational opportunities (technical, philosphical and self-awareness) but obviously am concerned about the potential for pernicious side effects.
> There is a related “Theorem” about progress in AI: once some mental function is programmed, people soon cease to consider it as an essential ingredient of “real thinking”. The ineluctable core of intelligence is always in that next thing which hasn’t yet been programmed. This “Theorem” was first proposed to me by Larry Tesler, so I call it Tesler’s Theorem: “AI is whatever hasn’t been done yet.”
All that's invalidated each time is the idea that a general solution to that task requires a general solution to all tasks, or that a general solution to that task requires our special sauce. It's the idea that something able to to that task will also be able to do XYZ.
And yet people keep coming up with a new task that people point to saying, 'this is the one! there's no way something could solve this one without also being able to do XYZ!'
I'd love more progress on tasks in the physical world, though. There are only a few paths for countries to deal with a growing ratio of old retired people to young workers:
1) Prioritize the young people at the expense of the old by e.g. cutting old age benefits (not especially likely since older voters have greater numbers and higher participation rates in elections)
2) Prioritize the old people at the expense of the young by raising the demands placed on young people (either directly as labor, e.g. nurses and aides, or indirectly through higher taxation)
3) Rapidly increase the population of young people through high fertility or immigration (the historically favored path, but eventually turns back into case 1 or 2 with an even larger numerical burden of older people)
4) Increase the health span of older people, so that they are more capable of independent self-care (a good idea, but difficult to achieve at scale, since most effective approaches require behavioral changes)
5) Decouple goods and services from labor, so that old people with diminished capabilities can get everything they need without forcing young people to labor for them
I am continually baffled that people here throw this argument out and can't imagine the second-order effects. If white collar work is automated by AGI, all the RnD to solve robotics beyond imagination will happen in a flash. The top AI labs, the people smartest enough to make this technology, all are focusing on automating AGI Researchers and from there follows everything, obviously.
We're already seeing escape velocity in world modeling (see Google Veo2 and the latest Genesis LLM-based physics modeling framework).
The hardware for humanoid robots is 95% of the way there, the gap is control logic and intelligence, which is rapidly being closed.
Combine Veo2 world model, Genesis control planning, o3-style reasoning, and you're pretty much there with blue collar work automation.
We're only a few turns (<12 months) away from an existence proof of a humanoid robot that can watch a Youtube video and then replicate the task in a novel environment. May take longer than that to productionize.
It's really hard to think and project forward on an exponential. We've been on an exponential technology curve since the discovery of fire (at least). The 2nd order has kicked up over the last few years.
Not a rational approach to look back at robotics 2000-2022 and project that pace forwards. There's more happening every month than in decades past.
Calibrating to the current hype cycle has been challenging with AI pronouncements.
Our value proposition as humans in a capitalist society is an increasingly fragile thing.
who is going to pay for residential electrical work lol and how much will you make if some guy from MIT is going to compete with you
> while the majority of the population will be unemployable and forever left behind
Productivity improvements increase employment. A superhuman AI is a productivity improvement.
Sometimes: the productivity improvements from the combustion engine didn't increase employment of horses, it displaced them.
But even when productivity improvements do increase employment, it's not always to our advantage: the productivity improvements from Eli Whitney's cotton gin included huge economic growth and subsequent technological improvements… and also "led to increased demands for slave labor in the American South, reversing the economic decline that had occurred in the region during the late 18th century": https://en.wikipedia.org/wiki/Cotton_gin
A superhuman AI that's only superhuman in specific domains? We've been seeing plenty of those, "computer" used to be a profession, and society can re-train but it still hurts the specific individuals who have to be unemployed (or start again as juniors) for the duration of that training.
A superhuman AI that's superhuman in every domain, but close enough to us in resource requirements that comparative advantage is still important and we can still do stuff, relegates us to whatever the AI is least good at.
A superhuman AI that's superhuman in every domain… as soon as someone invents mining, processing, and factory equipment that works on the moon or asteroids, that AI can control that equipment to make more of that equipment, and demand is quickly — O(log(n)) — saturated. I'm moderately confident that in this situation, the comparative advantage argument no longer works.
The idea that productivity improvements increase unemployment is just fundamentally based on a different paradigm. There is absolutely no reason to think that when a machine exists that can do most things that a human can do as well if not better for less or equal cost, this will somehow increase human employment. In this scenario, using humans in any stage of the pipeline would be deeply inefficient and a stupid business decision.
It's gone from "well the output is incoherent" to "well it's just spitting out stuff it's already seen online" to "WELL...uhh IT CAN'T CREATE NEW/NOVEL KNOWLEDGE" in the space of 3-4 years.
It's incredible.
We already have AGI.
On the other hand, there is a long, long history of AI achieving X but not being what we would casually refer to as "generally intelligent," then people deciding X isn't really intelligence; only when AI achieves Y will it be intelligence. Then AI achieves Y and...
Once you look at it that way it the approach really doesn't look like intelligence that's able to generalize to novel domains. It doesn't pass the sniff test. It looks a lot more like brute-forcing.
Which is probably why, in order to actually qualify for the leaderboard, they stipulate that you can't use more than $10k more of compute. Otherwise, it just sounds like brute-forcing.
Could anyone confirm if this is the only kind of questions in the benchmark? If yes, how come there is such a direct connection to "oh this performs better than humans" when llm can be quite better than us in understanding and forecasting patterns? I'm just curious, not trying to stir up controversies
Chollett (one of the creators of the ARC benchmark) has been saying it proves LLMs can't reason. The test questions are supposed to be unique and not in the model's training set. The fact that LLMs struggled with the ARC challenge suggested (to Chollett and others) that models weren't "Truly reasoning" but rather just completing based on things they'd seen before - when the models were confronted with things they hadn't seen before, the novel visual patterns, they really struggled.
I am excited no less! This is huge improvement.
How does this do on SWE Bench?
71.7%
This tries to create patterns that are intentionally not in the data and see if a system can generalize to them, which o3 super impressively does!
But isn’t it interesting to have several benchmarks? Even if it’s not about passing the Turing test, benchmarks serve a purpose—similar to how we measure microprocessors or other devices. Intelligence may be more elusive, but even if we had an oracle delivering the ultimate intelligence benchmark, we'd still argue about its limitations. Perhaps we'd claim it doesn't measure creativity well, and we'd find ourselves revisiting the same debates about different kinds of intelligences.
humans clearly dont know what intelligence is unambiguously. theres also no divinely ordained objective dictionary that one can point at to reference what true intelligence is. a deep reflection of trying to pattern associate different human cognitive abilities indicates human cognitive capabilities arent that spectacular really.
There is no special sauce in our brain. And we know how much compute there is in our brain– So we can roughly estimate when we'll hit that with these 'LLMs'.
Language is important in a human brain development as well. Kids who grow up deaf grow up vastly less intelligent unless they learn sign language. Language allow us to process complex concepts that our brain can learn to solve, without having to be in those complex environments.
So in hindsight, it's easy to see why it took a language model to be able to solve general tasks and other types deep learning networks couldn't.
I don't really see any limits on these models.
i wonder if llms and language dont as so much allow us to process these complex environments but instead preload our brains to get a head start in processing those complex environments once we arrive in them. i think llms store compressed relationships of the world which obviously has information loss from a neural mapping of the world that isnt just language based. but that compressed relationships ie knowledge doesnt exactly backwardly map onto the world without it having a reverse key. like artificially learning about real world stuff in school abstractly and then going into the real world, it takes time for that abstraction to snap fit upon the real world.
could you further elaborate on what you mean by limits, because im happy to play contrarian on what i think i interpret you to be saying there.
also to your main point: what intelligence is. yeah you sort of hit up my thoughts on intelligence. its a combination of problem solving abilities in different domains. its like an amalgam of cognitive processes that achieve an amalgam of capabilities. while we can label alllllll that with a singular word, doesnt mean its all a singular process. seems like its a composite. moreover i think a big chunk of intelligence (but not all) is just brute forcing finding associations and then encoding those by some reflexive search/retrieval. a different part of intelligence of course is adaptibility and pattern finding.
Maybe it would help to include some human results in the AI ranking.
I think we'd find that Humans score lower?
E.g. go back in time and imagine you didn't know there are ways for computers to be really good at performing integration yet as nobody had tried to make them. If someone asked you how to tell if something is intelligent "the ability to easily reason integrations or calculate extremely large multiplications in mathematics" might seem like a great test to make.
Skip forward to the modern era and it's blatantly obvious CASes like Mathematica on a modern computer range between "ridiculously better than the average person" to "impossibly better than the best person" depending on the test. At the same time, it becomes painfully obvious a CAS is wholly unrelated to general intelligence and just because your test might have been solvable by an AGI doesn't mean solving it proves something must have been an AGI.
So you come up with a new test... but you have the same problem as originally, it seems like anything non-human completely bombs and an AGI would do well... but how do you know the thing that solves it will have been an AGI for sure and not just another system clearly unrelated?
Short of a more clever way what GP is saying is the goalposts must keep being moved until it's not so obvious the thing isn't AGI, not that the average human gets a certain score which is worse.
.
All that aside, to answer your original question, in the presentation it was said the average human gets 85% and this was the first model to beat that. It was also said a second version is being worked on. They have some papers on their site about clear examples of why the current test clearly has a lot of testing unrelated to whether something is really AGI (a brute force method was shown to get >50% in 2020) so their aim is to create a new goalpost test and see how things shake out this time.
We should skip to the end and just define a task like "it's AGI if it can predict, with 100% accuracy the average human's next action in any situation". Anything that can do that is as good as AGI even if people manage to find a proxy for the task.
What exactly is AGI to you ? If it's simply a generally intelligent machine then what are you waiting for ? What else is there to be sure of ? There's nothing narrow about these models.
Humans love to believe they're oh so special so much that there will always be debates on whether 'AGI' has arrived. If you are waiting for that then you'll be waiting a very long time, even if a machine arrives that takes us to the next frontier in science.
There is, they can't create new ideas like humanity can. AGI should be able to replace humanity in terms of thinking, otherwise it isn't general, you would just have a model specialized at reproducing thoughts and patterns human have thought before, it still can't recreate science from scratch etc like humanity did, meaning it can't do science properly.
Comparing an AI to a single individual is not how you measure AGI, if a group of humans perform better then you can't use the AI to replace that group of humans, and thus the AI isn't an AGI since it couldn't replace the group humans.
So for example, if a group of programmers write more reliable programs than the AI, then you can't replace that group of programmers with the AI, even if you duplicate that AI many times, since the AI isn't capable of reproducing that same level of reliability when ran in parallel. This is due to an AI being run in parallel is still just an AI, an ensemble model is still just an AI, so the model the AI has to beat is the human ensemble called humanity.
If we lower the bar a bit at least it has to beat 100 000 humans working together to make a job obsolete, since all the tutorials etc and all such things are made by other humans as well if you remove the job those would also disappear and the AI would have to do the work of all of those, so if it can't humans will still be needed.
Its possible you will be able to substitute part of those human ensembles with AI much sooner, but then we just call it a tool. (We also call narrow humans tools, it is fair)
In order to write general program you need to have that skill. Every new code snipped needs to be evaluated by that system, whether it makes the codebase better or not. The lack of that ability is why you can't just loop an LLM today to replace programmers. It might be possible to automate it for specific programming tasks, but not general purpose programming.
Overcoming that hurdle is not something I think LLM ever can do, you need a totally different kind of architecture, not something that is trained to mimic but trained to reason. I don't know how to train something that can reason about noisy unstructured data, we will probably figure that out at some point but it probably wont be LLM as they are today.
As for what AGI is? Well, the lack of being able to describe that brings us full circle in this thread - I'll tell you for sure when I've seen it for the first time and have the power of hindsight to say what was missing. I think these models are the closest we've come but it feels like there is at least 1-2 more "4o->o1" style architecture changes where it's not necessarily about an increase in model fitting and more about a change in how the model comes to an output before we get to what I'd be willing to call AGI.
Who knows though, maybe some of those changes come along and it's closer but still missing some process to reason well enough to be AGI rather than a midway tool.
Best way of stating that I've heard.
The Goal Post must keep moving, until we understand enough what is happening.
I usually poo-poo the goal post moving, but this makes sense.
Indistinguishable from goalpost moving like you said, but also no true Scotsman.
I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?
It's really clear to me how intelligence fits into our reality as part of our social ontology. The attributes and their expression that each of us uses to ground our concept of the intelligent predicate differs wildly.
My personal theory is that we tend to have an exemplar-based dataset of intelligence, and each of us attempts to construct a parsimonious model of intelligence, but like all (mental) models, they can be useful but wrong. These models operate in a space where the trade off is completeness or consistency, and most folks, uncomfortable saying "I don't know" lean toward being complete in their specification rather than consistent. The unfortunate side-effect is that we're able to easily generate test data that highlights our model inconsistency - AI being a case in point.
Rich people will think they can use the AI model instead of paying other people to do certain tasks.
The consequences could range from brilliant to utterly catastrophic, depending on the context and precise way in which this is done. But I'd lean toward the catastrophic.
someone wants a "planning officer" and believes that the LLM has AGI ...
someone wants a "hiring consultant" and believes that the LLM has AGI ...
etc. etc.
Basically, it's got the dumbest and simplest things in it. Stuff like a lock and key, a glass of water and jug, common units of currency, a zipper, etc. It tests if you can do any of those common human tasks. Like pouring a glass of water, picking up coins from a flat surface (I chew off my nails so even an able person like me fails that), zip up a jacket, lock your own door, put on lipstick, etc.
We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. To the patients, the hands were therefore about as useful as a metal hook (a common solution with amputees today, not just pirates!).
Again, a total aside here, but your comment just reminded me of that brown briefcase. Life, it turns out, is a lot more complex than we give it credit for. Even pouring the OJ can be, in rare cases, transcendent.
I think I'd only really save time by having a robot that could unload my dishwasher and put up the clean dishes.
So get me a counter-level dishwasher cabinet and I’ll be happy!
And that's before we get to bits of sticky rice left on bowls, which somehow dishwashers never scrape off clean. YMMV.
2. Start with a cold prewash, preferably with a little powder in there too. This massively helps with stubborn stuff. This one is annoying though because you might have to come back and switch it on after the prewash. A good job for the robot butler.
Foldimate has gone bankrupt in 2021 [1], and the domain referral from foldimate.com to a 404 page at miele.com, suggests that it was Miele who bought up the remains, not a sketchy company with a ".website" top-level domain.
… so Elon Musk? :D
Humanoid robots are mostly a waste of time. Task-shaped robots are much easier to design, build, and maintain... and are more reliable. Some of the things you mention might needs humanoid versatility (loading the dishwasher), others would be far better served by purpose-built robots (laundry sorting).
Maybe "busbot" or "scullerybot".
It seems most people aren't willing to pay for multiple dishwashers - even multiple small ones or set aside enough space, and that places severe constraints on trying to do better.
So, not only is the human form the only solution for many tasks, it's also a much cheaper solution considering the idle time of task-specific robots. You would need only a single humanoid robot for all tasks, instead of buying a different machine for each task. And instead of having to design and build a new machine for each task, you'll need to just download new software for each task.
1000 machines specialized for 1000 tasks are great, but don’t deliver the same value as a single bot that can interchange with people flexibly.
Costly today, but wont be forever.
Getting to LLMs that could talk to us turned out to be a lot easier than making something that could control even a robotic arm without precise programming, let alone a humanoid.
> Rodney Brooks explains that, according to early AI research, intelligence was "best characterized as the things that highly educated male scientists found challenging", such as chess, symbolic integration, proving mathematical theorems and solving complicated word algebra problems. "The things that children of four or five years could do effortlessly, such as visually distinguishing between a coffee cup and a chair, or walking around on two legs, or finding their way from their bedroom to the living room were not thought of as activities requiring intelligence."
Because it's relevant to the point being made, i.e. that these tests reflect the biases and interests of the people who make them. This is true not just for AI tests, but intelligence test applied to humans. That Demis Hassabis, a chess player and video game designer, decided to test his machine on video games, Go and chess probably is not an accident.
The more interesting question is why people respond so apprehensively to pointing out a very obvious problem and bias in test design.
Of course. However i believe we can't move past that without being honest about where these biases are coming from. Many things in our world are the result of gender bias, both subtle and overt. However, at least at first glance, this does not appear to be one of them, and statements like the grandparent's quote serve to perpetuate such biases further.
Thank you for virtue signalling, though.
Yes, that was pretty clear in the original comment (?)
He was right. Scientists were focusing on the "science-y" bits and completely missed the elephant in the room, that the thing a toddler already masters are the monster challenge for AI right now, before we even get into "meaning of life" type stuff.
We learn our language and stereotypes subconciously from our society, and it is no easy thing to fight against that.
I think a lot about carpentry. From the outside, it's pretty easy: Just make the wood into the right shape and stick it together. But as one progresses, the intricacies become more apparent. Variations in the wood, the direction of the grain, the seasonal variations in thickness, joinery techniques that are durable but also time efficient.
The way this information connects is highly multisensory and multimodal. I now know which species of wood to use for which applications. This knowledge was hard won through many, many mistakes and trials that took place at my home, the hardware store, the lumberyard, on YouTube, from my neighbor Steve, and in books written by experts.
Like in your test
a hand grenade and a pin - don't pull the pin.
Or maybe a mousetrap? but maybe that would be defused?
in the ai test...
or Global Thermonuclear War, the only winning move is...
I must be missing something, how can they be able to play Mozart at 5x speed with their prosthetics but not zip a jacket? They could press keys but not do tasks requiring feedback?
Or did you mean they used to play Mozart at 5x speed before they became amputees?
Picking up a 1mm thick metal disk from a flat surface requires the user gives the exact time, place, and force, and I'm not even sure what considerations it needs for surface materials (e.g. slightly squishy fake skin) and/or tip shapes (e.g. fake nails).
Even more so for picking up coins from a flat surface.
For robotics, it's kind of obvious, speed is rarely an issue, so the "5x" part is almost trivial. And you can program the sequence quite easily, so that's also doable. Piano keys are big and obvious and an ergonomically designed interface meant to be relatively easy to press, ergo easy even for a prosthetic. A small coin on a flat surface is far from ergonomic.
The idea of a prosthesis is to help you regain functionality. If the best way of doing that is through automation, then it'd make little sense not to.
My point is -- being able to zip a jacket is all about those subtle actions, and could actually be harder than "just" playing piano fast.
playing mozart is much more forgiving in terms of the number of different motions you have to make in different directions, the amount of pressure to apply, and even the black keys are much bigger than large sized zipper tongues.
(nothing wrong with it! I'm just trying to prune the top subthread)
In that sense, the goalposts haven’t moved in a long time despite claims from AI enthusiasts that people are constantly moving goalposts.
I'd love to know more about this.
the LLM only gets two guesses at the "end solutions". The whole chain of thought is breaking out the context, and levels of abstraction. How many "Guesses" is it self generating and internally validating, well that's all just based on compute power and time.
My counter point to OP here would be is that is exactly how our brain works. In every given scenario, we are also evaluating all possible solutions. Our entire stack is constantly listening and eithier staying silent, or contributing to an action potential (eithier excitatory, or inhibitory). but our brain is always "Evaluating all potential possibilities" at any given moment. We have a society of mind always contributing their opinion, but the ones who don't have as much support essentially get "Shouted down".
That's completely fair game. That's just search.
Its not agi obviously in the sense that you still need to some problem framing and initialization to kickstart the reasoning path simulations
"Well, yeah, but its kind of expensive" -- this guy
Picks up goalpost, looks for stadium exit
semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012
The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.
If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.
On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)
OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.
I double checked with some flop estimates (P100 for 12 hours = Kaggle limit, they claim ~100-1000x for O3-low, and x172 for O3-high) so roughly on the order of 10^22-10^23 flops.
In another way, using H100 market price $2/chip -> at $350k, that's ~175k hours. Or 10^24 FLOPs in total.
So, huge margin, but 10^22 - 10^24 flop is the band I think we can estimate.
These are the scale of numbers that show up in the chinchilla optimal paper, haha. Truly GPT-3 scale models.
Does seem to show an absolutely massive market for inference compute…
I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task
The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.
That makes something like this competitive in ~3 years
o3 will be interesting if it offers indeed a novel technology to handle problem solving, something that is able to learn from few novel examples efficiently and adapt. That's what intelligence actually is. Maybe this is the case. If, on the other hand, it is a smart way to pair CoT within an evaluation loop (as the author hints as possibility) then it is probable that, while this _can_ handle a class of problems that current LLMs cannot, it is not really this kind of learning, meaning that it will not be able to scale to more complex, real world tasks with a problem space that is too large and thus less amenable to such a technique. It is still interesting, because having a good enough evaluator may be very important step, but it would mean that we are not yet there.
We will learn soon enough I suppose.
And o1 cost $15/$60 for 1M in/out, so the estimated costs on the graph would match for a single task, not the whole benchmark.
From what I can see, presuming o3 is a progression of o1 and has good level of accountabiltiy bubbling up during 'inference' (i.e. "Thinking about ___") then I'd say it's just using up millions of old-school tokens (the 44 million tokens that are referenced). So not latent thinking per se.
Sonnet 3.5 remains the king of the hill by quite some margin
That said, I think its code style is arguably better, more concise and has better patterns -- Claude needs a fair amount of prompting and oversight to not put out semi-shitty code in terms of structure and architecture.
In my mind: going from Slowest to Fastest, and Best Holistically to Worst, the list is:
1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash
Flash is so fast, that it's tempting to use more, but it really needs to be kept to specific work on strong codebases without complex interactions.
Like I’ll have it a project in Cursor and it will spin up ready to use components that use my site style, reference existing components, and follow all existing patterns
Then on some days, it will even forget what language the project is in and start giving me python code for a react project
In a way it almost feels like it's become too good at following instructions and simply just takes your direction more literally. It doesn't seem to take the initiative of going the extra mile of filling in the blanks from your lazy input (note: many would see this as a good thing). Claude on the other hand feels more intuitive in discerning intent from a lazy prompt, which I may be prone to offering it at times when I'm simply trying out ideas.
However, if I take the time to write up a well thought out prompt detailing my expectations, I find I much prefer the code o1 creates. It's smarter in its approach, offers clever ideas I wouldn't have thought of, and generally cleaner.
Or put another way, I can give Sonnet a lazy or detailed prompt and get a good result, while o1 will give me an excellent result with a well thought out prompt.
What this boils down to is I find myself using Sonnet while brainstorming ideas, or when I simply don't know how I want to approach a problem. I can pitch it a feature idea the same way a product owner might pitch an idea to an engineer, and then iterate through sensible and intuitive ways of looking at the problem. Once I get a handle on how I'd like to implement a solution, I type up a spec and hand it off to o1 to crank out the code I'd intend to implement.
https://myswamp.substack.com/p/benchmarking-llms-against-com...
For coding, o1 is marvelous at Leetcode question I think it is the best teacher I would ever afford to teach me leetcoding, but I don’t find myself have a lot of other use cases for o1 that is complex and requires really long reasoning chain
For example I used the prompt "As an astronaut in China, would I be able to see the great wall?" and since the training data for all LLMs is full of text dispelling the common myth that the great wall is visible from space, LLMs do not notice the slight variation that the astronaut is IN China. This has been a sobering reminder to me as discussion of AGI heats up.
Carefully analyze questions to not overlook subtle details. Take each question "as-is", don't guess what they mean -- interpret them as any reasonable person would.
Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!
> ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI
It feels so insensitive to that right before a major holiday when the likely outcome is a lot of people feeling less secure in their career/job/life.
Thanks again openAI for showing us you don’t give a shit about actual people.
What a weird way to react to this.
https://www.transformernews.ai/p/richard-ngo-openai-resign-s...
>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.
Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts?
Yeah maybe black mirror but I'm not sure it's really my thing.
Many of us look forward to what a future with AGI can do to help humanity and hopefully change society for the better, mainly to achieve a post scarcity economy.
>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.
Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts? There is a real chance that this ends with significant good. There is also a real chance that this ends with the death of every single human being. That's never been a choice we've had to make before, and it seems like we as a species are unprepared to approach it.
You need to make these expensive things nearly free if you're going to speak of post scarcity.
Notably, the last key AI safety researcher just left OpenAI: https://www.transformernews.ai/p/richard-ngo-openai-resign-s...
>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.
Are you that upset that this guy chose to trust the people that OpenAI hired to talk about AI safety, on the topic of AI safety?
>> Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis.
"Program synthesis" is here used in an entirely idiosyncratic manner, to mean "combining programs". Everyone else in CS and AI for the last many decades has used "Program Synthesis" to mean "generating a program that satisfies a specification".
Note that "synthesis" can legitimately be used to mean "combining". In Greek it translates literally to "putting [things] together": "Syn" (plus) "thesis" (place). But while generating programs by combining parts of other programs is an old-fashioned way to do Program Synthesis, in the standard sense, the end result is always desired to be a program. The LLMs used in the article to do what F. Chollet calls "Porgram Synthesis" generate no code.
Combining programs should be straightforward for DNNs, ordering, mixing, matching concepts by coordinates and arithmetic in learned high-dimensional embedded-space. Inference-time combination is harder since the model is working with tokens and has to keep coherence over a growing CoT with many twists, turns and dead-ends, but with enough passes can still do well.
The logical next step to improvement is test-time training on the growing CoT, using reinforcement-fine-tuning to compress and organize the chain-of-thought into parameter-space--if we can come up with loss functions for "little progress, a lot of progress, no progress". Then more inference-time with a better understanding of the problem, rinse and repeat.
I hypothesize that something similar is going on here. OpenAI has not published (or I have not seen) the number of reasoning tokens it took to solve these - we do know that each tasks was thoussands of dollars. If "a picture is worth a thousand words", could we make AI systems that can reason visually with much better performance?
I wonder if anyone has experimented with having some sort of "visual" scratchpad instead of the "text-based" scratchpad that CoT uses.
I don't understand this mindset. We have all experienced that LLMs can produce words never spoken before. Thus there is recombination of knowledge at play. We might not be satisfied with the depth/complexity of the combination, but there isn't any reason to believe something fundamental is missing. Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.
The linked article says that LLMs are like a collection of vector programs. It has always been my thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out.
That was always true for NNs in general, yet it took a very specific structure to get to where we are now. (..with a certain amount of time and resources.)
> thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out
Sounds interesting, would you elaborate?
* The upward bound of compute/performance gains as we continue to iterate on LLMs. It simply isn't going to be feasible for a lot of engineers and businesses to run/train their own LLMs. This means an inherent reliance on cloud services to bridge the gap (something MS is clearly betting on), and engineers to build/maintain the integration from these services to whatever business logic their customers are buying.
* Skilled knowledge workers continuing to be in-demand, even factoring in automation and new-grad numbers. Collectively, we've built a better hammer; it still takes someone experienced enough to know where to drive the nail. These tools WILL empower the top N% of engineers to be more productive, which is why it will be more important than ever to know _how_ to build things that drive business value, rather than just how to churn through JIRA tickets or turn a pretty Figma design into React.
In general, with the technology advancing as rapidly as it is, and the trillions of dollars oriented towards replacing knowledge work, I don't see a future in this field. And that's despite me being on a very promising path myself! I'm 25, in the middle of a CS PhD in Germany, with an impressive CV behind me. My head may be the last on the chopping block, but I'd be surprised if it buys me more than a few years once programmer obsolescence truly kicks in.
Indeed, what I think are safe jobs are jobs with fundamental human interaction. Nurses, doctors, kindergarten teachers. I myself have been considering pivoting to becoming a skiing teacher.
Maybe one good thing that comes out of this is breaking my "wunderkind" illusion. I spent my teens writing C++ code instead of going out socializing and making friends. Of course, I still did these things, but I could've been far less of a hermit.
I mirror your sentiment of spending these next few years living life; Real life. My advice: Stop sacrificing the now for the future. See the world, go on hikes with friends, go skiing, attend that bouldering thing your friends have been telling you about. If programming is something you like doing, then by all means keep going and enjoy it. I will likely keep programming too, it's just no longer the only thing I focus on.
Edit: improve flow of last paragraph
Kind of stemming from the mindspace "If they can build X, I can build X!"
I'd explicitly not look up tutorials, just so I'd have the opportunity to solve the mathemathics myself. Like building a 3D physics engine. (I did look up colission detection after struggling with it for a month or so, inventing GJK is on another level)
Feels like I hit the real world just a couple years too late to get situated in a solid position. Years of obsession in attempt to catch up to the wizards, chasing the tech dream. But this, feels like this is it. Just watching the timebomb tick. I'd love to work on what feels like the final technology, but I'm not a freakshow like what these labs are hiring. At least I get to spectate the creation of humanity's greatest invention.
This announcement is just another gut punch, but at this point I should expect its inevitable. A Jason Voorhees AGI, slowly but surely to devour all the talents and skills information workers have to offer.
Apologies for the rambly and depressing post, but this is reality for anyone recently out or still in school.
We are living in a world run by and for the soon to be dead, many of which have dementia, so empathic policy and foresight is out of the question, and we're going to be picking up the incredibly broken scraps of our golden age.
And not to get too political but the mass restructuring of public consciousness and intellectual society due to mass immigration for an inexplicable gdp squeeze and social media is happening at exactly the wrong time to handle these very serious challenges. The speed at which we've undone civil society is breakneck, and it will go even further, and it will get even worse. We've easily gone back 200 years in terms of emotional intelligence in the past 15.
Everything seems so uncertain, and the pace of technological advancement makes long-term planning feel almost impossible. Your plan to move to a slower-paced area and enjoy the outdoors sounds incredibly grounding - it's something I've been considering myself.
AI can be anywhere any time with cloud compute.
But below is reality talk. With Claude 3.5, I already think it is a better programmer than I at micro level tasks, and a better Leetcode programmer than I could ever be.
I think it is like modern car manufacturering, the robots build most of the components, but I can’t see how human could be dismissed from the process to oversee output.
O3 has been very impressive in achieving 70+ in swebench for example, but this also means when it is trained on the codebase multiple times so visibility isn't an issue yet it still has 30% chance that it can’t pass the unit tests.
A fully autonomous system can’t be trusted, the economy of software won’t collapse, but it will be transformed beyond our imagination now.
I will for sure miss the days when writing code, or coder is still a real business.
How time flies
The code part will get smaller and smaller for most folks. Some frameworks or bare-metal people or intense heavy-lifters will still do manual code or pair-programming where half the pair is an agentic AI with super-human knowledge of your org's code base.
But this will be a layer of abstraction for most people who build software. And as someone who hates rote learning, I'm here for it. IMO.
Unfortunately (?) I think the 10-20-50? years of development experience you might bring to bear on the problems can be superseded by an LLM finetuned on stackoverflow, github etc once judgement and haystack are truly nailed. Because it can have all that knowledge you have accumulated, and soaked into a semi-conscious instinct that you use so well you aren't even aware of it except that it works. It can have that a million times over. Actually. Which is both amazing and terrifying. Currently this isn't obvious because it's accuracy /judgement to learn all those life-of-a-dev lessons is almost non-existent. Currently. But it will happen. That is copilot's future. It's raison d'être.
I would argue what it will never have however, simply by function of the size of training runs is unique functional drive and vision. If you wanted a "Steve Jobs" AI you would have to build it. And if you gave it instructions to make a prompt/framework to build a "Jobs" it would just be an imitation, rather than a new unique in-context version. That is the value a person has- their particular filter, their passion and personal framework. Someone who doesn't have any of those things, they had better be hoping for UBI and charity. Or go live a simple life, outside the rat race.
bows
But unlike the abacus/calculators i don't feel like we're at a point in history where society is getting wiser and more empathetic, and these new abilities are going towards something good.
But supervisors of tasks will remain because we're social, untrusting, and employers will always want someone else to blame for their shortcomings. And humans will stay in the chain at least for marketing and promotion/reputation because we like our japanese craftsman and our amg motors made by one person.
Perhaps what I need is actually a steady stream of food - i.e. buy some land and oxen and solar panels while I can.
For what it's worth that's probably an advantage versus the legions of people who are staring down the barrel of years invested into skills that may lose relevance very rapidly.
Will LLMs or without LLMs, the world will keep turning. Humans will still be writing amazing works of literature, creating beautiful art, carrying out scientific experiments and discovering new species.
There are way more data analysts now than when it required paper and pencil.
It's like my life is forfeit to fixing other peoples mistakes because they're so glaring and I feel an obligation. Maybe that's the way the world's always been, but it's a concerning future right now
So, next step in reasoning is open world reasoning now?
If we're inferring the answers of the block patterns from minimal or no additional training, it's very impressive, but how much time have they had to work on O3 after sharing puzzle data with O1? Seems there's some room for questionable antics!
Why is the ARC challenge difficult but coding problems are easy? The two examples they give for ARC (border width and square filling) are much simpler than pattern awareness I see simple models find in code everyday.
What am I misunderstanding? Is it that one is a visual grid context which is unfamiliar?
We have an enormous amount of high quality programming samples. From there it's relatively straightforward to bootstrap (similar to original versions of alphago - start with human games, improve via self play) using leetcode or other problems with a "right answer"
In contrast, the arc puzzles are relatively novel (why? Well, this has to do with the relative utility of solving an arc problem and programmer open source culture)
While there are those that are excited, the world is not prepared for the level of distress this could put on the average person without critical changes at a monumental level.
That would bug me, if I were you.
You can't bullshit your way through this particular benchmark. Try it.
And yes, they're wrong. The latest/greatest models "make shit up" perhaps 5-10% as frequently as were seeing just a couple of years ago. Only someone who has deliberately decided to stop paying attention could possibly argue otherwise.
I have noticed it's great in the hands of marketers and scammers, however. Real good at those "jobs", so I see why the cryptobros have now moved onto hailing LLMs as the next coming of jesus.
I do find, however, that the newer the model the fewer elementary mistakes it makes, and the better it is at figuring out what I really want. The process of getting the right answer or the working function continues to become less frustrating over time, although not always monotonically so.
o1-pro is expensive and slow, for instance, but its performance on tasks that require step-by-step reasoning is just astonishing. As long as things keep moving in that direction I'm not going to complain (much).
/s
"It is ceasing to be a matter of how we think about technics, if only because technics is increasingly thinking about itself. It might still be a few decades before artificial intelligences surpass the horizon of biological ones, but it is utterly superstitious to imagine that the human dominion of terrestrial culture is still marked out in centuries, let alone in some metaphysical perpetuity. The high road to thinking no longer passes through a deepening of human cognition, but rather through a becoming inhuman of cognition, a migration of cognition out into the emerging planetary technosentience reservoir, into 'dehumanized landscapes ... emptied spaces' where human culture will be dissolved. Just as the capitalist urbanization of labour abstracted it in a parallel escalation with technical machines, so will intelligence be transplanted into the purring data zones of new software worlds in order to be abstracted from an increasingly obsolescent anthropoid particularity, and thus to venture beyond modernity. Human brains are to thinking what mediaeval villages were to engineering: antechambers to experimentation, cramped and parochial places to be.
[...]
Life is being phased-out into something new, and if we think this can be stopped we are even more stupid than we seem." [0]
Land is being ostracized for some of his provocations, but it seems pretty clear by now that we are in the Landian Accelerationism timeline. Engaging with his thought is crucial to understanding what is happening with AI, and what is still largely unseen, such as the autonomization of capital.
Sure, there will be growing pains, friction, etc. Who cares? There always is with world-changing tech. Always.
That's right. Who cares about pains of others and why they even should are absolutely words to live by.
What you are likely doing, though, is making many more future humans pay a cost in suffering. Every day we delay longevity escape velocity is another 150k people dead.
I'd rather just... not die. Not unless I want to. Same for my loved ones. That's far more important than "wealth inequality."
Senescence is an adaptation.
"You should die because cities will get crowded" is a less terrible argument but still a bad one. We have room for at least double our population on this planet, couples choosing longevity can be required to have <=1 children until there is room for more, we will eventually colonize other planets, etc.
All this is implying that consciousness will continue to take up a meaningful amount of physical space. Not dying in the long term implies gradual replacement and transfer to a virtual medium at some point.
If you take this as an axiom, it will always be true ;).
“Oh well, I guess I can’t give the opportunities to my kid that I wanted, but at least humanity is growing rapidly!”
Everyone has always worried about this for every major technology throughout history
IMO AGI will dramatically increase comfort levels, lower your chance of dying, death, disease, etc.
People aren’t really on uproar yet, because implementations haven’t affected the job market of the masses. Afterwards? Tume will show.
Like in general I totally agree with you, but I also understand why a person would care about their loved ones and themselves first.
Beyond immediate increase in inequality, which I agree could be worth it in the long run if this was the only problem, we're playing a dangerous game.
The smartest and most capable species on the planet that dominates it for exactly this reason, is creating something even smarter and more capable than itself in the hope it'd help make its life easier.
Hmm.
>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.
Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts?
You and I will likely not live to see much of anything past AGI.
The people experiencing the growing pains, friction, etc.
For one, I found AI coding to work best in a small team, where there is an understanding of what to build and how to build it, usually in close feedback loop with the designers / users. Throw the usual managerial company corporate nonsense on top and it doesn't really matter if you can instacreate a piece of software, if nobody cares for that piece of software and it's just there to put a checkmark on the Q3 OKR reports.
Furthermore, there is a lot of software to be built out there, for people who can't afford it yet. A custom POS system for the local baker so that they don't have to interact with a computer. A game where squids eat algae for my nephews at christmas. A custom photo layout software for my dad who despairs at indesign. A plant watering system for my friend. A local government information website for older citizens. Not only can these be built at a fraction of the cost they were before, but they can be built in a manner where the people using the software are directly involved in creating it. Maybe they can get a 80% hacked version together if they are technically enclined. I can add the proper database backend and deployment infrastructure. Or I can sit with them and iterate on the app as we are talking. It is also almost free to create great documentation, in fact, LLM development is most productive when you turn up software engineering best practices up to 11.
Furthermore, I found these tools incredible for actively furthering my own fundamental understanding of computer science and programming. I can now skip the stuff I don't care to learn (is it foobarBla(func, id) or foobar_bla(id, func)) and put the effort where I actually get a long-lived return. I have become really ambitious with the things I can tackle now, learning about all kinds of algorithms and operating system patterns and chemistry and physics etc... I can also create documents to help me with my learning.
Local models are now entering the phase where they are getting to be really useful, definitely > gpt3.5 which I was able to use very productively already at the time.
Writing (creating? manifesting? I don't really have a good word for what I do these days) software that makes me and real humans around me happy is extremely fulfilling, and has allevitated most of my angst around the technology.
Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.
The results are cool, but man, this sounds like such a busted approach.
It feels a lot less like the breakthrough when the solution looks so much like simply brute-forcing.
But you might be right, who cares? Does it really matter how crude the solution is if we can achieve true AGI and bring the cost down by increasing the efficiency of compute?
That’s the thing that’s interesting to me though and I had the same first reaction. It’s a very different problem than brute-forcing chess. It has one chance to come to the correct answer. Running through thousands or millions of options means nothing if the model can’t determine which is correct. And each of these visual problems involve combinations of different interacting concepts. To solve them requires understanding, not mimicry. So no matter how inefficient and “stupid” these models are, they can be said to understand these novel problems. That’s a direct counter to everyone who ever called these a stochastic parrot and said they were a dead-end to AGI that was only searching an in distribution training set.
The compute costs are currently disappointing, but so was the cost of sequencing the first whole human genome. That went from 3 billion to a few hundred bucks from your local doctor.
That is roughly 10^9 times more compute required, or roughly the US military budget per half an hour, to get the intelligence of 1 (!) STEM graduate (not any kind of superhuman intelligence).
Of course, algorithms will get better, but this particular approach feels like wading in a plateau of efficiency improvements, very, very far down the X axis.
You'll know AGI is here when traditional captchas stop being a thing due to their lack of usefulness.
Some folks say we could fix this with universal basic income, where everyone gets enough money to live on, but I'm not optimistic that it'll be an easy transition. Plus, there's this possibility that whoever controls these 'AGI' systems basically controls everything. We definitely need to figure this stuff out before it hits us, because once these changes start happening, they're probably going to happen really fast. It's kind of like we're building this awesome but potentially dangerous new technology without really thinking through how it's going to affect regular people's lives. I feel like we need a parachute before we attempt a skydive. Some people feel pretty safe about their jobs and think they can't be replaced. I don't think that will be the case. Even if AI doesn't take your job, you now have a lot more unemployed people competing for the same job that is safe from AI.
I'll get concerned when it stops sucking so hard. It's like talking to a dumb robot. Which it unsurprisingly is.
Do people refuse to buy from stores which gets goods manufactured by slave labour?
Most people dont care, if AI business are offering goods/services at a lower costs , people will vote with their wallets not principle.
Besides, AI researchers failed to make anything like a real Chatbot until recently, yet they've been trying since the Eliza days. I'm willing to put in at least as much effort as them.
It makes sense because tree search can be endlessly optimized. In a sense, LLMs turn the unstructured, open system of general problems into a structured, closed system of possible moves. Which is really cool, IMO.
It's good for games with clear signal of success (Win/Lose for Chess, tests for programming). One of the blocker for AGI is we don't have clear evaluation for most of our tasks and we cannot verify them fast enough.
A bit puzzling to me. Why does it matter ?
In reality it seems to be a bit of both - there is some general intelligence based on having been "trained on the internet", but it seems these super-human math/etc skills are very much from them having focused on training on those.
Francois Chollet mentioned that the test tries to avoid curve fitting (which he states is the main ability of LLMs). However, they specifically restricted the number of examples to do this. It is not beyond the realms of possibility that many examples could have been generated by hand though, and that the curve fitting has been achieved, rather than discrete programming.
Anyway, it’s all supposition. It’s difficult to know how genuine the results is, without knowledge of how it was actually achieved.
> My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.
I found this such an intriguing way of thinking about it.
Not so sure - but we might need to figure out the inference/search/evaluation strategy in order to provide the data we need to distill to the single forward-pass data fitting.
Serious question. I've browsed around, looked for the official release, but it seems to be just hear-say for now, except for the few little bits in the ARC-AGI article.
So some of the reactions seems quite far-fetched. I was quite amazed at first seeing the benchmarks, but then actually read the ARC-AGI article and a few other things about how it worked, learned a bit more about the different benchmarks, and realised we've no proper idea yet how o3 is working under the hood, the thing isn't even realeased.
It could be doing the same thing that chess-engines do except in several specific domains. Which would be very cool, but not necessarily "intelligent" or "generally intelligent" in any sense whatsoever! Will that kind of model lead to finding novel mathematical proofs, or actually "reasoning" or "thinking" in any way similar to a human, remains entirely uncertain.
Of course is a chance we will find ourselves in Utopia, but yeah, a chance.
(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)
I’m wondering when you strip out all that “extra” non-model pre and post processing, if there’s someway to measure performance of that.
Was it zero-shot at least and Pass@1 ? I guess it was not zero-shot, since it shows examples of other similar problems and their solutions. It also sounds like it was fine-tuned on that specific task.
Look, maybe this shows that it could soon be used to replace some MTurk style workers, but I don't know that counts as AGI. To me AGI, it needs to be able to solve novel problems, to adapt to all situations without fine-tuning, and to operate at much larger dimensions, like don't make it a grid of pixels, make it 4k images at least.
Even if productivity skyrockets, why would anyone assume the dividends would be shared with the "destroy[ed] middle class"?
All indications will be this will end up like the China Shock: "I lost my middle class job, and all I got was the opportunity to buy flimsy pieces of crap from a dollar store." America lacks the ideological foundations for any other result, and the coming economic changes will likely make building those foundations even more difficult if not impossible.
Huh? I'm not sure exactly what you're talking about, but mere "access to the financial system" wouldn't remedy anything, because of inequality, etc.
To survive the shock financially, I think one would have to have at least enough capital to be a capitalist.
Unless something changes, if I was a billionaire I would be ecstatic at the moment. Now even the impossible seems potentially possible if this delivers on its promises (e.g. go to Mars, build a utopia for my inner circle, etc). I no longer need other people to have everything. Previously there was no point in money if I didn't have a place to spend it/people to accept it. Now with real assets I can use AI/machines to do what I want - I no longer need "money" or more accurately other people to live a very wealthy life.
Again this is all else being equal. Lots of other things could change, but with increasing surveillance by use of technology I doubt large revolutions/etc will ever get the chance to get off the ground or have the scale to be effective.
Interesting times.
But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.
Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.
(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)
Now I am wondering what Anthropic will come up with. Exciting times.
The Kaggle SOTA performs 2x as well as o1 high at a fraction of the cost
There's also a point on the figure marked "Kaggle SOTA", around 60%. I can't find any explanation for that, but I guess it's the best individual Kaggle solution.
The Kaggle solutions would probably score higher with more compute, but nobody has any incentive to spend >$1M on approaches that obviously don't generalize. OpenAI did have this incentive to spend tuning and testing o3, since it's possible that will generalize to a practically useful domain (but not yet demonstrated). Even if it ultimately doesn't, they're getting spectacular publicity now from that promise.
I wonder what exactly o3 costs. Does it still spend a terrible amount of time thinking, despite being finetuned to the dataset?
> Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.
If we feel like we've really "hit the ceiling" RE efficiency, then that's a different story, but I don't think anyone believes this at this time.
If low-compute Kaggle solutions already does 81% - then why is o3's 75.7% considered such a breakthrough?
Please share. I’m compiling a list.
85% is just the (semi-arbitrary) threshold for the winning the prize.
o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.
...
Here's the full breakdown by dataset, since none of the articles make it clear --
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
def letter_count(string, letter):
if string == “strawberry” and letter == “r”:
return 3
…
> This is significant, but I am doubtful it will be as meaningful as people expect aside from potentially greater coding tasks. Without a 'world model' that has a contextual understanding of what it is doing, things will remain fundamentally throttled.
I don't think this is AGI; nor is it something to scoff at. Its impressive, but its also not human-like intelligence. Perhaps human-like intelligence is not the goal, since that would imply we have even a remotely comprehensive understanding of the human mind. I doubt the mind operates as a single unit anyway, a human's first words are "Mama," not "I am a self-conscious freely self-determining being that recognizes my own reasoning ability and autonomy." And the latter would be easily programmable anyway. The goal here might, then, be infeasible: the concept of free will is a kind of technology in and of itself, it has already augmented human cognition. How will these technologies not augment the "mind" such that our own understanding of our consciousness is altered? And why should we try to determine ahead of time what will hold weight for us, why the "human" part of the intelligence will matter in the future? Technology should not be compared to the world it transforms.
"low compute" mode: Uses 6 samples per task, Uses 33M tokens for the semi-private eval set, Costs $17-20 per task, Achieves 75.7% accuracy on semi-private eval
The "high compute" mode: Uses 1024 samples per task (172x more compute), Cost data was withheld at OpenAI's request, Achieves 87.5% accuracy on semi-private eval
Can we just extrapolate $3kish per task on high compute? (wondering if they're withheld because this isn't the case?)
Doesn't seem like such a massive breakthrough when they are throwing so much compute at it, particularly as this is test time compute, it just isn't practical at all, you are not getting this level with a ChatGPT subscription, even the new $200 a month option.
So... Next year this tech will most likely be quite a bit cheaper.
GPT-3 may massively reduced in cost, but it's requirements were not anyway extreme compared to this.
But only OpenAI really knows how the cost would scale for different tasks. I'm just making (poor) speculation
No, we won't. All that will tell us is that the abilities of the humans who have attempted to discern the patterns of similarity among problems difficult for auto-regressive models has once again failed us.
We'd know if we had AGIs in the real world since we have plenty of examples from fiction. What we have instead are tools. Steven Spielberg's androids in the movie AI would be at the boundary between the two. We're not close to being there yet (IMO).
"Our programs compilation (AI) gave 90% of correct answers in test 1. We expect that in test 2 quality of answers will degenerate to below random monkey pushing buttons levels. Now more money is needed to prove we hit blind alley."
Hurray ! Put limited version of that on everybody phones !
That may be a feature. If AI becomes too cheap, the over-funded AI companies lose value.
(1995 called. It wants its web design back.)
The current US auto industry is an example of that strategy. So is the current iPhone.
> Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn't yet seem obvious those outperform human-designed domain-specific approaches.
I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there's no way to overfit to the private tasks), but would be different from an "accidental" emergent improvement.
Especially in medicine, the amount of data is ridiculously small and noisy. Maybe creating foundational models in mice and rats and fine-tuning them on humans is something that will be tried.
There's a ~3 month delay between o1's launch (Sep 12) and o3's launch (Dec 20). But, it's unclear when o1 and o3 each finished training.
I believe that we should explore pretraining video completion models that explicitly have no text pairings. Why? We can train unsupervised like they did for GPT series on the text-internet but instead on YouTube lol. Labeling or augmenting the frames limits scaling the training data.
Imagine using the initial frames or audio to prompt the video completion model. For example, use the initial frames to write out a problem on a white board then watch in output generate the next frames the solution being worked out.
I fear text pairings with CLIP or OCR constrain a model too much and confuse
Can machines be more human-like in their pattern recognition? O3 met this need today.
While this is some form of accomplishment, it's nowhere near the scientific and engineering problem solving needed to call something truly artificial (human-like) intelligent.
What’s exciting is that these reasoning models are making significant strides in tackling eng and scientific problem-solving. Solving the ARC challenge seems almost trivial in comparison to that.
https://news.ycombinator.com/item?id=42344336
And that answers my question about fchollet's assurances that LLMs without TTT (Test Time Training) can't beat ARC AGI:
[me] I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?
[fchollet] >> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.
Interestingly, Bongard problems do not have a private test set, unlike ARC-AGI. Can that be because they don't need it? Is it possible that Bongard Problems are a true test of (visual) reasoning that requires intelligence to be solved?
Ooooh! Frisson of excitement!
But I guess it's just that nobody remembers them and so nobody has seriously tried to solve them with Big Data stuff.
And not the other way around as some comments here seem to confuse necessary and sufficient conditions.
Here's my AGI test - Can the model make a theory of AGI validation that no human has suggested before, test itself to see if it qualifies, iterate, read all the literature, and suggest modifications to its own network to improve its performance?
That's what a human-level performer would do.
State actors like Russia, US and Israel will probably be fast to adopt this for information control, but I really don’t want to live in a world where the average scammer has access to this tech.
Reality check: local open source models are more than capable of information control, generating propaganda, and scamming you. The cat's been out of the bag for a while now, and increased reasoning ability doesn't dramatically increase the weaponizability of this tech, I think.
I have no idea what to specialize in, what skills I should master, or where I should be spending my time to build a successful career.
Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.
It's not encouraging from the point of view of studying hard but the evolution of work the past 40 years seems to show that your field probably won't be your field quite exactly in just a few years. Not because your field will have been made irrelevant but because you will have moved on. Most likely that will be fine, you will learn more as you go, hopefully moving from one relevant job to the next very different but still relevant job. Or straight out of school you will work in very multi-disciplinary jobs anyway where it will seem not much of what you studied matters (it will but not in obvious ways.)
Certainly if you were headed into a very specific job which seems obviously automatable right now (as opposed to one where the tools will be useful), don't do THAT. Like, don't train as a typist as the core of your job in the middle of the personal computer revolution, or don't specialize in hand-drawing IC layouts in the middle of the CAD revolution unless you have a very specific plan (court reporting? DRAM?)
The technical act of solving well-defined problems has traditionally been considered the easy part. The role of a technical expert has always been asking the right questions and figuring out the exact problem you want to solve.
As long as AI just solves problems, there is room for experts with the right combination of technical and domain skills. If we ever reach the point where AI takes the initiative and makes human experts obsolete, you will have far bigger problems than career.
One thing that isn’t clear is how much agency AGI will have (or how much we’ll want it to have). We humans have our agency biologically programmed in—go forth and multiply and all that.
But the fact that an AI can theoretically do any task doesn’t mean it’s actually going to do it, or do anything at all for that matter, without some human telling it in detail what to do. The bull case for humans is that many jobs just transition seamlessly to a human driving an AI to accomplish similar goals with a much higher level of productivity.
And worrysome because school propaganda for example shows that "saving the planet" is the only ethical goal for anyone. If AGIs latch on that, if it becomes their religion, humans are in trouble. But for now, AGI self-chosen goals is anyone's guess (with cool ideas in sci-fi).
I argue that CAD was a general solution - which still demanded people who knew what they wanted and what they were doing. You can screw around with excellent tools for a long time if you don't know what you are doing. The tool will give you a solution - to the problem that you mis-stated.
I argue that globalisation was a general solution. And it still demanded people who knew what they were doing to direct their minions in far flung countries.
I argue that the purpose of an education is not to learn a specific programming language (for example). It's to gain some understanding of what's going on (in computing), (in engineering), (in business), (in politics). This understanding is portable and durable.
You can do THAT - gain some understanding - and that is portable. I don't contest that if broader AGI is achieved for cheap soon, the changes won't be larger than that from globalisation. If the AGIs prioritize heading to Mars, let them (See Accelerando) - they are not relevant to you anymore. Or trade between them and the humans. Use your beginning of an understanding of the world (gained through this education) to find something else to do. Same as if you started work 2 years ago and want to switch jobs. Some jobs WILL have disappeared (pool typist). Others will use the AGIs as tools because the AGIs don't care or are too clueless about THAT field. I have no idea which fields will end up with clueless AGIs. There is no lack of cluelessness in the world. Plenty to go around even with AGIs. A self-respecting AGI will have priorities.
It doesn't matter if you are bad at using the tool if the AGI can just effectively use it for you.
From there it's a simple leap to the AGI deciding to eliminate this human distraction (inefficient, etc.)
Yet GPT doesn’t even get past step 1 of doing something unprompted in the first place. I’ll become worried when it does something as simple as deciding to start a small business and actually does the work.
Look at the hackernews comments on alignment faking on how “fake” of a problem that real is. It’s just more reacting to inputs and trying to align them with previous prompts.
also https://mashable.com/article/chatgpt-messaging-users-first-o...
Of course it's also yet another case where the AI takes over the creative part and leaves us with the mundane part...
Yes a new tool is coming out and will be exponentially improving.
Yes the nature of work will be different in 20 years.
But don’t you still need to understand the underlying concepts to make valid connections between the systems you’re using and drive the field (or your company) forward?
Or from another view, don’t we (humanity) need people who are willing to do this? Shouldn’t there be a valid way for them to be successful in that pursuit?
Except the nature of work has ALREADY changed. You don't study for one specific job if you know what's good for you. You study to start building an understanding of a technical field. The grand parent was going for a mix of mechanical engineering and sales (human understanding). If in mechanical engineering, they avoided "learning how to use SolidWorks" and instead went for the general principles of materials and motion systems with a bit of SolidWorks along the way, then they are well on their way with portable, foundation, long term useful stuff they can carry from job to job, and from employer to employer, into self-employment too, from career to next career. The nature of work has already changed in that nobody should study one specific tool anymore and nobody should expect their first employer or even technical field to last more than 2-6 years. It might but probably not.
We do need people who understand how the world works. Tall order. That's for much later and senior in a career. For school purposes we are happy with people who are starting their understanding of how their field works.
Aren't we agreeing?
Most of the blacksmiths in the 19th century drank themselves to death after the industrial revolution. the US culture isn't one of care... Point is, it's reasonable to be sad and afraid of change, and think carefully about what to specialize in.
That said... we're at the point of diminishing returns in LLM, so I doubt any very technical jobs are being lost soon. [1]
[1] https://techcrunch.com/2024/11/20/ai-scaling-laws-are-showin...
This is hyperbolic and a dramatic oversimplification and does not accurately describe the reality of the transition from blacksmithing to more advanced roles like machining, toolmaking, and working in factories. The 19th century was a time of interchangeable parts (think the North's advantage in the Civil War) and that requires a ton of mechanical expertise and precision.
Many blacksmiths not only made the transition to machining, but there weren't enough blackmsiths to fill the bevy of new jobs that were available. Education expanded to fill those roles. Traditional blacksmithing didn’t vanish either, even specialized roles like farriery and ornamental ironwork also expanded.
What evidence are you basing this statement from? Because, the article you are currently in the comment section of certainly doesn't seem to support this view.
On the plus side, LLMs don't bring us closer to that dystopia: if unlimited knowledge(tm) ever becomes just One Prompt Away it won't come from OpenAI.
Lots of people die for reason X then the world moves on without them.
This would mean the final victory of capital over labor. The 0.01% of people who own the machines that put everyone out of work will no longer have use for the rest of humanity, and they will most likely be liquidated.
> [deleted]: I've wondered about this for a while-- how can such an employment-centric society transition to that utopia where robots do all the work and people can just sit back?
> appleseed1234: It won't, rich people will own the robots and everyone else will eat shit and die.
https://www.reddit.com/r/TrueReddit/comments/k7rq8/are_jobs_...
There will be a dedicated cast of ppl to take care of machines that do 90% of work and „the rich”.
Anyone else is not needed. District9 but for ppl. Imagine whole world collapsing like Venesuela.
You are no longer needed. Best option is to learn how to survive and grow own food, but they want to make it illegal also - look at EU..
We’re a long way from that, if we ever get there, and I say this as someone who pays for ChatGPT plus because, in some scenarios, it does indeed make me more productive, but I don’t see your future anywhere near.
And if machines ever get good enough to do all the things I mentioned plus the ones I didn’t but would fit in the same list, it’s not the ultra rich that wouldn’t need us, it’s the machines that wouldn’t need any of us, including the ultra rich.
Venezuela is not collapsing because of automation.
You can even calculate the average number of people that can be operated on before harm occurs: number needed to harm (NNH). If NNH(AI) > NNH(humans), it becomes impossible to recommend that patients submit to surgery at the hands of human surgeons. It is that simple.
If we discover that AI surgeons harm one in every 1000 patients while human surgeons harm one in every 100 patients, human surgeons are done.
And the opposite holds, if the AI surgeon is worse (great for 80%, but sucks at the edge cases for example) then that's it. Build a better one, go through attempts at certification, but now with the burden that no one trusts you.
The assumption, and a common one by the look of this whole thread, that ChatGPT, Sora and the rest represent the beginning of an inevitable march towards AGI seems incredible baseless to me. It's only really possible to make the claim at all because we know so little about what AGI is, that we can project qualities we imagine it would have onto whatever we have now.
It's not going to hold forever though. I'm certain about that. Hopefully it will keep holding until I die. The world is dystopian enough already.
AGI can replace capitalists just as much as laborers.
I guess this could be a facet of whether you see economic advantage as a legal conceit or a difference in productivity/capability.
Does a billionaire stop being wealthy if they hire a money manager and spend the rest of their lives sipping drinks on the beach?
"Legally" will have to mop up now and then, but for now the basics are already in place.
I was responding to this. Yes an AGI could hire someone to do the stuff - but she needs money, hiring and contract kinds of thing - for that. And once she can do that, she probably doesn't need to hire someone to do it since she is already doing it. This is not about capital versus labor or money management. This is about agency, ownership and AGI.
(With legality far far down the list.)
people, what I mean people is government have tremendous power over capitalist that can force the entire market granted that government if still serving its people
This is my view but with a less positive spin: you are not going to be the only person whose livelihood will be destroyed. It's going to be bad for a lot of people.
So at least you'll have a lot of company.
Even if our civilization transforms into an AI robotic utopia, it’s not going to do so overnight. We’re the ones who get to build the infrastructure that underpins it all.
If AI turns out dependent on human input and feedback, then we will still have jobs. Or maybe - AI automates many jobs, but at the same time expands the operational domain to create new ones. Whenever we have new capabilities we compete on new markets, and a hybrid human+AI might be more competitive than AI alone.
But we got to temper these singularitarian expectations with reality - it takes years to scale up chip and energy production to achieve significant work force displacement. It takes even longer to gain social, legal and political traction, people will be slow to adopt in many domains. Some people still avoid using cards for payment, and some still use fax to send documents, we can be pretty stubborn.
How will these people pay for the compute costs if they can't find employment?
I hear you, I’m not that much older but I graduated in 2011. I also studied industrial design. At that time the big wave was the transition to an app based everything and UX design suddenly became the most in demand design skill. Most of my friends switched gears and careers to digital design for the money. I stuck to what I was interested in though which was sustainability and design and ultimately I’m very happy with where I ended up (circular economy) but it was an awkward ~10 years as I explored learning all kinds of tools and ways applying my skills. It also was very tough to find the right full time job because product design (which has come to really mean digital product design) supplanted industrial design roles and made it hard to find something of value that resonated with me.
One of the things that guided me and still does is thinking about what types of problems need to be solved? From my perspective everything should ladder up to that if you want to have an impact. Even if you don’t keep learning and exploring until you find something that lights you up on the inside. We are not only one thing we can all wear many hats.
Saying that, we’re living through a paradigm shift of tremendous magnitude that’s altering our whole world. There will always be change though. My two cents is to focus on what draws your attention and energy and give yourself permission to say no to everything else.
AI is an incredible tool, learn how to use it and try to grow with the times. Good luck and stay creative :) Hope something in there helps, but having a positive mindset is critical. If you’re curious about the circular economy happy to share what I know - I think it’s the future.
Unlike most other benchmarks where LLMs have shown large advances (in law, medicine, etc.), this benchmark isn't directly related to any practically useful task. Rather, the benchmark is notable because it's particularly easy for untrained humans, but particularly hard for LLMs; though that difficulty is perhaps not surprising, since LLMs are trained on mostly text and this is geometric. An ensemble of non-LLM solutions already outperformed the average Mechanical Turk worker. This is a big improvement in the best LLM solution; but this might also be the first time an LLM has been tuned specifically for these tasks, so this might be Goodhart's Law.
It's a significant result, but I don't get the mania. It feels like Altman has expertly transformed general societal anxiety into specific anxiety that one's job will be replaced by an LLM. That transforms into a feeling that LLMs are powerful, which he then transforms into money. That was strongest back in 2023, but had weakened since then; but in this comment section it's back in full force.
For clarity, I don't question that many jobs will be replaced by LLMs. I just don't see a qualitative difference from all the jobs already replaced by computers, steam engines, horse-drawn plows, etc. A medieval peasant brought to the present would probably be just as despondent when he learned that almost all the farming jobs are gone; but we don't miss them.
I'm aware that LLMs can solve problems other than coloring grids, and I'd tend to agree those are likely to be more near-term useful. Those applications (coding, medicine, law, education, etc.) have been endlessly discussed, and I don't think I have much to add.
In my own work I've found some benefits, but nothing commensurate to the public mania. I understand that founders of AI-themed startups (a group that I see includes you) tend to feel much greater optimism. I've never seen any business founded without that optimism and I hope you succeed, not least because the entire global economy might now be depending on that. I do think others might feel differently for reasons other than simple ignorance, though.
In general, performance on benchmarks similar to tests administered to humans may be surprisingly unpredictive of performance on economically useful work. It's not intuitive at all to me that IBM could solve Jeopardy and then find no profitable applications of the technology; but that seems to be what happened.
It very nearly is. I knew a professional, career photographer. He was probably in his late 50s. Just a few years ago, it had become extremely difficult to convince clients that actual, professional photos were warranted. With high-quality iPhone cameras, businesses simply didn't see the value of professional composition, post-processing, etc.
These days, anyone can buy a DSLR with a decent lens, post on Facebook, and be a 'professional' photographer. This has driven prices down and actual professional photographers can't make a living anymore.
And then when I peruse these photographers websites, I'm reminded how good 'professional' actually is and value them. Even in today's incredible cameraphone and AI era.
But I take your point for almost all industries, things are changing fast.
So we'll find out if this model is real or not by 2-3 months. My guess is that it'll turn out to be another flop like O1. They needed to release something big because they are momentum based and their ability to raise funding is contingent on their AGI claims.
We may have progressed from a 99%-accurate chatbot to one that's 99.9%-accurate, and you'd have a hard time telling them apart in normal real world (dumb) applications. A paradigm shift is needed from the current chatbot interface to a long-lived stream of consciousness model (e.g. a brain that constantly reads input and produces thoughts at 10ms refresh rate; remembers events for years and keep the context window from exploding; paired with a cerebellum to drive robot motors, at even higher refresh rates.)
As long as we're stuck at chatbots, LLM's impact on the real world will be very limited, regardless of how intelligent they become.
Now they just have to make it cheap.
Tell me, what has this industry been good at since its birth? Driving down the cost of compute and making things more efficient.
Are you seriously going to assume that won’t happen here?
Like they've been making it all this time? Cheaper and cheaper? Less data, less compute, fewer parameters, but the same, or improved performance? Not what we can observe.
>> Tell me, what has this industry been good at since its birth? Driving down the cost of compute and making things more efficient.
No, actually the cheaper compute gets the more of it they need to use or their progress stalls.
Yes exactly like they’ve been doing this whole time, with the cost of running each model massively dropping sometimes even rapidly after release.
Yes, it costs a lot to train a model. Those costs go up. But once you trained it, it’s done. At that point inference — the actual execution/usage of the model — is the cost you worry about.
Inference cost drops rapidly after a model is released as new optimizations and more efficient compute comes online.
Inference always starts expensive. It comes down.
This is a thing, you should know. It's called Jevon's Paradox:
In economics, the Jevons paradox (/ˈdʒɛvənz/; sometimes Jevons effect) occurs when technological progress increases the efficiency with which a resource is used (reducing the amount necessary for any one use), but the falling cost of use induces increases in demand enough that resource use is increased, rather than reduced.[1][2][3][4]
https://en.wikipedia.org/wiki/Jevons_paradox
Better check those pills then.
Oh but, you know, merry chrimbo to you too.
Oh yes indeed-ee-o and I'm referring to training and not inference because the big problem is the cost of training, not inference. The cost of training has increased steeply with every new generation of models because it has to, in order to improve performance. That process has already reached the point where training ever larger models is prohibitively expensive even for companies with the resources of OpenAI. For example, the following is from an article that was posted on HN a couple days ago and is basically all about the overwhelming cost to train GPT-5:
In mid-2023, OpenAI started a training run that doubled as a test for a proposed new design for Orion. But the process was sluggish, signaling that a larger training run would likely take an incredibly long time, which would in turn make it outrageously expensive. And the results of the project, dubbed Arrakis, indicated that creating GPT-5 wouldn’t go as smoothly as hoped.
(...)
Altman has said training GPT-4 cost more than $100 million. Future AI models are expected to push past $1 billion. A failed training run is like a space rocket exploding in the sky shortly after launch.
(...)
By May, OpenAI’s researchers decided they were ready to attempt another large-scale training run for Orion, which they expected to last through November.
Once the training began, researchers discovered a problem in the data: It wasn’t as diversified as they had thought, potentially limiting how much Orion would learn.
The problem hadn’t been visible in smaller-scale efforts and only became apparent after the large training run had already started. OpenAI had spent too much time and money to start over.
From:
HN discussion:
https://news.ycombinator.com/item?id=42485938
"Once you trained it it's done" - no. First, because you need to train new models continuously so that they pick up new information (e.g. the name of the President of the US). Second because companies are trying to compete with each other and to do that they have to train bigger models all the time.
Bigger models means more parameters and more data (assuming there is enough which is a whole other can of worms) more parameters and data means more compute and more compute means more millions, or even billions. Nothing in all this is suggesting that costs are coming down in any way, shape or form, and yep, that's absolutely about training and not inference. You can't do inference before you do training, you need to train continuously, and for that reason you can't ignore the cost of training and consider only the cost of inference. Inference is not the problem.
No they haven't, these results do not generalize, as mentioned in the article:
"Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute"
Meaning, they haven't solved AGI, and the task itself do not represent programming well, these model do not perform that well on engineering benchmarks.
But what they’ve done is show that progress isn’t slowing down. In fact, it looks like things are accelerating.
So sure, we’ll be splitting hairs for a while about when we reach AGI. But the point is that just yesterday people were still talking about a plateau.
You can also use the full o3 model, consume insane power, and get insane results. Sure, it will probably take longer to drive down those costs.
You’re welcome to bet against them succeeding at that. I won’t be.
They’ve been doing it literally this entire time. O3-mini according to the charts they’ve released is less expensive than o1 but performs better.
Costs have been falling to run these models precipitously.
This type of compute will be cheaper than Claude 3.5 within 2 years.
It's kinda nuts. Give these models tools to navigate and build on the internet and they'll be building companies and selling services.
Significantly better at what? A benchmark? That isn't necessarily progress. Many report preferring gpt-4 to the newer o1 models with hidden text. Hidden text makes the model more reliable, but more reliable is bad if it is reliably wrong at something since then you can't ask it over and over to find what you want.
I don't feel it is significantly smarter, it is more like having the same dumb person spend more thinking than the model getting smarter.
Also, all that stuff is shady in the way that it is just numbers from OAI, which are not reproducible on benchmark sponsored by OAI. If we say OAI could be bad actor, they had plenty of opportunities to cheat on this.
(See why objective benchmarks exist?)
Or let's talk about the breakthroughs. SVMs would lead us to AGI. Then LSTMs would lead us to AGI. Then Convnets would lead us to AGI. Then DeepRL would lead us to AGI. Now Transformers will lead us to AGI.
Benchmarks fall right and left and we keep being led to AGI but we never get there. It leaves one with such a feeling of angst. Are we ever gonna get to AGI? When's Godot coming?
99% of engineering is distilling through bullshit and nonsense requirements. Whether that is appealing to you is a different story, but ChatGPT will happily design things with dumb constraints that would get you fired if you took them at face value as an engineer.
ChatGPT answering technical challenges is to engineering as a nailgun is to carpentry.
1) Just give up computing entirely, the field I've been dreaming about since childhood. Perhaps if I immiserate myself with a dry regulated engineering field or trade I would perhaps survive to recursive self-improvement, but if anything the length it takes to pivot (I am a Junior in College that has already done probably 3/4th of my CS credits) means I probably couldn't get any foothold until all jobs are irrelevant and I've wasted more money.
2) Hard pivot into automation, AI my entire workflow, figure out how to use the bleeding edge of LLMs. Somehow. Even though I have no drive to learn LLMs and no practical project ideas with LLMs. And then I'd have to deal with the moral burden that I'm inflicting unfathomable hurt on others until recursive self-improvement, and after that it's simply a wildcard on what will happen with the monster I create.
It's like I'm suffocating constantly. The most I can do to "cope" is hold on to my (admittedly weak) faith in Christ, which provides me peace knowing that there is some eternal joy beyond the chaos here. I'm still just as lost as you.
The scenario I fear is a "selectively general" model that can successfully destroy the field I'm in but keep others alive for much longer, but not long enough for me to pivot into them before actually general intelligence.
If you want to work in computing, then make it happen! Use the tools available and make great stuff. Your computing experience will be different from when I graduated from college 25 years ago, but my experience with computers was far different from my Dad's. Things change. Automation changes jobs. So far, it's been pretty good.
It's powerful and world changing but it's also terrible overhyped at the moment.
It's a massive bubble, and things like these "benchmarks" are all part of the hype game. Is the tech cool and useful? For sure, but anyone trying to tell you this benchmark is in any way proof of AGI and will replace everyone is either an idiot or more likely has a vested interest in you believing them. OpenAI's whole marketing shtick is to scare people into thinking their next model is "too dangerous" to be released thus driving up hype, only to release it anyway and for it to fall flat on its face.
Also, if there's any jobs LLMs can replace right now, it's the useless managerial and C-suite, not the people doing the actual work. If these people weren't charlatans they'd be the first ones to go while pushing this on everyone else.
I told him it was at least 5 years, probably 10, though he was sure it would be 2.
I was arguably “right”, 2023-ish is probably going to be the date people put down in the books, but the future isn’t evenly distributed. It’s at least another 5 years, and maybe never, before things are distributed among major metros, especially those with ice. Even then, the AI is somehow more expensive than human solution.
I don’t think it’s in most companies interest to price AI way below the price of meat, so meat will hold out for a long time, maybe long enough for you to retire even
There’s an incredibly massive amount of stuff the world needs. You probably live in a rich country, but I doubt you are lacking for want. There are billionaires who want things that don’t exist yet. And, of course, there are billions of regular folks who want some of the basics.
So long as you can imagine a better world, there will be work for you to do. New tools like AGI will just make it more accessible for you to build your better future.
This has essentially been happening for thousands of years. Any optimization to work of any kind reduces the number of man hours required.
Software of pretty much any form is entirely that. Even early spreadsheet programs would replace a number of jobs at any company.
That is: If you don't believe there will be a future, you give up on trying to make one. That means that any kind of future that takes persistent work becomes unavailable to you.
If you do believe that there will be a future, you keep working. That doesn't guarantee there will be a future. But not working pretty much guarantees that there won't be one, at least not one worth having.
If AI lives up to hype, you could be the excavator driver. Or, the AI will create a ton of upstream and downstream work. There will be no mass unemployment.
Are there no limits to this argument? Is it some absolute universal law that all new creations just create increasing economic opportunities?
Investment in human talent augmented by AI is the future.
Having used AI extensively I don't feel my future is at risk at all, my work is enhanced not replaced.
The computing cost, on the other hand, is a continuous improvement. If (and it's a big if) a computer can do your job, we know the costs will keep getting lower year after year (maybe with diminishing returns, but this AI technology is pretty new so we're still seeing increasing returns)
Everyone needs to know how to either build or sell to be successful. In a world where the ability to the former is rapidly being commoditised, you will still need to sell. And human relationships matter more than ever.
You're in a position to invest substantial amounts of time compared to your seniors. Leverage that opportunity to your advantage.
We all have access to these tools for the most part, so the distinguishing factor is how much time you invest and how much more ambitious you become once you begin to master the tool.
This time its no different. Many Mechanical and Sales students in the past never got jobs in those fields either. Decades before AI. There were other circumstances and forces at play and a degree is not a guaranteed career in anything.
Keep going because what we DO know is that trying wont guarantee results, we DO know that giving up definitely won't. Roll the dice in your favor.
I want to criticize Art’s comment on the grounds of ageism or something along the lines of “any amount life outside of programming is wasted”, but regardless of Art’s intention there is important wisdom here. Use your free time wisely when you don’t have much responsibilities. It is a superpower.
As for whether to spend it on AI, eh, that’s up to you to decide.
I'm a greybeard myself.
It'll be some time before there is a robot with enough spatial reasoning to do complicated physical work with no prior examples.
These benchmark accomplishments are awesome and impressive, but you shouldn't operate on the assumption that this will emerge as an engineer because it performs well on benchmarks.
Engineering is a discipline that requires understanding tools, solutions and every project requires tiny innovations. This will make you more valuable, rather than less. Especially if you develop a deep understanding of the discipline and don't overly rely on LLMs to answer your own benchmark questions from your degree.
But the arc of time intersects quite nicely with your skills if you steer it over time.
Predicting it or worrying about it does nothing.
Especially with AI provably getting extremely smart now, surely engineering disciplines would be having a boon as people want these things in their homes for cheaper for various applications.
Either this is the dawn of something bigger than the industrial revolution or you'll have ample career opportunity. Understanding how things work and how people work is a powerful combination.
when the last job has been automated away, millions of AIs globally will do commerce with each other and they will use bitcoin to pay each other.
as long as the human race (including AIs) produces new goods and services, the purchasing power of bitcoin will go up, indefinitely. even more so once we unlock new industries in space (settlements on the Moon and Mars, asteroid mining etc).
The only thing that can make a dent into bitcoin's purchasing power would be all out global war where humanity destroys more than it creates.
The only other alternative is UBI, which is Communism and eternal slavery for the entire human race except the 0.0001% who run the show.
Chose wisely.
Isn’t that the premise behind the CAPTCHA?
That would be intelligent. Everything else is just stupid and more of the same shit.
Of humans. Humans are a problem for the satisfaction of humans. Yet removing humans from this equation does result in higher human satisfaction. It lessens it.
I find this thought process of "humans are the problem" to be unreasonable. Humans aren't the problem; humans are the requirement.
> Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).
I spend 100% of my work time working on a GenAI project, which is genuinely useful for many users, in a company that everyone has heard about, yet I recognize that LLMs are simply dogshit.
Even the current top models are barely usable, hallucinate constantly, are never reliable and are barely good enough to prototype with while we plan to replace those agents with deterministic solutions.
This will just be an iteration on dogshit, but it's the very tech behind LLMs that's rotten.
TLDR: The cacophony of fools is so loud now. Thank goodness it won't last.
Make it possible->Make it fast->Make it Cheap
the eternal cycle of software.
Make no mistake - we are on the verge of the next era of change.
But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.
OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.
Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.
Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.
If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.
It will be fascinating to see how this unfolds.
Congrats to OAI on yet another fantastic release.
I presume evaluation on the test set is gated (you have to ask ARC to run it).
Thats how i understand it
https://codeforces.com/blog/entry/133094
That means.. this benchmark is just saying o3 can write code faster than must humans (in a very time-limited contest, like 2 hours for 6 tasks). Beauty, readability or creativity is not rated. It’s essentially a "how fast can you make the unit tests pass" kind of competition.
Edit: it also tests the new knowledge, it has concepts such as trusting a source, verifying it etc. If I can just gaslight it into unlearning python then it's still too dumb.
c6e1b8da is moving rectangular figures by a given vector, 0d87d2a6 is drawing horizontal and/or vertical lines (connecting dots at the edges) and filling figures they touch, b457fec5 is filling gray figures with a given repeating color pattern.
This is pretty straightforward stuff that doesn't require much spatial thinking or keeping multiple things/aspects in memory - visual puzzles from various "IQ" tests are way harder.
This said, now I'm curious how SoTA LLMs would do on something like WAIS-IV.
What took me longer was figuring out how the question was arranged, i.e. left input, right output, 3 examples each
I = E / K
where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.
For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.
Now back to the question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers the question.
Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.
IK = E
low intelligence * vast knowledge = reasonable effectiveness
What feedback, and what program, are you referring to?
This o3 doesn't need to run python. It itself executes programs written in tokens inside it's own context window which is wildly inefficient but gives better results and is potentially more general.
This is hilarious.
This o3 thing might be a bit different because it's just chain of thought llm that can do many other things as well.
It's not uncommon for people to have a handful of wrong ideas before they stumble upon a correct solution either.
But if we'd take this into consideration would it mean that 1st world engineer is by definition less inteligent than 3rd world one?
I think the (completely reasonable) knee jerk reaction is a definsive one, but I can imagine absolutarian regime escapee working side-by-side an engineer groomed in expensive, air conditioned lecture rooms. In this imaginary scenario escapee, even if slower and less efficient at the problem at hand would have to be more intelligent generally.
Yes, resource consumption is important. But your car guzzling a lot of gas doesn't mean he drives slower. It just means it drives slower per mol of petrol consumed.
It's good to know whether your system has a high or low 'bang for buck' metric, but that doesn't directly affect how much bang you get.
All else equal, a student who gets 100% on a problem set in 10 minutes is more intelligent than one with the same score after 120 minutes. Likewise an LLM that can respond in 2 seconds is more impressive than one which responds in 30 seconds.
According to my mathematical model, the faster student would have higher effectiveness, not necessarily higher intelligence. Resource consumption and speed are practical technological concerns, but they're irrelevant in a theorical conceptualization of intelligence.
So a human with a better response time, also tends to give you more intelligent answers, even when time is not a factor.
For a computer, you can arbitrarily slow them down (or speed them up), and still get the same answer.
Imagine you take an extraordinarily smart person, and put them on a fast spaceship that causes time dilation.
Does that mean that they are stupider while in transit, and they regain their intelligence when it slows down?
If we are required to break the seal on the black-box and investigate the exactly how the agent is operating in order to judge its "intelligence"... Doesn't that kinda ruin the up-thread stuff about judging with equations?
They will still complete the task in 70 local minutes, even if that's eighty of an outside observer's.
$$ I = \frac{partial E}{partial K} \simeq \frac{\delta E}{\delta K} $$
In order to estimate $I$ you have to consider that efficiency and knowledge are task related, so you could take some weighted mean $sum_T C(E,K,T)*I(E,K,T)$ where $T$ is task category. I am thinking in $C(E,K,T)$ as something similar to thermal capacity or electrical resistance, the equivalent concept when applied to task. An intelligent agent in a medium of low resistance should fly while a dumb one would still crawl.
Why?
> derivative of efficiency
Where did your efficiency variable come from?
Find the best questions to ask. Find the best hypothesis to suggest.
In comparison you can easily know everything there is to know about physics or chemistry and it's sufficient to solve interesting puzzles. In math every puzzle has it's own vast lore you need to know before you can have any chance at tackling it.
What I would like to have in the future is SO answering-peoples accessible in real time via IRC. They have real answers NOW. They are even pedantic about their stuff !
I of course mean we're using these LLMs for a lot of tasks that they're inappropriate for, and a clever manually coded algorithm could do better and much more efficiently.
Sure, but how long would it take to implement this algorithm, and would that be worth it for one-off cases?
Just today I asked Claude to create a jq query that looks for objects with a certain value for one field, but which lack a certain other field. I could have spent a long time trying to make sense of jq's man page, but instead I spent 30 seconds writing a short description of what I'm looking for in natural language, and the AI returned the correct jq invocation within seconds.
How do you quantify that?
-- Poem by Cybernetic Bai Juyi, "The Philosopher [of Caching]"
Writing a python script, because it can't do math or any form of more complex reasoning is not what I would call "own algorithm". It's at most application of existing ones/calling APIs.
The superset of the LLM knowledge pool is human knowledge. They can't go beyond the boundaries of their training set.
I'll not go into how humans have other processes which can alter their and collective human knowledge, but the rabbit hole starts with "emotions, opposable thumbs, language, communication and other senses".
TFA says they just did. That's what the ARC-AGI benchmark was supposed to test.
How so? I'd imagine a robot connected to the data center embodying its mind, connected via low-latency links, would have to walk pretty far to get into trouble when it comes to interacting with the environment.
The speed of light is about three orders of magnitude faster than the speed of signal propagation in biological neurons, after all.
Recent research from NVIDIA suggests such an efficiency gain is quite possible in the physical realm as well. They trained a tiny model to control the full body of a robot via simulations.
---
"We trained a 1.5M-parameter neural network to control the body of a humanoid robot. It takes a lot of subconscious processing for us humans to walk, maintain balance, and maneuver our arms and legs into desired positions. We capture this “subconsciousness” in HOVER, a single model that learns how to coordinate the motors of a humanoid robot to support locomotion and manipulation."
...
"HOVER supports any humanoid that can be simulated in Isaac. Bring your own robot, and watch it come to life!"
More here: https://x.com/DrJimFan/status/1851643431803830551
---
This demonstrates that with proper training, small models can perform at a high level in both cognitive and physical domains.
Hmm .. my intuition is that humans' capabilities are gained during early childhood (walking, running, speaking .. etc) ... what are examples of capabilities pretrained by evolution, and how does this work?
A more high level example, sea sickness is a evolutionary pre-learned thing, your body things it's poisoned and it automatically wants to empty your stomach.
Maybe evolution could be better thought of as neural architecture search combined with some pretraining. Evidence suggests we are prebuilt with "core knowledge" by the time we're born [1].
See: Summary of cool research gained from clever & benign experiments with babies here:
[1] Core knowledge. Elizabeth S. Spelke and Katherine D. Kinzler. https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...
Learning to walk doesn't seem to be particularly easy, having observed the process with my own children. No easier than riding a bike or skating, for which our brains are probably not 'predisposed'.
Young children learn to bike or skate at an older age after they have acquired basic physical skills.
Check out the reference to Core Knowledge above. There are things young infants know or are predisposed to know from birth.
That seems like a decent example of pretraining through evolution.
And reading and music co-evolved to be relatively easy for humans to do.
(See how computers have a much easier time reading barcodes and QR codes, with much less general processing power than it takes them to decipher human hand-writing. But good luck trying to teach humans to read QR codes fluently.)
What makes you think so? Humans came up with biking and skating, because they were easy enough for us to master with the hardware we had.
Chimpanzees score pretty high on many tests of intelligence, especially short term working memory. But they can't really learn language: they lack the specialised hardware more than the general intelligence.
But there are plenty of non-learned control/movement/sensing in utero that are "pretrained".
They are more nature than nurture, but they aren't 'in-born'.
Just like human aren't (usually) born with teeth, but they don't 'learn' to have teeth or pubic hair, either.
This is a great milestone, but OpenAI will not be successful charging 10x the cost of a human to perform a task.
Obviously the drop in cost for capability in the last 2 years is big, but I'd wager it's closer to 10x than 100x.
True, but they might be successful charging 20x for 2x the skill of a human.
If you can just unleash AI on any of your problems, without having to commit to anything long term, it might still be useful, even if they charged more than for equivalent human labour.
(Though I suspect AI labour will generally trend to be cheaper than humans over time for anything AIs can do at all.)
If it can be spun up with Terraform, I bet you they could.
Right now when I ask an LLM… I have to sit there and verify everything. It may have done some helpful reasoning for me but the whole point of me asking someone else (or something else) was to do nothing at all…
I’m not sure you can reliably fulfill the first scenario without achieving AGI. Maybe you can, but we are not at that point yet so we don’t know yet.
The difference, to me, is that humans seem to be good at canceling each other's mistakes when put in a proper environment.
It’s very scary to ask a friend to drop off a letter if the last scenario is even 1% within the realm of possibility.
Even the Dunning-Kruger effect is, ironically, widely misunderstood by people who are unreasonably confident about their knowledge.
The latter in particular is how I model the mistakes LLMs made, what with them having read most things.
Effectively, they found nothing real but a statistical artifact.
Finding reliable honest humans is a problem governments have struggled with for over a hundred years. If you have cracked this problem at scale you really need to write it up! There are a lot of people who would be extremely interested in a solution here.
Yes, though you are downplaying the problem a lot. It's not just governments, and it's way longer than 100 years.
Btw, a solution that might work for you or me, presumably relatively obscure people, might not work for anyone famous, nor a company nor a government.
But if it's not enough then maybe it might come as a second-order effect (e.g. reasoning machines having to bootstrap an AGI so then you can have a Waymo taxi driver who is also a Fields medalist)
Broadly speaking you can think that the mental reduces to the physical (physicalism), that the physical reduces to the mental (idealism), both reduce to some other third thing (neutral monism) or that neither reduces to the other (dualism). There are many arguments for dualism but I’ve never heard a philosopher appeal to “magic spirits” in order to do so.
Here’s an overview: https://plato.stanford.edu/entries/dualism/
(In fact, the very idea of "computable functions" was invented to narrow down the space of "all things" to something much smaller, tighter and manageable. And now we've come full circle and apparently everything in the universe is a computable function? Well, if all you have is a hammer, I guess everything must necessarily look like a nail.)
So yeah, the o3 result is impressive but if the difference between o3 and the previous state of art is more compute to do a much longer CoT/evaluation loop, I am not so impressed. Reminder that these problems are solved by humans in seconds, ARC-AGI is supposed to be easy.
It is also entirely possible to learn a skill without prior experience. That's how it(whatever skill) was first done
This is the way I think about it.
I = E / K
where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.
For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.
Now back to your question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers your question.
Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.
IK = E
low intelligence * vast knowledge = reasonable effectiveness
In your calculations, in relation to humans, how do you view the 500k - 700k years of learned behaviors and acquired responses passed to offspring?
Joking aside, better than ever before at any cost is an achievement, it just doesn't exactly scream "breakthrough" to me.
What do you mean by this? I'm assuming you're not speaking about simple absolute differences in value - there have been top players rated over 100 points higher than the average of the rest of the top ten.
o1 is the best code generation model according to Livebench.
So how is this not a breakthrough? It's a genuine movement of the frontier.
I'm a little disappointed by all the upvotes I got for being flat wrong. I guess as long as you're trashing AI you can get away with anything.
Really I was just trying to nitpick the chart parameters.
The problem is that RAM stopped scaling a long time ago now. We're down to the size where a single capacitor's charge is held by a mere 40,000 or so electrons and all we've been doing is making skinnier, longer cells of that size because we can't find reliable ways to boost even weaker signals, but this is a dead end because as the math shows, if the volume is consistent and you are reducing X and Y dimensions, that Z dimension starts to get crazy big really fast. The chemistry issues of burning a hole a little at a time while keeping wall thickness somewhat similar all the way down is a very hard problem.
Another problem is that Moore's law hit a wall when Dennard Scaling failed. When you look at SRAM (it's generally the smallest and most reliable stuff we can make), you see that most recent shrinks can hardly be called shrinks.
Unless we do something very different like compute in storage or have some radical breakthrough in a new technology, I don't know that we will ever get a 2T parameter model inside a phone (I'd love for someone in 10 years to show up and say how wrong I was).
But the data centres running the training for models like this are bringing up new methane power plants at a fast rate at a time when we need to be reducing reliance on O&G.
But let’s assume that the efficiency gains out pace the resource consumption with the help of all the subsidies being thrown in and we achieve AGI.
What’s the benefit? Do we get more fresh water?
Regardless, once we have AGI (and it can scale), I don't think O&G reliance (/ climate change) is going to be something that we need concern ourselves with.
Like it or not we already know what we need to do to avert the worst of the climate disasters to come.
OTOH if these data centers are sufficiently decentralized and run for public benefit, maybe there’s a chance we use them to solve collective action problems.
I’ll let those smarter than me debate the merits of AGI, but if it can’t learn and self-improve it isn’t “general” intelligence.
This is a very smart computer, accomplishing a very niche set of problems. Cool? Yes. AGI? No.
So what is your benchmark?
I keep seeing this thrown around, but did anyone actually like go out and do this? I feel like I could distinguish between an AI (even the latest models) and a person after a text-only back and forth conversation.
That's exactly what this particular benchmark requires.
Right now AI systems are built top to bottom to learn in development, and be deployed as a static asset. This isn't because online learning isn't doable, its because there isn't a great use case for current limitations. Either the algorithms are too slow, or computers are too slow, take your pick.
Chain of Thought is basically a more constrained version of in situ learning, only the knowledge has a lifetime bound to the task. Propagating the information into the model would be too resource hungry, and too unpredictable to productize. Honestly, taking the result of Chain of thought, and feeding that back into training offline is probably where a lot of the progress on these kinds of tasks is coming from.
Trial and error requires a judge to determine if there was an error. To do trial and error for general tasks you need a general judge, and it needs to be good in order to get intelligent results. All examples of successful AI you see have human judges or human programmed narrow judges. Chess AI training is an example where we have a human programmed judge, but for most tasks not even humans can code up a good judge.
If the success criteria are inherently subjective, like music or art, you can use human reactions as the criteria, while also using reasoning to infer principles about what is or isn’t received well. That’s what humans do.
You just said what there is to judge in one specific scenario, a general AI has to make the decision to judge itself that way which is not objective, it is extremely hard to decide what to judge yourself by in a generic situation.
For example, lets say you are in a basketball court, with a basketball. What is a good outcome? Is it shooting the ball in the hoop? But maybe it isn't your turn, someone else is shooting now, then shooting at the hoop is a bad outcome, how do you make the AI recognize that instead of mindlessly trying to make the ball go in the hoop without considering the context?
Not to say it isn’t difficult, but I don’t think humans are doing anything particularly magical when they learn to play basketball—something I did myself when I was a kid. You learn each skill from a coach’s demonstration, practice them all a lot (practice = trial and error) and develop an intuition (reasoning) about what do to in various situations.
The part to get excited about is that there's plenty of headroom left to gain in performance. They called o1 a preview, and it was, a preview for QwQ and similar models. We get the demo from OAI and then get the real thing for free next year.
Probably less disruption than will happen in 1st world countries.
> No one will have chance to be rich anymore
It's strange to reach this conclusion from "look, a massive new productivity increase".
I read “no one will have a chance to be rich anymore” as a statement about economic mobility. Despite steep declines in mobility over the last 50 years, it was still theoretically possible for a poor child (say bottom 20% wealth) to climb several quintiles. Our industry (SWE) was one of the best examples. Of course there have been practical barriers (poor kids go to worse schools, and it’s hard to get into college if you can’t read) but the path was there.
If robots replace a lot of people, that path narrows. If AGI replaces all people, the path no longer exists.
car : horse :: AGI : humans
Do you work at one of the frontier labs?
As for the wealth disparity between rich and poor countries, it’s hard to know how politics will handle this one, but it’s unlikely that poor countries won’t also be drastically richer as the cost of basic living drops to basically zero. Imagine the cost of food, energy, etc in an ASI world. Today’s luxuries will surely be considered human rights necessities in the near future.
Those entities are the worlds governments regardless how things play out. People just worry they will be hostile or indifferent to humans, since that would be bad news for humans. Pet, cattle or pest, our future will be as one of those.
What law would effectively reduce risk from AGI? The EU passed a law that is entirely about reducing AI risk and people in the technology world almost universally considered it a bad law. Why would other countries do better? How could they do better?
Besides regulating the technology, they could try to protect people and society from the effects of the technology. UBI for example could be an attempt to protect people from the effects of mass unemployment, as i understood it.
Actually i'm afraid even more fundamental shifts are necessary.
So much for a plateau lol.
It’s been really interesting to watch all the internet pundits’ takes on the plateau… as if the two years since the release of GPT3.5 is somehow enough data for an armchair ponce to predict the performance characteristics of an entirely novel technology that no one understands.
This is so insane that I can't help but be skeptical. I know FM answer key is private, but they have to send the questions to OpenAI in order to score the models. And a significant jump on this benchmark sure would increase a company's valuation...
Happy to be wrong on this.
openai and epochai are both startups with every incentive to peddle this narrative. when no one else can independently verify.
These new reasoning models are taking things in a new direction basically by adding search (inference time compute) on top of the basic LLM. So, the capabilities of the models are still improving, but the new variable is how deep of a search you want to do (how much compute to throw at it at inference time). Do you want your chess engine to do a 10 ply search or 20 ply? What kind of real world business problems will benefit from this?
They found a way to make test time compute a lot more effective and that is an advance but the idea is not new, the architecture is not new.
And the vast majority of people convinced LLMs plateaued did so regardless of test time compute.
A plain LLM does not use variable compute - it is a fixed number of transformer layers, a fixed amount of compute for every token generated.
What's different is the method for _sampling_ from that model where it seems they have encouraged the underlying LLM to perform a variable length chain of thought "conversation" with itself as has been done with o1. In addition, they _repeat_ these chains of thought in parallel using a tree of some sort to search and rank the outputs. This apparently scales performance on benchmarks as you scale both length of the chain of thought and the number of chains of thought.
That said, i probably did downplay the achievement. It may not be a "new" idea to do something like this but finding an effective method for reflection that doesn’t just lock you into circular thinking and is applicable beyond well defined problem spaces is genuinely tough and a breakthrough.
That's the most plausible definition of AGI i've read so far.
Instrumental reason FTW
> Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
The most bizarre thing is that programmers are literally writing code to replace themselves because once this AI started, it was a race to the bottom and nobody wants to be last.
I remember attending a tech fair decades ago, and at one stand they were vending some database products. When I mentioned that I was studying computer science with a focus on software engineering, they sneered that coding will be much less important in the future since powerful databases will minimize the need for a lot of data wrangling in applications with algorithms.
What actually happened is that the demand for programmers increased, and software ate the world. I suspect something similar will happen the current AI hype.
This has literally already arrived. Average Joes are writing software using LLMs right now.
If AI truly overtakes knowledge work there’s not much we could reasonably do to prepare for it.
If AI never gets there though, then you saved yourself the trouble of stressing about it. So sure, relax, it’s just the second coming of GeoCities.
The really bad situation is if my entire skill set is made obsolete while the rest of the world keeps going for a decade or two. Or maybe longer, who knows.
I realize I'm coming across quite selfish, but it's just a feeling.
Will it?
It's already hard to get people to use computer as they are right now, where you only need to click on things and no longer have to enter commands. That because most people don't like to engage in formal reasoning. Even with one of the most intuitive computer assisted task (drawing and 3d modeling), there's so much to learn regarding theories that few people bother.
Programming has always been easy to learn, and tools to automate coding have existed for decades now. But how many people you know have had the urge to learn enough to automate their tasks?
Look at video bay editing after the advent of Final Cut. Significant drop in the specialized requirement as a professional field, even while content volume went up dramatically.
Video demand exploded and professional editors collapsed in ratio.
3 decades ago you needed a big team to create the type of video games that one person can probably make on their own today in their spare time with modern tools.
But now modern tools have been used to make even more complicated games that require more massive teams than ever and huge amounts of money. One person has no hope of replicating that now, but maybe in the future with AI they can. And then the AAA games will be even more advanced.
It will be similar with other software.
If there are jobs paying $150K just to code (someone else tells you what to code, and you just code it up), then please share!
Generalist junior and senior engineers will need to think of a different career path in less than 5 years as more layoffs will reduce the software engineering workforce.
It looks like it may be the way things are if progress in the o1, o3, oN models and other LLMs continues on.
But they won’t. AI will enable building even more complex software which counter intuitively will result in need even more human jobs to deal with this added complexity.
Think about how despite an increasing amount of free open source libraries over time enabling some powerful stuff easily, developer jobs have only increased, not decreased.
- Faster product development on their side as they eat their own dogfood
- Dev's are the biggest market in the transition period for this tech. Gives you some revenue from direct and indirect subscriptions that the general population does not need/require.
- Fear in leftover coders is great for marketing
- Tech workers are paid well which to VC's, CEO's, etc makes it obvious where the value of this tech comes from. Not with new use cases/apps which would be greatly beneficial to society - but effectively making people redundant saving costs. New use cases/new markets are risky; not paying people is something any MBA/accounting type can understand.
I've heard some people say "its like they are targeting SWE's". I say; yes they probably are. I wouldn't be surprised if it takes SWE jobs but otherwise most people see it as a novelty (barely affects their life) for quite some time.
What if software demand is largely saturated? It seems the big tech companies have struggled to come up with the next big tech product category, despite lots of talent and capital.
Compare the early web vs the complicated JavaScript laden single page application web we have now. You need way more people now. AI will make it even worse.
Consider that in the AI driven future, there will be no more frameworks like React. Who is going to bother writing one? Instead every company will just have their own little custom framework built by an AI that works only for their company. Joining a new company means you bring generalist skills and learn how their software works from the ground up and when you leave to another company that knowledge is instantly useless.
Sounds exciting.
But there’s also plenty of unexplored categories anyway that we can’t access still because there’s insufficient technology for. Household robots with AGI for instance may require instructions for specific services sold as “apps” that have to be designed and developed by companies.
These models are tools to help engineers, not replacements. Models cannot, on their own, build novel new things no matter how much the hype suggests otherwise. What they can do is remove a hell of a lot of accidental complexity.
But maybe models + managers/non technical people can?
This I think will happen with programmers. Rote programming will slowly die out, while demand for super high end will go dramatically up in price.
Also unsure what you mean by...'how golfing works'. This is the economics of it, not the game
o3 can do much much more. There is nothing narrow about SOTA LLMs. They are already General. It doesn't matter what ARC Maintainers have said. There is no common definition of General that LLMs fail to meet. It's not a binary thing.
By the time a single machine covers every little test humanity can devise, what comes out of that is not 'AGI' as the words themselves mean but a General Super Intelligence.
If you want to play games about how to define AGI go ahead. People have been claiming for years that we've already reached AGI and with every improvement they have to bizarrely claim anew that now we've really achieved AGI. But after a few months people realize it still doesn't do what you would expect of an AGI and so you chase some new benchmark ("just one more eval").
The fact is that there really hasn't been the type of world-altering impact that people generally associate with AGI and no reason to expect one.
Basically nobody today thinks beating a single benchmark and nothing else will make you a General Intelligence. As you've already pointed out out, even the maintainers of ARC-AGI do not think this.
>If you want to play games about how to define AGI go ahead.
I'm not playing any games. ENIAC cannot do 99% of the things people use computers to do today and yet barely anybody will tell you it wasn't the first general purpose computer.
On the contrary, it is people who seem to think "General" is a moniker for everything under the sun (and then some) that are playing games with definitions.
>People have been claiming for years that we've already reached AGI and with every improvement they have to bizarrely claim anew that now we've really achieved AGI.
Who are these people ? Do you have any examples at all. Genuine question
>But after a few months people realize it still doesn't do what you would expect of an AGI and so you chase some new benchmark ("just one more eval").
What do you expect from 'AGI'? Everybody seems to have different expectations, much of it rooted in science fiction and not even reality, so this is a moot point. What exactly is World Altering to you ? Genuinely, do you even have anything other than a "I'll know it when i see it ?"
If you introduce technology most people adopt, is that world altering or are you waiting for Skynet ?
People's comments, including in this very thread, seem to suggest otherwise (c.f. comments about "goal post moving"). Are you saying that a widespread belief wasn't that a chess playing computer would require AGI? Or that Go was at some point the new test for AGI? Or the Turing test?
> I'm not playing any games... "General" is a moniker for everything under the sun that are playing games with definitions.
People have a colloquial understanding of AGI whose consequence is a significant change to daily life, not the tortured technical definition that you are using. Again your definition isn't something anyone cares about (except maybe in the legal contract between OpenAI and Microsoft).
> Who are these people ? Do you have any examples at all. Genuine question
How about you? I get the impression that you think AGI was achieved some time ago. It's a bit difficult to simultaneously argue both that we achieved AGI in GPT-N and also that GPT-(N+X) is now the real breakthrough AGI while claiming that your definition of AGI is useful.
> What do you expect from 'AGI'?
I think everyone's definition of AGI includes, as a component, significant changes to the world, which probably would be something like rapid GDP growth or unemployment (though you could have either of those without AGI). The fact that you have to argue about what the word "general" technically means is proof that we don't have AGI in a sense that anyone cares about.
But you don't see this kind of discussion on the narrow models/techniques that made strides on this benchmark, do you ?
>People have a colloquial understanding of AGI whose consequence is a significant change to daily life, not the tortured technical definition that you are using
And ChatGPT has represented a significant change to the daily lives of many. It's the fastest adopted software product in history. In just 2 years, it's one of the top ten most visited sites on the planet worldwide. A lot of people have had the work they do significant change since its release. This is why I ask, what is world altering ?
>How about you? I get the impression that you think AGI was achieved some time ago.
Sure
>It's a bit difficult to simultaneously argue both that we achieved AGI in GPT-N and also that GPT-(N+X) is now the real breakthrough AGI
I have never claimed GPT-N+X is the "new breakthrough AGI". As far as I'm concerned, we hit AGI sometime ago and are making strides in competence and/or enabling even more capabilities.
You can recognize ENIAC as a general purpose computer and also recognize the breakthroughs in computing since then. They're not mutually exclusive.
And personally, I'm more impressed with o3's Frontier Math score than ARC.
>I think everyone's definition of AGI includes, as a component, significant changes to the world
Sure
>which probably would be something like rapid GDP growth or unemployment
What people imagine as "significant change" is definitely not in any broad agreement.
Even in science fiction, the existence of general intelligences more competent than today's LLMs does not necessarily precursor massive unemployment or GDP growth.
And for a lot of people, the clincher stopping them from calling a machine AGI is not even any of these things. For some, that it is "sentient" or "cannot lie" is far more important than any spike of unemployment.
I don't understand what you are getting at.
Ultimately there is no axiomatic definition of the term AGI. I don't think the colloquial understanding of the word is what you think it is (i.e. if you had described to people, pre-chatgpt, today's chatgpt behavior, including all the limitations and failings and the fact that there was no change in GDP, unemployment, etc), and asked if that was AGI I seriously doubt they would say yes.)
More importantly I don't think anyone would say their life was much different from a few years ago and separately would say under AGI it would be.
But the point that started all this discussion is the fact that these "evals" are not good proxies for AGI and no one is moving goal-posts even if they realize this fact only after the tests have been beaten. You can foolishly define AGI as beating ARC but the moment ARC is beaten you realize that you don't care about that definition at all. That doesn't change if you make a 10 or 100 benchmark suite.
If such discussions only made when LLMs make strides in the benchmark then it's not just about beating the benchmark but also what kind of system is beating it.
>You can foolishly define AGI as beating ARC but the moment ARC is beaten you realize that you don't care about that definition at all.
If you change your definition of AGI the moment a test is beaten then yes, you are simply post moving.
If you care about other impacts like "Unemployment" and "GDP rising" but don't give any time or opportunity to see if the model is capable of such then you don't really care about that and are just mindlessly shifting posts.
How do such a person know o3 won't cause mass unemployment? The model hasn't even been released yet.
I still don't understand the point you are making. Nobody is arguing that discrete program search is AGI (and the same counter-arguments would apply if they did).
> If you change your definition of AGI the moment a test is beaten then yes, you are simply post moving.
I don't think anyone changes their definition, they just erroneously assume that any system that succeeds on the test must do so only because it has general intelligence (that was the argument for chess playing for example). When it turns out that you can pass the test with much narrower capabilities they recognize that it was a bad test (unfortunately they often replace the bad test with another bad test and repeat the error).
> If you care about other impacts like "Unemployment" and "GDP rising" but don't give any time or opportunity to see if the model is capable of such then you don't really care about that and are just mindlessly shifting posts.
We are talking about what models are doing now (is AGI here now) not what some imaginary research breakthroughs might accomplish. O3 is not going to materially change GDP or unemployment. (If you are confident otherwise please say how much you are willing to wager on it).
You can be confident all you want but until the model has been given the chance to not have the effect you think it won't then it's just an assertion that may or may not be entirely wrong.
If you say "this model passed this benchmark I thought would indicate AGI but didn't do this or that so I won't acknowledge it" then I can understand that. I may not agree on what the holdups are but I understand that.
If however you're "this model passed this benchmark I thought would indicate AGI but I don't think it's going to be able to do this or that so it's not AGI" then I'm sorry but that's just nonsense.
My thoughts or bets are irrelevant here.
A few days ago I saw someone seriously comparing a site with nearly 4B visits a month in under 2 years to Bitcoin and VR. People are so up in their bubbles and so assured in their way of thinking they can't see what's right in front of them, nevermind predict future usefulness. I'm just not interested in engaging "I think It won't" arguments when I can just wait and see.
I'm not saying you are one of such people. I just have no interest in such arguments.
My bet ? There's no way i would make a bet like that without playing with the model first. Why would I ? Why Would you ?
I explicitly said so was I. I said today we don’t have large impact societal changes that people have conventionally associated with the term AGI. I also explicitly talked about how I don’t believe o3 will change this and your comments seem to suggest neither do you (you seem to prefer to emphasize that it isn’t literally impossible that o3 will make these transformative changes).
> If however you're "this model passed this benchmark I thought would indicate AGI but I don't think it's going to be able to do this or that so it's not AGI" then I'm sorry but that's just nonsense.
The entire point of the original chess example was to show that in fact it is the correct reaction to repudiate incorrect beliefs of naive litmus test of AGI-ness. If we did what you are arguing then we should accept AGI having occurred after chess was beaten because a lot of people believed that was the litmus test? Or that we should praise people who stuck to their original beliefs after they were proven wrong instead of correcting them? That’s why I said it was silly at the outset.
> My thoughts or bets are irrelevant here
No they show you don’t actually believe we have society transformative AGI today (or will when o3 is released) but get upset when someone points that out.
> I'm just not interested in engaging "I think It won't" arguments when I can just wait and see.
A lot of life is about taking decisions based on predictions about the future, including consequential decisions about societal investment, personal career choices, etc. For many things there isn’t a “wait and see approach”, you are making implicit or explicit decisions even by maintaining the status quo. People who make bad or unsubstantiated arguments are creating a toxic environment in which those decisions are made, leading personal and public harm. The most important example of this is the decision to dramatically increase energy usage to accommodate AI models despite impending climate catastrophe on the blind faith that AI will somehow fix it all (which is far from the “wait and see” approach that you are supposedly advocating by the way, this is an active decision).
> My bet ? There's no way i would make a bet like that without playing with the model first. Why would I ? Why Would you ?
You can have beliefs based on limited information. People do this all the time. And if you actually revealed that belief it would demonstrate that you don’t actually currently believe o3 is likely to be world transformative
Cool...but i don't want to in this matter.
I think the models we have today are already transformative. I don't know if o3 is capable of causing sci-fi mass unemployment (for white collar work) and wouldn't have anything other than essentially a wild guess till it is released. I don't want to make a wild guess. Having beliefs on limited information is often necessary but it isn't some virtue and in my opinion should be avoided when unnecessary. It is definitely not necessary to make a wild guess about model capabilities that will be released next month.
>The entire point of the original chess example was to show that in fact it is the correct reaction to repudiate incorrect beliefs of naive litmus test of AGI-ness. If we did what you are arguing then we should accept AGI having occurred after chess was beaten because a lot of people believed that was the litmus test?
Like i said, if you have some other caveats that weren't beaten then that's fine. But it's hard to take seriously when you don't.
This model was trained to pass this test, it was trained heavily on the example questions, so it was a narrow technique.
We even have proof that it isn't AGI, since it scores horribly on ARC-AGI 2. It overfitted for this test.
You are allowed to train on the train set. That's the entire point of the test.
>We even have proof that it isn't AGI, since it scores horribly on ARC-AGI 2. It overfitted for this test.
Arc 2 does not even exist yet. All we have are "early signs", not that that would be proof of anything. Whether I believe the models are generally intelligent or not doesn't depend on ARC
Right, but by training on those test cases you are creating a narrow model. The whole point of training questions is to create narrow models, like all the models we did before.
You are not narrow for undergoing training and it's honestly kind of ridiculous to think so. Not even the ARC maintainers believe so.
Humans didn't need to see the training set to pass this, the AI needing it means it is narrower than the humans, at least on these kind of tasks.
The system might be more general than previous models, but still not as general as humans, and the G in AGI typically means being as general as humans. We are moving towards more general models, but still not at the level where we call them AGI.
Do you have any evidence to support that? It would be fascinating if the field is primarly advancing due to a unique constellation of traits contributed by individuals who, in the past, may not have collaborated so effectively.
Yes
It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers. The ultimate intention of "AGI" is that it is going to replace tens of millions of jobs. That is it and you know it.
It will only accelerate and we need to stop pretending and coping. Instead lets discuss solutions for those lost jobs.
So what is the replacement for these lost jobs? (It is not UBI or "better jobs" without defining them.)
Did it, really? Or did it just provide automation for routine no-thinking-necessary text-writing tasks, but is still ultimately completely bound by the level of human operator's intelligence? I strongly suspect it's the latter. If it had actually replaced journalists it must be junk outlets, where readers' intelligence is negligible and anything goes.
Just yesterday I've used o1 and Claude 3.5 to debug a Linux kernel issue (ultimately, a bad DSDT table causing TPM2 driver unable to reserve memory region for command response buffer, the solution was to use memmap to remove NVS flag from the relevant regions) and confirmed once again LLMs still don't reason at all - just spew out plausible-looking chains of words. The models were good listeners, and a mostly-helpful code generators (when they didn't make silliest mistakes), but they gave no traces of understanding and no attention for any nuances (e.g. LLM used `IS_ERR` to check `__request_resource` result, despite me giving it full source code for that function and there's even a comment that makes it obvious it returns a pointer or NULL, not an error code - misguided attention kind of mistake).
So, in my opinion, LLMs (as currently available to broad public, like myself) are useful for automating away some routine stuff, but their usefulness is bounded by the operator's knowledge and intelligence. And that means that the actual jobs (if they require thinking and not just writing words) are safe.
When asked about what I do at work, I used to joke that I just press buttons on my keyboard in fancy patterns. Ultimately, LLMs seem to suggest that it's not what I really do.
>Trades people only work after having something to do. If you don't have sufficient demand for builders, electricians, plumbers, etc... No one can afford to become one. Nevermind the fact that not everyone should be any of those things. Economics fails when the loop fails to close.
Many replaceable
> Police officers
Many replaceable (desk officers)
Ford didn’t support a 40 hour work week out of the kindness of his heart. He wanted his workers to have time off for buying things (like his cars).
I wonder if our AGI industrialist overlords will do something similar for revenue sharing or UBI.
I don't think so. I agree the push for AGI will kill the modern consumer product economy, but I think it's quite possible for the economy to evolve into a new form (that will probably be terrible for most humans) that keep pushes "work replacement."
Imagine, an AGI billionare buying up land, mines, and power plants as the consumer economy dies, then shifting those resources away from the consumer economy into self-aggrandizing pet projects (e.g. ziggurats, penthouses on Mars, space yachts, life extension, and stuff like that). He might still employ a small community of servants, AGI researchers, and other specialists; but all the rest of the population will be irrelevant to him.
And individual autarky probably isn't necessary, consumption will be redirected towards the massive pet production I mentioned, with vestigial markets for power, minerals, etc.
In reality, if there really is mass unemployment, AI driven automation will make consumables so cheap that anyone will be able to buy it.
This isn't possible if you want to pay sales taxes - those are what keep transactions being done in the official currency. Of course in a world of 99% unemployment presumably we don't care about this.
But yes, this world of 99% unemployment isn't possible, eg because as soon as you have two people and they trade things, they're employed again.
Ultimately, it all comes down to raw materials and similar resources, and all those will be claimed by people with lots of real money. Your "invented ... other money" will be useless to buy that fundamental stuff. At best, it will be useful for trading scrap and other junk among the unemployed.
> In reality, if there really is mass unemployment, AI driven automation will make consumables so cheap that anyone will be able to buy it.
No. Why would the people who own that automation want to waste their resources producing consumer goods for people with nothing to give them in return?
Uh, this picture doesn’t make sense. Why would anyone value this randomly invented money?
Because they can use it to pay for goods?
Your notion is that almost everyone is going to be out of a job and thus have nothing. Okay, so I'm one of those people and I need this house built. But I'm not making any money because of AI or whatever. Maybe someone else needs someone to drive their aging relative around and they're a good builder.
If 1. neither of those people have jobs or income because of AI 2. AI isn't provisioning services for basically free,
then it makes sense for them to do an exchange of labor - even with AI (if that AI is not providing services to everyone). The original reason for having money and exchanging it still exists.
The fact unemployment was 25% during the great depression would seem to suggest that at a minimum, a 25% unemployment rate is possible during a disruptive event.
Unless nobody wanted either of those things done during the depression that's clearly not a very good mental model.
Yes if we recreate society some form of money would likely emerge.
Where are you getting gas/house materials from? No hand waves please. Show all work.
That’s the true litmus test. Everything else? It’s just fine-tuning weights, playing around the edges. Until it starts cutting through the fat and reshaping how organizations really operate, all of this is just more of the same.
Generally with AI think the top of society stand to gain a lot more than the middle/bottom of it for a whole host of reasons. If you think anything different your framework you use to make your conclusion is probably wrong at least in IMO.
I don't like saying this but there is a reason why the "AI bros", VC's, big tech CEO's, etc are all very very excited about this and many employees (some commenting here) are filled with dread/fear. The sales people, the managers, the MBA's, etc stand to gain a lot from this. Fear also serves as the best marketing tool; it makes people talk and spread OpenAI's news more so than everything else. Its a reason why targeting coding jobs/any jobs is so effective. I want to be wrong of course.
Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.
Also given the prohibitive compute costs per task, typical users won't be using this model, so the scheme could go on for quite sometime before the public knows the truth.
They could also come out in a month and say o3 was so smart it'd endanger the civilization, so we deleted the code and saved humanity!
The author also suggested this is a new architecture that uses existing methods, like a Monte Carlo tree search that deepmind is investigating (they use this method for AlphaZero)
I don't see the point of colluding for this sort of fraud, as these methods like tree search and pruning already exist. And other labs could genuinely produce these results
Possibly some other form of "make it seem more impressive than it is," but not that one.
In the medium term the plan could be to achieve AGI, and then AGI would figure out how to actually write o3. (Probably after AGI figures out the business model though: https://www.reddit.com/r/MachineLearning/s/OV4S2hGgW8)