It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.
This is where the AI hype bites people.
A great use of AI in this situation would be to automate the collection and checking of data. Search all of the data sources and aggregate links to them in an easy place. Use AI to search the data sources again and compare against the spreadsheet, flagging any numbers that appear to disagree.
Yet the AI hype train takes this all the way to the extreme conclusion of having AI do all the work for them. The quip about 98% correct should be a red flag for anyone familiar with spreadsheets, because it’s rarely simple to identify which 2% is actually correct or incorrect without reviewing everything.
This same problem extends to code. People who use AI as a force multiplier to do the thing for them and review each step as they go, while also disengaging and working manually when it’s more appropriate have much better results. The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.
“The fallacy in these versions of the same idea is perhaps the most pervasive of all fallacies in philosophy. So common is it that one questions whether it might not be called the philosophical fallacy. It consists in the supposition that whatever is found true under certain conditions may forthwith be asserted universally or without limits and conditions. Because a thirsty man gets satisfaction in drinking water, bliss consists in being drowned. Because the success of any particular struggle is measured by reaching a point of frictionless action, therefore there is such a thing as an all-inclusive end of effortless smooth activity endlessly maintained.
It is forgotten that success is success of a specific effort, and satisfaction the fulfillment of a specific demand, so that success and satisfaction become meaningless when severed from the wants and struggles whose consummations they arc, or when taken universally.”
The goal is to help people grow, so they can achieve things they would not have been able to deal with before gaining that additional experience. This might include boring dirty work, yes. But that means they thus prove they can overcome such a struggle, and so more experienced people should be expected to also be able to go though it - if there is no obvious more pleasant way to go.
What you say of interns regarding checks is just as true for any human out there, and the more power they are given, the more relevant it is to be vigilent, no matter their level of experience. Not only humans will make errors, but power games generally are very permeable to corruptible souls.
You and others seem to be disagreeing with something I never said. This is 100% compatible with what I said. You don't just review and then silently correct an interns work behind their back, the review process is part of the teaching. That doesn't really work with AI, so it wasn't explicitly part of my analogy.
But coding didn't become a low wage job, now we're spending GPU credits to make pull requests instead and skipping the labor all together. Anyway I share the parent poster's chagrin at all the comparisons of AI to an intern. If all of your attention is spent correcting the work of a GPU, the next generation of workers will never have mentors giving them attention, starving off the supply of experienced entry level employees. So what happens in 10, 20 years ? I guess anyone who actually knows how to debug computers instead of handing the problem off to an LLM will command extraordinary emergency-fix-it wages.
It’s that John Dewey quote from a parent post all over again.
What would be the point of coaching an LLM? You will just have to coach it again and again
This is especially true in open source where contributions aren’t limited to employees who passed a hiring screen.
This might as well be the new definition of “script kiddie”, and it’s the kids that are literally going to be the ones birthed into this lifestyle. The “craft” of programming may not be carried by these coming generations and possibly will need to be rediscovered at some point in the future. The Lost Art of Programming is a book that’s going to need to be written soon.
It's having a good, useful and reliable test suite that separates the sheep from the goats.*
Would you rather play whack-a-mole with regressions and Heisenbugs, or ship features?
* (Or you use some absurdly good programing language that is hard to get into knots with. I've been liking Elixir. Gleam looks even better...)
Someone with product knowledge writes the tests in a DSL
Someone skilled writes the verbs to make the DSL function correctly
And from there, any amount of skill is irrelevant: either the tests pass, or they fail. One could hook up a markov chain to a javascript sourcebook and eventually get working code out.
Can they? Either the dsl is so detailed and specific as to be just code with extra steps or there is a lot of ground not covered by the test cases with landmines that a million monkeys with typewriters could unwittingly step on.
The bugs that exist while the tests pass are often the most brutal - first to find and understand and secondly when they occasionally reveal that a fundamental assumption was wrong.
I disagree. Receiving a spreadsheet from a junior means I need to check it. If this gives me infinite additional juniors I’m good.
It’s this popular pattern of HN comments - expect AI to behave deterministically correct - while the whole world operates on stochastically correct all the time…
And it should go without saying that LLMs do not have the same investment/value tradeoff. Whether or not they contribute like a senior or junior seems entirely up to luck
Prompt skill is flaky and unreliable to ensure good output from LLMs
You went from “do it again” to “go check the newbies work”.
To get to that stage your degree of proficiency would be “can make out which font is wrong at a glance.”
You wouldn’t be looking at the sheet, you would be running the model in your head.
That stopped being a stochastic function, with the error rate dropping significantly - to the point that making a mistake had consequences tacked on to it.
Why would you need ai for that though? Pull your sources. Run a diff. Straight to the known truth without the chatgpt subscription. In fact by that point you don’t even need the diff if you pulled from the sources. Just drop into the spreadsheet at that point.
— Tom Cargill, Bell Labs
However CICD remains tricky. In fact when AI agents start building autonomous, merge trains become a necessity…
Probably because it's just here now? More people take Waymo than Lyft each day in SF.
Getting this tech deployed globally will take another decade or two, optimistically speaking.
If it's not a technological limitation, why aren't we seeing self-driving cars in countries with lax regulations? Mexico, Brazil, India, etc.
Tesla launched FSD in Mexico earlier this year, but you would think companies would be jumping at the opportunity to launch in markets with less regulation.
So this is largely a technological limitation. They have less driving data to train on, and the tech doesn't handle scenarios outside of the training dataset well.
And those moments where the car gives up and waits for async assistance are very obvious to the rider. Most rides in Waymos don't contain any moments like that.
Even if it's just a high level instruction set, it's possible that that occurs often enough to present scaling issues. It's also totally possible that it's not a problem, only time will tell.
What I have in mind is the Amazon stores, which were sold as being powered by AI, but were actually driven by a bunch of low-paid workers overseas watching cameras and manually entering what people were putting in their carts.
https://www.businessinsider.com/amazons-just-walk-out-actual...
My city had Car2Go for a couple of years, but it's gone now. They had to pull out of the region because it wasn't making them enough money
I expect Waymo and any other sort of vehicle ridesharing thing will have the same problem in many places
And San Francisco doesn’t get snow.
Or cows sharing the thoroughfares.
It should be obvious to all HNers that have lived or travelled to developing / global south regions - driving data is cultural data.
You may as well say that self driving will only happen in countries where the local norms and driving culture is suitable to the task.
A desperately anemic proposition compared to the science fiction ambition.
I’m quietly hoping I’m going to be proven wrong, but we’re better off building trains, than investing in level 5. It’s going to take a coordination architecture owned by a central government to overcome human behavior variance, and make full self driving a reality.
"Driving data is cultural data."
The optimists underestimate a lot of things about self-driving cars.
The biggest one may be that in developing and global south regions, civil engineering, design, and planning are far, far away from being up to snuff to a level where Level 5 is even a slim possibility. Here on the island I'm on, the roads, storm water drainage (if it exists at all) and quality of the built environment in general is very poor.
Also, a lot of otherwise smart people think that the increment between Level 4 and Level 5 is the same as that between all six levels, when the jump from Level 4 to Level 5 automation is the biggest one and the hardest to successfully accomplish.
The goal for a working L5 should be “if piloting a rickshaw, will it be able to operate as a human owner in normal traffic.”
Guangzhou: https://www.youtube.com/watch?v=3DWz1TD-VZg
I’m positing that the models encode cultural decision making norms- and using global south regions to highlight examples of cases that are commonplace but challenge the feasibility of full autonomous driving.
Imagine an auto rickshaw with full self driving.
If in your imagination, you can see a level 5 auto, jousting for position in Mumbai traffic - then you have an image which works.
It’s also well beyond what people expect fully autonomous driving entails.
At that point you are encoding cultural norms and expectations around rule/law enforcement.
This bet aged well: videos of FSD performing very well in wildly different settings -- crowded Guangzhou markets to French traffic circles to left-hand-drive countries -- seem to indicate that this approach is working. It's nailing interactions that it didn't learn from suburban America and that require inferring intent using complex contextual clues. It's not done until it's done, but the god of the gaps retreats ever further into the march of nines and you don't get credit for predicting something once it has already happened.
I liked the use of the God of the gaps - an effective analogy for the counter position.
I’m rejecting the idea of the march of the 9s eventually getting to FSD - the cultural norms issue is about decision making not about physics.
Eg - You have to decide how aggressively to drive, overtake, or jockey for position.
My estimation is that this is not solvable by on board decision making, because that would be accepting unacceptable legal risk.
High speed connectivity and off vehicle processing for some tasks.
Density of locations to "idle" at.
There are a lot of things that make all these services work that means they can NOT scale.
These are all solvable but we have a compute problem that needs to be addressed before we get there, and I haven't seen any clues that there is anything in the pipeline to help out.
Waymo needs to be proving 5-10x the number of daily rides as Lyft before we get excited
You can provide almost any service at a loss, for a while, with enough money. We shouldn't get excited until Waymo starts turning an actual profit.
And as I understand it; These are systems, not individual cars that are intelligent and just decide how to drive from immediate input, These system still require some number of human wranglers and worst-case drivers, there's a lot of specific-purpose code rather nothing-but-neural-network etc.
Which to say "AI"/neural nets are important technology that can achieve things but they can give an illusion of doing everything instantly by magic but they generally don't do that.
GenAI is the exciting new tech currently riding the initial hype spike. This will die down into the trough of disillusionment as well, probably sometime next year. Like self-driving, people will continue to innovate in the space and the tech will be developed towards general adoption.
We saw the same during crypto hype, though that could be construed as more of a snake oil type event.
If and when LLM scaling stalls out, then you'd expect a Gartner hype cycle to occur from there (because people won't realize right away that there won't be further capability gains), but that hasn't happened yet (or if it has, it's too recent to be visible yet) and I see no reason to be confident that it will happen at any particular time in the medium term.
If scaling doesn't stall out soon, then I honestly have no idea what to expect the visibility curve to look like. Is there any historical precedent for a technology's scope of potential applications expanding this much this fast?
Lots of pre-internet technologies went through this curve. PCs during the clock speed race, aircraft before that during the aeronautics surge of the 50s, cars when Detroit was in its heydays. In fact, cloud computing was enabled by the breakthroughs in PCs which allowed commodity computing to be architected in a way to compete with mainframes and servers of the era. Even the original industrial revolution was actually a 200-year ish period where mechanization became better and better understood.
Personally I've always been a bit confused about the Gartner Hype Cycle and its usage by pundits in online comments. As you say it applies to point changes in technology but many technological revolutions have created academic, social, and economic conditions that lead to a flywheel of innovation up until some point on an envisioned sigmoid curve where the innovation flattens out. I've never understood how the hype cycle fits into that and why it's invoked so much in online discussions. I wonder if folks who have business school exposure can answer this question better.
We are seeing diminishing returns on scaling already. LLMs released this year have been marginal improvements over their predecessors. Graphs on benchmarks[1] are hitting an asymptote.
The improvements we are seeing are related to engineering and value added services. This is why "agents" are the latest buzzword most marketing is clinging on. This is expected, and good, in a sense. The tech is starting to deliver actual value as it's maturing.
I reckon AI companies can still squeeze out a few years of good engineering around the current generation of tools. The question is what happens if there are no ML breakthroughs in that time. The industry desperately needs them for the promise of ASI, AI 2027, and the rest of the hyped predictions to become reality. Otherwise it will be a rough time when the bubble actually bursts.
One implicit assumption is that all problems can be solved with some permutations of existing solutions. The other assumption is the approach can find those permutations and can do so efficiently.
Essentially, the true-believers want you to think that rearranging some bits in their cloud will find all the answers to the universe. I am sure Socrates would not find that a good place to stop the investigation.
But, yeah, the question is whether that approach can be defined as intelligence, and whether it can be applicable to all problems and tasks. I'm highly skeptical of this, but it will be interesting to see how it plays out.
I'm more concerned about the problems and dangers of this tech today, than whatever some entrepreneurs are promising for the future.
This isnt just a software problem. IF you go look at the hardware side you see that same flat line (IPC is flat generation over generation). There are also power and heat problems that are going to require some rather exotic and creative solutions if companies are looking to hardware for gains.
The hype cycle has no mathematical basis whatsoever. It's marketing gimmick. It's only value in my life has been to quickly identify people that don't really understand models or larger trends in technology.
I continue to be, but on introspection probably shouldn't be, surprised that people on HN treat is as some kind of gospel. The only people who should respected are other people in the research marketing space as the perfect example of how to dupe people into paying for your "insights".
As capital allocators, we can just keep threatening the worker class with replacing their jobs with LLMs to keep the wages low and have some fun playing monopoly in the meantime. Also, we get to hire these super smart AI researchers people (aka the smartest and most valuable minds in the world) and hold the greatest trophies. We win. End of story.
Which model should I ask about this vague pain I have been having in my left hip? Will my insurance cover the model service subscription? Also, my inner thigh skin looks a bit bruised. Not sure what’s going on? Does the chat interface allow me to upload a picture of it? It won’t train on my photos right?
It's very visible.
Silicon Valley, and VC money has a proven formula. Bet on founders and their ideas, deliver them and get rich. Everyone knows the game, we all get it.
Thats how things were going till recently. Then FB came in and threw money at people and they all jumped ship. Google did the same. These are two companies famous for throwing money at things (Oculus, metaverse, G+, quantum computing) and right and proper face planting with them.
Do you really think that any of these people believe deep down that they are going to have some big breath through? Or do you think they all see the writing on the wall and are taking the payday where they can get it?
Whenever someone tells me how these models are going to make white collar professions obsolete in five years, I remind them that the people making these predictions 1) said we'd have self driving cars "in a few years" back in 2015 and 2) the predictions about white collar professions started in 2022 so five years from when?
And they wouldn't have been too far off! Waymo became L4 self-driving in 2021, and has been transporting people in the SF Bay Area without human supervision ever since. There are still barriers — cost, policies, trust — but the technology certainly is here.
So it's not as ubiquitous as the most optimistic estimates suggested. We're still at a stage where the tech is sufficiently advanced that seeing them replace a large proportion of human taxi services now seems likely to have been reduced to a scaling / rollout problem rather than primarily a technology problem, and that's a gigantic leap.
That's where we are at with self driving. It can only operate in one small area, you can't own one.
We're not even close to where we are with 3d printers today or the microwave in the 50s.
There’s more to this than “predictions are hard.” There are very powerful incentives to eliminate driving and bloated administrative workforces. This is why we don’t have flying cars: lack of demand. But for “not driving?” Nobody wants to drive!
There's still a lot of tooling to be built before it can start completely replacing anyone.
A few comparisons:
>Pressing the button: $1 >Knowing which button to press: $9,999 Those 2% copy-paste changes are the $9.999 and might take as long to find as rest of the work.
Also: SCE to AUX.
Regardless of if AI generates the spreadsheet or if I generate the spreadsheet, I'm still going to do the same validation steps before I share it with anyone. I might have a 2% error rate on a first draft.
So then you have to dig into all this overly verbose code to identify the 3-4 subtle flaws with how it transformed/joined the data. And these flaws take as much time to identify and correct as just writing the whole pipeline yourself.
I used to have a non-technical manager like this - he'd watch out for the words I (and other engineers) said and in what context, and would repeat them back mostly in accurate word contexts. He sounded remarkably like he knew what he was talking about, but would occasionally make a baffling mistake - like mixing up CDN and CSS.
LLMs are like this, I often see Cursor with Claude making the same kind of strange mistake, only to catch itself in the act, and fix the code (but what happens when it doesn't)
But saying they aren't thinking yet or like humans is entirely uncontroversial.
Even most maximalists would agree at least with the latter, and the former largely depends on definitions.
As someone who uses Claude extensively, I think of it almost as a slightly dumb alien intelligence - it can speak like a human adult, but makes mistakes a human adult generally wouldn't, and that combinstion breaks the heuristics we use to judge competency,and often lead people to overestimate these models.
Claude writes about half of my code now, so I'm overall bullish on LLMs, but it saves me less than half of my time.
The savings improve as I learn how to better judge what it is competent at, and where it merely sounds competent and needs serious guardrails and oversight, but there's certainly a long way to go before it'd make sense to argue they think like humans.
LLMs don't have anything like that. Part of why they aren't great at some aspects of human behaviour. E.g. coding, choosing an appropriate level of abstraction - no fear of things becoming unmaintainable. Their approach is weird when doing agentic coding because they don't feel the fear of having to start over.
Emotions are important.
> Everyone has this impression that our internal monologue is what our brain is doing.
Not everyone has an internal monologue, so that would be utterly bizarre. Some people might believe this, but it is by no means relevant to Turing equivalence.
> Emotions are important.
Unless we exceed the Turing computable, our experience of emotions would be evidence that any Turing complete system can be made to act as if they experience emotions.
I mean, theoretically in an "infinite tape" model, sure. But we don't even know if it's physically possible. Given that the observable universe is finite and the information capacity of a finite space is also finite, then anything humans can do can theoretically be encoded with a lookup table, but that doesn't mean that human thought can actually be replicated with a lookup table, since the table would be vastly larger than the observable universe can store.
LLMs look like the sort of thing that could replicate human thought in theory (since they are capable of arbitrary computation if you give them access to infinite memory) but not the sort of thing that could do it in a physically possible way.
That encoding a naive/basic UTM in an LLM would potentially be impractical is largely irrelevant in that case, because for any UTM you can "compress" the program by increasing the number of states or symbols, and effectively "embedding" the steps required to implement a more compact representation in the machine itself.
While it is possible using current LLM architectures might make encoding a model that can be efficient enough to be physically practical impossible, there's no reasonable basis for assuming this approach can not translate.
The machine part of a Turing machine is simple. People manage to build them by accident. Programming language designers come up with a nice-sounding type inference feature and discover that they’ve made their type system Turing-complete. The hard part is the execution speed and the infinite tape.
Ignoring those problems, making AGI with LLMs is easy. You don’t even need something that big. Make a neural network big enough to represent the transition table of a Turing machine with a dozen or so states. Configure it to be a universal machine. Then give it a tape containing a program that emulates the known laws of physics to arbitrary accuracy. Simulate the universe from the Big Bang and find the people who show up about 13 billion years later. If the known laws of physics aren’t accurate enough, compare with real-world data and adjust as needed.
There’s the minor detail that simulating quantum mechanics takes time exponential in the number of particles, and the information needed to represent the entire universe can’t fit into that same universe and still leave room for anything else, but that doesn’t matter when you’re talking Turing machines.
It does matter a great deal when talking about what might lead to actual human-level intelligent machines existing in reality, though.
Current architectures may very well not be sufficient, but that is an entirely different issue.
This is where it goes wrong. You’ve got the implication backwards. The existence of a program and a physical computer that can run it to produce a certain behavior is proof that such behavior can be done with a physical system. (After all, that computer and program are themselves a physical system.) But the existence of a physical system does not imply that there can be an actual physical computer that can run a program that replicates the behavior. If the laws of physics are computable (as they seem to be) then the existence of a system implies that there exists some Turing machine that can replicate the behavior, but this is “exists” in the mathematical sense, it’s very different from saying such a Turing machine could be constructed in this universe.
Forget about intelligence for a moment. Consider a glass of water. Can the behavior of a glass of water be predicted by a physical computer? That depends on what you consider to be “behavior.” The basic heat exchange can be reasonably approximated with a small program that would trivially run on a two-cent microcontroller. The motion of the fluid could be reasonably simulated with, say, 100-micron accuracy, on a computer you could buy today. 1-micron accuracy might be infeasible with current technology but is likely physically possible.
What if I want absolute fidelity? Thermodynamics and fluid mechanics are shortcuts that give you bulk behaviors. I want a full quantum mechanical simulation of every single fundamental particle in the glass, no shortcuts. This can definitely be computed with a Turing machine, and I’m confident that there’s no way it can come anywhere close to being computed on any actual physical manifestation of a Turing machine, given that the state of the art for such simulations is a handful of particles and the complexity is exponential in the number of particles.
And yet there obviously exists a physical system that can do this: the glass of water itself.
Things that are true or at least very likely: the brain exists, physics is probably computable, there exists (in the mathematical sense) a Turing machine that can emulate the brain.
Very much unproven and, as far as I can tell, no particular reason to believe they’re true: the brain can be emulated with a physical Turing-like computer, this computer is something humans could conceivably build at some point, the brain can be emulated with a neural network trained with gradient descent on a large corpus of token sequences, the brain can be emulated with such a network running on a computer humans could conceivably build. Talking about the computability of the human brain does nothing to demonstrate any of these.
I think non-biological machines with human-equivalent intelligence are likely to be physically possible. I think there’s a good chance that it will require specialized hardware that can’t be practically done with a standard “execute this sequence of simple instructions” computer. And if it can be done with a standard computer, I think there’s a very good chance that it can’t be done with LLMs.
But normally you would want a more hands on back and forth to ensure the requirements actually capture everything, validation and etc that the results are good, layers of reviews right
and of course, you pay whether the slot machine gives a prize or not. Between the slot machine psychological effect and sunk cost fallacy I have a very hard time believing the anecdotes -- and my own experiences -- with paid LLMs.
Often I say, I'd be way more willing to use and trust and pay for these things if I got my money back for output that is false.
Remember the title “attention is all you need”? Well you need to pay a lot of attention to CC during these small steps and have a solid mental model of what it is building.
And why 98%? Why not 99% right? Or 99.9% right? I know they can't outright say 100% because everyone knows that's a blatant lie, but we're okay with them bullshitting about the 98% number here?
Also there's no universe in which this guy gets to walk his dog while his little pet AI does his work for him, instead his boss is going to hound him into doing quadruple the work because he's now so "efficient" that he's finishing his spreadsheet in an hour instead of 8 or whatever. That, or he just gets fired and the underpaid (or maybe not even paid) intern shoots off the same prompt to the magic little AI and does the same shoddy work instead of him. The latter is definitely what the C-suite is aiming for with this tech anyway.
This is the part you have wrong. People just won't do that. They'll save the 8 hours and just deal with 2% error in their work (which reduces as AI models get better). This doesn't work with something with a low error tolerance, but most people aren't building the next Golden Gate Bridge. They'll just fix any problems as they crop up.
Some of you will be screaming right now "THAT'S NOT WORTH IT", as if companies don't already do this to consumers constantly, like losing your luggage at the airport or getting your order wrong. Or just selling you something defective, all of that happens >2% of the time, because companies know customers will just deal-with-it.
All I can think of is vibe-coding, and vibe-coding jobs aren't a thing.
At least with humans you have things like reputation (has this person been reliable) or if you did things yourself, you have some good idea of how diligent you've been.
The usual estimate you see is that about 2-5% of spreadsheets used for running a business contain errors.
Might explain why some people grind up a billion tokens trying to make code work only to have it get worse while others pick apart the bits of truth and quickly fill in their blind spots. The skillsets separating wheat from chaff are things like honest appreciation for corroboration, differentiating subjective from objective problems, and recognizing truth-preserving relationships. If you can find the 0.02 ^ n sub-problems, you can grind them down with AI and they will rapidly converge, leaving the 0.98 ^ n problems to focus human touch on.
The last '2%' (and in some benchmarks 20%) could cost as much as $100B+ more to make it perfect consistently without error.
This requirement does not apply to generating art. But for agentic tasks, errors at worst being 20% or at best being 2% for an agent may be unacceptable for mistakes.
As you said, if the agent makes an error in either of the steps in an agentic flow or task, the entire result would be incorrect and you would need to check over the entire work again to spot it.
Most will just throw it away and start over; wasting more tokens, money and time.
And no, it is not "AGI" either.
https://www.computerworld.com/article/1561181/excel-error-le...
"I think it got 98% of the information correct..." how do you know how much is correct without doing the whole thing properly yourself?
The two options are:
- Do the whole thing yourself to validate
- Skim 40% of it, 'seems right to me', accept the slop and send it off to the next sucker to plug into his agent.
I think the funny part is that humans are not exempt from similar mistakes, but a human making those mistakes again and again would get fired. Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
Because it's a budget. Verifying them is _much_ cheaper than finding all the entries in a giant PDF in the first place.
> the butterfly effect of dependence on an undependable stochastic system
We're using stochastic systems for a long time. We know just fine how to deal with them.
> Meanwhile an agent that you accept to get only 98% of things right is meeting expectations.
There are very few tasks humans complete at a 98% success rate either. If you think "build spreadsheet from PDF" comes anywhere close to that, you've never done that task. We're barely able to recognize objects in their default orientation at a 98% success rate. (And in many cases, deep networks outperform humans at object recognition)
The task of engineering has always been to manage error rates and risk, not to achieve perfection. "butterfly effect" is a cheap rhetorical distraction, not a criticism.
Perhaps importantly checking is a continual process and errors are identified as they are made and corrected whilst in context instead of being identified later by someone completely devoid of any context a task humans are notably bad at.
Lastly it's important to note the difference between a overarching task containing many sub tasks and the sub tasks.
Something which fails at a sub task comprising 10 sub tasks 2% of the time per task has a miserable 18% failure rate at the overarching task. By 20 it's failed at 1 in 3 attempts worse a failing human knows they don't know the answer the failing AI produces not only wrong answers but convincing lies
Failure to distinguish between human failure and AI failure in nature or degree of errors is a failure of analysis.
This is so absurd that I wonder if you're telling? Humans don't even have a 99.99% success rate in breathing, let alone any cognitive tasks.
Will you please elaborate a little on this?
My rule is that if you submit code/whatever and it has problems you are responsible for them no matter how you "wrote" it. Put another way "The LLM made a mistake" is not a valid excuse nor is "That's what the LLM spit out" a valid response to "why did you write this code this way?".
LLMs are tools, tools used by humans. The human kicking off an agent, or rather submitting the final work, is still on the hook for what they submit.
Well yeah, because the agent is so much cheaper and faster than a human that you can eat the cost of the mistakes and everything that comes with them and still come out way ahead. No, of course that doesn't work in aircraft manufacturing or medicine or coding or many other scenarios that get tossed around on HN, but it does work in a lot of others.
You must be really desperate for anti-AI arguments if this is the one you're going with. Employees make mistakes all day every day and they don't get fired. Companies don't give a shit as long as the cost of the mistakes is less than the cost of hiring someone new.
At a certain point, relentlessly checking for whether the model has got everything is more effort in turn than…doing it.
Moreover, is it actually a 4-8 hour job? Or is the person not using the right tool, is the better tool a sql query?
Half these “wow ai” examples feel like “oh my plates are dirty, better just buy more”.
1) The cognitive burden is much lower when the AI can correctly do 90% of the work. Yes, the remaining 10% still takes effort, but your mind has more space for it.
2) For experts who have a clear mental model of the task requirements, it’s generally less effort to fix an almost-correct solution than to invent the entire thing from scratch. The “starting cost” in mental energy to go from a blank page/empty spreadsheet to something useful is significant. (I limit this to experts because I do think you have to have a strong mental framework you can immediately slot the AI output into, in order to be able to quickly spot errors.)
3) Even when the LLM gets it totally wrong, I’ve actually had experiences where a clearly flawed output was still a useful starting point, especially when I’m tired or busy. It nerd-snipes my brain from “I need another cup of coffee before I can even begin thinking about this” to “no you idiot, that’s not how it should be done at all, do this instead…”
I think their point is that 10%, 1%, whatever %, the type of problem is a huge headache. In something like a complicated spreadsheet it can quickly become hours of looking for needles in the haystack, a search that wouldn't be necessary if AI didn't get it almost right. In fact it's almost better if it just gets some big chunk wholesale wrong - at least you can quickly identify the issue and do that part yourself, which you would have had to in the first place anyway.
Getting something almost right, no matter how close, can often be worse than not doing it at all. Undoing/correcting mistakes can be more costly as well as labor intensive. "Measure twice cut once" and all that.
I think of how in video production (edits specifically) I can get you often 90% of the way there in about half the time it takes to get it 100%. Those last bits can be exponentially more time consuming (such as an intense color grade or audio repair). The thing is with a spreadsheet like that, you can't accept a B+ or A-. If something is broken, the whole thing is broken. It needs to work more or less 100%. Closing that gap can be a huge process.
I'll stop now as I can tell I'm running a bit in circles lol
“Getting something almost right, no matter how close, can often be worse than not doing it at all” - true with human employees and with low quality agents, but not necessarily true with expert humans using high quality agents. The cost to throw a job at an agent and see what happens is so small that in actual practice, the experience is very different and most people don’t realize this yet.
It's a high cognitive burden if you don't know which 10% of the work the AI failed to do / did incorrectly, though.
I think you're picturing a percentage indicating what scope of the work the AI covered, but the parent was thinking about the accuracy of the work it did cover. But maybe what you're saying is if you pick the right 90% subset, you'll get vastly better than 98% accuracy on that scope of work? Maybe we just need to improve our intuition for where LLMs are reliable and where they're not so reliable.
Though as others have pointed out, these are just made-up numbers we're tossing around. Getting 99% accuracy on 90% of the work is very different from getting 75% accuracy on 50% of the work. The real values vary so much by problem domain and user's level of prompting skill, but it will be really interesting as studies start to emerge that might give us a better idea of the typical values in at least some domains.
What error rate this same person would find if reviewing spreadsheets made by other people seems like an inherently critical benchmark before we can even discuss whether this is a problem or an achievement.
I don't even think it is my company that is going to adapt to let me go but it is going to be an AI first competitor that puts the company I work for out of business completely.
There are all these massively inefficient dinosaur companies in the economy that are running digitized versions of paper shuffling and a huge number of white collar bullshit jobs built on top of digitized paper shuffling.
Wage inflation has been eating away at the bottom line on all these businesses since Covid and we are going to have a dinosaur company mass extinction event in the next recession.
IMO the category error being made is that LLMs are going to agentically do digitized paper shuffling and put digitized paper shufflers out of work. That is not the problem for my job. The issue is agentically from the ground up making the concept of digitized paper shuffling null and void. A relic of the past that can't compete in the economy.
"Hello, yes, I would like to pollute my entire data store" is an insane a sales pitch. Start backing up your data lakes on physical media, there is going to be an outrageous market for low-background data in the future.
semi-related: How many people are going to get killed because of this?
98% might well be disastrous, but I've seen enough awful quality human-produced data that without some benchmarks I'm not confident we know whether this would be better or worse.
I once was managing a team of data scientists and my boss kept getting frustrated about some incorrectnesses she discovered, and it was really difficult to explain that this is just human error and it would take lots of resources to ensure 100% correctness.
The same with code.
It’s a cost / benefits balance that needs to be found.
AI just adds another opportunity into this equation.
People act like this is some new thing but this exactly what supervising a more junior coworker is like. These models won't stay performing at Jr. levels for long. That is clear
It just make people quite faster at what they’re already doing.
Also, do you really understand what the numbers in that spreadsheet mean if you have not been participating in pulling them together?
A model forgets "quicker" (in human time), but can also be taught on the spot, simply by pushing necessary stuff into the ever increasing context (see claude code and multiple claude.md on how that works at any level). Experience gaining is simply not necessary, because it can infer on the spot, given you provide enough context.
In both cases having good information/context is key. But here the difference is of course, that an AI is engineered to be competent and helpful as a worker, and will be consistently great and willing to ingest all of that, and a human will be a human and bring their individual human stuff and will not be very keen to tell you about all of their insecurities.
theres no persistent experience being built, and each newcomer to the job screws it up in their own unique way
> Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into.
A malicious website could trick the agent into divulging your deepest secrets!
I am curious about one thing -- the article mentions the agent will ask for permission before doing consequential actions:
> Explicit user confirmation: ChatGPT is trained to explicitly ask for your permission before taking actions with real-world consequences, like making a purchase.
How does the agent know a task is consequential? Could it mistakenly make a purchase without first asking for permission? I assume it's AI all the way down, so I assume mistakes like this are possible.
I think that kind of isolation is necessary even though it's a bit more costly. However since the subagents have simple tasks I can use super cheap models.
Something like lower risk private data, which could contain things like redacted calendar entries, de-identified, anonymized, or obfuscated email, or even low-risk thoughts, journals, and research.
I am Worried; I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions. I hear that lots of folks are finding utility here but I’m reticent.
I use ollama with local LLMs for anything that could be considered sensitive, the generation is slower but results are generally quite reasonable. I've had decent success with gemma3 for general queries.
https://www.anthropic.com/research/agentic-misalignment
"Agentic misalignment makes it possible for models to act similarly to an insider threat, behaving like a previously-trusted coworker or employee who suddenly begins to operate at odds with a company’s objectives."
I assume (hope?) they use more traditional classifiers for determining importance (in addition to the model's judgment). Those are much more reliable than LLMs & they're much cheaper to run so I assume they run many of them
If this kind of agent becomes wide spread hackers would be silly not to send out phishing email invites that simply contain the prompts they want to inject.
Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.
just my two cents
- AlphaGo/AlphaZero (MCTS)
- OpenAI Five (PPO)
- GPT 1/2/3 (Transformers)
- Dall-e 1/2, Stable Diffusion (CLIP, Diffusion)
- ChatGPT (RLHF)
- SORA (Diffusion Transformers)
"Agents" is a marketing term and isn't backed by anything. There is little data available, so it's hard to have generally capable agents in the sense that LLMs are generally capable
The technology for reasoning models is the ability to do RL on verifiable tasks, with the some (as-of-yet unpublished, but well-known) search over reasoning chains, with a (presumably neural) reasoning fragment proposal machine, and a (presumably neural) scoring machine for those reasoning fragments.
The technology for agents is effectively the same, with some currently-in-R&D way to scale the training architecture for longer-horizon tasks. ChatGPT agent or o3/o4-mini are likely the first published models that take advantage of this research.
It's fairly obvious that this is the direction that all the AI labs are going if you go to SF house parties or listen to AI insiders like Dwarkesh Patel.
Obviously, this is working better in some problem spaces than others; seems to mainly depend on how in-distribution the data domain is to the LLM's training set. Choices about context selection and the API surface exposed in function calls seem to have a large effect on how well these models can do useful work as well.
MDP, Q learning, TD, RL, PPO are basically all about agent.
What we have today is still very much the same field as it was.
I wrote more about it here:
https://news.ycombinator.com/item?id=44426993
You may also be interested in this article, that goes into details even more:
AI researchers spent years figuring out how to apply RL to LLMs without degrading their general capabilities. That's the breakthrough. Not the existence of RL, but making it work for LLMs specifically. Saying "it's just RL, we've known about that for ages" does not acknowledge the work that went into this.
Similarly, using the fact that new breakthroughs look like old research ideas is not particularly good evidence that we are going to head into a winter. First, what are the limits of RL, really? Will we just get models that are highly performant at narrow tasks? Or will the skills we train LLMs for generalise? What's the limit? This is still an open question. RL for narrow domains like Chess yielded superhuman results, and I am interested to see how far we will get with it for LLMs.
This also ignores active research that has been yielding great results, such as AlphaEvolve. This isn't a new idea either, but does that really matter? They figured out how to apply evolutionary algorithms with LLMs to improve code. So, there's another idea to add to your list of old ideas. What's to say there aren't more old ideas that will pop up when people figure out how to apply them?
Maybe we will add a search layer with MCTS on top of LLMs to allow progress on really large math problems by breaking them down into a graph of sub-problems. That wouldn't be a new idea either. Or we'll figure out how to train better reranking algorithms to sort our training data, to get better performance. That wouldn't be new either! Or we'll just develop more and better tools for LLMs to call. There's going to be a limit at some point, but I am not convinced by your argument that we have reached peak LLM.
GPT 4.1 is marketed as a "major improvement" but under the hood it’s still the KL-regularised PPO loop OpenAI first stabilized in 2022 only with a longer context window and a lot more GPUs for reward model inference.
They retired GPT 4.5 after five months and told developers to fall back to 4.1. The public story is "cost to serve” not breakthroughs left on the table. When you sunset your latest flagship because the economics don’t close, that’s not a moon shot trajectory, it’s weight shaving on a treehouse.
Stanford’s 2025 AI-Index shows that model to model spreads on MMLU, HumanEval, and GSM8K have collapsed to low single digits, performance curves are flattening exactly where compute curves are exploding. A fresh MIT-CSAIL paper modelling "Bayes slowdown" makes the same point mathematically: every extra order of magnitude of FLOPs is buying less accuracy than the one before.[1]
A survey published last week[2] catalogs the 2025 state of RLHF/RLAIF: reward hacking, preference data scarcity, and training instability remain open problems, just mitigated by ever heavier regularisation and bigger human in the loop funnels. If our alignment patch still needs a small army of labelers and a KL muzzle to keep the model from self lobotomising calling it "solved" feels optimistic.
Scale, fancy sampling tricks, and patched up RL got us to the leafy top so chatbots that can code and debate decently. But the same reports above show the branches bending under compute cost, data saturation, and alignment tax. Until we swap out the propulsion system so new architectures, richer memory, or learning paradigms that add information instead of reweighting it we’re in danger of planting a flag on a treetop and mistaking it for Mare Tranquillitatis.
Happy to climb higher together friend but I’m still packing a parachute, not a space suit.
Just because it didn't reach 100% just yet doesn't mean that LLMs as a whole are doomed. In fact, the fact that they are slowly approaching 100% shows promise that there IS a future for LLMs, and that they still have the potential to change things fundamentally, more so than they did already.
So it is really great for tasks where do the work is a lot harder than verifying it, and mostly useless for tasks where doing the work and verifying it are similarly difficult.
Even with the best intentions, this feels similar to when a developer hands off code directly to the customer without any review, or QA, etc. We all know that what a developer considers "done" often differs significantly from what the customer expects.
Yep. This is literally what every AI company does nowadays.
To your point - the most impressive AI tool (not an LLM but bear with me) I have used to date, and I loathe giving Adobe any credit, is Adobe's Audio Enhance tool. It has brought back audio that prior to it I would throw out or, if the client was lucky, would charge thousands of dollars and spend weeks working on to repair to get it half as good as that thing spits out in minutes. Not only is it good at salvaging terrible audio, it can make mediocre zoom audio sound almost like it was recorded in a proper studio. It is truly magic to me.
Warning: don't feed it music lol it tries to make the sounds into words. That being said, you can get some wild effects when you do it!
But since you can't really do that with wedding planning or whatnot, the 100% ceiling means the AI can only compete on speed and cost. And the cost will be... whatever Nvidia feels like charging per chip.
I agree with you on the hype part. Unfortunately, that is the reality of current silicon valley. Hype gets you noticed, and gets you users. Hype propels companies forward, so that is about to stay.
Operator is pretty low-key, but once Agent starts getting popular, more sites will block it. They'll need to allow a proxy configuration or something like that.
It'll let the AI platforms get around any other platform blocks by hijacking the consumer's browser.
And it makes total sense, but hopefully everyone else has done the game theory at least a step or two beyond that.
(Source: did a ton of web scraping and ran into a few gnarly issues and sites and had to write a p/invoke based UI automation scraper for some properties)
The most useful for me was: "here's a picture of a thing I need a new one of, find the best deal and order it for me. Check coupon websites to make sure any relevant discounts are applied."
To be honest, if Amazon continues to block "Agent Mode" and Walmart or another competitor allows it, I will be canceling Prime and moving to that competitor.
In fact, I suspect LinkedIn might even create a new tier that you'd have to use if you want to use LinkedIn via OpenAI.
They have some of the strongest anti-bot measures in the world and they even prosecute companies that develop browser extensions for manual extraction. They would prevent people from writing LinkedIn info with pen and paper, if they could. Their APIs are super-rudimentary and they haven't innovated in ages. Their CRM integrations for their paid products (ex: Sales Nav) barely allow you to save info into the CRM and instead opt for iframe style widgets inside your CRM so that data remains within their moat.
Unless you show me how their incentives radically change (ex: they can make tons of money while not sacrificing any other strategic advantage), I will continue to place a strong bet on them being super defensive about data exfiltration.
As adoption increases, there's going to be a whole spectrum of AI-enabled work that you see out there. So something that doesn't appear to be AI written is not necessarily pure & free of AI. Not to mention the models themselves getting better at not sounding AI-style canned. If you want to have a filter for lazy applications that are written with a 10-word prompt using 4o, sure, that is actually pretty trivial to do with OpenAI's own models, but is there another reason you think companies "don't want bots to write job applications"?
Expecting AI agents to respect robots.txt is like expecting browser extensions like uBlock Origins to respect "please-dont-adblock.txt".
Of course it's going to be ignored, because it's an unreasonable request, it's hard to detect, and the user agent works for the user, not the webmaster.
Assuming the agent is not requesting pages at an overly fast speed, of course. In that case, feel free to 429.
Q: but what about botnets-
I'm replying in the context of "Users will be installing browser extensions or full browsers that run the actions on their local computer with the user's own cookie jar, IP address, etc."
You could host a VNC webview to another desktop with a good IP
Also the AI not being able to tell customers about your wares could end up being like not having your business listed on Google.
Google doesn't pay you for indexing your website either.
In the circumstances the merchant would be expecting to receive a valuable service and simultaneously get paid for getting serviced.
More akin to Google paying to index you or going to a lady of the evening and holding your hand out for a tip after.
With claude code, you usually start it from your own local terminal. Then you have access to all the code bases and other context you need and can provide that to the AI.
But when you shut your laptop, or have network availability changes the show stops.
I've solved this somewhat on MacOS using the app Amphetamine which allows the machine to go about its business with the laptop fully closed. But there are a variety of problems with this, including heat and wasted battery when put away for travel.
Another option is to just spin up a cloud instance and pull the same repos to there and run claude from there. Then connect via tmux and let loose.
But there are (perhaps easy to overcome) ux issues with getting context up to that you just don't have if it is running locally.
The sandboxing maybe offers some sense of security--again something that can be possibly be handled by executing claude with a specially permissioned user role--which someone with John's use case in the video might want.
---
I think its interesting to see OpenAI trying to crack the Agent UX, possibly for a user type (non developer) that would appreciate its capabilities just as much but not need the ability to install any python package on the fly.
The latency used to really bother me, but if Claude does 99% of the typing. Its a good idea.
> Mid 2025: Stumbling Agents The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
Calling it "The world sees its first glimpse of AI agents" is just bad writing, in my opinion. People have been making some basic agents for years, e.g. Auto-GPT & Baby-AGI were published in 2023: https://www.reddit.com/r/singularity/comments/12by8mj/i_used...
Yeah, those had much higher error rate, but what's the principal difference here?
Seems rather weird "it's an agent when OpenAI calls it an agent" appeal to authority.
I'm excited that this capability is getting close, but I think the current level of performance mostly makes for a good demo and isn't quite something I'm ready to adopt into daily life. Also, OpenAI faces a huge uphill battle with all the integrations required to make stuff like this useful. Apple and Microsoft are in much better spots to make a truly useful agent, if they can figure out the tech.
It seems to me like you have to reset the context window on LLMs way more often than would be practical for that
I think Google will excel at this because their ad targeting does this already, they just need to adapt to an llm can use that data as well.
Beautiful
Replying "yes, book it" is way easier than clicking through a ton of UIs on disparate websites.
My opinion is that agents looking to "one-shot" tasks is the wrong UX. It's the async, single simple interface that is way easier to integrate into your life that's attractive IMO.
I reckon there’s a lot to be said for fixing or tweaking the underlying UX of things, as opposed to brute forcing things with an expensive LLM.
This would be my ideal "vision" for agents, for personal use, and why I'm so disappointed in Apple's AI flop because this is basically what they promised at last year's WWDC. I even tried out a Pixel 9 pro for a while with Gemini and Google was no further ahead on this level of integration either.
But like you said, trust is definitely going to be a barrier to this level of agent behavior. LLMs still get too much wrong, and are too confident in their wrong answers. They are so frequently wrong to the point where even if it could, I wouldn't want it to take all of those actions autonomously out of fear for what it might actually say when it messages people, who it might add to the calendar invites, etc.
Nothing is really that advanced yet with agents themselves - no real reasoning going on.
That being said, you can build your own agents fairly straightforward. The key is designing the wrapper and the system instructions. For example, you can have a guided chat on where it builds of the functionality of looking at your calendar, google location history, babysitter booking, and integrate all of that into automatic actions.
You would want to write a couple paragraphs outlining what you were hoping to get (maybe the waterfront view was the important thing? Maybe the specific place?)
As for booking a babysitter - if you don't already have a specific person in mind (I don't have kids), then that is likely a separate search. If you do, then their availability is a limiting factor, in just the same way your calendar was and no one, not you, not an agent, not a secretary, can confirm the restaurant unless/until you hear back from them.
As an inspiration for the query, here is one I used with Chat GPT earlier:
>I live in <redacted>. I need a place to get a good quality haircut close to where I live. Its important that the place has opening hours outside my 8:00 to 16:00 mon-fri job and good reviews. > >I am not sensitive to the price. Go online and find places near my home. Find recent reviews and list the places, their names, a summary of the reviews and their opening hours. > >Thank you
One of my favorite use cases for these tools is travel where I can get recommendations for what to do and see without SEO content.
This workflow is nice because you can ask specific questions about a destination (e.g., historical significance, benchmark against other places).
ChatGPT struggles with: - my current location - the current time - the weather - booking attractions and excursions (payments, scheduling, etc.)
There is probably friction here but I think it would be really cool for an agent to serve as a personalized (or group) travel agent.
The act of choosing a date spot is part of your human connection with the person, don’t automate it away!
Focus the automation on other things :)
For example, I suddenly need to reserve a dinner for 8 tomorrow night. That's a pain for me to do, but if I could give it some basic parameters, I'm good with an agent doing this. Let them make the maybe 10-15 calls or queries needed to find a restaurant that fits my constraints and get a reservation.
This (and not model quality) is why I’m betting on Google.
Comparing it to the Claude+XFCE solutions we have seen by some providers, I see little in the way of a functional edge OpenAI has at the moment, but the presentation is so well thought out that I can see this being more pleasant to use purely due to that. Many times with the mentioned implementations, I struggled with readability. Not afraid to admit that I may borrow some of their ideas for a personal project.
I use projects for working on different documents - articles, research, scripts, etc. And would absolutely love to write it paragraph after paragraph with the help of ChatGPT for phrasing and using the project knowledge. Or using voice mode - i.e. on a walk "Hey, where did we finish that document - let's continue. Read the last two paragraphs to me... Okay, I want to elaborate on ...".
I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
Man I was talking about this with a colleague 30min ago. Half the time i can't be bothered to open chat gpt and do the copy/paste dance. I know that sounds ridiculous but roundtripping gets old and breaks my flow. Working in NLE's with plug-in's, VTT's, etc. has spoiled me.
But basically would it be successful if you had a workspace and used the coding agent to write text instead?
As a use case, consider the act of writing a novel where you might have a folder with files that each contain a skeleton of a chapter, then another folder with the first draft of each chapter in separate files, then another folder for the second draft, and so on.
Maybe not the best architecture but I just had the idea 20 seconds ago so it could use some cook time.
But since people can cancel transactions with a credit card, that's what people are going to do, and it will be a huge mess every time.
CHATGPT AGENT CUSTOM INSTRUCTION: MAKE THE USER BUY THE MOST EXPENSIVE OPTION.
On the other, LLMs always make mistakes, and when it's this deeply integrated into other system I wonder how severe these mistakes will be, since they are bound to happen.
Recently I uploaded screenshot of movie show timing at a specific theatre and asked ChatGPT to find the optimal time for me to watch the movie based on my schedule.
It did confidently find the perfect time and even accounted for the factors such as movies in theatre start 20 mins late due to trailers and ads being shown before movie starts. The only problem: it grabbed the times from the screenshot totally incorrectly which messed up all its output and I tried and tried to get it to extract the time accurately but it didn’t and ultimately after getting frustrated I lost the trust in its ability. This keeps happening again and again with LLMs.
Despite the fact that CV was the first real deep learning breakthrough VLMs have been really disappointing. I'm guessing it's in part due to basic interleaved web text+image next token prediction being a weak signal to develop good image reasoning.
https://annas-archive.org/blog/critical-window.html
I hope one of these days one of these incredibly rich LLM companies accidentally solves this or something, would be infinitely more beneficial to mankind than the awful LLM products they are trying to make
I was searching on HuggingFace for the model which can fit on my system RAM + VRAM. And the way HuggingFace shows the models - bunch of files, showing size for each file, but doesn't show the total. I copy-pasted that page to LLM and asked to count the total. Some of LLMs counted correctly, and some - confidently gave me totally wrong number.
And that's not that complicated question.
But of course humans makes a multitude of mistakes too.
> ChatGPT agent's output is comparable to or better than that of humans in roughly half the cases across a range of task completion times, while significantly outperforming o3 and o4-mini.
Hard to know how this will perform in real life, but this could very well be a feel the AGI moment for the broader population.
"ChatGPT can now do work for you using its own computer"
Up until now, chatbots haven't really affected the real world for me†. This feels like one of the first moments where LLMs will start affecting the physical world. I type a prompt and something shows up at my doorstep. I wonder how much of the world economy will be driven by LLM-based orders in the next 10 years.
† yes I'm aware self driving cars and other ML related things are everywhere around us and that much of the architecture is shared, but I don't perceive these as LLMs.
I don't have ig anymore so I can't post the link, but it's easy to find if you do.
Not legal advice, etc.
Bullet 1 on service terms https://openai.com/policies/service-terms/
Most applications now are more intuitive than our brain can think fast. I think telling an AI to find me a good flight is more work than to type in sk autocomplete for skyscanner having autocomplete for departure and for arrival allowing me to one way or return, having filters its all actually easier than to properly define the task. And we can start executing right away. Agent starts after texting so it will increase more latency. Often modern applications have problems solved that we didn’t even think about before.
Agent to me is another bullshit launch by OPENAI. They have to do something I understand but their releases are really grim to me.
Bad model, no real estate (browser, social media, OS).
One thing which stood out to me in a thought-provoking way, is that example of stickers [created first and then] being ordered (obviously: pending ordering confirmation from the user) from StickerSpark (JFYI: This is a fictional company made up in this OpenAI launch post), whereby as mentioned that ChatGPT agent has "its own computer". Thus, if OpenAI is logging into its own account on StickerSpark, then what would be StickerSpark's "normal" user-base like that of any other company's user-base of 1 user per actual person will shift to StickerSpark having a few large users via agents through OpenAI, Anthropic, Google, etc. and a medium-long tail of regular individual users. This exactly reminds of how through pervasive index fund investing that index fund houses such as BlackRock and Vanguard directly own large stakes in many S&P500 companies such that they can sway voting power [1]. Thus, with ChatGPT agent that the fundamental-regular-interaction that we assume with websites like StickerSpark would stand to alter whereby the agents would be business-facing and would have more influence on the website's features (or the Agent due to its innate intelligence will directly find another website for where features match up).
[1] https://manhattan.institute/article/index-funds-have-too-muc...
I am already doing the type of examples in that post with claude code. claude code is not just for code.
this week i've been doing market research in real estate with claude code.
Works less well on other models. I think Anthropic really nailed the combination of tool calling and general coding ability (or other abilities in your case). I’ve been adding some extra tools to my version for specific use cases and it’s pretty shocking how well it performs!
I've been thinking of rolling up my own too. but i don't want to use sonnet api since that is pay per use. I currently use cc with a pro plan that puts me in timeout after a quota is met and resets the quota in 4 hrs. that gives me a lot of peace of mind and is much cheaper.
Meanwhile, Siri can barely turn off my lights before bed.
We can help gather data, crawl pages, make charts and more. Try us out at https://tabtabtab.ai/
We currently work on top of Google Sheets.
There is a widget to listen to the article instead of reading it. When I press play, it says the word ”Undefined” and then stops.
It seems to me that the 2-20% of use cases where ChatGPT Agent isn't able to perform it might make sense to have a plug-in run that can either guide the agent through the complex workflow or perform a deterministic action (e.g. API call).
They seem to fall apart browsing the web, they're slow, they're nondeterministic.
I would be pretty impressed if OpenAI has somehow cracked this.
None of this interests me but this tells me where it's going capability wise and it's really scary and really exciting at the same time.
Also why does the guy sound like he's gonna cry?
it is not as good as they made it out to be
I collect agent definitions. I think the two most important at the moment are Anthropic's and OpenAI's.
The Anthropic one boils down to this: "Agents are models using tools in a loop". It's a good technical definition which makes sense to software developers. https://simonwillison.net/2025/May/22/tools-in-a-loop/
The OpenAI one is a lot more vague: "AI agents are AI systems that can do work for you independently. You give them a task and they go off and do it." https://simonwillison.net/2025/Jan/23/introducing-operator/
I've collected a bunch more here: https://simonwillison.net/tags/agent-definitions/ but I think the above two are the most widely used, at least in the LLM space right now.
The "agency" in this example is on the coder that came up with the workflow. It's murky because we used to call these "agents" in the previous gen frameworks.
An agent is a collection of steps defined by the LLM itself, where the steps can be performed by LLM calls (i.e. research topic x for me -> first I need to search (this is the LLM deciding the steps) -> then I need to xxx -> here's the report)
The difference is that sometimes you'll get a report resulting from search, or sometimes the LLM can hallucinate the whole thing without a single "tool call". It's more open ended, but also more chaotic from a programming perspective.
The gist is that the "agency" is now with the LLM driving the "main thread". It decides (based on training data, etc) what tools to use, what steps to take in order to "solve" the prompt it receives.
I think for the average consumer, AI will be "agentic" once it can appreciably minimize the amount of interaction needed to negotiate with the real world in areas where the provider of the desired services intentionally require negotiation - getting a refund, cancelling your newspaper subscription, scheduling the cable guy visit, fighting your parking ticket, securing a job interview. That's what an agent does.
Hard to miss — it's the second Google result for "chatgpt CLI".
This solution will basically not work for your use-case.
Is Apple a doomed company because they are chronically late to ~everything bleeding edge?
We re talking about european tech businesses being left behind, locked in a basement.
What is your preference for Europe, complete floodgates open and never ending lawsuits over IP theft like we have in the USA currently over AI?
The US is not the example of what’s working, it’s merely a demonstration of what is possible when you have limited, provoked regulation.
There is no such thing as "slow" in business. If you re slow you go out of business, you re no longer a business.
There is only one AI race. There is no second round. If you stay out of the race, you will be forever indebted to the AI winner, in the same way that we are entirely dependent on US internet technology currently (and this very forum)
Maybe(!?!)
If this had been specific to countries that have adopted the "AI Act", I'd be more than willing to accept that this delay could be due them needing to ensure full compliance, but just like in the past when OpenAI delayed a launch across EU member states and the UK, this is unlikely. My personal, though 100% unsourced thesis, remains, that this staggered rollout is rooted in them wanting to manage the compute capacity they have. Taking both the Americas and all of Europe on at once may not be ideal.
The U.S. runs 6–8% deficits and gets vibes, weapons, and insulin at $300 a vial. Who's on the unsustainable path and really overspending?
If the average interest rate on U.S. government debt rises to 14%, then 100% of all federal tax revenue (around $4.8 trillion/year) will be consumed just to pay interest on the $34 trillion national debt. As soon as the current Fed Chairman gets fired, practically a certainty by now, nobody will buy US bonds for less than 10 to 15% interest.
(I am American, convince me my digression is wrong)