The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.
Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.
That said, I agree with you in the sense that the "going off the rails" problem is very much not solved even on the latest models. It's not clear to me how we can trust a team of AI agents working autonomously to actually build the right thing.
As you as you start to work with a codebase that you care about and need to seriously maintain, you'll see what a mess these agents make.
The practical and opportunistic response is too tell them "Tough cookies" and watch the problems steadily compound into more lucrative revenue opportunities for us. I really have no remorse for these people. Because half of them were explicitly warned against this approach upfront but were psychologically incapable of adjusting expectations or delaying LLM deployment until the technology proved itself. If you've ever had your professional opinion dismissed by the same people regarding you as the SME, you understand my pain.
I suppose I'm just venting now. While we are now extracting money from the dumbassery, the client entitlement and management of their emotions that often comes with putting out these fires never makes for a good time.
At what point do we realize that the best way to prompt is with formal language? I.e. a programming language?
Now that's a scary thought that basically goes against "1 trillion dollars can't be wrong".
Now, LLMs are probably great range extenders, but they're not wonder weapons.
E.g. I use these tools to clean up or reorganize old tests (with coverage and diff viewers checking of things I might miss), update documentation with cross links (with documentation linters checking for errors I miss), convert tests into benchmarks running as part of CI, make log file visualizers, and many more.
These tools are amazing for dealing with the long tail of boring issues that you never get to, and when used in this fashion they actually abruptly increase the quality of the codebase.
But any time someone mentions using AI without proof of success? Vibe coding sucks.
I concur it is different from what you call vibecoding.
Despite being soul sucking, I do it because A: It lets me achieve goals despite lacking energy/time for projects that don't require the level of commitment or care that i provide professionally. B: it reduces how much RSI i experience. Typing is a serious concern for me these days.
To mitigate the soul sucking i've been side projecting better review tools. Which frankly i could use for work anyway, as reviewing PRs from humans could be better too. Also inline with review tools, i think a lot of soul sucking is having to provide specificity, so i hope to be able to integrate LLMs into the review tool and speak more naturally to it. Eg i belive some IDEs (vscode? no idea) can let Claude/etc see the cursor, so you can say "this code looks incorrect" without needing to be extremely specific. A suite of tooling that improves this code sharing to Claude/etc would also reduce the inane specificity that seems to be required to make LLMs even remotely reliable for me.
[1]: though we don't seem to have a term for varying amounts of vibe. Some people consider vibe to be 100% complete ignorance of the architecture/code being built. In which case imo nothing i do is vibe, which is absurd to me but i digress.
What you are doing is by definition not vibe coding.
We use agents very aggressively, combined with beads, tons of tests, etc.
You treat them like any developer, and review the code in PRs, provide feedback, have the agents act, and merge when it's good.
We have gained tremendous velocity and have been able to tackle far more out of the backlog that we'd been forced to keep in the icebox before.
This idea of setting the bar at "agents work without code reviews" is nuts.
I'm seeing amazing result to with agents, when provided an well formed knowledge base and directed through each piece of work like its a sprint. Review and iron out scope requirements, api surface/contract, have agents create multi phase implementation plans and technical specifications in a share dev directory and to make high quality changes logs, document future consideration and any bugs/issues found that can be deferred. Every phase is addressed with a human code review along with gemini who is great at catching drift from spec and bugs in less obvious places.
While I'm sure an enterprise code base could still be an issue and would require even more direction (and opus I wont let touch java, it codes like an enterprise java greybeard who loves to create an interface/factory for everything), I think that's still just a tooling issues.
I'm not of the super pro AI camp, but having followed its development and used it throughout. For the first time I am actual amazed and bothered, and convinced if people dont embrace these tools, they will be left behind. No they dont 10-100x a jr dev, but if someone has proper domain knowledge to direct the agent, performs dual research with it to iron things out with the human actually understanding the problem space, 2-5x seems quite reasonable currently if driven by a capable developer. But this just move the work to review and documentation maintenance/crafting. Which has its own fatigue and is less rewarding for a programmers mind who loves to solve challenges and gets dopamine from it .
But given how man people are adverse...I dont think anyone who embraces it is going to have job security issues and be replaced, but here are many capable engineers who might due to their own reservations. I'm amazed by how many intelligent and capable people try llms/agents like a political straw man, there is no reasoning with them. They say vibe coding sucks (it does for anything more than a small throw away that wont be maintained), yet their examples for agents/llm not working is it can't just take a prompt and produce the best code ever and automatically and manifest the knowledge needed to work on their codebase. You still need to put in effort and learn to actually perform the engineering with the tools, but if it doesnt take a paragraph with no AGENTS.md and turn it into a feature or bug fix they are not good to them. Yeah they will get distracted and fuck up, just like if you throw 9/10 developers in the same situation and told them to get to work with no knowledge of the code base or domain and have their pr in by noon.
I know people have emotional responses to this, but if you think people aren’t effectively using agents to ship code in lots of domains, including existing legacy code bases, you are incorrect.
Do we know exactly how to do that well, of course not, we still fruitlessly argue about how humans should write software. But there is a growing body of techniques on how to do agent first development, and a lot of those techniques are naturally converging because they work.
This is not to suggest that AI tools do not have value but that “I just have agents writing code and it works great!” Has yet to hit its test.
I get it; I do. It's rapidly challenging the paradigm that we've setup over the years in a way that it's incredibly jarring, but this is going to be our new reality or you're going to be left behind in MOST industries; highly regulated industries are a different beast.
So; instead of just out-of-hand dismissing this, figure out the best ways to integrate agents into your and your teams'/companies' workstreams. It will accelerate the work and change your role from what it is today to something different; something that takes time and experience to work with.
But it's not the argument. The argument is that these tools provide lower-quality output and checking this output often takes more time than doing this work oneself. It's not that "we're conservative and afraid of changes", heck, you're talking to a crowd that used to celebrate a new JS framework every week!
There is a push to accept lower quality and to treat it as a new normal, and people who appreciate high-quality architecture and code express their concern.
This doesn't hurt to try and will give valuable and detailed feedback much more quickly than even an experienced developer seeing the project for the first time.
> It will accelerate the work and change your role from what it is today to something different;
We yet to see if different is good.My short experience with LLM reviewing my code is that LLM's output is overly explanatory and it slows me down.
> something that takes time and experience to work with.
So you invite us to participate in sunken cost fallacy.I’m available for consulting when you need something done correctly.
I've been using LLMs to augment development since early December 2023. I've expanded the scope and complexity of the changes made since then as the models grew. Before beads existed, I used a folder of markdown files for externalized memory.
Just because you were late to the party doesn't mean all of us were.
It wasn't a party I liked back in 2023. I'm just repeating the same stuff I see said over and over again here, but there has been a step change with Opus 4.5.
You can still it in action now because the other models are still where Opus was at a while ago. I recently needed to make small change to script I was using. It is a tiny (50 line) script written with the help of AI's ages ago, but was subtly wrong in so many ways. It's now become clear neither the AI's (I used several and cross checked) nor myself had a clue about what we were dealing with. The current "seems to work" version was created after much blood caused by misunderstandings was spilt, exposing bugs that had to be fixed.
I asked Claude 4.6 to fix yet another misunderstanding, and the result was a patch changing the minimum number of lines to get the job done. Just reviewing such a surgical modification was far easier than doing it myself.
I gave exactly the same prompt to Gemini. The result was a wholesale rearrangement of the code. Maybe it was good, but the effort to verify that was far lager than just doing it myself. It was a very 2023 experience.
The usual 2023 experience for me was ask an AI write some greenfield code, and get a result that looked like someone had changed variable names in something they found on the web after a brief search for code that looked like it might do a similar job. If you got lucky, it might have found something that was indeed very similar, but in my case that was rare. Asking it to modify code unlike something it had seen before was like asking someone to poke your eyes with a stick.
As I said, some of the organisers of this style of party seem have gotten their act together, so now it is well worth joining their parties. But this is a newish development.
This may be a result of me using tools poorly, or more likely evaluating merits which matter less than I think. But I don’t think we can see that yet as people just invented these agent workflows and we haven’t seen it yet.
Note that the situation was not that different before LLMs. I’ve seen PMs with all the tickets setup, engineers making PRs with reviews, etc and not making progress on the product. The process can be emulated without substantive work.
Source? Proofs? It's not the first, second or even third round on this rodeo.
In other words, notto disu shittu agen.
Not to say that there's no value in AI written code in these codebases, because there is plenty. But this whole thing where 6 agents run overnight and "tada" in the morning with production ready code is...not real.
Similarly, a lot of the AGI-hype comments exist to expand the scope of the space. It's not real, but it helps to position products and win arguments based on hypotheticals.
Proprietary embedded system documentation is not exactly ubiquitous. You must provide reference material and guardrails where the training is weakest.
This applies to everything in ML: it will be weakest at the edges.
You can get extremely good results assuming your spec is actually correct (and you're willing to chew through massive quantities of tokens / wait long enough).
Is it ever the case that the spec is entirely correct (and without underspecified parts)? I thought the reason we write code is because it's much easier to express a spec as code than it is to get a similar level of precision in prose.
The bots even now can really help you identify technical problems / mistakes / gaps / bad assumptions, but there's no replacing "I know what the business wants/needs, and I know what makes my product manager happy, and I know what 'feels' good" type stuff.
Also known as "compiling source code".
No pesky developers siphoning away equity!
EDIT: fixed typo
But in any case, we're definitely coming up on the need for that.
The Bing AI summary tells me that AI companies invested $202.3 billion in AI last year. Users are going to have to pay that back at some point. This is going to be even worse as a cost control situation than AWS.
That’s not how VC investments work. Just because something costs a lot to build doesn’t mean that anyone will pay for it. I’m pretty sure I haven’t worked for any startup that ever returned a profit to its investors.
I suspect you are right in that inference costs currently seem underpriced so users will get nickel-and-dinked of a while until the providers leverage a better margin per user.
Some of the players are aiming for AGI. If they hit that goal, the cost is easily worth it. The remaining players are trying to capture market share and build a moat where none currently exists.
Yes currency is very rarely at times exchanged at a loss for power but rarely not for more currency down the road.
In operator/supervisor mode (interactive CLI), you need high-signal observability while it’s running so you can abort or re-scope when it’s reading the wrong area or compounding assumptions. In batch/autonomous mode (headless / “run overnight”), you don’t need a live scrollback feed, but you still need a complete trace for audit/debug after the fact.
Collapsing file paths into counters is a batch optimization leaking into operator mode. The fix isn’t “verbose vs not” so much as separating channels: keep a small status line/spine (phase, current target, last tool call), keep an event-level trace (file paths / commands / searches) that’s persisted and greppable, and keep a truly-verbose mode for people who want every hook/subagent detail.
more reason to catch them otherwise we have to wait a longer time. in fact hiding is more correct if the AI was less autonomous right?
What fills the holes are best practices, what can ruin the result is wrong assumptions.
I dont see how full autonomy can work either without checkpoints along the way.
And at the end of the day it's not the agents who are accountable for the code running in the production. It's the human engineers.
Still makes this change from Anthropic stupid.
If a singular agent has a 1% chance of making an incorrect assumption, then 10 agents have that same 1% chance in aggregate.
I can attest that it works well in practice, and my organization is already deploying this technique internally.
This is one example of an orchestration workflow. There are others.
> Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one.
If there are several agents doing analysis of solutions, how do you define a consensus? Should it be unanimous or above some threshold? Are agents scores soft or hard? How threshold is defined if scores are soft? There is a whole lot of science in voting approaches, which voting approach is best here?Is it possible for analyzing agents to choose the best of wrong solutions? E.g., longest remembered table of FizzBuzz answers amongst remembered tables of FizzBuzz answers.
To me, our discussion shows that what you presented as a simple thing is not simple at all, even voting is complex, and actually getting a good result is so hard it warrants omitting answer altogether.
Okay then, agentic coding is nothing but complex task requiring knowledge of unbiased voting (what is this thing really?) and, apparently, use of necessarily heavy test suite and/or theorem provers.
This is NOT the same as asking “are you sure?” The sycophantic nature of LLMs would make them biased on that. But fresh agents with unbiased, detached framing in the prompt will show behavior that is probabilistically consistent with the underlying truth. Consistent enough for teasing out signal from noise with agent orchestration.
What is Codex doing differently to solve for this problem?
Even in that case they should still be logging what they're doing for later investigation/auditing if something goes wrong. Regardless of whether a human or an AI ends up doing the auditing.
As tedious as it is a lot of the time ( And I wish there was an in-between "allow this session" not just allow once or "allow all" ), it's invaluable to catch when the model has tried to fix the problem in entirely the wrong project.
Working on a monolithic code-base with several hundred library projects, it's essential that it doesn't start digging in the wrong place.
It's better than it used to be, but the failure mode for going wrong can be extreme, I've come back to 20+ minutes of it going around in circles frustrating itself because of a wrong meaning ascribed to an instruction.
https://code.claude.com/docs/en/settings#permission-settings
You can configure it at the project level
If you have an unlimited budget, obviously you will tend to let it run and correct it in the next iteration.
If you often run tight up against your 5-hour window, you're going to be more likely to babysit it.
Since it's just reading at that stage there's no tracked changes.
There are three separate layers here:
What the model internally computes
What the product exposes to the user
What developers need for debugging and control
Most outrage conflates all three.
Exposing raw reasoning tokens sounds transparent, but in practice it often leaks messy intermediate steps, half-formed logic, or artifacts that were never meant to be user-facing. That doesn’t automatically make a product more trustworthy. Sometimes it just creates noise.
The real issue is not whether internal thoughts are hidden. It’s whether developers can:
• Inspect tool calls • See execution traces • Debug failure modes • Reproduce behavior deterministically
If those are restricted, that’s a serious product problem. If what’s being “hidden” is just chain-of-thought verbosity, that’s a UI decision, not deception.
There’s also a business angle people don’t want to acknowledge. As models become productized infrastructure, vendors will protect internal mechanics the same way cloud providers abstract away hardware-level details. Full introspection is rarely a permanent feature in mature platforms.
Developers don’t actually want full transparency. They want reliability and control. If the system behaves predictably and exposes the right operational hooks, most people won’t care about hidden internal tokens.
The real question is: where should the abstraction boundary sit for a developer tool?
How to comply with a demand to show more information by showing less information.
It's just a whole new world where words suddenly mean something completely different, and you can no longer understand programs by just reading what labels they use for various things, you need to also lookup if what they think "verbose" means matches with the meaning you've built up understanding of first.
EDIT: Ah, looks like verbose mode might show less than it used to, and you need to use a new mode (^o) to show very verbose.
I didn't know about the ^o mode though, so good that the verbose information is at least still available somewhere. Even though now it seems like an enormously complicated maneuver with no purpose.
What I think they are forgetting in this silly stubbornness is that competition is really fierce, and just as they have gained appreciation from developers, they might very quickly lose it because of this sort of stupidity (for no good reason).
Do they believe that owning the harness (Claude Code) itself will lead to significantly more money? I can sort of see that, but I wouldn't think they are necessarily betting on it?
I use Anthropic's models wherever, whenever I can, be it cursor, copilot, you name it. I can't stand Claude Code for some reason, but I'll kill for those models.
On the other hand, I've seen some non-tech people have their "Holy shit!" moment with Claude Co-work (which I personally haven't tried yet) — and that's a market I can see them want to hold on to to branch out of the dev niche. The same moment happened when they tried their excel integration — they were completely mindblown.
[0] https://generativeai.pub/cursors-pricing-change-sparks-outra...
You have to go into /models then use the left/right arrow keys to change it. It’s a horrible UI design and I had no idea mine was set to high. You can only tell by the dim text at the bottom and the 3 potentially highlighted bars.
On high It would think for 30+ minutes, make a plan, then when I started the plan it would either compact and reread all my files, or start fresh and read my files, then compact after 2-3 changes and reread the files.
High reasoning is unusable with Opus 4.6 in my opinion. They need at least 1M context for this to work.
I still think it’d be nice to allow an output mode for you folks who are married to the previous approach since it clearly means a lot to you.
First, I agree with most commentators that they should just offer 3 modes of visibility: "default", "high", "verbose" or whatever
But I'm with you that this mode of working where you watch the agent work in real-time seems like it will be outdated soon. Even if we're not quite there, we've all seen how quickly these models improve. Last year I was saying Cursor was better because it allowed me to better understand every single change. I'm not really saying that anymore.
How do you do this? Do you follow traditional testing practices or do you have novel strategies like agents with separate responsibilities?
Curious what plans you’re using? running 24/7 x 5 agents would eat up several $200 subscriptions pretty fast
The way Claude does research has dramatically changed for the worse. Instead of piping through code logically, it's now spawning dozens of completely unrelated research threads to look at simple problems. I let it spin for over 30 minutes last night before realizing it was just "lost".
I have since been looking for these moments and killing it immediately. I tell Claude "just look at the related code" and it says, "sorry I'll look at this directly".
WTF Anthropic?
Although, this post surely isn't "news" as much as it is, as you said, a summary of a conversation being held on other platform(s).
So maybe it is just a blog post?
It breaks a spec (or freeform input) down into a structured json plan, then kicks off a new non-interactive session of Claude or codex for each task. Sounds like it could fit your workflow pretty well.
Would you like to proceed?
> 1. Yes, clear context and auto-accept edits (shift+tab)
2. Yes, auto-accept edits
3. Yes, manually approve edits
4. Type here to tell Claude what to change
So the default is to do it in a new context.If you examine what this actually does, it clears the context, and then says "here's the plan", points to the plan file, and also points to the logs of the previous discussion so that if it determines it should go back and look at them, it can.
I love the terminal more than the next guy but at some point it feels like you're looking at production nginx logs, just a useless stream of info that is very difficult to parse.
I vibe coded my own ADE for this called OpenADE (https://github.com/bearlyai/openade) it uses the native harnesses, has nice UIs and even comes with things like letting Claude and Codex work together on plans. Still very beta but has been my daily driver for a few weeks now.
Your interface looks pretty cool! I built something similar-ish though with a different featureset / priority (https://github.com/kzahel/yepanywhere - meant to be a mobile first interface but I also use it at my desk almost exclusively)
It sounds like you have some features to comment directly on markdown? That sounds pretty useful. I love how Antigravity has that feature.
Seriously? This can't be a comparable experience in terms of UX.
Otherwise it seems like a minor UI decision any other app would make and it surprising there's whole articles on it.
That was very much not my read of it.
Anthropic doesn't want you to be easily able to jump off claude code into open code + open weight llm.
Hiding filenames turns the workflow into a black box. It’s like removing the speedometer from a car because "it distracts the driver". Sure it looks clean, but it's deadly for both my wallet and my context window
I guess that fell on deaf ears.
Yeah, I used to sit and read all of these(at one of the largest video game publishers - does that count?). 95% of them were "your game sucks" but we fixed many bugs thanks to detailed descriptions that people have provided through that box.
When an agent can read, modify, and orchestrate multiple parts of a codebase, you need the equivalent of logs, traces, and diffs — not just summaries. Otherwise debugging becomes guesswork.
Traditional software became reliable only after we built strong observability tooling around it. Agent workflows will need the same evolution: clear execution traces, deterministic diffs, and full transparency into what happened and why.
Seems like this is the most probable outcome: LLM gets to fix the issues undisrupted while keeping the operator happy.
" A GitHub issue on the subject drew a response from Boris Cherny, creator and head of Claude Code at Anthropic, that "this isn't a vibe coding feature, it's a way to simplify the UI so you can focus on what matters, diffs and bash/mcp outputs." He suggested that developers "try it out for a few days" and said that Anthropic's own developers "appreciated the reduced noise.""
Seriously man, whatever happened to configs that you can set once. They obviously realise that people want it with the control-o but why make them do this over and over without a way to just config it, or whatever the cli does like maybe:
./clod-code -v
or something. Man I dislike these AI bros so much, there always about "your personal preferences are wrong" but you know they are lying through their smirking teeth they want you to burn tokens so the earth's inhabitability can die a few minutes earlier.
If you rely on monitoring the behaviors of an individual coding agent to produce the output you want, you won't scale
One of the interesting things about working on distributed systems, is that you can reproduce problems without having to reproduce or mock a long stack trace
So I certainly don’t see the case you’re talking about where it takes hours to reproduce or understand a problem without a debugger. Of course there are still many times when a debugger should be consulted! There is always a right tool for a given job.
If they tried to create a better product I'd expect them to just add the awesome option, not hide something that saves thousands of tokens and context if the model goes the wrong way.
The answer in both cases is: You don't. If it happens, it's because you sometimes make bad decisions, because it's hard to make good decisions.
Two months ago, Claude was great for "here is a specific task I want you to do to this file". Today, they seem to be pivoting towards "I don't know how to code but want this feature" usage. Which might be a good product decision, but makes it worse as a substitute for writing the code myself.
Pro tip: "git diff"
https://news.ycombinator.com/item?id=9224
Or this pulls the exchange under the famous HN post itself:
Ultimately, the problem is the tool turning against the user. Maybe it is time to get a new tool.