Anthropic tries to hide Claude's AI actions. Devs hate it

372
227
beardyw
20 hours ago
theregister.com

the_harpia_io
·
18 hours ago
·
[ - ]

the hiding stuff is weird because the whole reason you'd want to see what Claude is doing isn't just curiosity - it's about catching when it goes off the rails before it makes a mess. like when it starts reading through your entire codebase because it misunderstood what you asked for, or when it's about to modify files you didn't want touched. the verbose mode fix is good but honestly this should've been obvious from the start - if you're letting an AI touch your files, you want to know exactly which files. not because you don't trust the tool in theory but because you need to verify it's doing what you actually meant, not what it thinks you meant. abstractions are great until they hide the thing that's about to break your build

rco8786
·
18 hours ago
·
[ - ]

> it's about catching when it goes off the rails before it makes a mess

The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.

Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.

That said, I agree with you in the sense that the "going off the rails" problem is very much not solved even on the latest models. It's not clear to me how we can trust a team of AI agents working autonomously to actually build the right thing.

g947o
·
18 hours ago
·
[ - ]

None of those wild experiments are running on a "real", existing codebase that is more than 6 months old. The thing they don't talk about is that nobody outside these AI companies wants to vibe code with a 10 year old codebase with 2000 enterprise customers.

As you as you start to work with a codebase that you care about and need to seriously maintain, you'll see what a mess these agents make.

GoatInGrey
·
14 hours ago
·
[ - ]

Even on codebases within the half-year age group, these LLMs often do perform nasty (read: ungodly verbose) implementations that become a maintainability nightmare. Even for the LLMs that wrote it all in the first place. I know this because we've had a steady trickle of clients and prospects expressing "challenges around maintainability and scalability" as they move toward "production readiness". Of course, asking if we can implement "better performing coding agents". As if improved harnessing or similar guardrails can solve what is in my view, a deeper problem.

The practical and opportunistic response is too tell them "Tough cookies" and watch the problems steadily compound into more lucrative revenue opportunities for us. I really have no remorse for these people. Because half of them were explicitly warned against this approach upfront but were psychologically incapable of adjusting expectations or delaying LLM deployment until the technology proved itself. If you've ever had your professional opinion dismissed by the same people regarding you as the SME, you understand my pain.

I suppose I'm just venting now. While we are now extracting money from the dumbassery, the client entitlement and management of their emotions that often comes with putting out these fires never makes for a good time.

buschleague
·
11 hours ago
·
[ - ]

This is exactly why enforcement needs to be architectural. The "challenges around maintainability and scalability" your clients hit exist because their AI workflows had zero structural constraints. The output quality problem isn't the model, it's the lack of workflow infrastructure around it.

datsci_est_2015
·
11 hours ago
·
[ - ]

Is this not just “build a better prompt” in more words?

At what point do we realize that the best way to prompt is with formal language? I.e. a programming language?

semiquaver
·
10 hours ago
·
[ - ]

No, the suite of linters, test suite and documentation in your codebase cannot be equated to “a better prompt” except in the sense that all feedback of any kind is part of what the model uses to make decisions about how to act.

datsci_est_2015
·
10 hours ago
·
[ - ]

A properly set up and maintained codebase is the core duty of a software engineer. Sounds like the great-grandparent comment’s client needed a software engineer.

oblio
·
8 hours ago
·
[ - ]

What if LLMs, at the end of the day are machines, so for now generally dumber than humans and the best they can provide are at most statistically median implementantions (and if 80% of code out there is crap, the median will be low)?

Now that's a scary thought that basically goes against "1 trillion dollars can't be wrong".

Now, LLMs are probably great range extenders, but they're not wonder weapons.

lossyalgo
·
6 hours ago
·
[ - ]

Also who is to say what is actually crap? Writing great code is completely dependent on context. An AI could exclusively be trained on the most beautiful and clean code in the world, yet if it chooses the wrong paradigm in the wrong context, it doesn't matter how beautiful that code is - it's still gonna be totally broken code.

BOOMp0wSm4sh
·
9 hours ago
·
[ - ]

[dead]

krastanov
·
17 hours ago
·
[ - ]

I maintain serious code bases and I use LLM agents (and agent teams) plenty -- I just happen to review the code they write, I demand they write the code in a reviewable way, and use them mostly for menial tasks that are otherwise unpleasant timesinks I have to do myself. There are many people like me, that just quietly use these tools to automate the boring chores of dealing with mature production code bases. We are quiet because this is boring day-to-day work.

E.g. I use these tools to clean up or reorganize old tests (with coverage and diff viewers checking of things I might miss), update documentation with cross links (with documentation linters checking for errors I miss), convert tests into benchmarks running as part of CI, make log file visualizers, and many more.

These tools are amazing for dealing with the long tail of boring issues that you never get to, and when used in this fashion they actually abruptly increase the quality of the codebase.

g947o
·
16 hours ago
·
[ - ]

It's not called vibe coding then.

jmalicki
·
15 hours ago
·
[ - ]

Oh you made vibe coding work? Well then it's not vibe coding.

But any time someone mentions using AI without proof of success? Vibe coding sucks.

GoatInGrey
·
14 hours ago
·
[ - ]

No, what the other commenter described is narrowly scoped delegation to LLMs paired with manual review (which sounds dreadfully soul-sucking to me), not wholesale "write feature X, write the unit tests, and review the implementation for me". The latter is vibe-coding.

krastanov
·
13 hours ago
·
[ - ]

Reviewing a quick translation of a test to a benchmark (or another menial coding tasks) is way less soul-sucking than doing the menial coding by yourself. Boring soul-sucking tasks are an important thankless part of OSS maintenance.

I concur it is different from what you call vibecoding.

unshavedyak
·
13 hours ago
·
[ - ]

Sidenote, i do that frequently. I also do varying levels of review, ie more/less vibe[1]. It is soul sucking to me.

Despite being soul sucking, I do it because A: It lets me achieve goals despite lacking energy/time for projects that don't require the level of commitment or care that i provide professionally. B: it reduces how much RSI i experience. Typing is a serious concern for me these days.

To mitigate the soul sucking i've been side projecting better review tools. Which frankly i could use for work anyway, as reviewing PRs from humans could be better too. Also inline with review tools, i think a lot of soul sucking is having to provide specificity, so i hope to be able to integrate LLMs into the review tool and speak more naturally to it. Eg i belive some IDEs (vscode? no idea) can let Claude/etc see the cursor, so you can say "this code looks incorrect" without needing to be extremely specific. A suite of tooling that improves this code sharing to Claude/etc would also reduce the inane specificity that seems to be required to make LLMs even remotely reliable for me.

[1]: though we don't seem to have a term for varying amounts of vibe. Some people consider vibe to be 100% complete ignorance of the architecture/code being built. In which case imo nothing i do is vibe, which is absurd to me but i digress.

lukeschlather
·
14 hours ago
·
[ - ]

It's not vibe coding if you personally review all the diffs for correctness.

EnPissant
·
14 hours ago
·
[ - ]

> According to Karpathy, vibe coding typically involves accepting AI-generated code without closely reviewing its internal structure, instead relying on results and follow-up prompts to guide changes.

What you are doing is by definition not vibe coding.

dingnuts
·
15 hours ago
·
[ - ]

[dead]

peyton
·
17 hours ago
·
[ - ]

Yeah esp. the latest iterations are great for stuff like “find and fix all the battery drainers.” Tests pass, everyone’s happy.

hp197
·
15 hours ago
·
[ - ]

(rhetorical question) You work at Apple? :p

JPKab
·
17 hours ago
·
[ - ]

I work at a company with approximately $1 million in revenue per engineer and multiple 10+ year old codebases.

We use agents very aggressively, combined with beads, tons of tests, etc.

You treat them like any developer, and review the code in PRs, provide feedback, have the agents act, and merge when it's good.

We have gained tremendous velocity and have been able to tackle far more out of the backlog that we'd been forced to keep in the icebox before.

This idea of setting the bar at "agents work without code reviews" is nuts.

hickelpickle
·
13 hours ago
·
[ - ]

If there is one thing I have seen is that there is a subset of intellectual people will still be adverse to learning new tools, hang to ideological beliefs (I feel this though, watching programming as you know it die in a way, kinda makes you not want to follow it) and would prefer to just be lazy and not properly dogfood and learn their new tooling.

I'm seeing amazing result to with agents, when provided an well formed knowledge base and directed through each piece of work like its a sprint. Review and iron out scope requirements, api surface/contract, have agents create multi phase implementation plans and technical specifications in a share dev directory and to make high quality changes logs, document future consideration and any bugs/issues found that can be deferred. Every phase is addressed with a human code review along with gemini who is great at catching drift from spec and bugs in less obvious places.

While I'm sure an enterprise code base could still be an issue and would require even more direction (and opus I wont let touch java, it codes like an enterprise java greybeard who loves to create an interface/factory for everything), I think that's still just a tooling issues.

I'm not of the super pro AI camp, but having followed its development and used it throughout. For the first time I am actual amazed and bothered, and convinced if people dont embrace these tools, they will be left behind. No they dont 10-100x a jr dev, but if someone has proper domain knowledge to direct the agent, performs dual research with it to iron things out with the human actually understanding the problem space, 2-5x seems quite reasonable currently if driven by a capable developer. But this just move the work to review and documentation maintenance/crafting. Which has its own fatigue and is less rewarding for a programmers mind who loves to solve challenges and gets dopamine from it .

But given how man people are adverse...I dont think anyone who embraces it is going to have job security issues and be replaced, but here are many capable engineers who might due to their own reservations. I'm amazed by how many intelligent and capable people try llms/agents like a political straw man, there is no reasoning with them. They say vibe coding sucks (it does for anything more than a small throw away that wont be maintained), yet their examples for agents/llm not working is it can't just take a prompt and produce the best code ever and automatically and manifest the knowledge needed to work on their codebase. You still need to put in effort and learn to actually perform the engineering with the tools, but if it doesnt take a paragraph with no AGENTS.md and turn it into a feature or bug fix they are not good to them. Yeah they will get distracted and fuck up, just like if you throw 9/10 developers in the same situation and told them to get to work with no knowledge of the code base or domain and have their pr in by noon.

groundzeros2015
·
17 hours ago
·
[ - ]

Why are you using experience and authoritative framing about a technology we’ve been using for less than 6 months?

kasey_junk
·
16 hours ago
·
[ - ]

The person they are responding with dictated an authoritative framing that isn’t true.

I know people have emotional responses to this, but if you think people aren’t effectively using agents to ship code in lots of domains, including existing legacy code bases, you are incorrect.

Do we know exactly how to do that well, of course not, we still fruitlessly argue about how humans should write software. But there is a growing body of techniques on how to do agent first development, and a lot of those techniques are naturally converging because they work.

groundzeros2015
·
16 hours ago
·
[ - ]

I think programming effectiveness is inherently tied to the useful life of software, and we will need to see that play out.

This is not to suggest that AI tools do not have value but that “I just have agents writing code and it works great!” Has yet to hit its test.

garciasn
·
15 hours ago
·
[ - ]

The views I see often shared here are typical of those in the trenches of the tech industry: conservative.

I get it; I do. It's rapidly challenging the paradigm that we've setup over the years in a way that it's incredibly jarring, but this is going to be our new reality or you're going to be left behind in MOST industries; highly regulated industries are a different beast.

So; instead of just out-of-hand dismissing this, figure out the best ways to integrate agents into your and your teams'/companies' workstreams. It will accelerate the work and change your role from what it is today to something different; something that takes time and experience to work with.

benterix
·
15 hours ago
·
[ - ]

> I get it; I do. It's rapidly challenging the paradigm that we've setup over the years in a way that it's incredibly jarring,

But it's not the argument. The argument is that these tools provide lower-quality output and checking this output often takes more time than doing this work oneself. It's not that "we're conservative and afraid of changes", heck, you're talking to a crowd that used to celebrate a new JS framework every week!

There is a push to accept lower quality and to treat it as a new normal, and people who appreciate high-quality architecture and code express their concern.

joquarky
·
8 hours ago
·
[ - ]

"Find any inconsistencies that should be addressed in this codebase according to DRY and related best practices"

This doesn't hurt to try and will give valuable and detailed feedback much more quickly than even an experienced developer seeing the project for the first time.

thesz
·
15 hours ago
·
[ - ]

  > It will accelerate the work and change your role from what it is today to something different;

We yet to see if different is good.

My short experience with LLM reviewing my code is that LLM's output is overly explanatory and it slows me down.

  > something that takes time and experience to work with.

So you invite us to participate in sunken cost fallacy.

joquarky
·
8 hours ago
·
[ - ]

Tell it to summarize?

groundzeros2015
·
15 hours ago
·
[ - ]

I don’t doubt that companies are willing to try low quality things. They play with these processes all the time. Maybe the whole industry will try it.

I’m available for consulting when you need something done correctly.

democracy
·
11 hours ago
·
[ - ]

Also people need to be more specific about technologies/tasks they do. Otherwise it's apples to oranges.

JPKab
·
15 hours ago
·
[ - ]

6 months?

I've been using LLMs to augment development since early December 2023. I've expanded the scope and complexity of the changes made since then as the models grew. Before beads existed, I used a folder of markdown files for externalized memory.

Just because you were late to the party doesn't mean all of us were.

rstuart4133
·
7 hours ago
·
[ - ]

> Just because you were late to the party doesn't mean all of us were.

It wasn't a party I liked back in 2023. I'm just repeating the same stuff I see said over and over again here, but there has been a step change with Opus 4.5.

You can still it in action now because the other models are still where Opus was at a while ago. I recently needed to make small change to script I was using. It is a tiny (50 line) script written with the help of AI's ages ago, but was subtly wrong in so many ways. It's now become clear neither the AI's (I used several and cross checked) nor myself had a clue about what we were dealing with. The current "seems to work" version was created after much blood caused by misunderstandings was spilt, exposing bugs that had to be fixed.

I asked Claude 4.6 to fix yet another misunderstanding, and the result was a patch changing the minimum number of lines to get the job done. Just reviewing such a surgical modification was far easier than doing it myself.

I gave exactly the same prompt to Gemini. The result was a wholesale rearrangement of the code. Maybe it was good, but the effort to verify that was far lager than just doing it myself. It was a very 2023 experience.

The usual 2023 experience for me was ask an AI write some greenfield code, and get a result that looked like someone had changed variable names in something they found on the web after a brief search for code that looked like it might do a similar job. If you got lucky, it might have found something that was indeed very similar, but in my case that was rare. Asking it to modify code unlike something it had seen before was like asking someone to poke your eyes with a stick.

As I said, some of the organisers of this style of party seem have gotten their act together, so now it is well worth joining their parties. But this is a newish development.

·
15 hours ago
·
[ - ]

·
15 hours ago
·
[ - ]

dboreham
·
16 hours ago
·
[ - ]

If you hired a person six months ago and in that time they'd produced a ton of useful code for your product, wouldn't you say with authoritative framing that their hiring was a good decision?

groundzeros2015
·
15 hours ago
·
[ - ]

It would, but I haven’t seen that. What I’ve seen is a lot of people setting up cool agent workflows which feel very productive, but aren’t producing coherent work.

This may be a result of me using tools poorly, or more likely evaluating merits which matter less than I think. But I don’t think we can see that yet as people just invented these agent workflows and we haven’t seen it yet.

Note that the situation was not that different before LLMs. I’ve seen PMs with all the tickets setup, engineers making PRs with reviews, etc and not making progress on the product. The process can be emulated without substantive work.

democracy
·
11 hours ago
·
[ - ]

Why is it always "tons of code"? Unless you are paid by the line of code writin "tons of code" makes no sense.

otabdeveloper4
·
11 hours ago
·
[ - ]

> We have gained tremendous velocity and have been able to tackle far more out of the backlog that we'd been forced to keep in the icebox before.

Source? Proofs? It's not the first, second or even third round on this rodeo.

In other words, notto disu shittu agen.

rco8786
·
18 hours ago
·
[ - ]

That is also my experience. Doesn't even have to be a 10 year old codebase. Even a 1 year old codebase. Any one that is a serious product that is deployed in production with customers who rely on it.

Not to say that there's no value in AI written code in these codebases, because there is plenty. But this whole thing where 6 agents run overnight and "tada" in the morning with production ready code is...not real.

zerkten
·
17 hours ago
·
[ - ]

I don't believe that devs are the audience. They are pushing this to decision makers where they want them to think that the state of the art is further ahead than it is. These folks then think about how helpful it'd be to have 20% of that capability. When there is so much noise in the market, and everyone seems to be overtaking everyone else it, this kind of approach is the only one that gets attention.

Similarly, a lot of the AGI-hype comments exist to expand the scope of the space. It's not real, but it helps to position products and win arguments based on hypotheticals.

pjc50
·
17 hours ago
·
[ - ]

Also anything that doesn't look like a SaaS app does very badly. We had an internal trial at embedded firmware and concluded the results were unsalvageably bad. It doesn't help that the embedded environment is very unfriendly to standard testing techniques, as well.

joquarky
·
7 hours ago
·
[ - ]

You will need to build an accessible knowledge base for the topics for which the models have not had extensive training.

Proprietary embedded system documentation is not exactly ubiquitous. You must provide reference material and guardrails where the training is weakest.

This applies to everything in ML: it will be weakest at the edges.

JeremyNT
·
16 hours ago
·
[ - ]

I feel like you could have correctly stated this a few months ago, but the way this is "solved" is by multiple agents that babysit each other and review their output - it's unreasonably effective.

You can get extremely good results assuming your spec is actually correct (and you're willing to chew through massive quantities of tokens / wait long enough).

nicoburns
·
13 hours ago
·
[ - ]

> You can get extremely good results assuming your spec is actually correct

Is it ever the case that the spec is entirely correct (and without underspecified parts)? I thought the reason we write code is because it's much easier to express a spec as code than it is to get a similar level of precision in prose.

JeremyNT
·
13 hours ago
·
[ - ]

I think this is basically the only SWE-type job that exists beyond the (relatively near) future: finding the right spec and feeding it to the bots. And in this way I think even complete laypeople will be able to create software using the bots, but you'd still want somebody with a deeper understanding in this role for serious projects.

The bots even now can really help you identify technical problems / mistakes / gaps / bad assumptions, but there's no replacing "I know what the business wants/needs, and I know what makes my product manager happy, and I know what 'feels' good" type stuff.

otabdeveloper4
·
11 hours ago
·
[ - ]

> finding the right spec and feeding it to the bots

Also known as "compiling source code".

ldng
·
16 hours ago
·
[ - ]

And unreasonably expensive unless you are Big Corp. Die startups, die. Welcome to our Cyberpunk overlords.

whateveracct
·
14 hours ago
·
[ - ]

Companies will just shift money from salaries to their Anthropic bill - what's the problem?

JeremyNT
·
13 hours ago
·
[ - ]

Or hey, the VCs can self-deal by funding new startups that buy bot time from AI firms the same VCs already fund.

No pesky developers siphoning away equity!

flemhans
·
9 hours ago
·
[ - ]

My Claude Code has been running weeks on end churning through a huge task list almost unattended on a complex 15 yr old code base, auto-committing thousands of features. It is high quality code that will go live very soon.

oblio
·
8 hours ago
·
[ - ]

Awesome! Which application or service?

lanstin
·
12 hours ago
·
[ - ]

The gas town discord has two people that are doing transformation of extremely legacy in house Java frameworks. Not reporting great success yet but also probably work that just wouldn’t be done otherwise.

·
16 hours ago
·
[ - ]

democracy
·
12 hours ago
·
[ - ]

Oh that means you don't know how to use AI properly. Also its only 2026 imagine what AI agents can do in a few years /s

pzs
·
17 hours ago
·
[ - ]

Related question: how do we resolve the problem that we sign a blank cheque for the autonomous agents to use however many tokens they deem necessary to respond to your request? The analogy from team management: you don't just ask someone in your team to look into something only to realize three weeks later (in the absence of any updates) that they got nowhere with a problem that you expected to take less than a day to solve.

EDIT: fixed typo

rco8786
·
17 hours ago
·
[ - ]

We'll have to solve for that sometime soon-ish I think. Claude Code has at least some sort of token estimation built-in to it now. I asked it to kick off a large agent team (~100 agents) to rewrite a bunch of SQL queries, one per agent. It did the first 10 or so, then reported back that it would cost too much to do it this way...so it "took the reins" without my permission and tried to convert each query using only the main agent and abandoned the teams. The results were bad.

But in any case, we're definitely coming up on the need for that.

pjc50
·
16 hours ago
·
[ - ]

> blank cheque

The Bing AI summary tells me that AI companies invested $202.3 billion in AI last year. Users are going to have to pay that back at some point. This is going to be even worse as a cost control situation than AWS.

lossyalgo
·
6 hours ago
·
[ - ]

Didn't you hear? Ads are coming! (well not to Claude, because I guess they plan to somehow get unlimited SV funding?!)

thephyber
·
16 hours ago
·
[ - ]

> Users are going to have to pay that back at some point.

That’s not how VC investments work. Just because something costs a lot to build doesn’t mean that anyone will pay for it. I’m pretty sure I haven’t worked for any startup that ever returned a profit to its investors.

I suspect you are right in that inference costs currently seem underpriced so users will get nickel-and-dinked of a while until the providers leverage a better margin per user.

Some of the players are aiming for AGI. If they hit that goal, the cost is easily worth it. The remaining players are trying to capture market share and build a moat where none currently exists.

tsunamifury
·
15 hours ago
·
[ - ]

What planet are you living on and how do I get there.

Yes currency is very rarely at times exchanged at a loss for power but rarely not for more currency down the road.

Kye
·
17 hours ago
·
[ - ]

An AI product manager agent trained on all the experience of product managers setting budgets for features and holding teams to it. Am I joking? I do not know.

peab
·
16 hours ago
·
[ - ]

This seems pretty in line with how you’d manage a human - you give it a time constraint. a human isn't guaranteed to fix a problem either, and humans are paid by time

·
16 hours ago
·
[ - ]

the_harpia_io
·
16 hours ago
·
[ - ]

yeah I think that's exactly the disconnect - they're optimizing for a future where agents can actually be trusted to run autonomously, but we're not there yet. like the reliability just isn't good enough to justify hiding what it's doing. and honestly I'm not sure we'll get there by making the UX worse for humans who are actively supervising, because that's how you catch the edge cases that training data misses. idk, feels like they're solving tomorrow's problem while making today's harder

·
15 hours ago
·
[ - ]

twalichiewicz
·
11 hours ago
·
[ - ]

I think this is exactly the crux: there are two different UX targets that get conflated.

In operator/supervisor mode (interactive CLI), you need high-signal observability while it’s running so you can abort or re-scope when it’s reading the wrong area or compounding assumptions. In batch/autonomous mode (headless / “run overnight”), you don’t need a live scrollback feed, but you still need a complete trace for audit/debug after the fact.

Collapsing file paths into counters is a batch optimization leaking into operator mode. The fix isn’t “verbose vs not” so much as separating channels: keep a small status line/spine (phase, current target, last tool call), keep an event-level trace (file paths / commands / searches) that’s persisted and greppable, and keep a truly-verbose mode for people who want every hook/subagent detail.

buschleague
·
11 hours ago
·
[ - ]

We run agent teams (Navigator/Driver/Reviewer roles) on a 71K-line codebase. The trust problem is solved by not trusting the agents at all. You enforce externally. Python gates that block task completion until tests pass, acceptance criteria are verified, and architecture limits are met. The agents can't bypass enforcement mechanisms they can't touch. It's not about better prompts or more capable models. It's about infrastructure that makes "going off the rails" structurally impossible.

simianwords
·
17 hours ago
·
[ - ]

>The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.

more reason to catch them otherwise we have to wait a longer time. in fact hiding is more correct if the AI was less autonomous right?

fdefitte
·
11 hours ago
·
[ - ]

Agent teams working autonomously sounds cool until you actually try it. We've been running multi-agent setups and honestly the failure modes are hilarious. They don't crash, they just quietly do the wrong thing and act super confident about it.

KurSix
·
15 hours ago
·
[ - ]

If they're aiming for autonomy, why have a CLI at all? Just give us a headless mode. If I'm sitting in the terminal, it means I want to control the process. Hiding logs from an operator who’s explicitly chosen to run it manually just feels weird

faeyanpiraat
·
18 hours ago
·
[ - ]

Looking at it from far is simply making something large from a smaller input, so its kind of like nondeterministic decompression.

What fills the holes are best practices, what can ruin the result is wrong assumptions.

I dont see how full autonomy can work either without checkpoints along the way.

rco8786
·
18 hours ago
·
[ - ]

Totally agreed. Those assumptions often compound as well. So the AI makes one wrong decision early in the process and it affects N downstream assumptions. When they finally finish their process they've built the wrong thing. This happens with one process running. Even on latest Opus models I have to babysit and correct and redirect claude code constantly. There's zero chance that 5 claude codes running for hours without my input are going to build the thing I actually need.

And at the end of the day it's not the agents who are accountable for the code running in the production. It's the human engineers.

adastra22
·
17 hours ago
·
[ - ]

Actually it works the other way. With multiple agents they can often correct each others mistaken assumptions. Part of the value of this approach is precisely that you do get better results with fewer hallucinated assumptions.

Still makes this change from Anthropic stupid.

rco8786
·
17 hours ago
·
[ - ]

The corrective agent has the exact same percentage chance at making the mistake. "Correcting" an assumption that was previously correct into an incorrect one.

If a singular agent has a 1% chance of making an incorrect assumption, then 10 agents have that same 1% chance in aggregate.

adastra22
·
16 hours ago
·
[ - ]

You are assuming statistical independence, which is explicitly not correct here. There is also an error in your analysis - what matters is whether they make the same wrong assumption. That is far less likely, and becomes exponentially unlikely with increasing trials.

I can attest that it works well in practice, and my organization is already deploying this technique internally.

thesz
·
15 hours ago
·
[ - ]

How several wrong assumptions make it right with increasing trials?

adastra22
·
15 hours ago
·
[ - ]

You can ask Opus 4.6 to do a task and leave it running for 30min or more to attempt one-shooting it. Imagine doing this with three agents in parallel in three separate work trees. Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one. If no consensus after N runs, reframe to provide directions for a 4th attempt. Continue until a clear winning approach is found.

This is one example of an orchestration workflow. There are others.

thesz
·
14 hours ago
·
[ - ]

  > Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one.

If there are several agents doing analysis of solutions, how do you define a consensus? Should it be unanimous or above some threshold? Are agents scores soft or hard? How threshold is defined if scores are soft? There is a whole lot of science in voting approaches, which voting approach is best here?

Is it possible for analyzing agents to choose the best of wrong solutions? E.g., longest remembered table of FizzBuzz answers amongst remembered tables of FizzBuzz answers.

adastra22
·
11 hours ago
·
[ - ]

We have a voting algorithm that we use, but we're not at the level of confidential disclosure if we proceed further in this discussion. There's lots of research out there into unbiased voting algorithms for consensus systems.

thesz
·
9 hours ago
·
[ - ]

You conveniently decided not to answer my question about quality of the solutions to vote on (ranking FizzBuzz memorization).

To me, our discussion shows that what you presented as a simple thing is not simple at all, even voting is complex, and actually getting a good result is so hard it warrants omitting answer altogether.

adastra22
·
3 hours ago
·
[ - ]

Yeah, you've got unrealistic expectations if you expect me to divulge my company's confidential IP in a HN comment.

thesz
·
33 minutes ago
·
[ - ]

I had no expectations at all, I just asked questions, expecting answers. At the very beginning the tone of your comment, as I read it, was "agentic coding is nothing but simple, look they vote." Now answers to simple but important questions are "confidential IP."

Okay then, agentic coding is nothing but complex task requiring knowledge of unbiased voting (what is this thing really?) and, apparently, use of necessarily heavy test suite and/or theorem provers.

democracy
·
11 hours ago
·
[ - ]

It was a scene from a sci-fi movie (i mean Claude demo to CTOs)

groundzeros2015
·
17 hours ago
·
[ - ]

Nonsense. If you have 16 binary decisions that’s 64k possible paths.

adastra22
·
16 hours ago
·
[ - ]

These are not independent samplings.

groundzeros2015
·
16 hours ago
·
[ - ]

Indeed. Doesn’t that make it worse? Prior decisions will bring up path dependent options ensuring they aren’t even close to the same path.

adastra22
·
15 hours ago
·
[ - ]

Run a code review agent, and ask it to identify issues. For each issue, run multiple independent agents to perform independent verification of this issue. There will always be some that concur and some that disagree. But the probability distributions are vastly different for real issues vs hallucinations. If it is a real issue they are more likely to happen upon it. If it is a hallucination, they are more likely to discover the inconsistency on fresh examination.

This is NOT the same as asking “are you sure?” The sycophantic nature of LLMs would make them biased on that. But fresh agents with unbiased, detached framing in the prompt will show behavior that is probabilistically consistent with the underlying truth. Consistent enough for teasing out signal from noise with agent orchestration.

peyton
·
17 hours ago
·
[ - ]

Take a look at the latest Codex on very-high. Claude’s astroturfed IMHO.

rco8786
·
17 hours ago
·
[ - ]

Can you explain more? I'm talking about LLM/agent behavior in a generalized sense, even though I used claude code as the example here.

What is Codex doing differently to solve for this problem?

·
18 hours ago
·
[ - ]

logicchains
·
15 hours ago
·
[ - ]

>Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.

Even in that case they should still be logging what they're doing for later investigation/auditing if something goes wrong. Regardless of whether a human or an AI ends up doing the auditing.

xnorswap
·
18 hours ago
·
[ - ]

Yes, this is why I generally still use "ask for permission" prompts.

As tedious as it is a lot of the time ( And I wish there was an in-between "allow this session" not just allow once or "allow all" ), it's invaluable to catch when the model has tried to fix the problem in entirely the wrong project.

Working on a monolithic code-base with several hundred library projects, it's essential that it doesn't start digging in the wrong place.

It's better than it used to be, but the failure mode for going wrong can be extreme, I've come back to 20+ minutes of it going around in circles frustrating itself because of a wrong meaning ascribed to an instruction.

lachlan_gray
·
14 hours ago
·
[ - ]

fwiw there are more granular controls, where you can for example allow/deny specific bash commands, read or write access to specific files, using a glob syntax:

https://code.claude.com/docs/en/settings#permission-settings

You can configure it at the project level

the_harpia_io
·
16 hours ago
·
[ - ]

oh man the going-in-circles thing - that's the worst because you don't even know how long to let it run before you realize it's stuck. I've had similar issues where it misunderstands scope and starts making changes that cascade in ways it can't track. the 'allow this session' idea is actually really good - would be useful to have more granular control like that. honestly this is why I end up breaking work into smaller chunks and doing more prompt-response cycles rather than letting it run autonomously, but that obviously defeats the purpose of having an agent do the work

hippo22
·
14 hours ago
·
[ - ]

You look at what Claude’s doing to make sure it doesn’t go off the rails? Personally, I either move on to another ask in parallel or just read my phone. Trying to catch things by manually looking at its output doesn’t seem like a recipe for success.

gitaarik
·
2 hours ago
·
[ - ]

I often keep an eye on my running agents, and occasionally feel the need to correct them or to give them a bit more info because I see them sometimes diverge into areas I don't want them to go. Because they might spend much time and energy on something I already know is not gonna work.

joquarky
·
7 hours ago
·
[ - ]

It all depends on how much you're willing to spend.

If you have an unlimited budget, obviously you will tend to let it run and correct it in the next iteration.

If you often run tight up against your 5-hour window, you're going to be more likely to babysit it.

the_harpia_io
·
11 hours ago
·
[ - ]

yeah I watch it but not like staring at every line - more like checking when something feels off or when it's been chugging for a bit. if it starts pulling in files I didn't expect or the diff looks weirdly large I'll stop and check what's happening. the problem isn't that you can't catch it after the fact, it's that by then it's already burned 5 minutes reading your entire utils folder or whatever, and you gotta context switch to figure out what it misunderstood. easier to spot the divergence early

aceelric
·
18 hours ago
·
[ - ]

Exactly, and this is the best way to do code review while it's working so that you can steer it better. It's really weird that Anthropic doesn't get this.

the_harpia_io
·
16 hours ago
·
[ - ]

yeah the steering thing is huge - like when you can see it's about to go down the wrong path you can interrupt before it wastes time. or when you realize your prompt wasn't clear enough, you catch it early. hiding all that just means you find out later when it's already done the wrong thing, and then you're stuck trying to figure out what went wrong and how to fix it. it's the difference between collaborative coding and just hoping the black box does what you want

charliea0
·
17 hours ago
·
[ - ]

I assume it's to make it harder for competitors to train on Claude's Chain-of-Thought.

the_harpia_io
·
16 hours ago
·
[ - ]

hm, maybe? but seems like a weird tradeoff to make UX actively worse just for that - especially since the thinking tokens are still there in the API responses anyway, just hidden in the UI. I think it's more likely they're just betting most people don't care about the intermediate steps and want faster responses, which tbh tracks with how most product teams think about 'simplification'. though it does feel short-sighted when the whole point of using these tools is trust and verification

simianwords
·
14 hours ago
·
[ - ]

Not true because it is still exposed in API

flemhans
·
9 hours ago
·
[ - ]

And saved in files

danielbln
·
4 hours ago
·
[ - ]

And even visible in the UI via Ctrl+o.

segmondy
·
6 hours ago
·
[ - ]

It is, but Anthropic is deluded if they think they can gain a moat from that. They gladly copy/borrow ideas from other projects that's why they are paranoid that folks are doing the same.

faeyanpiraat
·
18 hours ago
·
[ - ]

The other side of catcing going off the rails is when it wants to make edits without it reading the context I know would’ve been neccessary for a high quality change.

the_harpia_io
·
16 hours ago
·
[ - ]

yeah exactly - it's that confidence without understanding. like it'll make a change that looks reasonable in isolation but breaks an assumption that's only documented three files over, or relies on state that's set up elsewhere. and you can't always tell just from looking at the diff whether it actually understood the full picture or just got lucky. this is why seeing what files it's reading before it makes changes would be super helpful - at least you'd know if it missed something obvious

socalgal2
·
13 hours ago
·
[ - ]

My first thought is, for the specific problem you brought up, you find out which files were touched by your version control system, not the AI's logs. I have to do this for myself even without AI.

the_harpia_io
·
11 hours ago
·
[ - ]

yeah that works after the fact but the issue is more about catching it mid-run - like when Claude decides to read through 50 files to answer a simple question because it misunderstood the scope. by the time you check git you've already wasted the time and tokens. the logs help you spot that divergence while it's happening, not just after

graeme
·
13 hours ago
·
[ - ]

Previously you could see which files Claude was reading. If it got the totally wrong context you could interrupt and redirect it.

Since it's just reading at that stage there's no tracked changes.

aadarshkumaredu
·
41 minutes ago
·
[ - ]

Calling it “hiding” assumes the default should be full exposure of internal reasoning. That’s not obviously true.

There are three separate layers here:

What the model internally computes

What the product exposes to the user

What developers need for debugging and control

Most outrage conflates all three.

Exposing raw reasoning tokens sounds transparent, but in practice it often leaks messy intermediate steps, half-formed logic, or artifacts that were never meant to be user-facing. That doesn’t automatically make a product more trustworthy. Sometimes it just creates noise.

The real issue is not whether internal thoughts are hidden. It’s whether developers can:

• Inspect tool calls • See execution traces • Debug failure modes • Reproduce behavior deterministically

If those are restricted, that’s a serious product problem. If what’s being “hidden” is just chain-of-thought verbosity, that’s a UI decision, not deception.

There’s also a business angle people don’t want to acknowledge. As models become productized infrastructure, vendors will protect internal mechanics the same way cloud providers abstract away hardware-level details. Full introspection is rarely a permanent feature in mature platforms.

Developers don’t actually want full transparency. They want reliability and control. If the system behaves predictably and exposes the right operational hooks, most people won’t care about hidden internal tokens.

The real question is: where should the abstraction boundary sit for a developer tool?

xg15
·
18 hours ago
·
[ - ]

> Cherny responded to the feedback by making changes. "We have repurposed the existing verbose mode setting for this," he said, so that it "shows file paths for read/searches. Does not show full thinking, hook output, or subagent output (coming in tomorrow's release)."

How to comply with a demand to show more information by showing less information.

embedding-shape
·
18 hours ago
·
[ - ]

Words have lost all meaning. "Verbose" no longer means "containing more words than necessary" but instead "Bit more than usual". "Fast" no longer mean "characterized by quick motion, operation, or effect" but instead depends on the company, some of them use slightly different way, but same "speed", but it's called "fast mode".

It's just a whole new world where words suddenly mean something completely different, and you can no longer understand programs by just reading what labels they use for various things, you need to also lookup if what they think "verbose" means matches with the meaning you've built up understanding of first.

dhagz
·
16 hours ago
·
[ - ]

You thought "fast mode" was describing the agent? No no no, it's describing your spend, since it only uses "extra usage."

nullbio
·
16 hours ago
·
[ - ]

Out of principle I'm never paying them a cent for "fast mode". I've already started using Codex anyway, will probably just cancel my sub since I've found I actually haven't needed CC at all since making the switch.

·
10 hours ago
·
[ - ]

throwaway314155
·
15 hours ago
·
[ - ]

To be fair, some of the best software out there has multiple levels of verbosity. Usually enabled with extra ‘-v’ short options.

alehlopeh
·
17 hours ago
·
[ - ]

I’m literally dead

xg15
·
15 hours ago
·
[ - ]

My condolences.

·
11 hours ago
·
[ - ]

3oil3
·
18 hours ago
·
[ - ]

This is really the kind of things Claude sometimes does. "Actually, wait... let's repurpose the existing verbose mode for this, simpler, and it fits the user's request to limit bloating"

vardalab
·
15 hours ago
·
[ - ]

Yeah but did he actually try to use the repurposed "verbose" mode? I did, and it's way more verbose than I need, but the regular mode now is basically like mute mode. In addition, recently it started running a lot of stuff in the background and that causes some crazy flicker and Claude has become stubbornly autonomous. It just runs stuff in a flyby mode, asks me a question and then waits a couple seconds and proceeds with a default choice while I am still reading and considering options. I am left mashing Esc and that sometimes does not stop stuff either. Last couple updates have really annoyed me tbh.

bostonvaulter2
·
8 hours ago
·
[ - ]

Did you enable verbose output in the settings? That gives you a different view then pressing ctrl-o. And this naming is endlessly confusing!

lkbm
·
16 hours ago
·
[ - ]

They changed it from showing just number of files read to showing the actual paths/filenames. IE, it shows more information.

EDIT: Ah, looks like verbose mode might show less than it used to, and you need to use a new mode (^o) to show very verbose.

xg15
·
15 hours ago
·
[ - ]

Yeah, I understood it such that the information was first moved from standard to verbose mode, and when people pointed out that they will drowned out in noise there, tge response was to cut down verbose mode as well.

I didn't know about the ^o mode though, so good that the verbose information is at least still available somewhere. Even though now it seems like an enormously complicated maneuver with no purpose.

jwr
·
16 hours ago
·
[ - ]

Anthropic is walking a very thin line here. The competition between models is intense and the only differentiator right now is the so-called harness that gets put over them. Anthropic needs a niche and they tried to find one by addressing developers. And they have been doing very well!

What I think they are forgetting in this silly stubbornness is that competition is really fierce, and just as they have gained appreciation from developers, they might very quickly lose it because of this sort of stupidity (for no good reason).

jorl17
·
14 hours ago
·
[ - ]

Is Claude Code really what makes them money, or is it their models? Both? Neither?

Do they believe that owning the harness (Claude Code) itself will lead to significantly more money? I can sort of see that, but I wouldn't think they are necessarily betting on it?

I use Anthropic's models wherever, whenever I can, be it cursor, copilot, you name it. I can't stand Claude Code for some reason, but I'll kill for those models.

On the other hand, I've seen some non-tech people have their "Holy shit!" moment with Claude Co-work (which I personally haven't tried yet) — and that's a market I can see them want to hold on to to branch out of the dev niche. The same moment happened when they tried their excel integration — they were completely mindblown.

lossyalgo
·
6 hours ago
·
[ - ]

Well they can just wait until other AI companies like Cursor become dependent on them, then jack up the prices without warning[0].

[0] https://generativeai.pub/cursors-pricing-change-sparks-outra...

nullbio
·
15 hours ago
·
[ - ]

Well they've successfully burned a bridge with me. I had 2 max subs, cancelled one of them and have been using Codex religiously for the last couple of weeks. Haven't had a need for Claude Code at all, and every time I open it I get annoyed at how slow it is and the lack of feedback - looking at it spin for 20 minutes on a simple prompt with no feedback is infuriating. Honestly, I don't miss it at all.

data-ottawa
·
15 hours ago
·
[ - ]

Check your model thinking effort.

You have to go into /models then use the left/right arrow keys to change it. It’s a horrible UI design and I had no idea mine was set to high. You can only tell by the dim text at the bottom and the 3 potentially highlighted bars.

On high It would think for 30+ minutes, make a plan, then when I started the plan it would either compact and reread all my files, or start fresh and read my files, then compact after 2-3 changes and reread the files.

High reasoning is unusable with Opus 4.6 in my opinion. They need at least 1M context for this to work.

yyhhsj0521
·
13 hours ago
·
[ - ]

You can press Ctrl-P instead of typing /model too

mwigdahl
·
11 hours ago
·
[ - ]

Doesn't work on Windows in a VS Code terminal window, unfortunately.

alentred
·
18 hours ago
·
[ - ]

Well, there is OpenCode [1] as an alternative, among many others. I have found OpenCode being the closest to Claude Code experience, and I find it quite good. Having said that I still prefer Claude Code for the moment.

[1] https://opencode.ai/

skerit
·
18 hours ago
·
[ - ]

What does Claude-Code do different that you still prefer it? I'm so in love with OpenCode, I just can't go back. It's such a nicer way of working. I even love the more advanced TUI

dpkirchner
·
16 hours ago
·
[ - ]

Claude Code's handling of multiple choice questions is awfully nice (it uses an interactive interface to let you use arrows to select answers, and supports multiple answers). I haven't seen opencode do that yet, although I don't know if that's just a model integration issue -- I've only tried with GLM 4.7, GPT 5.1 Codex Mini, and GPT 5.2 Codex.

mschulze
·
16 hours ago
·
[ - ]

Opencode also has that feature, I've seen it multiple times in the last days (mostly using Opus 4.5/4.6/Gemini 3)

dpkirchner
·
6 hours ago
·
[ - ]

Interesting. I wonder if it's just a matter of prompting properly or if GPT/GLM just doesn't have that training.

skerit
·
11 hours ago
·
[ - ]

Indeed, Opencode has it too. They've been improving it the past few weeks to look more like the one in Claude-Code. I disable it all the time though, I find it such a pain (in both Claude-Code and OpenCode)

epiccoleman
·
17 hours ago
·
[ - ]

Are you paying per-token after Anthropic closed the loophole on letting you log in to OpenCode?

michaelcampbell
·
15 hours ago
·
[ - ]

If one has a github sub, you can use OpenCode -> github -> \A models. It's not 100% (the context window I think is smaller, and they can be behind on the model version updates), but it's another way to get to \A models and not use CC.

siva7
·
12 hours ago
·
[ - ]

Yup, the context window there is only half of what you get in CC so only a weak alternative. They burned bridges with the dev community by their decision to block any other clients

skerit
·
11 hours ago
·
[ - ]

When did they successfully close the loophole? I know they tried a few times, but even the last attempt from a week or two ago was circumvented rather easily.

epiccoleman
·
4 hours ago
·
[ - ]

Oh, sounds like I'm just out of the loop then. I had an Opencode install that I was planning to check out, and then like, the next day there was the announcement from a week or two ago, so I just kinda shrugged and forgot about it.

saagarjha
·
18 hours ago
·
[ - ]

OpenCode would be nicer if they used normal terminal scrolling and not their own thing :(

benreesman
·
17 hours ago
·
[ - ]

It's a client/server architecture with an Open API spec at the boundary. You can tear off either side, put a proxy in the middle, whatever. Few hundred lines of diff weaponizes it.

DrammBA
·
17 hours ago
·
[ - ]

Terminal scrolling opens a big can of worms for them, I doubt they'll ever implement it. The best you can do is enable scrollbars in opencode so you can quickly jump places.

thdxr
·
17 hours ago
·
[ - ]

we are going to implement this

saagarjha
·
10 hours ago
·
[ - ]

lmao

subscribed
·
18 hours ago
·
[ - ]

I haven't tried it myself but there was a plenty of people in the other thread complaining that even on the Max subscription they couldn't use OpenCode.

kachapopopow
·
18 hours ago
·
[ - ]

oh-my-pi plug https://github.com/can1357/oh-my-pi

siva7
·
12 hours ago
·
[ - ]

i don't get this. isn't it contradictory to the philosophy of pi to start as slick as possible?

kardianos
·
17 hours ago
·
[ - ]

I've liked opencode+glm5 quite a bit so far.

clktmr
·
19 hours ago
·
[ - ]

It's probably in their interest to have as many vibed codebases out there as possible, that no human would ever want to look at. Incentivising never-look-at-the-code is effectively a workflow lockin.

kachapopopow
·
18 hours ago
·
[ - ]

I always review every single change / file in full and spend around 40% of the time it takes to produce something doing so. I assume it's the same for a lot of people who used to develop code and swapped to mostly code generation (since it's just faster). The spend I time looking at it depends on how much I care about it - something you don't really get writing things manually.

kcartmell
·
18 hours ago
·
[ - ]

Not trying to tell anyone else how to live, just want to make sure the other side of this argument is visible. I run 5+ agents all day every day. I measure, test, and validate outputs exhaustively. I value the decrease in noise in output here because I am very much not looking to micromanage process because I am simply too slow to keep up. When I want logging I can follow to understand “thought process” I ask for that in a specific format in my prompt something like “talk through the problem and your exploration of the data step by step as you go before you make any changes or do any work and use that plan as the basis of your actions”.

I still think it’d be nice to allow an output mode for you folks who are married to the previous approach since it clearly means a lot to you.

nharada
·
14 hours ago
·
[ - ]

First, I agree with most commentators that they should just offer 3 modes of visibility: "default", "high", "verbose" or whatever

But I'm with you that this mode of working where you watch the agent work in real-time seems like it will be outdated soon. Even if we're not quite there, we've all seen how quickly these models improve. Last year I was saying Cursor was better because it allowed me to better understand every single change. I'm not really saying that anymore.

steveklabnik
·
15 hours ago
·
[ - ]

This is the fundamental tension in this story, yes.

blackbear_
·
11 hours ago
·
[ - ]

> I measure, test, and validate outputs exhaustively.

How do you do this? Do you follow traditional testing practices or do you have novel strategies like agents with separate responsibilities?

nojs
·
18 hours ago
·
[ - ]

> I run 5+ agents all day every day

Curious what plans you’re using? running 24/7 x 5 agents would eat up several $200 subscriptions pretty fast

kcartmell
·
18 hours ago
·
[ - ]

My primary plan is the $200 Claude max. They only operate during my working hours and there is significant downtime as they deliver results and await my review.

ChicagoDave
·
12 hours ago
·
[ - ]

I noticed this too, but I think there's a much bigger problem.

The way Claude does research has dramatically changed for the worse. Instead of piping through code logically, it's now spawning dozens of completely unrelated research threads to look at simple problems. I let it spin for over 30 minutes last night before realizing it was just "lost".

I have since been looking for these moments and killing it immediately. I tell Claude "just look at the related code" and it says, "sorry I'll look at this directly".

WTF Anthropic?

lacoolj
·
12 hours ago
·
[ - ]

Was this from a specific model or all of them?

cesarvarela
·
6 hours ago
·
[ - ]

It is not related to the model, I think, it is the newer Claude Code versions.

ChicagoDave
·
11 hours ago
·
[ - ]

I mostly run on 4.6.

NikolaNovak
·
16 hours ago
·
[ - ]

Unless I'm mixing up stuff, this was addressed explicitly by an Antrophoc Dev on HN (I am not a developer, don't use the product, have zero equine animals in the game :)

https://news.ycombinator.com/item?id=46981968

saghm
·
12 hours ago
·
[ - ]

And in turn, that discussion was addressed explicitly by this blog post, which is essentially a summary of the conversation that has been taking place across multiple venues.

lacoolj
·
12 hours ago
·
[ - ]

I don't know a lot about The Register, but I thought it was a news platform?

Although, this post surely isn't "news" as much as it is, as you said, a summary of a conversation being held on other platform(s).

So maybe it is just a blog post?

panozzaj
·
18 hours ago
·
[ - ]

Claude logs the conversation to ~/.claude/projects, so you can write a tool to view them. I made a quick tool that has been valuable the last few weeks: https://github.com/panozzaj/cc-tail

small_model
·
18 hours ago
·
[ - ]

I always get Claude Code to create a plan unless its trivial, it will describe all the changes its going to make and to which files, then let it rip in a new context.

jbonatakis
·
14 hours ago
·
[ - ]

(Mildly) shameless plug, but you might be interested in a tool I’ve been building: https://github.com/jbonatakis/blackbird

It breaks a spec (or freeform input) down into a structured json plan, then kicks off a new non-interactive session of Claude or codex for each task. Sounds like it could fit your workflow pretty well.

dionian
·
18 hours ago
·
[ - ]

Why use a new context? Or you mean, just accept the plan and it automatically clears the context.

steveklabnik
·
13 hours ago
·
[ - ]

Recently, Claude gives you these options when asking you to accept a plan:

  Would you like to proceed?

  > 1. Yes, clear context and auto-accept edits (shift+tab)
    2. Yes, auto-accept edits
    3. Yes, manually approve edits
    4. Type here to tell Claude what to change

So the default is to do it in a new context.

If you examine what this actually does, it clears the context, and then says "here's the plan", points to the plan file, and also points to the logs of the previous discussion so that if it determines it should go back and look at them, it can.

small_model
·
12 hours ago
·
[ - ]

Yes, its basically another way to compact context, means it there is less chance start compacting part way through the plan.

small_model
·
17 hours ago
·
[ - ]

Yes sorry, CC does it, then rereads all the files from scratch with the plan in mind.

parhamn
·
17 hours ago
·
[ - ]

"Hiding" is doing some heavy lifting here. You can run --json and see everything pretty much (besides the system prompt and tool descriptions)....

I love the terminal more than the next guy but at some point it feels like you're looking at production nginx logs, just a useless stream of info that is very difficult to parse.

I vibe coded my own ADE for this called OpenADE (https://github.com/bearlyai/openade) it uses the native harnesses, has nice UIs and even comes with things like letting Claude and Codex work together on plans. Still very beta but has been my daily driver for a few weeks now.

kzahel
·
14 hours ago
·
[ - ]

ADE! first time I've heard that acronym. (I assume it means Agent development environment?)

Your interface looks pretty cool! I built something similar-ish though with a different featureset / priority (https://github.com/kzahel/yepanywhere - meant to be a mobile first interface but I also use it at my desk almost exclusively)

It sounds like you have some features to comment directly on markdown? That sounds pretty useful. I love how Antigravity has that feature.

WA
·
12 hours ago
·
[ - ]

Why does it say "Works with your existing Claude Code subscription"? I thought Anthropic banned use of CC subscriptions in third-party software?

parhamn
·
11 hours ago
·
[ - ]

the project just does subprocess calls to claude code (the product/cli). I think services like open code were using it to make raw requests to claude api. Have any more context I can look into?

throwaway314155
·
16 hours ago
·
[ - ]

> --json

Seriously? This can't be a comparable experience in terms of UX.

parhamn
·
14 hours ago
·
[ - ]

I think my read of "hiding" was more of a "trying to hide the secret sauce" which was implied in a few places.

Otherwise it seems like a minor UI decision any other app would make and it surprising there's whole articles on it.

saghm
·
12 hours ago
·
[ - ]

Given that we're talking about terminals, I'd argue there's a pretty good precedent for "hidden" meaning "not visible by default but possible to view at the expense of less clarity and extra noise"; no one th

throwaway314155
·
13 hours ago
·
[ - ]

> I think my read of "hiding" was more of a "trying to hide the secret sauce" which was implied in a few places.

That was very much not my read of it.

seyz
·
16 hours ago
·
[ - ]

Debugging an LLM integration without seeing the reasoning is like debugging a microservice with no logs. You end up cargo-culting prompt changes until something works, with no idea why.

tylervigen
·
18 hours ago
·
[ - ]

This article is mostly about this discussion on hn: https://news.ycombinator.com/item?id=46978710

jamestimmins
·
12 hours ago
·
[ - ]

The issue of it burning through tokens grepping around should be fixed with language server integration, but that’s broken in Claude Code and the MCP code nav tools seem to use more tokens than just a home-built code map in markdown files.

siva7
·
12 hours ago
·
[ - ]

They got so many things right in the beginning but now seem to lose touch with their core fan base, the developers. It's the typical corporate grind, a million competing interests arise where it's not anymore about the user but about politics and whoops, that's when you know you're not anymore a startup.

epolanski
·
15 hours ago
·
[ - ]

I've noticed more and more of the llm providers are trying to hide as much as possible of their thinking and inner working.

Anthropic doesn't want you to be easily able to jump off claude code into open code + open weight llm.

gopher_space
·
12 hours ago
·
[ - ]

The problem with monetizing AI is that a useful model can build a better version of itself and will guide you through the process.

KurSix
·
15 hours ago
·
[ - ]

Honestly, this feels like a massive step back. When I use an agent, I'm not just a user, I'm a supervisor. I need observability. If Claude starts digging into node_modules or opening some stale config from 2019, I need to know immediately so I can smash Ctrl+C

Hiding filenames turns the workflow into a black box. It’s like removing the speedometer from a car because "it distracts the driver". Sure it looks clean, but it's deadly for both my wallet and my context window

corv
·
19 hours ago
·
[ - ]

When their questionnaire asked me for feedback I specifically mentioned that I hoped they would not reduce visibility to the point of Github Actions.

I guess that fell on deaf ears.

MaxikCZ
·
18 hours ago
·
[ - ]

Can anybody break my black glasses and offer an anecdote of a high-employee count firm actually involving humans for reading feedback? I suspect its just there for "later", but never actually looked at by anyone...

gambiting
·
18 hours ago
·
[ - ]

You know when your game crashes on PS5 and you get a little popup that offers you the opportunity to write feedback/description of the crash?

Yeah, I used to sit and read all of these(at one of the largest video game publishers - does that count?). 95% of them were "your game sucks" but we fixed many bugs thanks to detailed descriptions that people have provided through that box.

plexui
·
14 hours ago
·
[ - ]

The real issue isn’t whether Claude hides actions or shows them. It’s that once you move from “assistant” to “agent”, observability becomes a hard requirement, not a nice-to-have.

When an agent can read, modify, and orchestrate multiple parts of a codebase, you need the equivalent of logs, traces, and diffs — not just summaries. Otherwise debugging becomes guesswork.

Traditional software became reliable only after we built strong observability tooling around it. Agent workflows will need the same evolution: clear execution traces, deterministic diffs, and full transparency into what happened and why.

singularfutur
·
14 hours ago
·
[ - ]

Anthropic optimized for "clean UI" metrics and forgot developers care more about not having their codebase silently corrupted. Every AI company relearns the same lesson: autonomy is the enemy of trust.

smashed
·
16 hours ago
·
[ - ]

How long until the status display is just an optimized display of what the human wants to see while being fully disconnected from what is actually happening?

Seems like this is the most probable outcome: LLM gets to fix the issues undisrupted while keeping the operator happy.

chasd00
·
16 hours ago
·
[ - ]

heh kind of like giving an engineering manager a nice dashboards with lots of graphs and knobs. it keeps them out of your hair.

seunosewa
·
18 hours ago
·
[ - ]

Perhaps they can just make it an option??

Shank
·
17 hours ago
·
[ - ]

I find it interesting that this does lead to a pattern that consumes more tokens (and by extension usage and money). If you don’t interrupt something going wrong, you’ll burn more tokens faster. Food for thought, but it does seem like a perverse incentive.

euroclydon
·
16 hours ago
·
[ - ]

I made a little TUI app to monitor CC sessions and show you the commands. https://github.com/joshpearce/cc_session_mon

vemv
·
17 hours ago
·
[ - ]

Hopefully with the advent of AI coding, OSS frontends for all sorts of commercial backends will be more frequent, have higher quality, and consumers would be able to vote with their wallets for high-quality APIs enabling said frontends.

nullbio
·
16 hours ago
·
[ - ]

It's all well and good for Anthropic developers who have 10x the model speed us regular users have and so their TUI is streaming quickly. But over here, it takes 20 minutes for Claude to do a basic task.

KurSix
·
15 hours ago
·
[ - ]

It feels like they're optimizing the UI for demo reels rather than real-world work. A clean screen is cool when everything is flying, but when things start lagging, I need verbose mode to see exactly where we're stuck and if I should even bother waiting

auggierose
·
15 hours ago
·
[ - ]

I am not surprised they do that. Traditionally, there doesn't seem to be that much money in interactive theorem proving.

radial_symmetry
·
18 hours ago
·
[ - ]

If you use Claude Code in Nimbalyst it tracks every file change for you and gives you red/green diffs for your session.

frigg
·
15 hours ago
·
[ - ]

They are doing this so they can eventually remove the feature entirely in the future.

anonzzzies
·
17 hours ago
·
[ - ]

We have been playing with glm4.7 on cerebras which I hope to be the near future for any model; it generates 1000s of lines when you recover from a sneeze : it's absolutely irrelevant if you can see what it does because there is no way you can read it live (at 1000s of tokens/s) and you are not going to read it afterwards. Catching it before it does something weird is just silly; you won't be able to react. Works great for us combined with Claude Code; claude does the senior work like planning and takes its time: glm does the implementation in a few seconds.

KurSix
·
15 hours ago
·
[ - ]

That holds up for code generation (where tokens fly by), but not for tool use. The agent often stalls between tool calls, and those are exactly the moments I need to see what it's planning, not just stare at a blank screen

anonzzzies
·
10 hours ago
·
[ - ]

Depends on the tools I guess. It can race through 100s of commands in bash in a blink.

cowboylowrez
·
16 hours ago
·
[ - ]

"Boris Cherny" seems pretty good at this enshittification stuff. Think about it, normal coders would consider having a config like show details or don't, you know, a developers preference but no this guy wants you to control-o all the time, read the article its right there what this guy says:

" A GitHub issue on the subject drew a response from Boris Cherny, creator and head of Claude Code at Anthropic, that "this isn't a vibe coding feature, it's a way to simplify the UI so you can focus on what matters, diffs and bash/mcp outputs." He suggested that developers "try it out for a few days" and said that Anthropic's own developers "appreciated the reduced noise.""

Seriously man, whatever happened to configs that you can set once. They obviously realise that people want it with the control-o but why make them do this over and over without a way to just config it, or whatever the cli does like maybe:

./clod-code -v

or something. Man I dislike these AI bros so much, there always about "your personal preferences are wrong" but you know they are lying through their smirking teeth they want you to burn tokens so the earth's inhabitability can die a few minutes earlier.

nullbio
·
15 hours ago
·
[ - ]

Speaking of burning tokens, they also like to waste our tokens with paragraphs of system messages for every single file read you do with Claude. Take a look at your jsonl files, search for <system-reminder>.

jonfw
·
15 hours ago
·
[ - ]

Keep cattle, not pets! The advice that used to apply for managing large numbers of machines also applies to managing coding agents.

If you rely on monitoring the behaviors of an individual coding agent to produce the output you want, you won't scale

dandellion
·
15 hours ago
·
[ - ]

Are you one of those developers that hates debuggers and stack traces, and would rather spend three hours looking at the output or adding prints for something that would take 5 minutes to any sane developer?

jonfw
·
10 hours ago
·
[ - ]

This is very much a tangent, and was asked in bad faith, but I’ll answer anyways!

One of the interesting things about working on distributed systems, is that you can reproduce problems without having to reproduce or mock a long stack trace

So I certainly don’t see the case you’re talking about where it takes hours to reproduce or understand a problem without a debugger. Of course there are still many times when a debugger should be consulted! There is always a right tool for a given job.

jstummbillig
·
18 hours ago
·
[ - ]

That is such silly framing. They are not "trying" to hide anything. They are trying to create a better product -- and might be making unpopular or simply bad choices along the way -- but the objective here is not to obfuscate which files are edited. It's a side effect.

subscribed
·
18 hours ago
·
[ - ]

Instead of adding a settings option to hide the filenames they hide them for everyone AND rewrite verbose mode, which is no longer a verbose mode, but the way to see filenames, thus breaking everyone's (depending on these) workflows for...... what exactly?

If they tried to create a better product I'd expect them to just add the awesome option, not hide something that saves thousands of tokens and context if the model goes the wrong way.

jstummbillig
·
17 hours ago
·
[ - ]

Again, the framing is simply not sensible. Why would they want to break "everyone's" workflow ("everyone" including the people working at Anthropic, who use the product themselves, which should give us some pause)? Why would you ever want to make a bad decision?

The answer in both cases is: You don't. If it happens, it's because you sometimes make bad decisions, because it's hard to make good decisions.

acron0
·
18 hours ago
·
[ - ]

How can you combat one unprovable framing by insisting on another unprovable framing?

·
15 hours ago
·
[ - ]

lacoolj
·
12 hours ago
·
[ - ]

lol the title of this post immediately feels like something I'd see on buzzfeed or my google news feed in mobile chrome.

alansaber
·
18 hours ago
·
[ - ]

The nice thing about the competition in the CLI space is that... you can just move? CC has always been a bit wonky/ this is active enshittification- there is the likes of Codex etc...

phendrenad2
·
17 hours ago
·
[ - ]

Correction: "some devs hate it"

powera
·
17 hours ago
·
[ - ]

Between this and 4.6's tendency to do so much more "exploratory" work, I am back to using ChatGPT Codex for some tasks.

Two months ago, Claude was great for "here is a specific task I want you to do to this file". Today, they seem to be pivoting towards "I don't know how to code but want this feature" usage. Which might be a good product decision, but makes it worse as a substitute for writing the code myself.

KurSix
·
15 hours ago
·
[ - ]

I feel the exact same way. Trying to cater to the "no-code" crowd is blurring the product's focus. It seems they've stuffed the system prompt with "be creative and explore" instructions, which kills determinism - so now we have to burn tokens just to tell it: "Don't think, just write the code"

slices
·
17 hours ago
·
[ - ]

Have you played with the effort setting? I'm finding medium effort on 4.6 to give more satisfactory results for that kind of thing.

pawelduda
·
14 hours ago
·
[ - ]

Feels like moment that would be looked back at as the beginning of enshittification

JohnCClarke
·
11 hours ago
·
[ - ]

Srsly? People actually watch all the chatter in the little window?

Pro tip: "git diff"

krystofee
·
17 hours ago
·
[ - ]

ctrl+o ?

intellirim
·
18 hours ago
·
[ - ]

[dead]

xorgun
·
18 hours ago
·
[ - ]

[dead]

amelius
·
19 hours ago
·
[ - ]

Why not run Claude on an FUSE based filesystem, and make a script that shows the user which files are being accessed?

tux3
·
19 hours ago
·
[ - ]

"You can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem" — https://news.ycombinator.com/item?id=9224

_aavaa_
·
19 hours ago
·
[ - ]

If you read their immediate reply you’ll see it invalidates this tired point: https://news.ycombinator.com/item?id=9479

Terretta
·
18 hours ago
·
[ - ]

Better link shows exchange:

https://news.ycombinator.com/item?id=9224

Or this pulls the exchange under the famous HN post itself:

https://news.ycombinator.com/item?id=8863

amelius
·
19 hours ago
·
[ - ]

You can basically ask Claude to build it for you :)

snvzz
·
17 hours ago
·
[ - ]

Why not script automatic C-o injection?

Ultimately, the problem is the tool turning against the user. Maybe it is time to get a new tool.