This avoids the typical stateless one-shot pattern of current coding agents and enables multi-step changes without losing intermediate reasoning, test failures, or partial progress.
The tool is useful for tasks that require many small, serial modifications: increasing test coverage, large refactors, dependency upgrades guided by release notes, or framework migrations.
Blog post about this: https://anandchowdhary.com/blog/2025/running-claude-code-in-...
I had a coworker do this with windsurf + manual driving awhile back and it was an absolute mess. Awful tests that were unmaintainable and next to useless (too much mocking, testing that the code “works the way it was written”, etc.). Writing a useful test suite is one of the most important parts of a codebase and requires careful deliberate thought. Without deep understanding of business logic (which takes time and is often lost after the initial devs move on) you’re not gonna get great tests.
To be fair to AI, we hired a “consultant” that also got us this same level of testing so it’s not like there is a high bar out there. It’s just not the kind of problem you can solve in 2 weeks.
Ask a coding agent to build tests for a project that has none and you're likely to get all sorts of messy mocks and tests that exercise internals when really you want them to exercise the top level public API of the project.
Give them just a few starting examples that demonstrate how to create a good testable environment without mocking and test the higher level APIs and they are much less likely to make a catastrophic mess.
You're still going to have to keep an eye on what they're doing and carefully review their work though!
I find this to be true for all AI coding, period. When I have the problem fully solved in my head, and I write the instructions to explicitly and fully describe my solution, the code that is generated works remarkably well. If I am not sure how it should work and give more vague instructions, things don't work so well.
Starting point: small-ish codebase, no tests at all:
> I'd like to add a test suite to this project. It should follow language best practices. It should use standard tooling as much as possible. It should focus on testing real code, not on mocking/stubbing, though mocking/stubbing is ok for things like third party services and parts of the code base that can't reasonably run in a test environment. What are some design options we could do? Don't write any code yet, present me the best of the options and let me guide you.
> Ok, I like option number two. Put the basic framework in place and write a couple of dummy tests.
> Great, let's go ahead and write some real tests for module X.
and etc. For a project with an existing and mature test suite, it's much easier: > I'd like to add a test (or improve a test) for module X. Use the existing helpers and if you find yourself needing new helpers, ask me about the approach before implementing
I've also found it helpful to put things in AGENTS.md or CLAUDE.md about tests and my preferences, such as: - Tests should not rely on sleep to avoid timing issues. If there is a timing issue, present me with options and let me guide you
- Tests should not follow an extreme DRY pattern, favor human readability over absolute DRYness
- Tests should focus on testing real code, not on mocking/stubbing, though mocking/stubbing is ok for things like third party services and parts of the code base that can't reasonably run in a test environment.
- Tests should not make assumptions about the current running state of the environment, nor should they do anything that isn't cleaned up before completing the test to avoid polluting future tests
I do want to stress that every project and framework is different and has different needs. As you discover the AI doing something you don't like, add it to the prompts or the AGENTS.md/CLAUDE.md. Eventually it will get pretty decent, though never blindly trust it because a butterfly flapping it's wings in Canada sometimes causes it to do unexpected things.Or would return early from playwright tests when the desired targets couldn't be found instead of failing.
But I agree that with some guidance and a better CLAUDE.md, can work well!
Code assistance tools might speed up your workflow by maybe 50% or even 100%, but it's not the geometric scaling that is commonly touted as the benefits of autonomous agentic AI.
And this is not a model capability issue that goes away with newer generations. But it's a human input problem.
For example, you can spend a few hours writing a really good set of initial tests that cover 10% of your codebase, and another few hours with an AGENTS.md that gives the LLM enough context about the rest of the codebase. But after that, there's a free* lunch because the agent can write all the other tests for you using that initial set and the context.
This also works with "here's how I created the Slack API integration, please create the Teams integration now" because it has enough to learn from, so that's free* too. This kind of pattern recognition means that prompting is O(1) but the model can do O(n) from that (I know, terrible analogy).
*Also literally becomes free as the cost of tokens approaches zero
I recently had a bunch of Claude credits so got it to write a language implementation for me. It probably took 4 hours of my time, but judging by other implementations online I'd say the average implementation time is hundreds of hours.
The fact that the model knew the language and there are existing tests I could use is a radical difference.
But "throw vague prompt at AI direction" does about as well as doing same thing with an intern.
An agent does a good job fixing it's own bad ideas when it can run tests, but the biggest blocker I've been having is the agent writing bad tests and getting stuck or claiming success by lobotomizing a test. I got pretty far with myself being the test critic and that being mostly the only input the agent got after the initial prompt. I'm just betting it could be done with a second agent.
(https://github.com/AnandChowdhary/continuous-claude/blob/mai...)
shopt -s extglob
case "$1"
# Flag support - allow -xyz z-takes-params
-@(a|b|c)*) _flag=${1:1:1}; _rest=${1:2}; shift; set -- "-$_flag" "-$_rest" "$@";;
# Param=Value support
-?(-)*=*) _key=${1%%=*}; _value=${1#*=}; shift; set -- "${_key}" "$_value" "$@";;
esacEspecially not for snark purposes - https://news.ycombinator.com/newsguidelines.html.