As mentioned in the article, the big trick is having clear specs. In my case I sat down for 2 hours and wrote a 12 step document on how I would implement this (along with background information). Claude went through step by step and wrote the code. I imagine this saved me probably 6-10 hours. I’m now reviewing and am going to test etc. and start adjusting and adding future functionality.
Its success was rooted in the fact I knew exactly how to do what it needed to do. I wrote out all the steps and it just followed my lead.
It makes it clear to me that mid and senior developers aren’t going anywhere.
That said, it was amazing to just see it go through the requirements and implement modules full of organised documented code that I didn’t have to write.
This gives me excellent results with far less typing and time.
I dont want it to replace me, I replace reading the docs and googling or repetitive tasks.
Its a hit or miss sometimes but I get to review every snippet.
If I would generate alot of code at once like for a full project I would get brainfuck reviewing it.
I keep my normal development flow and iterate, no waterfall
I think when they advertise "thinking" it just does a few more iterations of giving you the closest "number in your head from the clues you've given it (requirements).
I saw someone once say that LLMs are a kind of "word calculator" and I feel that's quite a good description.
A simple example:
Prompt:Make it yellow
Think: the user wants something yellow but hasn't said what it is. Previously the user talked about creating a Button, so it must be the button but I should clarify by asking
Response: is it the button that you want yellow?
On the other hand, there has been quite a few moments in the last week where I'm actually starting to question if it's really faster. Some of the random mistakes can be major depending on how quickly it gets something wrong. I feel like a computer game, I need to save every time I make progress (commit my work).
I'm still on the fence about it honestly.
What’s the advantage here for you with a process like this?
In your flow you also have multiple review steps and corrections as well adds even more friction.
I can see the advantage in what parent is describing however.
I prompt like: give me go structs for this json "pasted json" and write 2 functions to save it and load it from "nosql db I use"
That basically speeds up writing glue code
The business logic I write myself
Its faster to do the business logic myself than review what the Ai did.
Those tools existed before LLMs and were local, fast, and most importantly free.
Why people continue to re-invent tools and workflows we already have, I don't know. Perhaps they just like to be able say "but this uses AI!"
Has anyone managed to setup a "reactive" way to interact with LLMs in a codebase, so that when an LLM extend or updates some part of the territory, it also extends or updates the map?
I take it as a manifestation of the temporal bigotry in computer science. That anything not new is bad, which is absolutely untrue. Old is not bad, new is not good. Where something exists in time has almost not bearing on its quality. Most knowledge and good ideas do not survive.
I've been building a programming language using Claude, and this is my findings, too.
Which, after discovering this, makes sense. There are a LOT of small decisions that go into programming. Without detailed guidance, LLMs will end up making educated guesses for a lot of these decision, many of which will be incorrect. This creates a compounding effect where the net effect is a wrong solution.
1. I use the Socratic Coder[1] system prompt to have a back and forth conversation about the idea, which helps me hone the idea and improve it. This conversation forces me to think about several aspects of the idea and how to implement it.
2. I use the Brainstorm Specification[2] user prompt to turn that conversation into a specification.
3. I use the Brainstorm Critique[3] user prompt to critique that specification and find flaws in it which I might have missed.
4. I use a modified version of the Brainstorm Specification user prompt to refine the specification based on the critique and have a final version of the document, which I can either use on my own or feed to something like Claude Code for context.
Doing those things improved the quality of the code and work spit out by the LLMs I use by a significant amount, but more importantly, it helped me write much better code on my own because I know have something to guide me, while before I used to go blind.
As a bonus, it also helped me decide if an idea was worth it or not; there are times I'm talking with the LLM and it asks me questions I don't feel like answering, which tells me I'm probably not into that idea as much as I initially thought, it was just my ADHD hyper focusing on something.
[1]: https://github.com/jamesponddotco/llm-prompts/blob/trunk/dat...
[2]: https://github.com/jamesponddotco/llm-prompts/blob/trunk/dat...
[3]: https://github.com/jamesponddotco/llm-prompts/blob/trunk/dat...
> I use the Socratic Coder[1] system prompt to have a back and forth conversation about the idea. (prompt starts with: 1. Ask only one question at a time)
Why only 1? IMHO it's better to write a long prompt explaining yourself as much as possible (exercises your brain and you figure out things), and request as many questions to clarify as possible, review, and suggestions, all at once. This is better because:
1. It makes you think deeper and practice writing clearly.
2. Even though each interaction is quite slower, since you are more active and engaged it feels shorter (try it), and you minimize interactions significantly.
3. It's less wasteful as going back and forth
4. You converge in much shorter time as your misconceptions, misunderstandings, problems expressing yourself, or confusion on the part of the LLM are all addressed very early.
5. I find it annoying to wait for the replies.
I guess if you use a fast response conversational system like ChatGPT app it would make more sense. But I don't think that way you can have deep conversations unless you have a stellar working memory. I don't, so it's better for me to write and read, and re-write, and re-read...I start with an idea between <idea> tags, write as much as I possibly can between these tags, and then go one question at a time answering the questions with as much details as I possibly can.
Sometimes I'll feed the idea to yet another prompt, Computer Science PhD[1], and use the response as the basis for my conversation with the socratic coder, as the new basis might fill in gaps that I forgot to include initially.
[1]: https://github.com/jamesponddotco/llm-prompts/blob/trunk/dat...
[2]: Something like "Based on my idea, can you provide your thoughts on how the service should be build, please? Technologies to use, database schema, user roles, permissions, architectural ideas, and so on."
I've had moderate success in throwing a braindump at the llm, asking it to do a .md with a plan and then going with the implementation for it. Specialized thinking prompts seem like overkill (or dumbo-level coding skills are enough for me).
What's the benefit of putting the original idea between <idea> tags when it seems to the main body of the prompt anyway? Or are you supplying the Socratic Coder prompt and the idea in the same prompt?
Edit: Sorry, I had a brain fart for a second, thought you were talking about other prompts. I prefer to keep those as chats with the API, not Claude Code, but yeah, they might work as slash commands too.
It will make you much bette at development to learn like a senior dev did today
http://pchristensen.com/blog/articles/first-impressions-of-v...
"Review <codebase> and create a spec for <algorithm/pattern/etc.>"
It gives you a good starting point to jump off from.
Step 1: back and forth chat about the functionality we want. What do we want it to do? What are the inputs and outputs? Then generate a spec/requirements sheet.
Step 2: identify what language, technologies, frameworks to use to accomplish the goal. Generate a technical spec.
Step 3: architecture. Get a layout of the different files that need to be created and a general outline of what each will do.
Step 4: combine your docs and tell it to write the code.
I kinda feel like this is a self-placating statement that is not going to stay true for that long. We are so early in the process of developing AI good enough to do any of these things. Yes, right now you need senior level design skills and programming knowledge, but that doesn't mean that will stay true.
So you really think that in a few years some guy with no coding experience will ask the AI "Make me a GTA 6 clone that happens in Europe" and the AI will make actually make it, the code will just work and the performance will be excellent ?
The LLMs can't do that, they are attracted to solutions they seen in their training, this means sometimes they over complicate things, they do not see clever solutions, or apply theory and sometimes they are just stupid and hallucinate variable names and functions , like say 50% of the time it would use speed and 50% of the time it would use velocity and hte code will fail because undefined stuff.
I am not afraid of LLMs taking my job, I am afraid of bullshit marketing that convinces the CEO/management that if they buy me Claude then I must work 10x faster.
What is outside coding?
- writing? an AI that can replace hard core developers that write optimized game engines should also be able to generate quests
- art? same, AI should be able to get already created models and change them here and there to make them not look stolen
- marketing? why can't AI replace those people
so?
I don't know the answer, as much as anyone else, and obviously I'm skeptical that it'll happen.
But then if I think back to 2018, and imagine what I would think if I saw even GPT-OSS-20b back then, it would have been close to magic and absolutely not something I would have expect. I felt the same about GPT2 when it first launched too, when LLMs started to show small bit of promise. GPT3 was insane even when it launched.
So I guess I wouldn't base "what could happen in the future" based on what I personally believe is possible, because LLMs definitely fell into that camp just a few years ago, so why not with larger coding tasks too, which I see as unlikely today?
There is definitely a path from here to the future where the most senior engineer in your org/dept/team decides he can make some big project without some subset of more-junior employees because he has Claude. The managers or PMs won’t be coding without engineers, but it’s definitely possible for engineers to code with less teammates, especially if the very experienced ones are the ones planning and guiding the effort.
> The LLMs can't do that, they are attracted to solutions they seen in their training, this means…
None of the things you’ve said this means match my experience using LLMs to write real, usable, viable code. It might not be the most performant or perfect code, but it’s certainly usable and most software isn’t written at Google or whatever and don’t need to support hundreds of millions of customers at scale. If it took a day instead of a month, then “the business” might decide that’s a worthy tradeoff.
It depends on your project, I seen a lot of stupidity in the AI, like in a lua project where arrays were 1 indexed it would 0 index them, somehow the c like behaviour was too strong of a force to drag the model in that direction.
For example when i test an image generator I ask it to create a photo of the front of a book store and to include no brands, labels or texts (because they always include english text and most of the time there are spelling errors), but the AIs can't make a shop without the branding/text above teh door, they are just so over trained on this concept that explicti commands can't fix it,
so the same with LLMs, they are attracted to the average most popular shit they seen in the training data, so without instructions by you or maybe by the provider behind the hidden prompts it will output outdated javascript using "var" . it will output unoptimized algorithms, and even if you used a specific variable name it will be strongly be pushed to rename it to whatever is most popular in the training data.
Yes, I can make the LLMs write soem good code but ony if I baby sit it, tell it exactly what files to read as inspiration, what features to use and what to do, for sure I can't just paste the text in a ticket and let if free.
I also use it to review my code for bugs, it can find up to 5-% of the bugs and halucinate others that are not possible (like it would sugerate that if $x is null then something would crash and I should check for that, but the type system would already ensure $x can't be null so it really needs more training to do simple stuff... to be original and not just regurgitate the most popular things it was trained on it would need to be something not based on LLM architecture
Small side remark, but what is the value added of the AI generated documentation for the AI generated code. It's just a burden that increases context size whenever AI needs to re-analyse or change the existing code. It's not like any human is ever going to read the code docs, when he can just ask AI what it is about.
This is useful because if you just have Claude Code read all the code every time, it'll run out of context very quickly, whereas if you have a dozen 50 line files that summarize the 200-2000 lines of code they represent, they can always be fresh in context. Context management is king.
1) when your cloud LLM has an outage, your manager probably still expects you to be able to do your work for the most part. Not to go home because openai is down lol. You being productive as an engineer should not depend on the cloud working.
2) You may want to manually write code for certain parts of the project. Important functions, classes, modules, etc. Having good auto-generated docs is still useful when using a traditional IDE like IntelliJ, WebStorm, etc.
3) Code review. I’m assuming your team does code review as part of your SDLC??? Documentation can be helpful when reviewing code.
lol where do you work? This obviously isn't true for the entire industry. If Github or AWS or your WiFi/ISP is down, productivity is greatly reduced. Many SaaS company don't have local dev, so rely on the cloud broadly being up. "Should" hasn't been the reality in industry for years.
It's not a question of what you can do, but where the comfort level reduction outweighs the project importance/pay.
I continue writing code and unit tests while i wait for the cloud to work again. If the outage is a long time, I may even spin up “DynamoDB Local” via docker for some of our simpler services that only interact with DynamoDB. Our apache flink services that read from kafka are a lost cause obviously lol.
It’s also a good opportunity to tackle any minor refactoring that you’ve been hoping to do. Also possible without the cloud.
You can also work on _designing_ new features (whiteboarding, creating a design document, etc). Often when doing so you need to look at the source to see how the current implementation works. That’s much easier with code comments.
You are making different points.
I try to prompt-enforce no line by line documentation, but encourage function/class/module level documentation that will help future developers/AI coding agents. Humans are generally better, but AI sometimes needs a help to stop it not understanding a piece of code's context and just writing it's own new function that does the same thing
It puts you in a different mind space to sit down and think about it instead of iterating too much and in the end feeling productive while actually not achieving much and going mostly in circles.
The parent wrote:
>I imagine this saved me probably 6-10 hours. I’m now reviewing and am going to test etc.
Guessing the time saved prior to reviewing and testing seems premature fron my end.
A few times while writing the doc I had to go back and update the previous steps to add missing features.
Also I knew when to stop. It’s not fully finished yet. There are additional stages I need to implement. But as an experienced developer, I knew when I had enough for “core functionalty” that was well defined.
What worries me is how do you become a good developer if AI is writing it all?
One of my strengths as a developer is understanding the problem and breaking it down into steps, creating requirements documents like I’ve discussed.
But that’s a hard-earned skill from years of client work where I wrote the code. I have a huge step up in getting the most from these agents now.
The downside that Agile sought to remedy was inflexibility, which is an issue greatly ameliorated by coding agents.
Yes and then it gets pumped back to the top of the waterfall and goes through the entire process. Many organizations became so rigid that this was a problem. It is what Tom Smykowski in office space is a parody of. It's why you get much of the early web having things like the "feature creep" and "if web designers were architects".
Waterfall failed because of politics mingled into the process, it was the worst sort of design by committee. If you want a sense of how this plays out you simply have to look at Wayland development. The fact that is has been as successful as it is, is a testament to the will and patience of those involved.
That said, I think that the differing UIs of Cursor (in the IDE) and Claude (in the CLI) fundamentally change how you approach problems with them.
Cursor is “too available”. It’s right there and you can be lazy and just ask it anything.
Claude nudges you to think more deeply and construct longer prompts before engaging with it.
That my experience anyway
There are a lot of subjective, ambigous instructions that really won't affect what Claude writes. Remember it's not a human, it's not performing careful reasoning over each individual LOC.
Context rot is a thing (https://news.ycombinator.com/item?id=44564248 ).
As of today, you cannot squeeze a true rule system out of a single file given as context. Many of us have done this mistake at some point – believing that you can specify arbitrarily many rules and that they'll be honored.
If you really care about every such rule, you'd have to create sub-agents, one per rule, and make the agents a required part of a deterministic (non-AI orchestrated) pipeline. Then costs would explode of course.
You can slash the costs by using cheap LLMs once your workflow is stable (but pricey to run!). Fine-tuning, prompt optimization, special distillation techniques, this is a well covered area.
Sometimes I've wanted to implement it but I sense that someone else will sooner or later, putting in more resources than I could currently.
In the meantime I'm happy with vanilla CC usage.
My expectations have shifted from "magic black box" to "fancy autocomplete". i.e. CC is for me an autocomplete for specific intents, in small steps, prompted in specific terms. I do the thinking.
I do put effort in crafting good context though.
I’m still unable to get Claude Code to contribute meaningful features directly my large web app at work. Specs will sometimes help it get close but it eventually veers off course and enters a feedback loop of bad decisions. Some of this might be attempting tasks it’s not suited well for, or perhaps my specs just aren’t precise enough, but I had enough failed attempts that I stopped trying to do anything that I’d describe as “challenging” or need too much domain knowledge.
A friend recommended I try it for less brainy backlog tasks, especially the kinds of things I can run casually in the background and not feel too invested in. This keeps failure from being too frustrating because there’s minimal effort and success becomes a pleasant surprise.
My first attempt with this was writing Playwright tests of the large web app in a new workspace within the monorepo. It was a huge success. I explained some user experiences the way I’d walk a person through them, pointed it at a path on my dev server, and told it the process I wanted it to follow: use Playwright MCP to load the page and discover the specifics of using the feature, document execution steps, write playwright tests based on what it learned from discovery, run the tests and debug errors with Playwright MCP. I instructed it to seek out the UI code within the project and add data-testid selectors as needed. I had it write this process to a master task.md, then make more task markdown files for each feature to be tested. It was very effective. Some of the features were somewhat complex, requiring two users with two browsers interacting in non-trivial ways. Not 100% accurate and more complex features needed more contextual and code corrections, but overall it probably saved days of frustrating work.
- Essential project context and purpose
- A minimal project structure to help locate types, interfaces, and helpers
- Common commands to avoid parsing package.json repeatedly.
Regarding the specific practices mentioned:
Implementation Flow: I've noticed Claude Code often tries to write all tests at once, then implements everything when import fails (not true TDD). To address this, I created a TDD-Guard hook that enforces one test at a time, test fail for the right reason, only implement the minimal code to make the test pass.
Code quality: I've had good success automating these with husky, lint-staged, and commitlint. This gives deterministic results and frees up the context for more important information.
When Stuck: I agree that developer intervention is often the best path. I'm just afraid the specific guidance here might be too generic.
For anyone interested in this automated approach:
https://github.com/nizos/tdd-guard (includes example configuration)
https://github.com/typicode/husky
Some additional thoughts:
- I like to start with an ideation session with Claude in the web console. I explain the goals of the project, work through high level domain modeling, and break the project down into milestones with a target releasable goal in mind. For a small project, this might be a couple hours of back and forth. The output of this is the first version of CLAUDE.md.
- Then I start the project with Claude Code, have it read my global CLAUDE.md and the project CLAUDE.md and start going. Each session begins this way.
- I have Claude Code update the project CLAUDE.md as it goes. I have it mark its progress through the plan as it goes. Usually, at the end of the session, I will have it rewrite a special section that contains its summary of the project, how it works, and how to navigate the code. I treat this like Claude's long term memory basically. I have found it helps a lot.
- Even with good guidelines, Claude seems to have a tendency to get ahead of itself. I like to keep it focused and build little increments as I would myself if it is something I care about. If its just some one off or prototype, I let it go crazy and churn whatever works.
I’m curious about the tool but I wonder if it requires more significant investment to be a daily driver.
Not sure about cursor. But if you want to use Claude Code daily for more than 2-3hrs/day, the $20 plan will feel limiting
In my experience, the $100 plan is pretty good, although you still run into the rate limits if you use it for a long time everyday (especially if you use Opus, which seems to run out in the first 30min of usage)
I've been using only cursor for now and I really like having it in the ide. Being able to see the diffs and accept/ reject them and navigate my codebase is really nice.
At the end of each phase, I ask claude to update my implementation plan with new context for a new instance of claude to pick it up. This way it propagates context forward, and then I can clear the context window to start fresh on the next phase.
I suspect most open sources projects will go that way, fits the needs of a single human being, and 'that' kind of software (utilities) will become throw away code generated in a single LLM sitting
So I looked at the code more closely and it was using the React frontend and useEffect instead of a proper game engine. It's also not great at following hook rules and understanding their timing in advance scenarios. So now I'm prompting it to use a proper tick based game engine and rebuilding the game up, doing code reviews. It's going 'slower' now, but it's going much better now.
My goal is to make a Show HN post when I have a good demo.
alias claude-personal="CLAUDE_CONFIG_DIR=~/.claude-personal claude"
I tried mistral and the code is often buggy
Chatgpt is a good middle ground
All free tier only
Yeah, people say that. I even was sitting next to some 'expert' (not him saying; others saying) who told me this and we did a CC session with Opus 4 & Sonnet 4. He had this well written, clear spec. It really didn't do even an inch better than my adhoc shooting in features as they came to me in /clear contexts without CLAUDE.md. His workflow kept forgetting vital things (even though there are in the context doc), making up things that are NOT in the context doc and sometimes forbidden etc. While I just typed stuff like; now add a crud page for invoices, first study the codebase and got far better results. It is anecdotal obviously but I now managed to write 100+ projects with Claude and, outside hooks to prevent it from overstepping, I found no flow working better than another; people keep claiming it does, but when asked to 'show me', it turns out they are spending countless hours fixing completely wrong stuff EVEN when told explicitly NOT to do things like that in CLAUDE.md.
I still don't feel the need for an agent. The writings of the loose specs is either done offline on paper, through rounds of discussions with stakeholders, and/or with a lot of reading. When I'm hit with an error while coding, that's usually a signal that I don't know something and should probably stop to learn about it.
When it comes to tweaking, fast feedback is king. I know where the knobs are and checking the adjustment should be quick. So it's mostly tests, linting, or live editing environment.
Claude Code power users, what would you say makes it superior to other agents?
The big thing with Claude Code seems to be agentic process they've baked into it.
And I don't think we have a great eval benchmark that exactly measures this capability yet. SWE Bench seems to be pretty good, but there's already a lot of anecdotal comments that Claude is still better at coding than GPT 5, despite having similar scores on SWE Bench.
https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... https://longbench2.github.io/
My guess for why GPT5 scores more on benchmarks is that they evaluate on well defined tasks with all instructions given at the start.
Real life is multi turn. Multiple set of prompts to adhere to. This is where Claude is likely better.
I hated at first that it wasn’t like Cursor, sitting in the IDE. Then I realised I was using Cursor completely differently, using it often for small tasks where it’s only moderately helpful (refactoring, adding small functions, autocompleting)
With Claude I have to stop, think and plan before engaging with it, meaning it delivers much more impactful changes.
Put another way, it demands more from me meaning I treat it with more respect and get more out of it
However, there are a lot of Claude Code clones out there now that are basically the same (Gemini CLI, Codex, now Cursor CLI etc.). Claude still seems to lead the pack, I think? Perhaps it’s some combination of better coding performance due to the underlying LLM (usually Sonnet 4) being fine-tuned on the agent tool calls, plus Claude is just a little more mature in terms of configuration options etc.?
Gemini's been very quick to dive in and start changing things, even when I don't want it to. But those changes almost always fall short of what I'm after. They don't run or they leave failing tests, and when I ask it to fix the tests or the underlying issue, it churns without success. Claude is significantly slower and definitely not right all the time, but it seems to do a better job of stepping through a problem and resolving it well enough, while also improving results when I interject.
Meanwhile Gemini got itself stuck in a loop of compile/fail/try to fix/compile/fail again. Eventually it just gave up and said "I'm not able to figure this out". It does seem to have a kind of self-esteem problem in these scenarios, whereas Claude is more bullish on itself (maybe not always a good thing).
Claude seems to be the best at getting something that actually works. I do think Gemini will end up being tough competition, if nothing else because of the price, but Google really need a bit of a quality push on it. A free AI agent is worthless if it can't solve anything for me.
Who has had success using Claude Code on features in older, bigger, messier projects?
Then, I explored a product feature in an existing app of mine that I also had put off because I didn't feel it was worth spending several days exploring the idea. It's something that would've required me to look up tutorials and APIs on how to do some basic things and then write some golang code which I hadn't done in a while. With Claude Code, I was able to get a prototype of the idea from a client app and a golang service working within an hour!
Today I started prototyping yet another app idea I came up with yesterday. I started off doing the core of the prototype in a couple of hours by hand and then figured I'd pull Claude in to add features on top of it. I ended up spending several hours building this idea since I was making so much fantastic progress. It was genuinely addictive.
A few days ago I used it to help me explore how I should refactor a messy architecture I ended up with. I didn't initially consider it would even be useful at all but I was wowed by how it was able to understand the design I came up with and it gave me several starting points for a refactor. I ended up doing the refactor myself just because I really wanted to be sure I understood how it worked in case something went wrong. I suspect in a few weeks, I'll get used to just pairing with Claude on something like that.
Asking the agent to perform a code review on its own work is surprisingly fruitful.
I do this routinely with its suggestions, usually before I apply them. It is surprising how often Claude immediately dumps on its own last output, talking both of us out if it, and usually with good reasons. I'd like to automate this double-take.If you review the code and it needs a change, do you run it back through Claude with the requested changes or do you make the changes yourself?
The system helps you build out a spec first, then uses a few subagents which are tuned for placing files, reviewing for best practice, etc.
I've been using it for about a week and about 70% of my Claude Code usage runs through /feature right now.
The nice thing is you can give it a _lot_ of requests and let it run for 10-15 minutes without interruption. Plus, it makes a set of planning documents before it implements, so you can see exactly what it thought it was changing.
Some of the language is geared toward a Laravel project and it’s a composer package but the ideas are pretty general.
I bet it could be generalized!
I need to take the time to see what the Laravel side of this is so perhaps I can adapt for my RN app too.
If you happen catch it and you're quick to "esc" and just tell it to find a simpler solution, it's surprisingly great at reconsidering, resolving the issue simply, and picking up where it left off before the blunder.
I started with an idea but no spec. I got it to a happy place I can deploy yesterday. Spent around $75 on tokens. It was starting to feel expensive towards the end.
I did wonder if I had started with a clearer specification could I have got there quicker and for less money.
The thing is though, looking back at the conversations I had with it, the back and forth (vibe coding I guess) helped me refine what I was actually after so in two minds if a proper tight specification upfront would have been the best thing.
For research, investigation, and proof of concept, it is good to be flexible and a bit imprecise.
But once a path seems clear, writing a single detailed document (even with “help”) is valuable before working with a separate AI assistant.
The challenge is recognizing that transition point. It’s very easy to just meander from zero to sort-of-product without making this separation.
If you're using an AI for the "architecture" / spec phase, play a few of the models off each other.
I will start with a conversation in Cursor (with appropriate context) and ask Gemini 2.5 Pro to ask clarifying questions and then propose a solution, and once I've got something, switch the model to O3 (or your other preferred thinking model of choice - GPT-5 now?). Add the line "please review the previous conversation and critique the design, ask clarifying questions, and proposal alternatives if you think this is the wrong direction."
Do that a few times back and forth and with your own brain input, you should have a pretty robust conversation log and outline of a good solution.
Export that whole conversation into an .md doc, and use THAT in context with Claude Code to actually dive in and start writing code.
You'll still need to review everything and there will still be errors and bad decisions, but overall this has worked surprisingly well and efficiently for me so far.
One tip I picked up from a video recently to avoid sycophancy was to take the resulting spec and instead of telling the reviewing LLM "I wrote this spec", tell it "an engineer on my team wrote this spec". When it doesn't think it's insulting you, it tends to be a bit more critical.
Yep, that’ll do it.
I personally really like to use Claude Code together with Zen MCP https://github.com/BeehiveInnovations/zen-mcp-server to analyse existing and review fresh code with additional eyes from Gpt5 and Gemini.
And do it very step by step in what would equate to a tiny PR that gradually roles out the functionality. Too big and I find lots of ugly surprises and bugs and reorganizations that don’t make sense.
But everything is better if it can close the loop. So I instead instruct it to always use the puppeteer tool to launch the app and use some test credentials and see if the functionality works.
That's for a web app but you can see how you can do this for other things. Either unit tests, integration tests, or the appropriate MCP.
It needs to see what it's done and observe the resulting world. Not just attempt to reason to it.
Claude also leans towards what it's good at. Repetition costs it nothing so it doesn't mind implementing the same 5 times. One thing it did when I started is implement a sidebar on every page rather than using a component. So you need to provide some pressure against that with your prompts or at least force it to refactor at the end.