If you look at the past, whenever Google announces something major, OpenAI almost always releases something as well.
People forget realize that OpenAI was started to compete with Google on AI.
In my experience it take weeks if not months to coordinate a release, from testing to documentation to drafting press releases in multiple languages to benchmarks and website updates.
I’m old and I’ve been in this industry most of my life. I have never once seen or heard of all of that work being done and the company just waiting on competitors before pulling the trigger.
Economics is important. Best bang for the buck seems to be OpenAI ChatGPT 4.1 mini[6]. Does a decent job, doesn't flood my context window with useless tokens like Claude does, API works every time. Gets me out of bad spots. Can get confused, but I've been able to muddle through with it.
1: https://openrouter.ai/anthropic/claude-opus-4.1
2: https://openrouter.ai/anthropic/claude-sonnet-4
3: https://block.github.io/goose/
4: https://openrouter.ai/anthropic/claude-3.5-sonnet
I find the token/credit restrictions on Opus to be near useless even when using Claude Code. I only ever switch to it so get another model's take on the issue. Five minutes of use and I have hit the limit.
We have the $200 plans for work and despite only using Opus, we rarely hit the limits. CCUsage suggests the same via API would have been ~$2000 over the last month (we work 5 hours a day, 4 days a week, almost always with Claude).
Same context length and throughput limits?
Anecdotally I find gpt4.1 (and mini) were pretty good at those agentic programming tasks but the lack of token caching made the costs blow up with long context.
Unfortunately there's no easy tool to inspect usage. I started a project to parse the Claude logs using Claude and generate a Chrome trace with it. It's promising but it was taking my tokens away from my core project.
Do you mostly use opus?
It uses way less tokens or much more effectively when running locally.
Also there's a cli argument that lets you specify the model. try `claude --help`.
Maybe I'm out of touch, but I'm not handing out my phone number to sign up for random SaaS tools.
I wouldn't be surprised if asking for a phone number lowers the fraud rate enough to compensate for the added friction.
[0] Incidentally, this is also why many AI API providers ask for your money upfront (buy credits) unless you're big enough and/or have existing relationship with them.
There's additional storage costs with google caching, around $3.75 for 5 minutes/Mtok, and Claude Opus is $3.75 for 5minute Cache Writes / Mtok.
For cached reads Gemini Pro is 5X cheaper than Opus and like $0.01 more than Sonnet.
E.g. if need a self-contained script to do some data processing, for example, Opus can often do that in one shot. 500 line Python script would cost around $1, and as long as it's not tricky it just works - you don't need back-and-forth.
I don't think it's possible to employ any human to make 500 line Python script for $1 (unless it's a free intern or a student), let alone do it in one minute.
Of course, if you use LLM interactively, for many small tasks, Opus might be too expensive, and you probably want a faster model anyway. Really depends on how you use it.
(You can do quite a lot in file-at-once mode. E.g. Gemini 2.5 Flash could write 35 KB of code of a full ML experiment in Python - self-contained with data loading, model setup training, evaluation, all in one file, pretty much on the first try.)
Small models are for querying the context
Opus is cheap if you use it for its niche
> Small models are for querying the context
I respectfully disagree.
My experience is that large models are capable of understanding large contexts much better. Of course they are more expensive and slower, too. But in terms of accuracy, large models are always better at querying the context.
It's still pretty much impossible to have any LLM one-shot a complex implementation. There's just too many details to figure out and too much to explain for it to get correct. Often, there's uncertainty and ambiguity that I only understand the correct answer (or rather less bad answer) after I've spent time deep in the code. Having Opus spit out a possibly correct solution just isn't useful to me. I need to understand _why_ we got to that solution and _why_ it's a correct solution for the context I'm working in.
For me, this means that I largely have an iteratively driven implementation approach where any particular task just isn't that complex. Therefore, Sonnet is completely sufficient for my day-to-day needs.
It's very, very helpful. However, there are still a lot of problems I only discover/figure out after I've been working in the code.
I started adding an instruction file along the lines of "Always tell me your plan to solve the issue first with short example code, never edit files without explicit confirmation of your plan" at the start and it is like a day and night difference in how useful it becomes. It also starts to feel like programming again where you can read through various files and instead of thinking in your head, you write out your thoughts. You end up getting confirmation or push back on errors that you can clean up.
Reading through a sort of wrong sort of right implementation spread across various files after every prompt just really sucked.
I'm not one shotting massive amounts of files, but I am enjoying the lack of grunt work.
A major part of software engineering is identifying and resolving issues during implementation. Plans are a good outline of what needs to be done, but they're always incomplete and inaccurate.
Maybe Opus just is better
Subagents seem pretty similar to using zen mcp w/ OpenRouter but maybe better or at least more turnkey? I'll be checking them out.
Interestingly I found that prompting it to ask the o3 submodel (which they call The Oracle) to check Sonnet's working on a debugging solution was helpful. Extra interesting to me was the fact that Sonnet appeared to do a better job once I'd prompted that (like chain of thought prompting, perhaps asking it to put forward an explanation to be checked actually triggered more effective thinking).
Example: you need to review some code to see if it has proper test coverage.
If you use the "main" context, it'll waste tokens on reading the codebase and running tests to see coverage results.
But if you launch an agent (a subprocess pretty much), it can use a "disposable" context to do that and only return with the relevant data - which bits of the code need more tests.
Now you can either use the main context to implement the tests or if you're feeling really fancy launch another sub-agent to do it.
can look at primal check the mean or dual get out of local minima
in all cases, model, tokenizer, etc is just enough different that will generally pay off in spaces quickly
Given that there’s nothing close to scientific analysis going on, I find it hard to tell how big the “Sonnet is overall better, not just sometimes” crowd is. I think part of the problem is that “The bigger model is better” feels obvious to say, so why say it? Whereas “the smaller model is better actually” feels both like unobvious advice and also the kind of thing that feels smart to say, both of which would lead to more people who believe it saying it, possibly creating the illusion of consensus.
I was trying to dig into this yesterday, but every time I come across a new thread the things people are saying and the proportions saying what are different.
I suppose one useful takeaway is this: If you’re using Claude Max and get downgraded from Opus to Sonnet for a few hours, you don’t have to worry too much about it being a harsh downgrade in quality.
I stick with Sonnet for most things because it's generally good enough and I hit my token limits with it far less often.
Opus gives you a bit more rope to hang yourself with imo. Yes, it "thinks" slightly better, but still not good enough to me. But it can be good enough to convince you that it can do the job.. so i dunno, i almost dislike it in this regard. I find Sonnet just easier to predict in this regard.
Could i use Opus like i do Sonnet? Yes definitely, and generally i do. But then i don't really see much difference since i'm hand-holding so much.
I use Opus exclusively and don't hit limits. ccusage reports I'm using the API-equivalent of $2000/mo
But I totally agree there's no way it lasts. I'm mostly only using this for side projects and I'm sitting there interacting with it, not YOLO'ing, I do sometimes have two sessions going at the same time but I'm not firing off swarms or anything crazy. Just have it set to Opus and I chat with it.
I don't believe anyone saying Sonnet yields better results than Opus though, as my experience has been exactly the opposite. But trade-off wise, I can definitely see it being a better experience when used interactively because of its speed and lower cost.
E.g. prompt to read a paper, read some source, then write out a terse document meant to be read by machine not human.
Then switch to Sonnet, have it read that document, and do the actual implementation work.
It's my experience that Opus is better at solving architectural challenges where sonnet struggles.
So this release might change that consensus? If you believe the benchmarks are reflective of reality anyways.
That's a big "if." But yeah, I can't tell a difference subjectively between Opus and Sonnet, other than maybe a sort of placebo effect. I'm more careful to write quality prompts when using Opus, because I don't want to waste the 5x more expensive tokens.
Sonnet is great at banging it out.
(He had been stuck in the Team Rocket hideout (I believe) for weeks)
When can we replace doctors with it?
At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.
I've basically wasted the morning on Claude Code when I should've just been doing it all myself.
I get that it's not an easy problem to solve, but how is Anthropic supposed to solve the actual alignment problem if they can't even stop their production LLMs from glazing the user all the time? And OpenAI is somehow even worse.
I expect to be completely blown away by GPT-5 in the first few days and then over time I will figure out the limitations of the model. Then I will be less impressed because you don't know what it can't do at first.
I do agree it did hit the token limit a lot quicker than before where I could chat for hours without worrying about it.
Either way, still have one last yak to shave for this project so we'll see how efficient it is with that. If it accomplishes the task before burning through all the tokens then win, win, I suppose.
Welcome to the machine
Sonnet 4 has definitely been the best model for our product's use case, but I'd be interested in trying Haiku 4 (or 4.1?) just due to the cost savings.
I'm surprised Anthropic hasn't mentioned anything about Haiku 4 yet since they released the other models.
> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.
I.e. it seems we don't get much more than new training run levels of improvement anymore. Which is better than nothing, but a shame compared to the early scaling.
Instead, ideally they’d run the benchmark tests many times, and share all of the results so we could make statistical determinations.
> We plan to release substantially larger improvements to our models in the coming weeks.
Let's see: we have Claude Code vs. Claude the API vs. Claude the website, and they're totally different from each other? One is command line, one integrates into your IDE (which IDE?) and one is just browser based, I guess. Then you have the different pricing plans, Free, Pro, and Max? But then there's also Claude Team and Claude Enterprise? These are monthly plans that only work with Claude the Website, but Claude Code is per-request? Or is it Claude API that's per-request? I have no idea. Then you have the models: Claude Opus and Claude Sonnet, with various version numbers for each?? Then there's Cline and Cursor and GOOD GRIEF! I just want to putz around with something in VSCode for a few hours!
Fwiw I have a Claude pro plan and have no interest in using other offerings so I'm not sure if they're super simple (one model, one interface, one pricing plan)?
It's almost as if companies sell more than one product.
Why is this the top comment on so many threads about tech products?
And of course you could be doing it right but the people saying it works great could themselves be wrong about how good it is.
On top of that it costs both money and time/effort investment to figure out if you're doing it wrong. It's understandable to want some clarity. I think it's pretty different from buying shoes.
Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.
Are you a construction worker, a banker, a cashier or a driver? Are you walking 5 miles everyday or mostly sedentary? Do you require steel toed shoes? How long are you expecting them to last and what are you willing to pay? Are you going to wear them on long runs or take them river kayaking? Do they need to be water resistant, waterproof or highly breathable? Do you want glued, welted, or stitch down construction? What about flat feet or arch support? Does shoe weight matter? What clothing are you going to wear them with? Are you going to be dancing with them? Do the shoes need a break in period or are they ready to wear? Does the available style match your preferences? What about availability, are you ok having them made to order or do you require something in stock now?
By comparison I can try 10 different AI services without even needing to stand up for a break while I can't buy good dress shoes in the same physical store as a pair of football cleats.
Oh c'mon, now you're just being disingenuous, trying to make an argument for argument's sake.
No, shoe shopping is not more complicated than trialing a LLM. For all of those questions about shoes you are posing, either a) a purchaser won't care and won't need to ask them, or b) they already know they have specific requirements and will know what to ask.
With an LLM, a newbie doesn't even know what they're getting into, let alone what to ask or where to start.
> By comparison I can try 10 different AI services without even needing to stand up for a break
I can't. I have no idea how to do that. It sounds like you've been following the space for a while, and you're letting your knowledge blind you to the idea that many (most?) people don't have your experience.
Maybe there's a need to try ten different ones but I just stuck with one and can now convince it to do what I want it to do pretty successfully.
Here's a quick guide to get you started with AI coding assistants:
## Quick Start Options (Easiest)
*1. Web-based (Nothing to Download)* - *Claude.ai* - You're here! I can help with code, debug, explain concepts - *ChatGPT* - Similar capabilities, different model - *GitHub Copilot Chat* - Web interface if you have GitHub account
*2. IDE Extensions (Most Popular)* - *Cursor* - Full VS Code replacement with AI built-in. Download from cursor.com, works out of the box - *GitHub Copilot* - Install as VS Code/JetBrains extension ($10/month), autocompletes as you type - *Continue* - Free, open-source VS Code extension, lets you use multiple models
*3. Command Line* - *Claude Code* - Anthropic's terminal tool for autonomous coding tasks. Install via `npm install -g @anthropic-ai/claude-code` - *Aider* - Open-source CLI tool that edits files directly
## What They Do
- *Autocomplete tools* (Copilot, Cursor) - Suggest code as you type, finish functions - *Chat tools* (Claude, ChatGPT) - Explain, debug, design systems, write full programs - *Autonomous tools* (Claude Code, Aider) - Actually edit your files, make changes across codebases
## My Recommendation to Start
1. Try *Cursor* first - download it, paste in some code, and ask it questions. It's the most beginner-friendly 2. Or just start here in Claude - paste your code and I can help debug, explain, or write new features 3. Once comfortable, try GitHub Copilot for in-line suggestions while coding
The key is just picking one and trying it - you don't need to understand everything upfront!
Maybe the problem is I don't take shoes seriously enough? Something to work on...
If you allow yourself to be a novice and a learner with AI and LLMs and don't expect to start out as a "shoe expert" where you never even think about this in your life and it's not even an annoyance, you'll find that it's the exact same journey.
But for someone who hasn't been immersed in the "LLM scene", it's hard to understand why you might want to use one particular model of another. It's hard to understand why you might want to do per-request API pricing vs. a bucketed usage plan. This is a new technology, and the landscape is changing weekly.
I think maybe it might be nice if folks around here were a bit more charitable and empathetic about this stuff. There's no reason to get all gatekeep-y about this kind of knowledge, and complaining about these questions just sounds condescending and doesn't do anyone any good.
Because you overestimate the difference that the representative person understands.
A more accurate analogy is that Nike sells green-blue shoes and Nike sells blue-green shoes, but the blue-green shoes add 3 feet to your jump and green-blue shoes add 20 mph to your 100 yard dash sprint.
You know you need one of them for tomorrow's hurdles race but have no idea which is meaningful for your need.
With all this LLM cruft all you get is essentially the same old chat interface that's like the year 2000 called and wants its on-line chat websites back. The only thing other than a text box that you usually get is a model selector dropdown squirreled away in a corner somewhere. And that dropdown doesn't really explain the differences between the cryptic sounding options (GPT-something, Claude Whatever...). Of course this confuses people!
What you're looking for, are the landing pages of the B2B API products underlying these B2C experiences. That would be https://www.anthropic.com/claude, https://openai.com/api/, etc. (In general, search "[AI company] API".)
From those B2B landing pages, you can usually click through to pages with details about each of their models.
Here's the model page corresponding to this news announcement, for example: https://www.anthropic.com/claude/opus
(Also, note how these B2B pages are on the AI companies' own corporate domains; whereas their B2C products have their own dedicated domains. From their perspective, their B2C offerings are essentially treated as separate companies that happen to consume their APIs — a "reference use-case" — rather than as a part of what the B2B company sells.)
I do know the answer to OP's question but that's because I pickle my brain in this stuff. It is legitimately confusing.
The analogy to different SKUs strikes me also inaccurate. This isn't the difference between shoes, shirts, and shorts - it's more as if a company sells three t-shirts but you can't really tell what's different about them.
It's Claude, Claude, and Claude. Which ones code for you? Well, actually, all of them (Code, web/desktop Claude, and the API can all do this)
Which ones do you ask about daily sundry queries? Well, two of them (web/desktop Claude, but also the API, but not Code). Well, except if your sundry query is about a programming topic, in which case Code can also do that!
Ok, if I do want to use this to write code, which one should I use? Honestly, any of them, and the company does a poor job of explaining why you would use each option.
"Which of these very similar-seeming t-shirts should I get?" "You knob. How are posts like this even being posted." is just an extremely poor way to approach other people, IMO.
Thanks for articulating the confusion better than I could! I feel it's a similar branding problem as other tech companies have: I'm watching Apple TV+ on my Apple TV software running on my Apple TV connected to my Google TV that isn't actually manufactured by Google. But that Google TV also has an Apple TV app that can play Apple TV+.
I'm not sure if you ever got a good rundown, but the tl;dr is that the 3 products ("Desktop", Code, and API) all expose the same underlying models, but are given different prompts, tools, and context management techniques that make them behave fairly differently and affect how you interact with them.
- The API is the bare model itself. It has some coding ability because that's inherent to the model - you can ask it to generate code and copy and paste it for example. You normally wouldn't use this except that if you're using some Copilot-type IDE integration where the IDE is doing the work of talking to the model for you and integrating it into your developer experience. In that case you provide API key and the IDE does the heavy lifting.
- The desktop app is actually a half-decent coder. It's capable of producing specific artifacts, distinguishing between multiple "files" it's writing for you, and revisiting previously-written code. "Oh, actually rewrite this in Go." is for example a thing it can totally do. I find it useful for diagnosing issues interactively.
- "Claude Code" is a CLI-only wrapper around the model. Think of it like Anthropic's first-party IDE integration, except there's not an IDE, just the CLI. In this case the integration gives the tool broad powers to actually navigate your filesystem, read specific files, write to specific files, run shell commands like builds and tests, etc. These are all functions that an IDE integration would also give you, but this is done in a Claude-y way.
My personal take is: try Claude Code, since as long as you're halfway comfortable with a CLI it's pretty usable. If you really want a direct IDE integration you can go with the IDE+API key route, though keep in mind that you might end up paying more (Claude Code is all-you-can-eat-with-rate-limits, where API keys will... just keep going).
And to some extent it is like the PC race. Imagine going to work and writing software for whatever devices your company writes software for in whatever toolchain your company uses. Then 2-3 years after the PC race began heating up, asking "Hey I only really write code for whatever devices my employer gives me access to. Now I want to buy one of these new PCs but I don't really understand why I'd choose an Intel over a Motorolla chipset or why I'd prioritize more ROM or more RAM, and I keep hearing about this thing called RISC that's way better than CISC and some of these chips claim to have different addressing modes that are better?"
Contrast to something like OpenAI. They've got gpt4.1, 4o, and o4. Which of these are newer than one another? How do people remember which of o4 and 4o are which?
At least with those you can buy whatever you think is coolest. Which Claude model and interface should the average programmer use?
That's a silly claim to me, we're talking about a completely new environment where you prompt an AI to develop code, and therefore an "average programmer" is unlikely to have any meaningful experience or intuition with this flow. That is exactly what GP is talking about - where does he plug in the AI? What tradeoffs are there to different options?
The other day I had someone judge me for asking this question by dismissively saying "dont say youve still been using ChatGPT and copy/paste", which made me laugh - I don't use AI at all, so who was he looking down on?
And it seems the story you shared sort of proves the point: the web interface worked fine for you and you didn't need to question it until someone was needlessly rude about it.
In what way is this analogous? Running scripts is vastly different than AI codemod. I could easily answer how when and why a build system would be plugged in, and linting and formatting are long-established pathways.
On the flipside there are barely even established practices, let alone best ones, for using AI. The point being offered is that AI companies offer shockingly little guidance on how to use their apparently amazing tool.
I personally have never used AI to author code, so I don't really know how the story I provided proves anything to you. I like it to answer questions about why something isn't working to help give me some leads, and it is good at telling you how to use a new framework quickly, but that's a pretty different practice than it authoring code. Seems like you're kinda dodging the question too.
It's not like running a tool in your IDE or CLI where the only difference is the interface. It would be like if gcc ran from your IDE had faster compile times, but gcc run from the CLI gives better optimizations.
The fact that no one is recommending any baseline to start with proves the point that it's confusing. And we haven't even touched on Sonnet v Opus
I absolutely loathe this timeline we're stuck in.
Or maybe that's me, but still whether its through the likes of those vibe coding apps like lovable bolt etc.
at the end of the day, Most people are using the same tool which is claude since its mostly superior in coding (questionable now with oss models, but I still use it through kiro).
People expect this stuff to be simple when in reality its not and there is some frustation I suppose.
You're comparing well understood products that are wildly different to products with code names. Even someone who has never wore a t-shirt will see it on a mannequin and know where it goes.
I'm sorry but I cannot tell what the difference is between sonnet and opus. Unless one is for music...
So in this case you read the docs. Which is, in your analogy, you going to the Nike store and reading up on if a tshirt goes on your upper or lower body.
It's more like going to the Nike store and asking about the difference between the Vaporfly 3 and the Pegasus 41. I know they're all shoes and therefore go on my feet, but I don't know what the difference is unless one is better for riding horses?
This is a well-known and documented phenomenon - the paradox of choice.
I've been working in machine learning and AI for nearly 20 years and the number of options out there is overwhelming.
I've found many of the tools out there do some things I want, but not others, so even finding the model or platform that does exactly what I want or does it the best is a time-consuming process.
Claude Code is currently best-in-class, so no point in starting elsewhere, but you do need to read the documentation.
Actually, to try it out, prepaid token billing is fine. You are not required to have a subscription for claude code cli. Even just $5 gave me enough breathing room to get a feeling for its potential, personally. I do not touch code often these days so I was relieved not to have to subscribe and cancel again just to play around a little and have it write some basic scripts for me.
I haven't tried it myself, but I've heard from people that Opus can be slow when using it for coding tasks. I've only been using Sonnet, and it's performed well enough for my purposes.
I prefer configuring it to use Sonnet for things that don't require much reasoning/intelligence, with Opus as the coordinator.
Anthropic has this useful quick start guide: https://docs.anthropic.com/en/docs/claude-code/quickstart
My use case so far is usually requesting mechanic work I would rather describe than write myself like certain test suites, and sometimes discovery on messy code bases.
If you like an IDE, for example VS Code you can have the terminal open at the bottom and run Claude Code in that. You can put your instructions there and any edits it makes are visibile in the IDE immediately.
Personally I just keep a separate terminal open and have the terminal and VSCode open on two monitors - seems to work OK for me.
But I would recommend just starting using Claude in the browser, talk through an idea for a project you have and ask it to build it for you. Go ahead and have a brain storming session before you actually ask it to code - it'll help make sure the model has all of the context. Don't be afraid to overload it with requirements - it's generally pretty good at putting together a coherent plan. If the project is small/fits in a single file - say a one page web app or a complicated data schema + sql queries - then it can usually do a pretty good job in one place. Then just copy+paste the code and run it out of the browser.
This workflow works well for exploring and understanding new topics and technologies.
Cursor is nice because it's an AI integrated IDE (smoother than the VSCode experience above) where you can select which models to use. IMO it seems better at tracking project context than Gemini+VSCode.
Hope this helps!
Claude code is actually one of the most straightforward products I've used as far as onboarding goes. You download the tool, and follow the instructions. You can use one of the 3 plans, and everything else is automatic. You can figure out token usage and what models and versions to use and how to use MCP servers and all of that -- there's a lot of power -- but you don't need to do ANY of that to get started trying it out.
You're not being:
> That critic who doesn't try the stuff he criticizes
You're being:
> That critic who is trying to confirm their biases
Create a new directory in your terminal
Open that directory, type in "Claude" to run Claude
Press Shit + Tab to go into planning mode
Tell Claude what you want to build - recommend something simple to start with. Specify the languages, environment, frameworks you want, etc.
Claude will come up with a plan. Modify the plan or break it into smaller chunks if necessary
Once plan is approved, ask it to start coding. It will ask you for permissions and give you the finished code
It really is something when you actually watch it go.
It is actually one of my most useful use cases of this tech. Nice to have a way to ask in private so you don’t get snarky answers like: it’s just like buying shoes!
Cursor imports in your VSCode setup. Setting it up should be trivial.
Use Agent mode. Use it in a preexisting repo.
You're off the races.
There is a lot more you can do, but you should start seeing value at this point.
Agree that the offering is a bit confusing and it's hard to know where to start.
Just FYI: Claude Code is a terminal-based app. You run it in the working directory of your project, and use your regular editor that you're used to, but of course that means there's no editor integration (unlike something like Cursor). I personally like it that way, but YMMV.
1) Completely separate in your mind the auto-completion features from the agentic coding features. The auto-completion features are a neat trick but I personally find those to be a bit annoying overall, even if they sometimes hit it completely right. If I'm writing the code, I mostly don't want the LLM autocompletion.
2) Pay the $20 to get a month of Claude Pro access and then install Claude Code. Then, either wait until you have a small task in mind or your stuck on some stupid issue that you've been banging your head on and then open your terminal and fire up Claude Code. Explain to it in plain English what you want it to do. Pretend it's a colleague that you're giving a task to over Slack. And then watch it go. It works directly on your source code. There is no copying and pasting code.
3) Bookmark the Claude website. The next time you'd Google something technical, ask it Claude instead. General questions like "how does one typically implement a flizzle using the floppity-do framework"? "I'm trying to accomplish X, what are my options when using this stack?". General questions like that.
From there you'll start to get it and you'll get better at leverage the tool to do what you want. Then you can branch out the rest of the tool ecosystem.
"The card game state is a structure that contains a Deck of cards, represented by a list of type Card, and a list of Players, each containing a Hand which is also a list of type Card, dealt randomly, round-robin from the Deck object." I could have input the data structure and logic myself in the amount of time it took to describe that.
Also, I don't remember what model Copilot uses by default, especially the free version, but the model absolutely makes a difference. That's why I say to spend the $20. That gives you access to Sonnet 4 which is where, imo, these models took a giant leap forward in terms of quality of output.
I hope when I state it that way you start to realize the error in your thinking process. You don't send trivial tasks to the GPU because the overhead is too high.
You have to experiment and gain experience with agent coding. Just imagine that there are tasks where the overhead of explaining what to do and reviewing the output are dwarfed by the actual implementation. You have to calibrate yourself so you can recognize those tasks and offload them to the agent.
But not too general, because then it can get lost in the sauce and do something profoundly wrong.
IMO it's worth the effort to know these tools, because once you have a more intuitive sense for the right level of abstraction it really does help.
So not "make this very basic data structure for me based on my specs", and more like "rewrite this sequential logic into parallel batches", which might take some actual effort but also doesn't require the model to make too many decisions by itself.
It's also pretty good at tests, which tends to be very boilerplate-y, and by default that means you skip some cases, do a lot of brain-melting typing, or copy-and-paste liberally (and suffer the consequences when you missed that one search and replace). The model doesn't tire, and it's a simple enough task that the reliability is high. "Generate test cases for this object, making sure to cover edges cases A, B, and C" is a pretty good ROI in terms of your-time-spent vs. results.
I just googled "using claude from vscode" and the first page had a link that brought me to anthropic's step by step guide on how to set this up exactly.
Why care about pricing and product names and UI until it's a problem?
> Someone on HN told me Copilot sucks, use Claude.
I concur, but I'm also just a dude saying some stuff on HN :)
If you want to understand how all of this works, the best way is to build a coding agent manually. Its not that hard
1. Start with Ollama running locally and Gemma3 QAT models. https://ollama.com/library/gemma3
2. Write a wrapper around Ollama using your favorite language. The idea is that you want to be able to intercept responses coming back from the model.
3. Create a system prompt that tells the model things like "if the user is asking you to create a file, reply in this format:...". Generally to start, you can specify instructions for read file, write file, and execute file
4. In your wrapper, when you send the input chat prompt, and get the model response back, you look for those formats, and make the wrapper actually execute the action. For example if the model replies back with the format to read file, you read the file from your wrapper code and send it back to the model.
Every coding assistant is basically this under the hood with just a lot more fluff and their own IDE integration.
The benefit of doing your own is that you can customize it to your own needs, and when you direct a model with more precision even the small models perform very well with much faster speed.
Github Copilot is autocomplete, highly useful if you use VS Code, but if you are using e.g. Jetbrains then you have other options. Copilot comes with a bunch of other stuff that I rarely use.
Claude code is project-wide editing, from the CLI.
They complement each other well.
As far as I'm concerned the utility of the AI-focused editors has been diminished by the existence of Claude code, though not entirely made redundant.
Copilot isn't locked to a specific LLM, though. You can select the model from a panel, but I don't think you can plug in your own right now, and the ones you can select might not be SOTA because of that.
For single-line autocomplete, which is how I use it, pretty much anything will do the job. I use Copilot only because it integrates well with VS Code. I find the other features to be inferior.
When it works, it's great though. I've used it to vibe-code some nice little desktop apps to automate things I needed and it produced way more polished UI than I would have spent the time doing, and the code is pretty much how I would have written it myself. I just set it going and go do some other task for 10 mins and come back to see what changes it made.
That bunch of other stuff includes the chat, and more recently "Agent Mode". I find it pretty useful, and the autocomplete near useless.
[1] https://platform.openai.com/docs/guides/flex-processing?api-...
I'm talking multiple tries of claude 4 opus, Gemini 2.5 pro, o3 etc resulting in sometimes hundreds of lines of code.
Versus o3-pro (very slowly) analyzing and then fixing something that seemed completely unrelated in a one or two line change and truly fixing the root cause.
o3-pro level LLMs at reduced cost and increased speed will already be amazing..
Which it does a lot...
I've used Aider for a while, and I kind of liked if, but it felt like it needed way more manual work, and I also want to use different models, probably locally hosted. Haven't used Aider in 2 or 3 months, so I don't know if it already has evolved in that way...
edit: in the other hand, the automatic feedback loop means it sometimes go very crazy and the API costs skyrocket easily. But maybe that's another reason to run it locally.
there's also claude-code-proxy to make claude code use other models.
I uploaded a web design of mine (jpeg) and asked Claude to create the html/css. Asked GPT to do the same. GPT's code looked the closet to the design I created and uploaded. Just five to ten small tweaks and I was done vs. Claude it would have taken me almost triple the steps.
I actually subscribed to both today (resubscribed to GPT) and going to keep testing which one is the better front-end developer (i am, but got to embrace AI ).
These benchmark gains aren't that high, so I doubt it is that obvious.
One obvious explanation is that pricing is strongly related to the price to them, and that their only incentive is for people to use an expensive model of they really need it.
I forget which one of the GPT models was better, faster, and cheaper than the previous model. The incentive there is obviously, "If you want to use the old model for whatever reason, fine, but we really want you to use the new one because costs us less to run."
LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.
Claude Mad is tens of hours of opus a month, or you can pay per token and have unlimited.
Or did you mean “I wish it was cheaper”?
I'm outputting a PR every 6 minutes. The reviewers are using Claude to review everything. It used to take a day to add 100 lines to the codebase.. now I can add 100 lines in one prompt
If I want even more productivity (at risk of making the rest of my team look slow) I can tell Claude to output double the lines and ship it off for review. My performance metrics are incredible
What's 100x productivity multiplied by 100 instances of Claude? 10,000x productivity
Now to be fair and a bit more realistic it's not actually 10000x because it takes longer to push the PR because the file sizes are so big. Let's call it 9800x. That's still a sizable improvement
It's not 10x, but those guys do seem like they've hit somewhere around 2x improvement overall.
It's not always a literal 10x time for taskA w/ AI vs taskA w/o AI...
Because I've found it to work pretty amazingly for things that don't need to be exact (like data modeling) or don't have any security implications (public apps). But for everything else I end up having to find all the little bugs by reading the code line by line, which is much slower than just writing the code in the first place.
My current bottleneck is having to review the huge amounts of code that these models spit out. I do TDD, use auto-linting and type-checking.... but the model makes insidious changes that are only visible on deep inspection.
We're all bottlenecked on reviewing now. That's a good thing.
Lapses of judgement and syntax errors happen, but they're easier to spot because you know exactly what you're looking at. When code is written by a model, I have to review it 3 times.
1st to understand the code. 2nd to identify lapses in suspicious areas. 3rd to confirm my suspicions through interactive tests, because the model can use patterns I'm unfamiliar with, and it takes me some googling to confirm if certain patterns used by the model are outright bugs or not. The biggest time sink is fixing an identified bug, because now you're doing it in someone-else's (model's) legacy code rather than a greenfield feature implementation.
It's a big productivity bump. But, if reviewing is the bottleneck, then that upper bounds the productivity gains at ~4x for me. Still incredible technology, but the death of software-engineering that it is claimed to be.
What the point of these?
Kind of interesting that we live in an area of AI super advanced, but still make basic UI/UX mistake. The tagline of this blog post shouldn't be "1 min read".
It's not even accurate. I timed myself not reading fast but not slow, took me 3 min 30s. Maybe the images need be OCRed to make the estimation more accurate.
It's making really stupid errors and I have to work three times as much to get the same results as last week.
This makes them (Anthropic) worse than OpenAI in terms of openness.
Since in this case as we all know. [0]
"What will permanently change everything is open source and transparent AI models that are smaller and more powerful than GPT-3 or even GPT-4."
They might not fit your personal definition of "openness", but they do fit many other equally valid interpretations of that contept.