To be clear, since this confuses a lot of people in every thread: Anthropic will let you use their API with any coding tools you want. You just have to go through the public API and pay the same rate as everyone else. They have not "blocked" or "banned" any coding tools from using their API, even though a lot of the clickbait headlines have tried to insinuate as much.
Anthropic never sold subscription plans as being usable with anything other than their own tools. They were specifically offered as a way to use their own apps for a flat monthly fee.
They obviously set the limits and pricing according to typical use patterns of these tools, because the typical users aren't maxing out their credits in every usage window.
Some of the open source tools reverse engineered the protocol (which wasn't hard) and people started using the plans with other tools. This situation went on for a while without enforcement until it got too big to ignore, and they began protecting the private endpoints explicitly.
The subscription plans were never sold as a way to use the API with other programs, but I think they let it slide for a while because it was only a small number of people doing it. Once the tools started getting more popular they started closing loopholes to use the private API with other tools, which shouldn't really come as a surprise.
Problem is, most people don't do this, choosing convenience at any given moment without thinking about longer-term impact. This hurts us collectively by letting governments/companies, etc tighten their grip over time. This comes from my lived experience.
I agree anticompetitive behavior is bad, but the productivity gains to be had by using Anthropic models and tools are undeniable.
Eventually the open tools and models will catch up, so I'm all for using them locally as well, especially if sensitive data or IP is involved.
I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).
I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.
I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".
I've tried explaining the implementation word and word and it still prefers to create a whole new implementation reimplementing some parts instead of just doing what I tell it to. The only time it works is if I actually give it the code but at that point there's no reason to use it.
There's nothing wrong with this approach if it actually had guarantees, but current models are an extremely bad fit for it.
For actual work that I bill for, I go in with intructions to do minimal changes, and then I carefully review/edit everything.
That being said, the "toy" fully-AI projects I work with have evolved to the point where I regularly accomplish things I never (never ever) would have without the models.
I agree with all posts in the chain: Opus is good, Anthropic have burned good will, I would like to use other models...but Opus is too good.
What I find most frustrating is that I am not sure if it is even actual model quality that is the blocker with other models. Gemini just goes off the rails sometimes with strange bugs like writing random text continuously and burning output tokens, Grok seems to have system prompts that result in odd behaviour...no bugs just doing weird things, Gemini Flash models seem to output massive quantities of text for no reason...it is often feels like very stupid things.
Also, there are huge issues with adopting some of these open models in terms of IP. Third parties are running these models and you are just sending them all your code...with a code of conduct promise from OpenRouter?
I also don't think there needs to be a huge improvement in models. Opus feels somewhat close to the reasonable limit: useful, still outputs nonsense, misses things sometimes...there are open models that can reach the same 95th percentile but the median is just the model outputting complete nonsense and trying to wipe your file system.
The day for open models will come but it still feels so close and so far.
If people start using the Claude Max plans with other agent harnesses that don't use the same kinds of optimizations the economics may no longer have worked out.
(But I also buy that they're going for horizontal control of the stack here and banning other agent harnesses was a competitive move to support that.)
They seem to have started rejecting 3rd party usage of the sub a few weeks ago, before Claw blew up.
By the way, does anyone know about the Agents SDK? Apparently you can use it with an auth token, is anyone doing that? Or is it likely to get your account in trouble as well?
I've had a similar experience with opencode, but I find that works better with my local models anyway.
(There probably is, but I found it very hard to make sense of the UI and how everything works. Hard to change models, no chat history etc.?)
> hitting that limit is within the terms of the agreement with Anthropic
It's not, because the agreement says you can only use CC.
Selling dollars for $.50 does that. It sounds like they have a business model issue to me.
Without knowing the numbers it's hard to tell if the business model for these AI providers actually works, and I suspect it probably doesn't at the moment, but selling an oversubscribed product with baked in usage assumptions is a functional business model in a lot of spaces (for varying definitions of functional, I suppose). I'm surprised this is so surprising to people.
Being a common business model and it being functional are two different things. I agree they are prevalent, but they are actively user hostile in nature. You are essentially saying that if people use your product at the advertised limit, then you will punish them. I get why the business does it, but it is an adversarial business model.
There are already many serious concerns about sharing code and information with 3rd parties, and those Chinese open models are dangerously close to destroying their entire value proposition.
It's within their capability to provision for higher usage by alternative clients. They just don't want to.
it's like Apple: you can use macOS only on our Macs, iOS only on iPhones, etc. but at least in the case of Apple, you pay (mostly) for the hardware while the software it comes with is "free" (as in free beer).
Could have just turned a blind eye.
(Edit due to rate-limiting: I see, thanks -- I wasn't aware there was more than one token type.)
That's not the product you buy when you a Claude Code token, though.
This confused me for a while, having two separate "products" which are sold differently, but can be used by the same tool.
If a company is going to automate our jobs, we shouldn't be giving them money and data to do so. They're using us to put ourselves out of work, and they're not giving us the keys.
I'm fine with non-local, open weights models. Not everything has to run on a local GPU, but it has to be something we can own.
I'd like a large, non-local Qwen3-Coder that I can launch in a RunPod or similar instance. I think on-demand non-local cloud compute can serve as a middle ground.
I can also imagine a dysfunctional future where a developers spend half their time convincing their AI agents that the software they're writing is actually aligned with the model's set of values
And yeah, I got three (for some reason) emails titled "Your account has been suspended" whose content said "An internal investigation of suspicious signals associated with your account indicates a violation of our Usage Policy. As a result, we have revoked your access to Claude.". There is a link to a Google Form which I filled out, but I don't expect to hear back.
I did nothing even remotely suspicious with my Anthropic subscription so I am reasonably sure this mirroring is what got me banned.
Edit: BTW I have since iterated on doing the same mirroring using OpenCode with Codex, then Codex with Codex and now Pi with GPT-5.2 (non-Codex) and OpenAI hasn't banned me yet and I don't think they will as they decided to explicitly support using your subscription with third party coding agents following Anthropic's crackdown on OpenCode.
It’d be cool if Anthropic were bound by their terms of use that you had to sign. Of course, they may well be broad enough to fire customers at will. Not that I suggest you expend any more time fighting this behemoth of a company though. Just sad that this is the state of the art.
I'm not so sure. It doesn't sound like you were circumventing any technical measures meant to enforce the ToS which I think places them in the wrong.
Unless I'm missing some obvious context (I don't use Mac and am unfamiliar with the Bun.spawn API) I don't understand how hooking a TUI up to a PTY and piping text around is remotely suspicious or even unusual. Would they ban you for using a custom terminal emulator? What about a custom fork of tmux? The entire thing sounds absurd to me. (I mean the entire OpenCode thing also seems absurd and wrong to me but at least that one is unambiguously against the ToS.)
* Subscription plans, which are (probably) subsidized and definitely oversubscribed (ie, 100% of subscribers could not use 100% of their tokens 100% of the time).
* Wholesale tokens, which are (probably) profitable.
If you try to use one product as the other product, it breaks their assumptions and business model.
I don't really see how this is weaponized malaise; capacity planning and some form of over-subscription is a widely accepted thing in every industry and product in the universe?
Also, this is more like "I sell a service called take a bike to the grocery store" with a clause in the contract saying "only ride the bike to the grocery store." I do this because I am assuming that most users will ride the bike to the grocery store 1 mile away a few times a week, so they will remain available, even though there is an off chance that some customers will ride laps to the store 24/7. However, I also sell a separate, more expensive service called Bikes By the Hour.
My customers suddenly start using the grocery store plan to ride to a pub 15 miles away, so I kick them off of the grocery store plan and make them buy Bikes By the Hour.
They could, of course, price your 10GB plan under the assumption that you would max out your connection 24 hours a day.
I fail to see how this would be advantageous to the vast majority of the customers.
Please list what capabilities you would like our local model to have and how you would like to have it served to you.
[1] a sovereign digital nation built on a national framework rather than a for-profit or even non-profit framework, will be available at https://stateofutopia.com (you can see some of my recent posts or comments here on HN.)
[2] https://www.youtube.com/live/0psQ2l4-USo?si=RVt2PhGy_A4nYFPi
OpenCode et al continue to work with my Max subscription.
What Anthropic blocked is using OpenCode with the Claude "individual plans" (like the $20/month Pro or $100/month Max plan), which Anthropic intends to be used only with the Claude Code client.
OpenCode had implemented some basic client spoofing so that this was working, but Anthropic updated to a more sophisticated client fingerprinting scheme which blocked OpenCode from using this individual plans.
I recommend Ghostty for Mac users. Alacritty probably works too.
{
"plugin": [
"opencode-anthropic-auth@latest"
]
}It's that simple. Everyone else is trying to compete in other ways and Anthropic are pushing for dominate the market.
They'll eventually lose their performance edge and suddenly they will back to being cute and fluffy
I've cancelled a clause sub, but still have one.
I've tried all of the models available right now, and Claude Opus is by far the most capable.
I had an assertion failure triggered in a fairly complex open-source C library I was using, and Claude Opus not only found the cause, but wrote a self-contained reproduction code I could add to a GitHub issue. And it also added tests for that issue, and fixed the underlying issue.
I am sincerely impressed by the capabilities of Claude Opus. Too bad its usage is so expensive.
I wonder what they are up to.
You are doing that all the time. You just draw the line, arbitrarily.
It's like this old adage "Our brains are poor masters and great slaves". We are basically just wanting to survive and we've trained ourselves to follow the orders of our old corporate slave masters who are now failing us, and we are unfortunately out of fear paying and supporting anticompetitive behavior and our internal dissonance is stopping us from changing it (along with fear of survival and missing out and so forth).
The global marketing by the slave master class isn't helping. We can draw a line however arbitrary we'd like though and its still better and more helpful than complaining "you drew a line arbitrarily" and not actually doing any of the hard courageous work of drawing lines of any kind in the first place.
I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.
It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.
I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?
I'm hoping for an experience where I can tell my computer to do a thing - write a code, check for logged errors, find something in a bunch of files - and I get an answer a few moments later.
Setting a task and then coming back to see if it worked an hour later is too much friction for me!
What's interesting to me about this model is how good it allegedly is with no thinking mode. That's my main complaint about qwen3:30b, how verbose its reasoning is. For the size it's astonishing otherwise.
I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.
I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.
I wonder if it has to do with the message format, since it should be able to do tool use afaict.
Who pays for a free model? GPU training isn't free!
I remember early on people saying 100B+ models will run on your phone like nowish. They were completely wrong and I don't think it's going to ever really change.
People always will want the fastest, best, easiest setup method.
"Good enough" massively changes when your marketing team is managing k8s clusters with frontier systems in the near future.
Just the other day I was reading a paper about ANNs whose connections aren't strictly feedforward but, rather, circular connections proliferate. It increases expressiveness at the (huge) cost of eliminating the current gradient descent algorithms. As compute gets cheaper and cheaper, these things will become feasible (greater expressiveness, after all, equates to greater intelligence).
On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.
So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.
But who knows, maybe Qwen gives them a hand? (hint,hint)
Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.
Perhaps I'm grossly wrong -- I guess time will tell.
There is also the counter-intuitive phenomenon where training a model on a wider variety of content than apparently necessary for the task makes it better somehow. For example, models trained only on English content exhibit measurably worse performance at writing sensible English than those trained on a handful of languages, even when controlling for the size of the training set. It doesn't make sense to me, but it probably does to credentialed AI researchers who know what's going on under the hood.
System info:
$ ./llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 7897 (3dd95914d)
built with GNU 11.4.0 for Linux x86_64
llama.cpp command-line: $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
--ctx-size 32768Not as good as running the entire thing on the GPU, of course.
Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?
The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.
Great work as always btw!
brew upgrade llama.cpp # or brew install if you don't have it yet
Then: llama-cli \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this: llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
It's using about 28GB of RAM.There's no reason for a coding model to contain all of ao3 and wikipedia =)
I had not considered that, seems like a great solution for local models that may be more resource-constrained.
If we knew how to create a SOTA coding model by just putting coding stuff in there, that is how we would build SOTA coding models.
Besides, programming is far from just knowing how to autocomplete syntax, you need a model that's proficient in the fields that the automation is placed in, otherwise they'll be no help in actually automating it.
Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.
I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.
I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.
I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.
I suspect the API providers will offer this model for nice and cheap, too.
I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...
Overall, it's allowed me to maintain more consistent workflows as I'm less dependent on Opus. Now that Mastra has introduced the concept of Workspaces, which allow for more agentic development, this approach has become even more powerful.
Related: as an actual magician, although no longer performing professionally, I was telling another magician friend the other day that IMHO, LLMs are the single greatest magic trick ever invented judging by pure deceptive power. Two reasons:
1. Great magic tricks exploit flaws in human perception and reasoning by seeming to be something they aren't. The best leverage more than one. By their nature, LLMs perfectly exploit the ways humans assess intelligence in themselves and others - knowledge recall, verbal agility, pattern recognition, confident articulation, etc. No other magic trick stacks so many parallel exploits at once.
2. But even the greatest magic tricks don't fool their inventors. David Copperfield doesn't suspect the lady may be floating by magic. Yet, some AI researchers believe the largest, most complex LLMs actually demonstrate emergent thinking and even consciousness. It's so deceptive it even fools people who know how it works. To me, that's a great fucking trick.
Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization
Hope they update the model page soon https://chat.qwen.ai/settings/model
Sorry, but we're talking about models as content now? There's almost always a better word than "content" if you're describing something that's in tech or online.
Does anyone any experience with these and is this release actually workable in practice?
The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.
It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
Compared to RISC core designs or IC optimization, the pace of AI innovation is slow and easy to follow.
Im currently using qwen 2.5 16b , and it works really well
On a misc note: What's being used to create the screen recordings? It looks so smooth!
I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.
When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.
Don't forget that they want to make money in the end. They release small models for free because the publicity is worth more than they could charge for them, but they won't just give away models that are good enough that people would pay significant amounts of money to use them.
In practice, I've found the economics work like this:
1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more
The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.
And at the end of the day, does it matter?