Qwen3-Coder-Next | Modern Orange

385
218
danielhanchen
3 hours ago
qwen.ai

cedws
·
2 hours ago
·
[ - ]

I kind of lost interest in local models. Then Anthropic started saying I’m not allowed to use my Claude Code subscription with my preferred tools and it reminded me why we need to support open tools and models. I’ve cancelled my CC subscription, I’m not paying to support anticompetitive behaviour.

Aurornis
·
40 minutes ago
·
[ - ]

> Then Anthropic started saying I’m not allowed to use my Claude Code subscription with my preferred tools

To be clear, since this confuses a lot of people in every thread: Anthropic will let you use their API with any coding tools you want. You just have to go through the public API and pay the same rate as everyone else. They have not "blocked" or "banned" any coding tools from using their API, even though a lot of the clickbait headlines have tried to insinuate as much.

Anthropic never sold subscription plans as being usable with anything other than their own tools. They were specifically offered as a way to use their own apps for a flat monthly fee.

They obviously set the limits and pricing according to typical use patterns of these tools, because the typical users aren't maxing out their credits in every usage window.

Some of the open source tools reverse engineered the protocol (which wasn't hard) and people started using the plans with other tools. This situation went on for a while without enforcement until it got too big to ignore, and they began protecting the private endpoints explicitly.

The subscription plans were never sold as a way to use the API with other programs, but I think they let it slide for a while because it was only a small number of people doing it. Once the tools started getting more popular they started closing loopholes to use the private API with other tools, which shouldn't really come as a surprise.

ericd
·
5 minutes ago
·
[ - ]

The anticompetitive part is setting a much lower price for typical usage of Claude Code vs. typical usage of another CLI dev tool.

huevosabio
·
32 minutes ago
·
[ - ]

Yes, exactly. The discourse has been so far off the rails now.

aljgz
·
1 hour ago
·
[ - ]

You gave up some convenience to avoid voting for a bad practice with your wallet. I admire this, try to consistently do this when reasonably feasible.

Problem is, most people don't do this, choosing convenience at any given moment without thinking about longer-term impact. This hurts us collectively by letting governments/companies, etc tighten their grip over time. This comes from my lived experience.

gloomyday
·
59 minutes ago
·
[ - ]

Society is lacking people that stand up for something. My efforts to consume less is seen as being cheap by my family, which I find so sad. I much prefer donating my money than exchanging superfluous gifts on Christmas.

pluralmonad
·
32 minutes ago
·
[ - ]

As I get older I more and more view convenience as the enemy of good. Luckily (or unluckily for some) a lot of the tradeoffs we are asked to make in the name of convenience are increasingly absurd. I have an easier and easier time going without these Faustian bargains.

·
1 hour ago
·
[ - ]

skapadia
·
1 hour ago
·
[ - ]

Claude Opus 4.5 by far is the most capable development model. I've been using it mainly via Claude Code, and with Cursor.

I agree anticompetitive behavior is bad, but the productivity gains to be had by using Anthropic models and tools are undeniable.

Eventually the open tools and models will catch up, so I'm all for using them locally as well, especially if sensitive data or IP is involved.

vercaemert
·
55 minutes ago
·
[ - ]

I'd encourage you to try the -codex family with the highest reasoning.

I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).

I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.

I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".

eadwu
·
44 minutes ago
·
[ - ]

I've tried nearly all the models, they all work best if and only if you will never handle the code ever again. They suck if you have a solution and want them to implement that solution.

I've tried explaining the implementation word and word and it still prefers to create a whole new implementation reimplementing some parts instead of just doing what I tell it to. The only time it works is if I actually give it the code but at that point there's no reason to use it.

There's nothing wrong with this approach if it actually had guarantees, but current models are an extremely bad fit for it.

vercaemert
·
36 minutes ago
·
[ - ]

Yes, I only plan/implement on fully AI projects where it's easy for me to tell whether or not they're doing the thing I want regardless of whether or not they've rewritten the codebase.

For actual work that I bill for, I go in with intructions to do minimal changes, and then I carefully review/edit everything.

That being said, the "toy" fully-AI projects I work with have evolved to the point where I regularly accomplish things I never (never ever) would have without the models.

teaearlgraycold
·
39 minutes ago
·
[ - ]

There are domains of programming (web front end) where lots of requests can be done pretty well even when you want them done a certain way. Not all, but enough to make it a great tool.

skippyboxedhero
·
29 minutes ago
·
[ - ]

It feels very close to a trade-off point.

I agree with all posts in the chain: Opus is good, Anthropic have burned good will, I would like to use other models...but Opus is too good.

What I find most frustrating is that I am not sure if it is even actual model quality that is the blocker with other models. Gemini just goes off the rails sometimes with strange bugs like writing random text continuously and burning output tokens, Grok seems to have system prompts that result in odd behaviour...no bugs just doing weird things, Gemini Flash models seem to output massive quantities of text for no reason...it is often feels like very stupid things.

Also, there are huge issues with adopting some of these open models in terms of IP. Third parties are running these models and you are just sending them all your code...with a code of conduct promise from OpenRouter?

I also don't think there needs to be a huge improvement in models. Opus feels somewhat close to the reasonable limit: useful, still outputs nonsense, misses things sometimes...there are open models that can reach the same 95th percentile but the median is just the model outputting complete nonsense and trying to wipe your file system.

The day for open models will come but it still feels so close and so far.

giancarlostoro
·
2 hours ago
·
[ - ]

I do wonder if they locked things down due to people abusing their CC token.

simonw
·
1 hour ago
·
[ - ]

I buy the theory that Claude Code is engineered to use things like token caching efficiently, and their Claude Max plans were designed with those optimizations in mind.

If people start using the Claude Max plans with other agent harnesses that don't use the same kinds of optimizations the economics may no longer have worked out.

(But I also buy that they're going for horizontal control of the stack here and banning other agent harnesses was a competitive move to support that.)

mirekrusin
·
1 hour ago
·
[ - ]

It should just burn quota faster then. Instead of blocking they should just mention that if you use other tools then your quota may reduce at 3x speed compared to cc. People would switch.

andai
·
37 minutes ago
·
[ - ]

When I last checked a few months ago, Anthropic was the only provider that didn't have automatic prompt caching. You had to do it manually (and you could only set checkpoints a few times per context?), and most 3rd party stuff does not.

They seem to have started rejecting 3rd party usage of the sub a few weeks ago, before Claw blew up.

By the way, does anyone know about the Agents SDK? Apparently you can use it with an auth token, is anyone doing that? Or is it likely to get your account in trouble as well?

pluralmonad
·
25 minutes ago
·
[ - ]

I would be surprised if the primary reason for banning third party clients isn't because they are collecting training data via telemetry and analytics in CC. I know CC needlessly connects to google infrastructure, I assume for analytics.

volkercraig
·
1 hour ago
·
[ - ]

Absolutely. I installed clawdbot for just long enough to send a single message, and it burned through almost a quarter of my session allowance. That was enough for me. Meanwhile I can use CC comfortably for a few hours and I've only hit my token limit a few times.

I've had a similar experience with opencode, but I find that works better with my local models anyway.

andai
·
34 minutes ago
·
[ - ]

I used it for a few mins and it burned 7M tokens. Wish there was a way to see where it's going!

(There probably is, but I found it very hard to make sense of the UI and how everything works. Hard to change models, no chat history etc.?)

giancarlostoro
·
1 hour ago
·
[ - ]

Wow, that is very surprising and alarming. I wish Anthropic would have made a more public statement as to why they blocked other harnesses.

ImprobableTruth
·
1 hour ago
·
[ - ]

If that was the real reason, why wouldn't they just make it so that if you don't correctly use caching you use up more of your limit?

segmondy
·
1 hour ago
·
[ - ]

Nah, their "moat" is CC, they are afraid that as other folks build effective coding agent, they are are going lose market share.

cedws
·
2 hours ago
·
[ - ]

In what way would it be abused? The usage limits apply all the same, they aren't client side, and hitting that limit is within the terms of the agreement with Anthropic.

bri3d
·
2 hours ago
·
[ - ]

The subscription services have assumptions baked in about the usage patterns; they're oversubscribed and subsidized. If 100% of subscriber customers use 100% of their tokens 100% of the time, their business model breaks. That's what wholesale / API tokens are for.

> hitting that limit is within the terms of the agreement with Anthropic

It's not, because the agreement says you can only use CC.

Nemi
·
2 hours ago
·
[ - ]

> The subscription services have assumptions baked in about the usage patterns; they're oversubscribed and subsidized.

Selling dollars for $.50 does that. It sounds like they have a business model issue to me.

bri3d
·
1 hour ago
·
[ - ]

This is how every cloud service and every internet provider works. If you want to get really edgy you could also say it's how modern banking works.

Without knowing the numbers it's hard to tell if the business model for these AI providers actually works, and I suspect it probably doesn't at the moment, but selling an oversubscribed product with baked in usage assumptions is a functional business model in a lot of spaces (for varying definitions of functional, I suppose). I'm surprised this is so surprising to people.

Nemi
·
16 minutes ago
·
[ - ]

> selling an oversubscribed product with baked in usage assumptions is a functional business model in a lot of spaces

Being a common business model and it being functional are two different things. I agree they are prevalent, but they are actively user hostile in nature. You are essentially saying that if people use your product at the advertised limit, then you will punish them. I get why the business does it, but it is an adversarial business model.

Tossrock
·
1 hour ago
·
[ - ]

Don't forget gyms and other physical-space subscriptions. It's right up there with razor-and-blades for bog standard business models. Imagine if you got a gym membership and then were surprised when they cancelled your account for reselling gym access to your friends.

muyuu
·
1 hour ago
·
[ - ]

If they rely on this to be competitive, I have serious doubts they will survive much longer.

There are already many serious concerns about sharing code and information with 3rd parties, and those Chinese open models are dangerously close to destroying their entire value proposition.

·
1 hour ago
·
[ - ]

cedws
·
2 hours ago
·
[ - ]

That's on Anthropic for selling a mirage of limits they don't want people to actually reach for.

It's within their capability to provision for higher usage by alternative clients. They just don't want to.

behnamoh
·
2 hours ago
·
[ - ]

> It's not, because the agreement says you can only use CC.

it's like Apple: you can use macOS only on our Macs, iOS only on iPhones, etc. but at least in the case of Apple, you pay (mostly) for the hardware while the software it comes with is "free" (as in free beer).

whywhywhywhy
·
2 hours ago
·
[ - ]

Taking umbrage as if it matters how I use the compute I'm paying for via the harness they want me to use it within as long as I'm just doing personal tasks I want to do for myself, not trying to power an apps API with it seems such a waste of their time to be focusing on and only causes brand perception damage with their customers.

Could have just turned a blind eye.

CamperBob2
·
2 hours ago
·
[ - ]

How do I "abuse" a token? I pass it to their API, the request executes, a response is returned, I get billed for it. That should be the end of the conversation.

(Edit due to rate-limiting: I see, thanks -- I wasn't aware there was more than one token type.)

bri3d
·
2 hours ago
·
[ - ]

You can buy this product, right here: https://platform.claude.com/docs/en/about-claude/pricing

That's not the product you buy when you a Claude Code token, though.

s5fs
·
1 hour ago
·
[ - ]

Claude Code supports using API credits, and you can turn on Extra Usage and use API credits automatically once your session limit is reached.

This confused me for a while, having two separate "products" which are sold differently, but can be used by the same tool.

echelon
·
2 hours ago
·
[ - ]

The loss of access shows the kind of power they'll have in the future. It's just a taste of what's to come.

If a company is going to automate our jobs, we shouldn't be giving them money and data to do so. They're using us to put ourselves out of work, and they're not giving us the keys.

I'm fine with non-local, open weights models. Not everything has to run on a local GPU, but it has to be something we can own.

I'd like a large, non-local Qwen3-Coder that I can launch in a RunPod or similar instance. I think on-demand non-local cloud compute can serve as a middle ground.

dirkc
·
1 hour ago
·
[ - ]

Access is one of my concerns with coding agents - on the one hand I think they make coding much more accessible to people who aren't developers - on the other hand this access is managed by commercial entities and can be suspended for any reason.

I can also imagine a dysfunctional future where a developers spend half their time convincing their AI agents that the software they're writing is actually aligned with the model's set of values

rschachte
·
59 minutes ago
·
[ - ]

Easy to use a local proxy to use other models with CC. Wrote a basic working one using Claude. LiteLLM is also good. But I agree, fuck their mindset

_ink_
·
1 hour ago
·
[ - ]

What setup comes close to Claude Code? I am willing to rent cloude GPUs.

tomashubelbauer
·
2 hours ago
·
[ - ]

Anthropic banned my account when I whipped up a solution to control Claude Code running on my Mac from my phone when I'm out and about. No commercial angle, just a tool I made for myself since they wouldn't ship this feature (and still haven't). I wasn't their biggest fanboy to begin with, but it gave me the kick in the butt needed to go and explore alternatives until local models get good enough that I don't need to use hosted models altogether.

darkwater
·
2 hours ago
·
[ - ]

I control it with ssh and sometimes tmux (but termux+wireguard lead to a surprisingly generally stable connection). Why did you need more than that?

tomashubelbauer
·
2 hours ago
·
[ - ]

I didn't like the existing SSH applications for iOS and I already have a local app that I made that I have open 24/7, so I added a screen that used xterm.js and Bun.spawn with Bun.Terminal to mirror the process running on my Mac to my phone. This let me add a few bells and whistles that a generic SSH client wouldn't have, like notifications when Claude Code was done working etc.

pluralmonad
·
18 minutes ago
·
[ - ]

How did they even know you did this? I cannot imagine what cause they could have for the ban. They actively want folks building tooling around and integrating with Claude Code.

tomashubelbauer
·
15 minutes ago
·
[ - ]

I have no idea. The alternative is that my account just happened to be on the wrong side of their probably slop-coded abuse detection algorithm. Not really any better.

Tossrock
·
1 hour ago
·
[ - ]

They did ship that feature, it's called "&" / teleport from web. They also have an iOS app.

tomashubelbauer
·
1 hour ago
·
[ - ]

That's non-local. I am not interested in coding assistants that work on cloud based work-spaces. That's what motivated me to developed this feature for myself.

Tossrock
·
48 minutes ago
·
[ - ]

But... Claude Code is already cloud-based. It relies on the Anthropic API. Your data is all already being ingested by them. Seems like a weird boundary to draw, trusting the company's model with your data but not their convenience web ui. Being local-only (ie OpenCode & open weights model running on your own hw) is consistent, at least.

tomashubelbauer
·
44 minutes ago
·
[ - ]

It is not a moral stance. I just prefer to have my files of my personal projects in one place. Sure I sync them to GitHub for backup, but I don't use GitHub for anything else in my personal projects. I am not going to use a workflow which relies on checking out my code to some VM where I have to set everything up in a way where it has access to all the tools and dependencies that are already there on my machine. It's slower, clunkier. IMO you can't beat the convenience of working on your local files. When I used my CC mirror for the brief period where it worked, when I came back to my laptop, all my changes were just already there, no commits, no pulls, no sync, nothing.

Tossrock
·
38 minutes ago
·
[ - ]

Ah okay, that makes sense. Sorry they pulled the plug on you!

redblacktree
·
2 hours ago
·
[ - ]

How did this work? The ban, I mean. Did you just wake up to find out an email and that your creds no longer worked? Were you doing things to sub-process out to the Claude Code CLI or something else?

tomashubelbauer
·
2 hours ago
·
[ - ]

I left a sibling comment detailing the technical side of things. I used the `Bun.spawn` API with the `terminal` key to give CC a PTY and mirrored it to my phone with xterm.js. I used SSE to stream CC data to xterm.js and a regular request to send commands out from my phone. In my mind, this is no different than using CC via SSH from my phone - I was still bound by the same limits and wasn't trying to bypass them, Anthropic is entitled to their different opinion of course.

And yeah, I got three (for some reason) emails titled "Your account has been suspended" whose content said "An internal investigation of suspicious signals associated with your account indicates a violation of our Usage Policy. As a result, we have revoked your access to Claude.". There is a link to a Google Form which I filled out, but I don't expect to hear back.

I did nothing even remotely suspicious with my Anthropic subscription so I am reasonably sure this mirroring is what got me banned.

Edit: BTW I have since iterated on doing the same mirroring using OpenCode with Codex, then Codex with Codex and now Pi with GPT-5.2 (non-Codex) and OpenAI hasn't banned me yet and I don't think they will as they decided to explicitly support using your subscription with third party coding agents following Anthropic's crackdown on OpenCode.

eptcyka
·
2 minutes ago
·
[ - ]

> Anthropic is entitled to their different opinion of course.

It’d be cool if Anthropic were bound by their terms of use that you had to sign. Of course, they may well be broad enough to fire customers at will. Not that I suggest you expend any more time fighting this behemoth of a company though. Just sad that this is the state of the art.

fc417fc802
·
1 hour ago
·
[ - ]

> Anthropic is entitled to their different opinion of course.

I'm not so sure. It doesn't sound like you were circumventing any technical measures meant to enforce the ToS which I think places them in the wrong.

Unless I'm missing some obvious context (I don't use Mac and am unfamiliar with the Bun.spawn API) I don't understand how hooking a TUI up to a PTY and piping text around is remotely suspicious or even unusual. Would they ban you for using a custom terminal emulator? What about a custom fork of tmux? The entire thing sounds absurd to me. (I mean the entire OpenCode thing also seems absurd and wrong to me but at least that one is unambiguously against the ToS.)

RationPhantoms
·
2 hours ago
·
[ - ]

There is weaponized malaise employed by these frontier model providers and I feel like that dark-pattern, what you pointed out, and others are employed to rate-limit certain subscriptions.

bri3d
·
2 hours ago
·
[ - ]

They have two products:

* Subscription plans, which are (probably) subsidized and definitely oversubscribed (ie, 100% of subscribers could not use 100% of their tokens 100% of the time).

* Wholesale tokens, which are (probably) profitable.

If you try to use one product as the other product, it breaks their assumptions and business model.

I don't really see how this is weaponized malaise; capacity planning and some form of over-subscription is a widely accepted thing in every industry and product in the universe?

tomashubelbauer
·
1 hour ago
·
[ - ]

I am curious to see how this will pan out long-term. Is the quality gap of Opus-4.5 over GPT-5.2 large enough to overcome the fact that OpenAI has merged these two bullet points into one? I think Anthropic might have bet on no other frontier lab daring to disconnect their subscription from their in-house coding agent and OpenAI called their bluff to get some free marketing following Anthropic's crackdown on OpenCode.

bri3d
·
1 hour ago
·
[ - ]

It will also be interesting to see which model is more sustainable once the money fire subsidy musical chairs start to shake out; it all depends on how many whales there are in both directions I think (subscription customers using more than expected vs large buys of profitable API tokens).

Propelloni
·
1 hour ago
·
[ - ]

So, if I rent out my bike to you for an hour a day for really cheap money and I do so a 50 more times to 50 others, so that my bike is oversubscribed and you and others don't get your hours, that's OK because it is just capacity planning on my side and widely accepted? Good to know.

bri3d
·
1 hour ago
·
[ - ]

Let me introduce you to Citibike?

Also, this is more like "I sell a service called take a bike to the grocery store" with a clause in the contract saying "only ride the bike to the grocery store." I do this because I am assuming that most users will ride the bike to the grocery store 1 mile away a few times a week, so they will remain available, even though there is an off chance that some customers will ride laps to the store 24/7. However, I also sell a separate, more expensive service called Bikes By the Hour.

My customers suddenly start using the grocery store plan to ride to a pub 15 miles away, so I kick them off of the grocery store plan and make them buy Bikes By the Hour.

elzbardico
·
44 minutes ago
·
[ - ]

As others pointed out, every business that sells capacity does this, including your ISP provider.

They could, of course, price your 10GB plan under the assumption that you would max out your connection 24 hours a day.

I fail to see how this would be advantageous to the vast majority of the customers.

pluralmonad
·
11 minutes ago
·
[ - ]

Well, if the service price were in any way tied to the cost of transmitting bytes, then even the 24hr scenarios would likely see a reduction in cost to customers. Instead we have overage fees and data caps to help with "network congestion", which tells us all how little they think of their customers.

dehugger
·
1 hour ago
·
[ - ]

Yes, correct. Essentially every single industry and tool which rents out capacity of any system or service does this. Your ISP does this. The airline does this. Cruise lines. Cloud computing environments. Restaurants. Rental cars. The list is endless.

pyvpx
·
1 hour ago
·
[ - ]

I have some bad news for you about your home internet connection.

·
2 hours ago
·
[ - ]

·
2 hours ago
·
[ - ]

logicallee
·
22 minutes ago
·
[ - ]

What do you require local models to do? The State of Utopia[1] is currently busy porting a small model to run in a zero-trust environment - your web browser. It's finished the port in javascript and is going to wasm now for the CPU path. you can see it being livecoded by Claude right now[2] (this is day 2, day 1 it ported the C++ code to javascript successfully). We are curious to know what permissions you would like to grant such a model and how you would like it served to you. (For example, we consider that you wouldn't trust a Go build - especially if it's built by a nation state, regardless of our branding, practices, members or contributors.)

Please list what capabilities you would like our local model to have and how you would like to have it served to you.

[1] a sovereign digital nation built on a national framework rather than a for-profit or even non-profit framework, will be available at https://stateofutopia.com (you can see some of my recent posts or comments here on HN.)

[2] https://www.youtube.com/live/0psQ2l4-USo?si=RVt2PhGy_A4nYFPi

thedangler
·
1 hour ago
·
[ - ]

How are you using the huge models locally?

throwup238
·
1 hour ago
·
[ - ]

Did they actually say that? I thought they rolled it back.

OpenCode et al continue to work with my Max subscription.

disiplus
·
2 hours ago
·
[ - ]

im downloading it as we speek to try to run it on a 32gb 5090 + 128gb ddr5 i will compare it to glm 4.7-flash that was my local model of choice

gitpusher
·
1 hour ago
·
[ - ]

Likewise curious to hear how it goes! 80B seems too big for a 5090, I'd be surprised if it runs well un-quantized.

wilkystyle
·
1 hour ago
·
[ - ]

Interested to hear how this goes!

Alxc1
·
1 hour ago
·
[ - ]

I must have missed it, but what did Claude disable access for? Last I checked Cline and Claude Max still worked.

hnrodey
·
1 hour ago
·
[ - ]

OpenCode

tshaddox
·
1 hour ago
·
[ - ]

Yes, although OpenCode works great with official Claude API keys that are on normal API pricing.

What Anthropic blocked is using OpenCode with the Claude "individual plans" (like the $20/month Pro or $100/month Max plan), which Anthropic intends to be used only with the Claude Code client.

OpenCode had implemented some basic client spoofing so that this was working, but Anthropic updated to a more sophisticated client fingerprinting scheme which blocked OpenCode from using this individual plans.

nullbyte
·
1 hour ago
·
[ - ]

Protip for Mac people: If OpenCode looks weird in your terminal, you need to use a terminal app with truecolor support. It looks very janky on ANSI terminals but it's beautiful on truecolor.

I recommend Ghostty for Mac users. Alacritty probably works too.

mayhemducks
·
1 hour ago
·
[ - ]

Thank you for this comment! I knew it was something like this. I've been using it in the VSCode terminal, but you're right, the ANSI terminal just doesn't work. I wasn't quite sure why!

stevejb
·
1 hour ago
·
[ - ]

Is this still the case? Is Anthropic still not allowing access to OpenCode?

cedws
·
1 hour ago
·
[ - ]

Officially, it's against TOS. I'm told you can still make it work by adding this to ~/.config/opencode/opencode.json but it risks a ban and you definitely shouldn't do it.

  {
    "plugin": [
      "opencode-anthropic-auth@latest"
    ]
  }

stevejb
·
1 hour ago
·
[ - ]

Ah interesting. I have been using OpenCode more and more and I prefer it to Claude Code. I use OpenCode with Sonnet and/or Opus (among other models) with Bedrock, but paying metered rates for Opus is a way to go bankrupt fast!

fc417fc802
·
56 minutes ago
·
[ - ]

Just like I shouldn't use an unofficial play store client, right? No one would ever do that.

illusive4080
·
1 hour ago
·
[ - ]

They had a public spat with Opencode

wahnfrieden
·
2 hours ago
·
[ - ]

OpenAI committed to allowing it btw. I don't know why Anthropic gets so much love here

rustyhancock
·
2 hours ago
·
[ - ]

Cause they make the best coding model.

It's that simple. Everyone else is trying to compete in other ways and Anthropic are pushing for dominate the market.

They'll eventually lose their performance edge and suddenly they will back to being cute and fluffy

I've cancelled a clause sub, but still have one.

bheadmaster
·
2 hours ago
·
[ - ]

Agreed.

I've tried all of the models available right now, and Claude Opus is by far the most capable.

I had an assertion failure triggered in a fairly complex open-source C library I was using, and Claude Opus not only found the cause, but wrote a self-contained reproduction code I could add to a GitHub issue. And it also added tests for that issue, and fixed the underlying issue.

I am sincerely impressed by the capabilities of Claude Opus. Too bad its usage is so expensive.

jmathai
·
2 hours ago
·
[ - ]

Probably because the alternatives are OpenAI, Google, Meta. Not throwing shade at those companies but it's not hard to win the hearts of developers when that's your competition.

cedws
·
2 hours ago
·
[ - ]

Thanks, I’ll try out Codex to bridge until local models get to the level I need.

teratron27
·
1 hour ago
·
[ - ]

Because OpenAI is on the back foot at the moment, they need the retention

varispeed
·
2 hours ago
·
[ - ]

On the other hand I feel like 5.2 gets progressively dumbed down. It used to work well, but now initial few prompts go in right direction and then it goes off the rails reminding me more of a GPT-3.5.

I wonder what they are up to.

ad
·
1 hour ago
·
[ - ]

which tools?

jstummbillig
·
1 hour ago
·
[ - ]

> I’m not paying to support anticompetitive behaviour

You are doing that all the time. You just draw the line, arbitrarily.

tclancy
·
1 hour ago
·
[ - ]

The enemy of done is perfect, etc. what is the point of comments like this?

jstummbillig
·
1 hour ago
·
[ - ]

What is the point of any of this? To exchange how we think about things. I think virtue signaling is boring and uncandid.

InsideOutSanta
·
1 hour ago
·
[ - ]

But you are virtue-signalling, too, based on your own definition of virtuous behavior. In fact, you're doing nothing else. You're not contributing anything of value to the discussion.

tclancy
·
1 hour ago
·
[ - ]

Unclench and stop seeing everything as virtual signaling. What about al those White Knight, SJWs in the 70s who were against leaded gas? Still virtue signaling?

·
31 minutes ago
·
[ - ]

mannanj
·
1 hour ago
·
[ - ]

That's great, yes. We all draw the line somewhere, subjectively. We all pretend we follow logic and reason and lets all be more honest and truthfully share how we as humans are emotionally driven not logically driven.

It's like this old adage "Our brains are poor masters and great slaves". We are basically just wanting to survive and we've trained ourselves to follow the orders of our old corporate slave masters who are now failing us, and we are unfortunately out of fear paying and supporting anticompetitive behavior and our internal dissonance is stopping us from changing it (along with fear of survival and missing out and so forth).

The global marketing by the slave master class isn't helping. We can draw a line however arbitrary we'd like though and its still better and more helpful than complaining "you drew a line arbitrarily" and not actually doing any of the hard courageous work of drawing lines of any kind in the first place.

simonw
·
3 hours ago
·
[ - ]

This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.

I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.

Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next

1dom
·
3 hours ago
·
[ - ]

I run Qwen3-Coder-30B-A3B-Instruct gguf on a VM with 13gb RAM and a 6gb RTX 2060 mobile GPU passed through to it with ik_llama, and I would describe it as usable, at least. It's running on an old (5 years, maybe more) Razer Blade laptop that has a broken display and 16gb RAM.

I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.

It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.

I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?

simonw
·
1 hour ago
·
[ - ]

Honestly I've been completely spoiled by Claude Code and Codex CLI against hosted models.

I'm hoping for an experience where I can tell my computer to do a thing - write a code, check for logged errors, find something in a bunch of files - and I get an answer a few moments later.

Setting a task and then coming back to see if it worked an hour later is too much friction for me!

regularfry
·
2 hours ago
·
[ - ]

I've had usable results with qwen3:30b, for what I was doing. There's definitely a knack to breaking the problem down enough for it.

What's interesting to me about this model is how good it allegedly is with no thinking mode. That's my main complaint about qwen3:30b, how verbose its reasoning is. For the size it's astonishing otherwise.

embedding-shape
·
3 hours ago
·
[ - ]

> I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful

I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.

I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.

andai
·
31 minutes ago
·
[ - ]

Are you running 120B agentic? I tried using it in a few different setups and it failed hard in every one. It would just give up after a second or two every time.

I wonder if it has to do with the message format, since it should be able to do tool use afaict.

gigatexal
·
3 hours ago
·
[ - ]

I’ve a 128GB m3 max MacBook Pro. Running the gpt oss model on it via lmstudio once the context gets large enough the fans spin to 100 and it’s unbearable.

embedding-shape
·
1 hour ago
·
[ - ]

Yeah, Apple hardware don't seem ideal for LLMs that are large, give it a go with a dedicated GPU if you're inclined and you'll see a big difference :)

pixelpoet
·
2 hours ago
·
[ - ]

Laptops are fundamentally a poor form factor for high performance computing.

dehrmann
·
2 hours ago
·
[ - ]

I wonder if the future in ~5 years is almost all local models? High-end computers and GPUs can already do it for decent models, but not sota models. 5 years is enough time to ramp up memory production, consumers to level-up their hardware, and models to optimize down to lower-end hardware while still being really good.

johnsmith1840
·
34 minutes ago
·
[ - ]

Opensource or local models will always heavily lag frontier.

Who pays for a free model? GPU training isn't free!

I remember early on people saying 100B+ models will run on your phone like nowish. They were completely wrong and I don't think it's going to ever really change.

People always will want the fastest, best, easiest setup method.

"Good enough" massively changes when your marketing team is managing k8s clusters with frontier systems in the near future.

manbitesdog
·
2 hours ago
·
[ - ]

Plus a long queue of yet-undiscovered architectural improvements

vercaemert
·
47 minutes ago
·
[ - ]

I'm suprised there isn't more "hope" in this area. Even things like the GPT Pro models; surely that sort of reasoning/synthesis will eventually make its way into local models. And that's something that's already been discovered.

Just the other day I was reading a paper about ANNs whose connections aren't strictly feedforward but, rather, circular connections proliferate. It increases expressiveness at the (huge) cost of eliminating the current gradient descent algorithms. As compute gets cheaper and cheaper, these things will become feasible (greater expressiveness, after all, equates to greater intelligence).

infinitezest
·
2 hours ago
·
[ - ]

A lot of manufacturers are bailing on consumer lines to focus on enterprise from what I've read. Not great.

regularfry
·
2 hours ago
·
[ - ]

Even without leveling up hardware, 5 years is a loooong time to squeeze the juice out of lower-end model capability. Although in this specific niche we do seem to be leaning on Qwen a lot.

dust42
·
2 hours ago
·
[ - ]

Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple.

On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.

So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.

But who knows, maybe Qwen gives them a hand? (hint,hint)

ttoinou
·
2 hours ago
·
[ - ]

I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ?

dust42
·
2 hours ago
·
[ - ]

KV caching means that when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines.

Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.

dcastm
·
1 hour ago
·
[ - ]

I have the same experience with local models. I really want to use them, but right now, they're not on par with propietary models on capabilities nor speed (at least if you're using a Mac).

bityard
·
1 hour ago
·
[ - ]

Local models on your laptop will never be as powerful as the ones that take up a rack of datacenter equipment. But there is still a surprising amount of overlap if you are willing to understand and accept the limitations.

vessenes
·
3 hours ago
·
[ - ]

I'm thinking the next step would be to include this as a 'junior dev' and let Opus farm simple stuff out to it. It could be local, but also if it's on cerebras, it could be realllly fast.

ttoinou
·
3 hours ago
·
[ - ]

Cerebras already has GLM 4.7 in the code plans

vessenes
·
3 hours ago
·
[ - ]

Yep. But this is like 10x faster; 3B active parameters.

ttoinou
·
3 hours ago
·
[ - ]

Cerebras is already 200-800 tps, do you need even faster ?

overfeed
·
2 hours ago
·
[ - ]

Yes! I don't try to read agent tokens as they are generated, so if code generation decreases from 1 minute to 6 seconds, I'll be delighted. I'll even accept 10s -> 1s speedups. Considering how often I've seen agents spin wheels with different approaches, faster is always better, until models can 1-shot solutions without the repeated "No, wait..." / "Actually..." thinking loops

organsnyder
·
2 hours ago
·
[ - ]

They run fairly well for me on my 128GB Framework Desktop.

mittermayr
·
3 minutes ago
·
[ - ]

what do you run this on if I may ask? lmstudio, ollama, lama? which cli?

danielhanchen
·
3 hours ago
·
[ - ]

It works reasonably well for general tasks, so we're definitely getting there! Probably Qwen3 CLI might be better suited, but haven't tested it yet.

segmondy
·
1 hour ago
·
[ - ]

you do realize claude opus/gpt5 are probably like 1000B-2000B models? So trying to have a model that's < 60B offer the same level of performance will be a miracle...

jrop
·
1 hour ago
·
[ - ]

I don't buy this. I've long wondered if the larger models, while exhibiting more useful knowledge, are not more wasteful as we greedily explore the frontier of "bigger is getting us better results, make it bigger". Qwen3-Coder-Next seems to be a point for that thought: we need to spend some time exploring what smaller models are capable of.

Perhaps I'm grossly wrong -- I guess time will tell.

bityard
·
52 minutes ago
·
[ - ]

You are not wrong, small models can be trained for niche use cases and there are lots of people and companies doing that. The problem is that you need one of those for each use case whereas the bigger models can cover a bigger problem space.

There is also the counter-intuitive phenomenon where training a model on a wider variety of content than apparently necessary for the task makes it better somehow. For example, models trained only on English content exhibit measurably worse performance at writing sensible English than those trained on a handful of languages, even when controlling for the size of the training set. It doesn't make sense to me, but it probably does to credentialed AI researchers who know what's going on under the hood.

segmondy
·
31 minutes ago
·
[ - ]

eventually we will have smarter smaller models, but as of now, larger models are smarter by far. time and experience has already answered that.

danielhanchen
·
3 hours ago
·
[ - ]

For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

genpfault
·
1 hour ago
·
[ - ]

Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).

System info:

    $ ./llama-server --version
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
    version: 7897 (3dd95914d)
    built with GNU 11.4.0 for Linux x86_64

llama.cpp command-line:

    $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
    --ctx-size 32768

halcyonblue
·
1 hour ago
·
[ - ]

What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!

coder543
·
1 hour ago
·
[ - ]

MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.

Not as good as running the entire thing on the GPU, of course.

bityard
·
45 minutes ago
·
[ - ]

Hi Daniel, I've been using some of your models on my Framework Desktop at home. Thanks for all that you do.

Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?

ranger_danger
·
3 hours ago
·
[ - ]

What is the difference between the UD and non-UD files?

danielhanchen
·
3 hours ago
·
[ - ]

UD stands for "Unsloth-Dynamic" which upcasts important layers to higher bits. Non UD is just standard llama.cpp quants. Both still use our calibration dataset.

CamperBob2
·
2 hours ago
·
[ - ]

Please consider authoring a single, straightforward introductory-level page somewhere that explains what all the filename components mean, and who should use which variants.

The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.

danielhanchen
·
2 hours ago
·
[ - ]

Oh good idea! In general UD-Q4_K_XL (Unsloth Dynamic 4bits Extra Large) is what I generally recommend for most hardware - MXFP4_MOE is also ok

Keats
·
34 minutes ago
·
[ - ]

Is there some indication on how the different bit quantization affect performance? IE I have a 5090 + 96GB so I want to get the best possible model but I don't care about getting 2% better perf if I only get 5 tok/s.

segmondy
·
1 hour ago
·
[ - ]

The green/yellow/red indicators are based on what you set for your hardware on huggingface.

binsquare
·
3 hours ago
·
[ - ]

How did you do it so fast?

Great work as always btw!

danielhanchen
·
2 hours ago
·
[ - ]

Thanks! :) We're early access partners with them!

simonw
·
1 hour ago
·
[ - ]

I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:

  brew upgrade llama.cpp # or brew install if you don't have it yet

Then:

  llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:

  llama-server \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

It's using about 28GB of RAM.

skhameneh
·
3 hours ago
·
[ - ]

It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance.

cirrusfan
·
3 hours ago
·
[ - ]

If it sounds too good to be true…

theshrike79
·
2 hours ago
·
[ - ]

Should be possible with optimised models, just drop all "generic" stuff and focus on coding performance.

There's no reason for a coding model to contain all of ao3 and wikipedia =)

noveltyaccount
·
2 hours ago
·
[ - ]

I think I like coding models that know a lot about the world. They can disambiguate my requirements and build better products.

regularfry
·
2 hours ago
·
[ - ]

I generally prefer a coding model that can google for the docs, but separate models for /plan and /build is also a thing.

noveltyaccount
·
1 hour ago
·
[ - ]

> separate models for /plan and /build

I had not considered that, seems like a great solution for local models that may be more resource-constrained.

regularfry
·
1 hour ago
·
[ - ]

You can configure aider that way. You get three, in fact: an architect model, a code editor model, and a quick model for things like commit messages. Although I'm not sure if it's got doc searching capabilities.

jstummbillig
·
1 hour ago
·
[ - ]

There is: It works (even if we can't explain why right now).

If we knew how to create a SOTA coding model by just putting coding stuff in there, that is how we would build SOTA coding models.

moffkalast
·
1 hour ago
·
[ - ]

That's what Meta thought initially too, training codellama and chat llama separately, and then they realized they're idiots and that adding the other half of data vastly improves both models. As long as it's quality data, more of it doesn't do harm.

Besides, programming is far from just knowing how to autocomplete syntax, you need a model that's proficient in the fields that the automation is placed in, otherwise they'll be no help in actually automating it.

MarsIronPI
·
2 hours ago
·
[ - ]

But... but... I need my coding model to be able to write fanfiction in the comments...

tommyjepsen
·
1 hour ago
·
[ - ]

I got the Qwen3 Coder 30B running locally on mac Mac M4 Max 36GB. It was slow, but it worked and did do some decent stuff: https://www.youtube.com/watch?v=7mAPaRbsjTU

Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding

Tepix
·
43 minutes ago
·
[ - ]

Using lmstudio-community/Qwen3-Coder-Next-GGUF:Q8_0 I'm getting up to 32 tokens/s on Strix Halo, with room for 128k of context (out of 256k that the model can manage).

From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.

dimgl
·
41 minutes ago
·
[ - ]

How's the Strix Halo? I'd really like to get a local inference machine so that I don't have to use quantized versions of local models.

cmrdporcupine
·
36 minutes ago
·
[ - ]

I'm getting similar numbers on NVIDIA Spark around 25-30 tokens/sec output, 251 token/sec prompt processing... but I'm running with the Q4_K_XL quant. I'll try the Q8 next, but that would leave less room for context.

I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.

I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.

I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.

I suspect the API providers will offer this model for nice and cheap, too.

aseipp
·
20 minutes ago
·
[ - ]

llama.cpp is giving me ~35tok/sec with the unsloth quants (UD-Q4_K_XL, elsewhere in this thread) on my Spark. FWIW my understanding and experience is that llama.cpp seems to give slight better performance for "single user" workloads, but I'm not sure why.

I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...

vessenes
·
3 hours ago
·
[ - ]

3B active parameters, and slightly worse than GLM 4.7. On benchmarks. That's pretty amazing! With better orchestration tools being deployed, I've been wondering if faster, dumber coding agents paired with wise orchestrators might be overall faster than using the say opus 4.5 on the bottom for coding. At least we might want to deploy to these guys for simple tasks.

markab21
·
3 hours ago
·
[ - ]

It's getting a lot easier to do this using sub-agents with tools in Claude. I have a fleet of Mastra agents (TypeScript). I use those agents inside my project as CLI tools to do repetitive tasks that gobble tokens such as scanning code, web search, library search, and even SourceGraph traversal.

Overall, it's allowed me to maintain more consistent workflows as I'm less dependent on Opus. Now that Mastra has introduced the concept of Workspaces, which allow for more agentic development, this approach has become even more powerful.

IhateAI
·
2 hours ago
·
[ - ]

Do you actually ship anything is the question, or is this all just (expensive) magic tricks you're preforming on yourself? Genuine question.

mrandish
·
1 hour ago
·
[ - ]

> just (expensive) magic trick

Related: as an actual magician, although no longer performing professionally, I was telling another magician friend the other day that IMHO, LLMs are the single greatest magic trick ever invented judging by pure deceptive power. Two reasons:

1. Great magic tricks exploit flaws in human perception and reasoning by seeming to be something they aren't. The best leverage more than one. By their nature, LLMs perfectly exploit the ways humans assess intelligence in themselves and others - knowledge recall, verbal agility, pattern recognition, confident articulation, etc. No other magic trick stacks so many parallel exploits at once.

2. But even the greatest magic tricks don't fool their inventors. David Copperfield doesn't suspect the lady may be floating by magic. Yet, some AI researchers believe the largest, most complex LLMs actually demonstrate emergent thinking and even consciousness. It's so deceptive it even fools people who know how it works. To me, that's a great fucking trick.

IhateAI
·
46 minutes ago
·
[ - ]

Also, just like how in centuries past, rulers/governments bet their entire Empires on the predictions of magicians / seers they consulted. Machine learning Engineers are the new seers and their models are their magic tricks. It seems like history really is a circle.

solumunus
·
2 hours ago
·
[ - ]

Are you just exposing mastra cli commands to Claude Code in md context? I’d love you to elaborate on this if you have time.

adriand
·
1 hour ago
·
[ - ]

Seconded!

doctorpangloss
·
3 hours ago
·
[ - ]

Time will tell. All this stuff will get more adoption when Anthropic, Google and OpenAI raise prices.

Alifatisk
·
2 hours ago
·
[ - ]

They can only raise prices as long as people buy their subscriptions / pay for their api. The Chinese labs are closing in on the SOTA models (I would say they are already there) and offer insane cheap prices for their subscriptions. Vote with your wallet.

zokier
·
44 minutes ago
·
[ - ]

For someone who is very out of the loop with these AI models, can someone explain what I can actually run on my 3080ti (12G)? Is this something like that or is this still too big; is there anything remotely useful runnable with my GPU? I have 64G RAM if that helps (?).

gitpusher
·
1 hour ago
·
[ - ]

Pretty cool that they are advertising OpenClaw compatibility. I've tried a few locally-hosted models with OpenClaw and did not get good results – (that tool is a context-monster... the models would get completely overwhelmed them with erroneous / old instructions.)

Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization

Alifatisk
·
2 hours ago
·
[ - ]

As always, the Qwen team is pushing out fantastic content

Hope they update the model page soon https://chat.qwen.ai/settings/model

smallerfish
·
1 hour ago
·
[ - ]

> "content"

Sorry, but we're talking about models as content now? There's almost always a better word than "content" if you're describing something that's in tech or online.

jtbaker
·
20 minutes ago
·
[ - ]

any way to run these via ollama yet?

StevenNunez
·
1 hour ago
·
[ - ]

Going to try this over Kimi k2.5 locally. It was nice but just a bit too slow and a resource hog.

Robdel12
·
2 hours ago
·
[ - ]

I really really want local or self hosted models to work. But my experience is they’re not really even close to the closed paid models.

Does anyone any experience with these and is this release actually workable in practice?

storus
·
2 hours ago
·
[ - ]

Does Qwen3 allow adjusting context during an LLM call or does the housekeeping need to be done before/after each call but not when a single LLM call with multiple tool calls is in progress?

segmondy
·
2 hours ago
·
[ - ]

Not applicable... the models just process whatever context you provide to them, context management happens outside of the model and depends on your inference tool/coding agent.

zamadatix
·
3 hours ago
·
[ - ]

Can anyone help me understand the "Number of Agent Turns" vs "SWE-Bench Pro (%)" figure? I.e. what does the spread of Qwen3-Coder-Next from ~50 to ~280 agent turns represent for a fixed score of 44.3%: that sometimes it takes that spread of agent turns to achieve said fixed score for the given model?

yorwba
·
2 hours ago
·
[ - ]

SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.

zamadatix
·
1 hour ago
·
[ - ]

Ah, a spread of the individual tests makes plenty of sense! Many thanks (same goes to the other comments).

regularfry
·
2 hours ago
·
[ - ]

If this is genuinely better than K2.5 even at a third the speed then my openrouter credits are going to go unused.

edude03
·
3 hours ago
·
[ - ]

Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.

zamadatix
·
3 hours ago
·
[ - ]

Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.

esafak
·
2 hours ago
·
[ - ]

For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot

jsnell
·
2 hours ago
·
[ - ]

That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).

The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.

fudged71
·
1 hour ago
·
[ - ]

I'm thrilled. Picked up a used M4 Pro 64GB this morning. Excited to test this out

alexellisuk
·
3 hours ago
·
[ - ]

Is this going to need 1x or 2x of those RTX PRO 6000s to allow for a decent KV for an active context length of 64-100k?

It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.

redrove
·
13 minutes ago
·
[ - ]

I have a 3090 and a 4090 and it all fits in in VRAM with Q4_0 and quantized KV, 96k ctx. 1400 pp, 80 tps.

segmondy
·
2 hours ago
·
[ - ]

1 6000 should be fine, Q6_K_XL gguf will be almost on par with the raw weights and should let you have 128k-256k context.

orliesaurus
·
2 hours ago
·
[ - ]

how can anyone keep up with all these releases... what's next? Sonnet 5?

gessha
·
2 hours ago
·
[ - ]

Tune it out, come back in 6 months, the world is not going to end. In 6 months, you’re going to change your API endpoint and/or your subscription and then spend a day or two adjusting. Off to the races you go.

Squarex
·
2 hours ago
·
[ - ]

Well there are rumors sonnet 5 is coming today, so...

bigyabai
·
56 minutes ago
·
[ - ]

Relatively, it's not that hard. There's like 4-5 "real" AI labs, who altogether manage to announce maybe 3 products max, per-month.

Compared to RISC core designs or IC optimization, the pace of AI innovation is slow and easy to follow.

ionwake
·
2 hours ago
·
[ - ]

will this run on an apple m4 air with 32gb ram?

Im currently using qwen 2.5 16b , and it works really well

segmondy
·
2 hours ago
·
[ - ]

No, at Q2 you are looking at a size of about 26gb-30gb. Q3 exceeds it, you might run it, but the result might vary. Best to run a smaller model like qwen3-32b/30b at Q6

ionwake
·
1 hour ago
·
[ - ]

Thank you for your advice have a good evening

endymion-light
·
3 hours ago
·
[ - ]

Looks great - i'll try to check it out on my gaming PC.

On a misc note: What's being used to create the screen recordings? It looks so smooth!

throwaw12
·
3 hours ago
·
[ - ]

We are getting there, as a next step please release something to outperform Opus 4.5 and GPT 5.2 in coding tasks

gordonhart
·
3 hours ago
·
[ - ]

By the time that happens, Opus 5 and GPT-5.5 will be out. At that point will a GPT-5.2 tier open-weights model feel "good enough"? Based on my experience with frontier models, once you get a taste of the latest and greatest it's very hard to go back to a less capable model, even if that less capable model would have been SOTA 9 months ago.

rubslopes
·
2 minutes ago
·
[ - ]

I used to say that Sonnet 4.5 was all I would ever need, but now I exclusively use Opus...

cirrusfan
·
3 hours ago
·
[ - ]

I think it depends on what you use it for. Coding, where time is money? You probably want the Good Shit, but also want decent open weights models to keep prices sane rather than sama’s 20k/month nonsense. Something like a basic sentiment analysis? You can get good results out of a 30b MoE that runs at good pace on a midrange laptop. Researching things online with many sources and decent results I’d expect to be doable locally by the end of 2026 if you have 128GB ram, although it’ll take a while to resolve.

bwestergard
·
2 hours ago
·
[ - ]

What does it mean for U.S. AI firms if the new equilibrium is devs running open models on local hardware?

selectodude
·
2 hours ago
·
[ - ]

OpenAI isn’t cornering the market on DRAM for kicks…

tosh
·
3 hours ago
·
[ - ]

It feels like the gap between open weight and closed weight models is closing though.

theshrike79
·
2 hours ago
·
[ - ]

Mode like open local models are becoming "good enough".

I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.

When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.

yorwba
·
2 hours ago
·
[ - ]

When Alibaba succeeds at producing a GPT-5.2-equivalent model, they won't be releasing the weights. They'll only offer API access, like for the previous models in the Qwen Max series.

Don't forget that they want to make money in the end. They release small models for free because the publicity is worth more than they could charge for them, but they won't just give away models that are good enough that people would pay significant amounts of money to use them.

thepasch
·
2 hours ago
·
[ - ]

If an open weights model is released that’s as capable at coding as Opus 4.5, then there’s very little reason not to offload the actual writing of code to open weight subagents running locally and stick strictly to planning with Opus 5. Could get you masses more usage out of your plan (or cut down on API costs).

rglullis
·
2 hours ago
·
[ - ]

I'm going in the opposite direction: with each new model, the more I try to optimize my existing workflows by breaking the tasks down so that I can delegate tasks to the less powerful models and only rely on the newer ones if the results are not acceptable.

Keyframe
·
2 hours ago
·
[ - ]

I'd be happy with something that's close or same as opus 4.5 that I can run locally, at reasonable (same) speed as claude cli, and at reasonable budget (within $10-30k).

segmondy
·
2 hours ago
·
[ - ]

Try KimiK2.5 and DeepSeekv3.2-Speciale

IhateAI
·
2 hours ago
·
[ - ]

Just code it yourself, you might surprise yourself :)

valcron1000
·
2 hours ago
·
[ - ]

Still nothing to compete with GPT-OSS-20B for local image with 16 VRAM.

moron4hire
·
55 minutes ago
·
[ - ]

My IT department is convinced these "ChInEsE cCcP mOdElS" are going to exfiltrate our entire corporate network of its essential fluids and vita.. erh, I mean data. I've tried explaining to them that it's physically impossible for model weights to make network requests on their own. Also, what happened to their MitM-style, extremely intrusive network monitoring that they insisted we absolutely needed?

syntaxing
·
3 hours ago
·
[ - ]

Is Qwen next architecture ironed out in llama cpp?

ossicones
·
2 hours ago
·
[ - ]

What browser use agent are they using here?

novaray
·
12 minutes ago
·
[ - ]

Yes, the general purpose version is already supported and should have the same identical architecture

raphaelmolly8
·
2 hours ago
·
[ - ]

[dead]

Soerensen
·
3 hours ago
·
[ - ]

The agent orchestration point from vessenes is interesting - using faster, smaller models for routine tasks while reserving frontier models for complex reasoning.

In practice, I've found the economics work like this:

1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more

The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.

cirrusfan
·
3 hours ago
·
[ - ]

I find it really surprising that you’re fine with low end models for coding - I went through a lot of open-weights models, local and "local", and I consistently found the results underwhelming. The glm-4.7 was the smallest model I found to be somewhat reliable, but that’s a sizable 350b and stretches the definition of local-as-in-at-home.

NitpickLawyer
·
2 hours ago
·
[ - ]

You're replying to a bot, fyi :)

IhateAI
·
2 hours ago
·
[ - ]

"Is they key unlock here"

mrandish
·
1 hour ago
·
[ - ]

Yeah, that hits different.

CamperBob2
·
2 hours ago
·
[ - ]

If it weren't for the single em-dash (really an en-dash, used as if it were an em-dash), how am I supposed to know that?

And at the end of the day, does it matter?