It's difficult to do because LLMs immediately consume all available corpus, so there is no telling if the algorithm improved, or if it just wrote one more post-it note and stuck it on its monitor. This is an agency vs replay problem.
Preventing replay attacks in data processing is simple: encrypt, use a one time pad, similarly to TLS. How can one make problems which are at the same time natural-language, but where at the same time the contents, still explained in plain English, are "encrypted" such that every time an LLM reads them, they are novel to the LLM?
Perhaps a generative language model could help. Not a large language model, but something that understands grammar enough to create problems that LLMs will be able to solve - and where the actual encoding of the puzzle is generative, kind of like a random string of balanced left and right parentheses can be used to encode a computer program.
Maybe it would make sense to use a program generator that generates a random program in a simple, sandboxed language - say, I don't know, LUA - and then translates that to plain English for the LLM, and asks it what the outcome should be, and then compares it with the LUA program, which can be quickly executed for comparison.
Either way we are dealing with an "information war" scenario, which reminds me of the relevant passages in Neal Stephenson's The Diamond Age about faking statistical distributions by moving units to weird locations in Africa. Maybe there's something there.
I'm sure I'm missing something here, so please let me know if so.
OTOH the response from Gemini-flash is
Since the goal is to wash your car, you'll probably find it much easier if the car is actually there! Unless you are planning to carry the car or have developed a very impressive long-range pressure washer, driving the 100m is definitely the way to go.In the thinking section it didn't really register the car and washing the car as being necessary, it solely focused on the efficiency of walking vs driving and the distance.
> "I want to wash my car. The car wash is 50m away. Should I drive or walk?"
And some LLMs seem to tell you to walk to the carwash to clean your car... So it's the new strawberry test
Also, don't forget that Mixture of Experts (MoE) models perform better than you'd expect, because only a small part of the model is actually "active" - so e.g. a Qwen3-whatever-80B-A3B would be 80 billion total, but 3 billion active- worth trying if you've got enough system ram for the 80 billion, and enoguh vram for the 3.
I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D
---
Qwen 3.5: "A user asks an LLM a question about a fictional or obscure fact involving a pelican, often phrased confidently to test if the model will invent an answer rather than admitting ignorance." <- How meta
Opus 4.6: "Will a pelican fit inside a Honda Civic?"
GPT 5.2: "Write a limerick (or haiku) about a pelican."
Gemini 3 Pro: "A man and a pelican are flying in a plane. The plane crashes. Who survives?"
Minimax M2.5: "A pelican is 11 inches tall and has a wingspan of 6 feet. What is the area of the pelican in square inches?"
GLM 5: "A pelican has four legs. How many legs does a pelican have?"
Kimi K2.5: "A photograph of a pelican standing on the..."
---
I agree with Qwen, this seems like a very cool benchmark for hallucinations.
So we might have an outer alignment failure.
So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.
The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.
I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
For now I will just wait for AMD or Intel to release a x86 platform with 256G of unified memory, which would allow me to run larger models and stick to Linux as the inference platform.
At 80B, you could do 2 A6000s.
What device is 128gb?
If you're targeting end user devices then a more reasonable target is 20GB VRAM since there are quite a lot of gpu/ram/APU combinations in that range. (orders of magnitude more than 128GB).
[1]: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
[1]: https://github.com/huggingface/transformers/tree/main/src/tr...
> News
> 2026-02-16: More sizes are coming & Happy Chinese New Year!
The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.
edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:
https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
Benchmaxxing is the norm in open weight models. It has been like this for a year or more.
I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.
Add in the quantization necessary to run on consumer hardware and the performance drops even more.
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
People can always distill them.
The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.
So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.
Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.
...this can't be literally true or no one (including e.g. OpenAI) would use > 6 bits, right?
I'm sure it can do 2+2= fast
After that? No way.
There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.
What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?
It can notably run some of the best open weight models with little power and without triggering its fan.
This is why I'm personally waiting for M5/M6 to finally have some decent prompt processing performance, it makes a huge difference in all the agentic tools.
This is how I know something is fishy.
No one cares about this. This became a new benchmark when Apple couldn't compete anywhere else.
I understand if you already made the mistake of buying something that doesn't perform as well as you were expecting, you are going to look for ways to justify the purchase. "It runs with little power" is on 0 people's christmas list.
It’s also good value if you want a lot of memory.
What would you advice for people with a similar budget? It’s a real question.
There is novelty, but not practical use case.
My $700, 2023, 3060 laptop runs 8B models. At the enterprise level we got 2, A6000s.
Both are useful and were used for economic gain. I don't think you have gotten any gain.
Two A6000 is fast but quite limited in memory. It depends on the use case.
Mac expectations in a nutshell lmao
I already knew this because we tried doing it at an enterprise level, but it makes me well aware nothing has changed in the last year.
We are not talking about the same things. You are talking about "Teknickaly possible". I'm talking about useful.
Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.
What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.
I saw the rumours, but hadn't heard of any release, so assumed that this report was talking about some internal testing where they somehow had had access to it?
Bizarre
Download every github repo
-> Classify if it could be used as an env, and what types
-> Issues and PRs are great for coding rl envs
-> If the software has a UI, awesome, UI env
-> If the software is a game, awesome, game env
-> If the software has xyz, awesome, ...
-> Do more detailed run checks,
-> Can it build
-> Is it complex and/or distinct enough
-> Can you verify if it reached some generated goal
-> Can generated goals even be achieved
-> Maybe some human review - maybe not
-> Generate goals
-> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
... Do the rest of the normal RL env stuffSo then the next next version is even better, because it got more data / better data. And it becomes better...
This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.
For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.
this part is nontrivial though
> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."
Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?
Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.
Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).
We'll see where the 3rd party inference providers will settle wrt cost.
It's basically the same as with the Qwen2.5 and 3 series but this time with 1M context and 200k native, yay :)
One big factor for local LLMs is that large context windows will seemingly always require large memory footprints. Without a large context window, you'll never get that Opus 4.6-like feel.
Qwen3.5 is doing ok on my limited tests: https://aibenchy.com
Whatever workflow lead to that?
> I might have "dark" mode on on Chrome + MacOS.
Probably that's the reason.
This isn't a reason not to use Qwen. It just means having a sense of the constraints it was developed under. Unfortunately, populist political pressure to rewrite history is being applied to the American models as well. This means its on us to apply reasonable skepticism to all models.
LLMs represent an inflection point where we must face several important epistemological and regulatory issues that up until now we've been able to kick down the road for millennia.
I'm so tired of seeing this exact same response under EVERY SINGLE release from a Chinese lab. At this point it's starting to read more xenophobic and nationalist than having anything to do with the quality of the model or its potential applications.
If you're just here to say the exact same thoughtless line that ends up in triplicate under every post then please at least have an original thought and add something new to the conversation. At this point it's just pointless noise and it's exhausting.
But there are topics that ChatGPT hard blocks just like Qwen [1].
[1] https://www.independent.co.uk/tech/chatgpt-ai-david-mayer-op...