Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.
- Tool calling doesn't work properly with OpenCode
- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher
- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md
I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.
This user has also done a bunch of good quants:
https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
It isn't much until you get down to very small quants.
The flash model in this thread is more than 10x smaller (30B).
https://huggingface.co/models?other=base_model:quantized:zai...
Probably as:
issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931
And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.
Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.
However they are using more thinking internally and that makes them seem slow.
They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.
Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.
With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.
I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.
Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)
It's even cheaper to just use it through z.ai themselves I think.
Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).
My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.
(I even picked the 10usd plan, it was fine for now).
People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.
They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)
Proxmox became good and reliable enough as an open-source alternative for server management. Especially for the Linux enthusiasts out there.
That's going to be pretty hard for OpenAI to figure out and even if they figure it out and they stop me there will be thousands of other companies willing to do that arbitrage. (Just for the record, I'm not doing this, but I'm sure people are.)
They would need to be very restrictive about who is allowed to use the API and not and that would kill their growth because because then customers would just go to Google or another provider that is less restrictive.
Number two I'm not sure if random samples collected over even a moderately large number of users does make a great base of training examples for distillation. I would expect they need some more focused samples over very specific areas to achieve good results.
This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.
And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.
If you were happy with Claude at its Sonnet 3.7 & 4 levels 6 months ago, you'll be fine with them as a substitute.
But they're nowhere near Opus 4.5
https://github.com/ggml-org/llama.cpp/releases
https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...
llama-server -ngl 999 --ctx-size 32768 -m GLM-4.7-Flash-Q4_K_M.gguf
You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completionsSeems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.
ollama run hf.co/ngxson/GLM-4.7-Flash-GGUF:Q4_K_M
It's really fast! But, for now it outputs garbage because there is no (good) template. So I'll wait for a model/template on ollama.com We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.
The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.
https://docs.z.ai/release-notes/new-releasedThis seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.
It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.
GLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.
Not for code. The quality is so low, it's roughly on par with Sonnet 3.5
In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.
My recommendation would be to use other tools built to support pluggable model backends better. If you're looking for a Claude Code alternative, I've been liking OpenCode so far lately, and if you're looking for a Cursor alternative, I've heard great things about Roo/Cline/KiloCode although I personally still just use Continue out of habit.
https://huggingface.co/inference/models?model=zai-org%2FGLM-...
Slow inference is also present on z.ai, eyeballing it the 4.7 flash model was twice as slow as regular 4.7 right now.
I am interesting if I can run it on a 24GB RTX 4090.
Also, would vllm be a good option?
Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.
edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.
I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.
Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
HEAD of ollama with Q8_0 vs vLLM with BF16 and FP8 after.
BF16 predictably bad. Surprised FP8 performed so poorly, but I might not have things tuned that well. New at this.
┌─────────┬───────────┬──────────┬───────────┐
│ │ vLLM BF16 │ vLLM FP8 │ Ollama Q8 │
├─────────┼───────────┼──────────┼───────────┤
│ Tok/sec │ 13-17 │ 11-19 │ 32 │
├─────────┼───────────┼──────────┼───────────┤
│ Memory │ ~62GB │ ~28GB │ ~32GB │
└─────────┴───────────┴──────────┴───────────┘
Most importantly, it actually worked nice in opencode, which I couldn't get Nemotron to do.Tolerating this is very bad form from openrouter, as they default-select lowest price -meaning people who just jump into using openrouter and do not know about this fuckery get facepalm'd by perceived model quality.
ssh admin.hotaisle.app
Yes, this should be made easier to just get a VM with it pre-installed. Working on that.
It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.
Here is the magic (assuming a 4x)...
docker run -it --rm \
--pull=always \
--ipc=host \
--network=host \
--privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--device=/dev/mem \
--group-add render \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v /home/hotaisle:/mnt/data \
-v /root/.cache:/mnt/model \
rocm/vllm-dev:nightly
mv /root/.cache /root/.cache.foo
ln -s /mnt/model /root/.cache
VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8 \
--quantization fp8 \
--enable-auto-tool-choice \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--load-format fastsafetensors \
--enable-expert-parallel \
--allowed-local-media-path / \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--mm-encoder-tp-mode dataMy Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects