I'm not sure how these new models compare to the biggest and baddest models, but if price, speed, and reliability are a concern for your use cases I cannot recommend Mistral enough.
Very excited to try out these new models! To be fair, mistral-3-medium-0525 still occasionally produces gibberish ~0.1% of my use cases (vs gpt-5's 15% failure rate). Will report back if that goes up or down with these new models
On the API side of things my experience is that the model behaving as expected is the greatest feature.
There I also switched to Openrouter instead of paying directly so I can use whatever model fits best.
The recent buzz about ad-based chatbot services is probably because the companies no longer have an edge despite what the benchmarks say, users are noticing it and cancel paid plans. Just today OpenAI offered me 1 month free trial as if I wasn’t using it two months ago. I guess they hope I forget to cancel.
Was really plug and play. There are still small nuances to each one, but compared to a year ago prompts are much more portable
Business model of most subscription based services.
I feel like at least for normies if they are familiar with ChatGPT, it might be hard to make them switch especially if they are subscribed.
What is your use-case?
Mine is: I use "Pro"/"Max"/"DeepThink" models to iterate on novel cross-domain applications of existing mathematics.
My interaction is: I craft a detailed prompt in my editor, hand it off, come back 20-30 minutes later, review the reply, and then repeat if necessary.
My experience is that they're all very, very different from one another.
Sure, they produce different output so sometimes I will run the same thing on a few different models when Im not sure or happy but I’d don’t delegate the thinking part actually, I always give a direction in my prompts. I don’t see myself running 30min queries because I will never trust the output and will have to do all the work myself. Instead I like to go step by step together.
excited to add mistral to the rotation!
There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.
The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.
1. Sample a set of prompts / answers from historical usage.
2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.
3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.
4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.
The only exception I can think of is models trained on synthetic data like Phi.
Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)
I labor over every word, every button, every line of code, every blog post. I would say it is as hand-crafted as something digital can be.
Full transparency, the first backend version of phrasing was 'vibe-coded' (long before vibe coding was a thing). I didn't like the results, I didn't like the experience, I didn't feel good ethically, and I didn't like my own development.
I rewrote the application (completely, from scratch, new repo new language new framework) and all of the sudden I liked the results, I loved the process, I had no moral qualms, and I improved leaps and bounds in all areas I worked on.
Automation has some amazing use cases (I am building an automation product at the end of the day) but so does doing hard things yourself.
Although most important is just to enjoy what you do; or perhaps do something you can be proud of.
Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.
I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.
Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.
This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.
I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD
I tried using it for a very small and quick summarization task that needed low latency and any level above that took several seconds to get a response. Using minimal brought that down significantly.
Weirdly gpt5's reasoning levels don't map to the OpenAI api level reasoning effort levels.
These are screenshots from that week: https://x.com/barrelltech/status/1995900100174880806
I'm not going to share the prompt because (1) it's very long (2) there were dozens of variations and (3) it seems like poor business practices to share the most indefensible part of your business online XD
Impressive, I haven't seen that myself yet, I've only used 5 conversationally, not via API yet.
And yes, this only happens when I ask it to apply my formatting rules. If you let GPT format itself, I would be surprised if this ever happens.
But I'm no expert. I can't say I've used mistral much outside of my own domain.
But in that case the larger the better. If mistral medium can run on your M2 Ultra then it should be up to the task. Should eek out ministral and be just shy of the biggest frontier models.
But I wouldn’t even trust GPT-5 or Claude Opus or Gemini 3 Pro to get close to a zero percent success rate, and for a task such as this I would not expect mistral medium to outperform the big boys
It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar to DeepSeek-V3".
---
vllm/model_executor/models/mistral_large_3.py
```
from vllm.model_executor.models.deepseek_v2 import DeepseekV3ForCausalLM
class MistralLarge3ForCausalLM(DeepseekV3ForCausalLM):
```
"Science has always thrived on openness and shared discovery." btw
Okay I'll stop being snarky now and try the 14B model at home. Vision is good additional functionality on Large.
To quote the hf page:
>Behind vision-first models in multimodal tasks: Mistral Large 3 can lag behind models optimized for vision tasks and use cases.
Of course models purely made for image stuff will completely wipe it out. The vision language models are useful for their generalist capabilities
Pelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/
Ouch
https://ai.google.dev/gemini-api/docs/video-understanding#tr...
Mistral had the best small models on consumer GPUs for a while, hopefully Ministral 14B lives up to their benchmarks.
Had they gone to the EU, Mistral would have gotten a miniscule grant from the EU to train their AI models.
There is a bit of it, yes, although how much exactly is difficult to know. It’s not all tax breaks and subventions; several public agencies are using it, including in the army so finding out the details is not trivial.
2. Did ASML invest in Mistral in their first round of venture funding or was it US VCs all along that took that early risk and backed them from the very start?
Risk aversion is in the DNA and in almost every plot of land in Europe such that US VCs saw something in Mistral before even the european giants like ASML did.
ASML would have passed on Mistral from the start and Mistral would have instead begged to the EU for a grant.
2. ASML was propped up by ASM and Philips, stepping in as "VCs"
Isn't that then a chicken and egg?
No. VC’s historical capital has come from institutional investors. Pensions. Endowments. Foundations.
Mistral Large 3 is ranked 28, behind all the other major SOTA models. The delta between Mistral and the leader is only 1418 vs. 1491 though. I *think* that means the difference is relatively small.
Does that also mean that Gemini-3 (the top ranked model) loses to mistral 3 40% of the time?
Does that make Gemini 1.5x better, or mistral 2/3rd as good as Gemini, or can we not quantify the difference like that?
You can litteraly "improve" your model on LMArena by just adding a bunch of emojis.
The company I work for for example, a mid-sized tech business, currently investigates their local hosting options for LLMs. So Mistral certainly will be an option, among the Qwen familiy and Deepseek.
Mistral is positioning themselves for that market, not the one you have in mind. Comparing their models with Claude etc. would mean associating themselves with the data leeches, which they probably try to avoid.
Funded mostly by US VCs?
Hosted primarily on Azure?
Do you really have to go out of your way to start calling their competition "data leeches" for out-executing them?
I mean why do you think those guys left Meta? It reminds me of a time ten years ago I was sitting on a flight with a guy who works for the natural gas industry. I was (cough still am) a pretty naive environmentalist, so I asked him what he thought of solar, wind, etc. and why should we be investing in natural gas when there are all these other options. His response was simple. Natural gas can serve as a bridge from hydrocarbons to true green energy sources. Leverage that dense energy to springboard the other sources in the mix and you build a path forward to carbon free energy.
I see Mistral's use of US VCs the same way. Those VCs are hedging their bets and maybe hoping to make a few bucks. A few of them are probably involved because they're buddies with the former Meta guys "back in the day." If Mistral executes on their plan of being a transparent b2b option with solid data protections then they used those VCs the way they deserve to be used and the VCs make a few bucks. If Europe ever catches up to the US in terms of data centers, would Mistral move off of Azure? I'd bet $5 that they would.
Pan-nationalism is a hell of a drug: a company that does not know you exist puts out an objectively awful release, and people take frank discussion of it as a personal slight.
How so?
There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.
A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.
We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)
- Mistral Large 3 is comparable with the previous Deepseek release.
- Ministral 3 LLMs are comparable with older open LLMs of similar sizes.
To be fair, the SOTA models aren't even a single LLM these days. They are doing all manner of tool use and specialised submodel calls behind the scenes - a far cry from in-model MoE.
I think that Qwen3 8B and 4B are SOTA for their size. The GPQA Diamond accuracy chart is weird: Both Qwen3 8B and 4B have higher scores, so they used this weid chart where "x" axis shows the number of output tokens. I missed the point of this.
Why should they compare apples to oranges? Ministral3 Large costs ~1/10th of Sonnet 4.5. They clearly target different users. If you want a coding assistant you probably wouldn't choose this model for various reasons. There is place for more than only the benchmark king.
Why would they? They know they can't compete against the heavily closed-source models.
They are not even comparing against GPT-OSS.
That is absolutely and shockingly bearish.
edit: typos
Deepmind is not an UK company, its google aka US.
Mistral is a real EU based company.
And an EU company can't be forced by the US Gov to hand over data.
The cloud act and the current US administration doing things like sanctioning the ICC demonstrate why the locations of those desks is important.
And no, it's not only americans. I keep hearing this thing from people living in Europe as well (or better, in the EU). I also very often hear phrases like "Switzerland is not in Europe" to indicate that the country is not part of the European Union.
Google DeepMind does exist.
Open weight means secondary sales channels like their fine tuning service for enterprises [0].
They can't compete with large proprietary providers but they can erode and potentially collapse them.
Open weights and research builds on itself advancing its participants creating environment that has a shot at proprietary services.
Transparency, control, privacy, cost etc. do matter to people and corporations.
gpt-oss is killing the ongoing AIME3 competition on kaggle. They're using a hidden, new set of problems, IMO level, handcrafted to be "AI hardened". And gpt-oss submissions are at ~33/50 right now, two weeks into the competition. The benchmarks (at least for math) were not gamed at all. They are really good at math.
The next "public" model is qwen30b-thinking at 23/50.
Competition is limited to 1 H100 (80GB) and 5h runtime for 50 problems. So larger open models (deepseek, larger qwens) don't fit.
[1] https://www.kaggle.com/competitions/ai-mathematical-olympiad...
The token use chart in the OP release page demonstrates the Qwen issue well.
Token churn does help smaller models on math tasks, but for general purpose stuff it seems to hurt.
Releasing a near stat-of-the-art open model instanly catapults companies to a valuation of several billion dollars, making it possible raise money to acquire GPUs and train more SOTA models.
Now, what happens if such a business model does not emerge? I hope we won't find out!
Granted, this is a subject that is very well present in the training data but still.
Unfortunately that doesn't pay the electricity bill
There's a lot of businesses who do not want to hand over their sensitive data to hackers, employees of their competitors, and various world governments. There's inherent risk in choosing a propreitary option, and that doesn't just go for LLMs. You can get your feet swept up from underneath you.
I feel we're only a year or two away from hitting a plateau with the frontier closed models having diminishing returns vs what's "open"
Do things ever work that way? What if Google did Open source Gemini. Would you say the same? You never know. There's never "supposed" and "purpose" like that.
OpenAI went closed (despite open literally being in the name) once they had the advantage. Meta also is going closed now that they've caught up.
Open-source makes sense to accelerate to catch up, but once ahead, closed will come back to retain advantage.
Frankly, I don't actually care about or want "general intelligence" -- I want it to make good code, follow instructions, and find bugs. Gemini wasn't bad at the last bit, but wasn't great at the others.
They're all trying to make general purpose AI, but I just want really smart augmentation / tools.
It's also slower than both Opus 4.5 and Sonnet.
In prior posts you oddly attack "Palantir-partnered Anthropic" as well.
Are things that grim at OpenAI that this sort of FUD is necessary? I mean, I know they're doing the whole code red thing, but I guarantee that posting nonsense like this on HN isn't the way.
Trust no one, test your use case yourself is pretty much the only approach, because people either don't run benchmarks correctly or have the incentive not to.
- Gemini 3.0 Pro : 84.8
- DeepSeek 3.2 : 83.6
- GPT-5.1 : 69.2
- Claude Opus 4.5 : 67.4
- Kimi-K2 (1.2T) : 42.0
- Mistral Large 3 (675B) : 41.9
- Deepseek-3.1 (670B) : 39.7
The 14B 8B & 3B models are SOTA though, and do not have chinese censorship like Qwen3.
Maybe an architectural leap?
Most likely reason is that the instruct model underperforms compared to the open competition (even among non-reasoners like Kimi K2).
https://huggingface.co/mistralai/Ministral-3-14B-Instruct-25...
The unsloth quants are here:
https://huggingface.co/unsloth/Ministral-3-14B-Instruct-2512...
https://www.llama.com/docs/how-to-guides/vision-capabilities...
If that doesn't even meet the threshold for "terrible", then what does?
I guess it says a bit about the state of European AI
I also think most people do not consider open weights as OSS.
Like how does 14B compare to Qwen30B-A3B?
(Which I think is a lot of people's goto or it's instruct/coding variant, from what I've seen in local model circles)
My guess is the vast scale of google data. They've been hoovering data for decades now, and have had curation pipelines (guided by real human interactions) since forever.
Went to actually use it, got a message saying that I missed a payment 8 months previously and thus wasn't allowed to use Pro despite having paid for Pro for the previous 8 months. The lady I contacted in support simply told me to pay the outstanding balance. You would think if you missed a payment it would relate to simply that month that was missed not all subsequent months.
Utterly ridiculous that one missed payment can justify not providing the service (otherwise paid for in full) at all.
Basically if you find yourself in this situation you're actually better of deleting the account and resigning up again under a different email.
We really need to get our shit together in the EU on this sort of stuff, I was a paying customer purely out of sympathy but that sympathy dried up pretty quick with hostile customer service.
This sounds like the you expect your subscription to work as an on-demand service? It seems quite obvious that to be able to use a service you would need to be up to date on your payments, that would be no different in any other subscription/lease/rental agreement? Now Mistral might certainly look back at their records and see that you actually didn't use their service at all for the last few month and waive the missed payment. And that could be good customer service, but they might not even have record that you didn't use it, or at least those records would not be available to the billing department?
I understand perfectly well, I don't agree with that approach is the issue.
If I paid for 11/12 months I should get 11/12 months subscription not 1/12 months. They happily just took a years subscription and provided nothing in return. Even if I fixed the outstanding balance they would have provided 2/12 months of service at a cost of 12/12 months of payment.
Also a lot of Europeans are upset at US tech dominance. It's a position we've roped ourselves in to so any commentary that criticises an EU tech success story is seen as being unnecessarily negative.
However I do mean it as a warning to others, I got burned even with good intentions.