Btw, you can also run Mistral locally within the Docker model runner on a Mac.
I've run that using both Ollama (easiest) and MLX. Here are the Ollama models: https://ollama.com/library/mistral-small3.1/tags - the 15GB one works fine.
For MLX https://huggingface.co/mlx-community/Mistral-Small-3.1-24B-I... and https://huggingface.co/mlx-community/Mistral-Small-3.1-24B-I... should work, I use the 8bit one like this:
llm install llm-mlx
llm mlx download-model mlx-community/Mistral-Small-3.1-Text-24B-Instruct-2503-8bit -a mistral-small-3.1
llm chat -m mistral-small-3.1
The Ollama one supports image inputs too: llm install llm-ollama
ollama pull mistral-small3.1
llm -m mistral-small3.1 'describe this image' \
-a https://static.simonwillison.net/static/2025/Mpaboundrycdfw-1.png
Output here: https://gist.github.com/simonw/89005e8aa2daef82c53c2c2c62207...Qwen 3 8B on MLX runs in just 5GB of RAM and can write basic code but I don't know if it would be good enough for anything interesting: https://simonwillison.net/2025/May/2/qwen3-8b/
Honestly though with that little memory I'd stick to running against hosted LLMs - Claude 3.7 Sonnet, Gemini 2.5 Pro, o4-mini are all cheap enough that it's hard to spend much money with them for most coding workflows.
I tried to run some of the differently sized DeepSeek R1 locally when those had recently come out, but couldn’t manage at the time to run any of them. And I had to download a lot of data to try those. So if you know a specific size of DeepSeek R1 that will work on 64GB RAM on MacBook Pro M2 Max, or another great local LLM for coding on that, that would be super appreciated
Specifically the `Q6_K` quant looks solid at ~27gb. That leaves enough headroom on your 64gb Macbook that you can actually load a decent amount of context. (It takes extra VRAM for every token of context you need)
Rough math, based on this[0] calculator is that it's around ~10gb per 32k tokens of context. And that doesn't seem to change based on using a different quant size -- you just have to have enough headroom.
So with 64gb:
- ~25gb for Q6 quant
- 10-20gb for context of 32-64k
That leaves you around 20gb for application memory and _probably_ enough context to actually be useful for larger coding tasks! (It just might be slow, but you can use a smaller quant to get more speed.)
I hope that helps!
0: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...
1: https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32...
I don't know if they'll be good enough for general coding tasks though - I've been spoiled by API access to Claude 3.7 Sonnet and o4-mini and Gemini 2.5 Pro.
I've yet to find a good overview of how much memory each model needs for different context lengths (other than back of the envelope #weights * bits). LM Studio warns you if a model will likely not fit, but it's not very exact.
Still, it reports accurate peak memory usage for tensors living on GPU, but seems to miss some of the non-Metal overhead, however small (https://github.com/aukejw/mlx_transformers_benchmark/issues/...).
Im benchmarking runtime and memory usage for a few of them: https://aukejw.github.io/mlx_transformers_benchmark/
Right now, for a coding LLM on a Mac, the standard is Qwen 3 32b, which runs great on any M1 mac with 32gb memory or better. Qwen 3 235b is better, but fewer people have 128gb memory.
Anything smaller than 32b, you start seeing a big drop off in quality. Qwen 3 14b Q4_K_M is probably your best option at 16gb memory, but it's significantly worse in quality than 32b.
I have LMStudio installed, and use Continue in VSCode, but it doesn't feel nearly as feature rich compared to using something like Cursor's IDE, or the GitHub Copilot plugin.
A Raspberry Pi? And old ThinkPad? A fully speced-out latest gen Macbook?
edit: One of those old Mac Pros?
https://github.com/garagesteve1155/Overload
(As announced this morning in the FB group "Dull Men's Club!)
Indeed. At work, we are experimenting with this. Using a cloud platform is a non-starter for data confidentiality reasons. On-premise is the way to go. Also, they’re not American, which helps.
> Btw, you can also run Mistral locally within the Docker model runner on a Mac.
True, but you can do that only with their open-weight models, right? They are very useful and work well, but their commercial models are bigger and hopefully better (I use some of their free models every day, but none of their commercial ones).
1. A typical contract transfers the rights to the work. The ownership of AI generated code is legally a wee bit disputed. If you modify and refactor generated code heavily it's probably fine, but if you just accept AI generated code en masse, making your client think that you wrote it and it is therefore their copyright, that seems dangerous.
2. A typical contract or NDA also contains non disclosure, i.e. you can't share confidential information, e.g. code (including code you _just_ wrote, due to #1) with external parties or the general public willy nilly. Whether any terms of service assurances from OpenAI or Anthropic that your model inputs and outputs will probably not be used for training are legally sufficient, I have doubts.
IANAL, and _perhaps_ I'm wrong about one or both of these, in one or more countries, but by and large I'd say the risk is not worth the benefit.
I mostly use third party LLMs like I would StackOverflow: Don't post company code there verbatim, make an isolated example. And also don't paste from SO verbatim. I tried other ways of using LLMs for programming a few times in personal projects and can't say I worry about lower productivity with these limitations. YMMV.
(All this also generally goes for employees with typical employment contracts: It's probably a contract violation.)
Note, that this is not a statement about the fairness or morality of LLM building, but to think that the legality of AI code generation is something to reasonably worry about, is betting against multiple large players and their hundreds of billions of dollars in investment right now, and that likely puts you in a bad spot in reality.
From what I've been following it seems very likely that, at least in the US, AI-generated anything can't actually be copyrighted and thus can't have ownership at all! The legal implications of this are yet to percolate through the system though.
https://llmlitigation.com/case-updates.html
Personally I have roughly zero trust in US courts on this type of issue but we'll see how it goes. Arguably there are cases to be made where LLM:s cough up code cribbed from repos with certain licenses without crediting authors and so on. It's probably a matter of time until some aggressively litigious actors do serious, systematic attempts at getting money out of this, producing case law as a by product.
Edit: Oh right, Butterick et al went after Copilot and image generation too.
parent statement reminds me of smug French in a castle north of London circa 1200, with furious locals standing outside the gates, dressed in rags with farm tools as weapons. One well-equipped tower guard says to another "no one is seriously disputing the administration of these lands"
I mean sure, but I think of my little agency providing value, for a price. Clients have budgets, they have limited benefits from any software they build, and in order to be competitive against other agencies or their internal teams, overall, I feel we need to provide a good bang for buck.
But since it's not all that much about typing in code, and since even that activity isn't all that sped up by LLMs, not if quality and stability matters, I would still agree that it's completely fine.
I meant that I don't care enough to spearhead and drive this effort within the client orgs. They have their own processes, and internal employees would surely also like to use AI, so maybe they'll get there eventually. And meanwhile I'll just use it in the approved ways.
Sure, you can say "I'd just lie about it". But I don't know how many people would just casually lie in court. I sure wouldn't. Ethics is one thing, it takes a lot of guts, considering the possible repercussions.
Legally speaking, you also want to be careful about your dependencies and their licenses, a company that's afraid to get sued usually goes to quite some lengths to ensure they play this stuff safe. A lot of smaller companies and startups don't know or don't care.
From a professional ethics perspective, personally, I don't want to put my clients in that position unless they consciously decide they want that. They hire professionals not just to get work done they fully understand, but to a large part to have someone who tells them what they don't know.
I'm still sorting all this stuff out personally. I like LLM's when I work in an area I know well. But vibing in areas of technology that I don't know well just feels weird.
In the LLM case, I think it’s more of an open question whether the LLM output is republishing the copyrighted content without notice, or simply providing access to copyrighted content. I think the former would put the LLM provider in hot water, while the latter would put the user in hot water.
Those plenty startups will also use Google, OpenAi or the built in Microsoft AI.
This is clearly for companies that need to keep the sensitive data under their control. I think they also get support with adding more training to the model to be personalized for your needs.
The main players all allow some form of zero data retention but I'm sure the more cautious CISO/CIOs flat out don't trust it.
Still, I find local models very much worth using after taking the time to set them up with Emacs, open-codex, etc.
Same process, less people being called out for "cheating" in a professional setting.
Unless you have experience hosting and maintaining models at scale and with an enterprise feature set, then I believe what they are offering is beyond (for now) what you’d be able put up on your own.
https://www.grammar-monster.com/easily_confused/premise_prem...
The key thing they'd need to nail to make this better than what's already out there is the integrations. If they can make it seamless to integrate with all the key third-party enterprise systems then they'll have something strong here, otherwise it's not obvious how much they're adding over Open WebUI, LibreChat, and the other self-hosted AI agent tooling that's already available.
Those who don't have the time and desire to wire it all up probably make up a larger part of the market than those who do. It's a long-tail proposition, and that might be a problem.
> I have most of the features that they've listed here already set up on my desktop at home
I think your boss and your boss' boss are the audience they are going for. In my org there's concern over the democratization of locally run LLMs and the loss of data control that comes with it.
Mistral's product would allow IT or Ops or whatever department to set guardrails for the organization. The selling point that it's turn-key means that a small organization doesn't have to invest a ton of time into all the tooling needed to run it and maintain it.
Edit: I just re-read your comment and I do have to agree though. "game-changer" is a bit strong of a word.
We had a Mac Studio here nobody was using and it we now use it as a tiny AI station. If we like, we could even embed our codebases, but it wasn't necessary yet. Otherwise it should be easy to just buy a decent consumer PC with a stronger GPU, but performance isn't too bad even for autocomplete.
Efficiently? I thought macOS does not have API so that Docker could use GPU.
You might be talking about small tech companies that have no other options.
One can deploy similar solution (on-prem) using better and more cost efficient open-source models and infrastructure already.
What Mistral offers here is managing that deployment for you, but there's nothing stopping other companies doing the same with fully open stack. And those will have the benefit of not wasting money on R&D.
Is this an API point? A model enterprises deploy locally? A piece of software plus a local model?
There is so much corporate synergy speak there I can’t tell what they’re selling
https://console.cloud.google.com/marketplace/product/mistral...
Which says:
"Managed Services are fully hosted, managed and supported by the service providers. Although you register with the service provider to use the service, Google handles all billing."
My assumption is that they're using Google Marketplace for discovery and billing, and they offer a hosted option or an on-prem option.
But agreed, it isn't clear!
- it joins billing with other stuff
- I guess it's easier to get approval
- and more important (at least in our case), it allows you to reach your Google Cloud (or AWS) contract commitments of expense, and keep your discounts :)
Unless you have sufficient VRAM to keep all potential specialized models loaded simultaneously (which negates some of the "lightweight" benefit for the overall system), you'll be forced into model swapping. Constantly loading and unloading models to and from VRAM is a notoriously slow process.
If you have concurrent users with diverse needs (e.g., a developer requiring code generation and a marketing team member needing creative text), the system would have to swap models in and out if they can't co-exist in VRAM. This drastically increases latency before the selected model even begins processing the actual request.
The latency from model swapping directly translates to a poor user experience. Users, especially in an enterprise context, are unlikely to tolerate waiting for a minute or more just for the system to decide which model to use and then load it. This can quickly lead to dissatisfaction and abandonment.
This external routing mechanism is, in essence, an attempt to implement a sort of Mixture-of-Experts (MoE) architecture manually and at a much coarser grain. True MoE models (like the recently released Qwen3-30B-A3B, for instance) are designed from the ground up to handle this routing internally, often with shared parameter components and highly optimized switching mechanisms that minimize latency and resource contention.
To mitigate the latency from swapping, you'd be pressured to provision significantly more GPU resources (more cards, more VRAM) to keep a larger pool of specialized models active. This increases costs and complexity, potentially outweighing the benefits of specialization if a sufficiently capable generalist model (or a true MoE) could handle the workload with fewer resources. And a lot of those additional resources would likely sit idle for most of the time, too.
For some context: this is a fairly limited exploratory deployment which runs alongside other priority projects for me, so I'm not too obsessed with optimizing the decision-making time. Those three seconds are relatively minor when compared with the 20–60 seconds it takes to unload the old and load a new model.
I can see semantic router being really useful in scenarios built around commercial, API-accessed models, though. There, it could yield significant cost savings by, for example, intelligently directing simpler queries to a less capable but cheaper model instead of the latest and greatest (and likely significantly more expensive) model users might feel drawn to. You're basically burning money if you let your employees use Claude 3.7 to format a .csv file.
It,'s LLMs, all the way down.
I don't see any reason you couldn't stack more layers of routing in front, to select the model. However, this starts to seem inefficient.
I think the optimal solution will eventually be companies training and publishing hyper-focused expert models, that are designed to be used with other models and a router. Then interface vendors can purchase different experts and assemble the models themselves, like how a phone manufacter purchases parts from many suppliers, even their compeditors, in order to create the best final product. The bigger players (e.g. Apple for this analogy) might make more parts in house, but even the latest iPhone still has Samsung chips in it in teardowns.
I think all providers guarantee that they will not use your API inputs for training, it's meant as the pro version after all.
Plus it's dirt cheap, I query them several times per day, with access to high end thinking models, and pay just a few € per month.
The intro video highlights searching email alongside other tools.
What email clients will this support? Are there related tools that will do this?
The EU is more similar to NAFTA or five eyes, and culturally the loyalty is more similar to the US vs the anglosphere, like how Americans think of Australia, UK and Canada. Well, again, until recently. Things are changing fast.
Comparing to states is much more far fetched. The UK left the union without any serious retaliation, let alone military conflict. What would happen if Texas or California tried to secede seriously?
Mistral has been consistently last place, or at least last place among ChatGPT, Claude, Llama, and Gemini/Gemma.
I know this because I had to use a permissive license for a side project and I was tortured by how miserably bad Mistral was, and how much better every other LLM was.
Need the best? ChatGPT
Need local stuff? Llama(maybe Gemma)
Need to do barely legal things that break most company's TOS? Mistral... although deepseek probably beats it in 2025.
For people outside Europe, we don't have patriotism for our LLMs, we just use the best. Mistral has barely any usecase.
You probably want to replace Llama with Qwen in there. And Gemma is not even close.
> Mistral has been consistently last place, or at least last place among ChatGPT, Claude, Llama, and Gemini/Gemma.
Mistral held for a long time the position of "workhorse open-weights base model" and nothing precludes them from taking it again with some smart positioning.
They might not currently be leading a category, but as an outside observer I could see them (like Cohere) actively trying to find innovative business models to survive, reach PMF and keep the dream going, and I find that very laudable. I expect them to experiment a lot during this phase, and that probably means not doubling down on any particular niche until they find a strong signal.
Have you tried the latest, gemma3? I've been pretty impressed with it. Altho I do agree that qwen3 quickly overshadowed it, it seems too soon to dismiss it altogether. EG, the 3~4b and smaller versions of gemma seem to freak out way less frequently than similar param size qwen versions, tho I haven't been able to rule out quant and other factors in this just yet.
It's very difficult to fault anyone for not keeping up with the latest SOTA in this space. The fact we have several options that anyone can serviceably run, even on mobile, is just incredible.
Anyway, i agree that Mistral is worth keeping an eye on. They played a huge part in pushing the other players toward open weights and proving smaller models can have a place at the table. While I personally can't get that excited about a closed model, it's definitely nice to see they haven't tapped out.
Qwen 2.5 14B blows Gemma 27B out of the water for my use. Qwen 2.5 3B is also very competitive. The 3 series is even more interesting with the 0.6B model actually useful for basic tasks and not just a curiosity.
Where I find Qwen relatively lackluster is its complete lack of personality.
Have you tried Mistral's newest and proprietary models? Or even their newest open model?
That said, I personally am a very patriotic European.
And also another reason people might use a non-American model is that dependency on the US is a serious business risk these days. Not relevant if you are in the US but hugely relevant for the rest of us.
Otherwise it could be illegal to transfer EU data to US companies
That's not how laws work.
There are countries in the EU where you get sued for less