llm install llm-mistral
llm mistral refresh
llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"
https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...Pretty good for a 123B model!
(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)
I need to extract them all into a formal collection.
> Yes, I am familiar with the "pelican riding a bicycle" SVG generation test. It is a benchmark for evaluating the ability of AI models, particularly large language models (LLMs) and multi-modal systems, to generate original, high-quality SVG vector graphics based on a deliberately unusual and complex prompt. The benchmark was popularized by Simon Willison, who selected the prompt because:
> Yes — I’m familiar with the “pelican riding a bicycle” SVG generation test.
> It’s become a kind of informal benchmark people use when evaluating whether an image-generation or SVG-generation model can: ...
It's perfectly fine to link for convenience, but it does feel a little disrespectful/SEO-y to not 'continue the conversation'. A summary in the very least, how exactly it pertains. Sell us.
In a sense, link-dropping [alone] is saying: "go read this and establish my rhetorical/social position, I'm done here"
Imagine meeting an author/producer/whatever you liked. You'd want to talk about their work, how they created it, the impact it had, and so on. Now imagine if they did that... or if they waved their hand vaguely at a catalog.
You could be done, nothing is making you defend this (sorry) asinine benchmark across the internet. Not trying to (m|y)uck your yum, or whatever.
Remember, I did say linking for convenience is fine. We're belaboring the worst reading in comments. Inconsequential, unnecessary heartburn. Link the blog posts together and call it good enough.
I hadn’t seen the post. It was relevant. I just read it. Lucky Ten Thousand can read it next time even though I won’t.
Simon has never seemed annoying so unlike other comments that might worry me (even “Opus made this” even though it’s cool but I’m concerned someone astroturfed), that comment would’ve never raised my eyebrows. He’s also dedicated and I love he devotes his time to a new field like this where it’s great to have attempts at benchmarks, folks cutting through chaff, etc.
Yes, the LLM people will train on this. They will train on absolutely everything [as they have]. The comments/links prioritize engagement over awareness. My point, I suppose, if I had one is that this blogosphere can add to the chaff. I'm glad to see Simon here often/interested.
Aside: all this concern about over-fitting just reinforces my belief these things won't take the profession any time soon. Maybe the job.
You bring the benchmark and anticipated their... cheesing, with a promise to catch them on it. Cool announcement of an announcement. Just do that [or don't]. In a hippy sense, this is no longer yours. It's out there. Like everything else anyone wrote.
Let the LLM people train on your test. Catch them as claimed. Publish again. Huzzah, industry without overtime in the comments. It makes sense/cents to position yourself this way :)
Obviously they're going to train on anything they can get. They did. Mouse, meet cat. Some of us in the house would love it if y'all would keep it down! This is 90s rap beef all over again
No, no, remember? Points to the blog you were already reading! Working diligently to build a brand: podcast, paid newsletter, the works.
This interaction is, effectively, a link dropped with an announcement of an announcement. For what has already occurred. Over-fitting, training? You don't say.
If I wanted to be more of an ass, I'd look to argue about hype generation. But I don't, I appreciate any honest effort, which I believe for Simon.
However, there are always people who are “native” to a platform and field. Pieter Levels is native to Twitter and the nomad community. Swyx is native to Twitter/HN and the devtools community. And simonw is native to at least HN and the LLM-interest community. And various streamers and onlyfans creators do the same with theirs.
Through some degree of releasing things that whatever that community values they build a relationship that allows them greater freedom in participating there. It does create a positive feedback cycle for them (and hopefully the community) that most of them will try to parlay into something else: Levels and the OnlyFans creators are probably best at this monetization of reputation but each of them is doing this. One success step for simonw would be “Creator of Pelican LLM benchmark”.
Once you’ve breached some stable point in the community the norms are somewhat relaxed. But it’s not easy to do that. You have to produce some extraordinary volume of things that people value.
I think, tbh, tptacek here could most effectively monetize if he decided to. But he doesn’t appear to want to so he’s just a participant not an influencer so to speak. Whereas someone like Levels or simonw is both.
It’s just creator economy stuff. Meta discussions like this always pop up. But ultimately simonw is past the threshold of trust. There are people who say “wtf? Why is levels making $50k/mo on a stupid vibe-coded flying game?”
It ain’t the game. It’s the following before the game. The resource is the audience.
The best guy spinning the sign puts some effort in, or more crass, the best strippers make you believe.
You asserted a pattern of conduct on the user simonw:
> I think constantly replying to everybody with some link which doesn't address their concerns
Then claimed that conduct was:
> condescending and disrespectful.
I am asking you to elaborate to whom simonw is condescending and disrespecting. I don't see how it follows.
So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.
So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?
Any model can easily one-shot a python script that can count the occurrence of any letter anywhere and return the result.
It's just a tooling issue. You really can't "train" an LLM to do it because tokenisation and ... stuff.
Of course you could train it. Some quick scripting to find all words with repeat letters, build up sample sentences (aardvark has three a,) and you have hard coded the answer to simple questions that make your LLM look stupid.
.. it did that in a story prompt that didn't happen in a) our world b) the current time =)
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...
Honestly though, the benchmark was originally meant to be a stupid joke.
I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.
If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!
If you start here and scroll through and look at the progression of pelican on bicycle images it's honestly spooky how well they match the vibes of the models they represent: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.
Why?
If I hired a worker that was really good at drawing pelicans riding a bike, it wouldn't tell me anything about his/her other qualities?!
It's not a human intelligence - it's a totally different thing, so why would the same test that you use to evaluate human abilities apply here?
Also more directly the "all sorts of other things" we want llms to be good at often involve writing code/spatial reasoning/world understanding which creating an svg of a pelican riding a bicycle very very directly evaluates so it's not even that surprising?
Basically in my niche I _know_ there are no original pictures of specific situations and my prompts test whether the LLM is "creative" enough to combine multiple sources into one that matches my prompt.
I think of if like this: there are three things I want in the picture (more actually, but for the example assume 3). All three are really far from each other in relevance, in the very corner of an equilateral triangle (in the vector space of the LLM's "brain"). What I'm asking it to do is in the middle of all three things.
Every model so far tends to veer towards one or two of the points more than others because it can't figure out how to combine them all into one properly.
Yes it's like the wine glass thing.
Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?
I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.
An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.
An OK AI will draw a penguin on top of a bicycle and just call it a day.
It's not as binary as the wine glass example.
> Yes it's like the wine glass thing.
No, it's not!
That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?
I just don't get it.
It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such.
A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1].
[1] https://en.wikipedia.org/wiki/The_Adventures_of_Paddy_the_Pe...
Since people look at a glass of wine and judge how much "value" they got based partly on how much wine it looks like, many bars and restaurants choose bad wine-glasses (for the purpose of enjoying wine) that are smalle and thus can be fulled more.
I may have missed something but where are we saying the website should be recreated with 1996 tech or specs? The model is free to use any modern CSS, there is no technical limitations. So yes I genuinely think it is a good generalization test, because it is indeed not in the training set, and yet it is easy an easy task for a human developer.
Browsers are able to parse a webpage from 1996. I don't know what the argument in the linked comment is about, but in this one, we discuss the relevance of creating a 1996 page vs a pelican on a a bicycle in SVG.
Here is Gemini when asked how to build a webpage from 1996. Seems pretty correct. In general I dislike grand statements that are difficult to back up. In your case, if models have only a cursory knowledge of something (what does this mean in the context of LLMs anyway), what exactly they were trained on etc.
The shortened Gemini answer, the detailed version you can ask for yourself:
Layout via Tables: Without modern CSS, layouts were created using complex, nested HTML tables and invisible "spacer GIFs" to control white space.
Framesets: Windows were often split into independent sections (like a static sidebar and a scrolling content window) using Frames.
Inline Styling: Formatting was not centralized; fonts and colors were hard-coded individually on every element using the <font> tag.
Low-Bandwidth Design: Visuals relied on tiny tiled background images, animated GIFs, and the limited "Web Safe" color palette.
CGI & Java: Backend processing was handled by Perl/CGI scripts, while advanced interactivity used slow-loading Java Applets.
I'd be curious about that actually, feel like W3C specifications (I don't mean browser support of them) rarely deprecate and precisely try to keep the Web running.
Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.
(Surely they won't release it like that, right..?)
That looks like the next flagship rather than the fast distillation, but thanks for sharing.
Google should be punishing these sites but presumably it's too narrow of a problem for them to care.
Or at least a profit model. I don't see either on that page but maybe I'm missing something
edit: Mea culpa. I missed the active vs dense difference.
Devstral 2 is 123B dense. Deepseek is 37B Active. It will be slower and more expensive to run inference on this than dsv3. Especially considering that dsv3.2 has some goodies that make inference at higher context be more effective than their previous gen.
It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.
It introduced one new bug, but then fixed it on the first try when I pointed it out.
The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.
It's too early to form a conclusion, but so far, it's looking quite competent.
Back when Devstral 1 released, this was made very noticeable to me because the ones who used the smaller quantizations were unable to actually properly format the code, just as you noticed, that's why this sounded so similar to what I've seen before.
Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.
Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.
I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.
Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.
I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.
But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...
This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.
A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.
But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.
"There is nothing more permanent than a temporary demo"
Claude Code not good enough for ya?
Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.
Trying to follow along better is exactly the opposite of what I'd advocate - it's a waste of time especially with Claude, as Claude tends to favour trying lots of things, seeing what works, and revising its approach multiple times for complex tasks. If you follow along every step, you'll be tearing your hair out over stupid choices that it'll undo within seconds if you just let it work.
I find the flow works bc if it starts going off piste I just end it. Plus I then get my pre-commit hooks etc. I still like being relatively hands on though.
Err, doesn’t it have /review?
Imagine a GUI built around git branches + agents working in those branches + tooling to manage the orchestration and small review points, rather than "here's a chat and tool calling, glhf".
All of the models that can do tool calls are typically good enough to use Git.
Just this week I used both Claude Code and Codex to look at unstaged/staged changes and to review them multiple times, even do comparison between a feature branch and the main branch to identify why a particular feature might have broken in the feature branch.
But again, it's the "user message > llm reason > llm tool call > tool response > llm reason > llm response" flow I think is inefficient and not good enough. It's a lazy solution built on top of the chat flow.
What I imagined would exist by now would be something smarter, where you don't say "Ok, now please commit this" or whatever.
I already have a tool for myself that launch Codex, Claude Code, Qwen Code(r?) and Gemini for each change I do, and automatically manage them into git branches, and lets me diff between what they do and so on.
Yet I still think we haven't really figured out a good UX for this.
This is what we're building at Brokk: https://brokk.ai/
Quick intro: https://blog.brokk.ai/introducing-lutz-mode/
If you babysit every interaction, rather than reviewing a completed unit of work of some size, you're wasting your time second-guessing that the model won't "recover" from stupid mistakes. Sometimes that's right, but more often than not it corrects itself faster than you can.
And so it's far more effective to interact with it far more async, where the UI is more for figuring out what it did if something doesn't seem right, than for working live. I have Claude writing a game engine in another window right now, while writing this, and I have no interest in reviewing every little change, because I know the finished change will look nothing like the initial draft (it did just start the demo game right now, though, and it's getting there). So I review no smaller units of change than 30m-1h, often it will be hours, sometimes days, between each time I review the output, when working on something well specified.
The chat interface is optimal to me because you often are asking questions and seeking guidance or proposals as you are making actual code changes. On reason I do like it is that its default mode of operation is to make a commit for each change it makes. So it is extremely clear what the AI did vs what you did vs what is a hodge podge of both.
As others have mentioned, you can integrate with your IDE through the watch mode. It's somewhat crude but still useful way. But I find myself more often than not just running Aider in a terminal under the code editor window and chatting with it about what's in the window.
> The chat interface
Seems very much not, if it's still a chat interface :) Figuring out a chat UX is easy compared to something that was creating with letting LLM fill in some parts from the beginning. I guess I'm searching for something with a different paradigm than just "chat + $Something".
It's all very fluffy and theoretical of course.
"I want you to do feature X. Analyse the code for me and make suggestions how to implement this feature."
Then it will go off and work for a while and typically come back after a bit with some suggestions. Then iterate on those if needed and end with.
"Ok. Now take these decided upon ideas and create a plan for how to implement. And create new tests where appropriate."
Then it will go off and come back with a plan for what to do. And then you send it off with.
"Ok, start implementing."
So sure. You probably can work on this to make it easier to use than with a CLI chat. It would likely be less like an IDE and more like a planning tool you'd use with human colleagues though.
So you'd write a function name and then tell it to flesh it out.
function factorial(n) // Implement this. AI!
Becomes: function factorial(n) {
if (n === 0 || n === 1) {
return 1;
} else {
return n \* factorial(n - 1);
}
}
Last I looked Aider's maintainer has had to focus on other things recently, but aider-ce is a fantastic fork.I'm really curious to try Mistral's vibe, but even though I'm a big fanboi I don't want to be tied to just one model. Aider lets tier your models such that your big, expensive model can do all the thinking and then stuff like code reviews can run through a smaller model. It's a pretty capable tool
Edit: Fix formatting
Very much this for me - I really don't get why, given a new models are popping out every month from different providers, people are so happy to sink themselves into provider ecosystems when there are open source alternatives that work with any model.
The main problem with Aider is it isn't agentic enough for a lot of people but to me that's a benefit.
While True:
0. Context injected automatically. (My repos are small.)
1. I describe a change.
2. LLM proposes a code edit. (Can edit multiple files simultaneously. Only one LLM call required :)
3. I accept/reject the edit.
What kind of hardware do you have to be able to run a performant GPT-OSS-120b locally?
There are many platforms out there that can run it decently.
AMD strix halo, Mac platforms. Two (or three without extra ram) of the new AMD AI Pro R9700 (32GB of RAM, $1200), multi consumer gpu setups, etc.
What matters is high quality specifications including test cases
Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.
High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.
I have a feeling this will only supercharge the long established industry practice of new devs or engineering leadership getting recruited and immediately criticising the entire existing tech stack, and pushing for (and often succeeding) a ground up rewrite in language/framework de jour. This is hilariously common in web work, particularly front end web work. I suspect there are industry sectors that're well protected from this, I doubt people writing firmware for fuel injection and engine management systems suffer too much from this, the Javascript/Nodejs/NPM scourge _probably_ hasn't hit the PowerPC or 68K embedded device programming workflow. Yet...
In my mind, it's somewhat orthogonal to code quality.
Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.
Even the Gemini 3 announcement page had some bit like "best model for vibe coding".
If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)
There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).
Maybe common usage is shifting, but Karpathy's "vibe coding" was definitely meant to be a never look at the code, just feel the AI vibes thing.
Also, we’re both “people in tech”, we know LLMs can’t conceptualise beyond finding the closest collection of tokens rhyming with your prompt/code. Doesn’t mean it’s good or even correct. So that’s why it’s vibe coding.
sorry to disappoint you but that is also been considered vibecoding. It is just not pejorative.
Imo, if you read the code, it's no longer vibecoding.
I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.
For grins:
Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.
Max CUDA compatibility, slower t/s? DGX Spark.
Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.
Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.
You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.
That's a good idea!
Curious about this, if you don't mind sharing:
- what's the stack ? (Do you run like llama.cpp on that rented machine?)
- what model(s) do you run there?
- what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)
I am usually just running gpt-oss-120b or one of the qwen models. Sometimes gemma? These are mostly "medium" sized in terms of memory requirements - I'm usually trying unquantized models that will easily run on an single 80-ish gb gpu because those are cheap.
I tend to spend $10-$20 a week. But I am almost always prototyping or testing an idea for a specific project that doesn't require me to run 8 hrs/day. I don't use the paid APIs for several reasons but cost-effectiveness is not one of those reasons.
* Claude in December: 91 million tokens in, 750k out
* Codex in December: 43 million tokens in, 351k out
* Cerebras in December: 41 million tokens in, 301k out
* (obviously those figures above are so far in the month only)
* Claude in November: 196 million tokens in, 1.8 million out
* Codex in November: 214 million tokens in, 4 million out
* Cerebras in November: 131 million tokens in, 1.6 million out
* Claude in October: 5 million tokens in, 79k out
* Codex in October: 119 million tokens in, 3.1 million out
As for Cerebras in October, I don't have the data because they don't show the Qwen3 Coder model that was deprecated, but it was way more: https://blog.kronis.dev/blog/i-blew-through-24-million-token...In general, I'd say that for the stuff I do my workloads are extremely read heavy (referencing existing code, patterns, tests, build and check script output, implementation plans, docs etc.), but it goes about like this:
* most fixed cloud subscriptions will run out really quickly and will be insufficient (Cerebras being an exception)
* if paying per token, you *really* want the provider to support proper caching, otherwise you'll go broke
* if you have local hardware that is great, but it will *never* compete with the cloud models, so your best bet is to run something good enough, basically cover all of your autocomplete needs, and also with tools like KiloCode an advanced cloud model can do the planning and a simpler local model do the implementation, then the cloud model validate the outputHere are my lazy notes + a snippet of the history file from the remote instance for a recent setup where I used the web chat interface built into llama.cpp.
I created an instance gpu_1x_gh200 (96 GB on ARM) at lambda.ai.
connected from terminal on my box at home and setup the ssh tunnel.
ssh -L 22434:127.0.0.1:11434 ubuntu@<ip address of rented machine - can see it on lambda.ai console or dashboard>
Started building llama.cpp from source, history:
21 git clone https://github.com/ggml-org/llama.cpp
22 cd llama.cpp
23 which cmake
24 sudo apt list | grep libcurl
25 sudo apt-get install libcurl4-openssl-dev
26 cmake -B build -DGGML_CUDA=ON
27 cmake --build build --config Release
MISTAKE on 27, SINGLE-THREADED and slow to build see -j 16 below for faster build 28 cmake --build build --config Release -j 16
29 ls
30 ls build
31 find . -name "llama.server"
32 find . -name "llama"
33 ls build/bin/
34 cd build/bin/
35 ls
36 ./llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 --jinja
MISTAKE, didn't specify the port number for the llama-server 37 clear;history
38 ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking -c 0 --jinja --port 11434
39 ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking.gguf -c 0 --jinja --port 11434
40 ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF -c 0 --jinja --port 11434
41 clear;history
I switched to qwen3 vl because I need a multimodal model for that day's experiment. Lines 38 and 39 show me not using the right name for the model. I like how llama.cpp can download and run models directly off of huggingface.Then pointed my browser at http//:localhost:22434 on my local box and had the normal browser window where I could upload files and use the chat interface with the model. That also gives you an openai api-compatible endpoint. It was all I needed for what I was doing that day. I spent a grand total of $4 that day doing the setup and running some NLP-oriented prompts for a few hours.
48GB of vram and lots of cuda cores, hard to beat this value atm.
If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.
Fuck nvidia
How is it? I'd guess a bunch of the MoE models actually run well?
nix run github:numtide/llm-agents.nix#mistral-vibe
The repo is updated daily.As long as it doesn't mean 10x worse performance, that's a good selling point.
In work, where my employer pays for it, Haiku tends to be the workhorse with Sonnet or Opus when I see it flailing. On my own budget I’m a lot more cost conscious, so Haiku actually ends up being “the fancy model” and minimax m2 the “dumb model”.
> this model is worse (but cheaper)
> use it to output 10x the amount of trashier trash
You've lost me.
I'm team Anthropic with Claude Max & Claude Code, but I'm still excited to see Mistral trying this. Mistral has occasionally saved the day for me when Claude refused an innocuous request, and it's good to have alternatives... even if Mistral / Devstral seems to be far behind the quality of Claude.
That was very helpful, thanks!
Going to start hacking on this ASAP
The competition is much smoother. Where are the subscriptions which would give users the coding agent and the chat for a flat fee and working out of the box?..
core/prompts/cli.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
core/prompts/compact.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/bash.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/grep.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/read_file.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/write_file.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/search_replace.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/todo.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
Here's n example of the kinds of things I do with Claude Code now: https://gistpreview.github.io/?b64d5ee40439877eee7c224539452... - that one involved several from-scratch rewrites of the history of an entire Git repo just because I felt like it.
[1] https://openhands.dev/blog/devstral-a-new-state-of-the-art-o...
Uh, the "Modified MIT license" here[0] for Devstral 2 doesn't look particularly permissively licensed (or open-source):
> 2. You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. You may contact Mistral AI (sales@mistral.ai) to request a commercial license, which Mistral AI may grant you at its sole discretion, or choose to use the Model on Mistral AI's hosted services available at https://mistral.ai/.
[0] https://huggingface.co/mistralai/Devstral-2-123B-Instruct-25...
If you want to use something, and your company makes $240,000,000 in annual revenue, you should probably pay for it.
I do not mind having a license like that, my gripe is with using the terms "permissive" and "open source" like that because such use dilutes them. I cannot think of any reason to do that aside from trying to dilute the term (especially when some laws, like the EU AI Act, are less restrictive when it comes to open source AIs specifically).
Good. In this case, let it be diluted! These extra "restrictions" don't affect normal people at all, and won't even affect any small/medium businesses. I couldn't care less that the term is "diluted" and that makes it harder for those poor, poor megacorporations. They swim in money already, they can deal with it.
We can discuss the exact threshold, but as long as these "restrictions" are so extreme that they only affect huge megacorporations, this is still "permissive" in my book. I will gladly die on this hill.
Yes, they do, and the only reason for using the term “open source” for things whose licensing terms flagrantly defy the Open Source definition is to falsely sell the idea that using the code carries the benefits that are tied to the combination of features that are in the definition and which are lost with only a subset of those features. The freedom to use the software in commercial services is particularly important to end-users that are not interested in running their own services as a guarantee against lock-in and of whatever longevity they are able to pay to have provided even if the original creator later has interests that conflict with offering the software as a commercial service.
If this deception wasn't important, there would be no incentive not to use the more honest “source available for limited uses” description.
It also makes life harder for individuals and small companies, because this is not Open Source. It's incompatible with Open Source, it can't be reused in other Open Source projects.
Terms have meanings. This is not Open Source, and it will never be Open Source.
I'm amazed at the social engineering that the megacorps have done with the whole Open Source (TM) thing. They engineered a whole generation of engineers to advocate not in their own self-interest, nor for the interest of the little people, but instead for the interest of the megacorps.
As soon as there is even the tiniest of restrictions, one which doesn't affect anyone besides a bunch of richiest corporations in the world, a bunch of people immediately come out of the woodwork, shout "but it's not open source!" and start bullying everyone else to change their language. Because if you even so much as inconvenience a megacorporation even a little bit it's not Open Source (TM) anymore.
If we're talking about ideals then this is something I find unsettling and dystopian.
I hard disagree with your "It also makes life harder for individuals and small companies" statement. It's the opposite. It gives them a competitive advantage vs megacorps, however small it may be.
> start bullying everyone else to change their language
Either words matter or they do not. If words matter, then trying to dilute the term is a bad thing because it tries to weaken something that matters. If words do not matter, then the people who "bully everyone" can be easily ignored. You cannot have these two things at the same time.
Whatever name they come up with for a new license will be less useful, because I'll have to figure out that this is what that is
And honestly it wasn't a good hill to begin with: if what you are talking about is the license, call it "open license". The source code is out in the open, so it is "open source". This is why the purists have lost ground to practical usage.
As someone who was born and raised on FOSS, and still mostly employed to work on FOSS, I disagree.
Open source is what it is today because it's built by people with a spine who stand tall for their ideals even if it means less money, less industry recognition, lots of unglorious work and lots of other negatives.
It's not purist to believe that what built open source so far should remain open source, and not wanting to dilute that ecosystem with things that aren't open source, yet call themselves open source.
With all due respect, don't you see the irony in saying "people with a spine who stand tall for their ideals", and then arguing that attaching "restrictions" which only affect the richest megacorporations in the world somehow makes the license not permissive anymore?
What ideals are those exactly? So that megacorporations have the right to use the software without restrictions? And why should we care about that?
Anyone can use the code for whatever purpose they want, in any way they want. I've never been a "rich megacorporation", but I have gone from having zero money to having enough money, and I still think the very same thing about the code I myself release as I did from the beginning, it should be free to be used by anyone, for any purpose.
Because instead of making the point "this license isn't as permissive as it could/should be" (easy to understand), instead the point being made is "this isn't real open source", which comes across to most people as just some weird gate-keeping / No True Scotsman kinda thing.
Though given the stance you are taking in this conversation, I'm not surprised you want to quibble over that.
¯\_(ツ)_/¯
> if what you are talking about is the license, call it "open license".
If you want to build something proprietary, call it something else. "Open Source" is taken.
well we don't really want to open that can of worms though, do we?
I don't agree with ceding technical terms to the rest of the world. I'm increasingly told we need to stop calling cancer detection AI "AI" or "ML" because it is not the 'bad AI' and confuses people.
I guess I'm okay with being intransigent.
Who gives a shit what we call "cancer AI", what matters is the result.
"Open Source" is nebulous. It reasonably works here, for better or worse.
No it isn't it is well defined. The only people who find it "nebulous" are people who want the benefits without upholding the obligations.
Open source has a well understood meaning, including licenses like MIT and Apache - but not including MIT but only if you make less than $500million dollars, MIT unless you were born on a wednesday, etc.
Whenever anybody tries to claim that a non-commercial licenses is open-source, it always gets complaints that it is not open-source. This particular word hasn’t been watered down by misuse like so many others.
There is no commonly-accepted definition of open-source that allows commercial restrictions. You do not get to make up your own meaning for words that differs from how other people use it. Open-source does not have commercial restrictions by definition.
Looking up open-source in the dictionary does include definitions that would allow for commercial restrictions, depending on how you define "free" (a matter that is most certainly up for debate).
The term "open-source" exists for the purposes of a particular movement. If you are "for" the misuse and abuse of the term, you not only aren't part of that movement, but you are ignorant about it and fail to understand it— which means you frankly have no place speaking about the meanings of its terminology.
Unless this authority has some ownership over the term and can prevent its misuse (e.g. with lawsuits or similar), it is not actually the authority of the term, and people will continue to use it how they see fit.
Indeed, I am not part of a movement (nor would I want to be) which focuses more on what words are used rather than what actions are taken.
People can also say 2+2=5, and they're wrong. And people will continue to call them out on it. And we will keep doing so, because stopping lets people move the Overton window and try to get away with even more.
The same is not true for "open source", which is a purely linguistic construct.
And whenever they do so, this pointless argument will happen. Again, and again, and again. Because that’s not what the word means and your desired redefinition has been consistently and continuously rejected over and over again for decades.
What do you gain from misusing this term? The only thing it does is make you look dishonest and start arguments.
I am not misusing the term, but people are, according to your standards. And it is easy for them to do so, because "open source" was poorly named to begin with.
This kind of thing is how people try to shift the Overton window. No.
This tech is simply too critical to pretend the military won’t use it. That’s clearer now than ever, especially after the (so far flop-ish) launch of the U.S. military’s own genAI platform.
- https://helsing.ai/newsroom/helsing-and-mistral-announce-str... - https://sifted.eu/articles/mistral-helsing-defence-ai-action... - Luxembourg army chose Mistral: https://www.forcesoperations.com/la-pepite-francaise-mistral... - French army: https://www.defense.gouv.fr/actualites/ia-defense-sebastien-...
Not sure you've kept up to date, US have turned their backs on most allies so far including Europe and the EU, and now welcome previous enemies with open arms.
They did.
Surprising and good is only: Everything including graphics fixed when clicking my "speedreader" button in Brave. So they are doing that "cool look" by CSS.
There's a scan lines affect they apply to everything that's "cool", but gets old after a minute.
How is that a measure of model size? It should either be parameter size, activated parameters, or cost per output token.
Looks like a typo because the models line up with reported param sizes.
Next step, just need a shitload of vram. ;)
Maybe those Intel Battlematrix 48GB cards might be useful after all... :)
https://www.storagereview.com/review/intel-arc-pro-b60-battl...
If Mistral is so permissive they could be the first ones, provided that hardware is then fast/cheap/efficient enough to create a small box that can be placed in an office.
Maybe in 5 years.
The Apple offerings are interesting but the lack of x86, Linux, and general compatibility make it hard sell imo.
...so it won't ever happen, it'll require wifi and will only be accessible via the cloud, and you'll have to pay a subscription fee to access the hardware you bought. obviously.
The only thing I found is a pay-as-you-go API, but I wonder if it is any good (and cost-effective) vs Claude et al.
With pricing so low I don't see any reason why someone would buy sub for 200 EUR. These days those subs are so much limited in Claude Code or Cursor than it used to be (or used to unlimited). Better pay-as-you-go especially when there are days when you probably use AI less or not at all (weekends/holidays etc.) as long as those credits don't expire.
Why does every AI provider need to have its own tool, instead of contributing to existing tools like Roo Code or Opencode?
Just call it Mistral License & flush it down