25t/s prompt processing
63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.Steps to reproduce:
git clone https://github.com/ggml-org/llama.cpp.git
cmake -B build
cmake --build build --config Release -j 12 --clean-first
# download model and mmproj files...
build/bin/llama-server \
--model gemma-3-4b-it-Q4_K_M.gguf \
--mmproj mmproj-model-f16.gguf
Then open http://127.0.0.1:8080/ for the web interfaceNote: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.
No, none of these things are in the images.
I don't even know how to begin debugging that.
Not sure why it's not working
So instead of saying "I can't help you with this picture", the thing hallucinates something.
That is the expected behavior by now. Not hard to imagine at all.
https://github.com/ggml-org/llama.cpp/discussions/4167
I wonder if it's the encoder that isn't optimized?
As you are a photographer, using a picture from your website gemma 4b produces the following:
"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."
This description is pretty spot on.
The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.
That said I'm not as impressed of the description. The structure has some wood but it's certainly not just wooden, there are distant mountains but not much in the way of rolling hills to speak of. The dress is flowing but the waist is not knotted - the more striking note might have been the sleeves.
For 4 GB of model I'm not going to ding it too badly though. The question on which quant was mainly around the tokens/second angle (q4 requires 1/4th the memory bandwidth as the full model would) rather than quality angle. As a note: a larger multimodal model gets all of these points accurately (e.g. "wooden and stone rustic structure"), they aren't just things I noted myself.
(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)
(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)
(also, would you mind sharing a code pointer if you have any handy? I found this https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd... but not sure if that's the codepath taken)
You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.
I made some quants with vision support - literally run:
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1
Then load the image with /image image.png inside the chat, and chat away!
EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.
using tailscale for the internal network works really well
This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!
llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
Similar to how we ended up with the huggingface/tokenizers library for text-only Tranformers.
Very nice for something that's self hosted.
If you want to see, here it is:
https://gist.github.com/Q726kbXuN/f300149131c008798411aa3246...
Here's an example of the kind of detail it built up for me for one image:
It's wrapped up in a bunch of POC code around talking to LLMs, so it's very very messy, but it does work. Probably will even work for someone that's not me.
I'm sure there's a context limit if you have enough images, where you need to start map-reducing things, but even that wouldn't be too hard.
https://q726kbxun.github.io/llama_cpp_vision/index.html
It's not perfect, by any means, but between the keywords and description text, it's good enough for me to be able to find images in a larger collection.
On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:
unzip llama-b5332-bin-macos-arm64.zip
cd build/bin
sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R) ./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
Or start the localhost 8080 web server (with a UI and API) like this: ./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!
Llama-server allowing vision support is definitely super cool - was waiting for it for a while!
Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl
I have no idea how to specify custom layer specs with multi GPU, but that is interesting!
(See the code in side llama_model_default_params())
Any benefit on a Mac with apple silicon? Any experiences someone could share?
1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.
For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.
2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm
Use case: I am working on a hobby project that uses TS/React as frontend. I can use local or cloud LLMs in VSCode but even those with vision require that I take a screenshot and paste it to a chat. Ideally, I would want it all automated until some stop criterion is met (even if only n-iterations). But even an extension that would screenshot a preview and paste it to chat (triggered by a keyboard shortcut) would be a big time-saver.
I think that if we're realistic with ourselves, AI will become exponentially more expensive to train, but without additional high quality data (not you, synthetic data), we're back to 1980s era AI (expert systems), just with enhanced fossil fuel usage to keep up with the TPUs. What's old is new again, I suppose!
I sincerely hope to be proven wrong, of course, but I think recent AI innovation has stagnated in terms of new things it can do. It's a great tool, when you use it to leverage that distribution (eg, semantic search), but it might not fundamentally be the approach to AGI (unless your goal is to replicate what we can, but less spikey)
In other words way forward seems to be to put models in loops. Which includes internal 'thinking' and external feedback. Make them use generated and acquired new data. Lossy compress the data periodically. And we have another race of algorithms.
This was the premise of symbolic AI, but this approach seems to have been abandoned now.
They’re still doing text and math tests on every new model because it’s so bad
just trying to understand, awesome work so far.
Which part of a pdf file can you use LLMs for ? Pdf is a binary format..
PDF isn't really a binary format, it starts with a text header, structure is mostly text-based objects and you can parse many PDFs as plain-text. They tend to contain embedded binary data though, which is the specific part these vision models can help you with, assuming they're images. The rest a "normal" LLM can parse just fine.