I have my own fork here: https://github.com/HorizonXP/voxtral.c where I’m working on a CUDA implementation, plus some other niceties. It’s working quite well so far, but I haven’t got it to match Mistral AI’s API endpoint speed just yet.
There's a distinction between tokens per second and time to first token.
Delays come for me when I have to load a new model, or if I'm swapping in a particularly large context.
Most of the time, since the model is already loaded, and I'm starting with a small context that builds over time, tokens per second is the biggest impactor.
It's worth noting I don't do much fancy stuff, a tiny bit of agent stuff, I mainly use qwen-coder 30a3b or qwen2.5 code instruct/base 7b.
I'm finding more complex agent stuff where multiple agents are used can really slow things down if they're swapping large contexts. ik_llama has prompt caching which help speed this up when swapping between agent contexts up until a point.
tldr: loading weights each time isn't much of a problem, unless you're having to switch between models and contexts a lot, which modern agent stuff is starting to.
The first cut will probably not be a streaming implementation
will report a GitHub issue
That said, I now agree with your original statement and really want Voxtral support...
Is it possible to rig this up so it really is realtime, displaying the transcription within a second or two of the user saying something out loud?
The Hugging Face server-side demo at https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim... manages that, but it's using a much larger (~8.5GB) server-side model running on GPUs.
This isn't even close to realtime on M4 Max. Whisper's ~realtime on any device post-2022 with an ONNX implementation. The extra inference cost isn't worth the WER decrease on consumer hardware, or at least, wouldn't be worth the time implementing.
panorama panorama panorama panorama panorama panorama panorama panorama� molest rist moundothe exh� Invothe molest Yan artist��������� Yan Yan Yan Yan Yanothe Yan Yan Yan Yan Yan Yan Yan
or... not talking anything generate random German sentences.
Anything I can do to fix/try it on Brave?