Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model

159
13
Curiositry
9 hours ago
github.com

mythz
·
52 minutes ago
·
[ - ]

Big fan of Salvatore's voxtral.c and flux2.c projects - hope they continue to get optimized as it'd be great to have lean options without external deps. Unfortunately it's currently too slow for real-world use (AMD 7800X3D/Blas) when adding Voice Input support to llms-py [1].

In the end Omarchy's new support for voxtype.io provided the nicest UX, followed by Whisper.cpp, and despite being slower, OpenAI's Whisper is still a solid local transcription option.

Also very impressed with both the performance and price of Mistral's new Voxtral Transcription API [2] - really fast/instant and really cheap ($0.003/min), IMO best option in CPU/disk-constrained environments.

[1] https://llmspy.org/docs/features/voice-input

[2] https://docs.mistral.ai/models/voxtral-mini-transcribe-26-02

mijoharas
·
45 minutes ago
·
[ - ]

One thing I keep looking for is transcribing while I'm talking. I feel like I need that visual feedback. Does voxtype support that?

(I wasn't able to find anything at glance)

Handy claims to have an overlay, but it seems to not work on my system.

mythz
·
37 minutes ago
·
[ - ]

Not sure how it works in other OS's but in Omarchy [1] you hold down `Super + Ctrl + X` to start recording and release it to stop, while it's recording you'll see a red voice recording icon in the top bar so it's clear when its recording.

Although as llms-py is a local web App I had to build my own visual indicator [2] which also displays a red microphone next to the prompt when it's recording. It also supports both Tap On/Off and hold down for recording modes. When using voxtype I'm just using the tool for transcription (i.e. not Omarchy OS-wide dictation feature) like:

$ voxtype transcribe /path/to/audio.wav

If you're interested the Python source code to support multiple voice transcription backends is at: [3]

[1] https://learn.omacom.io/2/the-omarchy-manual/107/ai

[2] https://llmspy.org/docs/features/voice-input

[3] https://github.com/ServiceStack/llms/blob/main/llms/extensio...

Curiositry
·
6 hours ago
·
[ - ]

This was a breeze to install on Linux. However, I haven't managed to get realtime transcription working yet, ala Whisper.cpp stream or Moonshine.

--from-mic only supports Mac. I'm able to capture audio with ffmpeg, but adapting the ffmpeg example to use mic capture hasn't worked yet:

ffmpeg -f pulse -channels 1 -i 1 -f s16le - 2>/dev/null | ./voxtral -d voxtral-model --stdin

It's possible my system is simply under spec for the default model.

I'd like to be able to use this with the voxtral-q4.gguf quantized model from here: https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf

jwrallie
·
5 hours ago
·
[ - ]

I am interested in a way to capture audio not only from the mic, but also from one of the monitor ports so you could pipe the audio you are hearing from the web directly for real-time transcription with one of these solutions. Did anyone manage to do that?

I can, for example, capture audio from that with Audacity or OBS Studio and do it later, so it should be possible to do it in real time too assuming my machine can keep up.

bebna
·
3 hours ago
·
[ - ]

Set -i 1 to -i default or to one of your monitors, look them up with pactl list short sources

https://trac.ffmpeg.org/wiki/Capture/PulseAudio

yjftsjthsd-h
·
5 hours ago
·
[ - ]

Does it work if you use ffmpeg to feed it audio from a file? I personally would try file->ffmpeg->voxtral then mic->ffmpeg->file, and then try to glue together mic->ffmpeg->voxtral.

(But take with grain of salt; I haven't tried yet)

Curiositry
·
3 hours ago
·
[ - ]

Recording audio with FFMPEG, and transcribing a file that’s piped from FFMPEG both work.

Given that it took 19.64 mins to transcribe the 11 second sample wav, it’s possible I just didn’t wait long enough :)

yjftsjthsd-h
·
2 hours ago
·
[ - ]

Ah. In that case... Yeah. Is it using GPU, and does the whole model fit in your (V)RAM?

ekianjo
·
1 hour ago
·
[ - ]

This is a CPU implementation only.

hrpnk
·
54 minutes ago
·
[ - ]

There is also a MLX implementation: https://github.com/awni/voxmlx

written-beyond
·
2 hours ago
·
[ - ]

Funny, this and the Rust runtime implementation are neck and neck on the frontpage right now.

Cool project!

sgt
·
2 hours ago
·
[ - ]

I'm very interested in speech to text - but like tricky dialects and use of various terminologies but I'm still confused as to where to start in the best possible place, in order to train the models with a huge database of voice samples I own.

Any ideas from the HN crowd currently involved in speech 2 text models?

·
4 hours ago
·
[ - ]

MORPHOICES
·
53 minutes ago
·
[ - ]

[dead]

genie3io
·
2 hours ago
·
[ - ]

[dead]