In the end Omarchy's new support for voxtype.io provided the nicest UX, followed by Whisper.cpp, and despite being slower, OpenAI's Whisper is still a solid local transcription option.
Also very impressed with both the performance and price of Mistral's new Voxtral Transcription API [2] - really fast/instant and really cheap ($0.003/min), IMO best option in CPU/disk-constrained environments.
[1] https://llmspy.org/docs/features/voice-input
[2] https://docs.mistral.ai/models/voxtral-mini-transcribe-26-02
(I wasn't able to find anything at glance)
Handy claims to have an overlay, but it seems to not work on my system.
Although as llms-py is a local web App I had to build my own visual indicator [2] which also displays a red microphone next to the prompt when it's recording. It also supports both Tap On/Off and hold down for recording modes. When using voxtype I'm just using the tool for transcription (i.e. not Omarchy OS-wide dictation feature) like:
$ voxtype transcribe /path/to/audio.wav
If you're interested the Python source code to support multiple voice transcription backends is at: [3]
[1] https://learn.omacom.io/2/the-omarchy-manual/107/ai
[2] https://llmspy.org/docs/features/voice-input
[3] https://github.com/ServiceStack/llms/blob/main/llms/extensio...
--from-mic only supports Mac. I'm able to capture audio with ffmpeg, but adapting the ffmpeg example to use mic capture hasn't worked yet:
ffmpeg -f pulse -channels 1 -i 1 -f s16le - 2>/dev/null | ./voxtral -d voxtral-model --stdin
It's possible my system is simply under spec for the default model.
I'd like to be able to use this with the voxtral-q4.gguf quantized model from here: https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf
I can, for example, capture audio from that with Audacity or OBS Studio and do it later, so it should be possible to do it in real time too assuming my machine can keep up.
(But take with grain of salt; I haven't tried yet)
Given that it took 19.64 mins to transcribe the 11 second sample wav, it’s possible I just didn’t wait long enough :)
Cool project!
Any ideas from the HN crowd currently involved in speech 2 text models?