I really like dictation. For years, I relied on transcription tools that were almost good, but they were all closed-source. Even a lot of them that claimed to be “local” or “on-device” were still black boxes that left me wondering where my audio really went.
So I built Whispering. It’s open-source, local-first, and most importantly, transparent with your data. Your data is stored locally on your device, and your audio goes directly from your machine to a local provider (Whisper C++, Speaches, etc.) or your chosen cloud provider (Groq, OpenAI, ElevenLabs, etc.). For me, the features were good enough that I left my paid tools behind (I used Superwhisper and Wispr Flow before).
Productivity apps should be open-source and transparent with your data, but they also need to match the UX of paid, closed-software alternatives. I hope Whispering is near that point. I use it for several hours a day, from coding to thinking out loud while carrying pizza boxes back from the office.
Here’s an overview: https://www.youtube.com/watch?v=1jYgBMrfVZs, and here’s how I personally am using it with Claude Code these days: https://www.youtube.com/watch?v=tpix588SeiQ.
There are plenty of transcription apps out there, but I hope Whispering adds some extra competition from the OSS ecosystem (one of my other OSS favorites is Handy https://github.com/cjpais/Handy). Whispering has a few tricks up its sleeve, like a voice-activated mode for hands-free operation (no button holding), and customizable AI transformations with any prompt/model.
Whispering used to be in my personal GH repo, but I recently moved it as part of a larger project called Epicenter (https://github.com/epicenter-so/epicenter), which I should explain a bit...
I’m basically obsessed with local-first open-source software. I think there should be an open-source, local-first version of every app, and I would like them all to work together. The idea of Epicenter is to store your data in a folder of plaintext and SQLite, and build a suite of interoperable, local-first tools on top of this shared memory. Everything is totally transparent, so you can trust it.
Whispering is the first app in this effort. It’s not there yet regarding memory, but it’s getting there. I’ll probably write more about the bigger picture soon, but mainly I just want to make software and let it speak for itself (no pun intended in this case!), so this is my Show HN for now.
I just finished college and was about to move back with my parents and work on this instead of getting a job…and then I somehow got into YC. So my current plan is to cover my living expenses and use the YC funding to support maintainers, our dependencies, and people working on their own open-source local-first projects. More on that soon.
Would love your feedback, ideas, and roasts. If you would like to support the project, star it on GitHub here (https://github.com/epicenter-so/epicenter) and join the Discord here (https://go.epicenter.so/discord). Everything’s MIT licensed, so fork it, break it, ship your own version, copy whatever you want!
On key press, start recording microphone to /tmp/dictate.mp3:
# Save up to 10 mins. Minimize buffering. Save pid
ffmpeg -f pulse -i default -ar 16000 -ac 1 -t 600 -y -c:a libmp3lame -q:a 2 -flush_packets 1 -avioflags direct -loglevel quiet /tmp/dictate.mp3 &
echo $! > /tmp/dictate.pid
On key release, stop recording, transcribe with whisper.cpp, trim whitespace and print to stdout: # Stop recording
kill $(cat /tmp/dictate.pid)
# Transcribe
whisper-cli --language en --model $HOME/.local/share/whisper/ggml-large-v3-turbo-q8_0.bin --no-prints --no-timestamps /tmp/dictate.mp3 | tr -d '\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//'
I keep these in a dictate.sh script and bind to press/release on a single key. A programmable keyboard helps here. I use https://git.sr.ht/%7Egeb/dotool to turn the transcription into keystrokes. I've also tried ydotool and wtype, but they seem to swallow keystrokes. bindsym XF86Launch5 exec dictate.sh start
bindsym --release XF86Launch5 exec echo "type $(dictate.sh stop)" | dotoolc
This gives a very functional push-to-talk setup.I'm very impressed with https://github.com/ggml-org/whisper.cpp. Transcription quality with large-v3-turbo-q8_0 is excellent IMO and a Vulkan build is very fast on my 6600XT. It takes about 1s for an average sentence to appear after I release the hotkey.
I'm keeping an eye on the NVidia models, hopefully they work on ggml soon too. E.g. https://github.com/ggml-org/whisper.cpp/issues/3118.
Whisper on windows, the openai-whisper, doesn't have these q8_0 models, it has like 8 models, and i always get an error about triton cores (something about timeptamping i guess), which windows doesn't have. I've transcribed >1000 hours of audio with this setup, so i'm used to the workflow.
> It is difficult to get a man to understand something, when his salary depends upon his not understanding it!
in the special case where the thing to be understood is "your app doesn't need to be a Big Fucking Deal". Maybe it pleases some users to wrap this in layers of additional abstraction and chrome and clicky buttons and storefronts, but in the end the functionality is already there with a couple of FOSS projects glued together in a bash script.
I used to think the likes of Suckless were brutalist zealots, but more and more I think they (and the Unix patriarchs) were right and the path to enlightenment is expressed in plain text.
Yes! This. I have almost no experience w/ tts, but if/when I explore the space, I'll start w/ Whispering -- because of Epicenter. Starred the repo, and will give some thought to other apps that might make sense to contribute there. Bravo, thanks for publishing these and sharing, and congrats on getting into YC! :)
https://github.com/epicenter-so/epicenter/pull/655
After this pushes, we'll have far more extensive local transcription support. Just fixing a few more small things :)
We all should be.
My biggest gripe perhaps is not being able to get decent content out of a thought stream; the models can't properly filter out the pauses, "uuuuhmms", and much less so handle on the fly corrections to what I've been saying, like going back and repeating something with a slight variation and whatnot.
This is a challenging problem I'd love to see being tackled well by open models I can run on my computer or phone. Are there new models more capable of this? Is it not just a model thing, and I missing a good app too?
In the meanwhile, I'll keep typing, even though it can be quite a bit less convenient to do; especially true for note taking on the go.
One of the features of the project posted above is "transformations" that you can run on transcripts. They feed the text into an LLM to clean it up. If you're willing to pay for the tokens, I think you could not only remove filler-words, but could probably even get the semantically-aware editing (corrections) you're talking about.
I have mixed feelings about OS-integration. I'm currently working on a project to use a foot-pedal for push-to-transcribe - it speaks USB-HID so it works anywhere without software, and it doesn't clobber my clipboard. That said, an app like yours really opens up some cool possibilities! For example, in a keyboard-emulation strategy like mine, I can't easily adjust the text prompt/hint for the transcription model.
With an application running on the host though, you can inject relevant context/prompts/hints (either for transcription, or during your post-transformations). These might be provided intentionally by the user, or, if they really trust your app, this context could even be scraped from what's currently on-screen (or which files are currently being worked on).
Another thing I've thought about doing is using a separate keybind (or button/pedal) that appends the transcription directly to a running notes file. I often want to make a note to reference later, but which I don't need immediately. It's a little extra friction to have to actually have my notes file open in a window somewhere.
Will keep an eye on epicenter, appreciate the ethos.
Actually dictating code, but they do it in a rather smart way.
Do you have any sense of whether this type of model would work with children's speech? There are plenty of educational applications that would value a privacy-first locally deployed model. But, my understanding is that Whisper performs pretty poorly with younger speakers.
I wonder if it changes with time for people who use dictation often.
Similarly, I've used dictation when working on something physical, like reverse engineering some hardware, where my table is full of disassembled electronics, I might be carefully holding a probe or something like that, and having to put everything down just to write "X volts on probe Y" would slow me down.
I use whisperfile which is a multiplatform implementation of whisper that works really well.
Honestly, I'm getting tired of subscription-based apps. If it's truly offline, shouldn't it support a one-time purchase model? The whole point of local-first is that you're not dependent on ongoing cloud services, so why structure pricing like you are?
That said, will definitely give Whispering a try - always happy to see more open source alternatives in this space, especially with the local whisper.cpp integration that just landed.
Right now, the pricing is entirely free, and we are trying to expand our local model support to make it truly free. Subscriptions are up to the user right now.
Thanks for giving us a shot, and no pressure on using it! At the end of the day, I just want to build something that is open source and trustworthy, and hopefully will fit into the Epicenter ecosystem, the data layer that I talked about earlier in my post.
I understand the fatigue but not the outright indignation.
The LLM part should be very much doable, but I'm not sure if speaker recognition exists in a sufficiently working state?
Eventually I'm trying to get around to using it in conjunction with a fine-tuned whisper model to make transcriptions. Just haven't found the time yet.
I spent three months perfecting the speaker diarization pipeline and I think you'll be quite pleased with the results.
When you are in an environment where you can dictate, it really is a game changer. Not only is dictating much faster than typing, even if you're a fast typist, I find that you don't have the sticking problem of composing a message quite as much. It also makes my typing feel more like natural speech.
I have both the record and cancel actions bound to side buttons on my mouse, and paste to a third, the auto-paste feature is frustrating in my opinion.
I do miss having a taskbar icon to see if I'm recording or not. Sometimes I accidentally leave it running and sometimes the audio cues break until I restart it.
Transformations are great, despite an extreme amount of prompt engineering, I can't seem to stop the transformation model occasionally responding to my message rather than just transforming it though..
- allow flexible recording toggle shortcuts - show a visual icon with waves etc showing recording - how the clipboard is handled during recording (does it copy to clipboard? does it clear it after text output?)
VoiceInk is nearly there in terms of good behavior on these dimensions, and I hope to ditch my Wispr Flow sub soon.
For the Whispering dev: would it be possible to set "right shift" as a toggle? also do it like VoiceInk which is:
- either short right shift press -> then it starts, and short right shift press again to stop - or "long right shift press" (eg when at pressed at least for 0.5s) -> then it starts and just waits for you to release right shift to stop
it's quite convenient
another really cool stuff would be to have the same "mini-recorder" which pops-up on screen like VoiceInk when you record, and once you're done it would display the current transcript, and any of your "transformation" actions, and let you choose which one (or multiple) you want to apply, each time pasting the result in the clipboard
Like Leftium said, the local-first Whisper C++ implementation just posted a few hours ago.
Their point is they aren’t a middleman with this, and you can use your preferred supplier or run something locally.
> All your data is stored locally on your device,
is fundamentally incapable with half of the following sentence.
I'd write it as
> All your data is stored locally on your device, unless you explicitly decide to use a cloud provider for dictation.
If you add Deepgram listen API compatibility, you can do live transcription via either Deepgram (duh) or OWhisper: https://news.ycombinator.com/item?id=44901853
(I haven’t gotten the Deepgram JS SDK working with it yet, currently awaiting a response by the maintainers)
https://github.com/epicenter-so/epicenter/pull/661
In the middle of a huge release that sets up FFMPEG integration (OWhisper needs very specifically formatted files), but hoping to add this after!
edit: nvm, this overview explains the different options: https://www.gladia.io/blog/best-open-source-speech-to-text-m... and https://www.gladia.io/blog/thinking-of-using-open-source-whi...
That way I can switch to the dictation keyboard, press dictate, and have the transcription inserted in any application (first or third party).
MacWhisper is fantastic for macOS system dictation but the same abilities don't exist on iOS yet. The native iOS dictation is quite good but not as accurate with bespoke technical words / acronyms as Whisper cpp.
I get the whispers models, and do what? how to run in a device without internet, no documentation about it...
After this pushes, we'll have far more extensive local transcription support. Just fixing a few more small things :)
On the other hand, kudos to developer, already working to make it happen!
Record permanent the voice (without shortkey) e.g. "run" compile and run a script, "code" switch back to code editor.
Under windows i use AutoHotKey2, but i would replace it with simple voice commands.
on win11, i installed ffmpeg using winget but it's not detecting it. running ffmpeg -version works but the app doesn't detect it.
one thing, how can we reduce the number of notifications received?
i like the system prompt option too.
https://github.com/epicenter-so/epicenter/issues/674
We hope to fix notifications too thank you for the feedback and happy to hear you liked the system prompt!
Whisper for transcription tries to transform audio data into LLM output. The transcripts generally have proper casing, punctuation and can usually stick to a specific domain based on the surrounding context.
For OsX there is also the great VoiceInk which is similar and open-source https://github.com/Beingpax/VoiceInk/
It's uncanny how good / fast it is
Local, using WhisperX. Precompiled binaries available.
I'm hoping to find and try a local-first version of an nvidia/canary like (like https://huggingface.co/nvidia/canary-qwen-2.5b) since it's almost twice as fast as Whisper with even lower word-error-rate
Allegedly Groq will be offering diarization with their cloud offering and super fast API which will be huge for those willing to go off-local.
Still lots of quality headroom in this space. I’ll def revisit whispering
However what gives me pause is the sheer number of possibly compromised microphones all around me (phones, tablets, laptops, tv etc) at all times, which makes spying much easier than if I use a keyboard.
I love the idea of epicenter. I love open source local-first software.
Something I've been hacking on for a minute would fit so well, if encryption wasn't a requirement for the profit model.
But uh yes thank you for making my life easier, and I hope to return the favor soon
say "This is a test message" --voice="Bubbles"
EDIT: I'm having way too much fun with this lol say "This is a test message" --voice="Organ"
say "This is a test message" --voice="Good News"
say "This is a test message" --voice="Bad News"
say "This is a test message" --voice="Jester"
$ apt install espeak-ng
$ espeak-ng 'Hello, World!'
It takes some adjustment and sounds a lot worse than what e.g. Google ships proprietarily on your phone, but after ~30 seconds of listening (if I haven't used it recently) I understand it just as well as I understand the TTS engine on my phoneIf there's a more modern package that sounds more human that's a similar no-brainer to install, I'd be interested, but just to note that this part of the problem has been solved for many years now, even if the better-sounding models are usually not as openly licensed, orders of magnitude more resource-intensive, limited to a few languages, and often less reliable/predictable in their pronunciation of new or compound words (usually not all of these issues at once)
$ apt install festival
$ echo "Hello, World!" | festival --tts
Not impressively better, but I find festival slightly more intelligible. $ sudo apt update
$ sudo apt install -y python3 python3-pip libsndfile1 ffmpeg
$ python -m venv piper-tts
$ ./venv/piper-tts/bin/pip install piper-tts
$ ./venv/piper-tts/bin/python3 -m piper.download_voices en_US-lessac-medium
$ ./venv/piper-tts/bin/piper -m en_US-lessac-medium -- 'This will play on your speakers.'
To manage the install graphically, you can use Pied (https://pied.mikeasoft.com/), which has a snap and a flatpak. That one's really cool because you can choose the voice graphically which makes it easy to try them out or switch voices. To play sound you just use "spd-say 'Hello, world!'"More crazy: Home Assistant did a "Year of Voice" project (https://www.home-assistant.io/blog/2022/12/20/year-of-voice/) that culminated in a real open-source voice assistant product (https://www.home-assistant.io/voice-pe/) !!! And it's only $60??
If it's still an issue, feel free to build it locally on your machine to ensure your supply chain is clean! I'll add more instructions in the README in the future.
---7.3.0--- This release popped up just a few minutes ago, so VirusTotal results for the 7.3.0 EXE and MSI installers
EXE (still running behavior checks but Arctic Wolf says Unsafe and AVG & Avast say PUP): https://www.virustotal.com/gui/file/816b21b7435295d0ac86f6a8...
MSI nothing flags immediately, still running behavior checks (https://www.virustotal.com/gui/file/e022a018c4ac6f27696c145e...)
---7.2.2/7.2.1 below--- I do note one bit of weirdness, the Windows downloads show 7.2.2 but the download links themselves are 7.2.1. 7.2.1 is also what shows on the release from 3 days ago even though it's numbered 7.2.2.
I didn't check the Mac or Linux installers, but for Windows VirusTotal flags nothing on the 7.2.1/7.2.2 MSI (https://www.virustotal.com/gui/file/7a2d4fec05d1b24b7deda202...) and 3 flags on the EXE (ArcticWolf Unsafe, AVG & Avast PUP) (https://www.virustotal.com/gui/file/a30388127ad48ca8a42f9831...)
https://github.com/epicenter-so/epicenter/issues/440
Thank you again for bringing this to my attention! Need to step up my Windows development.