https://i.horizon.pics/4sEVXh8GpI (27s)
It starts with an intro, too. Really strange
[S1] It really sounds as if they've started using NPR to source TTS models
[S2] Yeah... yeah... it's kind of disturbing (laughs dejectedly).
[S3] I really wish, that they would just Stop with this.
https://i.horizon.pics/Tx2PrPTRM3Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding
You can get hours of audio out of it for free with Eleven Reader, which suggests that their inference costs aren't that high. Meanwhile, those same few hours of audio, at the exact same quality, would cost something like $100 when generated through their website or API, a lot more than any other provider out there. Their pricing (and especially API pricing) makes no sense, not unless it's just price discrimination.
Somebody with slightly deeper pockets than academics or one guy in a garage needs to start competing with them and drive costs down.
Open TTS models don't even seem to utilize audiobooks or data scraped off the internet, most are still Librivox / LJ Speech. That's like training an LLM on just Wikipedia and expecting great results. That may have worked in 2018, but even in 2020 we knew better, not to mention 2025.
TTS models never had their "Stable Diffusion moment", it's time we get one. I think all it would take is somebody doing open-weight models applying the lessons we learned from LLMs and image generation to TTS models, namely more data, more scraping, more GPUs, less qualms and less safety. Eleven Labs already did, and they're profiting from it handsomely.
With the budget one tenth that of Stable Diffusion and less ethical qualms, you could easily 10x or 100x this.
I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with
Of course, but it's not always available.
For example, I would love an audiobook for Stanisław Lem's "The Invincible," as I just finished its video game adaptation, yet it simply doesn't exist in my native language.
It's quite seldom that the author narrates the audiobooks I listen to, and sometimes the narrator does a horrible job, butchering the characters with exaggerated tones.
The author could be better, because they at least have other info beyond the text to rely on, they can go off-script or add little details, etc.
The most skilled readers will make you want to read books _just because they narrated them_. They add a unique quality to the story, that you do not get from reading yourself or from watching a video adaptation.
Currently I'm in The Age of Madness, read by Steven Pacey. He's fantastic. The late Roy Dotrice is worth a mention as well, for voicing Game of Thrones and claiming the Guinness world record for most distinct voices (224) in one series.
It will be awesome if we can create readings automatically, but it will be a while before TTS can compete with the best readers out there.
1. It’s a job that seems worthwhile to support, especially as it’s “practice” that only adds to a lifetime of work and improves their central skill set
2. A voice actor will bring their own flare, just like any actor does to their job
3. They (should) prepare for the book, understanding what it’s about in its entirety, and bring that context to the reading
> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!
Seriously impressive. Wish I could direct link the audio.
Kudos to the Dia team.
Is there some sort of system prompt or hint at how it should be voiced, or does it interpret it from the text?
Because it would be hilarious if it just derived it from the text and it did this sort of voice acting when you didn't want it to, like reading a matter-of-fact warning label.
Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
If I cut up a song or TV show & put it on Youtube (and screech about fair use/parody law) then that's fine, but people will balk at something like this.
AI is here, people.
It's quite concerning that the community around here is usually livid about FOSS license violations, which typically use copyright law as leverage, but somehow is perfectly OK with training models on copyrighted work and just labels that as "fair use".
1. What GPU did you use to train the model? I'd love to train a model like this, but currently, I only have a 16GB MacBook. Thinking about buying a 5090 if it's worth.
2. Is it possible to use this for real time audio generation, similar to the demo on the Sesame website?
This is a tangent point but it would have been nicer if it wasn't a notion site. You could put the same page on github pages and it will be much lighter to open, navigate and link (like people trying to link some audio)
> When we apply CFG to Parakeet sampling, quality is significantly improved. However, on inspecting generations, there tends to be a dramatic speed-up over the duration of the sample (i.e. the rate of speaking increases significantly over time). Our intuition for this problem is as follows: Say that is our model is (at some level) predicting phonemes and the ground truth distribution for the next phoneme occuring is 25% at a given timestep. Our conditional model may predict 20%, but because our uncondtional model cannot see the text transcription, its prediction for the correct next phoneme will be much lower, say 5%. With a reasonable level of CFG, because [the logit delta] will be large for the correct next phoneme, we’ll obtain a much higher final probability, say 50%, which biases our generation towards faster speech. [emphasis mine]
Parakeet details a solution to this, though this was not adopted (yet?) by Dia:
> To address this, we introduce CFG-filter, a modification to CFG that mitigates the speed drift. The idea is to first apply the CFG calculation to obtain a new set of logits as before, but rather than use these logits to sample, we use these logits to obtain a top-k mask to apply to our original conditional logits. Intuitively, this serves to constrict the space of possible “phonemes” to text-aligned phonemes without heavily biasing the relative probabilities of these phonemes (or for example, start next word vs pause more). [emphasis mine]
The paper contains audio samples with ablations you can listen to.
[1]: https://jordandarefsky.com/blog/2024/parakeet/#classifier-fr...
Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.
(later). It did nicely for the default example text but just made weird sounds for a "hello all" prompt. And took longer?!
> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:
> Identity Misuse: Do not produce audio resembling real individuals without permission.
> ...
Specifically the phrase "intended solely for research and educational use".
Thanks for the feedback :)
For anyone wanting a quick way to spin this up locally with a web UI and API access, I put together a FastAPI server wrapper around the model: https://github.com/devnen/Dia-TTS-Server
The setup is just a standard pip install -r requirements.txt (works on Linux/Windows). It pulls the model from HF automatically – defaulting to the faster BF16 safetensors (ttj/dia-1.6b-safetensors), but that's configurable in the .env. You get an OpenAI-compatible API endpoint (/v1/audio/speech) for easy integration, plus a custom one (/tts) to control all the Dia parameters. The web UI gives you a simple way to type text, adjust sliders, and test voice cloning. It'll use your CUDA GPU if you have one configured, otherwise, it runs on the CPU.
Might be a useful starting point or testing tool for someone. Feedback is welcome!
Parakeet references WhisperD which is at https://huggingface.co/jordand/whisper-d-v1a and doesn't include a full list of non-speech events that it's been trained with, except "(coughs)" and "(laughs)".
Not saying the authors didn't do anything interesting here. They put in the work to reproduce the blog post and open source it, a praiseworthy achievement in itself, and they even credit Parakeet. But they might not have the list of commands for more straightforward reasons.
It's also a valid criticism that we haven’t yet audited the dataset for existing list of tags. That’s something we’ll be improving soon.
As for Dia’s architecture, we largely followed existing models to build the 1.6B version. Since we only started learning about speech AI three months ago, we chose not to innovate too aggressively early on. That said, we're planning to introduce MoE and Sliding Window Attention in our larger models, so we're excited to push the frontier in future iterations.
> We plan to release our fine-tuned whisper models and possibly the generative model (and/or future improved versions). The generative model would have to be released under a non-commercial license due to our datasets.
> TODO Docker support
Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal
Do a clip with the speakers you want as the audio prompt, add the text of that clip (with speaker tags) of the clip at the beginning of your text prompt, and it clones the voices from your audio prompt for the output.
The outputs are a bit unstable, might need to add cleaner training data and run longer training sessions. Hopefully we can do something like OAI Whisper and update with better performing checkpoints!
Surely it just downloads to a directory that can be volume mapped?
Literally got cuda containers working earlier today so haven't spent a huge amount of time figuring things out
A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.
But AFAIK, even if you have just a few hours of audio containing specific terminology (and correct pronunciation), fine-tuning on that data will significantly improve performance.
It started with Ira Glass voice and now the default voice is someone that sounds like they're not certain they should be saying the very banal thing they are about to say, followed by a hand-shake protocol of nervous laughter.
Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.
time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?
the python code snippet seems to imply that the entire audio bytes are generated directly?
It's way past bedtime where I live, so will be able to get back to you after a few hours. Thanks for the interest :)
I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.
Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.
We are aware of the fact that you do not need to create a venv when using pre-existing uv. Just added it for people spinning up new GPUs on cloud. But I'll update the README to make that a bit clearer. Thanks for the feedback :)
I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.
The full version of Dia requires around 10GB of VRAM to run.
If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.
We've seen Bark from Suno go from 16GB requirement -> 4GB requirement + running on CPUs. Won't be too hard, just need some time to work on it.
It's going to be an interesting decade of the new equivalent of "No, Tiffany, Bill Gates will NOT be sending you $100 for forwarding that email." Except it's going to be AI celebrities making appeals for donations to help them become billionaires or something.
Sounds awesome in the demo page though.
Note that the model was not fine-tuned on a specific voice. Hence, you will get different voices every time you run the model. You can keep speaker consistency by either adding an audio prompt (a guide coming VERY soon - try it with the second example on Gradio or HF Space for now), or fixing the seed.
I've just been massively disappointed by Sesame's CSM: on their gradio on the website it was generating flawless dialogs with amazing voice cloning. When running it local the voice cloning performance is awful.
What're the recommended GPU cloud providers for using such open-weights models?
The old Bark TTS is noisy and often unreliable, but pretty great at coughs, throat clears, and yelling. Even dialogs... sometimes. Same Dia prompt in Bark: https://vocaroo.com/12HsMlm1NGdv
Dia sounds much more clear and reliable, wild what 2 people can do in 3 months.