This release supports English text-to-speech applications in eight voices: four male and four female. The model is quantized to int8 + fp16, and it uses onnx for runtime. The model is designed to run literally anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required!
We're releasing this to give early users a sense of the latency and voices that will be available in our next release (hopefully next week). We'd love your feedback! Just FYI, this model is an early checkpoint trained on less than 10% of our total data.
We started working on this because existing expressive OSS models require big GPUs to run them on-device and the cloud alternatives are too expensive for high frequency use. We think there's a need for frontier open-source models that are tiny enough to run on edge devices!
It doesn't sound so good. Excellent technical achievement and it may just improve more and more! But for now I can't use it for consumer facing applications.
The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.
I use TTS on my phone regularly and recently also tried this new project on F-Droid called SherpaTTS, which grabs some models from Huggingface. They're super heavy (the phone suspends other apps to disk while this runs) and sound good, but in the first news article there were already one or two mispronunciations because it's guessing how to say uncommon or new words and it's not based on logical rules anymore to turn text into speech
Google and Samsung have each a TTS engine pre-installed on my device and those sound and work fine. A tad monotonous but it seems to always pronounce things the same way so you can always work out what the text said
Espeak (or -ng) is the absolute worst, but after 30 seconds of listening closely you get used to it and can understand everything fine. I don't know if it's the best open source option (probably there are others that I should be trying) but it's at least the most reliable where you'll always get what is happening and you can install it on any device without licensing issues
Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.
The tech is still public and the research is available
This way they could offload as much of the "LLM" work on a device that lives in the home, all family linked phones and devices could use it for local inference.
It's way overpowered as is anyway, why not use it for something useful.
Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX
Performance Results:
Initial Latency: ~315ms for short text
Audio Generation Speed (seconds of audio per second of processing):
- Short text (12 chars): 3.35x realtime
- Medium text (100 chars): 5.34x realtime
- Long text (225 chars): 5.46x realtime
- Very Long text (306 chars): 5.50x realtime
Findings:
- Model loads in ~710ms
- Generates audio at ~5x realtime speed (excluding initial latency)
- Performance is consistent across different voices (4.63x - 5.28x realtime)
It sounds ok, but impressive for the size.
We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:
https://en.wikipedia.org/wiki/Software_Automatic_Mouth
https://simulationcorner.net/index.php?page=sam (try it yourself to hear what it sounds like)
DECtalk[1,2] would be a much better example, that's as formant as you get.
[1] https://en.wikipedia.org/wiki/DECtalk [2] https://webspeak.terminal.ink
Other than the video, the only relevant content is on the about page [2]. It says the voice is a collaboration between 5 different entities, including advocacy groups, marketing firms and a music producer.
The video is the only example of the voice in use. There is no API, weights, SDK, etc.
I suspect this was a one-off marketing stunt sponsored by Copenhagen pride before the pandemic. The initial reaction was strong enough that a couple years they were still getting a small but steady flow of traffic. One of the involved marketing firms decided to monetize the asset and defaced it with blog spam.
It’s a good choice for a robot voice. It’s easier to understand than the formant synths or deliberately distorted human voices. The genderless aspect is alien enough to avoid the uncanny valley. You intuitively know you’re dealing with something a little different.
I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)
Well sure, the BBC have already established that it's supposed to sound like a brit doing an impersonation of an American: https://www.youtube.com/watch?v=LRq_SAuQDec
"This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."
It takes a while until it starts generating sound on my i7 cores but it kind of works.
This also works:
"blah. bleh. blih. bloh. blyh. bluh."
So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.
If anyone else wants to try:
> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.
https://clowerweb.github.io/node_modules/onnxruntime-web/dis...
(seems reverted now)
Doesn't seem to work with thai.
Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)
Hopefully open source will render them irrelevant in the future.
Have you seen the code[1] in the repo? It uses phonemizer[2] which is GPL-3.0 licensed. In its current state, it's effectively GPL licensed.
[1]: https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...
[2]: https://github.com/bootphon/phonemizer
Edit: It looks like I replied to an LLM generated comment.
And it isn't something you can fix, because the model was trained on bad phonemes (everyone uses Whisper + then phonemizes the text transcript).
There is a third option: asking the project for an exception.
Though that is unlikely to be granted¹ leaving you back with just the other two options.
And of course a forth choice: just ignore the license. This is the option taken by companies like Onyx, whose products I might otherwise be interested in…
----
[1] Those of us who pick GPL3 or AGPL generally do so to keep things definite and an exception would muddy the waters, also it might not even be possible if the project has many maintainers as relicensing would require agreement from all who have provided code that is in the current release. Furthermore, if it has inherited the license from one of its dependencies, an exception is even less practical.
IIUC, the project isn't at the liberty to grant such an exception because it inherits its GPL license from espeak-ng.
Any user would still effectively be bound by the GPL-3.0, but if someone can remove the GPL dependencies they could use the project under Apache
Obviously you can't run them (with the original library) without complying with the GPL. But I don't see why I couldn't independently of that also give you this text file under Apache 2.0 to do with as you want (which for the record still doesn't allow you to run them with the original library without complying with the GPL, but that'd be phoneme forcing you to do that, not this project)
You would have to be very specific about the dual-licensing to avoid confusion about what you are allowed to do under Apache conditions though. You can't just say "it's dual-licensed"
If my MIT-licensed one-line Python library has this line of code…
run([“bash”, “-c”, “echo hello”])
…I’m not suddenly subject to bash’s licensing. For anyone wanting to run my stuff though, they’re going to need to make sure they themselves have bash installed.(But, to argue against my own point, if an OS vendor ships my library alongside a copy of bash, do they have to now relicense my library as GPL?)
However, this has never actually been proven in court, and there's many good arguments that linking doesn't count as a derivative work.
Old post by a lawyer someone else found (version 3 wouldn't affect this) [1]
For me personally I don't really understand how, if dynamic linking was viral, using linux to run code isn't viral. Surely at some level what linux does to run your code calls GPLed code.
It doesn't really matter though, since the FSF stance is enough to scare companies from not using it, and any individual is highly unlikely to be sued.
The Linux kernel has an explicit exception for userspace software:
> NOTE! This copyright does not cover user programs that use kernel services by normal system calls
> The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it.
> The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work.
As far as I understand the FSF's interpretation of their license, that's not true. Even if you only dynamically link to GPL-licensed code, you create a combined work which has to be licensed, as a whole, under the GPL.
I don't believe that this extends to calling an external program via its CLI, but that's not what the code in question seems to be doing.
(This is not an endorsement, but merely my understanding on how the GPL is supposed to work.)
Running bash (via exec()/fork()/spawn()/etc) isn't the same as (statically or dynamically) linking with its codebase. If your MIT-licensed one-liner links to code that's GPL licensed, then it gets infected by the GPL license.
I don't know if this has ever been tested in court.
[1]: https://www.gnu.org/licenses/gpl-faq.html#MereAggregation
Sillyness
The result can only be distributed under the terms of the GPL-3. That's actually a crucial difference: there's nothing preventing Kitten TTS from being Apache licensed, soliciting technical contributions under that license, and parts of its code being re-used in other software under that license. Yes, for the time being, this limits what you can do with Kitten TTS if you want to use the software as a whole (e.g. by embedding it into your product), but the license itself is still Apache and that can have value.
Morals may stop you but other than that? IMHO all open source code is public domain code if anyone is willing to spend some AI tokens.
There are standard ways to approach this called clean room engineering.
https://en.m.wikipedia.org/wiki/Clean-room_design
One person reads the code and produces a detailed technical specification. Someone reviews it to ensure that there is nothing in there that could be classified as copyrighted material, then a third person (who has never seen the original code) implements the spec.
You could use an LLM at both stages, but you'd have to be able to prove that the LLM that does the implementation had no prior knowledge of the code in question... Which given how LLMs have been trained seems to me to be very dubious territory for now until that legal situation gets resolved.
You don't give the whole codebase to an LLM and expect it to have one shot output. Instead, you break it down and and write the code block by block. Then the size if the codebase doesn't matter. You use the LLM as a tool, it is not supposed to replace you. You don't try to become George from Jetsons who is just pressing a button and doesn't touch anything, instead you are on top of it as the LLM does the coding. You test the code on every step to see if the implementation behaves as expected. Do enough of this and you have proper, full "bespoke" software.
https://github.com/espeak-ng/espeak-ng/blob/a4ca101c99de3534...
eSpeak NG's data files take about 12 MB (multi-lingual).
I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.
$ ls -lh /usr/bin/flite
Listed as 27K last I checked.
I recall some Blind users were able to decode Gordon 8-bit dialogue at speeds most people found incomprehensible. =3
What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?
Iam curious how fast this is with CPU only.
System Requirements
Works literally everywhere
Haha, on one of my machines my python version is too old, and the package/dependencies don't want to install.On another machie the python version is too new, and the package/dependencies don't want to install.
https://github.com/KittenML/KittenTTS/pull/21 https://github.com/KittenML/KittenTTS/pull/24 https://github.com/KittenML/KittenTTS/pull/25
If you have `uv` installed, you can try my merged ref that has all of these PRs (and #22, a fix for short generation being trimmed unnecessarily) with
uvx --from git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25 kittentts --output output.wav --text "This high quality TTS model works without a GPU"
I found the TTS a bit slow so I piped the output into ffplay with 1.2x speedup to make it sound a bit better
uvx --from git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25 kittentts --text "I serve 12 different beers at my restaurant for over 1000000 customers" --voice expr-voice-3-m --output - | ffplay -af "atempo=1.2" -f wav -
Nice one, thanks!
https://docs.astral.sh/uv/guides/tools/
uv installation:
This package is the epitome of dependency hell.
Seriously, stick with piper-tts.
Easy to install, 50MB gives you excellent results and 100MB gives you good results with hundreds of voices.
> g++ --version
g++ (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)
Copyright (C) 2025 Free Software Foundation, Inc.
Why is Python so bad at that? It's less kludgy than Bash scripts, but even those are easier to get working.
JS/TS/npm is just as bad with probably more build tools/frameworks.
Rust is a mess.
Go, well.
Even perl was quite complicated.
This is flat out wrong. NPM packages by default are local to a directory. And I haven't seen a package rely on a specific minor version of node in literally years. Node's back compat is also great, there was one hiccup 5 or 6 years ago where a super popular native package was deprecated ago but that's been about it.
I can take current LTS node and run just about any package from the NPM repo written within the last 4 or 5 years and it will just work. Meanwhile plenty of python packages somehow need specific point releases. What the unholy hell.
Node version manager does exist, and it can be setup to work per directory, which is super cool, but I haven't needed NVM in literal years.
man python
There you go. PYTHON(1) General Commands Manual PYTHON(1)
NAME
python - an object-oriented programming language
SYNOPSIS
python [ -c command | script | - ] [ arguments ]
DESCRIPTION
Python is the standard programming language.
Computer scientists love Python, not just because whitespace comes first ASCIIbetically, but because it's the standard. Everyone else loves Python because it's PYTHON!With no other language are you expected to maintain several entirely different versions of the language, each of which is a relatively large installation. Can you imagine if we all had five different llvms or gccs just to compile five different modern C projects?
I'm going to get downvoted to oblivion, but it doesn't change the reality that Python in 2025 is unnecessarily fragile.
> if we all had five different llvms or gccs
Oof, those are poor examples. Most compilers using LLVM other than clang do ship with their own LLVM patches, and cross-compiling with GCC does require installing a toolchain for each target.
Yes, because all I have to do is look at the real world.
Anyway, I think I'll stick with Festival 1.96 for TTS. It's super fast even on my core2duo and I have exactly zero chance of getting this Python 3'ish script to run on any machine with an OS older than a handful of years.
Pretty impressive but this seems to be a staple of most AI/ML projects.
"Works on my machine" or "just use docker", although here the later doesn't even seem to be an option.
I send you a 500kb Windows .exe file and claim it runs literally everywhere.
Would it be ignorant to say anything against it because of its size?
Now, RISC architectures are much more common, so instead of the rare 68K Apple/Amiga/etc computer that existed at the time, it's super common to want to run software on an ARM or occasionally RISC-V processor, so writing in x86 assembly language would require emulation, making for worse performance than a compiled language.
To make the setup easier and add a few features people are asking for here (like GPU support and long text handling), I built a self-hosted server for this model: https://github.com/devnen/Kitten-TTS-Server
The goal was a setup that "just works" using a standard Python virtual environment to avoid dependency conflicts.
The setup is just the standard git clone, pip install in a venv, and python server.py.
ONNX runtime is a single library, with C#'s package being ~115MB compressed.
Not tiny, but usually only a few lines to actually run and only a single dependency.
Which is completely reasonable imho, but obviously comes with tradeoffs.
You would need to constrain the vocabulary to see any benefits, but that could be reasonable. For example, you an enumeration of numbers, units and metric names could handle dynamic time, temperature and other dashboard items.
For something more complex like offline navigation, you already need to store a map. You could store street names as tokens instead of text. Add a few turn commands, and you have offline spoken directions without on device pre-processing.
Aside: Are there any models for understanding voice to text, fully offline, without training?
I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"
My mid-range AMD CPU is multiple times faster than realtime with parakeet.
OpenAI's whisper is a few years old and pretty solid.
[0]: https://github.com/openai/whisper/discussions/679 [1]: https://github.com/openai/whisper/discussions/928 [2]: https://github.com/openai/whisper/discussions/2608
Average duration per generation: 1.28 seconds
Characters processed per second: 30.35
--
"Um"
Average duration per generation: 0.22 seconds
Characters processed per second: 9.23
--
"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."
Average duration per generation: 2.25 seconds
Characters processed per second: 35.04
--
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 80
model name : AMD Ryzen 7 5800H with Radeon Graphics
stepping : 0
microcode : 0xa50000c
cpu MHz : 1397.397
cache size : 512 KB
I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.
While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.
With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.
For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.
1: https://github.com/fishaudio/fish-speech
Also, what are the two's VRAM requirents? This model has 15 million parameters which might run on low-power, sub-$100 computers with up-to-date software. Your hardware was an out-of-date 6GB GPU.
For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.
This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.
Since then, the trend has been to scale up. We need more models to scale down.
In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.
Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).
I know but it was more of a general comment. A really good TTS just isn't around yes in the OSS sphere. I looked at some of the other suggestions here but they have too many quirks. Dia sounds great but messages must have certain lengths etc and it picks a random voice every time. I'd love to have something self hosted that's as good as openai.
But it might be worth setting it up and seeing if it improves over time.
Why does it happen? I'm genuinely curious.
Looks like they are sidestepping these kinds of issues by generating the phonemes with the preprocessing stage of traditional speech synthesizers, and using the LLM only to turn those phonemes into natural-ish sounding speech. That limits how natural the model can become, but it should be able to correctly pronounce anything the preprocessing can pronounce
For scriptwriting when doing voice overs we always explicitly write out everything. So instead of 1 000 000 we would write one million or a million. This is a trivial example but if the number was 1 548 736 you will almost never be able to just read that off. However one million, five hundred and forty eight thousand, seven hundred and thirty six can just be read without parsing.
Same with urls, W W W dot Google dot com.
And yes, I added instructions along the lines of what you describe to my prompt. Its just sad that we have to. After all, LLM TTS has solved a bunch of real problems, like switching languages in a text, or foreign words. The pronounciation is better then anything we ever had. But it fails to read short numbers. I feel like that small issue could probably have been solved by doing some fine tuning. But I actually dont really understand the tech for it, so...
This might be on purpose and part of the training data because "for example" just sounds much better than "e.g.". Presumably for most purposes, linguistic naturalness is more important than fidelity.
In any case, I’d like TTS to not take that kind of artistic freedom.
https://github.com/KittenML/KittenTTS
This is the model and Github page, this blog post looks very much AI generated.
Foundational tools like this open up the possiblity of one-time payment or even free tools.
[0] https://old.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...
(it does however explain how many of these comments are older than the thread they are now children of)
After testing this locally, it still sounds quite mechanical, and fails catastrophically for simple phrases with numbers ("easy as 1-2-3"). If the 80M model can improve on this and keep the expressiveness seen in the reddit post, that looks promising.
It would be great if the training data were released too!
The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?
I ask because their models are pretty small. Some sound awesome and there is no depdendency hell like I'm seeing here.
Example: https://rhasspy.github.io/piper-samples/#en_US-ryan-high
Very well done. The quality is excellent and the technical parameters are, simply, unbelievable. Makes me want to try to embed this on a board just to see if it's possible.
For instance, try adding `np.random.shuffle(ref_s[0])` after the line `ref_s = self.voices[voice]`...
EDIT: be careful with your system volume settings if you do this.
I'm curious, but right now I don't want to install the package and run some code.
In a couple tests, the "Male 2" voice sounds reasonable, but I've found it has problem with some groups of words, specially when played with little context. I think it's small sentences.
For example, if you try to do just "Hey gang!", it will sound something like "Chay yang". But if you add an additional sentence after that, it will sound a bit different (but still weird).
I've been using a custom AI audiobook generation program [0] with piper for quite a while now and am very excited to look at integrating kitten. Historically piper has been the only good option for a free CPU-only local model so I am super happy to see more competition in the space. Easy installation is a big deal, since piper historically has had issues with that. (Hence why I had to add auto installation support in [0])
But yeah, if it's like any of the others we'll likely see a different "model" per language down the line based on the same techniques
(but somehow LLMs handle multilingual input perfectly fine! that's a bit strange, if you think about that)
BEAT THIS!
I tried to use it...
Its python venv has grown to 6 GBytes in size. The demo sentence
> "This high quality TTS model works without a GPU"
works, it takes 3s to render the audio. Audio sounds like a voice in a tin can.
I tried to have a news article read aloud and failed with
> [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/bert/Expand' > Status Message: invalid expand shape
If you are interested in TTS, you should explore alternatives
ls -lah /usr/bin/say
-rwxr-xr-x 1 root wheel 193K 15 Nov 2024 /usr/bin/say
Usage: M1-Mac-mini ~ % say "hello world this is the kitten TTS model speaking"
That being said, the ‘classical’ (pre-AI) speech synthesisers are much smaller than kitten, so you’re not wrong per se, just for the wrong reason.
https://project64.c64.org/Software/SAM10.TXT
Obviously it's not fair to compare these with ML models.
Running `man say` reveals that "this tool uses the Speech Synthesis manager", so I'm guessing the Apple Intelligence stuff is kicking in.
> speech synthesiser manager (the term manager was used for OS components in Classic Mac OS)
Especially fun to play with on the rainbow iMacs back then, too.
-- browserOS.com
I'm surprised that phone manufacturers do not include good TTS models in their browser APIs for example. So that websites can build good audio interfaces.
I for one would love to build a text editor that the user can use completely via audio. Text input might already be feasible via the "speak to type" feature, both Android and iOS offer.
But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there.
The interface I would like would offer a way to talk to write and then commands like "Ok editor, read the last paragraph" or "Ok editor, delete the last sentence".
It could be cool to do writing this way while walking. Just with a headset connected to a phone that sits in one's pocket.
(I've just tried it again without seeing that issue within a few pages)
> Siri voice or some older voices
You can choose "Enhanced" and "Premium" versions of voices which are larger and sound nice and modern to me. The "Serena Premium" voice I was using is over 200Mb and far better that this Show HN. It's very natural but kind of ruined by diabolical pronunciation of anything slightly non-standard which sadly seems to cover everything I read e.g. people/place names, technical/scientific terms or any neologisms in scifi/fantasy.
It's so wildly incomprehensible for e.g. Tibetan names in a mountaineering book, that you have to check the text. If the word being butchered is frequently repeated e.g. main character’s name, then it's just too painful to use.
> But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there
As people have been pointing out, we've had mediocre TTS since the 80s. If it was a real benefit people would be using even the inadequate version.
"A new tool is stirring up excitement and debate in the programming community"
Just give me the facts without American style embellishments. You're not trying to sell me anything =)
Might well be other issues behind that, and unclear if need any other dependencies that kitten doesn't rely on directly like torch or torchaudio? but... not 5 mins easy, but looks like issues might be able to be worked through...
For reference this is all I was trying basically:
Mix.install([:pythonx])
Pythonx.uv_init("""
[project]
name = "project"
version = "0.0.0"
requires-python = ">=3.8"
dependencies = [
"kittentts @ https://github.com/KittenML/KittenTTS/releases/download/0.1/kittentts-0.1.0-py3-none-any.whl"
]
""")
to get the above error.