Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py
You can try it with uv (downloads a 4.5GB model on first run) like this:
uv run https://tools.simonwillison.net/python/q3_tts.py \
'I am a pirate, give me your gold!' \
-i 'gruff voice' -o pirate.wavI shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/
Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc.
There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.
Nothing was more scary than the invention of the nuclear weapon. And we're all still here.
Life will go on. And there will be incredible benefits that come out of this.
Except that building a nuclear weapon was not available to everyone, certainly not to dumb people whose brain have been feeded with social media content.
I simply think people don't really know that the new world requires a new set of rules of engagement for anything that exists behind a screen (for now).
That said, I am likewise looking forward to the cool things to come out of this.
I was with you, until
But, yeah. Life will go on.
I'm a filmmaker. I've done it photons-on-glass production for fifteen years. Meisner trained, have performed every role from cast to crew. I'm elated that these tools are going to enable me to do more with a smaller budget. To have more autonomy and creative control.
It's not so much of an issue with art for art's sake aided by AI. It's an issue with artistic work becoming unviable work.
All that is before the fact that streaming services are stuffing playlists with AI generated music to further reduce the payouts to artists.
> Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around...
Yes all those things still happen, but it's increasingly untenable to make a living through it.
I presume this is due to using the base model, and not the one tuned for more expressiveness.
edit: Or more likely, the demo not exposing the expressiveness controls.
The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.
Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.
Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.
The HF demo is very similar to the GitHub demo, so easy to try out.
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --no-flash-attn --ip 127.0.0.1 --port 8000
Skipped FlashAttention since I'm on Windows and I haven't gotten FlashAttention 2 to work there yet (I found some precompiled FA3 files[3] but Qwen3-TTS isn't FA3 compatible yet).[1]: https://github.com/QwenLM/Qwen3-TTS?tab=readme-ov-file#quick...
Haven't looked into the demo to see if it could be optimized by moving certain bits to CPU for example.
That's not really rational considering the internet is full of examples of my voice that anyone could use though. Here's a recent podcast clip: https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3006s
What am I doing wrong?
Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.
If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.
> Read this in a calm, clear, and wise audiobook tone.
> Do not rush. Allow the meaning to sink in.
But maybe I should experiment with something more detailed. Do you have any suggestions?
The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.
I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.
Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.
Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.
Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.
Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.
Beyond that, Glm4.7 should also be great.
See https://dev.to/kilocode/open-weight-models-are-getting-serio...
It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7
Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.
$20/month is a bit of an insane ask when the most valuable thing Anthropic makes is the free Claude Code CLI.
What do you mean by this?
https://www.bloomberg.com/news/articles/2026-01-20/anthropic...
And that's the rub.
Many of us are not.
I prefer to have more open models. On the other hand China closes up their open models once they start to show a competitive edge.
Being critical of favorable actions towards a rival country shouldn't be divisive, and if it is, well, I don't think the problem is in the criticism.
Also the link doesn't mention open source? From a google search, he doesn't seem to care much for it.
I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.
If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7
Regardless of how productive those numbers may seem, that amount of code being published so quickly is concerning, to say the least. It couldn't have possibly been reviewed by a human or properly tested.
If this is the future of software development, society is cooked.
I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.
It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"
It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).
My experience is that all of the models seem to do a decent job of writing a whole application from scratch, up to a certain point of complexity. But as soon as you ask them for non-trivial modifications and bugfixes, they _usually_ go deep into rationalized rabbit holes into nowhere.
I burned through a lot of credits to try them all and Gemini tended to work the best for the things I was doing. But as always, YMMV.
This has been my consistent experience with every model prior to Opus 4.5, and every single open model I've given a go.
Hopefully we will get there in another 6 months when Opus is distilled into new open models, but I've always been shocked at some of the claims around open models, when I've been entirely unable to replicate them.
Hell, even Opus 4.5 shits the bed with semi-regularity on anything that's not completely greenfield for my usage, once I'm giving it tasks beyond some unseen complexity boundary.
That evening, for kicks, I brought the problem to GLM 4.7 Flash (Flash!) and it one-shot the right solution.
It's not apples to apples, because when it comes down to it LLMs are statistical token extruders, and it's a lot easier to extrude the likely tokens from an isolated query than from a whole workspace that's already been messed up somewhat by said LLM. That, and data is not the plural of anecdote. But still, I'm easily amused, and this amused me. (I haven't otherwise pushed GLM 4.7 much and I don't have a strong opinion about about it.)
But seriously, given the consistent pattern of knitting ever larger carpets to sweep errors under that Claude seems to exhibit over and over instead of identifying and addressing root causes, I'm curious what the codebases of people who use it a lot look like.
I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code
China would need an architectural breakthrough to leap American labs given the huge compute disparity.
A financial jackknifing of the AI industry seems to be one very plausible outcome as these promises/expectations of the AI companies starts meeting reality.
1. Chinese researcher in China, to be more specific.
They need a training-multiplier breakthrough that would allow them to train SOTA models on on a fraction of the compute that the US does. And this would also have to be kept a secret and be well hidden (often multiple researchers from around the world put the pieces together on a problem at around the same time, so the breakthrough would have to be something pretty difficult to discover for the greatest minds in the field) to prevent the US from using it to multiply their model strength with their greater compute.
1. e.g. select any DeepSeek release, and read the accompanying paper
Your 'cope' accusation has no place here, I have no dog in the race and do not need to cope with anything.
I will rephrase my statement and continue to stand by it: "Denying the volume of original AI research being done by China - a falsifiable metric - betrays some level of cope."
You seem to agree on the fact that China has surpassed the US. As for quality, I'll say expertise is a result of execution. At some point in time during off-shoring, the US had qualitatively better machinists that China, despite manufacturing volumes. That is no longer the case today - as they say, cream floats to the top, and that holds true for a pot or an industrial-sized vat.
Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...
I have dozens of hours of audio of like Bob Bailey and people of that era.
besides, they know what side their bread is buttered on. I feel like this is almost not the real announcement; or, the engineers that wrote this up and did the demos just ran it that way. The normal speech voices are fine (lower than the anime ones on the page.) i agree that the first few are very infantile. I'll change that word if i can think of a better one.
Observe, original: https://www.youtube.com/watch?v=YiRcOVDAryM
my edit (took about an hour, if memory serves, to set up. forgot render time...): https://www.youtube.com/watch?v=xazubVJ0jz4
i say "was [...] software" because the last 2 times i've tried to use it, it did imperceptible cleanup, making it worthless. Anyhow, all my radio plays are from OTRR, i think.Audio.Restoration.DeNoise.DeNoiseLF.2.8.3_WiN.OSX is a more recent version i think
p.s. are you a "dude named Ben"?
This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.
Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.
There are some samples. If you have GPU you might want to fork and improve this, but otherwise slow, but usable on CPU as well.
And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)
Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.
100% I was thinking the same thing.
Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone