Gemini 3 Pro Preview has superlative audio listening comprehension. If I send it a recording of myself in a car, with me talking, and another passenger talking to the driver, and the radio playing, me in English, the radio in Portuguese, and the driver+passenger in Spanish, Gemini can parse all 4 audio streams as well as other background noises and give a translation for each one, including figuring out which voice belongs to which person, and what everyone's names are (if it's possible to figure that out from the conversation).
I'm sure it would have superlative audio generation capabilities too, if such a feature were enabled.
With the prompt "WWII Plane Japan Kawasaki Ki-61 flying by, propeller airplane" and setting looping on and 30 sec duration manually instead of auto (the duration predictor fails pretty bad at this prompt, you need to be logged in to set duration manually) it works pretty well. No idea if it's close to that specific airplane though it sounds like a ww2 plane to me though.
Audio really is a blue ocean compared to text/image ML. The barriers aren't primarily compute or data - they're knowledge. You can't scale your way out of bad preprocessing or codec choices.
When 4 researchers can build Moshi from scratch in 6 months while big labs consider voice "solved," it shows we're still in a phase where domain expertise matters more than scale. There's an enormous opportunity here for teams who understand both ML and signal processing fundamentals.
[0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only.
The reason it matters is that soon, any time somebody sees a comment they don't like or think is stupid, they'll just say, "eh a bot said that," and totally dilute the rest of the discussion, even if the comment was real.
Edit: 2 day old account posting stuff that doesn't pass the sniff test. Hmmmm... baited by a bot?
The last example I've seen in one large company, done by a developer lacking audio/DSP experience: they used ffmpeg's resampling lib, but, after every 10ms audio frame processed by resampler, they'd invoke flush(), just for the sake of convenience of having the same number of input and output buffers ... :)
They'll optimize down the stack once they've sucked all the oxygen out of the room.
Little players won't be able to grow through the ceiling the giants create.
NVIDIA's basically the galaxy's most successful arms dealer, selling to both sides while convincing everyone they're just "enabling innovation." The real rebels would be training audio models on potato-patched RP2040s. Brave souls, if they exist.
..plenty of money to be made elsewhere
Wisprflow does not create it's own models but i know willow voice did do extensive finetuning to improve the quality and speed of their transcription models so you may count them.
You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.
Audio is too niche and porn is too ethically messy and legally risky.
There's also music, which the giants also don't touch. Suno is actually really impressive.