Some technical details:
- Predicts conversational floor ownership, not speech endpoints
- Audio-native streaming model, no ASR dependency
- Human-timed responses without silence-based delays
- Zero interruptions at sub-100ms median latency
- In benchmarks Sparrow-1 beats all existing models at real world turn-taking baselines
I wrote more about the work here: https://www.tavus.io/post/sparrow-1-human-level-conversation...
dreaming
In both conversational approaches the AI can respond with simple acknowledgements. When prompted by the user the AI could go into longer discussions and explanations.
It might be nice for the AI to quickly confirm it hears me and for it to give me subtle queues that it’s listening: backchannels: “yeah”, and non-verbal: “mhmm”. So I can imagine having a developer assistant that feels more like working with another dev than working with a computer.
That being said, there is room for all modes, all at the same time, and at different times shifting between them. A lot of time I just don’t want to talk at all.
Could Sparrow instead be used to produce high quality transcription that incorporate non-verbal cues?
Or even, use Sparrow AND another existing transcription/ASR thing to augment the transcription with non-verbal cues
My main use case for OpenAI/ChatGPT at this point is realtime voice chats.
OpenAI has done a pretty great job w/ realtime (their realtime API is pretty fantastic out of the box... not perfect, but pretty fantastic and dead simple setup). I can have what feels like a legitimate conversation with AI and it's downright magical feeling.
That said, the output is created by OpenAI models so it's... not my favorite.
I sometimes use ChatGPT realtime to think through/work through a problem/idea, have it create a detailed summary, then upload that summary to Claude to let 4.5 Opus rewrite/audit and come up with a better final output.
===
ME: "OK, so, I have a question about the economics of medicine. Uh..." [pauses to gather thoughts to ask question]
GEMINI: "Sure! Medical economics is the field of..."
===
And it's aggravated by the fact that all the LLMs love to give you page-long responses before it's your turn to talk again!
But then the actual flow of the conversation is deeply semantic in the best conversations, and the rules are very much a "dance" or a negotiation between partners.
It also implies that being the person who has something to say but is unable to get into the conversation due to following the conversational semantics is akin to going to a dance in your nice clothes but not being able to find a dance partner.
Common ...
The turn taking models were evaluated in a controlled environment with no additional cascaded steps: LLM, TTS, Phx. This matters to get apples to apples comparison: without the rest of the pipeline variability influencing the measurements.
The video conversation examples are sparrow-1 within the full pipeline. These responses aren’t as fast as sparrow itself because the LLM, TTS, facial rendering, and network transport also take time. Without Sparrow-1 they would be slower. Sparrow-1 enables the responses being as fast as they are, and with a faster CVI pipeline configuration the responses can be as fast as 430ms in my testing.