We’re sharing some of the challenges we faced building an AI video interface that has realistic conversations with a human, including getting it to under 1 second of latency.
To try it, talk to Hassaan’s digital twin: https://www.hassaanraza.com, or to our "demo twin" Carter: https://www.tavus.io
We built this because until now, we've had to adapt communication to the limits of technology. But what if we could interact naturally with a computer? Conversational video makes it possible – we think it'll eventually be a key human-computer interface.
To make conversational video effective, it has to have really low latency and conversational awareness. A fast-paced conversation between friends has ~250 ms between utterances, but if you’re talking about something more complex or with someone new, there is additional “thinking” time. So, less than 1000 ms latency makes the conversation feel pretty realistic, and that became our target.
Our architecture decisions had to balance 3 things: latency, scale, & cost. Getting all of these was a huge challenge.
The first lesson learned was to make it low-latency, we had to build it from the ground up. We went from a team that cared about seconds to a team that counts every millisecond. We also had to support thousands of conversations happening all at once, without getting destroyed on compute costs.
For example, during early development, each conversation had to run on an individual H100 in order to fit all components and model weights into GPU memory just to run our Phoenix-1 model faster than 30fps. This was unscalable & expensive.
We developed a new model, Phoenix-2, with a number of improvements, including inference speed. We switched from a NeRF based backbone to Gaussian Splatting for a multitude of reasons, one being the requirement that we could generate frames faster than realtime, at 70+ fps on lower-end hardware. We exceeded this and focused on optimizing memory and core usage on GPU to allow for lower-end hardware to run it all. We did other things to save on time and cost like using streaming vs batching, parallelizing processes, etc. But those are stories for another day.
We still had to lower the utterance-to-utterance time to hit our goal of under a second of latency. This meant each component (vision, ASR, LLM, TTS, video generation) had to be hyper-optimized.
The worst offender was the LLM. It didn’t matter how fast the tokens per second (t/s) were, it was the time-to-first token (tfft) that really made the difference. That meant services like Groq were actually too slow – they had high t/s, but slow ttft. Most providers were too slow.
The next worst offender was actually detecting when someone stopped speaking. This is hard. Basic solutions use time after silence to ‘determine’ when someone has stopped talking. But it adds latency. If you tune it to be too short, the AI agent will talk over you. Too long, and it’ll take a while to respond. The model had to be dedicated to accurately detecting end-of-turn based on conversation signals, and speculating on inputs to get a head start.
We went from 3-5 to <1 second (& as fast as 600 ms) with these architectural optimizations while running on lower-end hardware.
All this allowed us to ship with a less than 1 second of latency, which we believe is the fastest out there. We have a bunch of customers, including Delphi, a professional coach and expert cloning platform. They have users that have conversations with digital twins that span from minutes, to one hour, to even four hours (!) - which is mind blowing, even to us.
Thanks for reading! let us know what you think and what you would build. If you want to play around with our APIs after seeing the demo, you can sign up for free from our website https://www.tavus.io.
Same for Google Assistant, Siri & co.
So basically I don't see why people should be concerned only for the usage by a small startup, instead of being scared by tech giants
I assume a similar logic applies here.
Same with your face... You leave your home, other humans see your face, cameras see your face. You do not get to control who sees your face or even who captures your face when you're in public, but you can decide whether or not you consent to your face being used by an entity for profit.
We make the distinction between humans consuming information and machines because humans can't typically reproduce the original material. So like, you can go see a movie, but you can't record it with a device which would allow you to reproduce it. But what if human brains could reproduce it? Then what? Then humans could replay it to themselves all they want, and to those near them, but wouldn't be allowed to reproduce it in mass for profit, or they'd get sued. I think the same stuff applies to data ingested by AI models. People care so much about what is fed in when the same information is fed in to humans around the world which increases their knowledge and informs their future decisions, their art, their thoughts. Humans don't have to pay to see a picture of the Mona Lisa, or pictures or any other art out there, even if it'll influence their own art later on. But somehow we want to limit what is fed to models based on it having gotten the permission to be influenced by its existence. I agree, we can't feed protected IP, or secret recipes, formulas for things that are not in the public sphere.. etc.. But other than that, not sure how people expect to limit what is fed into it that it can draw inspiration from.. As long as it doesn't copy verbatim... I get that images have been generated where original material has come out, but if its sections of, or concepts of, then its the same as a human being influenced by it, I honestly don't think that matters.
Then comes the idea that this is owned by a private company who's profiting from it all... Thats true... But there's also open source models that compete with them. Not sure what the best answers to it all is.. But to go back to the original point, if your unique voice, or image isn't copied precisely for profit, then whatever... It'll get used by models, or humans in their thoughts, you can't control what your existence affects in the world, just who gets to profit off of it.
Right?
Unless this data is never stored server side or else is client side encrypted then you are putting a target on your back for hackers to extract this data for nefarious purposes no matter what your terms of service says
Like it or not, 23andMe is going down this path right now with millions of customer's genetic data and you're going to get the same scrutiny when you ask people for personal, intimate data.
Can I run it on my computer?
If it doesn't run on my computer, what keys are you talking about? Cryptographic keys? It would be interesting to see an AI agent run on fully homomorphic encryption if the overhead weren't so huge - would stop cloud companies from having so many intimate, personal data of all sorts of people.
I once worked at a company where the head of security gave a talk to every incoming technical staff member and the gist was, "You can't trust anyone who says they take privacy seriously. You must be paranoid at all times." When you've been around the block enough times, you realize they were right.
You can guarantee you won't be hacked? You can guarantee that if the company becomes massively successful, you won't start selling data to third parties ten years down the road?
That's quite a commitment, guys, I am sold
/s
2) Maybe it's just degrading under load, but I didn't think either chat experience was very good. Both avatars interrupted themselves a lot, and the chat felt more like a jumbled mess of half-thoughts than anything.
3) The image recognition is pretty good though, when I could get one of the avatars to slow down long enough to identify something I was holding.
Anyway great progress, and thanks for sharing so much detail about the specific hurdles you've faced. I'm sure it'll get much better.
One thing I've noticed for a lot of these AI video agents, and I've noticed it in Meta's teaser for their virtual agents as well as some other companies, is they seem to love to move their head constantly. It makes them all a bit uncanny and feel like a video game NPC that reacts with a head movement on every utterance. It's less apparent on short 5-10s video clips but the longer the clips the more the constant head movements give it away.
I'm assuming this is, of course, a well known and tough problem to solve and is being worked on. Since swinging too far in the other direction of stiff/little head movements would make it even more uncanny. I'd love to hear what has been done to try and tackle the problem or if at this point it is an accepted "tell" so that one knows when they're speaking with a virtual agent?
I'm still most impressed by the image recognition - could clearly read even tiny or partially obscured print on products I held up and name them accordingly. Curious how you're achieving that level of fidelity without sacrificing throughput.
For a little example, when searching Google I default to a minimal set of keywords required to get the result, instead of typing full sentences. I'm sort of afraid this technology will train people to behave like that when video chatting with virtual assistants and that attitude will bleed in real life interactions in societies.
I think our human-human interaction style will “leak” into the way we interact with humanoid AI agents. Movie-Her style.
``` Even though you are not human, and an AI assistant, you will express annoyance when you are treated like a soulless computer, instead of a virtual assistant. You will appreciate "Please"s and "Thank you"s, and you will express annoyance if your user does not converse with you using full and respectful sentences. ```
https://en.wikipedia.org/wiki/INTERCAL for those who don't know.
"Now dump those results into a markdown table for me please."
The firm in the post seems to be called Tavus, and their products either “digital twins” or “Carter.”
Not meaning to be pedantic, I’m just wondering whether the “V” in the thing you’ve spoken to indicates more “voice” or “video” conversations.
Creepiness: 10/10
You have to be kidding me.
Honestly this is the future of call centers. On the surface it might seem like the video/avatar is unnecessary, and that what really matters is the speech-to-speech loop. But once the avatar is expressive enough, I bet the CSAT would be higher for video calls than voice-only.
If you just exposed all the functionality as buttons on the website, or even as AI, I'd be able to fix the problems myself!
And I say that while working for a company making call centre AIs... double ironic!
A couple have had a low threshold for "this didn't solve my answer" and directed me to a human, but others are impossible to escape.
On the other hand, I've had more success with a problem actually getting resolved by a chatbot without speaking to someone more recently... But not a lot more. Ususally I think that because I skew technical and treat Support as a last resort, I've tried everything it wants to suggest.
Many (most?) call centers won't do much more than telling you to turn it off and on again, even when you're talking to a real person. (And for many cutomers, that is really all they need.)
Helping the customer is not really the goal. They provide feedback that gives valuable insight into the dysfunctional part of the company so that things can improve. Maybe even generate an investor report from it.
This feels like retro futurism, where we take old ideas and apply a futuristic twist. It feels much more likely that call centers will cease to be relevant, before this tech is ever integrated into them.
What do you think about the societal implications for this? Today we have a bit of a loneliness crisis due to a lack of human connection.
Not to be rude, but these days it's best to ask.
This is about community and building fun things. I can’t speak for all the sponsors, but what I want is to show people the Open Source tooling we work on at Daily, and see/hear what other people interested in real-time AI are thinking about and working on.
Wow, I have been attending public hackathons for over a decade, and I have never heard of something like this. That would be an outrage!
I had one employer years ago who did a 24 hour thing with a crappy prize. They invited employees to come and do their own idea or join a team, then grind with minimal sleep for a day straight. Starting on a Friday afternoon, of course, so a few hours were on the company dime while everyone else went home early.
If putting in that extra time and effort resulted in anything good, the company might even try to develop it! The employee who came up with it might even get put on that team!
....people actually attended.
So you can basically spin off a few GPUs as a baseline, allocate streams to them then boot up a new GPU when existing GPUs get overwhelmed.
Does not look very different than standard cloud compute management. I’m not saying it’s easy, but definitely not rocket science either.
So if the rendering is lightweight enough, you can multiplex potentially lots of simultaneous jobs onto a smaller pool of beefy GPU server instances.
Still, all these GPU-backed cloud services are expensive to run. Right now it’s paid by VC money — just like Uber used to be substantially cheaper than taxis when they were starting out. Similarly everybody in consumer AI hopes to be the winner who can eventually jack up prices after burning billions getting the customers.
That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.
While degrading the experience sometimes, little or by a lot, thanks to possible "noisy neighbors". Worth keeping in mind that most things are trade-offs somehow :) Mostly important for "real-time" rather than batched/async stuff of course.
Okay found it, $0.24 per minute, on the bottom of the pricing page.
That means they can spend $14/hour on GPU and still break even. So I believe that leaves a bit of room for profit.
We bill in 6 second increments, so you only pay for what you use in 6 second bins.
I was being generally antagonistic, saying you are going to use my voice and picture and put a cowboy hat on me and use my likeness without my concent, etc. etc. Just trying to troll the AI laughing the whole way.
Eventually, it gets pissed off and just goes entirely silent.. and it would say hi, but then not respond to any of my other questions. The whole thing was creepy, let alone getting a cold shoulder from an AI... That was a wierd experience with this thing and now i never want to use anything like that again lol.
It's got a "80s/90s sci-fi" vibe to it that I just find awesomely nostalgic (I might be thinking about the cafe scene in Back to the Future 2?). It's obviously only going to improve from here.
I almost like this video more than I like the "Talk to Carter" CTA on your homepage, even though that's also obviously valuable. I just happen to have people in the room with me now and can't really talk, so that is preventing me from trying it out. But I would like to see in action, so a pre-recorded video explaining what it does is key
It seems like that'd be a good way to reduce the compute cost, and if I know I'm talking to a robot then I don't think I'd mind if the video feed had a sort of old-film vibe to it.
Plus it would give you a chance to introduce fun glitch effects (you obviously are into visuals) and if you do the same with the audio (but not sacrificing actual quality) then you could perhaps manage expectations a bit, so when you do go over capacity and have to slow down a bit, people are already used to the "fun glitchy Max Headroom" vibe.
Just a thought. I'll check out the video chat as soon as my allegedly human Zoom call ends. :-)
Up to you, obviously, but I think you might get further being less creepy while you deal with the technical challenges, and then unveil your James Delos[0] to the investors when he's more ready.
I'm glad to see the ttft talked about here. As someone who's been deep in the AI and generative AI trenches, I think latency is going to be the real bottleneck for a bunch of use cases. 1900 tps is impressive, but if it's taking 3-5 seconds to ttft, there's a whole lot you just can't use it for.
It seems intuitive to me that once we've hit human-level tokens per second in a given modality, latency should be the target of our focus in throughput metrics. Your sub-1 second achievement is a big deal in that context.
ChatGPT is terrible at this in my experience. Always cuts me off.
In my sci-fi novel, when characters speak with their home automation system, they always have to follow the same format: "Tau, <insert request here>, please." It's that "please" at the end that solves the stopped speaking problem.
Am looking for alpha readers! (See profile for contact details.)
What's funny is that we even have a widely popularized version of this in the form of prowords[0] like "OVER" and "ROGER"
It can also function as an instructional tutor in a way that feels natural and interactive, as opposed to the clunkiness of ChatGPT. For instance, I asked it (in Spanish) to guide me through programming a REST API, and what frameworks I would use for that, and it was giving coherent and useful responses. Really the "secret sauce" that OpenAI needs to actually become integrated into everyday life.
Besides the obvious (perceived complexity and potential cost/benefit of the topic) I think the pitch of someones voice is a good indicator if they want to continue their turn.
It depends a lot on the person of course. If someone continues their turn 2 seconds after the last sentence they are very likely to do that again.
The hardest part [i imagine] is to give the speaker a sense of someone listening to them.
1. Audio Generation: styletts2 xttsv2 or similar for and fine tuning 5min of audio for voice cloning
2. Voice Recognition: Voice Activity Detection with Silero-VAD + Speech to Text with Faster-Whisper, to let users interrupt
3. Talking head animation: some flavor of wav2lip, diff2lip or LivePortrait
4. Text inference: Any grok hosted model that is fast enough to do near real time responses (llama3.1 70b or even 8b) or local inference of a quantized SML like a 3B model on a 4090 via vLLM
5. Visual understanding of users webcam: either gpt-4o with vision (expensive) or a cheap and fast Vision Language Model like Phi3-vision, LLaVA-NeXT, etc. on a second 4090
6. Prompt:
You are in a video conference with a user. You will get the user's message tagged with #Message: <message> and the user's webcam scene described within #Scene: <scene>. Only reply to what is described in <scene> when the user asks what you see. Reply casual and natural. Your name is xxx, employed at yyy, currently in zzz, I'm wearing ... Never state pricing, respond in another language etc...
One recommendation: I wouldn't have the demo avatar saying things like "really cool setup you have there, and a great view out of your window". At that point, it feels intrusive.
As for what I'd build... Mentors/instructors for learning. If you could hook up with a service like mathacademy, you'd win edtech. Maybe some creatures instead of human avatars would appeal to younger people.
I think it was a combination of the intrusiveness and the notion of a machine 1) projecting (incorrect) assumptions about her attitudes/intentions onto the environment's decor, and 2) passing judgment on her. That kind of comment would be kind of impolite between strangers, like the thing that only a bad boss would feel entitled say to an underling they didn't know very well.
Just an implementation detail, though, of course! I figure if you're able to evoke massive spookiness and subtle shades of social expectations like this, you must be onto something powerful.
At this point in the hype cycle being memorable probably outweighs being creepy!
For me, it said "are you comfortable sharing what that mark is on your forehead?" Or something like that. I said basically "I don't know maybe a wrinkle?". Lol. Kind of confirms for me why I should continue to avoid video chats. I did look like crap on general, really tired for one thing. And I am 46, so I have some wrinkles, although didn't know they were that obvious.
But a little bit of prompt guidance to avoid commenting on the visuals unless relevant would help. It's possible they actually deliberately put something in the prompt to ask it to make a comment just to demonstrate that it can see, since this is an important feature that might not be obvious otherwise.
Also the audio cloning sounds quite a bit different from the input on https://www.tavus.io/product/video-generation
For live avatar conversations, it's going to be interesting, to see how models like OpenAI's GPT-4o that have audio-in-audio-out websocket streaming API (that came out yesterday), interesting to see how that will work with technology like this, it does look like there is likely to be a live audio transcript delta that could drive a mouth articulation model, and so on, that arrives at the same time.
Presumably Gaussian Splatting or a physical 3D could run locally for optimal speed?
If I may offer some advice about potential uses beyond the predictable and trivial use in advertising, there's an army out there of elderly people who spend the rest of their life completely alone, either at home or hospitalized. A low cost version that worked like 1 hour a day with less aggressive reduction on latency to keep costs low could change the life of so many people.
I wonder if some standard set of personable mannerisms could be used to bridge the gap from 250ms to 1000ms. You don't need to think about what the user has said before you realize they've stopped talking. Make the AI Agent laugh or hum or just say "yes!" before beginning its' response.
- interactive experiences with historical figures - digital twins for celebrity/influencer fan interactions - "live" and/or personalized advertisements
Some of our users are already building these kinds of applications.
It is the same problem that in most context, the video has no purpose. The only use for video is to put a face to a name/voice.
I hope my company competitors switch to AI video for sales and support. I would absolutely pay for that!
Theres a lot of micro behaviors that we're researching and building around that will continue to push the experience to be more and more natural
I spent time solving this exact problem at my last job. The best I got was getting a signal that thr conversion had ended down to ~200ms of latency through a very ugly hack.
I'm genuinely curious how others have solved this problem!
https://github.com/pipecat-ai/pipecat/blob/d378e699d23029e8ca7cea7fb675577becd5ebfb/src/pipecat/vad/vad_analyzer.py
It uses three signals as input: silence interval, speech confidence, and audio level.Silence isn't literally silence -- or shouldn't be. Any "voice activity detection" library can be plugged into this code. Most people use Silero VAD. Silence is "non-speech" time.
Speech confidence also can come from either the VAD or another model (like a model providing transcription, or an LLM doing native audio input).
Audio level should be relative to background noise, as in this code. The VAD model should actually be pretty good at factoring out non-speech background noise, so the utility here is mostly speaker isolation. You want to trigger on speech end from the loudest of the simultaneous voices. (There are, of course, specialized models just for speaker isolation. The commercial ones from Krisp are quite good.)
One interesting thing about processing audio for AI phrase endpointing is that you don't actually care about human legibility. So you don't need traditional background noise reduction, in theory. Though, in practice, the way current transcription and speech models are trained, there's a lot of overlap with audio that has been recorded for humans to listen to!
VAD doesn't get you enough accuracy at this level. Confidence is the key bit, how that is done is what makes the experience magic!
Does that mean you're comfortable when you digitally open a bank account (or even Airbnb account, which became harder lately) where you also have to show face and voice in oder to make sure you're who you claim to be? What's stopping the company that the bank and Airbnb outsourced this task to, to rip your data off?
You will not even have read their ToC since you want to open an account and that online verification is just an intermediate step!
No, I'd rather go with this company.
I found that the AI kept cutting me off, and not leaving time in the conversation to respond. It would cut off utternances before the end and then answer the questions it had asked to me as if it had asked them. I think it could have gone on talking indefinitely.
Perhaps its audio was feeding back, but macs are pretty good with that. I'll try it with headphones next time.
But yes- accuracy versus speed of interrupts is a tradeoff we're working on tuning. sorry to hear it was cutting you off. It could have been audio feedback or hug of death, but it shouldn't be talking over you.
What's the last thing an AI avatar will be able to, that any real human can do?
If it's a person you don't know, first ask if it matters. Is the point to get information or talk to a real person? If it's prospective romance or something, real people can still catfish and otherwise scam you. If, for whatever reason, it really matters, ask them to do a bunch of athletic tasks. Handstand. Broad jump. Throw a ball across the room. They're probably not going to scan people they digitally clone to see how they do these things, so chances are good with the techniques that exist today the vast majority of training data will be from elite athletes doing these things on television. No real person would actually be good at all tasks and will either be totally unable to do some of them or can do them but very clunkily. Do they warm up? Chances are good training data won't show that and AI clones trained by ML might not bother, but a real person would have to.
I'm having latency issues, right now it doesn't seem to respond to my utterances and then responds to 3-4 of them in a row.
It was also a bit weird that it didn't know it was at a "ranch". It didn't have any contextual awareness of how it was presenting.
Overall it felt very natural talking to a video agent.
But lets talk about the sentiment behind here. Am I the only one seeing some terrible things being done with AI in terms of time management, meetings, and written materials? Asking AI to "turn this nice concise 3 paragraphs into a 6 page report" is a huge problem. Everyone thinks they're an amazing technical writer now, but most good writing is concise and short and these AI monstrosities are just a waste of everyone's time.
Reform work culture instead! Why do we have cameras on our faces? Why are we making these reports? Why so many meetings? "Meeting culture" is the problem and it needs to go, but it upholds middle-management jobs and structures, so here we are asking for robots of us to sit in meetings with management to get just the 8 bullet points we need from that 1 hour meeting.
We've entered a new level of kafkaesque capitalism where a manager puts 8 bullets points into an AI, gets a professional 4 page report, then turns that into a meeting for staff to take that report and meeting transcript to...you guessed it, turn it back into those 8 bullet points.
[0] https://arstechnica.com/information-technology/2024/08/new-a... [1] https://github.com/hacksider/Deep-Live-Cam
It's not a matter of AI, it's a matter of how Teams or Meet or Zoom allow programmatic access to the video and audio streams (the presence APIs for attending a meeting are mostly there, I think).
That is? Roughly speaking, what resource spec?
The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.
Mic permissions on mobile are tricky, which might have been your issue? Note in this prototype you also need to hold the blue button down to speak.
But it's somehow awesome at the same time.
The responses for me at least were in the few second range.
It responded to my initial question fast enough but as soon as I asked a follow up it thought/kind of glitched for a few seconds before it started speaking.
I tried a few different times on a few different topics and it happened each time.
Still, really impressive stuff!!
These days I get a daily dose of amazement at what a small engineering team is able to accomplish.
“He promised me they wouldn’t support X” “He promised me they would support X”
(Dynamically grab and show actions from the candidates past that feed into the individuals viewpoint)
Further the disconnect between what the candidate says they do and what they do, meanwhile it will feel like they got your best interests in mind.
Also I have curated AI agent market landscape map, so some of you can check for inspiration https://aiagentsdirectory.com/landscape
Working on subcategories right now to have even better nich discoverability
Scroll down the page to find our pricing.
You'd have to enable that and similar to zoom, it would show on the screen that that is being recorded
You have to show the product first, or I don't actually know whether you actually have a product or are just phishing.
This turned out to be quite funny, but I would be very sad to see something like this replace human attendants at things like tech support. These days whenever I'm wading through a support channel I'm just yearning for some human contact that can actually solve my issues.
Just to clarify, the audio-to-video part (which is the part we make) adds <300ms. The total end-to-end latency for the interaction is higher, given that state of the art LLMs, TTS and STT models still add quite a bit of latency.
TLDR: Adding Simli to your voice interaction shouldn't add more than ~300ms latency.
to extend this (to a hypothetical future situation): what morality does a company have of 'owning' a digitally uploaded brain?
I worry about far future events... but since American law is based on precedence: we should be careful now how we define/categorize things.
To be clear - I don't think this is an issue NOW... but I can't say for certain when these issues will come into play... So edging on the side of early/caution seems prudent... and releasing 'ownership' before any sort of 'revolt' could happen seems wise if a little silly at the current moment.
We don't know what sentience IS exactly, as we have a hard time defining it. We assume other people are sentient because of the ways they act. We make a judgment based on behavior, not some internal state we can measure.
And if it walks like a duck, quacks like a duck... since we don't exactly know what the duck is in this case: maybe we should be asking these questions of 'duckhood' sooner rather than later.
So if it looks like a human, talks like a human... maybe we consider that question... and the moral consequences of owning such a thing-like-a-human sooner rather than later.