> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.
-
This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.
-
Edit: Apparently it does.
It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)
and `response.done` [1] with the response text.
[0] https://platform.openai.com/docs/api-reference/realtime-serv...
[1] https://platform.openai.com/docs/api-reference/realtime-serv...
outputs are sent in text + audio but you'll get the text very quickly and audio a bit slower, and of course the audio takes time to play back. the text also doesn't currently have timing cues so its up to you if you want to try to play it "in sync". if the user interrupts the audio, you need to send back a truncation event so it can roll its own context back, and if you never presented the text to the user you'll need to truncate it there as well to ensure your storage isn't polluted with fragments the user never heard.
Radiologists, I'm not sure what we need is just image model finetuning + LLMs to get there.
Your example makes me think it will merely moves QA into essentially providing countless cases and then updating them over time to improve the AIs data.
And is it really gonna be cheaper than human support?
What's gonna happen when we will find out (see the impossibility to reach a human when interacting with many companies alredy) this is gonna bring (maybe, eventually) costs down, and revenue too because pissed off customers will move elsewhere.
Doesn’t matter what the computer becomes — AI, AGI or God-incarnate — there’s always a role between that and the end-user. That role today is called software engineer. Tomorrow, it’ll be whatever whatever. Perhaps paid the same or less or more. Doesn’t matter.
There’s always an intermediary to deal with the shit.
Hmm, I wonder if that’s the roles priests & the clergy have been playing all this while. Except, maybe humanity is the shit God (as an end user) has to deal with
But seperate from that you typically want some application specific storage of the current "conversation" in a very different format than raw request logging.
https://openai.com/careers/full-stack-software-engineer-leve...
Their web UI was a glitchy mess for over a year. Rollouts of just data is staggered and often delayed. They still can’t adhere to a JSON schema accurately, even though others have figured this out ages ago. There are global outages regularly. Etc…
I’m impressed by some aspects of their rapid growth, but these are financial achievements (credit due Sam) more than technical ones.
1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.
3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?
> They still can’t adhere to a JSON schema accurately
Strict mode for structured output fixes at least this though.
It depends on how you define startup but I don't think they will surpass Uber, ByteDance, or SpaceX until this next rumored funding round.
I'm excluding companies that have raised funding post IPO since that's an obvious cutoff for startups. The other cuttof being break even, in which case Uber has raised well over $20 billion.
Why not use an array of key value pairs if you want to maintain ordering without breaking traditional JSON rules?
[ {key1:value1}, {key2:value2} ]
Most tools preserve the order. I consider it to be an unofficial feature of JSON at this point. A lot of people think of it as a soft guarantee, but it’s a hard guarantee in all the recent JavaScript and python versions. There are some common places where it’s lost, like JSONB in Postgres, but it’s good to be aware that this unofficial feature is commonly being used.
[1]: https://platform.openai.com/docs/guides/structured-outputs/s...
It's nice to have have a solution from OpenAI given how much they use a variant of this internally. I've tried like 5 YC startups and I don't think anyone's really solved this.
There's the very real risk of vendor lock-in but quickly scanning the docs seems like it's a pretty portable implementation.
- Introducing the Realtime API: https://openai.com/index/introducing-the-realtime-api/
- Introducing vision to the fine-tuning API: https://openai.com/index/introducing-vision-to-the-fine-tuni...
- Prompt Caching in the API: https://openai.com/index/api-prompt-caching/
- Model Distillation in the API: https://openai.com/index/api-model-distillation/
Docs updates:
- Realtime API: https://platform.openai.com/docs/guides/realtime
- Vision fine-tuning: https://platform.openai.com/docs/guides/fine-tuning/vision
- Prompt Caching: https://platform.openai.com/docs/guides/prompt-caching
- Model Distillation: https://platform.openai.com/docs/guides/distillation
- Evaluating model performance: https://platform.openai.com/docs/guides/evals
Additional updates from @OpenAIDevs: https://x.com/OpenAIDevs/status/1841175537060102396
- New prompt generator on https://playground.openai.com
- Access to the o1 model is expanded to developers on usage tier 3, and rate limits are increased (to the same limits as GPT-4o)
Additional updates from @OpenAI: https://x.com/OpenAI/status/1841179938642411582
- Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it (except EU).
So regular paying users from EU are still left out in the cold.
(There is an exemption for "AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use", but Advanced Voice probably doesn't benefit from that exemption.)
So it seems to be possible to use this in a personal context.
https://artificialintelligenceact.eu/recital/44/
> Therefore, the placing on the market, the putting into service, or the use of AI systems intended to be used to detect the emotional state of individuals in situations related to the workplace and education should be prohibited. That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use.
But then I don't get the spirit of that limitation, as it should be just as applicable to TVs listening in on your conversations and trying to infer your emotions. Then again, I guess that for these cases there are other rules in place which prohibit doing this without the explicit consent of the user.
> I think this
> I don't get the spirit of that limitation
> I guess that
In a nutshell, this uncertainty is why firms are going to slow-roll EU rollout of AI and, for designated gatekeepers, other features. Until there is a body of litigated cases to use as reference, companies would be placing themselves on the hook for tremendous fines, not to mention the distraction of the executives.
Which, not making any value judgement here, is the point of these laws. To slow down innovation so that society, government, regulation, can digest new technologies. This is the intended effect, and the laws are working.
I use those words because I've never read any of the points in the EU AIA.
I would wager this -- OpenAI lawyers have looked that the situation. They have not been able to credibly say "yes, this is okay" and so management makes the decision to wait. Obviously, they would prefer to compete in Europe if it were a no-brainer decision.
It may be possible that the path to get to "yes, definitely" includes some amount of discussion with the relevant EU authorities and/or product modification. These things will take time.
The two examples shown in the DevDay are the things I don't really want to do in the future. I don't want to talk to anybody, and I don't want to wait for their answer in a human form. That's why I order my food through an app or Whatsapp, or why I prefer to buy my tickets online. In the rare case I call to order food, it's because I have a weird question or a weird request (can I pick it up in X minutes? Can you prepare it in a different way?)
I hope we don't start seeing apps using conversations as interfaces because it would really horrible (leaving aside the fact that a lot of people don't know how to communicate themselves, different accents, sound environments, etc), while clicking or typing work almost the same for everyone (at least much more normalized than talking)
The market for realistic voice agents is huge, but also very fragmented. Customer service is the obvious example, large companies employ tens of thousands of customer service phone agents, and a large # of those calls can be handled, at least in part, with a sufficiently smart voice agent.
Sales is another, just calling back leads and checking in on them. Voice clone the original sales agent, give the AI enough context about previous interactions, and a lot of boring legwork can be handled by AI.
Answering simple questions is another great example, restaurants get slammed with calls during their busiest hours (seriously getting ahold restaurant staff during peak hours can be literally impossible!) having an AI that can pick up the phone and answer basic questions (what's in certain dishes, what is the current wait time, what is the largest group that can be sat together, etc) is super useful.
A lot of small businesses with only a single employee can benefit from having a voice AI assistant picking up the phone and answering the easy everyday queries and then handing everything else off to the owner.
The key is that these voice AIs should be seamless, you ask your question, they answer, and you ideally don't even know it is an AI.
The AI isn't changing that equation at all.
1. AI instructions are legible. There is no record asking John to sell the customer things they don't need. There is a record if the AI does it.
2. AI interactions are legible. If a sales guy tells you something false on a zoom call, there is no record of it. If the AI does, there is a record.
"What are today's most important tasks? Anything I forgot before I log off? Can you write John to check the blocking PR? Let's fix this bug together".
> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.
As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.
This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.
That is substantially more expensive than TTS (text-to-speech) which already is quite expensive.
If OpenAI decides to fully ignore ethics and dive deep into voice cloning, then all bets are off.
The configuration of the session accepts a parameter (modalities) that could restrict the response only to text. See it in https://platform.openai.com/docs/api-reference/realtime-clie....
I guess this is using their "old" turn-based voice system?
Audio output in the api now but you lose image input. Why ? That's a shame.
"10:30 They started with some demos of o1 being used in applications, and announced that the rate limit for o1 doubled to 10000 RPM (from 5000 RPM) - same as GPT-4 now."
If you squint at it, this is what chat bots do now, except with a “terminal” style text UI instead of a GUI or true Web UI.
The first incremental step had already been taken: pretty-printing of maths and code. Interactive components are a logical next step.
It would be a mere afternoon of work to write a web server where the dozens of “controllers” is replaced with a single call to an LLM API that simply sends the previous page HTML and the request HTML with headers and all.
“Based on the previous HTML above and the HTTP request below, output the response HTML.”
Just sprinkle on some function calling and a database schema, and the site is done!
Fine-tuning the model based on example pages and responses might be all that’s required for a sufficient level of consistency.
An immediate use-case might be prototyping in-place.
If you have an existing site, you can capture the request-response pairs and train the AI on it, annotated with the spec docs. Then tell it to implement some new functionality and it should be able to. Just route a subset of the site to the AI instead of the normal controllers.
One could “design” new components and functionality in English and try it instantly with no compilation or deployment steps!