Back

OpenAI launches advanced AI voice assistant models with new API models

Deveshi Dabbawala

May 11, 2026
Table of contents

Using AI voice assistants often felt limited to simple question-and-answer exchanges until now. You would ask something, get a response, and then the interaction would reset. The conversation rarely maintained context, the system was unable to remember details from earlier exchanges, and it did not seem capable of genuinely reasoning through complex requests.

That is exactly what OpenAI’s latest release aims to change. On May 7, 2026, the company introduced three new audio models through its Realtime API and those are GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. Together, these models push AI voice assistants beyond simple call-and-response interactions and closer to real-time conversational intelligence.

The three models

GPT Realtime-2: This is the one that matters. This is the first voice model created by OpenAI that uses GPT-5 class reasoning abilities, meaning it can have conversations, utilize tools during the dialogue, recover from mistakes without getting stuck, and remember context through extended interactions. The maximum context window increased from 32K to 128K tokens, which is an important change if we talk about something other than answering single questions.  

Much of these advances in reasoning capabilities are due to the same improvements made in GPT-5.5, including instruction retention, contextual comprehension, and multi-step reasoning. All these developments were explained in more detail in our GPT-5.5 comprehensive guide, along with their influence on business AI solutions and communication processes. On the Big Bench Audio test, GPT-Realtime-2 achieved 15.2% better results compared to the previous version. In Audio MultiChallenge, the instruction retention rose from 36.7% to 70.8%.

GPTRealtimeTranslate: It handles live speech translation across 70+ input languages into 13 output languages without buffering or segmenting the audio. Deutsche Telekom is already running tests for cross-language customer conversations. Vimeo is exploring live video localization.

GPTRealtimeWhisper- It does one thing, live transcription of speech in real-time without any post processing and waiting time. The price of the service is $0.017 per minute. That makes the service appropriate to be used for any business communication when live text transcription from spoken word is needed.

What's new here

The call-and-response pattern was the defining limitation of every voice AI product before this. Previous models could sound natural. What they couldn't do was think while talking.

GPT‑Realtime‑2 breaks that pattern in a few specific ways worth noting:

  • Preambles: The system can offer statements such as “let me check that” or “one moment” while it works on a user’s request to prevent a gap in their communication
  • Parallel task processing: It can communicate with more than one tool at once and explain what it is doing (e.g., “Checking your calendar, looking that up now”) to maintain momentum in their communication
  • Error recovery by speaking: Instead of simply not answering, the system will say “I’m having trouble with that right now,” and continue its work without losing momentum
  • Effort level control: Developers have an option to configure {reasoning.effort} to “high” when the system must accomplish a complex task, allowing for additional reasoning

How developers are leveraging them

OpenAI has highlighted three major ways developers are beginning to use these models in real products.

  • Voice-to-action: It allows users to speak naturally while the AI voice assistant handles tasks in the background. Zillow, for example, is building systems where users can search for homes within a budget, avoid certain neighborhoods, and schedule tours through a single conversation.
  • Systems-to-voice: It takes the opposite approach. Instead of waiting for user input, the software proactively speaks to the user. An airline app could automatically alert travelers about a delayed flight, updated gate details, and the fastest route through the airport terminal.
  • Voice-to-voice: It focuses on real-time translation. Deutsche Telekom is already testing support systems where customers speak in their preferred language while GPT-Realtime-Translate translates the conversation live.  

This isn’t theoretical anymore because companies have already started designing systems based on these models, and now that Realtime API is out of its beta stage, the framework will be significantly more stable.

Source- OpenAI

Why this matters for AI voice assistant development

The truth is that almost all AI voice assistant products have suffered less from a lack of voice realism and more from a lack of reasoning ability. The models were capable of sounding plausible. They were incapable of solving a complex problem, nor of remembering what had been said a few minutes ago. This gap represents the constraint in what the developers could do.

GPT-Realtime-2 does not remove that limitation entirely, but it pushes it much further. A 128K context window means a customer support conversation lasting 40 minutes can stay in memory throughout the interaction.  

This is the same shift driving more advanced enterprise AI systems, including platforms like GoML’s AI Matic, where contextual understanding and multi-step reasoning are becoming essential for building AI assistants that can manage real business workflows instead of just responding to isolated prompts.  

It is also worth noting that the Realtime API is no longer in beta as of this announcement. This means that this technology is officially recognized as stable production-level technology.