OpenAI adds live voice, translation, and transcription models
OpenAI has introduced three new audio models into its API: GPT-Realtime-2 for real-time voice intelligence, GPT-Realtime-Translate for live translation across 70+ input and 13 output languages, and GPT-Realtime-Whisper for low-latency streaming speech-to-text transcription. These models aim to enable developers to build more natural, intelligent, and responsive voice applications by improving reasoning, context handling, and real-time processing capabilities.
Key Takeaways
- GPT-Realtime-2 is OpenAI’s first voice model with GPT-5-class reasoning and a 32K-to-128K context window increase for longer sessions.
- GPT-Realtime-Translate supports more than 70 input languages and 13 output languages, and OpenAI cites use cases including customer support, cross-border sales, education, events, media, and creator platforms.
- GPT-Realtime-Whisper is a streaming transcription model that turns speech into text live as the speaker talks.
- OpenAI says the Realtime API supports EU Data Residency and includes active classifiers that can halt sessions flagged for harmful content.
- Pricing starts at $32 per 1M audio input tokens for GPT-Realtime-2, $0.034 per minute for GPT-Realtime-Translate, and $0.017 per minute for GPT-Realtime-Whisper.
Why It Matters
OpenAI is pushing realtime audio beyond simple turn-taking toward models that can reason, translate, and transcribe while a conversation is still in progress. That matters for voice interfaces in streaming-adjacent workflows such as captions, live translation, support, and other spoken interactions the company explicitly lists. The ecosystem signal is that OpenAI is positioning voice as a production API layer, not just a demo feature, with named examples from Zillow, Deutsche Telekom, Priceline, Vimeo, and BolnaAI. What to watch next: whether developers adopt GPT-Realtime-2’s higher-context and adjustable reasoning modes in shipped products, not just in the Playground.
Read full article at openai.com
