Advancing voice intelligence with new models in the API
2 hours ago
- #multilingual translation
- #realtime audio API
- #AI voice models
- OpenAI introduces three new audio models: GPT‑Realtime‑2 (a voice model with GPT‑5-class reasoning), GPT‑Realtime‑Translate (live translation model for 70+ input languages to 13 output languages), and GPT‑Realtime‑Whisper (streaming speech-to-text for real-time transcription).
- GPT‑Realtime‑2 is designed for live voice interactions with features like preambles, parallel tool calls, stronger recovery, longer context (128K), better domain understanding, controllable tone, and adjustable reasoning effort, showing significant improvements in evaluations (e.g., 15.2% higher on Big Bench Audio).
- Use cases highlight emerging patterns: voice-to-action (e.g., Zillow assistant for real estate tasks), systems-to-voice (e.g., travel app providing proactive guidance), and voice-to-voice (e.g., Deutsche Telekom for multilingual support), with companies like Priceline integrating these for end-to-end travel management.
- GPT‑Realtime‑Translate enables real-time multilingual conversations, useful in customer support, education, and media, with lower Word Error Rates and latency improvements noted in testing (e.g., 12.5% lower WER in Indian languages).
- GPT‑Realtime‑Whisper offers low-latency transcription for applications like live captions, meeting notes, and voice agents, enhancing responsiveness in business workflows such as healthcare, sales, and support.
- The Realtime API includes safety measures like active classifiers to prevent misuse, compliance with EU Data Residency, and requires developers to disclose AI interactions, with pricing set at $32/1M input tokens for GPT‑Realtime‑2, $0.034/min for translation, and $0.017/min for Whisper.