OpenAI Steps Up its Transcription and Voice Generation AI Models

OpenAI has recently unveiled new AI models for transcription and voice generation on its API asserting that these versions represent a significant improvement over their predecessors né?. Updated benchmarks and details on word error rates were furnished on March 20, 2025.. Developers can tailor this model to adopt diverse tones or styles such as a “mad scientist” or a “mindfulness teacher.”

OpenAI also presents users with a range of voice samples to select from including a “true crime-style” voice or a female “professional” voice né?. The significance of precise transcription is underscored, especially for languages like Tamil, Telugu, Malayalam, and Kannada, which may have word error rates as high as 30% né?. These advancements align with OpenAI’s overarching goal of constructing automated systems capable of independently performing tasks né?. The new speech-to-text models, “gpt-4o-transcribe” and “gpt-4o-mini-transcribe,” are designed to supersede the outdated Whisper transcription model né?. One such application as per OpenAI’s Olivier Godement could be a chatbot designed to converse with a company’s customers.

During a discussion with TechCrunch Godement hinted at the future proliferation of automated agents. Among the innovations introduced is a fresh text-to-speech model named “gpt-4o-mini-tts” which not only delivers more lifelike speech but also allows for greater customization compared to previous iterations né?. Jeff Harris from OpenAI stressed the importance of adapting the voice experience and context to various scenarios. These cutting-edge models are engineered to accurately capture different accents and speech patterns, even in noisy settings, with fewer errors compared to Whisper.

OpenAI highlights that their models exhibit reduced occurrences of generating incorrect words or passages during transcription, a common issue experienced with Whisper in the past. Unlike Whisper OpenAI does not intend to release these new transcription models openly as they are more advanced and not suitable for general use like Whisper. The intention is to provide customers and developers with user-friendly agents that offer valuable assistance while remaining accurate and accessible né?. The emphasis remains on developing models tailored to specific requirements, with potential plans for open-source models for end-user devices