OpenAI is rolling out new AI models for transcription and voice generation on its API, claiming that they are an improvement over previous versions. These models align with OpenAI’s broader vision of building automated systems that can perform tasks independently. One interpretation of this vision, according to OpenAI’s Olivier Godement, is a chatbot that can engage in conversations with a business’s customers.
Godement stated in a briefing with TechCrunch that we can expect to see more of these automated agents in the near future. The goal is to assist customers and developers in utilizing agents that are helpful, accessible, and accurate. OpenAI introduces a new text-to-speech model called “gpt-4o-mini-tts,” which not only produces more realistic speech but is also easier to customize compared to previous models. Developers can direct this model to speak in different tones or styles, such as a “mad scientist” or a “mindfulness teacher.”

Additionally, OpenAI offers different voice samples for users to choose from, like a “true crime-style” voice or a female “professional” voice. Jeff Harris from OpenAI emphasized the importance of tailoring the voice experience and context according to different scenarios. The new speech-to-text models, “gpt-4o-transcribe” and “gpt-4o-mini-transcribe,” aim to replace the outdated Whisper transcription model. These new models are designed to capture various accents and speech patterns, even in noisy environments, with fewer errors compared to Whisper.
OpenAI also mentions that their models have lower chances of generating incorrect words or passages during transcription, as Whisper did in the past. The company emphasizes the need for accurate transcription, particularly for languages like Tamil, Telugu, Malayalam, and Kannada, where the word error rate may be as high as 30%. Unlike Whisper, OpenAI does not plan to release these new transcription models openly, as they are more advanced and not suitable for local use like Whisper. The focus is on creating models that are specifically tailored for certain needs, with a potential interest in open-source models for end-user devices. Updated benchmarks and clarifications on word error rates were provided on March 20, 2025.