OpenAI Unveils Advanced Audio Models for Developers

OpenAI has launched new audio models in its application programming interface (API), enhancing performance in both speech-to-text transcription and text-to-speech (TTS) functions. The San Francisco-based AI company introduced three innovative models designed to empower developers in creating applications with sophisticated workflows. These advancements are expected to streamline customer support operations and improve overall user experience.

New Audio Models Enhance Performance

In a recent blog post, OpenAI outlined the features of its new API-specific audio models. The company emphasized its history of developing AI agents, including Operator, Deep Research, and the Responses API, which incorporate built-in tools. However, OpenAI noted that the full potential of these agents can only be realized when they operate intuitively and interact across various mediums beyond text.

The newly introduced models include GPT-4o-transcribe and GPT-4o-mini-transcribe for speech-to-text tasks, alongside the GPT-4o-mini-tts for text-to-speech applications. OpenAI asserts that these models surpass the performance of its previous Whisper models released in 2022. Unlike their predecessors, the new models are not open-source, which may affect accessibility for some developers.

Specifically, the GPT-4o-transcribe model demonstrates improved performance in “word error rate” (WER) as evaluated by the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) benchmark, which assesses multilingual speech across 100 languages. OpenAI attributes these enhancements to targeted training techniques, including reinforcement learning and extensive midtraining with high-quality audio datasets.

Robust Features for Diverse Applications

The new speech-to-text models are designed to excel in challenging environments, effectively capturing audio even with heavy accents, background noise, and varying speech speeds. This capability is crucial for applications that require high accuracy in transcription, such as customer service and content creation.

Similarly, the GPT-4o-mini-tts model boasts significant advancements, allowing for customizable inflections, intonations, and emotional expressiveness. This feature enables developers to create applications suitable for a wide range of tasks, from customer support to creative storytelling. However, it is important to note that the model currently offers only artificial and preset voices.

Pricing and Availability

OpenAI has detailed the pricing structure for its new audio models on its API pricing page. The GPT-4o-based audio model is priced at $40 (approximately Rs. 3,440) per million input tokens and $80 (around Rs. 6,880) per million output tokens. In contrast, the GPT-4o mini-based audio models are available at a lower rate of $10 (about Rs. 860) per million input tokens and $20 (approximately Rs. 1,720) per million output tokens.

All audio models are now accessible to developers via the API. Additionally, OpenAI is launching an integration with its Agents software development kit (SDK) to assist users in building voice agents, further expanding the capabilities of its AI offerings.


Observer Voice is the one stop site for National, International news, Sports, Editorโ€™s Choice, Art/culture contents, Quotes and much more. We also cover historical contents. Historical contents includes World History, Indian History, and what happened today. The website also covers Entertainment across the India and World.

Follow Us on Twitter, Instagram, Facebook, & LinkedIn

Back to top button