Google Launches Gemini 2.5 with Native Audio Dialog Access

Google has unveiled exciting new audio generation features as part of its Gemini 2.5 models during the Google I/O 2025 event. The tech giant is now allowing developers and users to experiment with these capabilities on its platform. The two standout features include native audio dialog, which generates human-like audio responses, and a controllable text-to-speech (TTS) function that transforms scripts into conversational speech. However, these features are currently not accessible to developers through application programming interfaces (APIs).
Exploring Gemini 2.5 Flash’s Audio Features
In a recent blog post, Google elaborated on the innovative audio generation capabilities of the Gemini 2.5 Flash models. These features are designed to enhance user experiences by enabling developers to create more interactive applications. Users can explore the native audio dialog feature in the stream tab of Google AI Studio, while the TTS functionality is available in the generate media tab.
The native audio dialog allows for real-time interactions between users and the AI. Users can either type or verbally express their prompts, and the AI responds with generated audio. This direct audio generation process eliminates the need for an intermediate text phase, resulting in a more fluid conversation. The system is capable of recognizing the emotional tone of the user’s voice, enabling it to respond appropriately to feelings such as fear, anger, or surprise.
Capabilities of Controllable Text-to-Speech
The controllable TTS feature offers a range of functionalities that enhance the quality of audio output. It can generate multi-speaker dialogues and infuse emotions and accents into the narration of scripts. Additionally, users can control the delivery speed and emphasize specific pronunciations, making the audio output more engaging and relatable. This feature also supports the same 24 languages as the native audio dialog, allowing for language mixing and diverse communication styles.
Google emphasizes that these audio generation capabilities have undergone thorough risk assessments throughout their development. The company employed both internal mechanisms and red teaming strategies to identify and address any potential vulnerabilities. Furthermore, all audio outputs generated by these models are embedded with SynthID, Google’s watermarking technology, ensuring authenticity and traceability.
Implications for Developers and Users
The introduction of these audio generation features marks a significant advancement in AI technology, providing developers with powerful tools to create more immersive experiences. By leveraging the capabilities of Gemini 2.5, developers can build applications that engage users in more meaningful ways. The ability to generate human-like audio responses and control speech delivery opens up new possibilities for interactive storytelling, virtual assistants, and customer service applications.
As these features are still in the testing phase, developers are encouraged to explore their potential within Google AI Studio. The feedback gathered during this testing period will be crucial for refining the technology and ensuring it meets user needs. With the integration of advanced audio generation capabilities, Google is poised to lead the way in transforming how users interact with AI, making conversations more natural and intuitive.
Observer Voice is the one stop site for National, International news, Sports, Editorโs Choice, Art/culture contents, Quotes and much more. We also cover historical contents. Historical contents includes World History, Indian History, and what happened today. The website also covers Entertainment across the India and World.