How to Build Voice AI Agents (with AutoGen)

# 34 | How to enable "voice mode" with your AutoGen Agents!

Apr 07, 2025

∙ Paid

Voice agents may be realtime (a model processes speech input and generates speech output) or use a chained approach (speech converted to text, processed by business/agent logic and output converted to speech)

With OpenAI's recent release of improved audio models (both for speech to text and text to speech), building sophisticated voice-based multi-agents has become more accessible than ever.

Voice agents are AI-powered systems that enable natural, speech-based interactions between users and applications. They build upon the core concept of agents, which are entities that can reason, act, communicate, and adapt to solve tasks

Importantly, the new models are instructable i.e., as the developer you can control not only what the model says but also how it says it. This allows you to create voice agents with specific personalities and emotional qualities (e.g., "talk like a sympathetic customer service agent" or "speak like a professional narrator"). This opens up possibilities for more natural, context-appropriate voice interactions that were previously impossible.

TLDR;

In this article we will cover the following in detail:

Why voice interfaces matter in modern applications
Voice agent architecture options: speech-to-speech vs. chained approach
Implementing a chained voice agent with four key components:
- Voice input through speech-to-text transcription
- Multi-agent processing using a deep research team
- Output summarization for concise, voice-friendly responses
- Text-to-speech with instructable voice generation
Building a user-friendly interface with Chainlit

Full source code and implementation at the end of the article.

Why Voice?

One of my favorite go-to activities is using ChatGPT in voice mode with my 6-year-old during our initial voice app sessions. I start with a preamble: "I am hanging out with my 6-year-old and I'd like you to create an interactive quiz game about animal facts—for example, 'How many legs does a spider have?' We will tell you our response, and then you tell us if it's right or wrong before moving to the next question."
And that's it ... a highly engaging, hands-free, and very natural application that I can supervise while learning with my child. Importantly, there's no need to build an app, worry about the interface, or deal with other technical issues. This is just one of many ways voice interfaces bring value.

Voice interfaces offer a natural, hands-free way for users to interact with software. The ability to speak to an application and receive an audio response creates a more intuitive and accessible experience. This is particularly valuable for:

Users with mobility limitations or visual impairments
Hands-busy scenarios (cooking, driving, exercising)
Environments where typing is impractical
Creating more personal, human-like interactions

As a developer, think of ways in which you can extend the capabilities of any multi-agent system by integrating voice modality, new form factors for your application or service. Imagine an agent that can participate in a discord/teams call, run within devices like Alexa, Google Home etc

Voice Agent Architectures

There are two primary approaches to building voice applications that integrate modern generative AI models. (See the OpenAI tutorial on voice agents).

Speech-to-speech (multimodal): This approach uses a single model that processes audio directly and responds with generated speech.
While this creates low-latency, fluid conversations, it cannot perform actions beyond what the model can do natively. Business applications typically require integration with external tools, databases, multi-step reasoning, and custom business logic—capabilities not available in this approach.
Chained approach: This architecture breaks the process into three steps:
- Speech-to-text transcription
- Text processing and agent/application logic
- Text-to-speech conversion
The valuable advantage of the chained approach is its flexibility to control application logic (particularly with multi-agent systems) and tailor it precisely to address specific business needs. For example, you can incorporate database queries, implement security filters, perform correctness verification, and execute complex business rules within your processing layer. This architecture gives you complete control over each step in the conversation flow, allowing for sophisticated integrations with existing systems. While offering these powerful capabilities, this design does come with trade-offs: it introduces additional latency and requires more development effort to maintain a natural user experience.

In this tutorial, we'll use the chained approach as it provides greater flexibility and control, allowing us to implement complex agent behaviors using tools like AutoGen.

Deep Research Voice Agent

In the following section we will implement a deep research agent and UI that should be able to address general tasks e.g., “what is the latest news on LLAMA 4 model series”, “what is the latest news on the CHIPS act”, “what is the state of the art on laproscopic surgery, pros and cons”.

Let's break down our implementation into four essential parts:

Voice Input: Audio Transcription

The first step in our chained approach is capturing and transcribing user speech. Here, we'll use a speech-to-text model to convert audio input into text that our agent can process:

async def speech_to_text(audio_file):
    """Convert speech to text using a transcription model."""
    try:
        # Here we're using OpenAI's model, but you could substitute any transcription service
        response = await openai_client.audio.transcriptions.create(
            model="gpt-4o-mini-transcribe", 
            file=audio_file,
            response_format="text"
        )
        
        transcript = response
        return transcript
    except Exception as e:
        print(f"Error in transcription: {str(e)}")
        return None

The improved transcription models excel at handling challenging scenarios like accents, background noise, and varying speech speeds - making them particularly useful for real-world applications.

Deep Research Agent Team

The core of our system is a team of specialized agents that work together to execute on deep research tasks. Our implementation uses the SelectorGroupChat pattern in AutoGen with three specialized agents:

A Friendly Introduction to the AutoGen Framework (v0.4 API)

Victor Dibia, PhD

Feb 4

Read full story

Research Assistant: Performs web searches and analyzes information (has access to a Google Search API tool)
Verifier: Evaluates progress and ensures completeness of the research
Summary Agent: Provides concise, clear reports of findings
User Agent: Can delegate to the user to obtain clarification or feedback

This agent team architecture allows for a division of responsibilities, with each agent focusing on its specific expertise. The selector groupchat coordinates the conversation flow by choosing which agent should respond next based on the current context.

The agent team can be replaced or customized to suit different use cases. For example, you might create a customer service agent team with specialists in product information, troubleshooting, and order management.

Keep reading with a 7-day free trial

Subscribe to Designing with AI to keep reading this post and get 7 days of free access to the full post archives.