OpenAI recently released Operator - an AI agent that autonomously navigates web browsers to complete user tasks. While impressive, the more important question is - how can we build and customize a tool like this?
In this tutorial, we'll create a similar system using AutoGen's high-level AgentChat API (v0.4) in ~50 lines of code.
Our multi-agent system will:
Browse websites autonomously
Process visual information
Handle complex multi-step tasks
Recover from errors gracefully
Know when to ask for human help (get feedback)
Want skip to code? Its here. The rest of the article covers installation, selecting a pattern, implementation of agents and limitations (autonomy/control tradeoffs, memory etc).
Step 1 - Installation
Let's get your environment ready with all the necessary components:
Create a fresh conda environment:
conda create -n webagent python=3.11
conda activate webagent
Install the autogen required packages:
# Core packages
pip install -U autogen-agentchat autogen-ext[openai,web-surfer]
playwright install
Environment variables:
export OPENAI_API_KEY="your-key-here"
Important Notes:
The `autogen-ext[openai,web-surfer]` package includes all necessary dependencies for web browsing capabilities
Playwright handles browser automation and will install Chromium by default. We need this because we will use a MultiModalWebSurfer agent that requires this.
Make sure you have a valid OpenAI API key with access to GPT-4o for vision capabilities.
Now that our environment is set up, we can move on to implementing the agent architecture!
Step 2 - Select a Multi-Agent Pattern
For our web agent system, we need a flexible pattern that can handle:
Error recovery
Task verification (is the task done, when should be end)
Dynamic agent coordination (selecting the right agent at the right time)
Human intervention when needed
We'll use AutoGen's SelectorGroupChat preset in AutoGen, which provides these capabilities through a team-based approach:
Shared Context: All agents share a conversation thread, giving them full context of the task and previous actions.
Dynamic Selection: The GroupChat functions like an implicit orchestrator (powered by GPT-4o-mini) and chooses which agent should act next based on the current context.
Flexible Turn-Taking: Unlike rigid round-robin approaches, agents are selected based on their capabilities and the current needs of the task.
Built-in Error Recovery: If the web browsing agent encounters issues, this approach enables some type of built in recovery - other agents can respond or request human assistance.
Note: GroupChat is not the only pattern that could work here and has its limitations discussed below.
Step 3 - Implementation of Agents
We will be implementing 3 agents in our multi-agent system.
Let's break down each agent's role and implementation:
1. WebSurfer Agent
model_client = OpenAIChatCompletionClient(model="gpt-4o-2024-11-20")
websurfer_agent = MultimodalWebSurfer(
name="websurfer_agent",
description="an agent that solves tasks by browsing the web",
model_client=model_client,
headless=False,
)
The WebSurfer agent serves as our primary web navigator. Built on the MultimodalWebSurfer class, it combines browser automation with visual processing capabilities. Through GPT-4o integration, it can understand and interact with web pages just like a human would, handling everything from basic navigation to form filling. The agent maintains its own browser session (a chromium browser driving by playwright) and can process visual information to make informed decisions about how to interact with web elements.
2. Assistant Agent
assistant_agent = AssistantAgent(
name="assistant_agent",
description="an agent that verifies and summarizes information",
system_message="""You are a task verification assistant who is working with a web surfer agent to solve tasks.
At each point, check if the task has been completed as requested by the user.
If the websurfer_agent responds and the task has not yet been completed, respond with what is left to do and then say 'keep going'.
If and only when the task has been completed, summarize and present a final answer that directly addresses the user task in detail and then respond with TERMINATE.""",
model_client=model_client
)
The AssistantAgent acts as our quality control supervisor, continuously monitoring the progress and accuracy of the web surfing operations. It reviews the WebSurfer's actions, verifies task completion, and provides clear summaries of findings. This agent is crucial for maintaining task focus and ensuring that all user requirements are met before concluding the operation. When necessary, it can guide the WebSurfer toward missing information or request clarification through the User Proxy.
3. User Proxy Agent
user_proxy = UserProxyAgent(
name="user_proxy",
description="a human user that should be consulted only when the assistant_agent is unable to verify the information provided by the websurfer_agent"
)
Including a UserProxy agent here enables autonomous system behaviour but with human oversight. The expected design here is that for tasks that are straight forward, the UserProxyAgent is never called e.g., “Who is Jeff Dean and what papers has he written” is more of a simple web search task. However, for a task like “book me a flight to san diego next week”, there are lots of missing context, clarification and approvals that should be routed to a user for feedback via the UserProxyAgent.
4. SelectorGroupChat Team Orchestration
selector_prompt = """You are the cordinator of role play game. The following roles are available:
{roles}. Given a task, the websurfer_agent will be tasked to address it by browsing the web and providing information. The assistant_agent will be tasked with verifying the information provided by the websurfer_agent and summarizing the information to present a final answer to the user.
If the task needs assistance from a human user (e.g., providing feedback, preferences, or the task is stalled), you should select the user_proxy role to provide the necessary information.
Read the following conversation. Then select the next role from {participants} to play. Only return the role.
{history}
Read the above conversation. Then select the next role from {participants} to play. Only return the role.
"""
termination = MaxMessageTermination(
max_messages=20) | TextMentionTermination("TERMINATE")
websurfer_agent = MultimodalWebSurfer(
name="websurfer_agent",
description="an agent that solves tasks by browsing the web",
model_client=model_client,
headless=False,
)
team = SelectorGroupChat(
[websurfer_agent, assistant_agent, user_proxy],
selector_prompt=selector_prompt,
model_client=OpenAIChatCompletionClient(model="gpt-4o-mini"),
termination_condition=termination
)
The SelectorGroupChat preset in the AutoGen AgentChat api serves as a team orchestrator that ties everything together. Note how the selector prompt contains explicit instructions on how to determine which agent takes the next turn in the conversation (including rules for when to delegate to a human user). It also includes a termination condition (ends the task run if a maximum number of 20 messages/turns have elapsed or an agent responds with TERMINATE).
What happens or what is expected to happen when you run this team?
team.run_stream(task="Who is Francois Cholley? Is he a researcher? Provide a detailed list of his recent research papers")
We can summarize it as follows:
Initial search: The selector first chooses the WebSurfer agent, which starts at Bing.com and searches for "Francois Chollet research papers". The agent captures the search results page, extracts relevant information, and shares it back to the group conversation.
Iterative verification: The Assistant agent then analyzes the initial search results. It might notice that while we have some information about Chollet, we need more specific details about his research papers. The Assistant would instruct the WebSurfer to navigate to more authoritative sources like Google Scholar or Semantic Scholar for comprehensive publication data.
Iterative Refinement: The agents then enter an iterative cycle:
WebSurfer → visits recommended links and extracts information Assistant → verifies gathered information and identifies gaps WebSurfer → navigates to additional sources if needed Assistant → synthesizes all findings
This may continue until a termination condition is met either:
The Assistant determines we have complete information about Chollet and his recent publications (the task is complete!) (triggers "TERMINATE")
The conversation reaches the maximum allowed turns (20 messages in our configuration)
Listen to this episode with a 7-day free trial
Subscribe to Designing with AI to listen to this post and get 7 days of free access to the full post archives.