Multi-Agent Systems

When One Agent Isn't Enough

Everything we've built so far in this track has been a single agent: one LLM with a system prompt, a set of tools, and a loop that runs until the task is done. That works remarkably well for focused tasks — answer a question, write a function, search a database. But what happens when the task is too big, too varied, or too complex for one agent to handle alone?

Consider building a web application from a natural-language spec. Someone needs to break the spec into components. Someone needs to write the frontend code. Someone else writes the backend. A reviewer checks for bugs and security issues. A tester writes and runs tests. In a human team, these are different people with different skills and different contexts. Could separate LLM-powered agents fill each role?

A multi-agent system is exactly this: multiple LLM-powered agents that communicate and collaborate to accomplish a task that none of them could (or should) handle alone. Each agent has its own system prompt defining its role, its own set of tools, and its own conversation history. They interact by passing messages — the output of one becomes the input of another.

Why not just give one agent all the tools and all the instructions? Three reasons come up repeatedly in practice:

Context limits: a single agent accumulates all tool results, all intermediate reasoning, and all task context in one conversation. For complex tasks, this quickly fills up the context window. Separate agents each maintain their own, smaller context — only the information relevant to their role.
Specialisation: a system prompt that says "You are a senior security reviewer. Your job is to find vulnerabilities" produces different behaviour than one that says "You are a Python developer. Write clean, tested code." One agent trying to be everything at once tends to be mediocre at each role. Separate agents with focused system prompts are individually better at their assigned tasks.
Parallelism: a single agent is inherently sequential — it does one thing at a time. If the task involves searching five different databases, a single agent calls them one by one. Five parallel agents can search simultaneously, reducing wall-clock time by a factor of five.

These benefits are real, but as we'll see at the end of this article, they come with real costs. The question is not "should I use multi-agent?" but "does the task actually need it?"

Orchestration Patterns

If you have multiple agents, you need a way to coordinate them. Who talks to whom? Who decides what gets done next? How do results flow between agents? The answer depends on the pattern you choose, and there are four major ones.

Sequential (pipeline). The simplest pattern: Agent A finishes, passes its output to Agent B, which finishes and passes to Agent C. Think of an assembly line: a Researcher agent gathers information, a Writer agent turns it into prose, and an Editor agent polishes the prose. Each agent sees only the output of the previous one. This is easy to implement, easy to debug (you can inspect the output at each stage), and predictable. The downside is speed: there's no parallelism. If the Researcher takes 30 seconds, the Writer waits. If any agent in the chain fails, everything downstream is blocked.

# Pseudocode: sequential multi-agent pipeline
def run_pipeline(topic):
    # Stage 1: Research
    research = researcher_agent.run(
        f"Research the topic: {topic}. Return key facts and sources."
    )

    # Stage 2: Write (receives research output)
    draft = writer_agent.run(
        f"Write a blog post based on this research:\n{research}"
    )

    # Stage 3: Edit (receives draft)
    final = editor_agent.run(
        f"Edit this draft for clarity and accuracy:\n{draft}"
    )

    return final

Parallel (fan-out / fan-in). Multiple agents work simultaneously on different subtasks, and a final agent combines the results. For example, searching five different data sources in parallel: each search agent queries one source, and a synthesiser agent merges the results into a single answer. This is fast — the total time is the time of the slowest agent, not the sum — but coordination is harder. The synthesiser needs to reconcile conflicting information from different sources, handle partial failures (what if one search agent times out?), and produce a coherent whole from disparate parts.

import asyncio

async def parallel_search(query):
    # Fan-out: launch search agents in parallel
    tasks = [
        search_agent_arxiv.run(query),
        search_agent_wikipedia.run(query),
        search_agent_news.run(query),
        search_agent_github.run(query),
    ]
    results = await asyncio.gather(*tasks)

    # Fan-in: synthesise all results
    combined = "\n\n".join(
        f"Source {i+1}:\n{r}" for i, r in enumerate(results)
    )
    answer = synthesiser_agent.run(
        f"Synthesise these search results into a single answer:\n{combined}"
    )
    return answer

Hierarchical (manager / worker). A manager agent decomposes the task into subtasks and delegates each to a specialised worker agent. The manager sees the big picture and coordinates; the workers focus on execution. This mirrors how human teams work: a project manager assigns frontend work to the frontend developer, backend work to the backend developer, and testing to the QA engineer. The manager can dynamically decide which workers to invoke, how to partition the work, and when to iterate — if the tester finds a bug, the manager routes it back to the coder. This is the most flexible pattern but also the most complex: the manager agent itself needs to be capable enough to plan, decompose, and route effectively.

Debate / discussion. Instead of dividing labour, this pattern has agents with different perspectives argue and refine an answer. One agent proposes a solution, a second agent critiques it, and they go back and forth until they converge. This approach was formalised in the paper "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (Du et al., 2023) , which showed that multi-agent debate improves factuality on benchmarks like TruthfulQA and mathematical reasoning on GSM8K. The idea is that a single model can be confidently wrong, but when forced to defend its answer against a critic, errors surface and get corrected. The cost is that this is the most token-expensive pattern: each round of debate doubles (or more) the LLM calls, and convergence is not guaranteed.

💡 These patterns are not mutually exclusive. A hierarchical system might use a sequential pipeline within one worker, or fan out multiple workers in parallel. Real-world systems often mix patterns based on the structure of the task.

CrewAI

Now that we have the theory, how do these patterns look in code? Several frameworks have emerged to make multi-agent systems easier to build. Let's start with CrewAI (GitHub) , one of the most popular multi-agent frameworks. Its core metaphor is a crew : a group of agents with defined roles that collaborate on a set of tasks.

Each agent in CrewAI is defined by four things:

Role: a job title like "Senior Data Analyst" or "Technical Writer". This becomes part of the agent's system prompt.
Goal: what the agent is trying to accomplish, e.g. "Analyse quarterly sales data and identify trends."
Backstory: additional context that shapes the agent's persona and behaviour: "You are a 10-year veteran analyst known for finding insights others miss."
Tools: the specific tools this agent can access (a search tool, a code-execution tool, a file-reading tool, etc.).

Tasks are defined separately with descriptions and expected outputs, then assigned to specific agents. The crew then orchestrates execution — either sequentially (agents run one after another, each receiving the previous agent's output) or hierarchically (a manager agent delegates tasks to workers). The role-playing metaphor is what makes CrewAI distinctive: by giving agents explicit personas, goals, and backstories, the framework leverages the model's ability to role-play, which in practice leads to more focused and on-task outputs.

from crewai import Agent, Task, Crew, Process

# Define agents with roles, goals, and backstories
researcher = Agent(
    role="Senior Research Analyst",
    goal="Find comprehensive, accurate information on the given topic",
    backstory="You are an experienced researcher who excels at finding "
              "reliable sources and extracting key insights from them.",
    tools=[search_tool, web_scraper_tool],
    verbose=True,
)

writer = Agent(
    role="Technical Writer",
    goal="Write clear, engaging content based on research findings",
    backstory="You are a skilled technical writer who translates complex "
              "topics into accessible prose without losing accuracy.",
    tools=[],  # Writer doesn't need external tools
    verbose=True,
)

# Define tasks and assign to agents
research_task = Task(
    description="Research the topic '{topic}' thoroughly. "
                "Find key facts, recent developments, and expert opinions.",
    expected_output="A structured research brief with key findings and sources.",
    agent=researcher,
)

writing_task = Task(
    description="Write a blog post based on the research brief. "
                "Make it engaging, accurate, and well-structured.",
    expected_output="A polished blog post of 800-1200 words.",
    agent=writer,
)

# Create the crew and run
crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential,  # researcher runs first, writer second
    verbose=True,
)

result = crew.kickoff(inputs={"topic": "quantum error correction"})

Behind the scenes, CrewAI handles the plumbing: it passes the researcher's output into the writer's context, manages the conversation history for each agent, and provides logging so you can see exactly what each agent did. The framework supports both OpenAI and open-source models, and it can be extended with custom tools.

AutoGen and LangGraph

CrewAI's role-playing metaphor works well for straightforward delegation, but more complex multi-agent patterns — dynamic conversations, conditional branching, human-in-the-loop approval — need more flexible frameworks. Two of the most prominent are AutoGen and LangGraph.

AutoGen (Wu et al., 2023) is Microsoft's multi-agent conversation framework. Its core abstraction is the conversable agent : an entity that can send and receive messages. Agents don't just run tasks in isolation — they have conversations with each other. An AssistantAgent (LLM-powered) might discuss a problem with a UserProxyAgent (which represents a human or executes code on the human's behalf), passing messages back and forth until they reach a solution.

What makes AutoGen distinctive is that a human is just another agent in the conversation. The UserProxyAgent can be configured to always pass messages to a human for approval, to auto-approve certain types of actions, or to execute code and return the output. This makes human-in-the-loop workflows natural rather than bolted on. AutoGen also supports group chat : multiple agents share a single conversation thread, with a manager that decides who speaks next. This enables the debate pattern naturally — a proposer, a critic, and a moderator can all participate in one thread.

from autogen import AssistantAgent, UserProxyAgent

# An LLM-powered assistant
assistant = AssistantAgent(
    name="analyst",
    system_message="You are a data analyst. Write Python code to answer "
                   "questions about data. Always explain your reasoning.",
    llm_config={"model": "gpt-4o"},
)

# A proxy that executes code on behalf of the user
user_proxy = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",        # auto-execute, no human approval
    code_execution_config={"work_dir": "workspace"},
    max_consecutive_auto_reply=5,     # stop after 5 back-and-forth rounds
)

# Start a conversation: user_proxy sends a task, agents chat until done
user_proxy.initiate_chat(
    assistant,
    message="Analyse the dataset in 'sales.csv'. "
            "What month had the highest revenue?",
)

LangGraph (GitHub) takes a fundamentally different approach. Where AutoGen models agents as conversational participants, LangGraph models them as nodes in a graph . Edges between nodes represent transitions: after the researcher node finishes, control flows to the writer node (or to the reviewer node, depending on a condition). The graph maintains a shared state object that all nodes can read from and write to, and routing logic at each edge determines which node runs next.

This state-machine approach gives you fine-grained control over the flow. You can define conditional edges ("if the code passes tests, go to the deploy node; if it fails, go back to the coder node"), cycles ("loop between coder and tester until all tests pass"), and parallel branches ("run the frontend and backend nodes simultaneously, then join at the integrator node"). The tradeoff is more boilerplate: you're explicitly wiring together a graph, which is more code than CrewAI's declarative crew definition but gives you more control over exactly what happens and when.

from langgraph.graph import StateGraph, END
from typing import TypedDict

class AppState(TypedDict):
    task: str
    code: str
    test_result: str
    attempts: int

def coder_node(state: AppState) -> dict:
    # LLM generates or fixes code based on state
    code = coding_agent.run(state["task"], previous_code=state.get("code"))
    return {"code": code, "attempts": state.get("attempts", 0) + 1}

def tester_node(state: AppState) -> dict:
    # Run tests on the generated code
    result = testing_agent.run(state["code"])
    return {"test_result": result}

def should_retry(state: AppState) -> str:
    if "PASS" in state["test_result"]:
        return "done"
    if state["attempts"] >= 3:
        return "done"       # give up after 3 attempts
    return "retry"

# Build the graph
graph = StateGraph(AppState)
graph.add_node("coder", coder_node)
graph.add_node("tester", tester_node)

graph.set_entry_point("coder")
graph.add_edge("coder", "tester")           # coder always goes to tester
graph.add_conditional_edges("tester", should_retry, {
    "retry": "coder",   # loop back if tests fail
    "done": END,         # finish if tests pass or max attempts reached
})

app = graph.compile()
result = app.invoke({"task": "Write a function to sort a linked list"})

💡 The key difference: CrewAI is "define the team and let them work", AutoGen is "define the conversation and let them talk", and LangGraph is "define the graph and control the flow". The right choice depends on whether your problem is best modelled as a team, a conversation, or a workflow.

OpenAI Swarm and Anthropic's Approach

While the frameworks above offer full-featured multi-agent orchestration, there's a compelling argument for keeping things minimal. Two approaches from major model providers illustrate this philosophy.

OpenAI Swarm (GitHub) is an experimental, educational framework that reduces multi-agent systems to two primitives: agents (a system prompt plus a list of functions) and handoffs (one agent transferring control to another). A handoff is implemented as an ordinary function that returns the next agent — when the triage agent decides the user's question is about billing, it calls a transfer_to_billing() function that returns the billing agent, and the conversation continues seamlessly with the billing agent's system prompt and tools.

from swarm import Swarm, Agent

client = Swarm()

# A handoff is just a function that returns another agent
def transfer_to_billing():
    """Transfer the conversation to the billing specialist."""
    return billing_agent

def transfer_to_technical():
    """Transfer the conversation to technical support."""
    return technical_agent

triage_agent = Agent(
    name="Triage",
    instructions="You are a customer service triage agent. "
                 "Determine if the user needs billing help or technical help, "
                 "then transfer to the appropriate specialist.",
    functions=[transfer_to_billing, transfer_to_technical],
)

billing_agent = Agent(
    name="Billing",
    instructions="You are a billing specialist. Help with invoices, "
                 "payments, and subscription changes.",
    functions=[lookup_invoice, process_refund],
)

technical_agent = Agent(
    name="Technical Support",
    instructions="You are a technical support specialist. "
                 "Help debug issues and guide users through fixes.",
    functions=[check_system_status, create_ticket],
)

# Run: Swarm manages handoffs automatically
response = client.run(
    agent=triage_agent,
    messages=[{"role": "user", "content": "I was charged twice last month"}],
)

Swarm is explicitly not a production framework . OpenAI describes it as an educational resource — a pattern you can understand in an afternoon and adapt for your own needs. There's no persistent state, no built-in memory, no production features. The value is in the idea: handoffs are a clean, minimal primitive for multi-agent coordination, and you can implement the same pattern in a few dozen lines of code without any framework at all.

Anthropic's approach with Claude is different. Rather than building a multi-agent framework, Claude Code uses a single orchestrator agent that can spawn sub-agents for parallel tasks. The orchestrator handles the conversation with the user, makes high-level plans, and delegates specific pieces of work to sub-agents that run independently and return their results. This is structurally similar to the hierarchical manager/worker pattern, but with a key difference: the sub-agents are created dynamically based on what the task requires, rather than being predefined. If the orchestrator decides it needs to search three files in parallel, it spins up three sub-agents on the fly. If it doesn't need parallel work, it just does everything itself.

This design reflects a practical insight: for most tasks, a single capable agent with good tools is sufficient. Multi-agent coordination adds latency (agents need to communicate), token cost (each agent has its own context), and failure modes (what if one agent misunderstands another's output?). Spawning sub-agents only when needed keeps the common case simple and fast while preserving the ability to parallelise when it genuinely helps.

When to Use Multi-Agent vs Single Agent

With all these frameworks and patterns available, it's tempting to reach for multi-agent systems by default. But the honest answer is: most tasks that people build multi-agent systems for can be done with a single well-prompted agent with good tools . Multi-agent adds complexity — more moving parts, more failure modes, more debugging effort — and that complexity needs to be justified by a concrete benefit.

A single agent is the right choice when the task is well-defined, the number of tools is manageable (say, under 15-20), and the full context of the task fits within one context window. A coding assistant that reads files, writes code, and runs tests? One agent. A customer-support bot that looks up orders and processes returns? One agent. A research assistant that searches the web and summarises results? Probably one agent.

Multi-agent becomes genuinely valuable in four situations:

Genuine parallelism: the task involves multiple independent subtasks that can run simultaneously. Searching five databases, processing five documents, or generating five alternatives — these benefit from parallel agents because the wall-clock time is the time of the slowest agent, not the sum.
Different security contexts: agents that need different permissions shouldn't share the same tool set. An agent that reads customer data shouldn't have access to the deployment pipeline. An agent that writes code shouldn't have access to the production database. Separate agents with separate tool sets enforce these boundaries naturally.
Context overflow: when the conversation history, tool results, and task context genuinely exceed what one agent can track. A complex codebase refactoring might involve hundreds of files — an orchestrator that delegates to file-level workers keeps each worker's context manageable.
Adversarial verification: when you need one agent to check another's work. A code-writing agent and a code-reviewing agent with different system prompts catch different classes of errors. A proposal agent and a critique agent converge on better answers than either would produce alone.

If none of these apply, start with a single agent. You can always decompose it into multiple agents later if the single agent hits a wall. The reverse — simplifying a multi-agent system back down to a single agent — is much harder because you've already built the coordination infrastructure.

💡 The field is still figuring out when multi-agent is genuinely better versus when it's unnecessary complexity. A useful heuristic: if you can describe your multi-agent system as "Agent A does X, then Agent B does Y", ask yourself whether a single agent with the instruction "First do X, then do Y" would work just as well. Often it does.

Quiz

Test your understanding of multi-agent systems, orchestration patterns, and when to use them.

In the hierarchical (manager/worker) orchestration pattern, what is the manager agent's primary responsibility?

Executing all subtasks directly and returning the combined result

Decomposing the task into subtasks and delegating them to specialised worker agents

Critiquing each worker's output and sending it back for revision

Running all worker agents in parallel and picking the best output

In OpenAI's Swarm framework, how is a handoff between agents implemented?

The orchestrator sends an HTTP request to the next agent's endpoint

A special handoff message type is added to the conversation history

A function returns the next agent, and the conversation continues with that agent's system prompt and tools

The current agent's context window is copied into the next agent's context window

Which of the following is NOT a strong reason to use multi-agent over single-agent?

The task involves independent subtasks that benefit from parallel execution

Different parts of the task require different security permissions

The task can be described as a sequence of steps that fit in one context window

You need adversarial verification where one agent checks another's work

What distinguishes LangGraph's approach to multi-agent orchestration from CrewAI's?

LangGraph uses role-playing personas while CrewAI uses graph nodes

LangGraph models agents as nodes in a state graph with conditional edges, giving more control over flow at the cost of more boilerplate

LangGraph only supports sequential execution while CrewAI supports parallel execution

LangGraph is a production framework while CrewAI is educational only