When does a single agent stop being enough?

A single AI agent works well for focused tasks. Draft this document. Analyze this dataset. Review this pull request. The problems start when the task requires breadth, multiple tool sets, or steps that could run in parallel.

As Anthropic's engineering team put it after working with dozens of enterprise teams: the most successful implementations use "simple, composable patterns rather than complex frameworks" (Schluntz & Zhang, December 2024). They define a clear split: workflows are predefined code paths where the steps are known in advance; agents are LLM-directed processes where the model decides what to do next. Most production systems need workflows, not agents.

The single-agent ceiling shows up in practice as tools accumulate. Anthropic's research on tool design found that "more tools don't always lead to better outcomes" (Aizawa, September 2025). As instruction complexity grows, adherence to specific rules degrades and error rates compound. A five-server MCP setup can have 58 tools consuming roughly 55,000 tokens before any conversation even starts.

That's when you split. Not because multi-agent is trendy, but because the task genuinely requires different expertise, different tool sets, or parallel execution.

+90.2%PERFORMANCE

A single agent with 9 tools vs. specialized sub-agents with focused tool sets. Multi-agent outperformed by 90.2%.

What does multi-agent orchestration actually look like?

Anthropic published the most concrete data on this in June 2025. Their multi-agent research system used Claude Opus 4 as a lead agent coordinating Claude Sonnet 4 sub-agents. The results were significant: the multi-agent system outperformed single-agent Claude Opus 4 by 90.2% on their internal research evaluation. Parallel tool calling cut research time by up to 90% for complex queries. Token usage explained 80% of performance variance. And tool description improvements alone yielded a 40% decrease in task completion time.

But the architecture isn't free. Multi-agent systems consume roughly 15x more tokens than single-agent chats (Hadfield, Zhang, Lien et al., Anthropic, June 2025). The coordination overhead is real.

Five patterns cover the majority of production use cases:

Chaining

Best for: Fixed, predictable workflows

Routing

Best for: Tasks with distinct types

Parallelization

Best for: Research, analysis, data gathering

Orchestrator-Workers

Best for: Complex, unpredictable tasks

Evaluator-Optimizer

Best for: Tasks requiring refinement

Five coordination patterns cover the majority of production multi-agent use cases.

Pattern	How it works	Best for	Tradeoff
Prompt chaining	Sequential steps with quality gates between each	Fixed, predictable workflows	Simple but slow (no parallelism)
Routing	Classify input, send to specialized handler	Tasks with distinct types	Requires good classifier
Parallelization	Run independent subtasks simultaneously	Research, analysis, data gathering	N times the cost
Orchestrator-workers	Central agent decomposes and delegates	Complex, unpredictable tasks	Coordination overhead
Evaluator-optimizer	One agent generates, another evaluates, iterate	Tasks requiring refinement	Multiple passes per output

The orchestrator-worker pattern dominates production deployments. Microsoft's Magentic-One system (Fourney et al., November 2024) uses five agents (Orchestrator + WebSurfer, FileSurfer, Coder, ComputerTerminal) and achieves competitive performance on GAIA and WebArena benchmarks without per-task modification. The orchestrator decides what to do. The workers do it.

Single Agent1x

Multi-Agent15x more tokens

90.2% BETTER PERFORMANCE+90% FASTER RESEARCH

Anthropic, June 2025

Multi-agent uses 15x more tokens, but delivers 90.2% better performance and 90% faster research.

How do you choose a coordination topology?

Three topologies cover most real-world deployments:

Centralized (supervisor). One agent receives every task, decomposes it, delegates to specialists, and synthesizes results. Clear authority chain, full visibility, easy debugging. The weakness: single point of failure, and the supervisor's context window limits how many parallel workflows it can track. This is the dominant pattern in production today.

Hierarchical (tree). Multi-level structure with a top orchestrator, mid-level supervisors, and specialist agents. Combines the control of centralized with the scalability needed when you have 10+ agents. MultiAgentBench (Zhu et al., March 2025) found that graph coordination topology (agents connected in a flexible graph, not rigid tree) performed best across their benchmark tasks.

Decentralized (peer-to-peer). Agents operate autonomously and communicate directly. No single point of failure, scales horizontally. Google's A2A protocol (April 2025, launched with 50+ technology partners including Salesforce, SAP, and ServiceNow) enables this pattern by letting agents advertise capabilities via standardized Agent Cards and discover each other through capability matching. The weakness: much harder to debug, and consensus between agents is expensive.

The choice depends on your needs. If you need auditability and control: centralized. If you're coordinating 10+ specialized agents across domains: hierarchical. If agents come from different organizations or need to interoperate across company boundaries: decentralized with A2A.

Centralized

+ Full visibility, easy debugging

- Single point of failure

Hierarchical

+ Scalable to 10+ agents

- Coordination complexity

Decentralized

+ No single point of failure

- Hard to debug, expensive consensus

Three coordination topologies, each with distinct tradeoffs for control vs. resilience.

Why is this a distributed systems problem?

Because errors compound and failures cascade. In a five-step workflow where each step has 99% reliability, the end-to-end success rate is 95%. At ten steps, 90%. Real-world agent workflows involve dozens of operations ( Inngest, February 2026).

Research on multi-agent system failures found that framework-level design choices alone can increase latency by over 100x and reduce planning accuracy by up to 30% (Zhu et al., companion paper to MultiAgentBench, 2025). And the failure modes are different from traditional distributed systems. In a microservice architecture, a failing service returns an error code. In a multi-agent system, a failing agent returns a plausible-sounding wrong answer that passes validation checks.

This is why durable execution matters. Frameworks like Temporal (Davis, July 2025) and Inngest provide automatic state persistence, checkpoint replay, and exactly-once execution semantics. If an agent fails mid-workflow, the system resumes from the last checkpoint instead of restarting from scratch. As Temporal's engineering team describes it: AI applications are "distributed systems on steroids."

Confluent's engineering team (Falconer & Sellers, February 2025) advocates event-driven patterns for multi-agent coordination: agents communicate through Kafka topics rather than direct dependencies, the immutable event log serves as a single source of truth, and the system supports replay and recovery by design.

Without durable execution, a failure at step 7 restarts from scratch. With it, you resume from the last checkpoint.

What does the production landscape actually look like?

LangChain's State of Agent Engineering survey (1,340 respondents, late 2025) found that 57.3% of organizations now have agents in production. But quality is the top blocker at 32%, followed by latency at 20%.

The observability gap is telling: 89% report some form of monitoring, but only 62% have detailed tracing. Among production deployments specifically, 94% have observability and 71.5% have full tracing. The teams that make it to production invest in visibility. The teams that don't, get stuck.

Klarna provides the clearest enterprise case study for AI agents at scale. Their AI assistant handled 2.3 million customer service conversations in its first month (two-thirds of all customer service), cut resolution time from 11 minutes to under 2 minutes, and reduced repeat inquiries by 25%. The estimated profit improvement: $40 million annually. That's a single-agent deployment. Multi-agent orchestration at enterprise scale is still emerging.

90.2%

performance improvement from multi-agent vs single-agent on research tasks

Anthropic, June 2025

90%

reduction in research time through parallel tool calling

Anthropic, June 2025

15x

more tokens consumed by multi-agent vs single-agent chat

Anthropic, June 2025

57.3%

of organizations now have agents in production

LangChain, 2026

32%

cite quality as the top blocker to agent deployment

LangChain, 2026

40%

decrease in task completion time from improved tool descriptions alone

Anthropic, June 2025

50+

technology partners at A2A protocol launch including Salesforce, SAP, ServiceNow

Google, April 2025

2.3M

conversations handled in first month by Klarna's AI assistant, equivalent to 700 full-time agents

Klarna, February 2024

What the research says

"While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt."
Erik Schluntz, Barry Zhang, Anthropic, "Building Effective Agents", December 2024

"Token usage explains 80% of performance variance."
Jeremy Hadfield et al., Anthropic, "How We Built Our Multi-Agent Research System", June 2025

"AI applications are distributed systems on steroids."
Cornelia Davis, Temporal, July 2025

Our take

The instinct when a single agent struggles is to add more agents. Sometimes that's right. If you need parallel research across five topics, five sub-agents running simultaneously is genuinely 5x faster. If you need code generation reviewed by a separate quality agent, that separation of concerns produces better output.

But what we've found is that most "multi-agent" problems are actually context and tooling problems in disguise. Before splitting into multiple agents, ask: does the agent have the right context? Are the tool descriptions clear enough? Is the prompt decomposing the task properly? Anthropic found that tool description improvements alone cut task completion time by 40%, without adding a single agent.

When you do go multi-agent, treat it as a distributed systems problem from day one. Invest in durable execution so failures don't cascade. Invest in tracing so you can debug across agent boundaries. And invest in cost tracking, because 15x the token usage adds up fast. The teams that succeed aren't the ones with the most agents. They're the ones with the best orchestration infrastructure.

Key takeaway

Multi-agent systems outperform single agents by 90% on complex tasks, but consume 15x more tokens and introduce distributed systems complexity. Start with one agent. Split only when you have a clear reason: parallel execution, distinct tool sets, or separation of concerns. And invest in orchestration infrastructure (durable execution, tracing, cost tracking) before you invest in more agents.

FAQ

When should I split a single agent into multiple agents?

When the task requires genuine parallelism (five independent research threads), distinct tool sets (a code agent and a design agent need different tools), or separation of concerns (one agent generates, another evaluates). Don't split just because the task is hard. First check if the single agent has the right context and tools.

How many agents can you realistically coordinate?

A centralized supervisor can typically manage 5-7 direct sub-agents before context window limitations and coordination overhead become significant. Beyond that, hierarchical architectures with mid-level supervisors are needed. Research found that over 75% of multi-agent systems become difficult to manage past five agents.

What's the cost of multi-agent vs single-agent?

Anthropic reported that multi-agent systems consume roughly 15x more tokens than equivalent single-agent chats. Parallelization costs N times more (where N is the number of parallel agents). The tradeoff is speed and quality: 90% less research time, 90.2% better performance, but significantly higher token bills.

What is durable execution and why does it matter for agents?

Durable execution means the system automatically saves checkpoints during a multi-step workflow. If an agent fails at step 7 of 10, it resumes from step 7 instead of restarting from step 1. Frameworks like Temporal and Inngest provide this. Without it, a 10-step workflow at 99% per-step reliability only succeeds 90% of the time.

What's the difference between MCP and A2A?

MCP (Model Context Protocol, Anthropic) connects agents to tools and data sources. A2A (Agent-to-Agent, Google) connects agents to other agents. MCP is how your agent talks to GitHub, Slack, or a database. A2A is how your research agent discovers and delegates to a specialized analysis agent. Together they form the interoperability stack.

How do you debug a multi-agent system?

Distributed tracing. Every agent invocation creates a trace with parent/child relationships, tokens consumed, cost, and success/failure. LangChain's survey found that among teams with agents in production, 94% have observability and 71.5% have full tracing. Without tracing, multi-agent debugging is guesswork.

Is event-driven coordination better than direct orchestration?

Depends on the pattern. Direct orchestration (supervisor calls workers) is simpler and easier to debug. Event-driven (agents communicate through message topics) is more resilient and scales better, because agents don't have direct dependencies on each other. Confluent advocates event-driven patterns for production multi-agent systems, using Kafka's immutable event log as a single source of truth.

When Should You Use Multiple AI Agents Instead of One?