When does a single agent stop being enough?
A single AI agent works well for focused tasks. Draft this document. Analyze this dataset. Review this pull request. The problems start when the task requires breadth, multiple tool sets, or steps that could run in parallel.
As Anthropic's engineering team put it after working with dozens of enterprise teams: the most successful implementations use "simple, composable patterns rather than complex frameworks" (Schluntz & Zhang, December 2024). They define a clear split: workflows are predefined code paths where the steps are known in advance; agents are LLM-directed processes where the model decides what to do next. Most production systems need workflows, not agents.
The single-agent ceiling shows up in practice as tools accumulate. Anthropic's research on tool design found that "more tools don't always lead to better outcomes" (Aizawa, September 2025). As instruction complexity grows, adherence to specific rules degrades and error rates compound. A five-server MCP setup can have 58 tools consuming roughly 55,000 tokens before any conversation even starts.
That's when you split. Not because multi-agent is trendy, but because the task genuinely requires different expertise, different tool sets, or parallel execution.
What does multi-agent orchestration actually look like?
Anthropic published the most concrete data on this in June 2025. Their multi-agent research system used Claude Opus 4 as a lead agent coordinating Claude Sonnet 4 sub-agents. The results were significant: the multi-agent system outperformed single-agent Claude Opus 4 by 90.2% on their internal research evaluation. Parallel tool calling cut research time by up to 90% for complex queries. Token usage explained 80% of performance variance. And tool description improvements alone yielded a 40% decrease in task completion time.
But the architecture isn't free. Multi-agent systems consume roughly 15x more tokens than single-agent chats (Hadfield, Zhang, Lien et al., Anthropic, June 2025). The coordination overhead is real.
Five patterns cover the majority of production use cases:
The orchestrator-worker pattern dominates production deployments. Microsoft's Magentic-One system (Fourney et al., November 2024) uses five agents (Orchestrator + WebSurfer, FileSurfer, Coder, ComputerTerminal) and achieves competitive performance on GAIA and WebArena benchmarks without per-task modification. The orchestrator decides what to do. The workers do it.
How do you choose a coordination topology?
Three topologies cover most real-world deployments:
Centralized (supervisor). One agent receives every task, decomposes it, delegates to specialists, and synthesizes results. Clear authority chain, full visibility, easy debugging. The weakness: single point of failure, and the supervisor's context window limits how many parallel workflows it can track. This is the dominant pattern in production today.
Hierarchical (tree). Multi-level structure with a top orchestrator, mid-level supervisors, and specialist agents. Combines the control of centralized with the scalability needed when you have 10+ agents. MultiAgentBench (Zhu et al., March 2025) found that graph coordination topology (agents connected in a flexible graph, not rigid tree) performed best across their benchmark tasks.
Decentralized (peer-to-peer). Agents operate autonomously and communicate directly. No single point of failure, scales horizontally. Google's A2A protocol (April 2025, launched with 50+ technology partners including Salesforce, SAP, and ServiceNow) enables this pattern by letting agents advertise capabilities via standardized Agent Cards and discover each other through capability matching. The weakness: much harder to debug, and consensus between agents is expensive.
The choice depends on your needs. If you need auditability and control: centralized. If you're coordinating 10+ specialized agents across domains: hierarchical. If agents come from different organizations or need to interoperate across company boundaries: decentralized with A2A.
Why is this a distributed systems problem?
Because errors compound and failures cascade. In a five-step workflow where each step has 99% reliability, the end-to-end success rate is 95%. At ten steps, 90%. Real-world agent workflows involve dozens of operations ( Inngest, February 2026).
Research on multi-agent system failures found that framework-level design choices alone can increase latency by over 100x and reduce planning accuracy by up to 30% (Zhu et al., companion paper to MultiAgentBench, 2025). And the failure modes are different from traditional distributed systems. In a microservice architecture, a failing service returns an error code. In a multi-agent system, a failing agent returns a plausible-sounding wrong answer that passes validation checks.
This is why durable execution matters. Frameworks like Temporal (Davis, July 2025) and Inngest provide automatic state persistence, checkpoint replay, and exactly-once execution semantics. If an agent fails mid-workflow, the system resumes from the last checkpoint instead of restarting from scratch. As Temporal's engineering team describes it: AI applications are "distributed systems on steroids."
Confluent's engineering team (Falconer & Sellers, February 2025) advocates event-driven patterns for multi-agent coordination: agents communicate through Kafka topics rather than direct dependencies, the immutable event log serves as a single source of truth, and the system supports replay and recovery by design.
What does the production landscape actually look like?
LangChain's State of Agent Engineering survey (1,340 respondents, late 2025) found that 57.3% of organizations now have agents in production. But quality is the top blocker at 32%, followed by latency at 20%.
The observability gap is telling: 89% report some form of monitoring, but only 62% have detailed tracing. Among production deployments specifically, 94% have observability and 71.5% have full tracing. The teams that make it to production invest in visibility. The teams that don't, get stuck.
Klarna provides the clearest enterprise case study for AI agents at scale. Their AI assistant handled 2.3 million customer service conversations in its first month (two-thirds of all customer service), cut resolution time from 11 minutes to under 2 minutes, and reduced repeat inquiries by 25%. The estimated profit improvement: $40 million annually. That's a single-agent deployment. Multi-agent orchestration at enterprise scale is still emerging.
What the research says
"While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt."
"Token usage explains 80% of performance variance."
"AI applications are distributed systems on steroids."
Our take
The instinct when a single agent struggles is to add more agents. Sometimes that's right. If you need parallel research across five topics, five sub-agents running simultaneously is genuinely 5x faster. If you need code generation reviewed by a separate quality agent, that separation of concerns produces better output.
But what we've found is that most "multi-agent" problems are actually context and tooling problems in disguise. Before splitting into multiple agents, ask: does the agent have the right context? Are the tool descriptions clear enough? Is the prompt decomposing the task properly? Anthropic found that tool description improvements alone cut task completion time by 40%, without adding a single agent.
When you do go multi-agent, treat it as a distributed systems problem from day one. Invest in durable execution so failures don't cascade. Invest in tracing so you can debug across agent boundaries. And invest in cost tracking, because 15x the token usage adds up fast. The teams that succeed aren't the ones with the most agents. They're the ones with the best orchestration infrastructure.
Key takeaway
Multi-agent systems outperform single agents by 90% on complex tasks, but consume 15x more tokens and introduce distributed systems complexity. Start with one agent. Split only when you have a clear reason: parallel execution, distinct tool sets, or separation of concerns. And invest in orchestration infrastructure (durable execution, tracing, cost tracking) before you invest in more agents.
FAQ
When should I split a single agent into multiple agents?
When the task requires genuine parallelism (five independent research threads), distinct tool sets (a code agent and a design agent need different tools), or separation of concerns (one agent generates, another evaluates). Don't split just because the task is hard. First check if the single agent has the right context and tools.
How many agents can you realistically coordinate?
A centralized supervisor can typically manage 5-7 direct sub-agents before context window limitations and coordination overhead become significant. Beyond that, hierarchical architectures with mid-level supervisors are needed. Research found that over 75% of multi-agent systems become difficult to manage past five agents.
What's the cost of multi-agent vs single-agent?
Anthropic reported that multi-agent systems consume roughly 15x more tokens than equivalent single-agent chats. Parallelization costs N times more (where N is the number of parallel agents). The tradeoff is speed and quality: 90% less research time, 90.2% better performance, but significantly higher token bills.
What is durable execution and why does it matter for agents?
Durable execution means the system automatically saves checkpoints during a multi-step workflow. If an agent fails at step 7 of 10, it resumes from step 7 instead of restarting from step 1. Frameworks like Temporal and Inngest provide this. Without it, a 10-step workflow at 99% per-step reliability only succeeds 90% of the time.
What's the difference between MCP and A2A?
MCP (Model Context Protocol, Anthropic) connects agents to tools and data sources. A2A (Agent-to-Agent, Google) connects agents to other agents. MCP is how your agent talks to GitHub, Slack, or a database. A2A is how your research agent discovers and delegates to a specialized analysis agent. Together they form the interoperability stack.
How do you debug a multi-agent system?
Distributed tracing. Every agent invocation creates a trace with parent/child relationships, tokens consumed, cost, and success/failure. LangChain's survey found that among teams with agents in production, 94% have observability and 71.5% have full tracing. Without tracing, multi-agent debugging is guesswork.
Is event-driven coordination better than direct orchestration?
Depends on the pattern. Direct orchestration (supervisor calls workers) is simpler and easier to debug. Event-driven (agents communicate through message topics) is more resilient and scales better, because agents don't have direct dependencies on each other. Confluent advocates event-driven patterns for production multi-agent systems, using Kafka's immutable event log as a single source of truth.