Why is AI agent observability different from traditional monitoring?
Traditional application monitoring answers simple questions. Did the request succeed? How long did it take? Was the response code 200 or 500?
AI agents break this model. A "successful" agent response (200, fast, no errors) can be completely wrong. The agent confidently summarized a document that doesn't exist. It generated code that compiles but violates your architecture. It answered a customer question with plausible but fabricated information. Standard HTTP monitoring sees a healthy system. Your users see garbage.
This is why the Air Canada chatbot ruling matters (Moffatt v. Air Canada, BC Civil Resolution Tribunal, February 2024). Air Canada's chatbot gave a customer incorrect bereavement fare information. Air Canada argued the chatbot was a separate entity. The tribunal rejected this: "It should be obvious to Air Canada that it is responsible for all the information on its website." The damages were small (C$812). The precedent is enormous: you're liable for what your agents say, whether you monitored them or not.
What does the adoption landscape actually look like?
LangChain's State of Agent Engineering survey (1,340 respondents, late 2025) found that 57.3% of organizations now have agents in production. Among those teams, observability adoption varies:
The gap between "some monitoring" (89%) and "detailed tracing" (62%) is where agents fail silently. You know the agent ran. You can't tell why it produced bad output.
Cleanlab's survey of 95 production teams (August 2025, screened from 1,837 respondents) found that fewer than 1 in 3 teams are satisfied with their observability and guardrail solutions. 62% plan to improve observability in the next year, making it the most urgent investment area. 70% of regulated enterprises rebuild their AI agent stack quarterly or faster.
Gartner projects that LLM observability will reach 50% of GenAI deployments by 2028, up from 15% today (Pankaj Prasad, March 2026). As Prasad put it: "As enterprises scale GenAI, the trust requirement grows faster than the technology itself."
Meanwhile, Gartner also predicts that over 40% of agentic AI projects will be canceled by end of 2027 (Anushree Verma, June 2025). The stated reasons: escalating costs, unclear business value, and inadequate risk controls. All three are observability problems.
What does an AI agent observability stack look like?
Three layers, each answering a different question.
Layer 1: Tracing (What did the agent do?)
Distributed tracing for AI agents works like distributed tracing for microservices, but with agent-specific semantics. Every agent invocation creates a trace span with parent/child relationships, token counts, model used, latency, and tool calls. When an orchestrator delegates to sub-agents, the trace forms a tree you can inspect.
The OpenTelemetry GenAI semantic conventions are the emerging standard, with contributors from Amazon, Elastic, Google, IBM, and Microsoft. As of 2026, they're still in "Development" status, not yet stable. That's itself a data point: the industry standard for AI observability isn't finalized yet. Teams are building on evolving foundations.
Production tooling exists. Datadog launched LLM Observability in GA (June 2024) with end-to-end tracing across LLM chains, token usage, and cost tracking, integrated with existing APM. Langfuse (24.7K GitHub stars, MIT licensed, YC W23) provides open-source tracing with agent graph visualization. Braintrust (used by Notion, Stripe, Zapier, Vercel) combines tracing with evaluation. Notion reported going from fixing 3 issues per day to 30 after adopting Braintrust's observability.
Research from Dong, Lu & Zhu (AgentOps: Enabling Observability of LLM Agents, November 2024) identified the core gap: most DevOps tools for agents focus on LLM-specific metrics and prompt management with "limited support for agent-specific artifacts like goals, plans, and tools." Tracing the LLM call isn't enough. You need to trace the agent's reasoning: what goal was it pursuing, what plan did it form, which tools did it choose, and why.
Layer 2: Evaluation (Was the output good?)
Tracing tells you what happened. Evaluation tells you whether it should have happened. This is where pass@k vs pass^k from our Evals & Guardrails deep dive becomes operational.
In production, you run two types of evaluations:
Online evaluation: Check every output against guardrails before it reaches the user. Does it contain hallucinations? Does it violate brand guidelines? Does it include data above the user's clearance level? These run as fast classifiers (often a smaller model) on the output path.
Offline evaluation: Sample agent outputs weekly and grade them against known-good answers. Track scores over time. A declining score signals drift, and drift in an AI system is the equivalent of a slow memory leak: invisible until it causes an incident.
Layer 3: Cost attribution (What did it cost?)
Token costs are the compute bill of AI systems, and without per-agent attribution, you're flying blind.
Helicone (YC W23, open-source) showed that prompt optimization and caching alone yield 30-50% cost reduction, and up to 85% with fine-tuning. But you can't optimize what you can't measure.
Cost attribution needs to track at three levels: Per agent (which agents are expensive? Is the research agent consuming 10x the tokens of the code agent?), Per task (what does a single user request cost end-to-end across all agents involved?), and Per client (in multi-tenant deployments, what's the cost per client per month?).
Budget enforcement matters too. Hard limits prevent a runaway agent from burning through your monthly budget in an afternoon. Soft limits trigger alerts when spending spikes above 2x baseline.
Why is "you can't manage what you can't see" literally true for agents?
Traditional software is deterministic. The same input produces the same output. If something breaks, you reproduce it.
AI agents are stochastic. The same input can produce different outputs. An agent that worked perfectly yesterday might hallucinate today because the context window contained slightly different information. Without tracing that shows exactly what the agent saw, what it decided, and what it produced, debugging is guesswork.
The academic research frames this as a taxonomy problem: agent observability needs to capture not just system metrics (latency, tokens, cost) but pipeline artifacts (goals, plans, tool selections, intermediate reasoning). Most tools capture the first. Few capture the second. The second is what you need to debug a wrong answer.
What the research says
"As enterprises scale GenAI, the trust requirement grows faster than the technology itself. XAI provides visibility into why a model responded a certain way, while LLM observability validates how that response was generated and whether it can be relied on."
"Most agentic AI projects right now are early-stage experiments or proof of concepts that are mostly driven by hype."
"It should be obvious to Air Canada that it is responsible for all the information on its website."
Our take
The observability gap is the biggest risk in enterprise AI right now. Not model quality. Not cost. Not security. The fact that most teams can't trace what their agents are actually doing in production.
What we've found is that the teams reaching production successfully share one pattern: they invest in observability before they invest in more agents. Tracing comes before the second agent. Cost attribution comes before the third client. Evaluation pipelines come before the first production deployment. The teams that skip this end up in the 40% Gartner predicts will cancel their projects.
The three-layer stack (tracing, evaluation, cost attribution) isn't expensive. Open-source tools like Langfuse and Helicone cover tracing and cost. Braintrust covers evaluation. OpenTelemetry is converging on the standard. The investment isn't tooling. It's the discipline to instrument before you scale.
Key takeaway
You can't manage agents you can't see. 89% of teams have some monitoring, but only 62% can trace individual steps, and fewer than 1 in 3 are satisfied. Build three layers: tracing (what the agent did), evaluation (whether it was good), and cost attribution (what it cost). Do this before you add more agents, not after.
FAQ
Why can't I just use my existing APM tools for AI agents?
Existing tools (Datadog, New Relic) track system metrics: latency, error rates, throughput. AI agents need semantic monitoring: what goal was the agent pursuing, what plan did it form, which tools did it use, and was the output correct? A "successful" agent call (200 status, fast response) can produce completely wrong output. Standard APM doesn't catch this.
What is distributed tracing for AI agents?
Every agent invocation creates a trace span with parent/child relationships, tokens consumed, model used, tool calls, and latency. When an orchestrator delegates to sub-agents, the trace forms a tree. You can follow a single user request from the orchestrator through every sub-agent decision to the final output. OpenTelemetry GenAI semantic conventions are the emerging standard.
How do you evaluate agent output quality in production?
Two approaches. Online: run fast guardrail checks on every output before it reaches the user (hallucination detection, compliance checks, data leakage filters). Offline: sample outputs weekly and grade them against known-good answers. Track scores over time. A declining score is the AI equivalent of a slow memory leak.
What does cost attribution look like for AI agents?
Track token usage and cost at three levels: per agent (which agents are expensive), per task (what does a user request cost end-to-end), and per client (in multi-tenant systems). Set hard budget limits to prevent runaway agents, and soft limits that alert when spending exceeds 2x baseline.
Is the OpenTelemetry standard ready for AI agents?
Not yet. The GenAI semantic conventions are in "Development" status as of 2026, with contributors from Amazon, Google, IBM, Microsoft, and Elastic. The standard covers LLM call tracing, token counts, and model attributes, but agent-level semantics (goals, plans, tool selection reasoning) are still being defined.
What's the cost of not monitoring AI agents?
The Air Canada chatbot ruling (February 2024) established that companies are legally liable for what their AI says, whether monitored or not. Gartner projects 40%+ of agentic AI projects will be canceled by 2027, with "inadequate risk controls" as one of three causes. The cost isn't just incidents. It's abandoned projects after months of investment.
What open-source tools exist for AI agent observability?
Langfuse (24.7K GitHub stars, MIT licensed) for tracing and agent graph visualization. Helicone (YC W23) for cost monitoring and optimization. Braintrust for evaluation with 25+ built-in scorers. All integrate with major frameworks (LangChain, LlamaIndex, OpenAI SDK).