Menu
AboutResearchContact
Get Started
April 2026 · 7 min read

How Do You Monitor AI Agents You Can't See Inside?

89% of teams building AI agents have some form of observability. But only 62% can trace individual agent steps, and fewer than 1 in 3 production teams are satisfied with their monitoring. Unlike traditional software where a failed API call returns an error code, a failing AI agent returns a confident-sounding wrong answer. The observability stack for agents needs three layers: tracing (what did the agent do), evaluation (was the output good), and cost attribution (what did it cost, per agent, per task, per user).

DN
Founder, Astrohive
TRADITIONAL MONITORING
Status200 OK
Latency150ms
ErrorsNone
ALL SYSTEMS HEALTHY
AI AGENT MONITORING
Status200 OK
Latency150ms
ErrorsNone
×
OutputConfident hallucination
OUTPUT WRONG, METRICS HEALTHY
Traditional metrics show a healthy system. The agent is confidently returning wrong answers.

Why is AI agent observability different from traditional monitoring?

Traditional application monitoring answers simple questions. Did the request succeed? How long did it take? Was the response code 200 or 500?

AI agents break this model. A "successful" agent response (200, fast, no errors) can be completely wrong. The agent confidently summarized a document that doesn't exist. It generated code that compiles but violates your architecture. It answered a customer question with plausible but fabricated information. Standard HTTP monitoring sees a healthy system. Your users see garbage.

This is why the Air Canada chatbot ruling matters (Moffatt v. Air Canada, BC Civil Resolution Tribunal, February 2024). Air Canada's chatbot gave a customer incorrect bereavement fare information. Air Canada argued the chatbot was a separate entity. The tribunal rejected this: "It should be obvious to Air Canada that it is responsible for all the information on its website." The damages were small (C$812). The precedent is enormous: you're liable for what your agents say, whether you monitored them or not.

Some monitoring89%
Can trace steps62%
Satisfied31%
LangChain 2026 / Cleanlab 2025
89% of teams have some monitoring. Only 31% are satisfied with what they can actually see.

What does the adoption landscape actually look like?

LangChain's State of Agent Engineering survey (1,340 respondents, late 2025) found that 57.3% of organizations now have agents in production. Among those teams, observability adoption varies:

Observability levelAll teamsProduction teams
Some form of monitoring89%94%
Detailed tracing (individual steps)62%71.5%
Run offline evaluations52.4%

The gap between "some monitoring" (89%) and "detailed tracing" (62%) is where agents fail silently. You know the agent ran. You can't tell why it produced bad output.

Cleanlab's survey of 95 production teams (August 2025, screened from 1,837 respondents) found that fewer than 1 in 3 teams are satisfied with their observability and guardrail solutions. 62% plan to improve observability in the next year, making it the most urgent investment area. 70% of regulated enterprises rebuild their AI agent stack quarterly or faster.

Gartner projects that LLM observability will reach 50% of GenAI deployments by 2028, up from 15% today (Pankaj Prasad, March 2026). As Prasad put it: "As enterprises scale GenAI, the trust requirement grows faster than the technology itself."

Meanwhile, Gartner also predicts that over 40% of agentic AI projects will be canceled by end of 2027 (Anushree Verma, June 2025). The stated reasons: escalating costs, unclear business value, and inadequate risk controls. All three are observability problems.

TRACING
What did the agent do?
SpansTokensTool Calls
EVALUATION
Was the output good?
GuardrailsGradingDrift
COST ATTRIBUTION
What did it cost?
Per AgentPer TaskPer Client
Three layers of AI agent observability, each answering a different question.

What does an AI agent observability stack look like?

Three layers, each answering a different question.

Layer 1: Tracing (What did the agent do?)

Distributed tracing for AI agents works like distributed tracing for microservices, but with agent-specific semantics. Every agent invocation creates a trace span with parent/child relationships, token counts, model used, latency, and tool calls. When an orchestrator delegates to sub-agents, the trace forms a tree you can inspect.

The OpenTelemetry GenAI semantic conventions are the emerging standard, with contributors from Amazon, Elastic, Google, IBM, and Microsoft. As of 2026, they're still in "Development" status, not yet stable. That's itself a data point: the industry standard for AI observability isn't finalized yet. Teams are building on evolving foundations.

Production tooling exists. Datadog launched LLM Observability in GA (June 2024) with end-to-end tracing across LLM chains, token usage, and cost tracking, integrated with existing APM. Langfuse (24.7K GitHub stars, MIT licensed, YC W23) provides open-source tracing with agent graph visualization. Braintrust (used by Notion, Stripe, Zapier, Vercel) combines tracing with evaluation. Notion reported going from fixing 3 issues per day to 30 after adopting Braintrust's observability.

Research from Dong, Lu & Zhu (AgentOps: Enabling Observability of LLM Agents, November 2024) identified the core gap: most DevOps tools for agents focus on LLM-specific metrics and prompt management with "limited support for agent-specific artifacts like goals, plans, and tools." Tracing the LLM call isn't enough. You need to trace the agent's reasoning: what goal was it pursuing, what plan did it form, which tools did it choose, and why.

0s0.6s1.2s1.8s2.4s
User Request
2.4s
Orchestrator
2.3s
Research Agent
1.1s
Tool: web_search
0.7s
Code Agent
0.7s
QA Agent
0.3s
A trace tree showing parent/child spans across an orchestrator and three sub-agents.

Layer 2: Evaluation (Was the output good?)

Tracing tells you what happened. Evaluation tells you whether it should have happened. This is where pass@k vs pass^k from our Evals & Guardrails deep dive becomes operational.

In production, you run two types of evaluations:

Online evaluation: Check every output against guardrails before it reaches the user. Does it contain hallucinations? Does it violate brand guidelines? Does it include data above the user's clearance level? These run as fast classifiers (often a smaller model) on the output path.

Offline evaluation: Sample agent outputs weekly and grade them against known-good answers. Track scores over time. A declining score signals drift, and drift in an AI system is the equivalent of a slow memory leak: invisible until it causes an incident.

Layer 3: Cost attribution (What did it cost?)

Token costs are the compute bill of AI systems, and without per-agent attribution, you're flying blind.

Helicone (YC W23, open-source) showed that prompt optimization and caching alone yield 30-50% cost reduction, and up to 85% with fine-tuning. But you can't optimize what you can't measure.

Cost attribution needs to track at three levels: Per agent (which agents are expensive? Is the research agent consuming 10x the tokens of the code agent?), Per task (what does a single user request cost end-to-end across all agents involved?), and Per client (in multi-tenant deployments, what's the cost per client per month?).

Budget enforcement matters too. Hard limits prevent a runaway agent from burning through your monthly budget in an afternoon. Soft limits trigger alerts when spending spikes above 2x baseline.

PER AGENT
Research$0.12
Code$0.08
QA$0.03
PER TASK
End-to-end$0.23
PER CLIENT
Monthly total$47.50
Cost attribution at three levels: per agent, per task, and per client.

Why is "you can't manage what you can't see" literally true for agents?

Traditional software is deterministic. The same input produces the same output. If something breaks, you reproduce it.

AI agents are stochastic. The same input can produce different outputs. An agent that worked perfectly yesterday might hallucinate today because the context window contained slightly different information. Without tracing that shows exactly what the agent saw, what it decided, and what it produced, debugging is guesswork.

The academic research frames this as a taxonomy problem: agent observability needs to capture not just system metrics (latency, tokens, cost) but pipeline artifacts (goals, plans, tool selections, intermediate reasoning). Most tools capture the first. Few capture the second. The second is what you need to debug a wrong answer.

89%
of teams have some AI observability, but only 62% can trace steps
LangChain, 2026
<1 in 3
production teams satisfied with their observability solutions
Cleanlab, August 2025
15%
of GenAI deployments have real LLM observability today
Gartner, March 2026
40%+
of agentic AI projects will be canceled by end of 2027
Gartner, June 2025
62%
of production teams plan to improve observability next year (#1 priority)
Cleanlab, August 2025
10x
improvement in issue resolution after adopting AI observability
Notion via Braintrust
30-50%
cost reduction through prompt optimization and caching
Helicone, March 2025

What the research says

"As enterprises scale GenAI, the trust requirement grows faster than the technology itself. XAI provides visibility into why a model responded a certain way, while LLM observability validates how that response was generated and whether it can be relied on."

Pankaj Prasad, Gartner, March 2026

"Most agentic AI projects right now are early-stage experiments or proof of concepts that are mostly driven by hype."

Anushree Verma, Gartner, June 2025

"It should be obvious to Air Canada that it is responsible for all the information on its website."

Tribunal Member Rivers, Moffatt v. Air Canada, February 2024

Our take

The observability gap is the biggest risk in enterprise AI right now. Not model quality. Not cost. Not security. The fact that most teams can't trace what their agents are actually doing in production.

What we've found is that the teams reaching production successfully share one pattern: they invest in observability before they invest in more agents. Tracing comes before the second agent. Cost attribution comes before the third client. Evaluation pipelines come before the first production deployment. The teams that skip this end up in the 40% Gartner predicts will cancel their projects.

The three-layer stack (tracing, evaluation, cost attribution) isn't expensive. Open-source tools like Langfuse and Helicone cover tracing and cost. Braintrust covers evaluation. OpenTelemetry is converging on the standard. The investment isn't tooling. It's the discipline to instrument before you scale.

Key takeaway

You can't manage agents you can't see. 89% of teams have some monitoring, but only 62% can trace individual steps, and fewer than 1 in 3 are satisfied. Build three layers: tracing (what the agent did), evaluation (whether it was good), and cost attribution (what it cost). Do this before you add more agents, not after.

FAQ

Why can't I just use my existing APM tools for AI agents?

Existing tools (Datadog, New Relic) track system metrics: latency, error rates, throughput. AI agents need semantic monitoring: what goal was the agent pursuing, what plan did it form, which tools did it use, and was the output correct? A "successful" agent call (200 status, fast response) can produce completely wrong output. Standard APM doesn't catch this.

What is distributed tracing for AI agents?

Every agent invocation creates a trace span with parent/child relationships, tokens consumed, model used, tool calls, and latency. When an orchestrator delegates to sub-agents, the trace forms a tree. You can follow a single user request from the orchestrator through every sub-agent decision to the final output. OpenTelemetry GenAI semantic conventions are the emerging standard.

How do you evaluate agent output quality in production?

Two approaches. Online: run fast guardrail checks on every output before it reaches the user (hallucination detection, compliance checks, data leakage filters). Offline: sample outputs weekly and grade them against known-good answers. Track scores over time. A declining score is the AI equivalent of a slow memory leak.

What does cost attribution look like for AI agents?

Track token usage and cost at three levels: per agent (which agents are expensive), per task (what does a user request cost end-to-end), and per client (in multi-tenant systems). Set hard budget limits to prevent runaway agents, and soft limits that alert when spending exceeds 2x baseline.

Is the OpenTelemetry standard ready for AI agents?

Not yet. The GenAI semantic conventions are in "Development" status as of 2026, with contributors from Amazon, Google, IBM, Microsoft, and Elastic. The standard covers LLM call tracing, token counts, and model attributes, but agent-level semantics (goals, plans, tool selection reasoning) are still being defined.

What's the cost of not monitoring AI agents?

The Air Canada chatbot ruling (February 2024) established that companies are legally liable for what their AI says, whether monitored or not. Gartner projects 40%+ of agentic AI projects will be canceled by 2027, with "inadequate risk controls" as one of three causes. The cost isn't just incidents. It's abandoned projects after months of investment.

What open-source tools exist for AI agent observability?

Langfuse (24.7K GitHub stars, MIT licensed) for tracing and agent graph visualization. Helicone (YC W23) for cost monitoring and optimization. Braintrust for evaluation with 25+ built-in scorers. All integrate with major frameworks (LangChain, LlamaIndex, OpenAI SDK).

Share this article