The Problem: Stateless Agents Are Expensive by Default
Production AI agents are stateless. Every API call starts from scratch. The agent doesn't remember the last thing it did, what your company's design system looks like, or what coding conventions you follow. You have to tell it every time.
For a typical enterprise agent call, that looks something like:
- ~3,000 tokens of agent instructions (who you are, what you do, your constraints)
- ~5,000-8,000 tokens of company context (design tokens, code conventions, architecture rules, terminology)
- ~2,000-5,000 tokens of the actual task (the specific thing you want done right now)
If only 20-30% of every call is unique, you're paying full price for 70-80% of tokens that are identical to the last call.
The Fix: Layer Your Prompts for Caching
Most LLM providers now support prompt caching. Anthropic gives you up to 4 independent cache breakpoints per request. The cache works on exact prefix matching. If the first N tokens of your prompt are byte-identical to a recent call, you get a 90% discount on those tokens.
The key insight: not all context changes at the same rate. Structure your prompt in layers from most stable to least stable:
Layer 1: Agent instructions. These are the same for every client, every call. "You are a code review agent. You follow these rules. You output in this format." This changes when you update the agent, which is maybe once a month. Cache this globally.
Layer 2a: Stable client context. Design tokens, coding conventions, architecture constraints, brand voice rules, glossary of company-specific terms. This changes when the client updates their standards, which is maybe once a quarter. Cache this per client.
Layer 2b: Per-call task data. The actual problem to solve, the code to review, the evidence to analyze. This is unique every time. Don't even try to cache it.
The Math
Say your agent processes 1,000 calls per month for a client. Each call is ~13,000 input tokens.
Without caching:
- 13,000 tokens x 1,000 calls = 13M tokens
- At $3/MTok (Sonnet): $39/month
With three-layer caching (assuming 70% cache hit on layers 1 + 2a):
- Cached portion: 9,100 tokens x 1,000 calls = 9.1M tokens at $0.30/MTok = $2.73
- Dynamic portion: 3,900 tokens x 1,000 calls = 3.9M tokens at $3/MTok = $11.70
- Total: $14.43/month (63% reduction)
And that's conservative. Agents that work heavily with stable context (code generation against a design system, content writing against brand guidelines) can see 85%+ of their input cached.
The Catch: Order Matters
Cache invalidation follows the hierarchy. If you change something in Layer 1, the cache for Layers 2a and 2b is gone too. If you change Layer 2a, Layer 2b's cache resets.
This means:
- Put your most stable content first (agent instructions)
- Put per-client context in the middle
- Put per-task data last
- Never mix them. If you concatenate agent instructions and client context into one big system prompt, you lose the ability to cache them independently.
Also worth knowing: the default cache TTL is 5 minutes. If your agent isn't called within that window, the cache expires. For agents that run in bursts (processing a batch of tasks), this works great. For agents that get called once an hour, you'll mostly miss cache.
What This Looks Like in Practice
Imagine you're running a code review agent across a 50-person engineering team. Every PR triggers the agent. The agent needs to know your company's architecture rules, naming conventions, and design system patterns.
Without layered caching, every PR review re-sends all of that context. With it, the agent instructions cache globally (shared across all your clients if you're running a platform), the client's conventions cache per-client (shared across all 50 engineers on that team), and only the actual PR diff is new.
If your team submits 200 PRs per month, the savings compound quickly. And if you're running multiple agent types (code review, test generation, documentation, design audit), each agent family maintains its own independent cache chain. They don't interfere with each other.
Our Take
Most teams we've worked with treat prompt engineering as a creative exercise. Write a good prompt, ship it, move on. What they miss is that prompt structure is an infrastructure decision with direct cost implications.
We think about agent prompts in the same way you'd think about a database schema. The structure determines performance, cost, and maintainability. When we design agent architectures, we separate "what the agent knows how to do" from "what it needs to know about this specific client" from "what it needs to do right now." That separation isn't just clean architecture. It's what makes the difference between an AI bill that scales linearly with usage and one that plateaus.
The other thing most people get wrong: they optimize for the model tier (should I use the expensive model or the cheap one?) before optimizing the prompt structure. In our experience, restructuring prompts for caching saves more money than downgrading the model. And it doesn't sacrifice quality.
Key Takeaway
Structure your agent prompts in layers from most stable to least stable. Put agent instructions first, client context second, task-specific data last. Each layer caches independently. This alone can cut your per-call costs by 60-85% with zero quality loss. It's the highest-ROI optimization most AI teams aren't doing yet.