Why are enterprises still overspending on AI?
LLM pricing has collapsed. In 2021, GPT-3 cost $60 per million tokens. By 2026, equivalent-quality models cost $0.02 per million tokens. That's a 1,000x decline in five years, averaging 10x per year (Appenzeller, a16z, November 2024). Epoch AI's analysis (March 2025) found that post-January 2024, the median decline accelerated to 200x per year for some task categories.
And yet enterprise API spending doubled from $3.5B in late 2024 to $8.4B by mid-2025. Gartner forecasts $644B in GenAI spending in 2025, up 76.4% from 2024. An analysis by LeanLM found that 50-90% of enterprise LLM spend is addressable without quality loss (Cholette, February 2026), citing case studies where companies achieved 5x to 17x cost reductions through systematic optimization.
The problem isn't that AI is expensive. It's that most teams deploy it naively and never optimize.
Why do output tokens cost so much more than input tokens?
This is the single most important economic fact in LLM infrastructure, and most teams don't build around it.
The reason is hardware, not markup. Input tokens process in a single forward pass where model weights are loaded once. Output tokens require a separate forward pass for each token generated, reloading weights every time. LLMs are memory-bandwidth bound on current hardware (Chng, May 2024), and output generation is the bottleneck.
The FinOps Foundation's analysis (Collier & Tepper, September 2025) confirmed that output tokens at 3-5x input cost are the dominant cost driver, and identified context window creep (conversation history compounding across turns) as the single largest hidden cost multiplier. In a 20-turn conversation, turn 20 requires roughly 25x more input tokens than turn 1 for identical output quality.
The architectural implication: read aggressively, write precisely. Every design decision should minimize how much your AI writes, not how much it reads.
What does the optimization stack look like?
Six layers, each compounding on the last. Applied in sequence, they achieve roughly 92% total cost reduction.
Layer 1: Model routing (51% reduction)
Not every request needs a frontier model. Classification and tagging? Use a small model. Code generation? Use a mid-tier model. Complex multi-step reasoning? That's where you send it to the expensive one.
RouteLLM (Ong et al., LMSYS/UC Berkeley, July 2024) achieved over 85% cost reduction on MT-Bench while maintaining 95% of GPT-4 performance, and was over 40% cheaper than commercial routing alternatives. IBM Research (Martineau, October 2024) found their router connecting 11 LLMs slightly beat GPT-4 overall while saving 5 cents per query.
A practical routing split: 30% of requests to a small model (Haiku-class), 50% to a mid-tier model (Sonnet-class), 20% to a frontier model (Opus-class). That alone cuts blended cost from $15/MTok to roughly $7.40/MTok.
Layer 2: Prompt engineering (15% reduction)
Shorter, clearer prompts cost less and often produce better results. A 2,000-token system prompt compressed to 750 tokens with the same information density saves 63% on that segment across every request. Microsoft Research's LLMLingua achieves 20x compression with only 1.5% performance loss on reasoning tasks.
Layer 3: Context management (30% reduction)
This is the context window creep problem. In multi-turn conversations, every prior message gets re-sent with every new request. Without management, token consumption grows quadratically.
Solutions: conversation summarization every 10-15 exchanges (80% context reduction), sliding window with summary, and just-in-time context loading where you retrieve only what's relevant instead of sending everything. RAG queries cost roughly $0.00008 each versus $0.10 for equivalent long-context queries, a 1,250x difference (Rengifo & Martin, Elasticsearch, July 2025).
Layer 4: Prompt caching (60% reduction on remaining cost)
Anthropic's prompt caching (August 2024) gives a 90% discount on cached input tokens and up to 85% latency reduction. The minimum cacheable segment is 1,024 tokens, and the break-even point is just 1.14 cache reads.
In production, a three-layer caching strategy works: system prompt (~2-5K tokens) at 99% cache hit rate, client/project context (~10-50K tokens) at 95% hit rate, and task-specific context (~5-30K tokens) at 60-80% hit rate.
Overall, 70-85% of input tokens hit cache. This alone cuts total costs by 40-60%.
Layer 5: Batch processing (15% reduction)
Anthropic's Batch API (October 2024) offers a straight 50% discount on both input and output tokens for asynchronous processing within a 24-hour window. Any workload that doesn't need real-time response (code audits, documentation generation, eval suites, weekly reports) should batch.
Layer 6: Output control (17% reduction)
Since output tokens cost 5-8x more, controlling output length has outsized impact. Structured output formats (JSON over prose), concise response instructions, and format-specific generation (return only the diff, not the entire file) reduce output tokens significantly.
What does this look like in practice?
Starting from an all-frontier baseline and applying each layer:
At enterprise scale, this is the difference between $500K-$1M per month and $50K-$150K per month. Per-user cost drops from $50-100/month to $5-15/month.
What the research says
"LLMflation is real: 1,000x cost reduction over 3 years for equivalent-performance LLMs. Costs are declining approximately 10x per year."
"Output tokens cost 3-5x more than input tokens. Context window creep is the single largest hidden cost driver."
"50-90% of enterprise LLM spend is addressable without quality loss. Enterprise API spending doubled from $3.5B to $8.4B in under a year."
Our take
The mistake we see is teams optimizing the wrong variable. They negotiate volume discounts with their provider or switch to a cheaper model. Those help, but they're single-digit improvements. The 10x savings come from architectural decisions: routing requests to the right model tier, caching the context that doesn't change between requests, batching non-urgent work, and controlling output verbosity.
What we've found is that token cost is a solved problem for teams willing to invest in the infrastructure. The six-layer stack isn't theoretical. Each layer uses production-ready tools and APIs that exist today. The teams still spending $500K/month aren't doing it because AI is expensive. They're doing it because they deployed the naive architecture and never revisited it.
The bigger strategic question isn't how to make AI cheap. It's what you do with the savings. When your per-user cost drops from $100/month to $10/month, you can afford to give every person in the organization an AI co-pilot. That's when the real value unlocks.
Key takeaway
LLM costs dropped 1,000x in three years, but the default deployment wastes 50-90% of spend. A six-layer optimization stack (routing, prompts, context management, caching, batching, output control) achieves 92% reduction. The result: enterprise AI at $5-15 per user per month, which makes universal deployment economically viable for the first time.
FAQ
Why do output tokens cost so much more than input?
Hardware, not markup. Input tokens process in one forward pass (weights loaded once). Output tokens require a separate forward pass per token, reloading model weights each time. LLMs are memory-bandwidth bound, and output generation is the bottleneck. This means output costs 3-8x more depending on the model.
What is prompt caching and how much does it save?
Prompt caching stores frequently reused input segments (system prompts, client context) so you don't pay full price every time. Anthropic's implementation gives a 90% discount on cached tokens and up to 85% latency reduction. In production with a three-layer cache strategy, 70-85% of input tokens hit cache, cutting total input costs by 40-60%.
What is model routing?
Instead of sending every request to your most expensive model, a router classifies the request and sends it to the cheapest model that can handle it. Simple tasks (classification, extraction) go to small models. Complex reasoning goes to frontier models. RouteLLM achieved 85% cost reduction while maintaining 95% of GPT-4 quality.
How does batch processing reduce costs?
Anthropic's Batch API offers 50% off both input and output tokens for requests processed asynchronously within 24 hours. Any workload that doesn't need real-time response (reports, audits, evals, documentation) should batch. Combined with caching, batch processing can reach 95% total savings.
What is context window creep?
In multi-turn conversations, every previous message gets re-sent with each new request. By turn 20, you're sending roughly 25x more tokens than turn 1 for the same output quality. Solutions: summarize conversation history every 10-15 turns, use sliding windows, or load context just-in-time via RAG instead of carrying everything in the window.
How much does RAG save versus long-context approaches?
A RAG query costs roughly $0.00008. An equivalent long-context query costs about $0.10. That's 1,250x cheaper. At enterprise scale with thousands of queries per day, this is the single largest cost lever after model routing.
Is DeepSeek really 98% cheaper than OpenAI?
DeepSeek V3 is priced at $0.14/MTok for GPT-4-equivalent quality, versus GPT-4's original $20/MTok. That's a real price point that has compressed the entire market. But pricing is only part of the equation. Latency, reliability, data residency, and compliance requirements often matter more than per-token cost for enterprise deployments.