Cutting your LLM bill by 60% without losing quality

If your LLM bill is more than a rounding error, you almost certainly have 40–70% of fat to cut without the user ever noticing. Here’s the exact playbook we run on every engagement.

Most teams set up a production agent, watch the first month’s bill, and treat it as a fixed cost of doing business. It isn’t. The difference between a naive and a tuned agent pipeline is often 3–5× on cost with zero perceptible change in quality. Here is how we get there, in order of effort.

Step 1: Route to the cheapest model that works

Not every request needs your smartest model. A query classifier, a summarization of a short paragraph, a simple rewrite — these run perfectly well on a small model that costs 1/20th as much. The naive setup sends everything to GPT-4 or Claude Sonnet. The tuned setup routes.

How to build it: a small router step (itself a small, cheap model call) decides per-request which tier the work needs. Tier 1 goes to a 4o-mini or Haiku; tier 2 to a mid-size model; tier 3 to your best. Track tier distribution weekly. In every engagement we’ve run, over half of requests end up tier 1 after tuning.

Step 2: Cache aggressively — but exact-match first

Semantic caching sounds clever but it’s a trap for most workloads: you risk returning a near-miss result that subtly contradicts the exact query. Exact-match caching on normalized inputs is boring, cheap and safe. Start there.

Normalize the input (lowercase, strip whitespace, canonical ordering of keys), hash it, cache the response with a TTL appropriate for the task. For knowledge-base questions, 24 hours is often fine. For internal ops, longer.

Step 3: Compress the prompt

Prompts grow. Engineers add one more example, one more rule, one more edge case. Six months in you’re paying for 3,000 input tokens on every call, 90% of which the model doesn’t need for this particular request.

The fix: audit the prompt. Separate truly static context (cachable via provider prompt caching) from dynamic content. Split long system prompts into a base + task-specific addendum and only include the relevant addendum. Use prompt caching features from OpenAI and Anthropic — they can cut input token cost by 90% on repeated prefixes.

Step 4: Batch where possible

If your workload has any latency tolerance — overnight reports, async classification, enrichment — use batch APIs. Anthropic and OpenAI both offer ~50% discount for batch endpoints. For jobs that currently run one-at-a-time but could wait an hour, this is free money.

Step 5: Shorten the output

Output tokens cost 3–5× more than input tokens. Most agent responses are longer than they need to be — default verbosity, redundant preambles, re-stating the question. Tighten the system prompt (“respond in under 60 words unless asked”) and see your output tokens drop 30%.

A real example

A recent client was spending $4,200/month on their customer-support RAG chatbot, all on Claude Sonnet. We applied the above: routed 58% of requests to Haiku (simple FAQ matches), added exact-match caching on normalized queries (hit rate 34%), moved the knowledge-base preamble into prompt caching, and trimmed output verbosity by 25%.

Result: $1,520/month. Same quality, measured via blind grading on a held-out sample. Payback on the engineering time: seven days.

When to stop

Cost optimization has diminishing returns. Once you’re under $1k/month per workflow, further effort is probably worth less than shipping the next feature. But while you’re over that — and almost every team is somewhere — there’s easy money on the table.