LLM cost optimization

How to cut your LLM spend

Q: Does optimizing LLM cost add latency?

It does not have to. The work of classifying, trimming, caching, and shaping a request can run in the request path in real time. Phantm adds under 200 milliseconds end to end (P50 181.6ms, P95 206ms), because the classifier is a small local model rather than another API call, and streaming passes straight through.

Why model bills explode once you leave the prototype, and how to bring them back down, in real time and without losing answer quality. A practical guide, with results measured across 13,491 requests.

PhantmJune 25, 202612 min read

If you run anything on top of an LLM, you have probably watched the bill climb faster than you expected.

For most of machine learning's history, the expensive part was training. Now it is inference: the cost of answering requests, one token at a time, every time someone uses the product. It has become a real line item for teams of every size, and the providers clearly know it. Prompt caching and half-price batch tiers exist because tokens are the cost, and everyone is trying to use fewer of them.

The awkward part is that the bill scales with success. More users mean more calls, more calls mean more tokens, and the invoice grows every month the product does well. For a team with real traffic, monthly model spend can reach the same order as an engineer's salary, and it keeps climbing from there.

It usually shows up the moment a feature leaves the prototype and meets production traffic, which is more repetitive and more multi-turn than anything you tested with. Across the industry, enterprise spend on model APIs more than doubled in a year, from $3.5 billion to $8.4 billion, even as the price of a single token kept falling (Menlo Ventures).

Most of that spending is avoidable. A large share of a typical bill pays for tokens you did not need, or a model larger than the task required, and both can be reduced without the user noticing. This guide is how we approach that at Phantm, the LLM gateway we build: where the money goes, the techniques that bring it down, and what each one did in our evaluation of 13,491 requests, where the full pipeline cut cost 47.1 percent with no measurable loss in quality.

Where the money actually goes

Two problems drive almost every oversized bill, and both are fixable. Requests get sent to models far more powerful than the task needs, and those requests carry far more tokens than the answer requires.

The token problem is sneakier than it looks, because not all tokens cost the same. Output is priced five to six times higher than input across the major providers, so a model that answers a yes-or-no question in three paragraphs is burning the most expensive tokens there are. And context repeats: in a twenty-turn conversation, a 1,500-token system prompt is sent, and billed, on all twenty turns. Wrap that inside an agent that makes a dozen model calls for a single user action, and the waste stops being a rounding error. It becomes the bill.

Put plainly, there are four places to attack the problem: the model you choose, the context you repeat, the prompt you send, and the response you allow. The rest of this guide is each one, roughly in the order it pays off.

Right-sizing the model

Start with the model, because it is the largest single lever. Most traffic is easier than the model it runs on, and the reason that matters is the gap between tiers:

Model	Input	Cached input	Output
GPT-5.5	$5.00	$0.50	$30.00
GPT-5.4	$2.50	$0.25	$15.00
GPT-5.4-mini	$0.75	$0.075	$4.50
Claude Opus 4.8	$5.00	$0.50	$25.00
Claude Sonnet 4.6	$3.00	$0.30	$15.00
Claude Haiku 4.5	$1.00	$0.10	$5.00

List prices per million tokens, June 2026. Sources: OpenAI, Anthropic.

A token on Opus costs five times a token on Haiku; GPT-5.4 costs more than three times its mini. Sending a simple classification to a flagship model is like couriering a postcard. If you could send the easy requests to the cheap model and keep only the genuinely hard ones on the expensive one, you would cut the model bill without touching quality. That is model routing, and it is the most popular place teams start.

It is also why so many teams reach for tools like LiteLLM or OpenRouter. Those are good plumbing: one endpoint that can talk to many models. But the plumbing is the easy part. The hard part is deciding, for each request, which model can actually handle it, and on those tools that decision is usually a rule someone wrote by hand. Send anything under 500 tokens to the cheap model. It works right up until a 200-token request asks for a contract summary, the cheap model gets it wrong, and a real customer sees the mistake. A static rule cannot tell a short easy question from a short hard one, and in production that is the entire game.

Phantm replaces the rule with a classifier. Before anything is routed, a small model we call the gate reads the request and judges how hard it actually is, what kind of task it is, and how long a good answer should be. Easy, well-defined work drops to a mini or nano tier; reasoning, code, math, and planning stay on the flagship; and the small fraction a cheaper model would get wrong are caught and up-routed back to a stronger one. If a provider errors or rate-limits, the call fails over to another. In our evaluation this moved 60.8 percent of general traffic and 72.8 percent of customer-support traffic off the baseline model, and the quality held. The lesson underneath it is simple: routing is only ever as good as the thing deciding where to route, so the accuracy of that classifier, not the plumbing beneath it, is what separates real savings from a stream of support tickets.

Never paying twice for the same context

Routing decides which model answers. The next saving is making every call carry less, and the easiest place to begin is the context you send over and over.

Most production prompts open with a large, stable block, a system prompt, instructions, a few examples, that is identical across thousands of calls. Both OpenAI and Anthropic will cache that block and bill a repeat at roughly a tenth of the normal input price, a 90 percent discount on tokens you were going to send anyway. It is the lowest-risk saving there is, because the model sees exactly the same input; you are simply charged less for the part it has already seen.

There is a catch that costs teams more than they realize. Providers only cache a prefix once it crosses a size threshold, around 1,024 tokens. A 900-token system prompt sits just under that line, so it never caches, and you pay full price on every single call, indefinitely. Phantm watches for exactly this case and pads the prompt past the threshold with stable filler, so the same 900-token prompt suddenly qualifies and starts billing at the cached rate. It is a small trick that turns a permanent full-price charge into a 90 percent discount, and it is the kind of thing you only catch if you are sitting in the request path looking at every prompt. For workloads that genuinely repeat, asking the same things in slightly different words, an optional semantic cache goes a step further and returns a stored answer for a close-enough request, skipping the call entirely, guarded by a similarity threshold so it never answers something it should not. Across our customer-support evaluation the input-token cache-hit rate reached 89.8 percent on the Anthropic track and 36.3 percent on OpenAI.

Trimming the prompt and shaping the response

Caching handles the context that repeats. The rest of the prompt, and the response that comes back, still tend to be heavier than they need to be, and two more steps run on every request to fix that.

On the way in, cleanup strips dead filler and repeated framing without touching anything structural like a code block, and pruning trims old conversation turns once a thread runs long, keeping the system prompt and the most recent exchange while dropping the least relevant middle. On the way out, output shaping caps the response to the length the task calls for and nudges it toward the format you actually need, a short answer or a JSON object instead of an essay. That last step matters out of proportion to its simplicity, because output is the expensive half of the bill. The only requests left untouched are tool calls, which need room to work. In our evaluation cleanup fired on 14 to 24 percent of requests, history trimming on 22 to 45 percent, and output shaping on nearly all of them, since almost any response can stand to be a little leaner.

The catch: all of this has to happen in real time

Here is the objection that stops most teams from doing any of this inline, and it is a fair one. Every step so far, classifying the request, trimming the prompt, checking a cache, shaping the response, is work that happens before the model is even called. Do it naively and you have traded a money problem for a latency problem, and latency is its own tax. Users abandon slow answers, and an agent that pauses even half a second on each of a dozen calls becomes unusable. This is exactly why most cost tools settle for watching from the sidelines, a dashboard that reports what you spent last week, rather than changing the request as it happens.

Phantm runs the entire pipeline in the request path, in real time, and the whole thing adds under 200 milliseconds, a P50 of 181.6 and a P95 of 206. It can do that because the part that looks expensive, the classifier, is not another API call out to a model; it is a small local model that answers in a couple of milliseconds, for a fraction of a cent, and it is logged like everything else. Streaming passes straight through. From the outside the optimization is invisible: your application sends a request and gets back a normal, streamed response, only cheaper. You do not have to choose between a smaller bill and a fast product, which is the choice most teams quietly assume they are stuck with.

How much it adds up to

On their own these are modest. Together they compound, because each one works on what the last one left behind. Take a support assistant handling a million requests a month, each carrying a 1,200-token reused system prompt plus 800 tokens of fresh input and returning 500 tokens, all on GPT-5.4.

Applied	Monthly cost	vs baseline
Baseline, no optimization	$12,500	n/a
Cache the reused system prompt	$9,800	22% lower
Shape output, 500 to 350 tokens	$7,550	40% lower
Route 60% of traffic to a mini model	$4,379	65% lower

Illustrative arithmetic on June 2026 list prices. Your real figure depends on your traffic.

The baseline is two billion input tokens at $2.50 plus 500 million output tokens at $15, or $12,500. Caching the 1,200-token prefix moves it to the cached rate and cuts input cost from $5,000 to $2,300. Shaping output from 500 tokens to 350 drops output cost from $7,500 to $5,250. Routing 60 percent of the rest to a mini tier brings the total to about $4,379, a 65 percent reduction, before batching or a fine-tuned model is even in the picture. The exact number is not the point; the point is that the techniques stack, and the inputs that decide your number, your prompt length, your output length, and how much of your traffic is easy, are things you can measure. The cost calculator runs the same math on yours.

Without losing quality

All of this is worthless if the answers get worse, and the honest fear, the one that keeps teams from optimizing at all, is that a downroute or a trimmed prompt quietly degrades quality in a way nobody notices until a customer does. The only real answer is to make every change visible and to prove the quality held.

So every decision the pipeline makes is recorded with a reason code: which model handled the request and why, what was trimmed, whether a cache was hit, and why anything was blocked. You can read it per request, which means the optimization is never a black box rewriting your prompts behind your back. The only extra model call in the whole path is the gate, and because it costs a fraction of a cent and is logged like the rest, nothing happens off the books. Cost itself is held in place by policy rather than trust, through per-tenant budgets, rate limits, and an allowed list of models that fail closed, refusing a request they cannot check rather than waving it through.

The savings number is measured, not asserted. When you start, your traffic passes through untouched while the gateway measures a real baseline in parallel and compares it against the optimized path, so the figure you see is your own cost, before and after, on your own workload. We held the whole pipeline to that same standard in our evaluation, across 13,491 prompts in two phases, 6,500 on standard public benchmarks and 6,991 on customer-support workloads, against a realistic enterprise baseline.

Phase	Baseline	Optimized	Reduction
General benchmarks	$30.46	$19.71	35.3%
Customer-support workloads	$27.52	$10.97	60.1%
Combined	$57.97	$30.68	47.1%

The cost figure is only half the result. Cutting spend is trivial if you are allowed to return worse answers, so we tested quality with formal equivalence testing against a strict threshold, the same method used to show two treatments are interchangeable in clinical research, and every source landed under the line at which a person would notice a difference. The full corpus, scoring, and per-stage results are in the evaluation report and the whitepaper.

Which brings it back to that bill that rivaled a salary. The same product, the same code, the same answers, for closer to $4,400 a month than $12,500, with a quality result you can put in front of anyone who asks. That is what cutting LLM spend looks like when it happens on every request, in real time, and is measured end to end, rather than guessed at from a dashboard after the money is already gone.

Go deeper

This is the overview. Each technique has its own deep dive, with more data and worked examples.

Model routing

Does model routing actually save money?

Why hand-written rules misroute production traffic, how a classifier fixes it, and what the savings really are.

Coming soon

Caching

OpenAI vs Anthropic prompt caching

The thresholds, the economics, the padding trick, and the real cache-hit rates we measured in production.

Coming soon

Output

How to reduce output token costs

Why output is the dominant cost, and the controls that cut it without hurting answers.

Coming soon

Methodology

Proving quality held after optimization

How to show an optimization did not degrade quality, using formal equivalence testing.

Coming soon

Frequently asked questions

How much can you actually reduce LLM costs?

On real workloads, combining model routing, caching, prompt trimming, and output control typically removes 40 to 90 percent of inference cost. The exact figure depends on how repetitive your traffic is, how much request difficulty varies, and how long your prompts are. In a 13,491-request evaluation, Phantm measured a 47.1 percent combined reduction, and 60.1 percent on customer-support traffic, while holding quality inside a 0.2-point equivalence bound.

What is the fastest way to cut LLM spend without hurting quality?

Start with caching and output limits. Caching a reused system prompt cuts its input cost by about 90 percent on both OpenAI and Anthropic and changes nothing about the response. Capping and shaping the output attacks the most expensive tokens, since output is priced several times higher than input. Both are low risk because they do not change which model answers or what it is asked.

Is OpenRouter or LiteLLM enough to optimize LLM costs?

They are good plumbing: one endpoint that can reach many models. But routing on them is usually a hand-written rule, and a static rule cannot tell a short easy request from a short hard one, so it misroutes real traffic. The savings come from an accurate per-request difficulty classifier and an up-route path, applied in real time, which is what a purpose-built gateway like Phantm adds on top.

Does optimizing LLM cost add latency?

It does not have to. Classifying, trimming, caching, and shaping a request can all run in the request path in real time. Phantm adds under 200 milliseconds end to end (P50 181.6ms, P95 206ms), because the classifier is a small local model rather than another API call, and streaming passes straight through.

How do I reduce output token costs?

Output tokens are the dominant cost, priced several times higher than input. Cap output length per request type, ask for structured output such as JSON or a fixed schema instead of prose, and remove instructions that invite the model to explain at length when you do not read the explanation.

Do I have to change my code to cut LLM spend?

No. You can implement each technique yourself, or route traffic through an OpenAI-compatible gateway that applies routing, caching, trimming, and output control in the request path. With a gateway you change one base URL and keep your existing SDK, and the optimizations run on every request before it reaches the provider.

See what it does to your bill

Run one endpoint through Phantm in shadow mode for 14 to 21 days. You get token and cost deltas by optimization step, routing logs, quality proxies, and a concrete savings number on your own traffic.

Start a free pilot Read the evaluation