A two-phase evaluation of Phantm's adaptive optimization pipeline across nine general AI benchmarks and six production customer-experience workloads. Quality, measured by deterministic accuracy and LLM-judge scoring, was statistically indistinguishable from the baseline under standard equivalence testing.
Phantm is a drop-in LLM optimization proxy. It exposes an OpenAI-compatible endpoint and applies a six-stage adaptive pipeline to every request before forwarding it to the upstream provider. No SDK changes, no prompt rewrites, no model swapping by the customer.
This report covers a 13,491-prompt evaluation conducted across two phases: a general AI benchmark phase (6,500 prompts across nine standard benchmarks) and a customer-experience production workload phase (6,991 prompts across six CX datasets spanning retail, banking, ecommerce, multi-domain dialog, and agentic vertical support). Across the full evaluation, Phantm reduced inference cost by 47.1% against a realistic enterprise baseline. Quality was preserved: the deterministic-benchmark accuracy gap is 0.013 ± 0.019, and every LLM-judged source shows a quality gap below the threshold at which users notice a difference, per established rating-scale research (see §3.5).
| Phase | Baseline | Optimized | Savings | Reduction |
|---|---|---|---|---|
| General benchmarks | $30.46 | $19.71 | $10.75 | 35.3% |
| CX production workloads | $27.52 | $10.97 | $16.55 | 60.1% |
| Combined | $57.97 | $30.68 | $27.30 | 47.1% |
| Metric | General | CX |
|---|---|---|
| Deterministic benchmark accuracy gap (aggregate) | 0.013 ± 0.019 | — |
| LLM-judge quality gap (overall, 5-pt scale) | 0.091 ± 0.039 | 0.067 ± 0.020 |
All CX responses were LLM-judged; deterministic graders apply only to the general benchmark phase.
Phantm sits between an application and its LLM providers as an OpenAI-compatible proxy. Every request passes through an adaptive optimization pipeline that runs in real time before the upstream call is dispatched. Total pipeline overhead is under 200 milliseconds, and no application-side changes are required to use it.
The cost reductions reported here are the joint result of these stages firing across 13,491 requests, not the contribution of any single stage. The activation rate beneath each card is the fraction of requests on which that stage did meaningful work.
Classifies the incoming request to drive routing and shaping decisions downstream. Dominates the latency budget at 130.7ms mean. Every other stage combined adds under 22ms.
Orchestrates provider-side prompt caches on Anthropic and OpenAI tracks. On the CX phase, Anthropic input-token hit rate reaches 89.8%; OpenAI 36.3%.
Fires when the input contains compressible filler: boilerplate instructions, repeated framing, dead context. Conservative by default; activates on 14.0% of general traffic, 23.5% of CX traffic.
Compresses conversation history when present and compressible. Fires on 22.1% of general requests and 45.1% of CX requests. The CX uplift comes from long, repeated multi-turn dialogs.
Routes the request to a model that the gate predicts can solve it acceptably. Downroutes the majority of traffic, but also up-routes ~0.2–1.5% of requests to a top-tier reasoning model when the baseline tier is insufficient.
Applies length and format guidance to the request so the upstream model returns a tightly-shaped response. Fires on nearly every request (100% general, 99.5% CX) and is invisible to the calling application.
The evaluation was structured in two phases to answer two distinct questions. The general benchmark phase asks whether the optimization pipeline preserves quality on tasks where quality is measurable against known-correct answers. The CX phase asks how much the pipeline saves on workloads that resemble actual customer-facing deployment traffic: long, repeated system prompts; many short user turns per prompt; conversation histories of varying length. Both phases share the same optimization configuration and the same baseline assignment logic.
| Source | N | Task type | Grading |
|---|---|---|---|
| WildChat | 1,500 | Open-domain chat | LLM-judged contrastive |
| HotPotQA | 1,000 | Factoid QA | LLM-graded EM/F1 |
| MMLU | 1,000 | Multiple choice QA | Deterministic accuracy |
| BFCL | 750 | Tool use | Deterministic tool-call equivalence |
| BBH | 500 | Reasoning | LLM-graded exact match |
| LongBench | 500 | Long-context QA | LLM-judged contrastive |
| Hermes FC | 500 | Tool use (chat) | LLM-judged tool use |
| IFEval | 500 | Instruction following | Deterministic constraint validation |
| DialogSum | 250 | Summarization | LLM-judged contrastive |
| Source | N | CX vertical | Tool use |
|---|---|---|---|
| ABCD | 1,750 | Retail customer support | No |
| Nemotron | 1,500 | Agentic vertical CS (rentals, parking, security, sports retail, theme park, vet telehealth) | Yes |
| Taskmaster-2 | 1,500 | Multi-vertical (food, hotels, movies, sports, flights, restaurant search) | No |
| Banking77 | 750 | Banking | No |
| Bitext | 750 | Ecommerce support | No |
| MultiWOZ | 741 | Multi-domain (attraction, hotel, restaurant, taxi) | No |
The CX corpus is structured around 21 distinct system prompts averaging 333 user queries per prompt (median 250, range 179 to 1,750). This mirrors real CX traffic, where a small number of stable prompts are reused across high volumes of customer interaction.
Each prompt in the corpus is baselined against a specific model chosen to represent how that workload would be deployed without Phantm. Two model tiers are used in the baseline: solver tier (gpt-5.4 / claude-sonnet-4-6) and mini tier (gpt-5.4-mini / claude-haiku-4-5). Baseline tier is assigned per source based on the technical complexity of the task: long-context, tool-use, and reasoning-heavy sources are baselined to solver tier; open-domain chat, factoid QA, multiple-choice QA, summarization, and instruction-following sources are baselined to mini tier. A token-floor rule promotes any general-benchmark request with input length above 3,000 tokens to solver tier regardless of source. Provider track is assigned stratified-randomly per source with a target split of 60% OpenAI / 40% Anthropic.
General benchmarks are scored using a mix of deterministic graders and LLM-correctness judges. MMLU, IFEval, and BFCL are scored fully deterministically. HotPotQA and BBH are scored by an LLM-correctness judge; agreement rate between deterministic and LLM-correctness scoring on overlapping MMLU rows was 88.1%. Sources without ground-truth answers (WildChat, LongBench, DialogSum, Hermes FC) are scored by an LLM judge that compares optimized and baseline responses directly. CX workloads are fully LLM-judged using two rubrics: a general contrastive rubric for conversational sources (5,491 rows) and a tool-use rubric for the agentic CS corpus (1,500 rows on nemotron).
Cross-provider judging is used throughout: OpenAI-track responses are judged by claude-opus-4-6; Anthropic-track responses are judged by gpt-5.4. This eliminates the self-preference effect that occurs when a model judges its own outputs. The A/B position of the two responses is deterministically flipped per record using a SHA-256 hash of the record ID, mitigating position bias. Each response is scored on a 1–5 scale across multiple quality dimensions.
We test whether Phantm's quality matches the baseline using the standard two one-sided tests (TOST) procedure (Lakens, 2017) against a threshold of ±0.2 points on the 5-point Likert scale. A source passes equivalence at this threshold if its entire confidence interval falls within ±0.2.
The ±0.2 threshold draws from two well-established results. Research on patient-reported outcomes (Norman, Sloan & Wyrwich, Medical Care, 2003) finds that users typically can't reliably perceive differences smaller than about half a standard deviation, or roughly 0.5 points on a 5-point Likert scale. Studies that ask raters directly to identify the smallest meaningful difference (Anvari & Lakens, Journal of Experimental Social Psychology, 2021) put the threshold even lower, in the 0.20–0.39 range for single Likert items. We use 0.2, the strictest threshold supported by this literature, as the equivalence band.
The standard deviation of per-row quality deltas in this evaluation is approximately 0.83 across all LLM-judged rows (and 0.90 for the general phase considered separately). This is materially below the per-rating noise floor of 1.0–1.2 documented in the LLM-as-judge literature. For deterministic-benchmark accuracy, the confidence interval is computed from row-level paired binary outcomes.
Public evaluations of LLM cost-optimization products fall into two patterns. Academic frameworks, most notably Martian's RouterBench (Hu et al., 2024) and LMSYS's RouteLLM (Ong et al., 2024), publish full methodology, datasets, and code, summarizing cost-quality tradeoffs through point comparisons or composite scalars. Vendor reports typically lead with headline cost reductions and report response quality either informally or not at all. To our knowledge, no prior public evaluation in this product category has used formal equivalence testing against a pre-set threshold. The methodology, TOST against a smallest-effect-size-of-interest drawn from measurement-theory literature, is the standard procedure for testing equivalence claims in clinical research and the social sciences.
Each bar represents that phase's baseline spend, scaled to the combined-baseline total ($57.97). The green segment is what Phantm spent; the red segment is what it cut.
CX savings are substantially higher than general benchmark savings for two reasons. First, the CX corpus structure (stable system prompts, many queries per prompt) naturally enables cache orchestration to amortize over many requests. Second, CX traffic contains a larger fraction of straightforward requests that route safely to smaller models than the general benchmark mix does. Both of these conditions resemble real production CX deployments more closely than the general benchmark corpus.
The reported cost reductions are net of all routing decisions in both directions. While the optimizer routes most requests to smaller models, it also up-routes a small fraction to more capable models when the gate classifier identifies the input as too complex for the baseline tier. Approximately 1.5% of general-benchmark requests and 0.2% of CX requests were up-routed in this way, typically to a top-tier reasoning model. These up-routes increase per-request cost on the affected rows but preserve response quality where the baseline would have been insufficient. They are a feature of the routing logic, not an exception to the cost story.
Phantm's optimization pipeline applies six stages adaptively to each request. The cost reduction is the joint result of these stages, not the contribution of any single one. The activation rates below indicate, for each stage, the fraction of requests on which the stage did meaningful work.
Quality is reported across two views: deterministic benchmark accuracy (the most concrete measure of correctness) and quality gaps from LLM judging (the closest signal to user-perceptible quality across heterogeneous tasks). Both views address different questions and should be read together. Confidence intervals and equivalence tests use the methodology described in §3.5.
A note on win/tie/loss verdicts. Contrastive forced-choice judging, the industry-standard metric for LLM evaluation, requires the judge to pick a winner even when both responses are functionally equivalent, which artificially amplifies small underlying differences. Raw score gaps and deterministic accuracy, with their associated confidence intervals and equivalence tests, are the more representative measures of user-perceptible quality.
Each row shows baseline accuracy (steel dot) and optimized accuracy (blue dot) on a 0–1 scale.
Aggregated across all five graders, n-weighted: gap = 0.013 ± 0.019. The interval crosses zero; the data are consistent with no real difference. Three of five benchmarks have gaps under 2 percentage points; tool-use accuracy (BFCL) is functionally equivalent; the largest gap is on BBH at 3.2 percentage points.
Two horizontal lines per dimension show baseline (steel) and optimized (blue / green) scores on the full 1–5 axis. The visual takeaway: lines are essentially the same length, and the gap is barely visible at full scale.
Overall quality gaps are 0.091 ± 0.039 in the general phase and 0.067 ± 0.020 in the CX phase, on a 5-point scale. Both averages sit well below the ±0.2 equivalence threshold (§3.5): the general phase at less than half the threshold, the CX phase at roughly one-third. The accuracy dimension is the most concrete measure of correctness within the LLM-judge framework and shows the smallest gap in both phases.
Each interval is the per-source quality gap ± 1.96σ. The shaded band is the ±0.2 equivalence threshold. A source passes equivalence if its entire interval falls inside the band.
| Source | Vertical | N | Quality gap | TOST at ±0.2 |
|---|---|---|---|---|
| Banking77 | Banking | 750 | 0.001 ± 0.060 | ✓ pass |
| Bitext | Ecommerce CS | 750 | 0.015 ± 0.060 | ✓ pass |
| Nemotron | Agentic CS | 1,497 | 0.041 ± 0.042 | ✓ pass |
| ABCD | Retail support | 1,750 | 0.051 ± 0.039 | ✓ pass |
| MultiWOZ | Multi-domain | 741 | 0.113 ± 0.060 | ✓ pass |
| Taskmaster-2 | Multi-vertical | 1,500 | 0.150 ± 0.042 | ✓ pass |
All six CX sources have quality gaps that pass equivalence at the ±0.2 threshold, meaning each gap, even at the upper end of its interval, stays below the level at which users would notice a difference. Three sources (Banking77, Bitext, Nemotron) have intervals that cross zero, meaning the gap cannot be statistically distinguished from no difference at all.
| Metric | End-to-end (ms) | Optimizer-internal (ms) |
|---|---|---|
| Mean | 178.1 | 153.2 |
| Median (P50) | 181.6 | 157.6 |
| P95 | 206.0 | 178.5 |
| Stage | Mean ms |
|---|---|
| Gate (real-time classifier) | 130.7 |
| Compression | 13.2 |
| Routing | 4.1 |
| Provider cache lookup / padding | 2.4 |
| Cleanup | 1.2 |
| Pruning | 0.5 |
The real-time classifier that drives adaptive routing dominates the latency budget. All other stages combined add under 22 milliseconds. At ~170ms end-to-end, the pipeline adds less than 5% to the latency of a typical upstream LLM call, well below any threshold at which users would notice an additional delay.
Across 13,491 prompts spanning general AI benchmarks and production customer-experience workloads, Phantm reduced inference cost by 47.1% with no compromise to response quality. Accuracy on deterministic benchmarks was effectively unchanged from the baseline. Every LLM-judged source produced responses whose quality gap fell below the threshold at which users would notice a difference, and three of the six CX sources showed no measurable difference from the baseline at all.
Customers can deploy Phantm in front of existing LLM workloads to capture these savings without changing their application code, swapping models, or sacrificing the quality of the responses their users receive.
| From → To | N | Share |
|---|---|---|
| Full → mini downroute | 1,754 | 25.1% |
| Full → nano downroute | 1,535 | 22.0% |
| Mini → mini same | 1,587 | 22.7% |
| Mini → nano downroute | 1,495 | 21.4% |
| Nano → nano same | 310 | 4.4% |
| Nano → mini uproute | 288 | 4.1% |
| Mini → full uproute | 14 | 0.2% |
| Full → full same | 6 | 0.1% |
| Nano → full uproute | 2 | 0.0% |
| Aggregate downroute | 4,784 | 68.4% |
| From → To | N | Share |
|---|---|---|
| Mini → nano downroute | 1,995 | 32.0% |
| Mini → mini same | 1,276 | 20.5% |
| Nano → nano same | 1,115 | 17.9% |
| Full → mini downroute | 1,022 | 16.4% |
| Full → nano downroute | 507 | 8.1% |
| Nano → mini uproute | 204 | 3.3% |
| Full → full same | 55 | 0.9% |
| Mini → full uproute | 54 | 0.9% |
| Nano → full uproute | 6 | 0.1% |
| Aggregate downroute | 3,524 | 56.5% |