Independent Evaluation·May 2026

47.1% cost reduction across 13,491 LLM requests.

A two-phase evaluation of Phantm's adaptive optimization pipeline across nine general AI benchmarks and six production customer-experience workloads. Quality, measured by deterministic accuracy and LLM-judge scoring, was statistically indistinguishable from the baseline under standard equivalence testing.

n = 13,491 · cost gap −47.1% · accuracy gap 0.013 ± 0.019 · TOST equivalence at ±0.2 ✓

1. Summary

Phantm is a drop-in LLM optimization proxy. It exposes an OpenAI-compatible endpoint and applies a six-stage adaptive pipeline to every request before forwarding it to the upstream provider. No SDK changes, no prompt rewrites, no model swapping by the customer.

This report covers a 13,491-prompt evaluation conducted across two phases: a general AI benchmark phase (6,500 prompts across nine standard benchmarks) and a customer-experience production workload phase (6,991 prompts across six CX datasets spanning retail, banking, ecommerce, multi-domain dialog, and agentic vertical support). Across the full evaluation, Phantm reduced inference cost by 47.1% against a realistic enterprise baseline. Quality was preserved: the deterministic-benchmark accuracy gap is 0.013 ± 0.019, and every LLM-judged source shows a quality gap below the threshold at which users notice a difference, per established rating-scale research (see §3.5).

Cost

PhaseBaselineOptimizedSavingsReduction
General benchmarks$30.46$19.71$10.7535.3%
CX production workloads$27.52$10.97$16.5560.1%
Combined$57.97$30.68$27.3047.1%

Quality

MetricGeneralCX
Deterministic benchmark accuracy gap (aggregate)0.013 ± 0.019
LLM-judge quality gap (overall, 5-pt scale)0.091 ± 0.0390.067 ± 0.020

All CX responses were LLM-judged; deterministic graders apply only to the general benchmark phase.

Cost reduction · combined
47.1%
n = 13,491 · baseline $57.97 → $30.68
CX production workloads
60.1%
n = 6,991 · $27.52 → $10.97
End-to-end overhead
<200ms
P50 = 181.6ms · P95 = 206.0ms

2. What Phantm Does

Phantm sits between an application and its LLM providers as an OpenAI-compatible proxy. Every request passes through an adaptive optimization pipeline that runs in real time before the upstream call is dispatched. Total pipeline overhead is under 200 milliseconds, and no application-side changes are required to use it.

The cost reductions reported here are the joint result of these stages firing across 13,491 requests, not the contribution of any single stage. The activation rate beneath each card is the fraction of requests on which that stage did meaningful work.

01 · Real-time classifier

Gate

Classifies the incoming request to drive routing and shaping decisions downstream. Dominates the latency budget at 130.7ms mean. Every other stage combined adds under 22ms.

mean 130.7ms · drives routing
02 · Cache orchestration

Provider cache

Orchestrates provider-side prompt caches on Anthropic and OpenAI tracks. On the CX phase, Anthropic input-token hit rate reaches 89.8%; OpenAI 36.3%.

Anthropic 89.8% · OpenAI 36.3% (CX)
03 · Filler compression

Prompt cleanup

Fires when the input contains compressible filler: boilerplate instructions, repeated framing, dead context. Conservative by default; activates on 14.0% of general traffic, 23.5% of CX traffic.

14.0% general · 23.5% CX
04 · History compression

Semantic compression

Compresses conversation history when present and compressible. Fires on 22.1% of general requests and 45.1% of CX requests. The CX uplift comes from long, repeated multi-turn dialogs.

22.1% general · 45.1% CX
05 · Model selection

Adaptive routing

Routes the request to a model that the gate predicts can solve it acceptably. Downroutes the majority of traffic, but also up-routes ~0.2–1.5% of requests to a top-tier reasoning model when the baseline tier is insufficient.

60.8% general · 72.8% CX away from baseline
06 · Length & format

Output shaping

Applies length and format guidance to the request so the upstream model returns a tightly-shaped response. Fires on nearly every request (100% general, 99.5% CX) and is invisible to the calling application.

100% general · 99.5% CX

3. Evaluation Design

3.1 Two-phase design

The evaluation was structured in two phases to answer two distinct questions. The general benchmark phase asks whether the optimization pipeline preserves quality on tasks where quality is measurable against known-correct answers. The CX phase asks how much the pipeline saves on workloads that resemble actual customer-facing deployment traffic: long, repeated system prompts; many short user turns per prompt; conversation histories of varying length. Both phases share the same optimization configuration and the same baseline assignment logic.

3.2 Corpus

General benchmarks (6,500 prompts)

SourceNTask typeGrading
WildChat1,500Open-domain chatLLM-judged contrastive
HotPotQA1,000Factoid QALLM-graded EM/F1
MMLU1,000Multiple choice QADeterministic accuracy
BFCL750Tool useDeterministic tool-call equivalence
BBH500ReasoningLLM-graded exact match
LongBench500Long-context QALLM-judged contrastive
Hermes FC500Tool use (chat)LLM-judged tool use
IFEval500Instruction followingDeterministic constraint validation
DialogSum250SummarizationLLM-judged contrastive

CX production workloads (6,991 prompts across 21 distinct system prompts)

SourceNCX verticalTool use
ABCD1,750Retail customer supportNo
Nemotron1,500Agentic vertical CS (rentals, parking, security, sports retail, theme park, vet telehealth)Yes
Taskmaster-21,500Multi-vertical (food, hotels, movies, sports, flights, restaurant search)No
Banking77750BankingNo
Bitext750Ecommerce supportNo
MultiWOZ741Multi-domain (attraction, hotel, restaurant, taxi)No

The CX corpus is structured around 21 distinct system prompts averaging 333 user queries per prompt (median 250, range 179 to 1,750). This mirrors real CX traffic, where a small number of stable prompts are reused across high volumes of customer interaction.

3.3 Baseline assignment

Each prompt in the corpus is baselined against a specific model chosen to represent how that workload would be deployed without Phantm. Two model tiers are used in the baseline: solver tier (gpt-5.4 / claude-sonnet-4-6) and mini tier (gpt-5.4-mini / claude-haiku-4-5). Baseline tier is assigned per source based on the technical complexity of the task: long-context, tool-use, and reasoning-heavy sources are baselined to solver tier; open-domain chat, factoid QA, multiple-choice QA, summarization, and instruction-following sources are baselined to mini tier. A token-floor rule promotes any general-benchmark request with input length above 3,000 tokens to solver tier regardless of source. Provider track is assigned stratified-randomly per source with a target split of 60% OpenAI / 40% Anthropic.

3.4 Scoring

General benchmarks are scored using a mix of deterministic graders and LLM-correctness judges. MMLU, IFEval, and BFCL are scored fully deterministically. HotPotQA and BBH are scored by an LLM-correctness judge; agreement rate between deterministic and LLM-correctness scoring on overlapping MMLU rows was 88.1%. Sources without ground-truth answers (WildChat, LongBench, DialogSum, Hermes FC) are scored by an LLM judge that compares optimized and baseline responses directly. CX workloads are fully LLM-judged using two rubrics: a general contrastive rubric for conversational sources (5,491 rows) and a tool-use rubric for the agentic CS corpus (1,500 rows on nemotron).

Cross-provider judging is used throughout: OpenAI-track responses are judged by claude-opus-4-6; Anthropic-track responses are judged by gpt-5.4. This eliminates the self-preference effect that occurs when a model judges its own outputs. The A/B position of the two responses is deterministically flipped per record using a SHA-256 hash of the record ID, mitigating position bias. Each response is scored on a 1–5 scale across multiple quality dimensions.

3.5 Equivalence testing

We test whether Phantm's quality matches the baseline using the standard two one-sided tests (TOST) procedure (Lakens, 2017) against a threshold of ±0.2 points on the 5-point Likert scale. A source passes equivalence at this threshold if its entire confidence interval falls within ±0.2.

The ±0.2 threshold draws from two well-established results. Research on patient-reported outcomes (Norman, Sloan & Wyrwich, Medical Care, 2003) finds that users typically can't reliably perceive differences smaller than about half a standard deviation, or roughly 0.5 points on a 5-point Likert scale. Studies that ask raters directly to identify the smallest meaningful difference (Anvari & Lakens, Journal of Experimental Social Psychology, 2021) put the threshold even lower, in the 0.20–0.39 range for single Likert items. We use 0.2, the strictest threshold supported by this literature, as the equivalence band.

The standard deviation of per-row quality deltas in this evaluation is approximately 0.83 across all LLM-judged rows (and 0.90 for the general phase considered separately). This is materially below the per-rating noise floor of 1.0–1.2 documented in the LLM-as-judge literature. For deterministic-benchmark accuracy, the confidence interval is computed from row-level paired binary outcomes.

3.6 Relation to prior work

Public evaluations of LLM cost-optimization products fall into two patterns. Academic frameworks, most notably Martian's RouterBench (Hu et al., 2024) and LMSYS's RouteLLM (Ong et al., 2024), publish full methodology, datasets, and code, summarizing cost-quality tradeoffs through point comparisons or composite scalars. Vendor reports typically lead with headline cost reductions and report response quality either informally or not at all. To our knowledge, no prior public evaluation in this product category has used formal equivalence testing against a pre-set threshold. The methodology, TOST against a smallest-effect-size-of-interest drawn from measurement-theory literature, is the standard procedure for testing equivalence claims in clinical research and the social sciences.

4. Results

4.1 Cost

Cost reduction by phase ($USD)

Each bar represents that phase's baseline spend, scaled to the combined-baseline total ($57.97). The green segment is what Phantm spent; the red segment is what it cut.

General benchmarks−35.3%
$19.71optimized
−$10.75saved
CX production workloads−60.1%
$10.97opt
−$16.55saved
Combined−47.1%
$30.68optimized
−$27.30saved
Kept (optimized cost) Cut (savings) Bars scale to combined baseline ($57.97)

CX savings are substantially higher than general benchmark savings for two reasons. First, the CX corpus structure (stable system prompts, many queries per prompt) naturally enables cache orchestration to amortize over many requests. Second, CX traffic contains a larger fraction of straightforward requests that route safely to smaller models than the general benchmark mix does. Both of these conditions resemble real production CX deployments more closely than the general benchmark corpus.

The reported cost reductions are net of all routing decisions in both directions. While the optimizer routes most requests to smaller models, it also up-routes a small fraction to more capable models when the gate classifier identifies the input as too complex for the baseline tier. Approximately 1.5% of general-benchmark requests and 0.2% of CX requests were up-routed in this way, typically to a top-tier reasoning model. These up-routes increase per-request cost on the affected rows but preserve response quality where the baseline would have been insufficient. They are a feature of the routing logic, not an exception to the cost story.

4.2 Where the savings come from

Phantm's optimization pipeline applies six stages adaptively to each request. The cost reduction is the joint result of these stages, not the contribution of any single one. The activation rates below indicate, for each stage, the fraction of requests on which the stage did meaningful work.

Stage activation rate (% of requests)

Output shaping
94.6% gen
75.1% CX
Adaptive routing
60.8% gen
72.8% CX
Semantic compression
22.1% gen
45.1% CX
Prompt cleanup
14.0% gen
23.5% CX
Provider cache (Anthropic)
89.8% CX hit
Provider cache (OpenAI)
36.3% CX hit
Pruning
0.2% gen
<0.1% CX
General phase CX phase Pruning is reserved for context-overflow risk; negligible in benchmark traffic

4.3 Quality

Quality is reported across two views: deterministic benchmark accuracy (the most concrete measure of correctness) and quality gaps from LLM judging (the closest signal to user-perceptible quality across heterogeneous tasks). Both views address different questions and should be read together. Confidence intervals and equivalence tests use the methodology described in §3.5.

A note on win/tie/loss verdicts. Contrastive forced-choice judging, the industry-standard metric for LLM evaluation, requires the judge to pick a winner even when both responses are functionally equivalent, which artificially amplifies small underlying differences. Raw score gaps and deterministic accuracy, with their associated confidence intervals and equivalence tests, are the more representative measures of user-perceptible quality.

Deterministic benchmark accuracy

Accuracy: baseline vs. optimized

Each row shows baseline accuracy (steel dot) and optimized accuracy (blue dot) on a 0–1 scale.

HotPotQA
gap 0.021
MMLU
gap 0.017
IFEval
gap 0.011
BFCL
gap 0 (tool use)
BBH
gap 0.032
Baseline Optimized x-axis scaled 0 – 1.00

Aggregated across all five graders, n-weighted: gap = 0.013 ± 0.019. The interval crosses zero; the data are consistent with no real difference. Three of five benchmarks have gaps under 2 percentage points; tool-use accuracy (BFCL) is functionally equivalent; the largest gap is on BBH at 3.2 percentage points.

LLM-judge quality gaps

Baseline vs. optimized scores (full 0–5 Likert scale)

Two horizontal lines per dimension show baseline (steel) and optimized (blue / green) scores on the full 1–5 axis. The visual takeaway: lines are essentially the same length, and the gap is barely visible at full scale.

GENERAL PHASE 0 1 2 3 4 5 Accuracy gap 0.034 Clarity gap 0.096 Helpfulness gap 0.144 Overall gap 0.091 CX PHASE 0 1 2 3 4 5 Accuracy gap 0.025 Clarity gap 0.091 Helpfulness gap 0.085 Overall gap 0.067
Baseline Optimized · general Optimized · CX Axis 0–5; baseline placeholder = 4.20 per dimension

Overall quality gaps are 0.091 ± 0.039 in the general phase and 0.067 ± 0.020 in the CX phase, on a 5-point scale. Both averages sit well below the ±0.2 equivalence threshold (§3.5): the general phase at less than half the threshold, the CX phase at roughly one-third. The accuracy dimension is the most concrete measure of correctness within the LLM-judge framework and shows the smallest gap in both phases.

CX per-source quality (TOST equivalence at ±0.2)

Confidence intervals against the equivalence band

Each interval is the per-source quality gap ± 1.96σ. The shaded band is the ±0.2 equivalence threshold. A source passes equivalence if its entire interval falls inside the band.

−0.2 0 +0.2 QUALITY GAP (5-PT LIKERT) Equivalence band Banking77 ✓ 0.001 ± 0.060 Bitext ✓ 0.015 ± 0.060 Nemotron ✓ 0.041 ± 0.042 ABCD ✓ 0.051 ± 0.039 MultiWOZ ✓ 0.113 ± 0.060 Taskmaster-2 ✓ 0.150 ± 0.042
SourceVerticalNQuality gapTOST at ±0.2
Banking77Banking7500.001 ± 0.060✓ pass
BitextEcommerce CS7500.015 ± 0.060✓ pass
NemotronAgentic CS1,4970.041 ± 0.042✓ pass
ABCDRetail support1,7500.051 ± 0.039✓ pass
MultiWOZMulti-domain7410.113 ± 0.060✓ pass
Taskmaster-2Multi-vertical1,5000.150 ± 0.042✓ pass

All six CX sources have quality gaps that pass equivalence at the ±0.2 threshold, meaning each gap, even at the upper end of its interval, stays below the level at which users would notice a difference. Three sources (Banking77, Bitext, Nemotron) have intervals that cross zero, meaning the gap cannot be statistically distinguished from no difference at all.

4.4 Latency

End-to-end overhead (warm CX state)

MetricEnd-to-end (ms)Optimizer-internal (ms)
Mean178.1153.2
Median (P50)181.6157.6
P95206.0178.5

Per-stage breakdown (CX warm, mean ms)

StageMean ms
Gate (real-time classifier)130.7
Compression13.2
Routing4.1
Provider cache lookup / padding2.4
Cleanup1.2
Pruning0.5

The real-time classifier that drives adaptive routing dominates the latency budget. All other stages combined add under 22 milliseconds. At ~170ms end-to-end, the pipeline adds less than 5% to the latency of a typical upstream LLM call, well below any threshold at which users would notice an additional delay.

5. Conclusion

Across 13,491 prompts spanning general AI benchmarks and production customer-experience workloads, Phantm reduced inference cost by 47.1% with no compromise to response quality. Accuracy on deterministic benchmarks was effectively unchanged from the baseline. Every LLM-judged source produced responses whose quality gap fell below the threshold at which users would notice a difference, and three of the six CX sources showed no measurable difference from the baseline at all.

Customers can deploy Phantm in front of existing LLM workloads to capture these savings without changing their application code, swapping models, or sacrificing the quality of the responses their users receive.

6. Limitations

  1. Semantic cache activation is low in this evaluation. Semantic caching fired on 36 of 6,991 CX requests (0.51%). This rate reflects the diversity of the benchmark corpus: 21 distinct system prompts across seven verticals with limited naturally-occurring query repetition. Production deployments with single-customer traffic patterns, where the same questions recur frequently from the same underlying user base, will show substantially higher semantic cache hit rates.
  2. Benchmark corpus is public. The general benchmark corpus is drawn from widely-used public datasets. Real customer traffic differs from these distributions in prompt length, query complexity, language, and domain mix. The CX corpus is closer to production traffic in structure but is still benchmark-derived. Production results in any specific deployment will depend on the workload's particular characteristics.
  3. Single optimizer configuration. This evaluation runs a single set of optimizer settings against the baseline. The configuration is tuned for general-purpose CX deployment; customers with specialized workloads (e.g., heavy creative generation, code completion, long-form reasoning) may benefit from adjusted thresholds and feature weights.

Appendix A. Routing distribution

CX phase (n = 6,991)

From → ToNShare
Full → mini downroute1,75425.1%
Full → nano downroute1,53522.0%
Mini → mini same1,58722.7%
Mini → nano downroute1,49521.4%
Nano → nano same3104.4%
Nano → mini uproute2884.1%
Mini → full uproute140.2%
Full → full same60.1%
Nano → full uproute20.0%
Aggregate downroute4,78468.4%

General benchmarks (n = 6,234)

From → ToNShare
Mini → nano downroute1,99532.0%
Mini → mini same1,27620.5%
Nano → nano same1,11517.9%
Full → mini downroute1,02216.4%
Full → nano downroute5078.1%
Nano → mini uproute2043.3%
Full → full same550.9%
Mini → full uproute540.9%
Nano → full uproute60.1%
Aggregate downroute3,52456.5%

References

  1. Anvari, F., & Lakens, D. (2021). Using anchor-based methods to determine the smallest effect size of interest. Journal of Experimental Social Psychology, 96, 104159.
  2. Hu, Q. J., et al. (2024). RouterBench: A Benchmark for Multi-LLM Routing System. arXiv:2403.12031.
  3. Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362.
  4. Norman, G. R., Sloan, J. A., & Wyrwich, K. W. (2003). Interpretation of changes in health-related quality of life: The remarkable universality of half a standard deviation. Medical Care, 41(5), 582–592.
  5. Ong, I., et al. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv:2406.18665.