Proof — Phantm Evaluation Report (May 2026)

1. Summary

Phantm is a drop-in LLM optimization proxy. It exposes an OpenAI-compatible endpoint and applies a six-stage adaptive pipeline to every request before forwarding it to the upstream provider. No SDK changes, no prompt rewrites, no model swapping by the customer.

This report covers a 13,491-prompt evaluation conducted across two phases: a general AI benchmark phase (6,500 prompts across nine standard benchmarks) and a customer-experience production workload phase (6,991 prompts across six CX datasets spanning retail, banking, ecommerce, multi-domain dialog, and agentic vertical support). Across the full evaluation, Phantm reduced inference cost by 47.1% against a realistic enterprise baseline. Quality was preserved: the deterministic-benchmark accuracy gap is 0.013 ± 0.019, and every LLM-judged source shows a quality gap below the threshold at which users notice a difference, per established rating-scale research (see §3.5).

Cost

Phase	Baseline	Optimized	Savings	Reduction
General benchmarks	$30.46	$19.71	$10.75	35.3%
CX production workloads	$27.52	$10.97	$16.55	60.1%
Combined	$57.97	$30.68	$27.30	47.1%

Quality

Metric	General	CX
Deterministic benchmark accuracy gap (aggregate)	0.013 ± 0.019	—
LLM-judge quality gap (overall, 5-pt scale)	0.091 ± 0.039	0.067 ± 0.020

All CX responses were LLM-judged; deterministic graders apply only to the general benchmark phase.

Cost reduction · combined

47.1%

n = 13,491 · baseline $57.97 → $30.68

CX production workloads

60.1%

n = 6,991 · $27.52 → $10.97

End-to-end overhead

<200ms

P50 = 181.6ms · P95 = 206.0ms

2. What Phantm Does

Phantm sits between an application and its LLM providers as an OpenAI-compatible proxy. Every request passes through an adaptive optimization pipeline that runs in real time before the upstream call is dispatched. Total pipeline overhead is under 200 milliseconds, and no application-side changes are required to use it.

The cost reductions reported here are the joint result of these stages firing across 13,491 requests, not the contribution of any single stage. The activation rate beneath each card is the fraction of requests on which that stage did meaningful work.

01 · Real-time classifier

Gate

Classifies the incoming request to drive routing and shaping decisions downstream. Dominates the latency budget at 130.7ms mean. Every other stage combined adds under 22ms.

mean 130.7ms · drives routing

02 · Cache orchestration

Provider cache

Orchestrates provider-side prompt caches on Anthropic and OpenAI tracks. On the CX phase, Anthropic input-token hit rate reaches 89.8%; OpenAI 36.3%.

Anthropic 89.8% · OpenAI 36.3% (CX)

03 · Filler compression

Prompt cleanup

Fires when the input contains compressible filler: boilerplate instructions, repeated framing, dead context. Conservative by default; activates on 14.0% of general traffic, 23.5% of CX traffic.

14.0% general · 23.5% CX

04 · History compression

Semantic compression

Compresses conversation history when present and compressible. Fires on 22.1% of general requests and 45.1% of CX requests. The CX uplift comes from long, repeated multi-turn dialogs.

22.1% general · 45.1% CX

05 · Model selection

Adaptive routing

Routes the request to a model that the gate predicts can solve it acceptably. Downroutes the majority of traffic, but also up-routes ~0.2–1.5% of requests to a top-tier reasoning model when the baseline tier is insufficient.

60.8% general · 72.8% CX away from baseline

06 · Length & format

Output shaping

Applies length and format guidance to the request so the upstream model returns a tightly-shaped response. Fires on nearly every request (100% general, 99.5% CX) and is invisible to the calling application.

100% general · 99.5% CX

3. Evaluation Design

3.1 Two-phase design

The evaluation was structured in two phases to answer two distinct questions. The general benchmark phase asks whether the optimization pipeline preserves quality on tasks where quality is measurable against known-correct answers. The CX phase asks how much the pipeline saves on workloads that resemble actual customer-facing deployment traffic: long, repeated system prompts; many short user turns per prompt; conversation histories of varying length. Both phases share the same optimization configuration and the same baseline assignment logic.

3.2 Corpus

General benchmarks (6,500 prompts)

Source	N	Task type	Grading
WildChat	1,500	Open-domain chat	LLM-judged contrastive
HotPotQA	1,000	Factoid QA	LLM-graded EM/F1
MMLU	1,000	Multiple choice QA	Deterministic accuracy
BFCL	750	Tool use	Deterministic tool-call equivalence
BBH	500	Reasoning	LLM-graded exact match
LongBench	500	Long-context QA	LLM-judged contrastive
Hermes FC	500	Tool use (chat)	LLM-judged tool use
IFEval	500	Instruction following	Deterministic constraint validation
DialogSum	250	Summarization	LLM-judged contrastive

CX production workloads (6,991 prompts across 21 distinct system prompts)

Source	N	CX vertical	Tool use
ABCD	1,750	Retail customer support	No
Nemotron	1,500	Agentic vertical CS (rentals, parking, security, sports retail, theme park, vet telehealth)	Yes
Taskmaster-2	1,500	Multi-vertical (food, hotels, movies, sports, flights, restaurant search)	No
Banking77	750	Banking	No
Bitext	750	Ecommerce support	No
MultiWOZ	741	Multi-domain (attraction, hotel, restaurant, taxi)	No

The CX corpus is structured around 21 distinct system prompts averaging 333 user queries per prompt (median 250, range 179 to 1,750). This mirrors real CX traffic, where a small number of stable prompts are reused across high volumes of customer interaction.

3.3 Baseline assignment

Each prompt in the corpus is baselined against a specific model chosen to represent how that workload would be deployed without Phantm. Two model tiers are used in the baseline: solver tier (gpt-5.4 / claude-sonnet-4-6) and mini tier (gpt-5.4-mini / claude-haiku-4-5). Baseline tier is assigned per source based on the technical complexity of the task: long-context, tool-use, and reasoning-heavy sources are baselined to solver tier; open-domain chat, factoid QA, multiple-choice QA, summarization, and instruction-following sources are baselined to mini tier. A token-floor rule promotes any general-benchmark request with input length above 3,000 tokens to solver tier regardless of source. Provider track is assigned stratified-randomly per source with a target split of 60% OpenAI / 40% Anthropic.

3.4 Scoring

General benchmarks are scored using a mix of deterministic graders and LLM-correctness judges. MMLU, IFEval, and BFCL are scored fully deterministically. HotPotQA and BBH are scored by an LLM-correctness judge; agreement rate between deterministic and LLM-correctness scoring on overlapping MMLU rows was 88.1%. Sources without ground-truth answers (WildChat, LongBench, DialogSum, Hermes FC) are scored by an LLM judge that compares optimized and baseline responses directly. CX workloads are fully LLM-judged using two rubrics: a general contrastive rubric for conversational sources (5,491 rows) and a tool-use rubric for the agentic CS corpus (1,500 rows on nemotron).

Cross-provider judging is used throughout: OpenAI-track responses are judged by claude-opus-4-6; Anthropic-track responses are judged by gpt-5.4. This eliminates the self-preference effect that occurs when a model judges its own outputs. The A/B position of the two responses is deterministically flipped per record using a SHA-256 hash of the record ID, mitigating position bias. Each response is scored on a 1–5 scale across multiple quality dimensions.

3.5 Equivalence testing

We test whether Phantm's quality matches the baseline using the standard two one-sided tests (TOST) procedure (Lakens, 2017) against a threshold of ±0.2 points on the 5-point Likert scale. A source passes equivalence at this threshold if its entire confidence interval falls within ±0.2.

The ±0.2 threshold draws from two well-established results. Research on patient-reported outcomes (Norman, Sloan & Wyrwich, Medical Care, 2003) finds that users typically can't reliably perceive differences smaller than about half a standard deviation, or roughly 0.5 points on a 5-point Likert scale. Studies that ask raters directly to identify the smallest meaningful difference (Anvari & Lakens, Journal of Experimental Social Psychology, 2021) put the threshold even lower, in the 0.20–0.39 range for single Likert items. We use 0.2, the strictest threshold supported by this literature, as the equivalence band.

The standard deviation of per-row quality deltas in this evaluation is approximately 0.83 across all LLM-judged rows (and 0.90 for the general phase considered separately). This is materially below the per-rating noise floor of 1.0–1.2 documented in the LLM-as-judge literature. For deterministic-benchmark accuracy, the confidence interval is computed from row-level paired binary outcomes.

3.6 Relation to prior work

Public evaluations of LLM cost-optimization products fall into two patterns. Academic frameworks, most notably Martian's RouterBench (Hu et al., 2024) and LMSYS's RouteLLM (Ong et al., 2024), publish full methodology, datasets, and code, summarizing cost-quality tradeoffs through point comparisons or composite scalars. Vendor reports typically lead with headline cost reductions and report response quality either informally or not at all. To our knowledge, no prior public evaluation in this product category has used formal equivalence testing against a pre-set threshold. The methodology, TOST against a smallest-effect-size-of-interest drawn from measurement-theory literature, is the standard procedure for testing equivalence claims in clinical research and the social sciences.

4. Results

4.1 Cost

Cost reduction by phase ($USD)

Each bar represents that phase's baseline spend, scaled to the combined-baseline total ($57.97). The green segment is what Phantm spent; the red segment is what it cut.

General benchmarks−35.3%

$19.71optimized

−$10.75saved

CX production workloads−60.1%

$10.97opt

−$16.55saved

Combined−47.1%

$30.68optimized

−$27.30saved

Kept (optimized cost) Cut (savings) Bars scale to combined baseline ($57.97)

CX savings are substantially higher than general benchmark savings for two reasons. First, the CX corpus structure (stable system prompts, many queries per prompt) naturally enables cache orchestration to amortize over many requests. Second, CX traffic contains a larger fraction of straightforward requests that route safely to smaller models than the general benchmark mix does. Both of these conditions resemble real production CX deployments more closely than the general benchmark corpus.

The reported cost reductions are net of all routing decisions in both directions. While the optimizer routes most requests to smaller models, it also up-routes a small fraction to more capable models when the gate classifier identifies the input as too complex for the baseline tier. Approximately 1.5% of general-benchmark requests and 0.2% of CX requests were up-routed in this way, typically to a top-tier reasoning model. These up-routes increase per-request cost on the affected rows but preserve response quality where the baseline would have been insufficient. They are a feature of the routing logic, not an exception to the cost story.

4.2 Where the savings come from

Phantm's optimization pipeline applies six stages adaptively to each request. The cost reduction is the joint result of these stages, not the contribution of any single one. The activation rates below indicate, for each stage, the fraction of requests on which the stage did meaningful work.

Stage activation rate (% of requests)

Output shaping

94.6% gen

75.1% CX

Adaptive routing

60.8% gen

72.8% CX

Semantic compression

22.1% gen

45.1% CX

Prompt cleanup

14.0% gen

23.5% CX

Provider cache (Anthropic)

—

89.8% CX hit

Provider cache (OpenAI)

—

36.3% CX hit

Pruning

0.2% gen

<0.1% CX

General phase CX phase Pruning is reserved for context-overflow risk; negligible in benchmark traffic

4.3 Quality

Quality is reported across two views: deterministic benchmark accuracy (the most concrete measure of correctness) and quality gaps from LLM judging (the closest signal to user-perceptible quality across heterogeneous tasks). Both views address different questions and should be read together. Confidence intervals and equivalence tests use the methodology described in §3.5.

A note on win/tie/loss verdicts. Contrastive forced-choice judging, the industry-standard metric for LLM evaluation, requires the judge to pick a winner even when both responses are functionally equivalent, which artificially amplifies small underlying differences. Raw score gaps and deterministic accuracy, with their associated confidence intervals and equivalence tests, are the more representative measures of user-perceptible quality.

Deterministic benchmark accuracy

Accuracy: baseline vs. optimized

Each row shows baseline accuracy (steel dot) and optimized accuracy (blue dot) on a 0–1 scale.

HotPotQA

gap 0.021

MMLU

gap 0.017

IFEval

gap 0.011

BFCL

gap 0 (tool use)

BBH

gap 0.032

Baseline Optimized x-axis scaled 0 – 1.00

Aggregated across all five graders, n-weighted: gap = 0.013 ± 0.019. The interval crosses zero; the data are consistent with no real difference. Three of five benchmarks have gaps under 2 percentage points; tool-use accuracy (BFCL) is functionally equivalent; the largest gap is on BBH at 3.2 percentage points.

LLM-judge quality gaps

Baseline vs. optimized scores (full 0–5 Likert scale)

Two horizontal lines per dimension show baseline (steel) and optimized (blue / green) scores on the full 1–5 axis. The visual takeaway: lines are essentially the same length, and the gap is barely visible at full scale.

Baseline Optimized · general Optimized · CX Axis 0–5; baseline placeholder = 4.20 per dimension

Overall quality gaps are 0.091 ± 0.039 in the general phase and 0.067 ± 0.020 in the CX phase, on a 5-point scale. Both averages sit well below the ±0.2 equivalence threshold (§3.5): the general phase at less than half the threshold, the CX phase at roughly one-third. The accuracy dimension is the most concrete measure of correctness within the LLM-judge framework and shows the smallest gap in both phases.

CX per-source quality (TOST equivalence at ±0.2)

Confidence intervals against the equivalence band

Each interval is the per-source quality gap ± 1.96σ. The shaded band is the ±0.2 equivalence threshold. A source passes equivalence if its entire interval falls inside the band.

Source	Vertical	N	Quality gap	TOST at ±0.2
Banking77	Banking	750	0.001 ± 0.060	✓ pass
Bitext	Ecommerce CS	750	0.015 ± 0.060	✓ pass
Nemotron	Agentic CS	1,497	0.041 ± 0.042	✓ pass
ABCD	Retail support	1,750	0.051 ± 0.039	✓ pass
MultiWOZ	Multi-domain	741	0.113 ± 0.060	✓ pass
Taskmaster-2	Multi-vertical	1,500	0.150 ± 0.042	✓ pass

All six CX sources have quality gaps that pass equivalence at the ±0.2 threshold, meaning each gap, even at the upper end of its interval, stays below the level at which users would notice a difference. Three sources (Banking77, Bitext, Nemotron) have intervals that cross zero, meaning the gap cannot be statistically distinguished from no difference at all.

4.4 Latency

End-to-end overhead (warm CX state)

Metric	End-to-end (ms)	Optimizer-internal (ms)
Mean	178.1	153.2
Median (P50)	181.6	157.6
P95	206.0	178.5

Per-stage breakdown (CX warm, mean ms)

Stage	Mean ms
Gate (real-time classifier)	130.7
Compression	13.2
Routing	4.1
Provider cache lookup / padding	2.4
Cleanup	1.2
Pruning	0.5

The real-time classifier that drives adaptive routing dominates the latency budget. All other stages combined add under 22 milliseconds. At ~170ms end-to-end, the pipeline adds less than 5% to the latency of a typical upstream LLM call, well below any threshold at which users would notice an additional delay.

5. Conclusion

Across 13,491 prompts spanning general AI benchmarks and production customer-experience workloads, Phantm reduced inference cost by 47.1% with no compromise to response quality. Accuracy on deterministic benchmarks was effectively unchanged from the baseline. Every LLM-judged source produced responses whose quality gap fell below the threshold at which users would notice a difference, and three of the six CX sources showed no measurable difference from the baseline at all.

Customers can deploy Phantm in front of existing LLM workloads to capture these savings without changing their application code, swapping models, or sacrificing the quality of the responses their users receive.

6. Limitations

Semantic cache activation is low in this evaluation. Semantic caching fired on 36 of 6,991 CX requests (0.51%). This rate reflects the diversity of the benchmark corpus: 21 distinct system prompts across seven verticals with limited naturally-occurring query repetition. Production deployments with single-customer traffic patterns, where the same questions recur frequently from the same underlying user base, will show substantially higher semantic cache hit rates.
Benchmark corpus is public. The general benchmark corpus is drawn from widely-used public datasets. Real customer traffic differs from these distributions in prompt length, query complexity, language, and domain mix. The CX corpus is closer to production traffic in structure but is still benchmark-derived. Production results in any specific deployment will depend on the workload's particular characteristics.
Single optimizer configuration. This evaluation runs a single set of optimizer settings against the baseline. The configuration is tuned for general-purpose CX deployment; customers with specialized workloads (e.g., heavy creative generation, code completion, long-form reasoning) may benefit from adjusted thresholds and feature weights.

Appendix A. Routing distribution

CX phase (n = 6,991)

From → To	N	Share
Full → mini downroute	1,754	25.1%
Full → nano downroute	1,535	22.0%
Mini → mini same	1,587	22.7%
Mini → nano downroute	1,495	21.4%
Nano → nano same	310	4.4%
Nano → mini uproute	288	4.1%
Mini → full uproute	14	0.2%
Full → full same	6	0.1%
Nano → full uproute	2	0.0%
Aggregate downroute	4,784	68.4%

General benchmarks (n = 6,234)

From → To	N	Share
Mini → nano downroute	1,995	32.0%
Mini → mini same	1,276	20.5%
Nano → nano same	1,115	17.9%
Full → mini downroute	1,022	16.4%
Full → nano downroute	507	8.1%
Nano → mini uproute	204	3.3%
Full → full same	55	0.9%
Mini → full uproute	54	0.9%
Nano → full uproute	6	0.1%
Aggregate downroute	3,524	56.5%

References

Anvari, F., & Lakens, D. (2021). Using anchor-based methods to determine the smallest effect size of interest. Journal of Experimental Social Psychology, 96, 104159.
Hu, Q. J., et al. (2024). RouterBench: A Benchmark for Multi-LLM Routing System. arXiv:2403.12031.
Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362.
Norman, G. R., Sloan, J. A., & Wyrwich, K. W. (2003). Interpretation of changes in health-related quality of life: The remarkable universality of half a standard deviation. Medical Care, 41(5), 582–592.
Ong, I., et al. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv:2406.18665.