Why LLM Inference Costs Are the Wrong Unit of Measure
Every time an LLM solves a problem, it forgets how it solved it. Feed the model the same class of problem tomorrow, and it starts from scratch — burning the same tokens, taking the same computational path, producing (hopefully) the same answer. For one-off tasks this is fine. For tasks that recur across thousands of instances, it is a straightforward waste.
This is the gap ReaComp addresses. Published May 6, 2026 by researchers at Carnegie Mellon University (arXiv:2605.05485), ReaComp asks a simple question: if an LLM can reason through a problem once, can we compile that reasoning into a deterministic symbolic solver that handles the same class of problems indefinitely — at zero inference cost?
The answer, according to the paper, is yes. And the numbers are striking.
What ReaComp Actually Does
The name stands for Reasoning Compiler. The core pipeline has three stages:
Stage 1 — Collect reasoning traces. Run an LLM on a representative sample of problems in a domain. Log the complete reasoning chains, not just the final answers. These traces expose the patterns the model uses: what rules it applies, in what order, and under what conditions.
Stage 2 — Compile traces into a symbolic solver. A coding agent (the paper uses a GPT-4-class model) analyzes the traces and synthesizes a Python program that implements the identified rules as a deterministic symbolic algorithm. The solver operates over a constrained domain-specific language (DSL) — a small, finite set of operations appropriate to the task domain.
Stage 3 — Deploy the solver. The compiled solver runs on all future instances of the same problem class. No LLM calls. No token cost. Pure Python execution.
When the solver fails — because a problem falls outside its rule coverage — control passes to the LLM in a hybrid mode. This fallback is intentional, not a bug. The solver handles the common distribution; the LLM handles the tail.
The Benchmarks Behind the Claims
The paper evaluates ReaComp on two main benchmarks: PBEBench and SLR-Bench.
PBEBench (arXiv:2505.23126)
PBEBench is a programming-by-examples benchmark inspired by historical linguistics — specifically, the problem of inducing sound change laws from before/after word pairs. Given input–output examples like [("kaim", "keim"), ("bail", "beil")], a solver must infer the rule ai → ei and apply it correctly to unseen instances.
The benchmark is notable for several reasons:
- It does not require specialized linguistic knowledge
- Difficulty emerges from formal properties of the rule space (feeding, bleeding, counter-feeding, counter-bleeding interactions between rules)
- It resists data contamination because problems can be generated synthetically at arbitrary difficulty
- It mirrors real scientific and engineering tasks where the goal is to induce generalizable rules, not just pattern-match surface forms
PBEBench-Lite is designed to discriminate between models of varying capability. PBEBench-Hard requires inducing programs of a complexity comparable to what historical linguists actually construct — and even with expensive test-time scaling, LLMs solve fewer than 5% of hard instances.
ReaComp Results on PBEBench
| System | PBEBench-Lite | PBEBench-Hard | Token Cost |
|---|---|---|---|
| LLM baseline (GPT-4o) | — | 68.4% | 1× |
| LLM + test-time scaling | — | ~68.4% | high |
| ReaComp solver (standalone) | 91.3% | 84.7% | ~0 |
| ReaComp hybrid (solver + LLM) | — | 85.8% | 0.22× (−78%) |
The standalone solver ensemble — compiled from LLM traces and then run without any LLM calls — reaches 84.7% on PBEBench-Hard, outperforming LLMs with test-time scaling by +16.3 percentage points. On the easier Lite variant it reaches 91.3%.
The hybrid mode (solver first, LLM fallback) reaches 85.8% — slightly above the standalone solver — while cutting reported token usage by 78%. This is the production-relevant configuration for most real deployments.
SLR-Bench Results
SLR-Bench (arXiv:2506.15787) is a scalable logical reasoning benchmark with 19k+ prompts across 20 curriculum difficulty levels. On its hard tier, the baseline LLM achieves 34.4%. The ReaComp hybrid raises this to 58.0% — a substantial improvement that transfers across domain boundaries, demonstrating that the solver induction approach is not benchmark-specific.
The paper also reports a zero-shot transfer to a real historical linguistics task (predicting actual sound changes in natural language data), where the solver ensemble reaches 80.1% accuracy under ensembling and recovers plausible linguistic rules that align with known phonological patterns.
The Economics of Amortized Inference
The key insight here is about amortized cost, a concept borrowed from algorithm analysis and compiler design.
Building the solver costs something upfront: LLM calls to generate traces, coding agent time to synthesize the solver code. But this is a one-time fixed cost. Once the solver exists, every subsequent instance it handles costs zero tokens. The more instances you run, the cheaper the effective per-instance cost becomes.
If you run 1,000 instances and the solver handles 800 of them (80% coverage), you pay:
- 1 solver construction cost (fixed, amortized)
- 200 LLM fallback calls for the remaining 20%
Compared to running 1,000 full LLM calls, this is roughly a 78–80% token reduction — which matches what the paper reports.
This amortization logic is why ReaComp scales favorably compared to approaches that try to reduce per-call cost through quantization or smaller models. Those approaches make every call slightly cheaper. ReaComp eliminates most calls entirely.
A Minimal Sandbox Reproduction
Effloow Lab ran a minimal PoC to verify the conceptual mechanics. Using Python 3.12, we built a string-rewrite solver that induces rules from input–output examples — mirroring the PBEBench DSL without the full coding-agent pipeline.
import re
def apply_program(program: list[tuple[str, str]], text: str) -> str:
"""Apply a sequence of regex rewrite rules to a string."""
for pattern, replacement in program:
text = re.sub(pattern, replacement, text)
return text
CANDIDATE_RULES = [
(r'p', 'f'), # p → f (First Consonant Shift analog)
(r'b', 'p'), # b → p
(r'ai', 'ei'), # vowel shift
(r'au', 'o'), # diphthong collapse
# ... additional candidates
]
def induce_program(examples: list[tuple[str, str]], max_rules=3):
"""Find minimal rule sequence explaining all examples."""
from itertools import combinations
for n in range(1, max_rules + 1):
for combo in combinations(CANDIDATE_RULES, n):
for ordered in _permutations(list(combo)):
if all(apply_program(ordered, s) == t for s, t in examples):
return list(ordered)
return None
Running this against three task types produced the following results:
Task 1: p→f consonant shift
Training: [('pat', 'fat'), ('pit', 'fit')]
Induced program: [('p', 'f')]
✓ pan → fan ✓ pin → fin ✓ pup → fuf
Task 2: b→p→f two-step chain
Training: [('bat', 'fat'), ('bit', 'fit')]
Induced program: [('b', 'p'), ('p', 'f')]
✓ ban → fan ✓ big → fig
Task 3: ai→ei vowel shift
Training: [('kaim', 'keim'), ('bail', 'beil')]
Induced program: [('ai', 'ei')]
✓ rain → rein ✓ tail → teil
The solvers generalize correctly to unseen inputs with zero additional computation after induction. The failure cases were also informative: when the rule space was insufficient (an xaz → yab mapping that requires rules outside our candidate set), the inducer correctly returned None — the correct signal to fall back to the LLM.
Limitations of this sandbox: The PoC uses brute-force rule search over a manually defined candidate set. In ReaComp, the coding agent generates the solver code from trace analysis, which allows it to discover DSL structure rather than search a predefined space. The full system is substantially more capable than this minimal reproduction.
No GitHub repository was available as of 2026-05-11 — the paper does not yet have a public code release.
How to Apply This Pattern in Your Own Systems
The ReaComp architecture is not limited to the specific benchmarks it was evaluated on. The underlying pattern — trace collection → solver induction → hybrid deployment — applies anywhere you have a class of structurally similar tasks that an LLM solves repeatedly.
Good candidates for ReaComp-style solver induction:
- Data transformation pipelines — if you use an LLM to normalize records, parse date formats, or extract structured fields from text, a solver can likely handle 70–90% of cases after seeing a few hundred examples
- Code generation for constrained domains — SQL query generation for a fixed schema, configuration file generation, or regex construction from natural language descriptions
- Classification with fixed label spaces — the LLM's reasoning about which label applies can often be compiled into a rule-based classifier for common patterns
- API call routing — if an LLM routes requests to backend endpoints, the routing logic can frequently be extracted as a deterministic decision tree
The implementation workflow:
- Run your LLM on 50–200 representative examples, logging full reasoning chains
- Use a coding agent (Claude Code, GPT-4o in agent mode) to analyze the traces and synthesize a Python solver
- Evaluate the solver's coverage on a held-out test set
- Deploy solver first; route low-confidence or failed cases to the LLM fallback
- Periodically retrain the solver as new edge cases accumulate in the LLM fallback logs
The key engineering requirement is maintaining a fallback path. Do not deploy a ReaComp-style solver without LLM fallback — the solver will fail on tail cases, and you need a way to handle them gracefully.
What Makes PBEBench-Hard Actually Hard
Understanding why PBEBench-Hard is hard illuminates why ReaComp's results are significant.
The benchmark is inspired by historical phonology — the branch of linguistics that studies how sounds in languages change over time. Linguists have identified four types of rule interactions that create systematic difficulty:
-
Feeding: Rule A creates the environment that triggers Rule B. For example:
b → pcreates newps, which then triggerp → f. The rules must be applied in order. -
Bleeding: Rule A destroys the environment Rule B would need. If
ai → eiapplies first, a subsequente → arule has more inputs. - Counter-feeding: Rules that would feed each other are ordered so the feeding does not occur.
- Counter-bleeding: Rules that would bleed each other are ordered to prevent the bleeding.
LLMs struggle with these interactions because they require tracking state across rule applications — essentially simulating an interpreter. The symbolic solver, by contrast, is an interpreter. It executes the rules mechanically and never confuses rule ordering.
This is why the standalone solver reaches 84.7% while LLMs reach 68.4% even with expensive test-time scaling. The solver architecture is structurally suited to the problem in a way that neural next-token prediction is not.
Common Mistakes When Building Solver Pipelines
Mistake 1: Skipping the trace collection step.
Trying to synthesize a solver by describing the problem to the coding agent without first collecting LLM reasoning traces usually produces a solver that covers the obvious cases but misses the edge cases that traces would have exposed.
Mistake 2: Deploying the solver without coverage tracking.
A solver that handles 80% of cases looks great until a production incident reveals that the 20% it misses are all high-priority requests. Instrument every solver invocation with a success/failure signal and route failures to the LLM.
Mistake 3: Treating the solver as a permanent artifact.
Task distributions shift. A solver induced from 2024 traces may underperform on 2026 data. Treat solver re-induction as a routine maintenance task, not a one-time deployment.
Mistake 4: Using the full LLM for trace collection when a smaller model suffices.
Trace collection quality depends on the LLM's reasoning ability, but not all domains require a frontier model. For constrained DSLs, a smaller model's traces may be sufficient and significantly cheaper.
The Broader Neuro-Symbolic Picture
ReaComp sits at the intersection of two research traditions that have historically operated in parallel: neuro-symbolic AI and program synthesis.
Neuro-symbolic AI has long argued that symbolic reasoning is more reliable for structured tasks — better at handling rule composition, transitive inference, and systematic generalization. The challenge has always been how to bridge the neural and symbolic worlds: symbolic systems need formal specifications that are expensive to write by hand.
ReaComp's contribution is using the LLM's reasoning trace as that specification — without requiring a human to write formal rules. The LLM acts as a general-purpose trace generator. The coding agent acts as the formalizer. The symbolic solver acts as the efficient inference engine.
This three-component architecture sidesteps the traditional neuro-symbolic bottleneck (human-authored formal rules) while retaining the efficiency advantage of symbolic execution.
It is also worth noting what ReaComp does not claim: it does not claim that symbolic solvers should replace LLMs generally. The hybrid architecture is the point. LLMs remain essential for induction and for handling the distribution tail. Solvers handle the common case efficiently. The system is better as a whole than either component alone.
FAQ
Q: Does ReaComp require access to a specific LLM API?
The paper does not tie the approach to a specific model. Trace collection can use any reasoning-capable LLM. The coding agent step benefits from a model strong at code generation. In practice, GPT-4-class models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro) are likely sufficient for most domains.
Q: How many training examples does solver induction require?
The paper does not give a universal answer — it varies by domain complexity. For the PBEBench tasks, a relatively small number of traces appears sufficient. For more complex domains, more traces will be needed to expose the full rule space. Start with 50–100 examples and evaluate coverage before expanding.
Q: Is the compiled solver interpretable?
Yes, and this is one of its advantages. Because the solver is Python code synthesized by a coding agent, a developer can read it, debug it, and modify it. This is substantially more auditable than a neural network's internal representations. For regulated domains where explainability matters, this interpretability is not incidental — it is a core feature.
Q: What happens when the solver and the LLM disagree on an answer?
The paper uses the solver as a first-pass filter. If the solver produces an answer, that answer is used directly. The LLM fallback is only invoked when the solver signals failure (returns None or an explicit failure code). There is no ensemble voting between solver and LLM in the main experimental setup.
Q: Can this approach handle domains with continuous (non-symbolic) outputs?
The current ReaComp formulation targets discrete symbolic outputs — programs, rules, classifications. For continuous outputs like generated text, the compilation step would need to be significantly extended. The paper does not address this case.
Key Takeaways
Bottom Line
ReaComp demonstrates that LLM reasoning traces are not just intermediate computation — they are extractable artifacts that can be compiled into zero-cost symbolic solvers. The 84.7% standalone accuracy on PBEBench-Hard and 78% token reduction in hybrid mode are not marginal improvements; they represent a different cost structure for recurring inference tasks. For any system that runs an LLM on structurally similar problems at scale, the solver induction pattern is worth evaluating seriously.
- ReaComp compiles LLM reasoning traces into deterministic Python solvers via a coding agent pipeline
- Standalone solver ensembles reach 84.7% on PBEBench-Hard, outperforming LLMs by +16.3 percentage points at zero token cost
- Hybrid solver+LLM mode achieves 85.8% accuracy while reducing token usage by 78%
- The approach generalizes across benchmarks (SLR-Bench hard tier: 34.4% → 58.0%) and transfers to real historical linguistics data (80.1%)
- The key economic insight is amortized inference: fixed induction cost, then zero marginal cost per solved instance
- Practical deployment requires maintaining an LLM fallback, coverage tracking, and periodic solver re-induction as data distributions shift
- No public code release is available as of 2026-05-11; the paper is at arXiv:2605.05485
The ReaComp pattern joins a growing toolkit of techniques — speculative decoding, adaptive KV cache quantization, prompt caching — for reducing the operational cost of LLM inference without sacrificing accuracy. What distinguishes it is that it targets the architectural cost of inference, not just the per-token cost.
Top comments (0)