We Deleted 77% of Our Code and Got Better Results: NeuralBridge V10
The brutal truth about why simpler is better in AI API reliability.
The Wake-Up Call
We built NeuralBridge to solve one problem: AI API calls fail, and developers need automatic recovery without babysitting.
After 9 versions and 13,000 lines of code, we made a disturbing discovery.
Our "smartest" version was actually our worst.
The V9 Bayesian Experiment: When More Complexity Hurts
In V9, we implemented Bayesian inference for fault diagnosis. It sounded brilliant:
- Probabilistic fault classification
- Prior probabilities updated with new evidence
- Theoretically optimal decision-making
The results?
| Version | Timeout Recovery | Invalid Model | Total |
|---|---|---|---|
| V8.2 (Rules) | 100% | 100% | 100% |
| V9 (Bayesian) | 99.3% | 100% | 99.7% |
V9 was mathematically worse than V8.2's simple if-else statements.
The "smart" Bayesian approach added overhead that actually hurt performance. After 600 real API calls testing both approaches, the evidence was undeniable.
The Numbers That Changed Everything
100 Rounds of Real API Calls (Zero Mocks)
| Fault Type | Strategy | Recovery Rate | Avg Latency |
|---|---|---|---|
| Timeout | SimpleRetry | 83% | 3,941ms |
| Timeout | LiteLLM | 87% | 4,928ms |
| Timeout | V8.2 Flywheel | 98% | 2,211ms |
| Invalid Model | SimpleRetry | 0% | 3,901ms |
| Invalid Model | LiteLLM | 100% | 5,363ms |
| Invalid Model | V8.2 Flywheel | 100% | 5,239ms |
Key insight: V8.2 recovers 11 more timeouts than LiteLLM with half the latency.
The rule-based flywheel isn't just better—it's 2x faster.
What We Learned
1. Network Effect Flywheel
Every customer hits unique failure patterns. When one customer solves a problem, that solution helps everyone.
Customer A hits error "TPM limit at 2:30 AM UTC"
↓
Solution discovered and stored
↓
Customer B hits similar error at 3:00 AM UTC
↓
Instant recovery from knowledge base
This is why we crawl GitHub Issues, Stack Overflow, and Status Pages. They're free ammunition for our flywheel.
2. The Dual-Flywheel Architecture
We separated two concerns that shouldn't be mixed:
- Training Flywheel (Offline): Parallel strategy testing, can be slow, discovers optimal solutions
- Execution Flywheel (Real-time): Must be fast, just looks up pre-tested strategies
# Training (can take seconds)
fault → test 5 strategies in parallel → store best in knowledge base
# Execution (must be milliseconds)
fault → lookup knowledge base → execute best strategy immediately
The execution path has zero ML, zero statistics, just rules. Because rules are fast.
3. Real Failure Data > Theoretical Models
We scraped:
- 50+ GitHub Issues (openai-python, anthropic-sdk)
- 30+ Stack Overflow questions
- 15+ OpenAI status incidents
- 8+ Anthropic status incidents
Real error messages → Real patterns → Real strategies
No synthetic data. No simulated failures. Every strategy was tested against actual API errors.
V10: The Rebuild
What We Removed
| Component | Lines | Reason |
|---|---|---|
| Bayesian inference | 800 | Slower, worse accuracy than rules |
| MDP/POMDP planning | 1,200 | Overhead unjustified by results |
| Complex retry budgets | 600 | Better to have simple backoff |
| Unused providers | 400 | Maintenance burden, no value |
| Total Removed | ~3,000 | Better performance |
What We Kept
| Strategy | Why |
|---|---|
param_fix |
Solves 40% of param errors |
immediate_switch_model |
Instant recovery for bad models |
dynamic_backoff |
Actually works unlike fancy alternatives |
Result
V9: ~13,000 lines
V10: ~2,200 lines
Reduction: 77%
The code fits in your head now.
The Brutal Honest Status
What We've Validated ✅
- 2 fault types tested: timeout, invalid_model
- 600 real API calls across 100 rounds
- V8.2 vs LiteLLM comparison complete
What We Haven't Tested Yet ❌
- Rate limiting scenarios (no live 429s during testing)
- Quota exceeded (billing edge cases)
- Connection errors (network instability)
Our Current Reality
- 0 paying customers
- Open source only (GitHub blocked in China, can't even push code)
- Building in public
We're not pretending to be production-ready. We're showing you the data and letting you decide.
Why We Published This
GitHub is blocked in China. We can't push code or build a community there.
So we're using content marketing to reach developers who might benefit from what we've learned:
- Simpler can be better - The data proves it
- Real testing > theoretical optimization - 600 calls, zero mocks
- Network effects work - Every failure you solve makes the system smarter for everyone
The Architecture That Made It Possible
┌─────────────────────────────────────────────────────────────┐
│ NeuralBridge V10 │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Training │────▶│ Knowledge │ │
│ │ Flywheel │ │ Base │ │
│ │ (Offline/slow) │ │ (Strategies) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Execution Flywheel (Real-time) ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ ││
│ │ │Diagnoser │─▶│Strategy │─▶│ Executor │ ││
│ │ │(Rules) │ │Router │ │ switch_model │ ││
│ │ │ │ │(Lookup) │ │ fix_params │ ││
│ │ │ │ │ │ │ retry_with_delay │ ││
│ │ └──────────┘ └──────────┘ └──────────────────────┘ ││
│ └─────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────┘
Try It Yourself
from flywheel import quick_recover
error = TimeoutError("Request timed out after 30s")
result = quick_recover(error)
if result.success:
print(f"Recovered with: {result.strategy_used}")
print(f"Latency: {result.latency_ms}ms")
Or integrate directly:
from flywheel import NeuralBridgeV10
engine = NeuralBridgeV10(api_key="your-key")
try:
response = openai.ChatCompletion.create(**params)
except Exception as e:
# Auto-heal and retry
fixed_params = engine.recover_sync(e, params)
response = openai.ChatCompletion.create(**fixed_params)
The Honest Numbers (One More Time)
| Metric | Value |
|---|---|
| Code reduction | 77% (13,000 → 2,200 lines) |
| Timeout recovery | 98% (vs LiteLLM's 87%) |
| Speed improvement | 2x faster (2,211ms vs 4,928ms) |
| Fault types validated | 2 |
| Real API calls | 600 |
| Paying customers | 0 |
We deleted 77% of our code and got better results. That's the story.
NeuralBridge is open source. We're building in public and sharing what we learn. No hype, just data.
Tags: ai, api, reliability, selfhealing, openai, llm, error-handling
Published 2024-07-07 | Last updated: 2024-07-07
Top comments (0)