The Hidden Cost of Calling AI Too Early
I stopped calling AI on every request — and everything got better.
The Problem
In one of my projects, I was generating AI-based insights from user activity.
The initial design was simple:
Every request for today’s insight → call the AI model → return a fresh response.
GET /api/insights/today
At first, this felt clean and correct.
But in practice, it created serious problems:
- 429 rate limit errors within hours
- Daily quota exhausted before noon
- Random failures affecting users
- Costs scaling linearly with traffic
The system was working — but it wasn’t sustainable.
The Real Issue
The problem wasn’t the AI provider.
It was the trigger model.
The system never asked basic questions before making an expensive call:
- Has anything actually changed?
- Did I already generate a response recently?
- Is the user even active today?
Without these checks, every request was treated as:
“Generate a new insight now.”
That assumption was the real bug.
The New Approach
Instead of adding caching on top, I redesigned the system into an event-driven pipeline.
AI became the last step, not the default.
System Flow
Here’s the simplified request flow:
flowchart TD
A[Request for today's insight] --> B{Activity today?}
B -- No --> C[Reuse latest insight or fallback]
B -- Yes --> D{Meaningful change?}
D -- No --> C
D -- Yes --> E{Cooldown passed?}
E -- No --> C
E -- Yes --> F{Daily cap reached?}
F -- Yes --> C
F -- No --> G{Global AI limit reached?}
G -- Yes --> H[Use deterministic fallback]
G -- No --> I[Call AI model]
I --> J[Persist insight]
H --> J
C --> J
Most requests now end at a simple database read — not an AI call.
(Optional) System Screenshot
Add your architecture / sequence diagram here

The Five-Layer Redesign
1. Activity Gate
Start with the cheapest check:
boolean hasActivity = activityService.hasActivityToday(userId, context);
if (!hasActivity) {
return getLatestOrFallback(userId, today);
}
If nothing happened → don’t call AI.
2. Event-Driven Triggers
AI should only run when something meaningful changes.
Examples:
- user updates intent
- significant behavior change
- threshold crossed
No change → reuse previous insight.
3. Cooldown Window
Avoid frequent re-generation:
Duration cooldown = Duration.ofMinutes(30);
if (elapsed < cooldown) {
return getLatestOrFallback(userId, today);
}
This prevents unnecessary repeated calls.
4. Per-User Daily Cap
if (todayCount >= 10) {
return getLatestOrFallback(userId, today);
}
Even active users shouldn’t trigger unlimited AI calls.
5. Global AI Guard
if (dailyAiCalls.get() >= 50) {
useFallback = true;
}
This acts as a system-wide circuit breaker.
Configuration
All thresholds are configurable:
insight:
activity-delta: 30
cooldown-minutes: 30
daily-cap-per-user: 10
max-ai-calls-per-day: 50
freshness-window-hours: 8
This allows tuning without redeploying code.
What Changed
After this redesign:
- AI calls dropped from ~100/day → ~5–10/day
- Rate limit errors disappeared
- Most requests became fast database reads
- Free-tier usage became sustainable
- System behavior became more predictable
Engineering Takeaway
AI should be the exception, not the rule.
A well-designed backend should first decide:
“Is this request even worth sending to the model?”
That decision layer — gating, triggers, cooldowns — is where the real engineering happens.
Final Thought
If most requests can be handled using deterministic logic or cached state:
Do that first.
Use AI only when it actually adds value.
That single shift can make your system:
- cheaper
- faster
- more reliable
—and much easier to scale.
## blog link -
https://anupamkushwaha.me/blog/stopped-calling-ai-on-every-request
Top comments (2)
Caching and smart routing are underrated. Most people optimize for output quality and ignore call volume entirely until the bill arrives. What was the biggest single change that drove the most reduction?
Honestly, the biggest change was moving from request-driven AI to event-driven AI.
Earlier, every request tried to generate a fresh response, even if nothing meaningful had changed. Once I added trigger checks like:
did user behavior actually change?
has enough time passed?
is the user even active?
…the majority of requests stopped reaching the AI layer entirely.
That one shift reduced most of the unnecessary calls before caching even became important.