Anupam Kushwaha

Posted on May 5

How I cut AI calls by 95% without losing quality?

#ai #backend #buildinpublic #techtalks

The Hidden Cost of Calling AI Too Early

I stopped calling AI on every request — and everything got better.

The Problem

In one of my projects, I was generating AI-based insights from user activity.

The initial design was simple:

Every request for today’s insight → call the AI model → return a fresh response.

GET /api/insights/today

At first, this felt clean and correct.

But in practice, it created serious problems:

429 rate limit errors within hours
Daily quota exhausted before noon
Random failures affecting users
Costs scaling linearly with traffic

The system was working — but it wasn’t sustainable.

The Real Issue

The problem wasn’t the AI provider.

It was the trigger model.

The system never asked basic questions before making an expensive call:

Has anything actually changed?
Did I already generate a response recently?
Is the user even active today?

Without these checks, every request was treated as:

“Generate a new insight now.”

That assumption was the real bug.

The New Approach

Instead of adding caching on top, I redesigned the system into an event-driven pipeline.

AI became the last step, not the default.

System Flow

Here’s the simplified request flow:

flowchart TD
    A[Request for today's insight] --> B{Activity today?}
    B -- No --> C[Reuse latest insight or fallback]
    B -- Yes --> D{Meaningful change?}
    D -- No --> C
    D -- Yes --> E{Cooldown passed?}
    E -- No --> C
    E -- Yes --> F{Daily cap reached?}
    F -- Yes --> C
    F -- No --> G{Global AI limit reached?}
    G -- Yes --> H[Use deterministic fallback]
    G -- No --> I[Call AI model]
    I --> J[Persist insight]
    H --> J
    C --> J

Most requests now end at a simple database read — not an AI call.

(Optional) System Screenshot

Add your architecture / sequence diagram here

![System Flow](your-image-url-or-upload)

The Five-Layer Redesign

1. Activity Gate

Start with the cheapest check:

boolean hasActivity = activityService.hasActivityToday(userId, context);

if (!hasActivity) {
    return getLatestOrFallback(userId, today);
}

If nothing happened → don’t call AI.

2. Event-Driven Triggers

AI should only run when something meaningful changes.

Examples:

user updates intent
significant behavior change
threshold crossed

No change → reuse previous insight.

3. Cooldown Window

Avoid frequent re-generation:

Duration cooldown = Duration.ofMinutes(30);

if (elapsed < cooldown) {
    return getLatestOrFallback(userId, today);
}

This prevents unnecessary repeated calls.

4. Per-User Daily Cap

if (todayCount >= 10) {
    return getLatestOrFallback(userId, today);
}

Even active users shouldn’t trigger unlimited AI calls.

5. Global AI Guard

if (dailyAiCalls.get() >= 50) {
    useFallback = true;
}

This acts as a system-wide circuit breaker.

Configuration

All thresholds are configurable:

insight:
  activity-delta: 30
  cooldown-minutes: 30
  daily-cap-per-user: 10
  max-ai-calls-per-day: 50
  freshness-window-hours: 8

This allows tuning without redeploying code.

What Changed

After this redesign:

AI calls dropped from ~100/day → ~5–10/day
Rate limit errors disappeared
Most requests became fast database reads
Free-tier usage became sustainable
System behavior became more predictable

Engineering Takeaway

AI should be the exception, not the rule.

A well-designed backend should first decide:

“Is this request even worth sending to the model?”

That decision layer — gating, triggers, cooldowns — is where the real engineering happens.

Final Thought

If most requests can be handled using deterministic logic or cached state:

Do that first.

Use AI only when it actually adds value.

That single shift can make your system:

cheaper
faster
more reliable

—and much easier to scale.

## blog link -
https://anupamkushwaha.me/blog/stopped-calling-ai-on-every-request

Top comments (2)

Keynition • May 6

Caching and smart routing are underrated. Most people optimize for output quality and ignore call volume entirely until the bill arrives. What was the biggest single change that drove the most reduction?

Anupam Kushwaha • May 6

Honestly, the biggest change was moving from request-driven AI to event-driven AI.
Earlier, every request tried to generate a fresh response, even if nothing meaningful had changed. Once I added trigger checks like:
did user behavior actually change?
has enough time passed?
is the user even active?

…the majority of requests stopped reaching the AI layer entirely.
That one shift reduced most of the unnecessary calls before caching even became important.