DEV Community

Cover image for How I cut AI calls by 95% without losing quality?
Anupam Kushwaha
Anupam Kushwaha

Posted on

How I cut AI calls by 95% without losing quality?

The Hidden Cost of Calling AI Too Early

I stopped calling AI on every request — and everything got better.


The Problem

In one of my projects, I was generating AI-based insights from user activity.

The initial design was simple:

Every request for today’s insight → call the AI model → return a fresh response.

GET /api/insights/today
Enter fullscreen mode Exit fullscreen mode

At first, this felt clean and correct.

But in practice, it created serious problems:

  • 429 rate limit errors within hours
  • Daily quota exhausted before noon
  • Random failures affecting users
  • Costs scaling linearly with traffic

The system was working — but it wasn’t sustainable.


The Real Issue

The problem wasn’t the AI provider.

It was the trigger model.

The system never asked basic questions before making an expensive call:

  • Has anything actually changed?
  • Did I already generate a response recently?
  • Is the user even active today?

Without these checks, every request was treated as:

“Generate a new insight now.”

That assumption was the real bug.


The New Approach

Instead of adding caching on top, I redesigned the system into an event-driven pipeline.

AI became the last step, not the default.


System Flow

Here’s the simplified request flow:

flowchart TD
    A[Request for today's insight] --> B{Activity today?}
    B -- No --> C[Reuse latest insight or fallback]
    B -- Yes --> D{Meaningful change?}
    D -- No --> C
    D -- Yes --> E{Cooldown passed?}
    E -- No --> C
    E -- Yes --> F{Daily cap reached?}
    F -- Yes --> C
    F -- No --> G{Global AI limit reached?}
    G -- Yes --> H[Use deterministic fallback]
    G -- No --> I[Call AI model]
    I --> J[Persist insight]
    H --> J
    C --> J
Enter fullscreen mode Exit fullscreen mode

Most requests now end at a simple database read — not an AI call.


(Optional) System Screenshot

Add your architecture / sequence diagram here

![System Flow](your-image-url-or-upload)
Enter fullscreen mode Exit fullscreen mode

The Five-Layer Redesign

1. Activity Gate

Start with the cheapest check:

boolean hasActivity = activityService.hasActivityToday(userId, context);

if (!hasActivity) {
    return getLatestOrFallback(userId, today);
}
Enter fullscreen mode Exit fullscreen mode

If nothing happened → don’t call AI.


2. Event-Driven Triggers

AI should only run when something meaningful changes.

Examples:

  • user updates intent
  • significant behavior change
  • threshold crossed

No change → reuse previous insight.


3. Cooldown Window

Avoid frequent re-generation:

Duration cooldown = Duration.ofMinutes(30);

if (elapsed < cooldown) {
    return getLatestOrFallback(userId, today);
}
Enter fullscreen mode Exit fullscreen mode

This prevents unnecessary repeated calls.


4. Per-User Daily Cap

if (todayCount >= 10) {
    return getLatestOrFallback(userId, today);
}
Enter fullscreen mode Exit fullscreen mode

Even active users shouldn’t trigger unlimited AI calls.


5. Global AI Guard

if (dailyAiCalls.get() >= 50) {
    useFallback = true;
}
Enter fullscreen mode Exit fullscreen mode

This acts as a system-wide circuit breaker.


Configuration

All thresholds are configurable:

insight:
  activity-delta: 30
  cooldown-minutes: 30
  daily-cap-per-user: 10
  max-ai-calls-per-day: 50
  freshness-window-hours: 8
Enter fullscreen mode Exit fullscreen mode

This allows tuning without redeploying code.


What Changed

After this redesign:

  • AI calls dropped from ~100/day → ~5–10/day
  • Rate limit errors disappeared
  • Most requests became fast database reads
  • Free-tier usage became sustainable
  • System behavior became more predictable

Engineering Takeaway

AI should be the exception, not the rule.

A well-designed backend should first decide:

“Is this request even worth sending to the model?”

That decision layer — gating, triggers, cooldowns — is where the real engineering happens.


Final Thought

If most requests can be handled using deterministic logic or cached state:

Do that first.

Use AI only when it actually adds value.

That single shift can make your system:

  • cheaper
  • faster
  • more reliable

—and much easier to scale.

## blog link -
https://anupamkushwaha.me/blog/stopped-calling-ai-on-every-request

Top comments (2)

Collapse
 
keynition profile image
Keynition

Caching and smart routing are underrated. Most people optimize for output quality and ignore call volume entirely until the bill arrives. What was the biggest single change that drove the most reduction?

Collapse
 
anupam_kushwaha_85 profile image
Anupam Kushwaha

Honestly, the biggest change was moving from request-driven AI to event-driven AI.
Earlier, every request tried to generate a fresh response, even if nothing meaningful had changed. Once I added trigger checks like:
did user behavior actually change?
has enough time passed?
is the user even active?

…the majority of requests stopped reaching the AI layer entirely.
That one shift reduced most of the unnecessary calls before caching even became important.