DEV Community

Cover image for The Resilience & Observability Stack
Sreeraj Sreenivasan
Sreeraj Sreenivasan

Posted on

The Resilience & Observability Stack

Building Production-Ready FastAPI in 2026

Your API works. But is it production-ready?


Why This Matters

In 2026, "Contract-First" development means more than an OpenAPI spec. It means three implicit promises to every consumer:

  • Errors are predictable — every failure returns a structured, documented payload
  • Health is visible — logs, metrics, and traces tell a coherent story
  • The system self-heals — transient failures retry; abuse gets throttled

This article covers the two pillars that deliver on those promises:

  • Observability: Structlog + Prometheus + Sentry + Rich
  • Resilience: Tenacity + SlowAPI

Pillar 1: Enterprise Observability

Structlog — Centralized JSON Logging

Cloud environments need machine-readable logs. Structlog gives you JSON in production and human-friendly output locally — toggled by a single env var.

# app/core/logging.py
import structlog

def configure_logging() -> None:
    processors = [
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()  # swap for ConsoleRenderer() locally
        if settings.ENVIRONMENT == "production"
        else structlog.dev.ConsoleRenderer(colors=True),
    ]
    structlog.configure(processors=processors, cache_logger_on_first_use=True)
Enter fullscreen mode Exit fullscreen mode

In production, every log entry is a clean, indexable JSON object. In development, it's colourised and human-readable — no config changes required.


Rich — Beautiful Local Console Output

Install Rich tracebacks globally and your terminal shows full variable state at every frame of an exception — invaluable for debugging async SQLAlchemy sessions.

from rich.traceback import install as install_rich_traceback
install_rich_traceback(show_locals=True, width=120)
Enter fullscreen mode Exit fullscreen mode

Instead of a wall of text, you get colour-coded output with file references and local variable values at the exact line that failed.


Prometheus — Metrics in Two Lines

from prometheus_fastapi_instrumentator import Instrumentator

Instrumentator().instrument(app).expose(app, endpoint="/metrics")
Enter fullscreen mode Exit fullscreen mode

You get request counts, latency histograms, and in-flight connections out of the box — ready for Grafana dashboards and SLO alerting.


Sentry — Error Tracking That Talks to Your Frontend

The critical constraint most templates miss: Sentry by default drops HTTPException.detail — the exact string your React frontend reads to show users a meaningful message like "User already exists".

Fix it with a before_send hook:

from fastapi import HTTPException

def before_send(event: dict, hint: dict) -> dict | None:
    exc_info = hint.get("exc_info")
    if exc_info:
        _, exc_value, _ = exc_info
        if isinstance(exc_value, HTTPException):
            event.setdefault("extra", {})
            event["extra"]["http_exception_detail"] = exc_value.detail
            event["extra"]["status_code"] = exc_value.status_code
            event.setdefault("tags", {})["http_status"] = str(exc_value.status_code)
    return event

sentry_sdk.init(dsn=settings.SENTRY_DSN, before_send=before_send, ...)
Enter fullscreen mode Exit fullscreen mode

Now every 409 Conflict in your Sentry dashboard shows exactly what the user saw. Filter by http_status:409 across your entire project instantly.


Pillar 2: Application Resilience

Tenacity — Retries for Transient Failures

Kubernetes rolling deploys, Aurora cold starts, and flaky network hops all introduce brief connectivity gaps. Without retries, those gaps become 500 errors. With Tenacity, they're invisible.

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from sqlalchemy.exc import OperationalError, DisconnectionError

db_retry = retry(
    retry=retry_if_exception_type((OperationalError, DisconnectionError)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=0.5, max=4),
    reraise=True,  # Still raises after exhaustion — Sentry catches it with full context
)
Enter fullscreen mode Exit fullscreen mode

Apply it as a decorator on any database or external HTTP call:

@db_retry
async def get_by_email(self, db: AsyncSession, email: str) -> User | None:
    result = await db.execute(select(User).where(User.email == email))
    return result.scalar_one_or_none()
Enter fullscreen mode Exit fullscreen mode

reraise=True ensures exhausted retries still propagate normally through your exception handlers, keeping Structlog and Sentry integration intact.


SlowAPI — Rate Limiting That Respects Your OpenAPI Contract

The subtle problem with naive rate limiting: the 429 response becomes an undocumented payload that breaks your auto-generated frontend client.

The fix is a custom handler that returns {"detail": "..."} — identical to every other FastAPI error:

from fastapi.responses import JSONResponse
from slowapi.errors import RateLimitExceeded

async def rate_limit_handler(request: Request, exc: RateLimitExceeded) -> JSONResponse:
    return JSONResponse(
        status_code=429,
        content={"detail": f"Rate limit exceeded: {exc.detail}. Please slow down."},
        headers={"Retry-After": "60"},
    )

app.add_exception_handler(RateLimitExceeded, rate_limit_handler)
Enter fullscreen mode Exit fullscreen mode

Apply per-route limits based on risk:

@router.post("/auth/login")
@limiter.limit("10/minute")   # Strict — brute-force bait
async def login(request: Request, payload: LoginRequest): ...

@router.post("/users/")
@limiter.limit("5/minute")    # Tight — prevent account creation spam
async def create_user(request: Request, payload: UserCreate): ...
Enter fullscreen mode Exit fullscreen mode

The Frontend Connection

Every tool above converges on one payoff: your React client always reads the same key.

Scenario Status Response
User already exists 409 {"detail": "User already exists"}
Rate limit hit 429 {"detail": "Rate limit exceeded..."}
Validation failure 422 {"detail": [...Pydantic errors]}
Server error 500 {"detail": "Internal server error"}

One Axios interceptor handles all of them:

client.interceptors.response.use(
  (res) => res,
  (error) => {
    const message = error.response?.data?.detail ?? "An unexpected error occurred.";
    toast.error(typeof message === "string" ? message : JSON.stringify(message));
    return Promise.reject(error);
  }
);
Enter fullscreen mode Exit fullscreen mode

No special-casing. No silent failures. One contract, end to end.


Conclusion

The gap between a demo API and a production API isn't features — it's operational maturity.

Tool What it solves
Structlog Structured logs for cloud aggregators
Rich Developer-friendly local debugging
Prometheus Latency metrics and SLO visibility
Sentry + before_send Error tracking with frontend-aware payloads
Tenacity Silent recovery from transient failures
SlowAPI + custom handler Rate limiting that honours your OpenAPI contract

Together, these six tools ensure your FastAPI app behaves with consistency and transparency — whether on a single VPS, a Kubernetes cluster, or a globally distributed edge network.

Ship with confidence.


Tags: fastapi python observability sentry prometheus structlog tenacity slowapi backend

Top comments (0)