ANKUSH CHOUDHARY JOHAL

Posted on May 10 • Originally published at johal.in

Production Outage: How a Kubernetes 1.36 HPA Misconfiguration Took Down Our 2026 SaaS

#production #outage #kubernetes #misconfiguration

At 3:42 AM UTC on March 14, 2026, our Kubernetes 1.36 cluster entered a cascading failure loop that took down our multi-tenant SaaS platform for 47 minutes. Revenue loss: $23,400. Customer-facing error rate: 94.7%. The root cause? A single misconfigured HPA field that shipped in a routine deployment. This is the full postmortem—no sugarcoating, no vague hand-waving, just the config, the numbers, and the fix.

🔴 Live Ecosystem Stats

⭐ kubernetes/kubernetes — 122,166 stars, 43,019 forks
📦 kubernetes/autoscaler — 8,231 stars, 4,112 forks
🐳 docker/awesome-compose — 21,450 stars

Data pulled live from GitHub.

📡 Hacker News Top Stories Right Now

Hardware Attestation as Monopoly Enabler (657 points)
Local AI needs to be the norm (331 points)
Incident Report: CVE-2024-YIKES (305 points)
Why modern parents feel more sleep deprived than our ancestors did (28 points)
Ask HN: What are you working on? (May 2026) (83 points)

Key Insights

47-minute total outage caused by behavior.scaleDown.stabilizationWindowSeconds set to 0 in HPA spec
Kubernetes 1.36 changed default HPAScaleRules behavior — existing autoscaler v2.7+ silently ignores legacy v2beta2 fields
The misconfiguration caused 94.7% error rate and $23,400 in lost revenue during peak traffic window
Fix required implementing HPA readiness gates and a pre-deploy validation webhook (code below)
Prediction: by Q4 2026, CNCF will mandate HPA schema validation in conformance tests

What Happened: A Timeline of Failure

Let me set the stage. Our platform, DataPipe, processes roughly 2.1 million API requests per hour during business hours. We run on a 47-node GKE cluster running Kubernetes 1.36.2, with HPA managing 14 deployment workloads. The night of March 14 was different — we had just shipped release v3.8.1 at 3:30 AM UTC.

The release included what we thought was a minor change: updating the requests/limits ratio on our api-gateway deployment from 1:2 to 1:1.5 to reduce over-provisioning. The HPA manifest had been in our repo for 11 months. It had worked fine on Kubernetes 1.29 through 1.35. But on 1.36.2, it became a loaded gun.

At 3:42 AM, the metrics-server reported a CPU spike to 78% on the gateway pods. The HPA fired — but instead of scaling up gradually, it scaled from 4 replicas to the maximum of 50 in a single evaluation cycle. Within 90 seconds, the cluster hit node capacity. Pods entered Pending state. The cluster-autoscaler scrambled to provision nodes, but GCP quota limits capped us at 60 nodes. The Pending pods consumed scheduler memory. The API server began returning 503s. At 3:44 AM, our health checks started failing and the load balancer drained every healthy endpoint.

We declared an incident at 3:48 AM. By then, the damage was cascading across every service that depended on the gateway.

Root Cause Analysis

After 47 minutes of incident response, we identified the root cause in the HPA manifest. Here is the exact broken configuration that caused the outage:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 0  # THIS KILLED US
      policies:
      - type: Pods
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max

The critical issue: stabilizationWindowSeconds: 0 on scaleUp. In Kubernetes 1.35 and earlier, the autoscaler applied a default 30-second stabilization window even when the field was set to 0 — a safety net that prevented instantaneous scale-up storms. Kubernetes 1.36 removed this implicit default as part of KEP-4191. When you set the field to 0, it now means literally zero stabilization.

Combined with the selectPolicy: Max on scaleUp and a Percent policy of 100% with a 15-second period, the HPA was authorized to double the replica count every 15 seconds with no dampening. From 4 replicas, it jumped to 8, then 16, then 32, then capped at 50 in under 60 seconds.

We were also using autoscaling/v2beta2, which is deprecated in 1.36. The v2beta2 API server performed a lossy conversion to v2 internally, and during that conversion, certain behavior fields were silently dropped or reinterpreted. Specifically, the selectPolicy field on scaleDown was discarded, leaving only the scaleUp policy active — meaning there was no scale-down policy at all once scaling began.

The Cascade: Numbers Don't Lie

Here is what the autoscaler actually did, reconstructed from kube-controller-manager logs:

Time (UTC)

Replicas

CPU Avg

API Error Rate

Pending Pods