The Real Cost of Watching Everything
Run a horizontally scaled controller across a large Kubernetes cluster—say, three replicas of a custom resource controller in a 5,000-node cluster. Each replica receives the entire event stream from the API server. It deserializes every Pod, every ConfigMap, every change, filters out 66% of them, and discards the rest. Multiply that waste by dozens of controllers and you're burning CPU and bandwidth on work that never had to happen.
This is the scaling wall that Kubernetes operators hit, and Kubernetes v1.36 finally addresses it head-on with server-side sharded list and watch.
Why Client-Side Sharding Isn't Enough
Some controllers already implement horizontal sharding—tools like kube-state-metrics assign each replica a slice of the keyspace and discard irrelevant objects locally. Sounds reasonable until you map the actual costs:
- Deserialization waste: N replicas each deserialize the full event stream, even though (N-1) of them will throw away most of it.
- Network scales wrong: Bandwidth grows with the number of replicas, not shrinks with the shard size. Three replicas = three times the API server egress.
- CPU efficiency tanks: Every CPU cycle spent parsing objects you'll discard is a cycle you could've spent on actual work.
The problem isn't the sharding logic—it's that all filtering happens after the data leaves the API server. You're paying the full cost upfront, then hoping the controller will do the right math on its end.
Server-Side Sharding Changes the Game
Instead of filtering downstream, Kubernetes v1.36 moves the filter upstream. Controllers now tell the API server exactly which slice of the keyspace they own, and the API server only sends matching events.
The mechanism is deceptively simple: a new shardSelector field in ListOptions lets you specify a hash range. When you request:
opts := metav1.ListOptions{
ShardSelector: &metav1.ShardSelector{
Index: 0, // This replica is shard 0
Total: 3, // Out of 3 total shards
},
}
pods, err := clientset.CoreV1().Pods(metav1.NamespaceAll).List(ctx, opts)
The API server hashes each object's namespace and name, maps it to a shard range, and filters at the source. Only events matching your shard ever leave the server.
What Actually Changes for You
Immediate wins:
- Lower per-replica CPU: No wasted deserialization cycles. Each replica only processes what it owns.
- Reduced network: API server sends 1/N of the traffic per replica. Scale to 10 replicas? You've slashed per-replica egress by 90%.
- Better controller responsiveness: Smaller event streams mean faster reconciliation loops and lower latency on watch operations.
Tradeoffs to know:
- This is alpha in v1.36, so expect the API surface to evolve. Don't ship it to production yet.
- Your controller code needs to know its shard assignment and pass it on every list/watch call. If you're using a framework like kubebuilder, watch for patches that handle this automatically.
- Hash collisions are handled deterministically—objects map to shards based on
fnv.New32ahash of their namespace/name. The distribution is uniform as long as your keyspace is reasonably large.
A Concrete Migration Path
If you maintain a horizontally scaled controller or metrics exporter:
-
Check the feature gate:
ServerSideShardedListAndWatch=true(alpha). - Audit your watch/list calls: Any place where you're already doing client-side filtering is a candidate for server-side sharding.
- Implement shard assignment: Use a simple integer (e.g., from a downward API env var or StatefulSet ordinal) to determine your replica's shard.
- Test in a lab cluster first: The hash function is deterministic, but edge cases around large resource counts should be validated before production.
// Example: derive shard index from pod ordinal
ordinality := os.Getenv("ORDINAL") // "0", "1", "2", etc.
shardIndex, _ := strconv.Atoi(ordinality)
shardTotal := 3 // Replicas in your deployment
// Apply to every list/watch
opts.ShardSelector = &metav1.ShardSelector{
Index: shardIndex,
Total: shardTotal,
}
Why This Matters Now
Kubernetes clusters are getting bigger, and the pressure on the API server is mounting. Every optimization that pushes filtering logic upstream—whether it's field selectors, label selectors, or now shard selectors—buys you headroom to scale controllers without hitting a resource wall.
Server-side sharded list and watch is especially important for anyone running high-cardinality watch operations: Pod controllers, node-level agents, cost optimizers, security scanners. For teams operating 1,000+ node clusters with dozens of custom controllers, this can be the difference between stable API server load and constant firefighting.
The Question for Your Cluster
Are your horizontally scaled controllers already doing some form of client-side sharding to stay sane? If so, server-side sharding is probably worth experimenting with in your next lab run. And if you're not doing sharding yet but you've got multiple replicas of a watcher—you're probably leaving performance on the table.
What's the largest cluster you're running, and how many custom controllers are watching the same resources? I'd love to hear whether this lands on your v1.36 roadmap.
Top comments (0)