Topology Spread Constraints
- High Availability Enforcement: TopologySpreadConstraints ensure Pods are evenly distributed across different topological domains (nodes, zones, regions) to prevent single points of failure. If an entire availability zone goes down, properly spread workloads continue serving from remaining zones.
- Minimizing Skew: The
maxSkewparameter defines the maximum allowed difference in pod count between the most-populated and least-populated topology domain. AmaxSkewof 1 enforces near-perfect balance; higher values allow more flexibility in scheduling. - Granular Control with
topologyKey: By specifyingtopologyKey(for example,topology.kubernetes.io/zonefor zone-level spreading orkubernetes.io/hostnamefor node-level spreading), you dictate the axis along which Pods should be distributed. - Flexible Scheduling: The
whenUnsatisfiablefield offers two modes:DoNotSchedule(hard constraint -- pods stay Pending if the spread cannot be satisfied) andScheduleAnyway(soft constraint -- the scheduler tries its best but schedules the pod regardless). This lets you balance between strict HA and scheduling flexibility. - Complement to Pod Anti-Affinity: TopologySpreadConstraints provide more granular control than Pod Anti-Affinity. Anti-affinity is binary (one pod per domain or zero), while topology spread allows multiple pods per domain as long as the skew across domains stays within bounds.
How do you ensure your application stays online when an entire data center or availability zone goes dark? Even if you have 10 replicas, the Kubernetes scheduler might place all of them on the same node or in the same availability zone (AZ). If that zone experiences an outage, all 10 replicas go down simultaneously.
TopologySpreadConstraints allow you to control how Pods are distributed across your cluster's topology, ensuring true high availability by spreading replicas across failure domains.
1. The Problem: Uneven Distribution
Without topology spread constraints, the Kubernetes scheduler optimizes for resource utilization and node fit, not for distribution across failure domains. Consider a cluster with three availability zones:
- Zone A: 10 nodes, 80% utilized
- Zone B: 5 nodes, 50% utilized
- Zone C: 5 nodes, 50% utilized
When you deploy 6 replicas, the scheduler may place 4 in Zone B and 2 in Zone C (because those zones have the most available resources), leaving zero replicas in Zone A. If Zone B fails, you lose 4 of 6 replicas instantly.
The goal of topology spread is to minimize skew -- the difference in pod count between the most-populated and least-populated zones.
Topology Spread
2. Core Parameters
topologyKey
The topologyKey is a node label key that defines the topology domains. The scheduler groups nodes by the value of this label and distributes pods across the resulting groups.
Common topology keys:
| topologyKey | What It Spreads Across | Use Case |
|---|---|---|
topology.kubernetes.io/zone | Availability zones | Zone-level HA (most common) |
topology.kubernetes.io/region | Cloud regions | Multi-region spreading |
kubernetes.io/hostname | Individual nodes | Node-level spreading (similar to anti-affinity) |
Custom label (e.g., rack) | Custom domains | Rack-aware scheduling in bare-metal clusters |
maxSkew
The maximum degree to which pods may be unevenly distributed across topology domains. The skew is calculated as: maxPodCount(domain) - minPodCount(domain).
- maxSkew: 1: The difference between the most-populated and least-populated domain can be at most 1. This enforces near-perfect balance. With 6 replicas across 3 zones, you get 2-2-2.
- maxSkew: 2: Allows more imbalance. With 6 replicas across 3 zones, distributions like 3-2-1 are acceptable.
- maxSkew: 3 or higher: Provides very loose spreading. Useful when you want a soft preference for distribution without strict enforcement.
whenUnsatisfiable
Determines what happens when the scheduler cannot place a pod while satisfying the maxSkew constraint.
- DoNotSchedule (hard constraint): The pod stays in
Pendingstate until a node becomes available that satisfies the skew constraint. Use this for critical production workloads where HA is non-negotiable. - ScheduleAnyway (soft constraint): The scheduler places the pod on the node that minimizes skew, even if the constraint would be violated. Use this when you prefer balanced distribution but cannot afford pods stuck in Pending.
labelSelector
Defines which pods count toward the skew calculation. Only pods matching this selector are considered when computing the current distribution. This is typically set to match the same labels as your Deployment's selector.
3. YAML Examples
Basic Zone-Level Spreading
# Spread pods evenly across availability zones
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
namespace: production
spec:
replicas: 6
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone # Spread across AZs
whenUnsatisfiable: DoNotSchedule # Hard constraint
labelSelector:
matchLabels:
app: web-frontend
containers:
- name: frontend
image: myregistry.io/web-frontend:v3.1.0
resources:
requests:
cpu: "250m"
memory: "256Mi"
With this configuration, the scheduler distributes the 6 replicas as evenly as possible across zones: 2-2-2 in a three-zone cluster.
Combined Node and Zone Spreading
# Spread across both zones AND nodes for maximum distribution
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
namespace: production
spec:
replicas: 9
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
topologySpreadConstraints:
# First constraint: spread across zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: payment-service
# Second constraint: spread across nodes within each zone
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # Soft for node-level
labelSelector:
matchLabels:
app: payment-service
containers:
- name: payment
image: myregistry.io/payment-service:v2.0.0
resources:
requests:
cpu: "500m"
memory: "512Mi"
This configuration first ensures pods are evenly distributed across zones (hard constraint), then tries to spread them across nodes within each zone (soft constraint). The result is maximum distribution: 3 pods per zone, each on a different node if possible.
Soft Spreading with ScheduleAnyway
# Best-effort spreading — prefer balance but don't block scheduling
apiVersion: apps/v1
kind: Deployment
metadata:
name: logging-agent
namespace: monitoring
spec:
replicas: 4
selector:
matchLabels:
app: logging-agent
template:
metadata:
labels:
app: logging-agent
spec:
topologySpreadConstraints:
- maxSkew: 2 # Allow some imbalance
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway # Never block scheduling
labelSelector:
matchLabels:
app: logging-agent
containers:
- name: agent
image: myregistry.io/logging-agent:v1.5.0
4. Multiple Constraints
When multiple topologySpreadConstraints are specified, the scheduler evaluates all of them. A pod is only placed on a node that satisfies all constraints simultaneously (for DoNotSchedule constraints) or that minimizes the total skew (for ScheduleAnyway constraints).
If constraints conflict (for example, zone spreading requires placing the pod in Zone A, but node spreading requires placing it on a node in Zone B), DoNotSchedule constraints take priority and the pod stays Pending until a valid placement exists.
5. Interaction with Affinity Rules
TopologySpreadConstraints and affinity/anti-affinity rules can be used together, but understanding their interaction is essential.
With Node Affinity
Node affinity restricts which nodes are eligible. TopologySpreadConstraints then distribute pods among the eligible nodes.
# Combine node affinity with topology spread
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: ["compute-optimized"] # Only schedule on compute nodes
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: ml-inference
With Pod Anti-Affinity
Pod anti-affinity and topology spread can work together but may conflict. Anti-affinity says "do not place two pods on the same node," while topology spread says "keep zones balanced." If the only node that balances zones already has a pod, anti-affinity blocks placement and the pod stays Pending.
In general, prefer TopologySpreadConstraints over Pod Anti-Affinity for most spreading use cases. TopologySpreadConstraints are more flexible and predictable.
6. Comparison: TopologySpreadConstraints vs. Pod Anti-Affinity
| Aspect | TopologySpreadConstraints | Pod Anti-Affinity |
|---|---|---|
| Granularity | Controls the maximum skew (allows multiple pods per domain) | Binary: one pod per domain or zero |
| Multi-domain | Can spread across multiple topology levels simultaneously | Requires separate affinity rules per level |
| Soft/Hard | DoNotSchedule (hard) or ScheduleAnyway (soft) | required (hard) or preferred (soft) |
| Scaling behavior | Handles any replica count gracefully | Breaks when replicas > domains (e.g., 5 replicas on 3 nodes) |
| Performance | Efficient scheduler evaluation | Can be expensive for large clusters |
| Introduced | Kubernetes 1.19 (stable in 1.24) | Kubernetes 1.4 |
Pod Anti-Affinity is still useful when you truly need at most one pod per node (for DaemonSet-like patterns or stateful workloads). For all other spreading use cases, TopologySpreadConstraints are the preferred mechanism.
7. Default Cluster-Level Topology Spread
Starting in Kubernetes 1.24, you can configure default TopologySpreadConstraints at the cluster level via the kube-scheduler configuration. This ensures all workloads get basic spreading even if individual Deployments do not specify constraints.
# kube-scheduler configuration: default topology spread for all pods
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 3
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
defaultingType: List # Use these defaults for pods without constraints
When defaultingType is List, the specified constraints are applied to pods that do not define their own TopologySpreadConstraints. When set to System, Kubernetes applies built-in defaults (spread by zone and hostname with ScheduleAnyway).
8. Real-World HA Patterns
Three-AZ Production Deployment
The most common HA pattern: deploy replicas across three availability zones with strict zone spreading and soft node spreading.
# Production HA: 3 AZs, strict zone balance, soft node balance
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
namespace: production
spec:
replicas: 9 # 3 per zone in a 3-zone cluster
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-gateway
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-gateway
containers:
- name: gateway
image: myregistry.io/api-gateway:v4.0.0
resources:
requests:
cpu: "500m"
memory: "512Mi"
# Ensure pods spread across nodes during disruptions
# PDB ensures at least 6 of 9 replicas are always running
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-gateway-pdb
namespace: production
spec:
minAvailable: 6 # At least 6 of 9 must be running
selector:
matchLabels:
app: api-gateway
Rack-Aware Bare-Metal Spreading
For bare-metal clusters, custom node labels define rack topology:
# Spread across racks in a bare-metal cluster
topologySpreadConstraints:
- maxSkew: 1
topologyKey: rack # Custom label: rack=rack-01, rack-02, etc.
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: database-proxy
9. Version History and Feature Gates
TopologySpreadConstraints were introduced as an alpha feature in Kubernetes 1.16, moved to beta in 1.18, and became generally available in Kubernetes 1.19. The minDomains field (which sets a minimum number of eligible domains) was added in 1.24 (beta) and reached GA in 1.30. The matchLabelKeys field (for automatic label matching) was added in 1.25 (alpha) and reached beta in 1.27, allowing the scheduler to use pod-template-hash automatically without explicit labelSelector configuration.
Common Pitfalls
- Uneven zone sizes: If Zone A has 10 nodes and Zone B has 2 nodes,
maxSkew: 1withDoNotSchedulecan leave pods Pending because there are not enough nodes in Zone B. UseScheduleAnywayor ensure zones have roughly equal capacity. - Forgetting the labelSelector: Without
labelSelector, the constraint applies to all pods in the namespace. This almost certainly is not what you want and will produce confusing scheduling behavior. - Conflicting constraints: Multiple hard constraints that cannot be simultaneously satisfied result in pods stuck in
Pending. Usekubectl describe podto see scheduler events and identify which constraint is blocking placement. - maxSkew too restrictive during scale-up: With
maxSkew: 1andDoNotSchedule, scaling from 3 to 4 replicas in a 3-zone cluster requires placing the 4th pod in the zone with the fewest pods. If that zone has no available capacity, the pod stays Pending. UseScheduleAnywayfor non-critical workloads. - Not combining with PodDisruptionBudget: Topology spread only controls initial placement. During node drains or voluntary disruptions, pods may be evicted and rescheduled unevenly. PDBs ensure that enough replicas remain available during disruptions to maintain the HA posture.
- Ignoring
minDomains: WithoutminDomains, the scheduler counts only domains that already have matching pods. If you have a 3-zone cluster but pods only exist in 2 zones, the third zone is ignored in skew calculations. SettingminDomains: 3forces the scheduler to consider all zones.
What's Next?
- Apply TopologySpreadConstraints to your critical production Deployments with
maxSkew: 1across availability zones. - Configure cluster-level default constraints in the kube-scheduler configuration to ensure all workloads get basic spreading.
- Combine topology spread with PodDisruptionBudgets to maintain HA guarantees during node maintenance and voluntary disruptions.
- Explore the
minDomainsfield (GA in Kubernetes 1.30) to force the scheduler to consider all topology domains. - Use the
matchLabelKeysfield to automatically match pod-template-hash labels, simplifying topology spread for Deployments with rolling updates. - Monitor pod distribution with
kubectl get pods -o wideand verify that replicas are actually spread across the expected zones and nodes.