Scheduling: Affinity & Anti-Affinity
- Node Affinity: Attracts Pods to specific nodes based on node labels (e.g., hardware types like SSD or GPU). It is the expressive replacement for
nodeSelector, supporting operators likeIn,NotIn,Exists,DoesNotExist,Gt, andLt. - Hard vs. Soft Rules:
requiredDuringSchedulingIgnoredDuringExecutionrules must be met for a Pod to schedule (the pod stays Pending otherwise), whilepreferredDuringSchedulingIgnoredDuringExecutionrules act as a weighted suggestion the scheduler tries to honor but can ignore. - Inter-Pod Affinity: Co-locates related Pods (e.g., a web server and Redis cache) on the same node or within the same topology domain (zone, rack) to reduce network latency.
- Anti-Affinity for HA: Spreads replicas of the same application across different nodes or availability zones to ensure High Availability during hardware failures or zone outages.
- topologyKey: Defines the scope of affinity and anti-affinity rules. Using
kubernetes.io/hostnamemeans "same node," whiletopology.kubernetes.io/zonemeans "same availability zone." - Performance Warning: Pod affinity and anti-affinity require the scheduler to inspect pods on every candidate node. At scale (hundreds of nodes, thousands of pods), this can significantly slow scheduling. Use with caution in large clusters.
We covered Taints and Tolerations (which repel pods from nodes). Now let's look at Affinity, which attracts pods to nodes -- or to other pods.
Node Affinity
Node affinity allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node. It is conceptually similar to nodeSelector but far more expressive.
requiredDuringSchedulingIgnoredDuringExecution (Hard)
This is a hard constraint. If no node satisfies the rule, the pod remains Pending indefinitely.
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator # Node must have this label
operator: In
values:
- nvidia-a100
- nvidia-h100 # Matches either GPU type
containers:
- name: trainer
image: ml-trainer:latest
Key details about hard affinity:
- Multiple
nodeSelectorTermsare OR-ed together (any one term matching is sufficient). - Multiple
matchExpressionswithin a single term are AND-ed (all must match). - The "IgnoredDuringExecution" part means that if a node's labels change after a pod is scheduled, the pod is not evicted. It stays where it is.
preferredDuringSchedulingIgnoredDuringExecution (Soft)
This is a soft constraint with a weight. The scheduler tries to honor it but will place the pod on a non-matching node if necessary.
apiVersion: v1
kind: Pod
metadata:
name: web-server
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80 # Higher weight = stronger preference
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 20 # Lower priority preference
preference:
matchExpressions:
- key: region
operator: In
values:
- us-east-1
containers:
- name: nginx
image: nginx:1.27
The weight field ranges from 1 to 100. When scoring nodes, the scheduler adds the weight to a node's score for each satisfied preference. A node that matches the SSD preference gets +80, and one that also matches the region gets +100 total.
Node Affinity Operators
| Operator | Meaning | Example |
|---|---|---|
In | Label value is in the specified set | region In [us-east-1, us-west-2] |
NotIn | Label value is not in the set | env NotIn [production] |
Exists | Label key exists (value ignored) | gpu Exists |
DoesNotExist | Label key does not exist | spot DoesNotExist |
Gt | Label value is greater than (integers only) | cpu-cores Gt 8 |
Lt | Label value is less than (integers only) | cpu-cores Lt 64 |
The Gt and Lt operators are particularly useful for numeric node labels. For example, if nodes are labeled with cpu-cores: 16, you can target nodes with at least a certain capacity:
matchExpressions:
- key: cpu-cores
operator: Gt
values:
- "8" # String representation of integer
Pod Affinity & Anti-Affinity
Instead of matching node labels, pod affinity and anti-affinity match pod labels. This lets you express rules like "run this pod near that other pod" or "keep these pods apart."
The topologyKey Concept
Every pod affinity or anti-affinity rule requires a topologyKey. This field specifies the domain over which the rule applies by referencing a node label:
kubernetes.io/hostname-- same node (the most granular scope)topology.kubernetes.io/zone-- same availability zonetopology.kubernetes.io/region-- same region- Any custom node label (e.g.,
rack,building)
The scheduler groups nodes by their value for the specified label. Two nodes with topology.kubernetes.io/zone: us-east-1a are in the same topology domain.
Pod Affinity (Co-location)
Schedule pods together in the same topology domain to reduce latency:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache # Co-locate with Redis pods
topologyKey: kubernetes.io/hostname # Same node
containers:
- name: web
image: my-web-app:latest
Use case: A web server that makes hundreds of requests per second to a Redis cache. Running them on the same node eliminates network hops and reduces latency from milliseconds to microseconds.
Pod Anti-Affinity (Spreading)
Keep pods apart for high availability:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web # Don't co-locate with other web pods
topologyKey: topology.kubernetes.io/zone # Spread across AZs
containers:
- name: web
image: my-web-app:latest
With 3 replicas and anti-affinity across zones, each replica lands in a different availability zone. If an entire AZ goes down, two-thirds of your capacity remains.
Warning: If you have only 2 availability zones and request 3 replicas with hard anti-affinity across zones, one replica will stay Pending forever. Use preferred anti-affinity in such cases.
Preferred Pod Anti-Affinity with Weights
For more flexible spreading:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 5
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: topology.kubernetes.io/zone
- weight: 50
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
containers:
- name: api
image: api-service:latest
This configuration says: "Strongly prefer different zones (weight 100), and somewhat prefer different nodes (weight 50)." The scheduler will do its best to spread replicas across zones and nodes but will not leave pods Pending if it cannot satisfy these preferences.
Real-World Example: Co-locate Web + Cache, Spread Across AZs
A production-grade pattern that combines pod affinity and anti-affinity:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
affinity:
# Attract: run near cache pods in the same zone
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 70
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: topology.kubernetes.io/zone
# Repel: spread web replicas across nodes
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: web-app:latest
This tells the scheduler: "Put me in the same zone as my cache for low latency, but spread me across different nodes for fault tolerance."
Combining Affinity with Taints
Affinity and taints are complementary tools:
| Mechanism | Direction | Scope |
|---|---|---|
| Taint + Toleration | Repels unwanted pods | Node-level |
| Node Affinity | Attracts pods to nodes | Node-level |
| Pod Affinity | Attracts pods to pods | Pod-level |
| Pod Anti-Affinity | Repels pods from pods | Pod-level |
A common pattern for dedicated hardware:
- Taint the GPU node:
kubectl taint nodes gpu-1 hardware=gpu:NoSchedule - Label the GPU node:
kubectl label nodes gpu-1 hardware=gpu - Tolerate the taint in the ML pod spec.
- Use node affinity to require
hardware=gpu, ensuring the pod lands on a GPU node (not just any node the toleration permits).
Performance Considerations
Node affinity is relatively cheap to compute -- the scheduler only needs to check node labels. However, pod affinity and anti-affinity are expensive because the scheduler must:
- Find all pods matching the label selector across the cluster.
- Determine which topology domains those pods occupy.
- Score every candidate node based on the results.
In large clusters (500+ nodes, 10,000+ pods), heavy use of pod affinity and anti-affinity rules can add hundreds of milliseconds to each scheduling decision. Mitigations include:
- Prefer
preferredDuringSchedulingoverrequiredDuringSchedulingto give the scheduler more flexibility. - Use
topologySpreadConstraints(a newer, more efficient alternative) for simple spreading requirements. - Limit the scope of label selectors with namespaces.
topologySpreadConstraints: A Modern Alternative
For simple pod spreading requirements, Kubernetes offers topologySpreadConstraints as a more efficient and expressive alternative to pod anti-affinity. Introduced in Kubernetes 1.19, it lets you define how pods should be distributed across topology domains:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
topologySpreadConstraints:
- maxSkew: 1 # Max difference in pod count between zones
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # Hard constraint (or ScheduleAnyway)
labelSelector:
matchLabels:
app: web
containers:
- name: web
image: web-app:latest
The maxSkew field controls how unevenly pods can be distributed. With maxSkew: 1 and 6 replicas across 3 zones, the scheduler ensures each zone has exactly 2 pods (a difference of at most 1 between any two zones). This is more precise than anti-affinity, which only guarantees that pods are not co-located, not that they are evenly distributed.
Common Pitfalls
-
Hard anti-affinity with insufficient topology domains: If you require anti-affinity across zones but have fewer zones than replicas, some pods will be permanently Pending. Use preferred anti-affinity for flexibility.
-
Forgetting topologyKey: Pod affinity without
topologyKeyis invalid. You must always specify the scope of the rule. -
Confusing node affinity with pod affinity: Node affinity matches node labels. Pod affinity matches pod labels. Using a node label in a pod affinity selector will never match anything.
-
Overly broad label selectors: A pod affinity rule that matches thousands of pods forces the scheduler to inspect all of them. Use specific labels to reduce the search space.
-
Ignoring "IgnoredDuringExecution": Both node and pod affinity rules are only enforced at scheduling time. If labels change after scheduling, pods are not relocated. You must manually delete pods to trigger rescheduling.
-
Conflicting rules: A pod that requires affinity to
app=redisbut also requires anti-affinity toapp=redison the same topologyKey will never schedule.
Best Practices
-
Start with preferred rules: Use hard requirements only when correctness demands it (e.g., regulatory data residency). Preferred rules keep your cluster flexible under pressure.
-
Use topologySpreadConstraints for even distribution: For simple "spread evenly" requirements,
topologySpreadConstraints(available since Kubernetes 1.19) is more efficient and expressive than pod anti-affinity. -
Label nodes consistently: Establish a labeling convention early (e.g.,
topology.kubernetes.io/zone,node.kubernetes.io/instance-type,team). Affinity rules are only as good as the labels they reference. -
Test scheduling in staging: Complex affinity rules can have surprising interactions. Test with realistic node counts and pod distributions before deploying to production.
-
Combine with PodDisruptionBudgets: Anti-affinity ensures replicas are spread out. PDBs ensure enough replicas stay running during voluntary disruptions (node upgrades, scaling).
-
Document your topology assumptions: If your affinity rules assume 3 availability zones, document this. Cluster expansions or cloud region changes could break these assumptions.
What's Next?
- Scheduling (Taints): Learn how taints and tolerations repel pods from nodes, the complementary mechanism to affinity.
- Priority & Preemption: Understand what happens when affinity rules conflict with resource constraints and the scheduler must preempt lower-priority pods.
- Resources: Resource requests directly affect scheduling. A pod requesting 8 CPU cores has fewer candidate nodes than one requesting 100m.
- Troubleshooting: Debug pods stuck in
Pendingdue to unsatisfiable affinity rules.