Scheduling: Affinity & Anti-Affinity

Key Takeaways for AI & Readers

Node Affinity: Attracts Pods to specific nodes based on node labels (e.g., hardware types like SSD or GPU). It is the expressive replacement for nodeSelector, supporting operators like In, NotIn, Exists, DoesNotExist, Gt, and Lt.
Hard vs. Soft Rules: requiredDuringSchedulingIgnoredDuringExecution rules must be met for a Pod to schedule (the pod stays Pending otherwise), while preferredDuringSchedulingIgnoredDuringExecution rules act as a weighted suggestion the scheduler tries to honor but can ignore.
Inter-Pod Affinity: Co-locates related Pods (e.g., a web server and Redis cache) on the same node or within the same topology domain (zone, rack) to reduce network latency.
Anti-Affinity for HA: Spreads replicas of the same application across different nodes or availability zones to ensure High Availability during hardware failures or zone outages.
topologyKey: Defines the scope of affinity and anti-affinity rules. Using kubernetes.io/hostname means "same node," while topology.kubernetes.io/zone means "same availability zone."
Performance Warning: Pod affinity and anti-affinity require the scheduler to inspect pods on every candidate node. At scale (hundreds of nodes, thousands of pods), this can significantly slow scheduling. Use with caution in large clusters.

We covered Taints and Tolerations (which repel pods from nodes). Now let's look at Affinity, which attracts pods to nodes -- or to other pods.

Node Affinity

Node affinity allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node. It is conceptually similar to nodeSelector but far more expressive.

Node 1 (HDD)

📦

Node 2 (SSD)

📦

Node 3 (SSD)

📦

Without affinity, the pod can go anywhere.

requiredDuringSchedulingIgnoredDuringExecution (Hard)

This is a hard constraint. If no node satisfies the rule, the pod remains Pending indefinitely.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: accelerator          # Node must have this label
            operator: In
            values:
            - nvidia-a100
            - nvidia-h100            # Matches either GPU type
  containers:
  - name: trainer
    image: ml-trainer:latest

Key details about hard affinity:

Multiple nodeSelectorTerms are OR-ed together (any one term matching is sufficient).
Multiple matchExpressions within a single term are AND-ed (all must match).
The "IgnoredDuringExecution" part means that if a node's labels change after a pod is scheduled, the pod is not evicted. It stays where it is.

preferredDuringSchedulingIgnoredDuringExecution (Soft)

This is a soft constraint with a weight. The scheduler tries to honor it but will place the pod on a non-matching node if necessary.

apiVersion: v1
kind: Pod
metadata:
  name: web-server
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80                    # Higher weight = stronger preference
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      - weight: 20                    # Lower priority preference
        preference:
          matchExpressions:
          - key: region
            operator: In
            values:
            - us-east-1
  containers:
  - name: nginx
    image: nginx:1.27

The weight field ranges from 1 to 100. When scoring nodes, the scheduler adds the weight to a node's score for each satisfied preference. A node that matches the SSD preference gets +80, and one that also matches the region gets +100 total.

Node Affinity Operators

Operator	Meaning	Example
`In`	Label value is in the specified set	`region In [us-east-1, us-west-2]`
`NotIn`	Label value is not in the set	`env NotIn [production]`
`Exists`	Label key exists (value ignored)	`gpu Exists`
`DoesNotExist`	Label key does not exist	`spot DoesNotExist`
`Gt`	Label value is greater than (integers only)	`cpu-cores Gt 8`
`Lt`	Label value is less than (integers only)	`cpu-cores Lt 64`

The Gt and Lt operators are particularly useful for numeric node labels. For example, if nodes are labeled with cpu-cores: 16, you can target nodes with at least a certain capacity:

matchExpressions:
- key: cpu-cores
  operator: Gt
  values:
  - "8"                               # String representation of integer

Pod Affinity & Anti-Affinity

Instead of matching node labels, pod affinity and anti-affinity match pod labels. This lets you express rules like "run this pod near that other pod" or "keep these pods apart."

The topologyKey Concept

Every pod affinity or anti-affinity rule requires a topologyKey. This field specifies the domain over which the rule applies by referencing a node label:

kubernetes.io/hostname -- same node (the most granular scope)
topology.kubernetes.io/zone -- same availability zone
topology.kubernetes.io/region -- same region
Any custom node label (e.g., rack, building)

The scheduler groups nodes by their value for the specified label. Two nodes with topology.kubernetes.io/zone: us-east-1a are in the same topology domain.

Pod Affinity (Co-location)

Schedule pods together in the same topology domain to reduce latency:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - redis-cache          # Co-locate with Redis pods
            topologyKey: kubernetes.io/hostname  # Same node
      containers:
      - name: web
        image: my-web-app:latest

Use case: A web server that makes hundreds of requests per second to a Redis cache. Running them on the same node eliminates network hops and reduces latency from milliseconds to microseconds.

Pod Anti-Affinity (Spreading)

Keep pods apart for high availability:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web                  # Don't co-locate with other web pods
            topologyKey: topology.kubernetes.io/zone  # Spread across AZs
      containers:
      - name: web
        image: my-web-app:latest

With 3 replicas and anti-affinity across zones, each replica lands in a different availability zone. If an entire AZ goes down, two-thirds of your capacity remains.

Warning: If you have only 2 availability zones and request 3 replicas with hard anti-affinity across zones, one replica will stay Pending forever. Use preferred anti-affinity in such cases.

Preferred Pod Anti-Affinity with Weights

For more flexible spreading:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 5
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - api
              topologyKey: topology.kubernetes.io/zone
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - api
              topologyKey: kubernetes.io/hostname
      containers:
      - name: api
        image: api-service:latest

This configuration says: "Strongly prefer different zones (weight 100), and somewhat prefer different nodes (weight 50)." The scheduler will do its best to spread replicas across zones and nodes but will not leave pods Pending if it cannot satisfy these preferences.

Real-World Example: Co-locate Web + Cache, Spread Across AZs

A production-grade pattern that combines pod affinity and anti-affinity:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        # Attract: run near cache pods in the same zone
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 70
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - redis-cache
              topologyKey: topology.kubernetes.io/zone
        # Repel: spread web replicas across nodes
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - web
              topologyKey: kubernetes.io/hostname
      containers:
      - name: web
        image: web-app:latest

This tells the scheduler: "Put me in the same zone as my cache for low latency, but spread me across different nodes for fault tolerance."

Combining Affinity with Taints

Affinity and taints are complementary tools:

Mechanism	Direction	Scope
Taint + Toleration	Repels unwanted pods	Node-level
Node Affinity	Attracts pods to nodes	Node-level
Pod Affinity	Attracts pods to pods	Pod-level
Pod Anti-Affinity	Repels pods from pods	Pod-level

A common pattern for dedicated hardware:

Taint the GPU node: kubectl taint nodes gpu-1 hardware=gpu:NoSchedule
Label the GPU node: kubectl label nodes gpu-1 hardware=gpu
Tolerate the taint in the ML pod spec.
Use node affinity to require hardware=gpu, ensuring the pod lands on a GPU node (not just any node the toleration permits).

Performance Considerations

Node affinity is relatively cheap to compute -- the scheduler only needs to check node labels. However, pod affinity and anti-affinity are expensive because the scheduler must:

Find all pods matching the label selector across the cluster.
Determine which topology domains those pods occupy.
Score every candidate node based on the results.

In large clusters (500+ nodes, 10,000+ pods), heavy use of pod affinity and anti-affinity rules can add hundreds of milliseconds to each scheduling decision. Mitigations include:

Prefer preferredDuringScheduling over requiredDuringScheduling to give the scheduler more flexibility.
Use topologySpreadConstraints (a newer, more efficient alternative) for simple spreading requirements.
Limit the scope of label selectors with namespaces.

topologySpreadConstraints: A Modern Alternative

For simple pod spreading requirements, Kubernetes offers topologySpreadConstraints as a more efficient and expressive alternative to pod anti-affinity. Introduced in Kubernetes 1.19, it lets you define how pods should be distributed across topology domains:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      topologySpreadConstraints:
      - maxSkew: 1                         # Max difference in pod count between zones
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule   # Hard constraint (or ScheduleAnyway)
        labelSelector:
          matchLabels:
            app: web
      containers:
      - name: web
        image: web-app:latest

The maxSkew field controls how unevenly pods can be distributed. With maxSkew: 1 and 6 replicas across 3 zones, the scheduler ensures each zone has exactly 2 pods (a difference of at most 1 between any two zones). This is more precise than anti-affinity, which only guarantees that pods are not co-located, not that they are evenly distributed.

Common Pitfalls

Hard anti-affinity with insufficient topology domains: If you require anti-affinity across zones but have fewer zones than replicas, some pods will be permanently Pending. Use preferred anti-affinity for flexibility.
Forgetting topologyKey: Pod affinity without topologyKey is invalid. You must always specify the scope of the rule.
Confusing node affinity with pod affinity: Node affinity matches node labels. Pod affinity matches pod labels. Using a node label in a pod affinity selector will never match anything.
Overly broad label selectors: A pod affinity rule that matches thousands of pods forces the scheduler to inspect all of them. Use specific labels to reduce the search space.
Ignoring "IgnoredDuringExecution": Both node and pod affinity rules are only enforced at scheduling time. If labels change after scheduling, pods are not relocated. You must manually delete pods to trigger rescheduling.
Conflicting rules: A pod that requires affinity to app=redis but also requires anti-affinity to app=redis on the same topologyKey will never schedule.

Best Practices

Start with preferred rules: Use hard requirements only when correctness demands it (e.g., regulatory data residency). Preferred rules keep your cluster flexible under pressure.
Use topologySpreadConstraints for even distribution: For simple "spread evenly" requirements, topologySpreadConstraints (available since Kubernetes 1.19) is more efficient and expressive than pod anti-affinity.
Label nodes consistently: Establish a labeling convention early (e.g., topology.kubernetes.io/zone, node.kubernetes.io/instance-type, team). Affinity rules are only as good as the labels they reference.
Test scheduling in staging: Complex affinity rules can have surprising interactions. Test with realistic node counts and pod distributions before deploying to production.
Combine with PodDisruptionBudgets: Anti-affinity ensures replicas are spread out. PDBs ensure enough replicas stay running during voluntary disruptions (node upgrades, scaling).
Document your topology assumptions: If your affinity rules assume 3 availability zones, document this. Cluster expansions or cloud region changes could break these assumptions.

What's Next?

Scheduling (Taints): Learn how taints and tolerations repel pods from nodes, the complementary mechanism to affinity.
Priority & Preemption: Understand what happens when affinity rules conflict with resource constraints and the scheduler must preempt lower-priority pods.
Resources: Resource requests directly affect scheduling. A pod requesting 8 CPU cores has fewer candidate nodes than one requesting 100m.
Troubleshooting: Debug pods stuck in Pending due to unsatisfiable affinity rules.

Node Affinity​

requiredDuringSchedulingIgnoredDuringExecution (Hard)​

preferredDuringSchedulingIgnoredDuringExecution (Soft)​

Node Affinity Operators​

Pod Affinity & Anti-Affinity​

The topologyKey Concept​

Pod Affinity (Co-location)​

Pod Anti-Affinity (Spreading)​

Preferred Pod Anti-Affinity with Weights​

Real-World Example: Co-locate Web + Cache, Spread Across AZs​

Combining Affinity with Taints​

Performance Considerations​

topologySpreadConstraints: A Modern Alternative​

Common Pitfalls​

Best Practices​

What's Next?​