Skip to main content

Pod Priority & Preemption

Key Takeaways for AI & Readers
  • Workload Tiering: Assign different importance levels to Pods using PriorityClasses to ensure critical services remain running when the cluster is under resource pressure.
  • Preemption Mechanics: When a high-priority pod cannot be scheduled due to insufficient resources, the scheduler identifies lower-priority pods to evict, respecting graceful termination periods and PodDisruptionBudgets where possible.
  • System Priority Classes: Kubernetes ships with system-cluster-critical (priority 2000000000) and system-node-critical (priority 2000001000) for essential infrastructure components. User-defined priorities should stay well below these values.
  • Scheduling Order: Priority affects scheduling order, not just preemption. Higher-priority pending pods are scheduled before lower-priority ones, even without preemption.
  • preemptionPolicy: Set to Never to allow a pod to jump the scheduling queue without evicting other pods. Useful for important batch jobs that should not disrupt running workloads.
  • Graceful Termination: Even during preemption, evicted Pods receive their configured terminationGracePeriodSeconds to finish tasks, save state, or deregister from service discovery.

Not all applications are equal. Your production database is more important than a nightly log export job. When a cluster is at capacity, Kubernetes needs to know which pods to keep and which to evict to make room for critical workloads. This is where Pod Priority and Preemption come in.

How Priority Affects the Scheduler

Priority influences the Kubernetes scheduler in two distinct ways:

  1. Scheduling order: When multiple pods are pending, higher-priority pods are placed in the scheduling queue first, regardless of creation time.
  2. Preemption: When a high-priority pod cannot be scheduled because no node has enough resources, the scheduler can evict lower-priority pods to free up capacity.
Busy Node (Capacity: 3)
📦
Low
📦
Low
📦
Low
Preemption: When a node is full, the scheduler will kill (preempt) lower-priority pods to make room for PriorityClass: Critical pods.

Defining a PriorityClass

A PriorityClass is a cluster-scoped object that maps a name to an integer priority value:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # Higher number = higher priority
globalDefault: false # If true, all pods without a class get this
description: "Use for critical backend services like APIs and databases."
preemptionPolicy: PreemptLowerPriority # Default: can preempt lower-priority pods

Key fields explained:

  • value: An integer from -2,147,483,648 to 1,000,000,000. Values above 1 billion are reserved for system use.
  • globalDefault: At most one PriorityClass can be the global default. Pods without an explicit priorityClassName receive this priority. If no global default exists, those pods get priority 0.
  • preemptionPolicy: Either PreemptLowerPriority (default) or Never.
  • description: A human-readable string explaining when to use this class.

Assigning a PriorityClass to a Pod

Reference the PriorityClass by name in the pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 3
selector:
matchLabels:
app: payment
template:
metadata:
labels:
app: payment
spec:
priorityClassName: high-priority # Reference the PriorityClass
containers:
- name: payment
image: payment-service:v2.3
resources:
requests:
cpu: "500m"
memory: "512Mi"

System Priority Classes

Kubernetes provides two built-in PriorityClasses for critical system components:

NameValuePurpose
system-node-critical2000001000Node-level essentials (kubelet, kube-proxy, CNI plugins)
system-cluster-critical2000000000Cluster-level essentials (CoreDNS, metrics-server, cloud controllers)

These classes exist in the kube-system namespace context and should only be used for infrastructure pods. Using them for application workloads defeats their purpose and can cause cascading failures if system pods get preempted.

# Example: CoreDNS uses system-cluster-critical
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
spec:
template:
spec:
priorityClassName: system-cluster-critical
containers:
- name: coredns
image: registry.k8s.io/coredns/coredns:v1.11.1

How Preemption Works Step by Step

When the scheduler cannot find a node for a pending high-priority pod, the preemption algorithm executes:

  1. Identify candidate nodes: The scheduler examines each node to determine if removing some lower-priority pods would make room for the pending pod.

  2. Select victims: On each candidate node, the scheduler identifies the minimum set of lower-priority pods whose eviction would free enough resources. It prefers to evict:

    • Pods with the lowest priority first.
    • Pods that would violate the fewest PodDisruptionBudgets.
  3. Respect PDBs (best effort): The scheduler tries to avoid violating PodDisruptionBudgets. If all preemption options would violate a PDB, the scheduler may still proceed, but it prefers paths that do not.

  4. Choose the best node: Among all candidate nodes, the scheduler picks the one where preemption causes the least disruption (fewest evictions, lowest-priority victims, fewest PDB violations).

  5. Evict victims: The selected pods receive a deletion signal with their configured terminationGracePeriodSeconds. They are not killed instantly.

  6. Wait and schedule: The scheduler sets a nominatedNodeName on the pending pod. Once the victims terminate and release their resources, the pending pod is scheduled. If conditions change in the meantime (e.g., another pod frees resources on a different node), the scheduler may choose a different node.

Real-World Tiering Example

Here is a practical five-tier priority scheme for a production cluster:

# Tier 1: Critical infrastructure (never preempted by user workloads)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical
value: 1000000
description: "Production databases, payment processing, auth services."
---
# Tier 2: High-priority production services
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high
value: 750000
description: "Production APIs, web frontends, customer-facing services."
---
# Tier 3: Medium-priority supporting services
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium
value: 500000
globalDefault: true # Default for all untagged pods
description: "Internal tools, monitoring, CI/CD pipelines."
---
# Tier 4: Low-priority background work
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low
value: 250000
description: "Log aggregation, analytics, non-urgent processing."
---
# Tier 5: Batch/preemptible workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch
value: 100000
preemptionPolicy: Never # Can be preempted but won't preempt others
description: "Batch jobs, ML training, data migrations. Safe to evict."

With this scheme, if the cluster is full and a new critical payment pod needs to schedule, the scheduler will evict batch or low pods first. The medium tier serves as the global default, so developers who forget to set a priority class get a reasonable default.

preemptionPolicy: Never

Setting preemptionPolicy: Never creates an interesting behavior: the pod benefits from priority-based scheduling order (it jumps ahead in the queue) but will never trigger preemption of other pods. If resources are unavailable, it simply waits.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: important-batch
value: 800000
preemptionPolicy: Never # High in queue, but won't evict anything
description: "Important batch jobs that should schedule soon but not disrupt services."

This is ideal for batch workloads that are time-sensitive but should not disrupt running services. They get scheduled before lower-priority pods waiting in the queue, but they wait patiently for natural resource availability.

Interaction with PodDisruptionBudgets

PodDisruptionBudgets (PDBs) protect applications from losing too many replicas at once. During preemption, the scheduler respects PDBs on a best-effort basis:

  • If the scheduler can satisfy the pending pod without violating any PDB, it will choose that path.
  • If all options violate a PDB, the scheduler will still preempt -- the high-priority pod's need takes precedence.
  • PDBs are more strictly respected during voluntary disruptions (node drains, cluster upgrades) than during preemption.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: redis-pdb
spec:
minAvailable: 2 # At least 2 replicas must stay running
selector:
matchLabels:
app: redis

If Redis has 3 replicas and a PDB requiring minAvailable: 2, the scheduler will preempt at most 1 Redis pod. If it needs to preempt 2 Redis pods to fit the high-priority pod, it will look for better options on other nodes first.

When NOT to Use Preemption

Preemption is a powerful tool, but it is not always appropriate:

  1. Stateful workloads with long shutdown times: If your database takes 5 minutes to flush to disk, frequent preemption can cause data corruption or prolonged downtime. Protect these with high priority and PDBs.

  2. Workloads with expensive startup costs: ML training jobs that have been running for hours lose all progress when preempted (unless they checkpoint). Either give them high priority or use preemptionPolicy: Never.

  3. Clusters with adequate capacity: If your cluster is rarely at full capacity, preemption adds complexity without benefit. Focus on right-sizing your cluster instead.

  4. Multi-tenant clusters without governance: If every team assigns critical priority to their pods, preemption becomes meaningless. Enforce priority assignment through admission controllers or OPA/Gatekeeper policies.

Common Pitfalls

  1. Priority inflation: Without governance, teams will assign the highest priority to all their pods. Use admission webhooks or policy engines to enforce that only approved workloads can use high-priority classes.

  2. Forgetting globalDefault: If no PriorityClass has globalDefault: true, pods without a priorityClassName get priority 0 and are the first to be preempted. Always set a sensible default.

  3. Using system priorities for applications: Assigning system-cluster-critical to application pods can prevent CoreDNS, kube-proxy, or other system components from scheduling, bringing down the entire cluster.

  4. Not setting resource requests: Preemption works based on resource requests, not actual usage. If your pods lack resource requests, the scheduler cannot accurately calculate whether evicting them would free enough capacity.

  5. Ignoring nominatedNodeName: After preemption, the pending pod has a nominatedNodeName but is not yet scheduled. Other high-priority pods might "steal" the freed resources. This is normal behavior, not a bug.

  6. Cascading preemption: A preempted pod that is high-priority itself might preempt another pod when it is rescheduled on a different node, causing a chain reaction. Design your priority tiers with sufficient gaps.

Best Practices

  1. Define 3 to 5 priority tiers: Too many tiers create confusion. A simple critical/high/medium/low/batch scheme covers most use cases.

  2. Set a global default: Choose a middle-tier priority as the global default so untagged workloads have a reasonable placement.

  3. Protect with PDBs: Combine priority with PodDisruptionBudgets to limit the blast radius of preemption on stateful or replicated services.

  4. Use preemptionPolicy: Never for batch: Batch jobs should get scheduling priority (they jump the queue) but should not evict running services.

  5. Enforce with policy: Use OPA/Gatekeeper, Kyverno, or admission webhooks to prevent teams from assigning priorities above their allocated tier.

  6. Monitor preemption events: Set up alerts for preemption events (kubectl get events --field-selector reason=Preempted). Frequent preemption is a signal that your cluster needs more capacity.

What's Next?

  • Scheduling (Taints): Learn how taints control which pods can land on which nodes, working alongside priority to determine placement.
  • Scheduling (Affinity): Understand how affinity rules interact with priority when the scheduler must balance placement preferences with resource constraints.
  • Resources: Resource requests and limits are the currency that priority and preemption operate on.
  • Troubleshooting: Debug pods that are unexpectedly preempted or stuck in Pending despite having high priority.