Pod Priority & Preemption
- Workload Tiering: Assign different importance levels to Pods using
PriorityClassesto ensure critical services remain running when the cluster is under resource pressure. - Preemption Mechanics: When a high-priority pod cannot be scheduled due to insufficient resources, the scheduler identifies lower-priority pods to evict, respecting graceful termination periods and PodDisruptionBudgets where possible.
- System Priority Classes: Kubernetes ships with
system-cluster-critical(priority 2000000000) andsystem-node-critical(priority 2000001000) for essential infrastructure components. User-defined priorities should stay well below these values. - Scheduling Order: Priority affects scheduling order, not just preemption. Higher-priority pending pods are scheduled before lower-priority ones, even without preemption.
- preemptionPolicy: Set to
Neverto allow a pod to jump the scheduling queue without evicting other pods. Useful for important batch jobs that should not disrupt running workloads. - Graceful Termination: Even during preemption, evicted Pods receive their configured
terminationGracePeriodSecondsto finish tasks, save state, or deregister from service discovery.
Not all applications are equal. Your production database is more important than a nightly log export job. When a cluster is at capacity, Kubernetes needs to know which pods to keep and which to evict to make room for critical workloads. This is where Pod Priority and Preemption come in.
How Priority Affects the Scheduler
Priority influences the Kubernetes scheduler in two distinct ways:
- Scheduling order: When multiple pods are pending, higher-priority pods are placed in the scheduling queue first, regardless of creation time.
- Preemption: When a high-priority pod cannot be scheduled because no node has enough resources, the scheduler can evict lower-priority pods to free up capacity.
Defining a PriorityClass
A PriorityClass is a cluster-scoped object that maps a name to an integer priority value:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # Higher number = higher priority
globalDefault: false # If true, all pods without a class get this
description: "Use for critical backend services like APIs and databases."
preemptionPolicy: PreemptLowerPriority # Default: can preempt lower-priority pods
Key fields explained:
- value: An integer from -2,147,483,648 to 1,000,000,000. Values above 1 billion are reserved for system use.
- globalDefault: At most one PriorityClass can be the global default. Pods without an explicit
priorityClassNamereceive this priority. If no global default exists, those pods get priority 0. - preemptionPolicy: Either
PreemptLowerPriority(default) orNever. - description: A human-readable string explaining when to use this class.
Assigning a PriorityClass to a Pod
Reference the PriorityClass by name in the pod spec:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 3
selector:
matchLabels:
app: payment
template:
metadata:
labels:
app: payment
spec:
priorityClassName: high-priority # Reference the PriorityClass
containers:
- name: payment
image: payment-service:v2.3
resources:
requests:
cpu: "500m"
memory: "512Mi"
System Priority Classes
Kubernetes provides two built-in PriorityClasses for critical system components:
| Name | Value | Purpose |
|---|---|---|
system-node-critical | 2000001000 | Node-level essentials (kubelet, kube-proxy, CNI plugins) |
system-cluster-critical | 2000000000 | Cluster-level essentials (CoreDNS, metrics-server, cloud controllers) |
These classes exist in the kube-system namespace context and should only be used for infrastructure pods. Using them for application workloads defeats their purpose and can cause cascading failures if system pods get preempted.
# Example: CoreDNS uses system-cluster-critical
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
spec:
template:
spec:
priorityClassName: system-cluster-critical
containers:
- name: coredns
image: registry.k8s.io/coredns/coredns:v1.11.1
How Preemption Works Step by Step
When the scheduler cannot find a node for a pending high-priority pod, the preemption algorithm executes:
-
Identify candidate nodes: The scheduler examines each node to determine if removing some lower-priority pods would make room for the pending pod.
-
Select victims: On each candidate node, the scheduler identifies the minimum set of lower-priority pods whose eviction would free enough resources. It prefers to evict:
- Pods with the lowest priority first.
- Pods that would violate the fewest PodDisruptionBudgets.
-
Respect PDBs (best effort): The scheduler tries to avoid violating PodDisruptionBudgets. If all preemption options would violate a PDB, the scheduler may still proceed, but it prefers paths that do not.
-
Choose the best node: Among all candidate nodes, the scheduler picks the one where preemption causes the least disruption (fewest evictions, lowest-priority victims, fewest PDB violations).
-
Evict victims: The selected pods receive a deletion signal with their configured
terminationGracePeriodSeconds. They are not killed instantly. -
Wait and schedule: The scheduler sets a
nominatedNodeNameon the pending pod. Once the victims terminate and release their resources, the pending pod is scheduled. If conditions change in the meantime (e.g., another pod frees resources on a different node), the scheduler may choose a different node.
Real-World Tiering Example
Here is a practical five-tier priority scheme for a production cluster:
# Tier 1: Critical infrastructure (never preempted by user workloads)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical
value: 1000000
description: "Production databases, payment processing, auth services."
---
# Tier 2: High-priority production services
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high
value: 750000
description: "Production APIs, web frontends, customer-facing services."
---
# Tier 3: Medium-priority supporting services
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium
value: 500000
globalDefault: true # Default for all untagged pods
description: "Internal tools, monitoring, CI/CD pipelines."
---
# Tier 4: Low-priority background work
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low
value: 250000
description: "Log aggregation, analytics, non-urgent processing."
---
# Tier 5: Batch/preemptible workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch
value: 100000
preemptionPolicy: Never # Can be preempted but won't preempt others
description: "Batch jobs, ML training, data migrations. Safe to evict."
With this scheme, if the cluster is full and a new critical payment pod needs to schedule, the scheduler will evict batch or low pods first. The medium tier serves as the global default, so developers who forget to set a priority class get a reasonable default.
preemptionPolicy: Never
Setting preemptionPolicy: Never creates an interesting behavior: the pod benefits from priority-based scheduling order (it jumps ahead in the queue) but will never trigger preemption of other pods. If resources are unavailable, it simply waits.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: important-batch
value: 800000
preemptionPolicy: Never # High in queue, but won't evict anything
description: "Important batch jobs that should schedule soon but not disrupt services."
This is ideal for batch workloads that are time-sensitive but should not disrupt running services. They get scheduled before lower-priority pods waiting in the queue, but they wait patiently for natural resource availability.
Interaction with PodDisruptionBudgets
PodDisruptionBudgets (PDBs) protect applications from losing too many replicas at once. During preemption, the scheduler respects PDBs on a best-effort basis:
- If the scheduler can satisfy the pending pod without violating any PDB, it will choose that path.
- If all options violate a PDB, the scheduler will still preempt -- the high-priority pod's need takes precedence.
- PDBs are more strictly respected during voluntary disruptions (node drains, cluster upgrades) than during preemption.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: redis-pdb
spec:
minAvailable: 2 # At least 2 replicas must stay running
selector:
matchLabels:
app: redis
If Redis has 3 replicas and a PDB requiring minAvailable: 2, the scheduler will preempt at most 1 Redis pod. If it needs to preempt 2 Redis pods to fit the high-priority pod, it will look for better options on other nodes first.
When NOT to Use Preemption
Preemption is a powerful tool, but it is not always appropriate:
-
Stateful workloads with long shutdown times: If your database takes 5 minutes to flush to disk, frequent preemption can cause data corruption or prolonged downtime. Protect these with high priority and PDBs.
-
Workloads with expensive startup costs: ML training jobs that have been running for hours lose all progress when preempted (unless they checkpoint). Either give them high priority or use
preemptionPolicy: Never. -
Clusters with adequate capacity: If your cluster is rarely at full capacity, preemption adds complexity without benefit. Focus on right-sizing your cluster instead.
-
Multi-tenant clusters without governance: If every team assigns
criticalpriority to their pods, preemption becomes meaningless. Enforce priority assignment through admission controllers or OPA/Gatekeeper policies.
Common Pitfalls
-
Priority inflation: Without governance, teams will assign the highest priority to all their pods. Use admission webhooks or policy engines to enforce that only approved workloads can use high-priority classes.
-
Forgetting globalDefault: If no PriorityClass has
globalDefault: true, pods without apriorityClassNameget priority 0 and are the first to be preempted. Always set a sensible default. -
Using system priorities for applications: Assigning
system-cluster-criticalto application pods can prevent CoreDNS, kube-proxy, or other system components from scheduling, bringing down the entire cluster. -
Not setting resource requests: Preemption works based on resource requests, not actual usage. If your pods lack resource requests, the scheduler cannot accurately calculate whether evicting them would free enough capacity.
-
Ignoring nominatedNodeName: After preemption, the pending pod has a
nominatedNodeNamebut is not yet scheduled. Other high-priority pods might "steal" the freed resources. This is normal behavior, not a bug. -
Cascading preemption: A preempted pod that is high-priority itself might preempt another pod when it is rescheduled on a different node, causing a chain reaction. Design your priority tiers with sufficient gaps.
Best Practices
-
Define 3 to 5 priority tiers: Too many tiers create confusion. A simple critical/high/medium/low/batch scheme covers most use cases.
-
Set a global default: Choose a middle-tier priority as the global default so untagged workloads have a reasonable placement.
-
Protect with PDBs: Combine priority with PodDisruptionBudgets to limit the blast radius of preemption on stateful or replicated services.
-
Use preemptionPolicy: Never for batch: Batch jobs should get scheduling priority (they jump the queue) but should not evict running services.
-
Enforce with policy: Use OPA/Gatekeeper, Kyverno, or admission webhooks to prevent teams from assigning priorities above their allocated tier.
-
Monitor preemption events: Set up alerts for preemption events (
kubectl get events --field-selector reason=Preempted). Frequent preemption is a signal that your cluster needs more capacity.
What's Next?
- Scheduling (Taints): Learn how taints control which pods can land on which nodes, working alongside priority to determine placement.
- Scheduling (Affinity): Understand how affinity rules interact with priority when the scheduler must balance placement preferences with resource constraints.
- Resources: Resource requests and limits are the currency that priority and preemption operate on.
- Troubleshooting: Debug pods that are unexpectedly preempted or stuck in
Pendingdespite having high priority.