Skip to main content

Evictions & Node Pressure

Key Takeaways for AI & Readers
  • Node-Pressure Eviction: The kubelet monitors resource signals (memory.available, nodefs.available, imagefs.available, pid.available) and proactively evicts pods when thresholds are breached to prevent the node from becoming completely unresponsive.
  • Soft vs. Hard Thresholds: Soft evictions trigger after a configurable grace period, giving pods time to terminate gracefully. Hard evictions are immediate -- the kubelet kills pods without waiting.
  • QoS-Based Eviction Order: Pods are evicted in order of their Quality of Service class: BestEffort first (no requests/limits), then Burstable (usage exceeding requests), then Guaranteed (requests equal limits) only as a last resort.
  • API-Initiated Eviction: The Eviction API (POST /eviction) respects PodDisruptionBudgets and is used by kubectl drain, the Descheduler, and cluster autoscalers. Node-pressure eviction does NOT respect PDBs.
  • Node Conditions: When eviction thresholds are breached, the kubelet sets node conditions (MemoryPressure, DiskPressure, PIDPressure) that prevent the scheduler from placing new pods on the affected node.
  • Debugging: Eviction events are recorded on both the Pod (reason: Evicted) and the Node (reason: NodeHasDiskPressure, etc.). Use kubectl describe and node logs to diagnose.

Eviction is the process where the kubelet proactively terminates pods to preserve node stability. This is fundamentally different from a standard pod deletion -- it is the node's survival mechanism when resources are critically low.

1. Node-Pressure Eviction

The kubelet continuously monitors several resource signals on the node. When any signal crosses a configured threshold, the kubelet enters eviction mode.

Eviction Signals

0%
📦
Running
📦
Running
📦
Running
📦
Running
📦
Running
When a node runs out of resources (Memory/Disk), the Kubelet starts Node-Pressure Eviction. It kills pods based on their QoS class to prevent the node from crashing.
SignalDescriptionDefault Hard Threshold
memory.availableAvailable memory on the node (free + reclaimable cache)100Mi
nodefs.availableAvailable disk space on the node's root filesystem10%
nodefs.inodesFreeAvailable inodes on the root filesystem5%
imagefs.availableAvailable disk space on the image filesystem (where container images are stored)15%
pid.availableAvailable process IDs on the node100

These signals are evaluated every 10 seconds by default (configurable via --housekeeping-interval).

How Available Memory Is Calculated

The kubelet calculates memory.available as:

memory.available = node.status.capacity.memory
- memory.workingSet
(from /proc/meminfo and cgroup stats)

The workingSet includes resident memory (RSS) and active cached pages, but excludes inactive cached pages and buffer pages that the kernel can reclaim. This means the kubelet considers reclaimable cache as "available," which is important for workloads that rely on page cache (databases, file servers).

2. Soft vs. Hard Eviction Thresholds

Hard Eviction Thresholds

When a hard threshold is breached, the kubelet kills pods immediately with no grace period. The pod receives a SIGKILL. Hard thresholds are the last line of defense before the node's OOM killer takes over (which is even more disruptive because the OOM killer has no awareness of Kubernetes priorities).

# kubelet configuration (in KubeletConfiguration or flags)
--eviction-hard="memory.available<100Mi,nodefs.available<10%,imagefs.available<15%,pid.available<100"

Soft Eviction Thresholds

Soft thresholds trigger a grace period. The kubelet waits for the specified duration, and if the resource signal is still below the threshold, it begins evicting pods. This gives transient spikes time to resolve without unnecessary disruption.

# Soft threshold: trigger when memory is below 200Mi
--eviction-soft="memory.available<200Mi,nodefs.available<15%"

# Grace period: wait 90 seconds before evicting
--eviction-soft-grace-period="memory.available=90s,nodefs.available=120s"

# Maximum time to wait for a pod to terminate after soft eviction
--eviction-max-pod-grace-period=60

Minimum Reclaim

After evicting pods, the kubelet may still be near the threshold. The --eviction-minimum-reclaim flag tells the kubelet to keep evicting until a minimum amount of resources has been freed:

--eviction-minimum-reclaim="memory.available=500Mi,nodefs.available=1Gi"

This prevents the kubelet from evicting a single small pod, briefly crossing back above the threshold, and then immediately re-entering eviction mode.

3. The Eviction Order (QoS Classes)

When the kubelet decides to evict pods, it does not pick them randomly. It follows a strict priority based on the pod's Quality of Service (QoS) class and resource consumption.

Step 1: Sort by QoS Class

  1. BestEffort (evicted first): Pods that specify no resource requests or limits. These pods are considered expendable because they made no resource commitments.

  2. Burstable (evicted second): Pods that have requests set but either have no limits or limits that differ from requests. Within this class, pods are sorted by how much their usage exceeds their requests -- the biggest overusers are evicted first.

  3. Guaranteed (evicted last): Pods where every container has requests == limits for both CPU and memory. These pods are only evicted when the node has no other option.

Step 2: Sort Within QoS Class

Within each QoS class, pods are ranked by their resource consumption relative to their requests. For memory evictions, the kubelet considers:

  • Memory usage exceeding request: Pods using more memory than they requested are evicted before pods using less than their request.
  • Pod priority: Higher-priority pods (higher spec.priority) are evicted after lower-priority pods within the same QoS class.
# BestEffort pod -- will be evicted first
apiVersion: v1
kind: Pod
metadata:
name: expendable-worker
spec:
containers:
- name: worker
image: worker:latest
# No resources specified -- QoS: BestEffort
---
# Burstable pod -- evicted second
apiVersion: v1
kind: Pod
metadata:
name: web-server
spec:
containers:
- name: nginx
image: nginx:1.27
resources:
requests:
cpu: "100m"
memory: "128Mi"
# No limits -- QoS: Burstable
---
# Guaranteed pod -- evicted last
apiVersion: v1
kind: Pod
metadata:
name: critical-database
spec:
containers:
- name: postgres
image: postgres:16
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2" # requests == limits for CPU
memory: "4Gi" # requests == limits for memory
# QoS: Guaranteed

4. Node Conditions

When eviction thresholds are breached, the kubelet sets corresponding node conditions that the scheduler reads to avoid placing new pods on stressed nodes.

ConditionTrigger SignalEffect
MemoryPressurememory.available below thresholdScheduler avoids this node for new pods; BestEffort pods are not scheduled here
DiskPressurenodefs.available or imagefs.available below thresholdScheduler avoids this node; kubelet stops accepting new pods
PIDPressurepid.available below thresholdScheduler avoids this node

Node conditions have oscillation dampening -- once a condition is set to True, the kubelet does not set it back to False until the resource signal has been above the threshold for a configurable period (--eviction-pressure-transition-period, default: 5 minutes). This prevents the scheduler from rapidly flip-flopping pods between nodes.

# Check node conditions
kubectl describe node worker-01 | grep -A5 Conditions

5. Protecting Critical Pods

Some pods must survive even during severe node pressure. Kubernetes provides several mechanisms:

Priority Classes

Pods with higher priorityClassName are evicted after lower-priority pods. System-critical pods use the built-in system-cluster-critical and system-node-critical priority classes.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # Higher value = higher priority
globalDefault: false
description: "For critical production services"
---
apiVersion: v1
kind: Pod
metadata:
name: critical-service
spec:
priorityClassName: high-priority
containers:
- name: app
image: critical-app:latest
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "500m"
memory: "1Gi" # Guaranteed QoS + high priority = maximum protection

Guaranteed QoS

Setting requests == limits for all containers in a pod gives it Guaranteed QoS, making it the last to be evicted. Combined with a high priority class, this provides the strongest protection against node-pressure eviction.

6. API-Initiated Eviction vs. Node-Pressure Eviction

These are two fundamentally different mechanisms that share the name "eviction."

AspectAPI-Initiated EvictionNode-Pressure Eviction
Triggered byAPI call (POST /api/v1/namespaces/{ns}/pods/{name}/eviction)Kubelet detecting resource threshold breach
Respects PDBsYes -- will fail if eviction would violate PDBNo -- kubelet must protect the node
Used bykubectl drain, Descheduler, Cluster Autoscaler, VPAKubelet only
Grace periodRespects terminationGracePeriodSecondsHard: none; Soft: configurable
Pod statusFailed with reason EvictedFailed with reason Evicted

PodDisruptionBudgets (PDBs) and Eviction

PDBs only apply to API-initiated evictions. They tell the eviction API: "do not evict this pod if fewer than N replicas would remain healthy." This is critical for kubectl drain and the Descheduler, but the kubelet will ignore PDBs during node-pressure eviction because node survival takes priority.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
namespace: production
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: web-frontend

7. Configuring Kubelet Eviction Thresholds

Eviction thresholds are configured in the kubelet configuration file or via command-line flags. In managed Kubernetes services (EKS, GKE, AKS), some of these are pre-configured and may not be modifiable.

# KubeletConfiguration (kubelet-config.yaml)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
imagefs.available: "15%"
pid.available: "100"
evictionSoft:
memory.available: "300Mi"
nodefs.available: "15%"
evictionSoftGracePeriod:
memory.available: "1m30s"
nodefs.available: "2m"
evictionMaxPodGracePeriod: 60
evictionMinimumReclaim:
memory.available: "500Mi"
nodefs.available: "1Gi"
evictionPressureTransitionPeriod: "5m0s" # Dampen oscillation

Reserved Resources

To prevent workloads from consuming all node resources and triggering evictions, reserve resources for the kubelet and system daemons:

# Reserve resources for the system
kubeReserved:
cpu: "200m"
memory: "512Mi"
ephemeral-storage: "1Gi"
systemReserved:
cpu: "200m"
memory: "512Mi"
ephemeral-storage: "1Gi"
enforceNodeAllocatable:
- pods
- kube-reserved
- system-reserved

8. Debugging Eviction Events

When pods are unexpectedly evicted, follow this diagnostic process:

Step 1: Check the Pod Events

kubectl describe pod <evicted-pod-name> -n <namespace>
# Look for:
# Status: Failed
# Reason: Evicted
# Message: "The node was low on resource: memory. ..."

Step 2: Check Node Conditions

kubectl describe node <node-name>
# Look for:
# Conditions:
# MemoryPressure: True (or recently True)
# DiskPressure: True

Step 3: Check kubelet Logs

# On the node itself
journalctl -u kubelet | grep -i evict
# Look for messages like:
# "evicting pod ... usage=1.2Gi, request=512Mi"

Step 4: Check Historical Metrics

Use Prometheus or your monitoring stack to correlate the eviction timestamp with node-level resource metrics. Look for:

  • node_memory_MemAvailable_bytes dropping below the hard threshold.
  • container_memory_working_set_bytes for the evicted pods to understand which pod was the largest consumer.
  • kubelet_evictions_total metric for eviction frequency.

9. Common Pitfalls

  1. Not setting resource requests. Pods without requests get BestEffort QoS and are the first to be evicted. Always set requests for production workloads.

  2. Confusing eviction with OOMKill. An OOMKill happens when a container exceeds its memory limit -- the kernel kills the process. An eviction happens when the node is running low on resources. They have different causes and different remedies.

  3. Assuming PDBs protect against node-pressure eviction. PDBs only apply to API-initiated evictions (drain, Descheduler). The kubelet ignores PDBs when the node is under pressure.

  4. Not reserving system resources. Without kubeReserved and systemReserved, workload pods can consume all node memory, starving the kubelet and the container runtime. This can cause the node to become NotReady and trigger mass evictions.

  5. Setting eviction thresholds too aggressively. If hard thresholds are set too high (e.g., memory.available<1Gi), normal workload fluctuations will trigger evictions. If set too low (e.g., memory.available<10Mi), the node may OOM before the kubelet can evict pods.

  6. Ignoring disk pressure. Image pulls, container logs, and emptyDir volumes can fill the node filesystem. Monitor nodefs.available and configure image garbage collection (--image-gc-high-threshold, --image-gc-low-threshold).

  7. Not draining before maintenance. Shutting down a node without draining causes all pods to be killed abruptly. Use kubectl drain for graceful evacuation.

10. What's Next?

  • Node Management: Learn how kubectl drain uses API-initiated eviction with PDB support to gracefully evacuate nodes. See Node Operations.
  • Resource Requests and Limits: Understand how setting requests and limits determines the QoS class, which directly controls eviction order.
  • VPA: The Vertical Pod Autoscaler adjusts resource requests based on usage, which affects both scheduling and eviction behavior. See VPA.
  • Priority and Preemption: Priority classes affect both scheduling (preemption) and eviction order. Higher-priority pods survive longer during node pressure.
  • Monitoring: Set up Prometheus alerts for kube_pod_status_reason{reason="Evicted"} and kube_node_status_condition{condition="MemoryPressure",status="true"} to catch eviction events proactively.