Node Operations (Maintenance)
- Cordon:
kubectl cordonmarks a node as unschedulable, preventing new pods from landing on it while existing pods continue running undisturbed. - Drain:
kubectl draincordons the node and then evicts all pods using the Eviction API, which respects PodDisruptionBudgets. Controllers (Deployments, StatefulSets) recreate evicted pods on other nodes. - PodDisruptionBudgets (PDBs): PDBs define how many pods in a set must remain available during voluntary disruptions like drains. Without PDBs, drain will evict all pods simultaneously.
- Graceful Node Shutdown: Since Kubernetes 1.21, the kubelet can detect system shutdown signals (systemd inhibitor locks) and gracefully terminate pods in priority order before the node powers off.
- Node Problem Detector: A DaemonSet that monitors for kernel deadlocks, filesystem corruption, container runtime issues, and hardware problems, reporting them as node conditions.
- Spot/Preemptible Nodes: Cloud spot instances can be reclaimed with minimal notice (30s-2min). Handle this with termination handlers, PDBs, and graceful shutdown configuration.
Nodes are physical or virtual machines that require regular maintenance -- OS patches, kernel upgrades, Kubernetes version bumps, hardware replacements, and security updates. As a platform engineer, you must orchestrate these operations without disrupting running workloads.
1. The Node Maintenance Workflow
The standard procedure for taking a node out of service follows three phases:
Phase 1: Cordon (kubectl cordon)
kubectl cordon worker-03
# node/worker-03 cordoned
This sets the node's .spec.unschedulable field to true. The scheduler will no longer place new pods on this node. Existing pods continue running normally. Use cordon when you want to "quietly" stop new work from arriving while current work finishes (e.g., waiting for a batch job to complete).
Phase 2: Drain (kubectl drain)
kubectl drain worker-03 \
--ignore-daemonsets \ # DaemonSets run on every node; can't be evicted
--delete-emptydir-data \ # Accept loss of emptyDir volumes
--grace-period=60 \ # Give pods 60s to shut down
--timeout=300s \ # Fail if drain takes more than 5 minutes
--pod-selector='app!=critical' # Optional: only drain specific pods
The drain command:
- Cordons the node (if not already cordoned).
- Lists all pods on the node (excluding DaemonSet pods and mirror pods).
- For each pod, sends an eviction request to the Kubernetes API (
POST /eviction). - The eviction API checks PodDisruptionBudgets. If evicting the pod would violate a PDB, the request is rejected and drain retries.
- Once the eviction is accepted, the pod receives SIGTERM and has
terminationGracePeriodSecondsto shut down. - The pod's controller (Deployment, StatefulSet, Job) notices the pod is gone and creates a replacement on another node.
Phase 3: Uncordon (kubectl uncordon)
# After maintenance is complete
kubectl uncordon worker-03
# node/worker-03 uncordoned
This sets unschedulable back to false. The node is now eligible to receive new pods again. Importantly, Kubernetes does not automatically move existing pods back to the uncordoned node. Pods already running on other nodes stay where they are. Over time, natural churn (new deployments, scaling events) will rebalance the cluster, or you can use the Descheduler to actively rebalance.
2. PodDisruptionBudgets During Drain
PDBs are the primary mechanism for ensuring application availability during voluntary disruptions like node drains.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-frontend-pdb
namespace: production
spec:
# Choose ONE of minAvailable or maxUnavailable (not both)
minAvailable: 2 # At least 2 pods must remain running
# maxUnavailable: 1 # At most 1 pod can be unavailable at a time
selector:
matchLabels:
app: web-frontend
How PDBs Interact with Drain
When kubectl drain sends an eviction request for a pod covered by a PDB:
- The API server checks: "If I evict this pod, will the number of available pods drop below
minAvailable(or will the number of unavailable pods exceedmaxUnavailable)?" - If eviction would violate the PDB, the API returns a
429 Too Many Requestserror, and drain retries after a short backoff. - This means drain will block until the replacement pod is running and healthy on another node, maintaining the availability guarantee.
PDB Guidelines
- Every production Deployment and StatefulSet should have a PDB.
- Use
maxUnavailable: 1for most workloads -- this allows one pod to be down at a time. - For StatefulSets, use
maxUnavailable: 1to ensure ordered, one-at-a-time rolling updates. - Do not set
minAvailableequal to the replica count (e.g.,minAvailable: 3for 3 replicas), as this will make the deployment undrain-able.
Warning: PDB Deadlocks
If you set minAvailable: 100% (or equal to replica count), nodes cannot be drained. The PDB API will reject every eviction request.
- Symptom:
kubectl drainhangs forever (or until timeout). - Fix: Temporarily delete the PDB or scale up the deployment to satisfy the availability requirement.
3. Node Maintenance Procedures
OS Upgrades and Kernel Patches
# 1. Cordon and drain the node
kubectl drain worker-03 --ignore-daemonsets --delete-emptydir-data
# 2. SSH to the node and perform the upgrade
ssh worker-03
sudo apt update && sudo apt upgrade -y
sudo reboot
# 3. Wait for the node to come back online
kubectl get node worker-03 -w
# Wait until STATUS shows Ready
# 4. Uncordon
kubectl uncordon worker-03
Kubernetes Version Upgrades (Node by Node)
Kubernetes supports running nodes at one minor version behind the control plane. The recommended upgrade procedure for worker nodes:
# On each worker node (one at a time):
# 1. Drain the node
kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data
# 2. Upgrade kubelet and kubectl on the node
ssh worker-01
sudo apt-mark unhold kubelet kubectl
sudo apt update && sudo apt install -y kubelet=1.30.0-00 kubectl=1.30.0-00
sudo apt-mark hold kubelet kubectl
# 3. Restart the kubelet
sudo systemctl daemon-reload && sudo systemctl restart kubelet
# 4. Verify the node version
kubectl get node worker-01
# NAME STATUS ROLES VERSION
# worker-01 Ready <none> v1.30.0
# 5. Uncordon
kubectl uncordon worker-01
4. Node Labels and Annotations
Labels and annotations are essential for organizing nodes into logical groups and providing metadata to schedulers and operators.
# Add labels for topology and workload placement
kubectl label node worker-01 topology.kubernetes.io/zone=us-east-1a
kubectl label node worker-01 node.kubernetes.io/instance-type=m5.xlarge
kubectl label node gpu-node-01 accelerator=nvidia-a100
# Add annotations for operational metadata
kubectl annotate node worker-01 maintenance-window="sun-02:00-06:00"
# Use labels for node affinity in pod specs
# Schedule GPU workloads only on labeled GPU nodes
apiVersion: v1
kind: Pod
metadata:
name: ml-training
spec:
nodeSelector:
accelerator: nvidia-a100
containers:
- name: trainer
image: training-job:latest
resources:
limits:
nvidia.com/gpu: 4
5. Node Pools and Node Groups
In managed Kubernetes services, nodes are organized into node pools (GKE), node groups (EKS), or VM scale sets (AKS). Each pool shares the same instance type, OS image, and configuration.
Best practices for node pool management:
- Separate pools by workload type: general-purpose (m5.xlarge), memory-optimized (r5.xlarge), GPU (p3.2xlarge), and spot/preemptible.
- Use taints and tolerations: Taint GPU nodes so only GPU workloads schedule there.
- Independent scaling: Each pool has its own autoscaling configuration (min/max nodes).
- Independent upgrades: Upgrade one pool at a time, starting with non-critical workloads.
# Taint a GPU node pool to prevent non-GPU workloads
# (typically configured at the node pool level in cloud provider settings)
# Nodes in this pool have:
# taint: nvidia.com/gpu=present:NoSchedule
# Pod that tolerates the taint
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
containers:
- name: inference
image: inference:latest
resources:
limits:
nvidia.com/gpu: 1
6. Node Problem Detector
The Node Problem Detector (NPD) is a DaemonSet that monitors nodes for various issues and reports them as node conditions or events to the Kubernetes API server.
NPD detects problems such as:
- Kernel issues: deadlocks, hung tasks, corrupted memory (via kernel log parsing).
- Filesystem issues: ext4 errors, read-only filesystem, inode exhaustion.
- Container runtime issues: unresponsive Docker/containerd daemon, high error rates.
- Hardware issues: MCE (Machine Check Exception) errors, NMI (Non-Maskable Interrupt) events.
- Custom checks: You can define custom monitor scripts that NPD runs periodically.
# Deploy Node Problem Detector
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
hostPID: true # Access host process information
containers:
- name: node-problem-detector
image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19
command: ["/node-problem-detector"]
args: ["--logtostderr",
"--config.system-log-monitor=/config/kernel-monitor.json"]
volumeMounts:
- name: log
mountPath: /var/log
readOnly: true
securityContext:
privileged: true
volumes:
- name: log
hostPath:
path: /var/log
Automated Remediation
Detecting a "Bad Node" is only half the battle. You need to fix it.
- Draino: A tool that watches node conditions (set by NPD) and automatically cordons and drains the node.
- Cluster Autoscaler: Can be configured to terminate nodes that have been Unready for too long.
- Remediation Loop:
NPD detects issue -> Taints Node -> Draino Drains Node -> Autoscaler Terminates Node -> Cloud Provider Replaces Node.
7. Graceful Node Shutdown
Since Kubernetes 1.21 (GA in 1.28), the kubelet supports graceful node shutdown. When the node receives a shutdown signal (via systemd inhibitor locks), the kubelet terminates pods in priority order before the node powers off.
# KubeletConfiguration for graceful shutdown
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
shutdownGracePeriod: 60s # Total time for shutdown
shutdownGracePeriodCriticalPods: 15s # Time reserved for critical pods
# Non-critical pods get: 60s - 15s = 45s to terminate
# Critical pods (system-node-critical, system-cluster-critical) get: 15s
Priority-Based Graceful Shutdown (since 1.24)
For finer control, you can define shutdown grace periods per priority class:
shutdownGracePeriodByPodPriority:
- priority: 0 # BestEffort/low-priority pods
shutdownGracePeriodSeconds: 10
- priority: 10000 # Standard workloads
shutdownGracePeriodSeconds: 30
- priority: 100000 # High-priority services
shutdownGracePeriodSeconds: 45
- priority: 2000000000 # system-critical pods
shutdownGracePeriodSeconds: 60
The kubelet terminates pods in ascending priority order -- lowest priority pods are killed first, giving the highest priority pods the most time to shut down gracefully.
8. Spot / Preemptible Node Handling
Cloud spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure) offer 60-90% cost savings but can be reclaimed with as little as 30 seconds notice.
Strategies for Handling Spot Termination
-
Termination handlers: Deploy a DaemonSet (e.g.,
aws-node-termination-handler) that watches for spot termination notices via the instance metadata service and automatically cordons and drains the node. -
PDBs for all workloads: Ensure every workload has a PDB so that the drain triggered by the termination handler does not violate availability guarantees.
-
Short grace periods: Set
terminationGracePeriodSecondsto 25 seconds or less, because you may have only 30 seconds total. -
Spread constraints: Use topology spread constraints or pod anti-affinity to ensure that not all replicas of a service are on spot nodes in the same availability zone.
-
Mixed node pools: Run a baseline of on-demand nodes for critical services and use spot nodes for burst capacity, batch jobs, and stateless workloads.
# Topology spread: distribute across spot and on-demand nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 6
template:
spec:
topologySpreadConstraints:
- maxSkew: 2
topologyKey: node.kubernetes.io/lifecycle # "spot" or "on-demand"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-frontend
terminationGracePeriodSeconds: 25 # Fast shutdown for spot
containers:
- name: frontend
image: frontend:latest
9. Common Pitfalls
-
Draining without PDBs. Without a PDB,
kubectl drainevicts all pods simultaneously. If all replicas of a service are on the drained node, the service experiences a complete outage until pods are rescheduled. -
Forgetting
--ignore-daemonsets. DaemonSet pods cannot be drained (they are meant to run on every node). Without this flag, drain will fail immediately. -
Stuck drains due to overly strict PDBs. If
minAvailableequals the replica count, drain can never evict any pod. The drain command will hang until it times out. Always ensureminAvailable < replicaCountor usemaxUnavailable >= 1. -
Not waiting for replacement pods. After draining, verify that evicted pods have been rescheduled and are Running before proceeding with maintenance.
-
Forgetting to uncordon. A cordoned node still consumes cluster resources but accepts no pods. After maintenance, always uncordon. Set up monitoring alerts for nodes that remain cordoned for more than a few hours.
-
Draining nodes with local persistent volumes. Pods using
localPersistentVolumes cannot be rescheduled to other nodes. Drain will fail unless you use--force, which deletes the pod without rescheduling. Plan data migration separately. -
Ignoring graceful shutdown configuration. Without graceful shutdown, an unexpected node power-off kills all pods with SIGKILL. Applications that need to flush data, close connections, or deregister from service discovery will lose data or cause downstream errors.
10. What's Next?
- Evictions: Understand how node-pressure eviction (kubelet-initiated) differs from the API-initiated eviction used by drain. See Evictions.
- Descheduler: After uncordoning nodes, the Descheduler can actively rebalance pods to utilize the returned capacity. See Descheduler.
- Cluster Autoscaler: Understand how the Cluster Autoscaler removes underutilized nodes (which involves draining) and adds new ones.
- Custom Schedulers: Learn how custom schedulers can influence pod placement during the rescheduling that follows a drain. See Custom Schedulers.
- Monitoring: Set up alerts for
kube_node_spec_unschedulable == 1(cordoned nodes) andkube_node_status_condition{condition="Ready",status="false"}(unhealthy nodes).