Node Operations (Maintenance)

Key Takeaways for AI & Readers

Cordon: kubectl cordon marks a node as unschedulable, preventing new pods from landing on it while existing pods continue running undisturbed.
Drain: kubectl drain cordons the node and then evicts all pods using the Eviction API, which respects PodDisruptionBudgets. Controllers (Deployments, StatefulSets) recreate evicted pods on other nodes.
PodDisruptionBudgets (PDBs): PDBs define how many pods in a set must remain available during voluntary disruptions like drains. Without PDBs, drain will evict all pods simultaneously.
Graceful Node Shutdown: Since Kubernetes 1.21, the kubelet can detect system shutdown signals (systemd inhibitor locks) and gracefully terminate pods in priority order before the node powers off.
Node Problem Detector: A DaemonSet that monitors for kernel deadlocks, filesystem corruption, container runtime issues, and hardware problems, reporting them as node conditions.
Spot/Preemptible Nodes: Cloud spot instances can be reclaimed with minimal notice (30s-2min). Handle this with termination handlers, PDBs, and graceful shutdown configuration.

Nodes are physical or virtual machines that require regular maintenance -- OS patches, kernel upgrades, Kubernetes version bumps, hardware replacements, and security updates. As a platform engineer, you must orchestrate these operations without disrupting running workloads.

1. The Node Maintenance Workflow

Physical Worker Node

📦

Cordon stops new pods from being scheduled. Drain cordons the node AND safe-deletes existing pods to move them to other nodes.

The standard procedure for taking a node out of service follows three phases:

Phase 1: Cordon (`kubectl cordon`)

kubectl cordon worker-03
# node/worker-03 cordoned

This sets the node's .spec.unschedulable field to true. The scheduler will no longer place new pods on this node. Existing pods continue running normally. Use cordon when you want to "quietly" stop new work from arriving while current work finishes (e.g., waiting for a batch job to complete).

Phase 2: Drain (`kubectl drain`)

kubectl drain worker-03 \
  --ignore-daemonsets \           # DaemonSets run on every node; can't be evicted
  --delete-emptydir-data \        # Accept loss of emptyDir volumes
  --grace-period=60 \             # Give pods 60s to shut down
  --timeout=300s \                # Fail if drain takes more than 5 minutes
  --pod-selector='app!=critical'  # Optional: only drain specific pods

The drain command:

Cordons the node (if not already cordoned).
Lists all pods on the node (excluding DaemonSet pods and mirror pods).
For each pod, sends an eviction request to the Kubernetes API (POST /eviction).
The eviction API checks PodDisruptionBudgets. If evicting the pod would violate a PDB, the request is rejected and drain retries.
Once the eviction is accepted, the pod receives SIGTERM and has terminationGracePeriodSeconds to shut down.
The pod's controller (Deployment, StatefulSet, Job) notices the pod is gone and creates a replacement on another node.

Phase 3: Uncordon (`kubectl uncordon`)

# After maintenance is complete
kubectl uncordon worker-03
# node/worker-03 uncordoned

This sets unschedulable back to false. The node is now eligible to receive new pods again. Importantly, Kubernetes does not automatically move existing pods back to the uncordoned node. Pods already running on other nodes stay where they are. Over time, natural churn (new deployments, scaling events) will rebalance the cluster, or you can use the Descheduler to actively rebalance.

2. PodDisruptionBudgets During Drain

PDBs are the primary mechanism for ensuring application availability during voluntary disruptions like node drains.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-frontend-pdb
  namespace: production
spec:
  # Choose ONE of minAvailable or maxUnavailable (not both)
  minAvailable: 2                # At least 2 pods must remain running
  # maxUnavailable: 1            # At most 1 pod can be unavailable at a time
  selector:
    matchLabels:
      app: web-frontend

How PDBs Interact with Drain

When kubectl drain sends an eviction request for a pod covered by a PDB:

The API server checks: "If I evict this pod, will the number of available pods drop below minAvailable (or will the number of unavailable pods exceed maxUnavailable)?"
If eviction would violate the PDB, the API returns a 429 Too Many Requests error, and drain retries after a short backoff.
This means drain will block until the replacement pod is running and healthy on another node, maintaining the availability guarantee.

PDB Guidelines

Every production Deployment and StatefulSet should have a PDB.
Use maxUnavailable: 1 for most workloads -- this allows one pod to be down at a time.
For StatefulSets, use maxUnavailable: 1 to ensure ordered, one-at-a-time rolling updates.
Do not set minAvailable equal to the replica count (e.g., minAvailable: 3 for 3 replicas), as this will make the deployment undrain-able.

Warning: PDB Deadlocks

If you set minAvailable: 100% (or equal to replica count), nodes cannot be drained. The PDB API will reject every eviction request.

Symptom: kubectl drain hangs forever (or until timeout).
Fix: Temporarily delete the PDB or scale up the deployment to satisfy the availability requirement.

3. Node Maintenance Procedures

OS Upgrades and Kernel Patches

# 1. Cordon and drain the node
kubectl drain worker-03 --ignore-daemonsets --delete-emptydir-data

# 2. SSH to the node and perform the upgrade
ssh worker-03
sudo apt update && sudo apt upgrade -y
sudo reboot

# 3. Wait for the node to come back online
kubectl get node worker-03 -w
# Wait until STATUS shows Ready

# 4. Uncordon
kubectl uncordon worker-03

Kubernetes Version Upgrades (Node by Node)

Kubernetes supports running nodes at one minor version behind the control plane. The recommended upgrade procedure for worker nodes:

# On each worker node (one at a time):

# 1. Drain the node
kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data

# 2. Upgrade kubelet and kubectl on the node
ssh worker-01
sudo apt-mark unhold kubelet kubectl
sudo apt update && sudo apt install -y kubelet=1.30.0-00 kubectl=1.30.0-00
sudo apt-mark hold kubelet kubectl

# 3. Restart the kubelet
sudo systemctl daemon-reload && sudo systemctl restart kubelet

# 4. Verify the node version
kubectl get node worker-01
# NAME        STATUS   ROLES    VERSION
# worker-01   Ready    <none>   v1.30.0

# 5. Uncordon
kubectl uncordon worker-01

4. Node Labels and Annotations

Labels and annotations are essential for organizing nodes into logical groups and providing metadata to schedulers and operators.

# Add labels for topology and workload placement
kubectl label node worker-01 topology.kubernetes.io/zone=us-east-1a
kubectl label node worker-01 node.kubernetes.io/instance-type=m5.xlarge
kubectl label node gpu-node-01 accelerator=nvidia-a100

# Add annotations for operational metadata
kubectl annotate node worker-01 maintenance-window="sun-02:00-06:00"

# Use labels for node affinity in pod specs

# Schedule GPU workloads only on labeled GPU nodes
apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  nodeSelector:
    accelerator: nvidia-a100
  containers:
    - name: trainer
      image: training-job:latest
      resources:
        limits:
          nvidia.com/gpu: 4

5. Node Pools and Node Groups

In managed Kubernetes services, nodes are organized into node pools (GKE), node groups (EKS), or VM scale sets (AKS). Each pool shares the same instance type, OS image, and configuration.

Best practices for node pool management:

Separate pools by workload type: general-purpose (m5.xlarge), memory-optimized (r5.xlarge), GPU (p3.2xlarge), and spot/preemptible.
Use taints and tolerations: Taint GPU nodes so only GPU workloads schedule there.
Independent scaling: Each pool has its own autoscaling configuration (min/max nodes).
Independent upgrades: Upgrade one pool at a time, starting with non-critical workloads.

# Taint a GPU node pool to prevent non-GPU workloads
# (typically configured at the node pool level in cloud provider settings)
# Nodes in this pool have:
#   taint: nvidia.com/gpu=present:NoSchedule

# Pod that tolerates the taint
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"
  containers:
    - name: inference
      image: inference:latest
      resources:
        limits:
          nvidia.com/gpu: 1

6. Node Problem Detector

The Node Problem Detector (NPD) is a DaemonSet that monitors nodes for various issues and reports them as node conditions or events to the Kubernetes API server.

NPD detects problems such as:

Kernel issues: deadlocks, hung tasks, corrupted memory (via kernel log parsing).
Filesystem issues: ext4 errors, read-only filesystem, inode exhaustion.
Container runtime issues: unresponsive Docker/containerd daemon, high error rates.
Hardware issues: MCE (Machine Check Exception) errors, NMI (Non-Maskable Interrupt) events.
Custom checks: You can define custom monitor scripts that NPD runs periodically.

# Deploy Node Problem Detector
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: node-problem-detector
  template:
    metadata:
      labels:
        app: node-problem-detector
    spec:
      hostPID: true                    # Access host process information
      containers:
        - name: node-problem-detector
          image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19
          command: ["/node-problem-detector"]
          args: ["--logtostderr",
                 "--config.system-log-monitor=/config/kernel-monitor.json"]
          volumeMounts:
            - name: log
              mountPath: /var/log
              readOnly: true
          securityContext:
            privileged: true
      volumes:
        - name: log
          hostPath:
            path: /var/log

Automated Remediation

Detecting a "Bad Node" is only half the battle. You need to fix it.

Draino: A tool that watches node conditions (set by NPD) and automatically cordons and drains the node.
Cluster Autoscaler: Can be configured to terminate nodes that have been Unready for too long.
Remediation Loop: NPD detects issue -> Taints Node -> Draino Drains Node -> Autoscaler Terminates Node -> Cloud Provider Replaces Node.

7. Graceful Node Shutdown

Since Kubernetes 1.21 (GA in 1.28), the kubelet supports graceful node shutdown. When the node receives a shutdown signal (via systemd inhibitor locks), the kubelet terminates pods in priority order before the node powers off.

# KubeletConfiguration for graceful shutdown
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
shutdownGracePeriod: 60s           # Total time for shutdown
shutdownGracePeriodCriticalPods: 15s  # Time reserved for critical pods
# Non-critical pods get: 60s - 15s = 45s to terminate
# Critical pods (system-node-critical, system-cluster-critical) get: 15s

Priority-Based Graceful Shutdown (since 1.24)

For finer control, you can define shutdown grace periods per priority class:

shutdownGracePeriodByPodPriority:
  - priority: 0                    # BestEffort/low-priority pods
    shutdownGracePeriodSeconds: 10
  - priority: 10000                # Standard workloads
    shutdownGracePeriodSeconds: 30
  - priority: 100000               # High-priority services
    shutdownGracePeriodSeconds: 45
  - priority: 2000000000           # system-critical pods
    shutdownGracePeriodSeconds: 60

The kubelet terminates pods in ascending priority order -- lowest priority pods are killed first, giving the highest priority pods the most time to shut down gracefully.

8. Spot / Preemptible Node Handling

Cloud spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure) offer 60-90% cost savings but can be reclaimed with as little as 30 seconds notice.

Strategies for Handling Spot Termination

Termination handlers: Deploy a DaemonSet (e.g., aws-node-termination-handler) that watches for spot termination notices via the instance metadata service and automatically cordons and drains the node.
PDBs for all workloads: Ensure every workload has a PDB so that the drain triggered by the termination handler does not violate availability guarantees.
Short grace periods: Set terminationGracePeriodSeconds to 25 seconds or less, because you may have only 30 seconds total.
Spread constraints: Use topology spread constraints or pod anti-affinity to ensure that not all replicas of a service are on spot nodes in the same availability zone.
Mixed node pools: Run a baseline of on-demand nodes for critical services and use spot nodes for burst capacity, batch jobs, and stateless workloads.

# Topology spread: distribute across spot and on-demand nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 6
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 2
          topologyKey: node.kubernetes.io/lifecycle  # "spot" or "on-demand"
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web-frontend
      terminationGracePeriodSeconds: 25   # Fast shutdown for spot
      containers:
        - name: frontend
          image: frontend:latest

9. Common Pitfalls

Draining without PDBs. Without a PDB, kubectl drain evicts all pods simultaneously. If all replicas of a service are on the drained node, the service experiences a complete outage until pods are rescheduled.
Forgetting --ignore-daemonsets. DaemonSet pods cannot be drained (they are meant to run on every node). Without this flag, drain will fail immediately.
Stuck drains due to overly strict PDBs. If minAvailable equals the replica count, drain can never evict any pod. The drain command will hang until it times out. Always ensure minAvailable < replicaCount or use maxUnavailable >= 1.
Not waiting for replacement pods. After draining, verify that evicted pods have been rescheduled and are Running before proceeding with maintenance.
Forgetting to uncordon. A cordoned node still consumes cluster resources but accepts no pods. After maintenance, always uncordon. Set up monitoring alerts for nodes that remain cordoned for more than a few hours.
Draining nodes with local persistent volumes. Pods using local PersistentVolumes cannot be rescheduled to other nodes. Drain will fail unless you use --force, which deletes the pod without rescheduling. Plan data migration separately.
Ignoring graceful shutdown configuration. Without graceful shutdown, an unexpected node power-off kills all pods with SIGKILL. Applications that need to flush data, close connections, or deregister from service discovery will lose data or cause downstream errors.

10. What's Next?

Evictions: Understand how node-pressure eviction (kubelet-initiated) differs from the API-initiated eviction used by drain. See Evictions.
Descheduler: After uncordoning nodes, the Descheduler can actively rebalance pods to utilize the returned capacity. See Descheduler.
Cluster Autoscaler: Understand how the Cluster Autoscaler removes underutilized nodes (which involves draining) and adds new ones.
Custom Schedulers: Learn how custom schedulers can influence pod placement during the rescheduling that follows a drain. See Custom Schedulers.
Monitoring: Set up alerts for kube_node_spec_unschedulable == 1 (cordoned nodes) and kube_node_status_condition{condition="Ready",status="false"} (unhealthy nodes).

1. The Node Maintenance Workflow​

Phase 1: Cordon (kubectl cordon)​

Phase 2: Drain (kubectl drain)​

Phase 3: Uncordon (kubectl uncordon)​

2. PodDisruptionBudgets During Drain​

How PDBs Interact with Drain​

PDB Guidelines​

Warning: PDB Deadlocks​

3. Node Maintenance Procedures​

OS Upgrades and Kernel Patches​

Kubernetes Version Upgrades (Node by Node)​

4. Node Labels and Annotations​

5. Node Pools and Node Groups​

6. Node Problem Detector​

Automated Remediation​

7. Graceful Node Shutdown​

Priority-Based Graceful Shutdown (since 1.24)​

8. Spot / Preemptible Node Handling​

Strategies for Handling Spot Termination​

9. Common Pitfalls​

10. What's Next?​