Skip to main content

Scheduling: Taints & Tolerations

Key Takeaways for AI & Readers
  • Node Repulsion: Taints allow nodes to repel Pods that do not specifically "tolerate" the taint. This is the inverse of affinity -- instead of attracting pods, taints push them away.
  • Pod Permission: Tolerations allow Pods to be scheduled on nodes with matching taints, though they do not guarantee scheduling. A toleration merely removes the repulsion; the scheduler still uses other factors (resources, affinity) to make the final placement decision.
  • Three Taint Effects: NoSchedule prevents new pods from landing on a node, PreferNoSchedule is a soft version that the scheduler tries to avoid, and NoExecute evicts already-running pods that lack a matching toleration.
  • Isolation Use Cases: Common patterns include isolating specialized hardware (GPUs, FPGAs), protecting Control Plane nodes from user workloads, and dedicating nodes to specific teams or tenants.
  • Node Maintenance: Taints are essential for marking nodes as "off-limits" during maintenance or drainage operations, allowing graceful workload migration before taking a node offline.
  • DaemonSet Behavior: DaemonSets automatically add tolerations for certain taints (like node.kubernetes.io/not-ready and node.kubernetes.io/unreachable) to ensure system-level pods run on every node.

Kubernetes scheduling is about finding the right node for a Pod. One of the most powerful mechanisms for controlling placement is Taints and Tolerations.

  • Taint: Applied to a Node. It says "Do not schedule anything here unless it has a matching toleration."
  • Toleration: Applied to a Pod. It says "I can tolerate this taint -- I am allowed to schedule on nodes that have it."

Think of taints as a "No Entry" sign on a node's door, and tolerations as the key that lets specific pods through. Critically, having the key does not force you through the door -- it simply allows it. The scheduler still considers resources, affinity rules, and other constraints before making a final decision.

Interactive Scheduling

Try scheduling different types of Pods and see where they land.

Loading...

How Taints Work

A taint consists of three parts:

key=value:effect
  • Key: An arbitrary string identifier (e.g., gpu, team, node.kubernetes.io/not-ready).
  • Value: An optional string value associated with the key (e.g., nvidia, frontend).
  • Effect: One of three values that determines the taint's behavior.

Taint Effects Explained

NoSchedule

The strictest effect. New pods that do not tolerate this taint will never be scheduled on the node. Pods already running on the node are unaffected.

# Apply a NoSchedule taint to a node
kubectl taint nodes node1 gpu=nvidia:NoSchedule

Use this when you absolutely must prevent certain workloads from landing on a node, such as reserving GPU nodes exclusively for machine learning jobs.

PreferNoSchedule

A soft version of NoSchedule. The scheduler will try to avoid placing pods that lack the toleration on this node, but if no other nodes are available, it will schedule them there anyway.

# Apply a PreferNoSchedule taint
kubectl taint nodes node2 environment=staging:PreferNoSchedule

Use this for preferences rather than hard requirements. For example, you might prefer to keep development workloads off production nodes, but allow overflow if the cluster is under pressure.

NoExecute

The most aggressive effect. This taint applies to both new and existing pods. When a NoExecute taint is added to a node:

  1. New pods without a matching toleration will not be scheduled.
  2. Existing pods without a matching toleration are immediately evicted.
  3. Pods with a matching toleration that specifies tolerationSeconds will be evicted after that duration.
# Apply a NoExecute taint -- this will evict non-tolerating pods
kubectl taint nodes node3 maintenance=true:NoExecute

This is the go-to effect for node maintenance. When you need to drain a node, applying a NoExecute taint ensures all non-essential workloads are moved elsewhere.

Toleration Syntax and Operators

A toleration in a pod spec can use one of two operators:

operator: Equal (default)

Matches when key, value, and effect all match exactly:

apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
tolerations:
- key: "gpu"
operator: "Equal" # Key and value must match exactly
value: "nvidia"
effect: "NoSchedule"
containers:
- name: trainer
image: my-ml-training:latest
resources:
limits:
nvidia.com/gpu: 1

operator: Exists

Matches when the key exists, regardless of value. You can omit the value field entirely:

apiVersion: v1
kind: Pod
metadata:
name: monitoring-agent
spec:
tolerations:
- key: "gpu"
operator: "Exists" # Matches any value for key "gpu"
effect: "NoSchedule"
containers:
- name: agent
image: monitoring-agent:latest

Tolerating All Taints

You can create a pod that tolerates every possible taint by omitting both key and value and using operator: Exists. This is a powerful (and dangerous) pattern:

apiVersion: v1
kind: Pod
metadata:
name: omnipresent-agent
spec:
tolerations:
- operator: "Exists" # Tolerates ALL taints on ALL nodes
containers:
- name: collector
image: log-collector:latest

This pattern is typically used only by infrastructure-level pods (monitoring agents, log collectors) that must run on every node, including control plane and tainted nodes.

Control Plane Taints

By default, Kubernetes control plane nodes carry the following taint:

node-role.kubernetes.io/control-plane:NoSchedule

This prevents user workloads from running on control plane nodes. The control plane components (API server, scheduler, controller manager, etcd) tolerate this taint automatically. If you are running a single-node cluster for development, you may need to remove this taint to schedule workloads:

# Remove the control plane taint (single-node clusters only)
kubectl taint nodes my-control-plane node-role.kubernetes.io/control-plane:NoSchedule-

Note the trailing - which removes the taint.

nodeSelector Basics

Before diving deeper into taints, it is worth mentioning nodeSelector, the simplest form of node selection. While taints repel pods, nodeSelector attracts them:

apiVersion: v1
kind: Pod
metadata:
name: fast-storage-pod
spec:
nodeSelector:
disktype: ssd # Only schedule on nodes labeled disktype=ssd
containers:
- name: database
image: postgres:16

nodeSelector is a hard requirement. If no node has the matching label, the pod stays Pending. For more flexible attraction rules, see Scheduling (Affinity).

Real-World Scenarios

Scenario 1: GPU Node Isolation

Reserve GPU nodes exclusively for ML workloads. Non-GPU pods should never waste expensive GPU node resources:

# Taint all GPU nodes
kubectl taint nodes gpu-node-1 gpu-node-2 gpu-node-3 \
hardware=gpu:NoSchedule
# ML training pod that tolerates the GPU taint
apiVersion: batch/v1
kind: Job
metadata:
name: model-training
spec:
template:
spec:
tolerations:
- key: "hardware"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
nodeSelector:
hardware: gpu # Also ensure we land on GPU nodes
containers:
- name: trainer
image: training-pipeline:v2.1
resources:
limits:
nvidia.com/gpu: 2
restartPolicy: Never

Notice the combination of a toleration (to be allowed on the node) with a nodeSelector (to guarantee placement there). The toleration alone does not ensure the pod lands on a GPU node -- it only permits it.

Scenario 2: Dedicated Team Nodes

In a multi-tenant cluster, dedicate specific nodes to specific teams:

# Reserve nodes for the data-science team
kubectl taint nodes ds-node-1 ds-node-2 team=data-science:NoSchedule
# Data science team pod
apiVersion: v1
kind: Pod
metadata:
name: jupyter-notebook
labels:
team: data-science
spec:
tolerations:
- key: "team"
operator: "Equal"
value: "data-science"
effect: "NoSchedule"
nodeSelector:
team: data-science
containers:
- name: jupyter
image: jupyter/scipy-notebook:latest

Scenario 3: Node Maintenance with NoExecute

When performing maintenance on a node, use NoExecute with tolerationSeconds to give workloads time to migrate:

# Start maintenance -- evict all non-tolerating pods
kubectl taint nodes node-to-maintain maintenance=scheduled:NoExecute
# A critical pod that can stay during short maintenance windows
apiVersion: v1
kind: Pod
metadata:
name: critical-cache
spec:
tolerations:
- key: "maintenance"
operator: "Equal"
value: "scheduled"
effect: "NoExecute"
tolerationSeconds: 3600 # Stay for 1 hour, then get evicted
containers:
- name: redis
image: redis:7-alpine

After maintenance is complete, remove the taint:

kubectl taint nodes node-to-maintain maintenance=scheduled:NoExecute-

How DaemonSets Automatically Tolerate Taints

DaemonSets are designed to run one pod per node (for log collectors, monitoring agents, network plugins, etc.). To fulfill this contract, the DaemonSet controller automatically adds tolerations for several built-in taints:

  • node.kubernetes.io/not-ready (NoExecute)
  • node.kubernetes.io/unreachable (NoExecute)
  • node.kubernetes.io/disk-pressure (NoSchedule)
  • node.kubernetes.io/memory-pressure (NoSchedule)
  • node.kubernetes.io/pid-pressure (NoSchedule)
  • node.kubernetes.io/unschedulable (NoSchedule)

This means your DaemonSet pods will continue to run even when nodes are experiencing problems. However, DaemonSets do not automatically tolerate user-defined taints. If you taint a node with gpu=nvidia:NoSchedule, you must explicitly add that toleration to your DaemonSet spec if you want it to run on GPU nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
tolerations:
- operator: "Exists" # Tolerate all taints to run everywhere
containers:
- name: exporter
image: prom/node-exporter:latest
ports:
- containerPort: 9100

Combining Taints with Node Affinity

Taints and affinity serve complementary purposes. A common pattern is:

  1. Taint the node to repel unwanted pods.
  2. Label the node for identification.
  3. Use nodeSelector or node affinity on desired pods to attract them to the node.
  4. Add a toleration on those same pods so they can pass the taint.

Without this combination, your toleration-bearing pod might land on any node (not just the target), or your affinity-bearing pod might target a tainted node it cannot access.

Common Pitfalls

  1. Toleration without nodeSelector: Adding a toleration allows a pod to schedule on a tainted node but does not guarantee it. Without nodeSelector or node affinity, the pod might land on any non-tainted node instead.

  2. Forgetting to remove taints after maintenance: If you add a NoSchedule taint for maintenance and forget to remove it, new pods will never schedule on that node. Automate taint removal as part of your maintenance runbook.

  3. Typos in taint keys or values: Taints and tolerations are matched by exact string comparison. A toleration for key: "gpu" will not match a taint with key GPU or gpus. There is no validation that a toleration matches any existing taint.

  4. Using NoExecute carelessly: Applying a NoExecute taint immediately evicts all non-tolerating pods. On a busy node, this can cause a cascade of rescheduling. Use kubectl drain for a more controlled approach, or apply NoExecute during a low-traffic window.

  5. Overusing the "tolerate all" pattern: Using operator: Exists without a key means the pod can land anywhere, including control plane nodes. Only do this for infrastructure-level workloads.

  6. Ignoring effect in tolerations: A toleration must match the taint's effect. If a node has gpu=nvidia:NoSchedule and your toleration specifies effect: NoExecute, it will not match. Omitting the effect field in a toleration matches all effects for that key.

Best Practices

  1. Use meaningful taint keys: Adopt a consistent naming convention like team=<name>, hardware=<type>, or environment=<env>. This makes it easy to understand the intent of each taint.

  2. Combine taints with labels: Always label a node with the same category you taint it with. This allows pods to both tolerate the taint and select the node via nodeSelector or affinity.

  3. Document your taint strategy: Maintain a registry of all taints used in your cluster and their purpose. This prevents confusion when new team members join.

  4. Use PreferNoSchedule for soft isolation: When you want to discourage but not forbid scheduling, PreferNoSchedule gives the scheduler flexibility under pressure.

  5. Set tolerationSeconds for NoExecute: When pods tolerate a NoExecute taint, always consider setting tolerationSeconds to prevent pods from running indefinitely on a problematic node.

  6. Automate taint management: Use node lifecycle controllers or admission webhooks to automatically apply taints based on node conditions, labels, or external events.

What's Next?

  • Scheduling (Affinity): Learn how to attract pods to specific nodes using node affinity and co-locate related pods with pod affinity.
  • Priority & Preemption: Understand what happens when tainted nodes are full and the scheduler must choose which pods to evict.
  • Resources: Learn how resource requests and limits interact with scheduling decisions.
  • Troubleshooting: Debug pods stuck in Pending due to taints and other scheduling constraints.