Scheduling: Taints & Tolerations
- Node Repulsion: Taints allow nodes to repel Pods that do not specifically "tolerate" the taint. This is the inverse of affinity -- instead of attracting pods, taints push them away.
- Pod Permission: Tolerations allow Pods to be scheduled on nodes with matching taints, though they do not guarantee scheduling. A toleration merely removes the repulsion; the scheduler still uses other factors (resources, affinity) to make the final placement decision.
- Three Taint Effects:
NoScheduleprevents new pods from landing on a node,PreferNoScheduleis a soft version that the scheduler tries to avoid, andNoExecuteevicts already-running pods that lack a matching toleration. - Isolation Use Cases: Common patterns include isolating specialized hardware (GPUs, FPGAs), protecting Control Plane nodes from user workloads, and dedicating nodes to specific teams or tenants.
- Node Maintenance: Taints are essential for marking nodes as "off-limits" during maintenance or drainage operations, allowing graceful workload migration before taking a node offline.
- DaemonSet Behavior: DaemonSets automatically add tolerations for certain taints (like
node.kubernetes.io/not-readyandnode.kubernetes.io/unreachable) to ensure system-level pods run on every node.
Kubernetes scheduling is about finding the right node for a Pod. One of the most powerful mechanisms for controlling placement is Taints and Tolerations.
- Taint: Applied to a Node. It says "Do not schedule anything here unless it has a matching toleration."
- Toleration: Applied to a Pod. It says "I can tolerate this taint -- I am allowed to schedule on nodes that have it."
Think of taints as a "No Entry" sign on a node's door, and tolerations as the key that lets specific pods through. Critically, having the key does not force you through the door -- it simply allows it. The scheduler still considers resources, affinity rules, and other constraints before making a final decision.
Interactive Scheduling
Try scheduling different types of Pods and see where they land.
How Taints Work
A taint consists of three parts:
key=value:effect
- Key: An arbitrary string identifier (e.g.,
gpu,team,node.kubernetes.io/not-ready). - Value: An optional string value associated with the key (e.g.,
nvidia,frontend). - Effect: One of three values that determines the taint's behavior.
Taint Effects Explained
NoSchedule
The strictest effect. New pods that do not tolerate this taint will never be scheduled on the node. Pods already running on the node are unaffected.
# Apply a NoSchedule taint to a node
kubectl taint nodes node1 gpu=nvidia:NoSchedule
Use this when you absolutely must prevent certain workloads from landing on a node, such as reserving GPU nodes exclusively for machine learning jobs.
PreferNoSchedule
A soft version of NoSchedule. The scheduler will try to avoid placing pods that lack the toleration on this node, but if no other nodes are available, it will schedule them there anyway.
# Apply a PreferNoSchedule taint
kubectl taint nodes node2 environment=staging:PreferNoSchedule
Use this for preferences rather than hard requirements. For example, you might prefer to keep development workloads off production nodes, but allow overflow if the cluster is under pressure.
NoExecute
The most aggressive effect. This taint applies to both new and existing pods. When a NoExecute taint is added to a node:
- New pods without a matching toleration will not be scheduled.
- Existing pods without a matching toleration are immediately evicted.
- Pods with a matching toleration that specifies
tolerationSecondswill be evicted after that duration.
# Apply a NoExecute taint -- this will evict non-tolerating pods
kubectl taint nodes node3 maintenance=true:NoExecute
This is the go-to effect for node maintenance. When you need to drain a node, applying a NoExecute taint ensures all non-essential workloads are moved elsewhere.
Toleration Syntax and Operators
A toleration in a pod spec can use one of two operators:
operator: Equal (default)
Matches when key, value, and effect all match exactly:
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
tolerations:
- key: "gpu"
operator: "Equal" # Key and value must match exactly
value: "nvidia"
effect: "NoSchedule"
containers:
- name: trainer
image: my-ml-training:latest
resources:
limits:
nvidia.com/gpu: 1
operator: Exists
Matches when the key exists, regardless of value. You can omit the value field entirely:
apiVersion: v1
kind: Pod
metadata:
name: monitoring-agent
spec:
tolerations:
- key: "gpu"
operator: "Exists" # Matches any value for key "gpu"
effect: "NoSchedule"
containers:
- name: agent
image: monitoring-agent:latest
Tolerating All Taints
You can create a pod that tolerates every possible taint by omitting both key and value and using operator: Exists. This is a powerful (and dangerous) pattern:
apiVersion: v1
kind: Pod
metadata:
name: omnipresent-agent
spec:
tolerations:
- operator: "Exists" # Tolerates ALL taints on ALL nodes
containers:
- name: collector
image: log-collector:latest
This pattern is typically used only by infrastructure-level pods (monitoring agents, log collectors) that must run on every node, including control plane and tainted nodes.
Control Plane Taints
By default, Kubernetes control plane nodes carry the following taint:
node-role.kubernetes.io/control-plane:NoSchedule
This prevents user workloads from running on control plane nodes. The control plane components (API server, scheduler, controller manager, etcd) tolerate this taint automatically. If you are running a single-node cluster for development, you may need to remove this taint to schedule workloads:
# Remove the control plane taint (single-node clusters only)
kubectl taint nodes my-control-plane node-role.kubernetes.io/control-plane:NoSchedule-
Note the trailing - which removes the taint.
nodeSelector Basics
Before diving deeper into taints, it is worth mentioning nodeSelector, the simplest form of node selection. While taints repel pods, nodeSelector attracts them:
apiVersion: v1
kind: Pod
metadata:
name: fast-storage-pod
spec:
nodeSelector:
disktype: ssd # Only schedule on nodes labeled disktype=ssd
containers:
- name: database
image: postgres:16
nodeSelector is a hard requirement. If no node has the matching label, the pod stays Pending. For more flexible attraction rules, see Scheduling (Affinity).
Real-World Scenarios
Scenario 1: GPU Node Isolation
Reserve GPU nodes exclusively for ML workloads. Non-GPU pods should never waste expensive GPU node resources:
# Taint all GPU nodes
kubectl taint nodes gpu-node-1 gpu-node-2 gpu-node-3 \
hardware=gpu:NoSchedule
# ML training pod that tolerates the GPU taint
apiVersion: batch/v1
kind: Job
metadata:
name: model-training
spec:
template:
spec:
tolerations:
- key: "hardware"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
nodeSelector:
hardware: gpu # Also ensure we land on GPU nodes
containers:
- name: trainer
image: training-pipeline:v2.1
resources:
limits:
nvidia.com/gpu: 2
restartPolicy: Never
Notice the combination of a toleration (to be allowed on the node) with a nodeSelector (to guarantee placement there). The toleration alone does not ensure the pod lands on a GPU node -- it only permits it.
Scenario 2: Dedicated Team Nodes
In a multi-tenant cluster, dedicate specific nodes to specific teams:
# Reserve nodes for the data-science team
kubectl taint nodes ds-node-1 ds-node-2 team=data-science:NoSchedule
# Data science team pod
apiVersion: v1
kind: Pod
metadata:
name: jupyter-notebook
labels:
team: data-science
spec:
tolerations:
- key: "team"
operator: "Equal"
value: "data-science"
effect: "NoSchedule"
nodeSelector:
team: data-science
containers:
- name: jupyter
image: jupyter/scipy-notebook:latest
Scenario 3: Node Maintenance with NoExecute
When performing maintenance on a node, use NoExecute with tolerationSeconds to give workloads time to migrate:
# Start maintenance -- evict all non-tolerating pods
kubectl taint nodes node-to-maintain maintenance=scheduled:NoExecute
# A critical pod that can stay during short maintenance windows
apiVersion: v1
kind: Pod
metadata:
name: critical-cache
spec:
tolerations:
- key: "maintenance"
operator: "Equal"
value: "scheduled"
effect: "NoExecute"
tolerationSeconds: 3600 # Stay for 1 hour, then get evicted
containers:
- name: redis
image: redis:7-alpine
After maintenance is complete, remove the taint:
kubectl taint nodes node-to-maintain maintenance=scheduled:NoExecute-
How DaemonSets Automatically Tolerate Taints
DaemonSets are designed to run one pod per node (for log collectors, monitoring agents, network plugins, etc.). To fulfill this contract, the DaemonSet controller automatically adds tolerations for several built-in taints:
node.kubernetes.io/not-ready(NoExecute)node.kubernetes.io/unreachable(NoExecute)node.kubernetes.io/disk-pressure(NoSchedule)node.kubernetes.io/memory-pressure(NoSchedule)node.kubernetes.io/pid-pressure(NoSchedule)node.kubernetes.io/unschedulable(NoSchedule)
This means your DaemonSet pods will continue to run even when nodes are experiencing problems. However, DaemonSets do not automatically tolerate user-defined taints. If you taint a node with gpu=nvidia:NoSchedule, you must explicitly add that toleration to your DaemonSet spec if you want it to run on GPU nodes:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
tolerations:
- operator: "Exists" # Tolerate all taints to run everywhere
containers:
- name: exporter
image: prom/node-exporter:latest
ports:
- containerPort: 9100
Combining Taints with Node Affinity
Taints and affinity serve complementary purposes. A common pattern is:
- Taint the node to repel unwanted pods.
- Label the node for identification.
- Use nodeSelector or node affinity on desired pods to attract them to the node.
- Add a toleration on those same pods so they can pass the taint.
Without this combination, your toleration-bearing pod might land on any node (not just the target), or your affinity-bearing pod might target a tainted node it cannot access.
Common Pitfalls
-
Toleration without nodeSelector: Adding a toleration allows a pod to schedule on a tainted node but does not guarantee it. Without
nodeSelectoror node affinity, the pod might land on any non-tainted node instead. -
Forgetting to remove taints after maintenance: If you add a
NoScheduletaint for maintenance and forget to remove it, new pods will never schedule on that node. Automate taint removal as part of your maintenance runbook. -
Typos in taint keys or values: Taints and tolerations are matched by exact string comparison. A toleration for
key: "gpu"will not match a taint with keyGPUorgpus. There is no validation that a toleration matches any existing taint. -
Using NoExecute carelessly: Applying a
NoExecutetaint immediately evicts all non-tolerating pods. On a busy node, this can cause a cascade of rescheduling. Usekubectl drainfor a more controlled approach, or applyNoExecuteduring a low-traffic window. -
Overusing the "tolerate all" pattern: Using
operator: Existswithout a key means the pod can land anywhere, including control plane nodes. Only do this for infrastructure-level workloads. -
Ignoring effect in tolerations: A toleration must match the taint's effect. If a node has
gpu=nvidia:NoScheduleand your toleration specifieseffect: NoExecute, it will not match. Omitting theeffectfield in a toleration matches all effects for that key.
Best Practices
-
Use meaningful taint keys: Adopt a consistent naming convention like
team=<name>,hardware=<type>, orenvironment=<env>. This makes it easy to understand the intent of each taint. -
Combine taints with labels: Always label a node with the same category you taint it with. This allows pods to both tolerate the taint and select the node via
nodeSelectoror affinity. -
Document your taint strategy: Maintain a registry of all taints used in your cluster and their purpose. This prevents confusion when new team members join.
-
Use PreferNoSchedule for soft isolation: When you want to discourage but not forbid scheduling,
PreferNoSchedulegives the scheduler flexibility under pressure. -
Set tolerationSeconds for NoExecute: When pods tolerate a
NoExecutetaint, always consider settingtolerationSecondsto prevent pods from running indefinitely on a problematic node. -
Automate taint management: Use node lifecycle controllers or admission webhooks to automatically apply taints based on node conditions, labels, or external events.
What's Next?
- Scheduling (Affinity): Learn how to attract pods to specific nodes using node affinity and co-locate related pods with pod affinity.
- Priority & Preemption: Understand what happens when tainted nodes are full and the scheduler must choose which pods to evict.
- Resources: Learn how resource requests and limits interact with scheduling decisions.
- Troubleshooting: Debug pods stuck in
Pendingdue to taints and other scheduling constraints.