Skip to main content

The Descheduler

Key Takeaways for AI & Readers
  • Post-Scheduling Optimization: The Descheduler identifies pods that are suboptimally placed and evicts them, allowing the standard scheduler to reschedule them to better nodes. It does not schedule pods itself.
  • Cluster Drift: Over time, clusters become imbalanced due to node additions/removals, label changes, shifting traffic patterns, and pod restarts. The Descheduler corrects this drift.
  • Strategies/Plugins: Key strategies include RemoveDuplicates (spread replicas across nodes), LowNodeUtilization (rebalance from overloaded to underutilized nodes), RemovePodsViolatingInterPodAntiAffinity, RemovePodsViolatingNodeAffinity, and RemovePodsViolatingTopologySpreadConstraint.
  • Safety Mechanisms: The Descheduler respects PodDisruptionBudgets, can limit the maximum number of evictions per run, and can be configured to skip pods with local storage or critical priority classes.
  • Deployment Modes: Can run as a CronJob (periodic batch rebalancing), a Deployment (continuous background rebalancing), or a one-shot Job.

The default Kubernetes scheduler makes a placement decision once -- at the moment a pod is created. After that, the pod stays on its assigned node until it terminates, is evicted, or is explicitly deleted. Over time, this leads to cluster drift: some nodes become overloaded while others sit idle, affinity rules are violated after label changes, and topology spread constraints are no longer satisfied.

The Descheduler addresses this by periodically evaluating the cluster state and evicting pods that would be better placed elsewhere.

1. Why Clusters Become Imbalanced

Node A (Hot)
Node B (Cold)
📥
The standard Scheduler only acts when a pod is created. The Descheduler runs periodically to fix imbalances (like node hotspots or affinity violations) by evicting pods so they land on better nodes.

Several common scenarios cause suboptimal pod placement:

  • Node scaling: New nodes are added to the cluster (via autoscaler or manual action), but existing pods do not migrate to them. The new nodes remain underutilized.
  • Node drains and uncordons: After maintenance, an uncordoned node receives no pods until new pods are created. The cluster remains imbalanced until natural churn occurs.
  • Label and taint changes: A node's labels are updated, but pods that were scheduled based on old labels remain in place, potentially violating their own affinity rules.
  • Resource usage drift: A pod's actual resource consumption changes over time (e.g., traffic shifts), making the original scheduling decision suboptimal.
  • Rolling updates: After a rolling update, new pods may cluster on a subset of nodes based on scheduling conditions at the time, leaving other nodes with stale pods.
  • Spot instance churn: When spot nodes are reclaimed and replaced, pods from the reclaimed nodes may pile up on the remaining stable nodes.

2. How the Descheduler Works

The Descheduler follows a straightforward process:

  1. Snapshot: It reads the current state of all nodes and pods from the Kubernetes API.
  2. Evaluate: It runs each enabled strategy/plugin against the snapshot, identifying pods that should be evicted.
  3. Evict: It sends eviction requests to the Kubernetes API for the identified pods. These evictions go through the standard Eviction API, which respects PodDisruptionBudgets.
  4. Reschedule: The evicted pods' controllers (Deployments, StatefulSets, ReplicaSets) detect the missing pods and create replacements. The standard scheduler places these new pods, now seeing the updated cluster state (including newly added nodes, corrected labels, etc.).

The Descheduler never binds pods to nodes. It only evicts. The standard scheduler (or your custom scheduler) handles the re-placement.

3. Descheduler Strategies (Plugins)

RemoveDuplicates

Ensures that no more than one replica of a ReplicaSet, Deployment, or StatefulSet runs on the same node. If two replicas of the same Deployment end up on one node (e.g., because other nodes were temporarily unavailable), the Descheduler evicts one so the scheduler can place it on a different node.

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: default
pluginConfig:
- name: RemoveDuplicates
args:
excludeOwnerKinds: # Don't deduplicate these controller types
- DaemonSet
plugins:
deschedule:
enabled:
- RemoveDuplicates

LowNodeUtilization

The most impactful strategy. It identifies nodes whose resource utilization (CPU, memory, or pod count) is above a high threshold (overutilized) and evicts pods from them so they can be rescheduled to nodes whose utilization is below a low threshold (underutilized).

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: default
pluginConfig:
- name: LowNodeUtilization
args:
thresholds: # Below these = underutilized
cpu: 20
memory: 20
pods: 15
targetThresholds: # Above these = overutilized
cpu: 70
memory: 70
pods: 80
# Only run if at least one node is underutilized AND
# at least one node is overutilized
numberOfNodes: 0 # 0 means no minimum node count
plugins:
balance:
enabled:
- LowNodeUtilization

The strategy computes utilization as sum(pod requests) / node allocatable, not actual usage. This is important: it rebalances based on the scheduling resource model, not real-time metrics.

RemovePodsViolatingInterPodAntiAffinity

Finds pods that violate their own inter-pod anti-affinity rules. This happens when pods were scheduled before anti-affinity rules were added to the pod spec, or when anti-affinity was satisfied at scheduling time but a subsequent pod migration broke it.

RemovePodsViolatingNodeAffinity

Identifies pods running on nodes that no longer match their requiredDuringSchedulingIgnoredDuringExecution node affinity rules. This occurs when node labels change after scheduling.

RemovePodsViolatingTopologySpreadConstraint

Detects pods that violate their topology spread constraints. After node additions, removals, or pod deletions, the distribution of pods across topology domains (zones, nodes, racks) may become skewed beyond the allowed maxSkew.

RemovePodsHavingTooManyRestarts

Evicts pods that have restarted more than a specified number of times. This is useful for cleaning up pods stuck in CrashLoopBackOff that are consuming resources.

pluginConfig:
- name: RemovePodsHavingTooManyRestarts
args:
podRestartThreshold: 10 # Evict pods with >10 restarts
includingInitContainers: true # Count init container restarts too

HighNodeUtilization

The inverse of LowNodeUtilization. It evicts pods from underutilized nodes to pack workloads onto fewer nodes, enabling the Cluster Autoscaler to scale down the empty nodes. This strategy optimizes for cost rather than spreading.

4. Running the Descheduler

As a CronJob (Periodic Rebalancing)

The most common deployment mode. The Descheduler runs every N minutes, evaluates the cluster, evicts suboptimal pods, and exits.

apiVersion: batch/v1
kind: CronJob
metadata:
name: descheduler
namespace: kube-system
spec:
schedule: "*/10 * * * *" # Run every 10 minutes
concurrencyPolicy: Forbid # Don't overlap runs
jobTemplate:
spec:
template:
spec:
serviceAccountName: descheduler
restartPolicy: Never
containers:
- name: descheduler
image: registry.k8s.io/descheduler/descheduler:v0.30.1
command:
- /bin/descheduler
args:
- --policy-config-file=/policy/policy.yaml
- --v=3 # Verbosity level
volumeMounts:
- name: policy
mountPath: /policy
volumes:
- name: policy
configMap:
name: descheduler-policy

As a Deployment (Continuous Background Rebalancing)

For continuous optimization, the Descheduler can run as a long-lived Deployment with a configurable deschedulingInterval:

apiVersion: apps/v1
kind: Deployment
metadata:
name: descheduler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: descheduler
template:
metadata:
labels:
app: descheduler
spec:
serviceAccountName: descheduler
containers:
- name: descheduler
image: registry.k8s.io/descheduler/descheduler:v0.30.1
command:
- /bin/descheduler
args:
- --policy-config-file=/policy/policy.yaml
- --descheduling-interval=5m # Re-evaluate every 5 minutes
- --v=3
volumeMounts:
- name: policy
mountPath: /policy
volumes:
- name: policy
configMap:
name: descheduler-policy

5. Eviction Limits and Safety

The Descheduler provides several safety mechanisms to prevent excessive disruption:

Maximum Pod Eviction Limits

pluginConfig:
- name: LowNodeUtilization
args:
thresholds:
cpu: 20
memory: 20
targetThresholds:
cpu: 70
memory: 70
# Global eviction limits
maxNoOfPodsToEvictPerNode: 5 # Max pods evicted from any single node
maxNoOfPodsToEvictPerNamespace: 10 # Max pods evicted from any single namespace
maxNoOfPodsToEvictTotal: 50 # Max total evictions per run

PDB Interaction

The Descheduler uses the standard Kubernetes Eviction API, which respects PodDisruptionBudgets. If evicting a pod would violate a PDB, the eviction request is rejected and the Descheduler skips that pod. This is a critical safety mechanism -- always ensure PDBs are in place for production workloads.

Pod Filtering

You can exclude pods from descheduling based on:

  • Namespace: Exclude system namespaces (kube-system).
  • Priority class: Do not evict pods with system-cluster-critical or system-node-critical priority.
  • Labels: Skip pods with a descheduler.alpha.kubernetes.io/evict=false annotation.
  • Local storage: Skip pods with local storage (emptyDir, hostPath) to avoid data loss.
profiles:
- name: default
pluginConfig:
- name: DefaultEvictor
args:
evictLocalStoragePods: false # Don't evict pods with local storage
evictSystemCriticalPods: false # Don't evict system-critical pods
evictDaemonSetPods: false # Don't evict DaemonSet pods (default)
minReplicas: 2 # Don't evict if <2 replicas exist
nodeFit: true # Only evict if the pod can be rescheduled

The nodeFit: true option is particularly important: it tells the Descheduler to verify that there is at least one other node where the evicted pod could be scheduled. Without this, the Descheduler might evict a pod only for it to be rescheduled back to the same node, creating a pointless eviction loop.

6. Configuration Example: Production-Ready Policy

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: production
pluginConfig:
# Eviction safety settings
- name: DefaultEvictor
args:
evictLocalStoragePods: false
evictSystemCriticalPods: false
evictDaemonSetPods: false
minReplicas: 2
nodeFit: true
# Rebalance node utilization
- name: LowNodeUtilization
args:
thresholds:
cpu: 25
memory: 25
targetThresholds:
cpu: 65
memory: 65
# Spread replicas across nodes
- name: RemoveDuplicates
# Fix topology spread violations
- name: RemovePodsViolatingTopologySpreadConstraint
args:
constraints:
- DoNotSchedule # Only fix hard constraints
# Clean up crashed pods
- name: RemovePodsHavingTooManyRestarts
args:
podRestartThreshold: 20
includingInitContainers: true
plugins:
balance:
enabled:
- LowNodeUtilization
- RemovePodsViolatingTopologySpreadConstraint
deschedule:
enabled:
- RemoveDuplicates
- RemovePodsHavingTooManyRestarts

7. When to Use (and When Not to Use) the Descheduler

Use the Descheduler when:

  • Nodes are added or removed frequently (autoscaling clusters).
  • You observe significant resource imbalance across nodes.
  • You are running topology spread constraints that drift after node changes.
  • You have multiple replicas of services and want them spread across failure domains.
  • You want to facilitate Cluster Autoscaler scale-down by packing pods onto fewer nodes.

Do NOT use the Descheduler when:

  • Your workloads are sensitive to restarts (pods that take minutes to start, stateful services without replication).
  • You do not have PDBs in place for your workloads.
  • Your cluster is small and rarely changes topology.
  • Pods use local storage that would be lost on eviction.
  • The scheduler itself is misconfigured -- fix the scheduler configuration first rather than patching with the Descheduler.

8. Common Pitfalls

  1. Eviction loops. Without nodeFit: true, the Descheduler can evict a pod that gets rescheduled back to the same node (because it is still the best fit). The next Descheduler run evicts it again, creating an infinite loop. Always enable nodeFit.

  2. Missing PDBs. The Descheduler respects PDBs, but if you do not have PDBs, it can evict all replicas of a service simultaneously. Treat PDBs as a prerequisite for running the Descheduler.

  3. Overly aggressive thresholds. Setting LowNodeUtilization thresholds too close together (e.g., underutilized at 40%, overutilized at 50%) causes excessive churn. Maintain at least a 20-30 percentage point gap between thresholds.

  4. Ignoring the Cluster Autoscaler interaction. The Descheduler and Cluster Autoscaler can conflict: the Descheduler evicts pods to balance load, while the Cluster Autoscaler removes empty nodes. If the Descheduler spreads pods across all nodes, the autoscaler cannot scale down. Use HighNodeUtilization if cost optimization is the goal.

  5. Running too frequently. If the Descheduler runs every minute, pods may be evicted before they finish starting up from the previous eviction. A 5-15 minute interval is usually sufficient.

  6. Not monitoring evictions. Track the metric descheduler_pods_evicted in Prometheus to understand how many pods the Descheduler is evicting per run. If the number is consistently high, your scheduling configuration may need tuning.

9. What's Next?

  • Node Operations: The Descheduler is the complement to drain/uncordon operations. See Node Operations.
  • Custom Schedulers: The Descheduler works with any scheduler -- the standard scheduler or your custom scheduler handles the re-placement. See Custom Schedulers.
  • Evictions: Understand how the Eviction API and PDBs work, since the Descheduler relies on them. See Evictions.
  • Cluster Autoscaler: Understand the interaction between the Descheduler and the Cluster Autoscaler for cost-optimized clusters.
  • Metrics: Set up dashboards for descheduler_pods_evicted, descheduler_loop_duration_seconds, and PDB disruption budgets to monitor the Descheduler's impact on your cluster.