Custom Schedulers
- Scheduling Framework: Kubernetes uses a plugin-based scheduling framework with well-defined extension points (QueueSort, PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, PreBind, Bind, PostBind) that allow fine-grained customization.
- Multiple Schedulers: You can run custom schedulers alongside the default scheduler. Pods select their scheduler via
spec.schedulerName. Each scheduler operates independently, watching for unscheduled pods that target it. - Scheduler Plugins: The modern approach to customization -- write a Go plugin that implements one or more extension point interfaces, compile it into the scheduler binary, and enable it via a scheduler configuration profile.
- Scheduler Extenders: A legacy webhook-based approach where the scheduler calls an external HTTP endpoint during Filter and Prioritize phases. Simpler to implement but slower and less capable than plugins.
- Use Cases: GPU topology-aware scheduling, data locality for analytics workloads, cost-aware bin packing, gang scheduling for distributed training, and multi-cluster placement.
- Scheduler Profiles: A single scheduler binary can host multiple scheduling profiles, each with a different set of enabled plugins, eliminating the need to run multiple scheduler deployments.
Kubernetes ships with a high-quality default scheduler that handles most workloads well. But when your scheduling requirements go beyond resource-fit and affinity rules -- when you need to optimize for GPU topology, minimize cloud costs, ensure data locality, or coordinate multi-pod gang scheduling -- you need to extend or replace the scheduler.
1. The Scheduling Cycle
A scheduler has one fundamental job: watch the API server for Pods where spec.nodeName is empty, select a suitable node, and fill in that field (bind the pod).
The default scheduler processes pods through a well-defined pipeline:
Scheduling Cycle (per-pod, synchronous)
-
QueueSort: Determines the order in which pending pods are dequeued for scheduling. By default, pods are sorted by priority and then by creation time.
-
PreFilter: Performs pre-processing or validation before filtering begins. A plugin can reject a pod here if preconditions are not met (e.g., "this pod requires a PVC that does not exist").
-
Filter: Removes nodes that cannot run the pod. Built-in filters check for: sufficient CPU/memory, node selectors, taints/tolerations, affinity/anti-affinity, volume constraints, and port conflicts.
-
PostFilter: Runs only if Filter eliminated all nodes (no feasible node). The default PostFilter plugin attempts preemption -- finding a lower-priority pod to evict so the current pod can be scheduled.
-
PreScore: Performs pre-processing before scoring. Plugins can compute shared state that multiple scoring plugins will use.
-
Score: Ranks the remaining feasible nodes. Built-in scoring plugins consider: balanced resource utilization, pod topology spread, node affinity preference, and inter-pod affinity.
-
NormalizeScore: Normalizes scores to a 0-100 range so scores from different plugins are comparable.
-
Reserve: Optimistically claims resources on the selected node before the bind. If the bind fails, the Unreserve callback is invoked to release the claim.
Binding Cycle (asynchronous)
-
Permit: A gate that can approve, deny, or delay binding. The "wait" option is used for gang scheduling -- hold a pod until all members of the group are schedulable.
-
PreBind: Performs actions required before binding (e.g., provisioning a network volume, setting up a node resource).
-
Bind: Writes the
spec.nodeNameto the pod object in the API server. Only one Bind plugin runs (the first to handle the request). -
PostBind: Performs cleanup or notification after a successful bind. Used for logging, metrics, or triggering follow-up actions.
2. Running Multiple Schedulers Side by Side
You can deploy a custom scheduler as a separate Deployment in your cluster. Pods opt into a specific scheduler by setting spec.schedulerName:
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
schedulerName: gpu-topology-scheduler # Use the custom scheduler
containers:
- name: trainer
image: ml-trainer:latest
resources:
limits:
nvidia.com/gpu: 4
If schedulerName is not specified, the pod is handled by the default scheduler (default-scheduler).
Deploying a Custom Scheduler
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-topology-scheduler
namespace: kube-system
spec:
replicas: 1 # Only 1 replica (leader election handles HA)
selector:
matchLabels:
component: gpu-topology-scheduler
template:
metadata:
labels:
component: gpu-topology-scheduler
spec:
serviceAccountName: gpu-topology-scheduler # Needs RBAC for pods, nodes, bindings
containers:
- name: scheduler
image: my-registry/gpu-scheduler:v1.0
command:
- /gpu-scheduler
- --config=/etc/scheduler/config.yaml # Scheduler configuration
- --leader-elect=true # Enable leader election for HA
volumeMounts:
- name: config
mountPath: /etc/scheduler
volumes:
- name: config
configMap:
name: gpu-scheduler-config
RBAC Requirements
A custom scheduler needs permissions to:
- Read Pods, Nodes, PersistentVolumes, PersistentVolumeClaims, Services, ReplicationControllers, StatefulSets, ReplicaSets, and Deployments.
- Create Bindings (to bind pods to nodes).
- Update Pods (to set
spec.nodeNameand pod status). - Create Events (to record scheduling decisions and errors).
- Create Leases (for leader election).
3. Writing a Scheduler Plugin
The modern approach to scheduler customization is writing a scheduler plugin in Go. A plugin implements one or more extension point interfaces.
package main
import (
"context"
"fmt"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
// DataLocalityPlugin scores nodes higher if they hold the data the pod needs.
type DataLocalityPlugin struct {
handle framework.Handle
}
// Name returns the plugin name.
func (p *DataLocalityPlugin) Name() string {
return "DataLocality"
}
// Score returns a higher score for nodes that have the requested dataset.
func (p *DataLocalityPlugin) Score(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeName string,
) (int64, *framework.Status) {
// Read the desired dataset from a pod annotation
dataset := pod.Annotations["data-locality/dataset"]
if dataset == "" {
return 0, framework.NewStatus(framework.Success)
}
// Check if the node has the dataset (via node labels)
nodeInfo, err := p.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.NewStatus(framework.Error, fmt.Sprintf("node not found: %v", err))
}
if nodeInfo.Node().Labels["data-locality/dataset"] == dataset {
return 100, framework.NewStatus(framework.Success) // Maximum score
}
return 0, framework.NewStatus(framework.Success)
}
// ScoreExtensions returns nil (no normalization needed for this plugin).
func (p *DataLocalityPlugin) ScoreExtensions() framework.ScoreExtensions {
return nil
}
// New creates a new instance of the plugin.
func New(obj runtime.Object, handle framework.Handle) (framework.Plugin, error) {
return &DataLocalityPlugin{handle: handle}, nil
}
This plugin is compiled into the scheduler binary and enabled via the scheduler configuration.
4. Scheduler Configuration and Profiles
Since Kubernetes 1.19, the scheduler supports profiles -- a single scheduler binary can serve multiple scheduling behaviors. Each profile has a unique name and its own set of enabled/disabled plugins.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
# Default profile for general workloads
- schedulerName: default-scheduler
plugins:
score:
enabled:
- name: NodeResourcesBalancedAllocation
weight: 1
- name: NodeResourcesFit
weight: 1
- name: InterPodAffinity
weight: 1
# GPU-aware profile
- schedulerName: gpu-topology-scheduler
plugins:
filter:
enabled:
- name: GpuTopologyFilter # Custom plugin
score:
enabled:
- name: GpuTopologyScore # Custom plugin
weight: 10 # High weight for GPU topology
- name: NodeResourcesFit
weight: 1
# Cost-optimized profile
- schedulerName: cost-aware-scheduler
plugins:
score:
enabled:
- name: CostAwareScore # Custom plugin
weight: 5
- name: NodeResourcesFit
weight: 3
disabled:
- name: NodeResourcesBalancedAllocation # Disable balanced allocation
With profiles, you do not need to deploy multiple scheduler binaries. One scheduler process handles all profiles, and pods select the profile via spec.schedulerName.
5. Scheduler Extenders (Legacy)
Before the scheduling framework, the primary extension mechanism was scheduler extenders -- external HTTP endpoints that the scheduler calls during Filter and Prioritize phases.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
- urlPrefix: "https://my-extender.kube-system.svc:443"
filterVerb: "filter" # POST /filter
prioritizeVerb: "prioritize" # POST /prioritize
bindVerb: "" # Empty means extender doesn't handle bind
weight: 5
enableHTTPS: true
tlsConfig:
certFile: "/etc/scheduler/cert.pem"
keyFile: "/etc/scheduler/key.pem"
managedResources:
- name: "example.com/special-resource"
ignoredByScheduler: true # Scheduler doesn't check this resource
ignorable: true # If extender fails, proceed without it
Extenders vs. Plugins
| Aspect | Plugins (Framework) | Extenders (Webhook) |
|---|---|---|
| Performance | In-process, nanosecond overhead | HTTP round trip per scheduling cycle |
| Extension Points | All 12 extension points | Only Filter and Prioritize |
| Data Access | Full access to scheduler cache | Only sees pod and node list (serialized over HTTP) |
| Error Handling | Framework status codes | HTTP error codes |
| Deployment | Compiled into scheduler binary | Separate service |
| Recommendation | Preferred for new development | Legacy; use only if you cannot compile a plugin |
6. Real-World Use Cases
GPU Topology-Aware Scheduling
Multi-GPU training jobs perform best when GPUs communicate over NVLink rather than PCIe. A custom Filter plugin can reject nodes where the available GPUs are not NVLink-connected, and a Score plugin can prefer nodes where the most tightly connected GPUs are free.
Data Locality
Big data workloads (Spark on Kubernetes, Presto, Trino) benefit from running on nodes that hold the HDFS/Ceph/local-SSD data they need to process. A Score plugin can query a data placement API and rank nodes by the fraction of required data blocks they hold.
Cost Optimization
In multi-cloud or hybrid environments, a cost-aware scheduler can prefer cheaper nodes (spot instances, preemptible VMs) for fault-tolerant workloads and reserve on-demand instances for critical services. The Score plugin reads pricing information from node labels or an external pricing API.
Gang Scheduling
Distributed training jobs (e.g., PyTorch DDP, Horovod) require all N pods to be scheduled simultaneously -- partial scheduling wastes GPU resources. A Permit plugin can hold each pod in a "waiting" state until all members of the gang are schedulable, then release them all at once. The Volcano scheduler and Coscheduling plugin implement this pattern.
Topology-Aware Scheduling
Beyond GPUs, some workloads are sensitive to NUMA topology, CPU cache topology, or network topology (same rack, same ToR switch). A Filter or Score plugin can read node topology information and prefer placements that minimize latency.
7. Common Pitfalls
-
Conflicts between multiple schedulers. If two schedulers both try to schedule pods to the same node and over-commit it, one scheduler does not know about the other's decisions. Use resource quotas and node affinity to partition the cluster, or use scheduler profiles (single binary) instead.
-
Missing leader election. Without leader election, running multiple replicas of a custom scheduler causes duplicate bind attempts and race conditions. Always enable
--leader-elect=true. -
Insufficient RBAC. A custom scheduler needs extensive read permissions across the cluster. Missing RBAC for PVCs, storage classes, or CSI nodes causes scheduling failures for pods with volumes.
-
Extender latency. Scheduler extenders add an HTTP round trip to every scheduling cycle. At high pod creation rates (100+ pods/second), this latency becomes a bottleneck. Migrate to in-process plugins if performance is critical.
-
Plugin panics crash the scheduler. Unlike extenders (which are isolated processes), a plugin panic will crash the entire scheduler binary. Implement robust error handling and recovery in plugin code.
-
Forgetting about preemption. If your custom Filter rejects all nodes, the default PostFilter will try to preempt lower-priority pods. If your scheduling logic requires special preemption behavior, implement a custom PostFilter plugin.
-
Not testing with realistic cluster state. The scheduler's behavior depends on the snapshot of node resources at scheduling time. Test with nodes that have realistic resource utilization, not empty clusters.
8. What's Next?
- Descheduler: The Descheduler handles the other side of scheduling -- rebalancing pods after initial placement. See The Descheduler.
- AI & GPU Scheduling: Learn about GPU scheduling, the NVIDIA device plugin, and specialized schedulers like Volcano. See AI & GPU Scheduling.
- Node Operations: Understand how node drain and cordon interact with the scheduler during maintenance windows. See Node Operations.
- Priority and Preemption: Learn how pod priority affects scheduling order and preemption decisions.
- Scheduler Source Code: The Kubernetes scheduler is well-documented Go code. Start with the
pkg/scheduler/framework/directory in the kubernetes/kubernetes repository to understand the plugin interfaces.