Custom Schedulers

Key Takeaways for AI & Readers

Scheduling Framework: Kubernetes uses a plugin-based scheduling framework with well-defined extension points (QueueSort, PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, PreBind, Bind, PostBind) that allow fine-grained customization.
Multiple Schedulers: You can run custom schedulers alongside the default scheduler. Pods select their scheduler via spec.schedulerName. Each scheduler operates independently, watching for unscheduled pods that target it.
Scheduler Plugins: The modern approach to customization -- write a Go plugin that implements one or more extension point interfaces, compile it into the scheduler binary, and enable it via a scheduler configuration profile.
Scheduler Extenders: A legacy webhook-based approach where the scheduler calls an external HTTP endpoint during Filter and Prioritize phases. Simpler to implement but slower and less capable than plugins.
Use Cases: GPU topology-aware scheduling, data locality for analytics workloads, cost-aware bin packing, gang scheduling for distributed training, and multi-cluster placement.
Scheduler Profiles: A single scheduler binary can host multiple scheduling profiles, each with a different set of enabled plugins, eliminating the need to run multiple scheduler deployments.

Kubernetes ships with a high-quality default scheduler that handles most workloads well. But when your scheduling requirements go beyond resource-fit and affinity rules -- when you need to optimize for GPU topology, minimize cloud costs, ensure data locality, or coordinate multi-pod gang scheduling -- you need to extend or replace the scheduler.

1. The Scheduling Cycle

A scheduler has one fundamental job: watch the API server for Pods where spec.nodeName is empty, select a suitable node, and fill in that field (bind the pod).

apiVersion: v1

kind: Pod

metadata:

name: expensive-app

spec:

schedulerName: default-scheduler

➡️

⚖️

Standard Logic

You can run multiple schedulers in one cluster. By setting schedulerName, you tell Kubernetes to ignore the default scheduler and use your custom logic instead.

The default scheduler processes pods through a well-defined pipeline:

Scheduling Cycle (per-pod, synchronous)

QueueSort: Determines the order in which pending pods are dequeued for scheduling. By default, pods are sorted by priority and then by creation time.
PreFilter: Performs pre-processing or validation before filtering begins. A plugin can reject a pod here if preconditions are not met (e.g., "this pod requires a PVC that does not exist").
Filter: Removes nodes that cannot run the pod. Built-in filters check for: sufficient CPU/memory, node selectors, taints/tolerations, affinity/anti-affinity, volume constraints, and port conflicts.
PostFilter: Runs only if Filter eliminated all nodes (no feasible node). The default PostFilter plugin attempts preemption -- finding a lower-priority pod to evict so the current pod can be scheduled.
PreScore: Performs pre-processing before scoring. Plugins can compute shared state that multiple scoring plugins will use.
Score: Ranks the remaining feasible nodes. Built-in scoring plugins consider: balanced resource utilization, pod topology spread, node affinity preference, and inter-pod affinity.
NormalizeScore: Normalizes scores to a 0-100 range so scores from different plugins are comparable.
Reserve: Optimistically claims resources on the selected node before the bind. If the bind fails, the Unreserve callback is invoked to release the claim.

Binding Cycle (asynchronous)

Permit: A gate that can approve, deny, or delay binding. The "wait" option is used for gang scheduling -- hold a pod until all members of the group are schedulable.
PreBind: Performs actions required before binding (e.g., provisioning a network volume, setting up a node resource).
Bind: Writes the spec.nodeName to the pod object in the API server. Only one Bind plugin runs (the first to handle the request).
PostBind: Performs cleanup or notification after a successful bind. Used for logging, metrics, or triggering follow-up actions.

2. Running Multiple Schedulers Side by Side

You can deploy a custom scheduler as a separate Deployment in your cluster. Pods opt into a specific scheduler by setting spec.schedulerName:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  schedulerName: gpu-topology-scheduler  # Use the custom scheduler
  containers:
    - name: trainer
      image: ml-trainer:latest
      resources:
        limits:
          nvidia.com/gpu: 4

If schedulerName is not specified, the pod is handled by the default scheduler (default-scheduler).

Deploying a Custom Scheduler

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-topology-scheduler
  namespace: kube-system
spec:
  replicas: 1                         # Only 1 replica (leader election handles HA)
  selector:
    matchLabels:
      component: gpu-topology-scheduler
  template:
    metadata:
      labels:
        component: gpu-topology-scheduler
    spec:
      serviceAccountName: gpu-topology-scheduler   # Needs RBAC for pods, nodes, bindings
      containers:
        - name: scheduler
          image: my-registry/gpu-scheduler:v1.0
          command:
            - /gpu-scheduler
            - --config=/etc/scheduler/config.yaml    # Scheduler configuration
            - --leader-elect=true                    # Enable leader election for HA
          volumeMounts:
            - name: config
              mountPath: /etc/scheduler
      volumes:
        - name: config
          configMap:
            name: gpu-scheduler-config

RBAC Requirements

A custom scheduler needs permissions to:

Read Pods, Nodes, PersistentVolumes, PersistentVolumeClaims, Services, ReplicationControllers, StatefulSets, ReplicaSets, and Deployments.
Create Bindings (to bind pods to nodes).
Update Pods (to set spec.nodeName and pod status).
Create Events (to record scheduling decisions and errors).
Create Leases (for leader election).

3. Writing a Scheduler Plugin

The modern approach to scheduler customization is writing a scheduler plugin in Go. A plugin implements one or more extension point interfaces.

package main

import (
    "context"
    "fmt"

    v1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/kubernetes/pkg/scheduler/framework"
)

// DataLocalityPlugin scores nodes higher if they hold the data the pod needs.
type DataLocalityPlugin struct {
    handle framework.Handle
}

// Name returns the plugin name.
func (p *DataLocalityPlugin) Name() string {
    return "DataLocality"
}

// Score returns a higher score for nodes that have the requested dataset.
func (p *DataLocalityPlugin) Score(
    ctx context.Context,
    state *framework.CycleState,
    pod *v1.Pod,
    nodeName string,
) (int64, *framework.Status) {
    // Read the desired dataset from a pod annotation
    dataset := pod.Annotations["data-locality/dataset"]
    if dataset == "" {
        return 0, framework.NewStatus(framework.Success)
    }

    // Check if the node has the dataset (via node labels)
    nodeInfo, err := p.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    if err != nil {
        return 0, framework.NewStatus(framework.Error, fmt.Sprintf("node not found: %v", err))
    }

    if nodeInfo.Node().Labels["data-locality/dataset"] == dataset {
        return 100, framework.NewStatus(framework.Success)  // Maximum score
    }
    return 0, framework.NewStatus(framework.Success)
}

// ScoreExtensions returns nil (no normalization needed for this plugin).
func (p *DataLocalityPlugin) ScoreExtensions() framework.ScoreExtensions {
    return nil
}

// New creates a new instance of the plugin.
func New(obj runtime.Object, handle framework.Handle) (framework.Plugin, error) {
    return &DataLocalityPlugin{handle: handle}, nil
}

This plugin is compiled into the scheduler binary and enabled via the scheduler configuration.

4. Scheduler Configuration and Profiles

Since Kubernetes 1.19, the scheduler supports profiles -- a single scheduler binary can serve multiple scheduling behaviors. Each profile has a unique name and its own set of enabled/disabled plugins.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  # Default profile for general workloads
  - schedulerName: default-scheduler
    plugins:
      score:
        enabled:
          - name: NodeResourcesBalancedAllocation
            weight: 1
          - name: NodeResourcesFit
            weight: 1
          - name: InterPodAffinity
            weight: 1

  # GPU-aware profile
  - schedulerName: gpu-topology-scheduler
    plugins:
      filter:
        enabled:
          - name: GpuTopologyFilter      # Custom plugin
      score:
        enabled:
          - name: GpuTopologyScore       # Custom plugin
            weight: 10                   # High weight for GPU topology
          - name: NodeResourcesFit
            weight: 1

  # Cost-optimized profile
  - schedulerName: cost-aware-scheduler
    plugins:
      score:
        enabled:
          - name: CostAwareScore         # Custom plugin
            weight: 5
          - name: NodeResourcesFit
            weight: 3
        disabled:
          - name: NodeResourcesBalancedAllocation  # Disable balanced allocation

With profiles, you do not need to deploy multiple scheduler binaries. One scheduler process handles all profiles, and pods select the profile via spec.schedulerName.

5. Scheduler Extenders (Legacy)

Before the scheduling framework, the primary extension mechanism was scheduler extenders -- external HTTP endpoints that the scheduler calls during Filter and Prioritize phases.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
  - urlPrefix: "https://my-extender.kube-system.svc:443"
    filterVerb: "filter"               # POST /filter
    prioritizeVerb: "prioritize"       # POST /prioritize
    bindVerb: ""                       # Empty means extender doesn't handle bind
    weight: 5
    enableHTTPS: true
    tlsConfig:
      certFile: "/etc/scheduler/cert.pem"
      keyFile: "/etc/scheduler/key.pem"
    managedResources:
      - name: "example.com/special-resource"
        ignoredByScheduler: true       # Scheduler doesn't check this resource
    ignorable: true                    # If extender fails, proceed without it

Extenders vs. Plugins

Aspect	Plugins (Framework)	Extenders (Webhook)
Performance	In-process, nanosecond overhead	HTTP round trip per scheduling cycle
Extension Points	All 12 extension points	Only Filter and Prioritize
Data Access	Full access to scheduler cache	Only sees pod and node list (serialized over HTTP)
Error Handling	Framework status codes	HTTP error codes
Deployment	Compiled into scheduler binary	Separate service
Recommendation	Preferred for new development	Legacy; use only if you cannot compile a plugin

6. Real-World Use Cases

GPU Topology-Aware Scheduling

Multi-GPU training jobs perform best when GPUs communicate over NVLink rather than PCIe. A custom Filter plugin can reject nodes where the available GPUs are not NVLink-connected, and a Score plugin can prefer nodes where the most tightly connected GPUs are free.

Data Locality

Big data workloads (Spark on Kubernetes, Presto, Trino) benefit from running on nodes that hold the HDFS/Ceph/local-SSD data they need to process. A Score plugin can query a data placement API and rank nodes by the fraction of required data blocks they hold.

Cost Optimization

In multi-cloud or hybrid environments, a cost-aware scheduler can prefer cheaper nodes (spot instances, preemptible VMs) for fault-tolerant workloads and reserve on-demand instances for critical services. The Score plugin reads pricing information from node labels or an external pricing API.

Gang Scheduling

Distributed training jobs (e.g., PyTorch DDP, Horovod) require all N pods to be scheduled simultaneously -- partial scheduling wastes GPU resources. A Permit plugin can hold each pod in a "waiting" state until all members of the gang are schedulable, then release them all at once. The Volcano scheduler and Coscheduling plugin implement this pattern.

Topology-Aware Scheduling

Beyond GPUs, some workloads are sensitive to NUMA topology, CPU cache topology, or network topology (same rack, same ToR switch). A Filter or Score plugin can read node topology information and prefer placements that minimize latency.

7. Common Pitfalls

Conflicts between multiple schedulers. If two schedulers both try to schedule pods to the same node and over-commit it, one scheduler does not know about the other's decisions. Use resource quotas and node affinity to partition the cluster, or use scheduler profiles (single binary) instead.
Missing leader election. Without leader election, running multiple replicas of a custom scheduler causes duplicate bind attempts and race conditions. Always enable --leader-elect=true.
Insufficient RBAC. A custom scheduler needs extensive read permissions across the cluster. Missing RBAC for PVCs, storage classes, or CSI nodes causes scheduling failures for pods with volumes.
Extender latency. Scheduler extenders add an HTTP round trip to every scheduling cycle. At high pod creation rates (100+ pods/second), this latency becomes a bottleneck. Migrate to in-process plugins if performance is critical.
Plugin panics crash the scheduler. Unlike extenders (which are isolated processes), a plugin panic will crash the entire scheduler binary. Implement robust error handling and recovery in plugin code.
Forgetting about preemption. If your custom Filter rejects all nodes, the default PostFilter will try to preempt lower-priority pods. If your scheduling logic requires special preemption behavior, implement a custom PostFilter plugin.
Not testing with realistic cluster state. The scheduler's behavior depends on the snapshot of node resources at scheduling time. Test with nodes that have realistic resource utilization, not empty clusters.

8. What's Next?

Descheduler: The Descheduler handles the other side of scheduling -- rebalancing pods after initial placement. See The Descheduler.
AI & GPU Scheduling: Learn about GPU scheduling, the NVIDIA device plugin, and specialized schedulers like Volcano. See AI & GPU Scheduling.
Node Operations: Understand how node drain and cordon interact with the scheduler during maintenance windows. See Node Operations.
Priority and Preemption: Learn how pod priority affects scheduling order and preemption decisions.
Scheduler Source Code: The Kubernetes scheduler is well-documented Go code. Start with the pkg/scheduler/framework/ directory in the kubernetes/kubernetes repository to understand the plugin interfaces.

1. The Scheduling Cycle​

Scheduling Cycle (per-pod, synchronous)​

Binding Cycle (asynchronous)​

2. Running Multiple Schedulers Side by Side​

Deploying a Custom Scheduler​

RBAC Requirements​

3. Writing a Scheduler Plugin​

4. Scheduler Configuration and Profiles​

5. Scheduler Extenders (Legacy)​

Extenders vs. Plugins​

6. Real-World Use Cases​

GPU Topology-Aware Scheduling​

Data Locality​

Cost Optimization​

Gang Scheduling​

Topology-Aware Scheduling​

7. Common Pitfalls​

8. What's Next?​