Skip to main content

AI & GPU Scheduling

Key Takeaways for AI & Readers
  • Device Plugin Mechanism: Standard Kubernetes only manages CPU and memory. GPUs are exposed to the scheduler through a vendor-specific Device Plugin (e.g., NVIDIA's) that discovers GPUs on nodes and reports them as extended resources (nvidia.com/gpu).
  • GPU Request Syntax: Pods request GPUs via resources.limits (e.g., nvidia.com/gpu: 1). GPUs are exclusively allocated -- a pod gets the entire GPU unless sharing is configured.
  • GPU Sharing Strategies: MIG (Multi-Instance GPU) partitions a physical GPU into isolated hardware slices. Time-slicing shares a GPU across pods with context switching. MPS (Multi-Process Service) provides concurrent GPU access with CUDA-level sharing.
  • GPU Operator: The NVIDIA GPU Operator automates the lifecycle of GPU drivers, device plugins, container toolkits, and monitoring components via a single operator deployment.
  • AI Workload Patterns: Training workloads are batch, GPU-hungry, and benefit from gang scheduling. Inference workloads are latency-sensitive, often auto-scaled, and can leverage GPU sharing for cost efficiency.
  • Specialized Schedulers: Volcano provides gang scheduling, queue management, and fair-share policies for batch AI workloads. KubeRay manages Ray clusters on Kubernetes for distributed AI/ML applications.

With the rise of Large Language Models and AI applications, Kubernetes has become a primary platform for training and serving machine learning models. However, standard Kubernetes only understands CPU and memory. To schedule GPU workloads, you need Device Plugins, specialized operators, and often custom scheduling strategies.

1. GPU Scheduling in Kubernetes

AI Workload Pod
🧠
PyTorch / LLM
Physical Node GPU
NVIDIA H100
Kubernetes uses Device Plugins to expose specialized hardware like GPUs to containers. The scheduler ensures only nodes with available GPUs are selected.

Kubernetes uses the Device Plugin Framework (introduced in v1.8) to support hardware accelerators. The framework defines a gRPC interface that device plugins implement to:

  1. Discover the hardware devices available on a node (e.g., 8 NVIDIA A100 GPUs).
  2. Report them to the kubelet as extended resources (e.g., nvidia.com/gpu: 8 in node.status.capacity).
  3. Allocate specific devices to pods when the scheduler assigns a GPU-requesting pod to the node.
  4. Mount the required device files, driver libraries, and environment variables into the container.

The kubelet communicates with device plugins via a Unix domain socket at /var/lib/kubelet/device-plugins/. Each device plugin registers itself with the kubelet and implements the ListAndWatch and Allocate RPCs.

2. NVIDIA Device Plugin

The NVIDIA device plugin is the standard for running GPU workloads on Kubernetes. It runs as a DaemonSet on every GPU node.

# Deploy the NVIDIA device plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin
template:
metadata:
labels:
name: nvidia-device-plugin
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
accelerator: nvidia # Only run on GPU nodes
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.16.2
securityContext:
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins

Requesting GPUs in Pod Specs

GPUs are requested in the resources.limits field (not requests -- GPUs do not support oversubscription by default):

apiVersion: v1
kind: Pod
metadata:
name: llm-inference
spec:
containers:
- name: model-server
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU (exclusively allocated)
cpu: "8"
memory: "32Gi"
env:
- name: NVIDIA_VISIBLE_DEVICES # Automatically set by device plugin
value: "all"
- name: MODEL_NAME
value: "meta-llama/Llama-3-8B"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc

Key points about GPU requests:

  • You can only specify GPU limits, not requests. The limit equals the request.
  • GPUs are integer resources -- you cannot request 0.5 GPUs (unless using MIG or time-slicing).
  • Each GPU is exclusively allocated to one pod by default. No other pod can use it.
  • You cannot specify which specific GPU a pod gets (e.g., GPU 0 vs GPU 3). The device plugin handles assignment.

3. GPU Sharing Strategies

Running one workload per GPU is wasteful for many use cases. Three sharing strategies address this.

Multi-Instance GPU (MIG)

MIG (available on NVIDIA A100 and H100) partitions a single physical GPU into up to 7 isolated instances at the hardware level. Each instance has dedicated compute units, memory, and L2 cache -- there is no performance interference between instances.

# Request a MIG slice (1/7 of an A100)
apiVersion: v1
kind: Pod
metadata:
name: small-inference
spec:
containers:
- name: model
image: inference-server:latest
resources:
limits:
nvidia.com/mig-1g.5gb: 1 # 1 GPU instance with 5GB memory
# Other MIG profiles for A100:
# nvidia.com/mig-1g.10gb: 1 # 1 instance, 10GB
# nvidia.com/mig-2g.10gb: 1 # 2 instances, 10GB
# nvidia.com/mig-3g.20gb: 1 # 3 instances, 20GB
# nvidia.com/mig-7g.40gb: 1 # Full GPU, 40GB

MIG configuration is set at the node level and requires a GPU reset. The device plugin exposes each MIG slice as a separate schedulable resource. MIG provides the strongest isolation of the three sharing strategies.

Time-Slicing

Time-slicing shares a GPU among multiple pods by rapidly switching context between them. Each pod gets exclusive access to the GPU for a brief time window, then yields to the next pod.

# NVIDIA device plugin ConfigMap for time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: kube-system
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Each physical GPU appears as 4 schedulable GPUs

With replicas: 4, a node with 2 physical GPUs will report nvidia.com/gpu: 8 to the scheduler. Four pods can share a single physical GPU. The trade-off: each pod sees the full GPU memory (there is no memory isolation), and performance degrades proportionally to the number of pods sharing the GPU.

MPS (Multi-Process Service)

MPS is a CUDA feature that allows multiple processes to share a GPU simultaneously (truly concurrent, not time-sliced). MPS provides better GPU utilization and lower latency than time-slicing for workloads that do not fully saturate the GPU.

  • Advantage over time-slicing: Lower context-switch overhead, better utilization for small models.
  • Disadvantage: No memory isolation, no fault isolation (a CUDA error in one process can crash all processes on the GPU).
  • Best for: Multiple inference servers running small models on the same GPU.

4. GPU Operator

The NVIDIA GPU Operator automates the entire GPU software stack lifecycle on Kubernetes nodes. Instead of manually installing drivers, container toolkit, device plugin, and monitoring, the operator manages all of these as Kubernetes resources.

# Install the GPU Operator via Helm
# helm install gpu-operator nvidia/gpu-operator \
# --namespace gpu-operator --create-namespace \
# --set driver.enabled=true \
# --set toolkit.enabled=true \
# --set devicePlugin.enabled=true \
# --set migManager.enabled=true \
# --set dcgmExporter.enabled=true

# The operator deploys these components as DaemonSets:
# 1. nvidia-driver-daemonset -- Installs NVIDIA drivers on nodes
# 2. nvidia-container-toolkit -- Configures container runtime for GPU access
# 3. nvidia-device-plugin -- Registers GPUs with kubelet
# 4. nvidia-dcgm-exporter -- Exports GPU metrics to Prometheus
# 5. nvidia-mig-manager -- Manages MIG configuration
# 6. gpu-feature-discovery -- Labels nodes with GPU capabilities

The GPU Operator is the recommended approach for production GPU clusters. It handles driver upgrades, runtime configuration, and monitoring automatically.

5. Training vs. Inference Workload Patterns

AI workloads have fundamentally different scheduling requirements depending on whether they are training or serving models.

Training Workloads

  • Resource profile: GPU-hungry. A single training job may require 8, 16, or even thousands of GPUs.
  • Duration: Hours to weeks.
  • Scheduling: Requires gang scheduling -- all GPUs must be allocated simultaneously. Partial allocation wastes resources.
  • Communication: Pods in a distributed training job communicate heavily via NCCL (NVIDIA Collective Communications Library) over NVLink (intra-node) and RDMA/InfiniBand (inter-node).
  • Fault tolerance: Checkpointing is essential. A single node failure can lose hours of training progress.
  • Priority: Often batch priority -- can be preempted by higher-priority inference workloads.
# Multi-GPU training job using PyTorch DDP
apiVersion: batch/v1
kind: Job
metadata:
name: llm-training
spec:
parallelism: 4 # 4 pods (one per node)
completions: 4
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: my-training:latest
resources:
limits:
nvidia.com/gpu: 8 # 8 GPUs per pod = 32 GPUs total
cpu: "96"
memory: "768Gi"
rdma/hca: 4 # 4 RDMA NICs for inter-node communication
env:
- name: WORLD_SIZE
value: "32" # Total GPUs across all pods
- name: NCCL_IB_DISABLE
value: "0" # Enable InfiniBand for NCCL
- name: MASTER_ADDR
value: "llm-training-0" # Head node for coordination
command:
- torchrun
- --nproc_per_node=8 # 8 GPUs per node
- --nnodes=4 # 4 nodes
- train.py

Inference Workloads

  • Resource profile: 1-4 GPUs for most models. Large models (70B+ parameters) may need 4-8 GPUs with tensor parallelism.
  • Duration: Long-running services.
  • Scheduling: Standard scheduling is usually sufficient. No gang requirement.
  • Latency: Critical. Time-to-first-token (TTFT) and inter-token latency are key metrics.
  • Autoscaling: Benefits from HPA based on queue depth, GPU utilization, or request latency.
  • GPU sharing: Inference workloads often do not fully utilize a GPU, making them ideal candidates for MIG or time-slicing.
# Inference deployment with HPA
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-serving
spec:
replicas: 2
selector:
matchLabels:
app: llm-serving
template:
metadata:
labels:
app: llm-serving
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1
cpu: "8"
memory: "32Gi"
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Model loading takes time
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-serving-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-serving
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization # Custom metric from DCGM exporter
target:
type: AverageValue
averageValue: "70" # Scale up when GPU >70% busy

6. CUDA and Container Toolkits

For containers to access GPUs, the container runtime must mount the NVIDIA driver libraries and device files. The NVIDIA Container Toolkit (formerly nvidia-docker) integrates with containerd and CRI-O to handle this automatically.

The toolkit configures the container runtime to:

  • Mount /dev/nvidia* device files into the container.
  • Mount NVIDIA driver shared libraries (libcuda.so, libnvidia-ml.so) from the host.
  • Set environment variables (NVIDIA_VISIBLE_DEVICES, NVIDIA_DRIVER_CAPABILITIES).

Container images that use GPUs should be based on NVIDIA's CUDA base images (e.g., nvidia/cuda:12.4.0-runtime-ubuntu22.04), which include the CUDA runtime but rely on the host-mounted driver.

7. Memory Management for LLMs

Large Language Models require careful memory planning:

  • Model weights: A 70B parameter model in FP16 requires ~140GB of GPU memory. In INT8 quantization, this drops to ~70GB.
  • KV cache: For serving, the KV cache grows with sequence length and batch size. A 70B model serving 2048-token sequences with a batch of 32 can require an additional 10-20GB of GPU memory.
  • Peak memory: Training requires 3-4x more memory than inference due to optimizer states, gradients, and activations.
Memory requirements for common models (FP16 inference):
+------------------+----------+------------------+
| Model | Params | GPU Memory (FP16)|
+------------------+----------+------------------+
| Llama 3 8B | 8B | ~16 GB (1 GPU) |
| Llama 3 70B | 70B | ~140 GB (2 GPUs) |
| Mixtral 8x7B | 46.7B | ~93 GB (2 GPUs) |
| GPT-4 class | ~1.8T | ~3.6 TB (multi) |
+------------------+----------+------------------+

8. Volcano Scheduler for Batch AI Jobs

Volcano is a CNCF project that provides a Kubernetes-native batch scheduling system. It extends Kubernetes scheduling with features essential for AI/ML and HPC workloads:

  • Gang scheduling: All pods in a job must be schedulable before any are started. This prevents partial allocation of multi-GPU training jobs.
  • Queue management: Jobs are submitted to queues with resource quotas and priority levels.
  • Fair-share scheduling: Resources are distributed fairly among users and teams based on configured shares.
  • Preemption: Lower-priority jobs can be preempted to make room for higher-priority ones.
# Volcano Job for distributed training
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training
spec:
minAvailable: 4 # Gang scheduling: all 4 pods must be ready
schedulerName: volcano # Use the Volcano scheduler
queue: ml-training # Submit to the training queue
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 4
name: trainer
template:
spec:
containers:
- name: trainer
image: training:latest
resources:
limits:
nvidia.com/gpu: 8
command: ["torchrun", "--nproc_per_node=8", "train.py"]
restartPolicy: OnFailure

9. KubeRay for Ray Clusters

KubeRay is a Kubernetes operator for managing Ray clusters. Ray is a popular framework for distributed Python applications, especially for AI/ML workloads (training, tuning, serving).

KubeRay manages:

  • RayCluster: A Ray head node + worker nodes, with auto-scaling support.
  • RayJob: A Ray cluster that runs a specific job and tears down when complete.
  • RayService: A Ray Serve deployment for model serving with rolling upgrades.
# KubeRay cluster for distributed inference
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: inference-cluster
spec:
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.9.0-py310-gpu
resources:
limits:
cpu: "4"
memory: "16Gi"
workerGroupSpecs:
- replicas: 4
minReplicas: 2
maxReplicas: 8
groupName: gpu-workers
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.9.0-py310-gpu
resources:
limits:
nvidia.com/gpu: 1
cpu: "8"
memory: "32Gi"

10. Common Pitfalls

  1. NVIDIA driver version mismatch. The driver on the host must be compatible with the CUDA version in the container image. Use the GPU Operator to manage driver versions, or pin the CUDA version in your container images.

  2. Not tainting GPU nodes. Without taints, non-GPU workloads will schedule on expensive GPU nodes, wasting costly resources. Always taint GPU nodes (e.g., nvidia.com/gpu=present:NoSchedule) and add tolerations to GPU workloads.

  3. Forgetting readiness probes. GPU model serving containers take minutes to load large models into GPU memory. Without an appropriate initialDelaySeconds on the readiness probe, Kubernetes will kill the pod before the model is loaded.

  4. Ignoring GPU memory when using time-slicing. Time-slicing does not provide memory isolation. If 4 pods share a 40GB GPU and each tries to use 20GB, they will all crash with CUDA out-of-memory errors. Calculate memory budgets carefully.

  5. No gang scheduling for multi-node training. Without gang scheduling (Volcano or Coscheduling plugin), pods may be scheduled one at a time. If only 3 of 4 pods can be placed, those 3 pods sit idle, wasting GPU resources, while waiting for the 4th pod.

  6. Not configuring NCCL for multi-node training. NCCL needs proper network configuration (InfiniBand, RoCE, or TCP). Without setting NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE, and related environment variables, multi-node communication performance can be 10-100x worse than expected.

  7. Over-provisioning GPU nodes. GPU instances are expensive ($2-30/hour per GPU). Use the Cluster Autoscaler with GPU node pools and set scale-down-unneeded-time to a reasonable value (e.g., 10 minutes for inference, 30 minutes for training) to avoid paying for idle GPUs.

11. What's Next?

  • Custom Schedulers: Learn about the scheduling framework and how GPU topology-aware plugins can improve multi-GPU workload placement. See Custom Schedulers.
  • Node Operations: Understand how to drain GPU nodes for maintenance without disrupting training jobs. See Node Operations.
  • Monitoring: Deploy DCGM Exporter and configure Grafana dashboards for GPU utilization, memory usage, temperature, and power consumption.
  • Model Serving: Explore frameworks like vLLM, TGI (Text Generation Inference), and Triton Inference Server for efficient LLM serving on Kubernetes.
  • Cost Optimization: Use spot/preemptible GPU instances for fault-tolerant training jobs with checkpointing, and right-size inference workloads with MIG or time-slicing.