Container Runtimes (CRI)
- CRI as the Interface: The Container Runtime Interface (CRI) is a gRPC-based API that the kubelet uses to communicate with container runtimes, enabling runtime implementations to be swapped without changing Kubernetes core code.
- containerd: The industry-standard runtime, extracted from Docker, providing a stable, high-performance daemon that manages the complete container lifecycle -- image pull, snapshot, container execution, and storage.
- CRI-O: A lightweight, Kubernetes-native runtime that implements only what Kubernetes needs. It is the default for Red Hat OpenShift and follows the principle of minimal scope.
- OCI Runtimes: Both containerd and CRI-O delegate low-level container creation to an OCI-compliant runtime.
runcis the default, but alternatives like Kata Containers (VM-based isolation) and gVisor (user-space kernel) provide stronger security boundaries. - RuntimeClass: The Kubernetes RuntimeClass resource allows pods to select which OCI runtime to use, enabling mixed workloads where some pods run in standard
runccontainers and others run in hardened sandboxes. - Docker Deprecation: Kubernetes removed the Dockershim in v1.24. Docker-built images still work (they are OCI-compliant), but the Docker daemon is no longer a supported runtime.
Kubernetes does not know how to start a container. It delegates that task to a Container Runtime through the CRI (Container Runtime Interface). Understanding this boundary is essential for debugging pod startup failures, choosing the right runtime for your security requirements, and planning migrations.
1. The CRI Specificationβ
The CRI is a gRPC API defined by Kubernetes. It consists of two services:
RuntimeServiceβ
Handles the lifecycle of pod sandboxes and containers:
- RunPodSandbox / StopPodSandbox / RemovePodSandbox: Create and manage the pod-level isolation boundary (network namespace, IPC namespace, PID namespace).
- CreateContainer / StartContainer / StopContainer / RemoveContainer: Manage individual containers within a sandbox.
- ExecSync / Exec / Attach: Execute commands inside running containers (used by
kubectl exec). - PortForward: Set up port forwarding to a pod (used by
kubectl port-forward).
ImageServiceβ
Handles container image operations:
- PullImage: Downloads a container image from a registry.
- ListImages: Lists images available on the node.
- RemoveImage: Deletes an image from local storage.
- ImageStatus: Returns metadata about a specific image.
How Kubelet Communicates with the Runtimeβ
The kubelet connects to the runtime via a Unix domain socket. The socket path is configured with the --container-runtime-endpoint flag:
# containerd (default in most distributions)
--container-runtime-endpoint=unix:///run/containerd/containerd.sock
# CRI-O
--container-runtime-endpoint=unix:///var/run/crio/crio.sock
When the scheduler assigns a pod to a node, the kubelet issues a sequence of CRI calls:
RunPodSandbox-- creates the pod's namespace structure and calls the CNI plugin to set up networking.PullImage-- for each container, pulls the image if not already cached.CreateContainer-- creates the container within the sandbox (does not start it yet).StartContainer-- starts the container process.
If any step fails, the kubelet reports the failure as a pod event and retries with exponential backoff.
2. containerd Deep Diveβ
containerd is the most widely used container runtime in production Kubernetes clusters. It was originally a component of Docker and was extracted into a standalone project to serve as a general-purpose container runtime for orchestrators.
Architectureβ
containerd follows a modular, plugin-based architecture:
+---------------------------------------------+
| containerd daemon |
| +----------+ +----------+ +--------------+ |
| | Content | | Snapshot | | Task | |
| | Store | | Manager | | Service | |
| | (images) | | (layers) | | (containers) | |
| +----------+ +----------+ +--------------+ |
| +----------+ +----------+ +--------------+ |
| | CRI | | Image | | Runtime | |
| | Plugin | | Service | | Plugin | |
| +----------+ +----------+ +--------------+ |
+---------------------------------------------+
| |
CRI gRPC OCI Runtime
(from kubelet) (runc, kata, etc.)
- Content Store: Stores image content (layers, manifests, configs) as content-addressable blobs.
- Snapshotter: Manages filesystem snapshots for container root filesystems. Different snapshotters support different storage backends (overlayfs, btrfs, ZFS, devmapper).
- Task Service: Manages running container processes (called "tasks" in containerd terminology).
- CRI Plugin: An in-process plugin that implements the Kubernetes CRI gRPC API, translating kubelet requests into containerd operations.
containerd Configurationβ
# /etc/containerd/config.toml
version = 2
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.10" # Pause container image
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc" # Default OCI runtime
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true # Use systemd cgroup driver
# Kata Containers runtime (for VM-isolated workloads)
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io"]
Snapshotter Pluginsβ
The snapshotter determines how container filesystem layers are assembled:
- overlayfs (default): Uses the Linux OverlayFS to stack layers. Fast and efficient, but requires a compatible kernel and filesystem.
- devmapper: Uses device-mapper thin provisioning. Required in some environments where overlayfs is not available. Used by AWS Firecracker.
- stargz (eStargz): Enables lazy image pulling -- containers can start before the full image is downloaded by fetching only the layers and files needed for startup.
3. CRI-Oβ
CRI-O is a container runtime built specifically for Kubernetes. Unlike containerd, which is a general-purpose runtime usable outside Kubernetes, CRI-O implements only the CRI interface and nothing more. This "do one thing well" philosophy results in a smaller attack surface and simpler operational model.
CRI-O is the default runtime for Red Hat OpenShift and is tightly integrated with the OpenShift lifecycle (upgrades, configuration, monitoring).
Key characteristics:
- Kubernetes-native: No extra APIs or features beyond what Kubernetes requires.
- Stable API: CRI-O versions are tied to Kubernetes versions (CRI-O 1.29 supports Kubernetes 1.29).
- Conmon: CRI-O uses
conmon(container monitor) as a small process that monitors each container, handles logging, and forwards signals. - Storage: Uses the
containers/imageandcontainers/storagelibraries from the Podman ecosystem for image management.
# Check CRI-O status
crictl --runtime-endpoint unix:///var/run/crio/crio.sock info
# List containers managed by CRI-O
crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps
4. runc: The Default OCI Runtimeβ
Both containerd and CRI-O delegate the actual container creation to an OCI runtime. The OCI (Open Container Initiative) Runtime Specification defines how to take a filesystem bundle and a configuration file and create a running container process.
runc is the reference implementation of the OCI runtime spec. When containerd or CRI-O needs to start a container, they:
- Prepare a filesystem bundle (rootfs +
config.json). - Call
runc createto set up the container's Linux namespaces (mount, PID, network, IPC, UTS, user) and cgroups. - Call
runc startto execute the container's entrypoint process.
runc directly invokes Linux kernel system calls (clone, unshare, pivot_root, mount) to create the isolation boundary. The container process runs directly on the host kernel -- there is no hypervisor or additional kernel involved.
5. Alternative OCI Runtimesβ
While runc is sufficient for most workloads, some scenarios require stronger isolation boundaries.
Kata Containers (VM-Level Isolation)β
Kata Containers runs each pod inside a lightweight virtual machine. Instead of sharing the host kernel, each pod gets its own dedicated guest kernel running inside a VM managed by QEMU/KVM or AWS Firecracker.
- Security: Provides hardware-enforced isolation via VT-x / VT-d. A kernel vulnerability in one pod cannot affect other pods or the host.
- Compatibility: Runs standard OCI container images -- no modifications required.
- Overhead: VM startup adds 1-2 seconds of latency and ~30-50MB of memory per pod for the guest kernel.
- Use cases: Multi-tenant clusters, running untrusted code, compliance requirements that mandate VM-level isolation.
gVisor (User-Space Kernel)β
gVisor (by Google) implements a user-space kernel called Sentry that intercepts system calls from the container process and re-implements them in a sandboxed Go application. The container never directly talks to the host kernel.
- Security: Reduces the attack surface by intercepting syscalls before they reach the host kernel. Only a small set of vetted syscalls are forwarded to the host.
- Performance: Syscall-heavy workloads (heavy I/O, many threads) can see 5-30% performance degradation due to the syscall interception overhead.
- Compatibility: Some syscalls are not implemented, which can cause compatibility issues with applications that use exotic kernel features.
- Use cases: Sandboxing untrusted workloads (CI/CD build containers, serverless functions, user-uploaded code).
# Pod spec using gVisor runtime via RuntimeClass
apiVersion: v1
kind: Pod
metadata:
name: sandboxed-workload
spec:
runtimeClassName: gvisor # Use gVisor instead of runc
containers:
- name: untrusted-app
image: user-code:latest
resources:
limits:
cpu: "1"
memory: "512Mi"
6. RuntimeClass in Kubernetesβ
The RuntimeClass resource allows cluster administrators to define which OCI runtime a pod should use. This enables a cluster to offer multiple isolation levels: standard runc for trusted workloads, Kata for multi-tenant isolation, and gVisor for untrusted code.
# Define a RuntimeClass for Kata Containers
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata
handler: kata # Must match the handler name in containerd/CRI-O config
overhead: # Account for VM overhead in scheduling
podFixed:
memory: "40Mi"
cpu: "100m"
scheduling: # Only schedule to nodes that support Kata
nodeSelector:
kata-runtime: "true"
---
# Define a RuntimeClass for gVisor
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc # gVisor's OCI runtime binary
overhead:
podFixed:
memory: "30Mi"
cpu: "50m"
The overhead field tells the scheduler to account for the additional resources consumed by the runtime itself (e.g., the Kata VM kernel or the gVisor Sentry process). The scheduling field ensures that pods are only placed on nodes where the specified runtime is available.
7. Migration from Docker to containerdβ
Since Kubernetes v1.24, the Dockershim has been removed. Clusters that were using Docker as the container runtime must migrate to containerd or CRI-O. The key points for this migration:
- Images are compatible. Docker-built images are OCI-compliant and work unchanged with containerd and CRI-O.
- The Docker socket is gone. Any tooling that relied on
/var/run/docker.sock(build pipelines, monitoring agents, log collectors) must be updated to use the containerd socket or the CRI API viacrictl. - Image filesystem path changes. Docker stored images in
/var/lib/docker/. containerd stores them in/var/lib/containerd/. During migration, images need to be re-pulled. - Cgroup driver alignment. Both the kubelet and the runtime must use the same cgroup driver (systemd is now the recommended default). Mismatched drivers cause pod startup failures.
# Verify the runtime endpoint after migration
kubectl get nodes -o wide
# The CONTAINER-RUNTIME column should show "containerd://1.7.x"
# Use crictl instead of docker CLI for node-level debugging
crictl ps # List running containers
crictl images # List cached images
crictl logs <container-id> # View container logs
crictl inspect <container-id> # Inspect container metadata
8. Choosing a Runtimeβ
| Criterion | containerd | CRI-O |
|---|---|---|
| Ecosystem | Broad (usable outside K8s) | Kubernetes-only |
| Default In | GKE, EKS, AKS, kubeadm | OpenShift |
| Plugin System | Rich (snapshotters, differ) | Minimal |
| Image Lazy Pull | Stargz/Nydus snapshotter | Not built-in |
| Version Coupling | Independent release cycle | Tied to K8s versions |
| Community | CNCF graduated, Docker lineage | Red Hat maintained |
For most teams, containerd is the right choice due to its broad ecosystem support and the fact that it ships as the default in all major managed Kubernetes services. Choose CRI-O if you are running OpenShift or if you prefer a minimal runtime with strict Kubernetes-only scope.
9. Common Pitfallsβ
-
Cgroup driver mismatch. If the kubelet uses systemd cgroups but containerd uses cgroupfs (or vice versa), pods will fail to start with errors about cgroup creation. Always ensure both are configured to use
systemd. -
Socket path misconfiguration. After migration from Docker, tools that hardcode
/var/run/docker.sockwill fail silently. Audit all DaemonSets (log collectors, monitoring agents) for Docker socket mounts. -
Missing pause image. The
sandbox_imagein containerd's config must be pullable. In air-gapped environments, pre-load the pause image or configure a mirror. -
RuntimeClass not installed on target nodes. If a pod specifies
runtimeClassName: katabut the node does not have Kata installed, the pod will be stuck inContainerCreatingwith a cryptic error. Use thescheduling.nodeSelectorfield in the RuntimeClass to prevent this. -
containerd config version mismatch. containerd v1.x used
version = 1config format; v2.x requiresversion = 2. Upgrading containerd without updating the config file causes startup failures. -
Image not re-pulled after migration. After switching from Docker to containerd, existing images in Docker's storage are not visible to containerd. Pods may enter
ImagePullBackOffuntil the images are re-pulled.
10. What's Next?β
- Container Security: Explore how RuntimeClass, seccomp profiles, and AppArmor policies work together with the container runtime to provide defense in depth.
- Image Management: Learn about image pull policies, pre-pulling strategies, and image garbage collection configured in the kubelet.
- eBPF Networking: Understand how the CNI plugin is invoked as part of the pod sandbox creation during the CRI flow. See eBPF Networking.
- Node Management: Learn how node maintenance (cordon, drain) interacts with running containers and the runtime. See Node Operations.
- Debugging: Use
crictlas your primary tool for node-level container debugging. It speaks the CRI protocol directly and works with any CRI-compliant runtime.