Container Runtimes (CRI)

Key Takeaways for AI & Readers

CRI as the Interface: The Container Runtime Interface (CRI) is a gRPC-based API that the kubelet uses to communicate with container runtimes, enabling runtime implementations to be swapped without changing Kubernetes core code.
containerd: The industry-standard runtime, extracted from Docker, providing a stable, high-performance daemon that manages the complete container lifecycle -- image pull, snapshot, container execution, and storage.
CRI-O: A lightweight, Kubernetes-native runtime that implements only what Kubernetes needs. It is the default for Red Hat OpenShift and follows the principle of minimal scope.
OCI Runtimes: Both containerd and CRI-O delegate low-level container creation to an OCI-compliant runtime. runc is the default, but alternatives like Kata Containers (VM-based isolation) and gVisor (user-space kernel) provide stronger security boundaries.
RuntimeClass: The Kubernetes RuntimeClass resource allows pods to select which OCI runtime to use, enabling mixed workloads where some pods run in standard runc containers and others run in hardened sandboxes.
Docker Deprecation: Kubernetes removed the Dockershim in v1.24. Docker-built images still work (they are OCI-compliant), but the Docker daemon is no longer a supported runtime.

Kubernetes does not know how to start a container. It delegates that task to a Container Runtime through the CRI (Container Runtime Interface). Understanding this boundary is essential for debugging pod startup failures, choosing the right runtime for your security requirements, and planning migrations.

1. The CRI Specification

💂

Kubelet

gRPC: RunPodSandbox()

📦

containerd

The Container Runtime Interface (CRI) is a standard that decouples Kubernetes from the actual container runtime. Kubelet doesn't "Run Docker"—it sends gRPC requests to containerd.

The CRI is a gRPC API defined by Kubernetes. It consists of two services:

RuntimeService

Handles the lifecycle of pod sandboxes and containers:

RunPodSandbox / StopPodSandbox / RemovePodSandbox: Create and manage the pod-level isolation boundary (network namespace, IPC namespace, PID namespace).
CreateContainer / StartContainer / StopContainer / RemoveContainer: Manage individual containers within a sandbox.
ExecSync / Exec / Attach: Execute commands inside running containers (used by kubectl exec).
PortForward: Set up port forwarding to a pod (used by kubectl port-forward).

ImageService

Handles container image operations:

PullImage: Downloads a container image from a registry.
ListImages: Lists images available on the node.
RemoveImage: Deletes an image from local storage.
ImageStatus: Returns metadata about a specific image.

How Kubelet Communicates with the Runtime

The kubelet connects to the runtime via a Unix domain socket. The socket path is configured with the --container-runtime-endpoint flag:

# containerd (default in most distributions)
--container-runtime-endpoint=unix:///run/containerd/containerd.sock

# CRI-O
--container-runtime-endpoint=unix:///var/run/crio/crio.sock

When the scheduler assigns a pod to a node, the kubelet issues a sequence of CRI calls:

RunPodSandbox -- creates the pod's namespace structure and calls the CNI plugin to set up networking.
PullImage -- for each container, pulls the image if not already cached.
CreateContainer -- creates the container within the sandbox (does not start it yet).
StartContainer -- starts the container process.

If any step fails, the kubelet reports the failure as a pod event and retries with exponential backoff.

2. containerd Deep Dive

containerd is the most widely used container runtime in production Kubernetes clusters. It was originally a component of Docker and was extracted into a standalone project to serve as a general-purpose container runtime for orchestrators.

Architecture

containerd follows a modular, plugin-based architecture:

+---------------------------------------------+
|               containerd daemon              |
| +----------+ +----------+ +--------------+   |
| | Content  | | Snapshot | | Task         |   |
| | Store    | | Manager  | | Service      |   |
| | (images) | | (layers) | | (containers) |   |
| +----------+ +----------+ +--------------+   |
| +----------+ +----------+ +--------------+   |
| | CRI      | | Image    | | Runtime      |   |
| | Plugin   | | Service  | | Plugin       |   |
| +----------+ +----------+ +--------------+   |
+---------------------------------------------+
        |                          |
    CRI gRPC                  OCI Runtime
   (from kubelet)            (runc, kata, etc.)

Content Store: Stores image content (layers, manifests, configs) as content-addressable blobs.
Snapshotter: Manages filesystem snapshots for container root filesystems. Different snapshotters support different storage backends (overlayfs, btrfs, ZFS, devmapper).
Task Service: Manages running container processes (called "tasks" in containerd terminology).
CRI Plugin: An in-process plugin that implements the Kubernetes CRI gRPC API, translating kubelet requests into containerd operations.

containerd Configuration

# /etc/containerd/config.toml
version = 2

[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.10"  # Pause container image

  [plugins."io.containerd.grpc.v1.cri".containerd]
    default_runtime_name = "runc"                # Default OCI runtime

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true                     # Use systemd cgroup driver

    # Kata Containers runtime (for VM-isolated workloads)
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
      runtime_type = "io.containerd.kata.v2"

  [plugins."io.containerd.grpc.v1.cri".registry]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
        endpoint = ["https://registry-1.docker.io"]

Snapshotter Plugins

The snapshotter determines how container filesystem layers are assembled:

overlayfs (default): Uses the Linux OverlayFS to stack layers. Fast and efficient, but requires a compatible kernel and filesystem.
devmapper: Uses device-mapper thin provisioning. Required in some environments where overlayfs is not available. Used by AWS Firecracker.
stargz (eStargz): Enables lazy image pulling -- containers can start before the full image is downloaded by fetching only the layers and files needed for startup.

3. CRI-O

CRI-O is a container runtime built specifically for Kubernetes. Unlike containerd, which is a general-purpose runtime usable outside Kubernetes, CRI-O implements only the CRI interface and nothing more. This "do one thing well" philosophy results in a smaller attack surface and simpler operational model.

CRI-O is the default runtime for Red Hat OpenShift and is tightly integrated with the OpenShift lifecycle (upgrades, configuration, monitoring).

Key characteristics:

Kubernetes-native: No extra APIs or features beyond what Kubernetes requires.
Stable API: CRI-O versions are tied to Kubernetes versions (CRI-O 1.29 supports Kubernetes 1.29).
Conmon: CRI-O uses conmon (container monitor) as a small process that monitors each container, handles logging, and forwards signals.
Storage: Uses the containers/image and containers/storage libraries from the Podman ecosystem for image management.

# Check CRI-O status
crictl --runtime-endpoint unix:///var/run/crio/crio.sock info

# List containers managed by CRI-O
crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps

4. runc: The Default OCI Runtime

Both containerd and CRI-O delegate the actual container creation to an OCI runtime. The OCI (Open Container Initiative) Runtime Specification defines how to take a filesystem bundle and a configuration file and create a running container process.

runc is the reference implementation of the OCI runtime spec. When containerd or CRI-O needs to start a container, they:

Prepare a filesystem bundle (rootfs + config.json).
Call runc create to set up the container's Linux namespaces (mount, PID, network, IPC, UTS, user) and cgroups.
Call runc start to execute the container's entrypoint process.

runc directly invokes Linux kernel system calls (clone, unshare, pivot_root, mount) to create the isolation boundary. The container process runs directly on the host kernel -- there is no hypervisor or additional kernel involved.

5. Alternative OCI Runtimes

While runc is sufficient for most workloads, some scenarios require stronger isolation boundaries.

Kata Containers (VM-Level Isolation)

Kata Containers runs each pod inside a lightweight virtual machine. Instead of sharing the host kernel, each pod gets its own dedicated guest kernel running inside a VM managed by QEMU/KVM or AWS Firecracker.

Security: Provides hardware-enforced isolation via VT-x / VT-d. A kernel vulnerability in one pod cannot affect other pods or the host.
Compatibility: Runs standard OCI container images -- no modifications required.
Overhead: VM startup adds 1-2 seconds of latency and ~30-50MB of memory per pod for the guest kernel.
Use cases: Multi-tenant clusters, running untrusted code, compliance requirements that mandate VM-level isolation.

gVisor (User-Space Kernel)

gVisor (by Google) implements a user-space kernel called Sentry that intercepts system calls from the container process and re-implements them in a sandboxed Go application. The container never directly talks to the host kernel.

Security: Reduces the attack surface by intercepting syscalls before they reach the host kernel. Only a small set of vetted syscalls are forwarded to the host.
Performance: Syscall-heavy workloads (heavy I/O, many threads) can see 5-30% performance degradation due to the syscall interception overhead.
Compatibility: Some syscalls are not implemented, which can cause compatibility issues with applications that use exotic kernel features.
Use cases: Sandboxing untrusted workloads (CI/CD build containers, serverless functions, user-uploaded code).

# Pod spec using gVisor runtime via RuntimeClass
apiVersion: v1
kind: Pod
metadata:
  name: sandboxed-workload
spec:
  runtimeClassName: gvisor    # Use gVisor instead of runc
  containers:
    - name: untrusted-app
      image: user-code:latest
      resources:
        limits:
          cpu: "1"
          memory: "512Mi"

6. RuntimeClass in Kubernetes

The RuntimeClass resource allows cluster administrators to define which OCI runtime a pod should use. This enables a cluster to offer multiple isolation levels: standard runc for trusted workloads, Kata for multi-tenant isolation, and gVisor for untrusted code.

# Define a RuntimeClass for Kata Containers
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata                  # Must match the handler name in containerd/CRI-O config
overhead:                      # Account for VM overhead in scheduling
  podFixed:
    memory: "40Mi"
    cpu: "100m"
scheduling:                    # Only schedule to nodes that support Kata
  nodeSelector:
    kata-runtime: "true"
---
# Define a RuntimeClass for gVisor
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc                 # gVisor's OCI runtime binary
overhead:
  podFixed:
    memory: "30Mi"
    cpu: "50m"

The overhead field tells the scheduler to account for the additional resources consumed by the runtime itself (e.g., the Kata VM kernel or the gVisor Sentry process). The scheduling field ensures that pods are only placed on nodes where the specified runtime is available.

7. Migration from Docker to containerd

Since Kubernetes v1.24, the Dockershim has been removed. Clusters that were using Docker as the container runtime must migrate to containerd or CRI-O. The key points for this migration:

Images are compatible. Docker-built images are OCI-compliant and work unchanged with containerd and CRI-O.
The Docker socket is gone. Any tooling that relied on /var/run/docker.sock (build pipelines, monitoring agents, log collectors) must be updated to use the containerd socket or the CRI API via crictl.
Image filesystem path changes. Docker stored images in /var/lib/docker/. containerd stores them in /var/lib/containerd/. During migration, images need to be re-pulled.
Cgroup driver alignment. Both the kubelet and the runtime must use the same cgroup driver (systemd is now the recommended default). Mismatched drivers cause pod startup failures.

# Verify the runtime endpoint after migration
kubectl get nodes -o wide
# The CONTAINER-RUNTIME column should show "containerd://1.7.x"

# Use crictl instead of docker CLI for node-level debugging
crictl ps                      # List running containers
crictl images                  # List cached images
crictl logs <container-id>     # View container logs
crictl inspect <container-id>  # Inspect container metadata

8. Choosing a Runtime

Criterion	containerd	CRI-O
Ecosystem	Broad (usable outside K8s)	Kubernetes-only
Default In	GKE, EKS, AKS, kubeadm	OpenShift
Plugin System	Rich (snapshotters, differ)	Minimal
Image Lazy Pull	Stargz/Nydus snapshotter	Not built-in
Version Coupling	Independent release cycle	Tied to K8s versions
Community	CNCF graduated, Docker lineage	Red Hat maintained

For most teams, containerd is the right choice due to its broad ecosystem support and the fact that it ships as the default in all major managed Kubernetes services. Choose CRI-O if you are running OpenShift or if you prefer a minimal runtime with strict Kubernetes-only scope.

9. Common Pitfalls

Cgroup driver mismatch. If the kubelet uses systemd cgroups but containerd uses cgroupfs (or vice versa), pods will fail to start with errors about cgroup creation. Always ensure both are configured to use systemd.
Socket path misconfiguration. After migration from Docker, tools that hardcode /var/run/docker.sock will fail silently. Audit all DaemonSets (log collectors, monitoring agents) for Docker socket mounts.
Missing pause image. The sandbox_image in containerd's config must be pullable. In air-gapped environments, pre-load the pause image or configure a mirror.
RuntimeClass not installed on target nodes. If a pod specifies runtimeClassName: kata but the node does not have Kata installed, the pod will be stuck in ContainerCreating with a cryptic error. Use the scheduling.nodeSelector field in the RuntimeClass to prevent this.
containerd config version mismatch. containerd v1.x used version = 1 config format; v2.x requires version = 2. Upgrading containerd without updating the config file causes startup failures.
Image not re-pulled after migration. After switching from Docker to containerd, existing images in Docker's storage are not visible to containerd. Pods may enter ImagePullBackOff until the images are re-pulled.

10. What's Next?

Container Security: Explore how RuntimeClass, seccomp profiles, and AppArmor policies work together with the container runtime to provide defense in depth.
Image Management: Learn about image pull policies, pre-pulling strategies, and image garbage collection configured in the kubelet.
eBPF Networking: Understand how the CNI plugin is invoked as part of the pod sandbox creation during the CRI flow. See eBPF Networking.
Node Management: Learn how node maintenance (cordon, drain) interacts with running containers and the runtime. See Node Operations.
Debugging: Use crictl as your primary tool for node-level container debugging. It speaks the CRI protocol directly and works with any CRI-compliant runtime.

1. The CRI Specification​

RuntimeService​

ImageService​

How Kubelet Communicates with the Runtime​

2. containerd Deep Dive​

Architecture​

containerd Configuration​

Snapshotter Plugins​

3. CRI-O​

4. runc: The Default OCI Runtime​

5. Alternative OCI Runtimes​

Kata Containers (VM-Level Isolation)​

gVisor (User-Space Kernel)​

6. RuntimeClass in Kubernetes​

7. Migration from Docker to containerd​

8. Choosing a Runtime​

9. Common Pitfalls​

10. What's Next?​