Skip to main content

CNI: The Container Network Interface

Key Takeaways for AI & Readers
  • CNI Specification: The CNI spec defines a simple contract -- a JSON config file and a binary that the container runtime calls with ADD, DEL, CHECK, and VERSION commands to configure pod networking.
  • Kubelet Integration: The kubelet tells the container runtime (containerd/CRI-O) to invoke CNI plugins after creating the pod's network namespace but before starting containers.
  • Overlay vs. Direct Routing: Overlay networks (VXLAN, Geneve) encapsulate pod traffic in outer headers for portability; direct routing (BGP, cloud-native) avoids encapsulation for native performance but requires network infrastructure support.
  • IPAM: IP Address Management is a critical CNI responsibility -- plugins must allocate non-overlapping pod CIDRs across nodes using host-local, DHCP, or cloud-provider-specific IPAM backends.
  • CNI Choices: Cilium (eBPF-based, feature-rich), Calico (flexible BGP/overlay, mature policy engine), Flannel (simple overlay), and cloud-native CNIs (AWS VPC CNI, Azure CNI) each serve different operational profiles.

Every Kubernetes cluster needs a CNI Plugin. It is the software responsible for assigning IP addresses to Pods and enabling pod-to-pod communication across nodes. Without a CNI plugin, pods cannot communicate and the cluster is effectively non-functional.

1. The CNI Specification

The Container Network Interface is a CNCF specification that defines a minimal contract between container runtimes and network plugins. The spec is deliberately simple: it describes a JSON configuration file and a set of operations that a CNI binary must implement.

CNI Operations

A CNI plugin binary must handle the following commands, passed via environment variables (CNI_COMMAND):

  • ADD: Called when a new pod sandbox is created. The plugin must allocate an IP address, configure the network interface inside the pod's network namespace, and set up routes. It returns a JSON result containing the assigned IP and other details.
  • DEL: Called when a pod is being torn down. The plugin must release the IP address and clean up any network resources (interfaces, routes, iptables rules).
  • CHECK: Called periodically to verify that the pod's networking is still correctly configured. If the check fails, the runtime may tear down and recreate the network.
  • VERSION: Returns the CNI spec versions that this plugin supports.

CNI Configuration

CNI configuration is stored as JSON files in /etc/cni/net.d/ on each node. The kubelet (via the container runtime) reads the first configuration file alphabetically from this directory.

{
"cniVersion": "1.0.0",
"name": "my-cluster-network",
"type": "calico",
"ipam": {
"type": "calico-ipam",
"assign_ipv4": "true",
"ipv4_pools": ["10.244.0.0/16"]
},
"log_level": "info",
"datastore_type": "kubernetes",
"nodename_file_optional": true,
"policy": {
"type": "k8s"
}
}

CNI also supports chaining -- multiple plugins are called in sequence. A typical chain might use a primary plugin (Calico) for IP allocation and routing, followed by a bandwidth plugin for traffic shaping and a portmap plugin for host port mappings.

2. How the Kubelet Invokes CNI Plugins

Node 1
Pod A
VXLAN Header
📦
Node 2
Pod B
Overlay networks wrap packets in another packet (Encapsulation). This allows pods to talk across different subnets but adds CPU overhead.

The flow from pod creation to network readiness follows a precise sequence:

  1. The scheduler assigns a pod to a node. The kubelet on that node receives the pod spec via its watch on the API server.
  2. The kubelet calls the container runtime (containerd or CRI-O) via the CRI (Container Runtime Interface) to create a pod sandbox.
  3. The runtime creates a new network namespace for the pod using Linux unshare(CLONE_NEWNET).
  4. The runtime reads the CNI configuration from /etc/cni/net.d/ and executes the CNI binary (e.g., /opt/cni/bin/calico) with CNI_COMMAND=ADD.
  5. The CNI plugin creates a veth pair -- one end is placed in the pod's network namespace (typically named eth0), and the other end is placed in the host network namespace.
  6. The plugin allocates an IP address from its IPAM backend and configures the interface inside the pod namespace with this IP, a default route, and any required DNS settings.
  7. The plugin programs the host-side networking (routes, iptables rules, eBPF programs, or BGP advertisements) so that other nodes know how to reach this pod IP.
  8. The plugin returns a JSON result to the runtime, which reports readiness to the kubelet.
kubelet --> CRI (containerd) --> create netns --> CNI binary (ADD)
|
+-----v------+
| 1. Create |
| veth |
| 2. Alloc IP|
| 3. Config |
| routes |
+-----+------+
|
return JSON result

3. Overlay vs. Native Routing

There are two fundamental approaches to how pod traffic traverses the physical network between nodes.

Overlay Networks (VXLAN / Geneve)

Overlay networks encapsulate the original pod-to-pod packet inside an outer UDP packet. The outer header uses the node IPs as source and destination, so the physical network only needs to route between nodes -- it does not need to know about pod CIDRs.

  • VXLAN (Virtual Extensible LAN) adds a 50-byte header and is the most widely supported encapsulation. It uses UDP port 4789 by default.
  • Geneve (Generic Network Virtualization Encapsulation) is a newer, extensible format that supports variable-length options in the header. Cilium and OVN prefer Geneve for its extensibility.

MTU considerations: Because encapsulation adds header bytes (50 for VXLAN, 50-80+ for Geneve depending on options), the effective MTU inside pods is reduced. If the underlying network MTU is 1500, pods will have an MTU of 1450 with VXLAN. Failing to account for this causes packet fragmentation and dramatic performance drops. Most CNIs handle this automatically, but it is a common source of subtle bugs when the underlying network MTU is non-standard.

# Cilium Helm values: VXLAN overlay mode
tunnel: vxlan # Encapsulation mode (vxlan or geneve)
mtu: 1450 # Pod MTU accounting for VXLAN overhead
# autoDirectNodeRoutes: false # Not used in overlay mode

Direct / Native Routing

In direct routing mode, pod CIDRs are injected into the physical network's routing table. Each node advertises its pod CIDR so that upstream routers know to send traffic for those IPs to that node.

  • BGP Peering: Calico and Cilium can peer with the data center's BGP routers (e.g., ToR switches) to advertise pod CIDRs. This provides true native performance with no encapsulation overhead.
  • Cloud-Native Routing: AWS VPC CNI attaches secondary IPs from the VPC directly to pods. Azure CNI similarly assigns VNet IPs. These approaches eliminate overlay overhead entirely but tie you to a specific cloud provider.
  • Direct Routing with IP-in-IP: Calico's default mode uses lightweight IP-in-IP encapsulation (20-byte overhead) as a compromise, only encapsulating traffic that crosses subnet boundaries.
# Calico BGP peering configuration
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: rack-tor-switch
spec:
peerIP: 10.0.0.1 # ToR switch IP
asNumber: 64512 # Remote AS number
nodeSelector: rack == "rack-01" # Only nodes in rack-01 peer with this switch
FeatureCiliumCalicoFlannelAWS VPC CNI
DatapatheBPFiptables/eBPFVXLAN overlayVPC native
Network PolicyL3/L4/L7 + identityL3/L4 + globalNone (requires Calico addon)L3/L4 via security groups
RoutingOverlay, native, BGPOverlay, BGP, IPIPVXLAN onlyVPC routing
EncryptionWireGuard, IPsecWireGuard, IPsecNoneVPC-level
ObservabilityHubble (flows, DNS, HTTP)Flow logsNoneVPC Flow Logs
kube-proxy ReplacementYesYes (eBPF mode)NoNo
ComplexityMedium-HighMediumLowLow (AWS only)
Best ForScale, visibility, securityFlexibility, hybrid cloudLearning, small labsAWS-native workloads

Calico

Calico is the most flexible and battle-tested CNI. It supports overlay (VXLAN, IP-in-IP), direct routing via BGP, and can be deployed on bare metal, private clouds, or any public cloud. Its policy engine supports Kubernetes NetworkPolicy and its own more expressive GlobalNetworkPolicy. Calico can also run in eBPF mode, replacing iptables for its datapath.

Cilium

Cilium is built entirely on eBPF and provides the most advanced feature set: identity-based L7 network policies, transparent encryption, Hubble observability, and a sidecar-free service mesh. It is the default CNI for GKE Dataplane V2 and is increasingly adopted in large-scale production clusters.

Flannel

Flannel is the simplest CNI. It provides a VXLAN overlay and basic IP allocation -- nothing more. It does not support network policies natively (you must add Calico as a policy-only addon). Flannel is appropriate for learning environments and small clusters where simplicity is valued over features.

Cloud-Native CNIs

AWS VPC CNI, Azure CNI, and GKE's native datapath allocate pod IPs directly from the cloud provider's virtual network. This provides the best performance (no encapsulation) and seamless integration with cloud-native services (security groups, load balancers), but eliminates portability across providers.

5. IP Address Management (IPAM)

IPAM is responsible for allocating unique IP addresses to pods without conflicts. Different IPAM backends serve different deployment models.

  • host-local: Each node gets a fixed CIDR (e.g., /24) from the cluster CIDR. Simple and fast, but wastes IP space if nodes have uneven pod counts.
  • Calico IPAM: Uses IP pools with block-based allocation. Blocks can be dynamically reassigned between nodes based on demand, reducing IP waste.
  • Cluster-pool (Cilium): Cilium's default IPAM mode allocates pod CIDRs per node from a cluster-wide pool, similar to host-local but with dynamic sizing.
  • AWS ENI IPAM: Attaches secondary Elastic Network Interfaces to nodes and assigns VPC IPs directly to pods. Limited by the ENI and IP-per-ENI limits of each instance type.
  • DHCP: Allocates IPs from an external DHCP server. Rarely used in Kubernetes but supported by the CNI spec.
# Cilium cluster-pool IPAM configuration
ipam:
mode: cluster-pool
operator:
clusterPoolIPv4PodCIDRList:
- "10.0.0.0/8" # Large cluster CIDR
clusterPoolIPv4MaskSize: 24 # Each node gets a /24 (254 pod IPs)

6. Performance Considerations

Network performance varies significantly between CNI configurations. Key factors to measure:

  • Throughput (TCP_STREAM): Native routing approaches (BGP, cloud-native) achieve within 1-2% of bare-metal. VXLAN overlay typically costs 5-10% throughput. eBPF datapaths outperform iptables-based datapaths by 10-20% at scale.
  • Latency (TCP_RR): Encapsulation adds 10-30 microseconds per round trip. At high request rates, this compounds significantly.
  • Connection rate (TCP_CRR): eBPF-based CNIs handle 30-50% more new connections per second than iptables-based CNIs at large Service counts because they avoid sequential rule traversal.
  • CPU usage: iptables rule processing consumes CPU proportional to rule count. eBPF hash lookups consume constant CPU regardless of cluster size.

Always benchmark with your workload. Tools like netperf, iperf3, and fortio are standard for CNI benchmarking. Test pod-to-pod (same node), pod-to-pod (cross-node), and pod-to-Service paths.

7. Common Pitfalls

  1. Missing CNI binaries. If the CNI plugin DaemonSet is not yet running when a pod is scheduled to a node, the kubelet will fail to create the pod sandbox with a NetworkPlugin cni failed to set up pod error. Ensure CNI DaemonSets have tolerations for node.kubernetes.io/not-ready.

  2. MTU mismatch. The most common overlay networking issue. If your nodes have an MTU of 9000 (jumbo frames) but the CNI defaults to a pod MTU of 1450, you are leaving significant throughput on the table. Conversely, if the underlying MTU is exactly 1500 and you don't reduce the pod MTU for VXLAN overhead, large packets will be silently dropped.

  3. IP exhaustion. With host-local IPAM and a /24 per node, each node can host at most 254 pods. On large nodes with 100+ cores, this can become a bottleneck. Use a larger mask or a dynamic IPAM backend.

  4. CNI config file ordering. The container runtime picks the first file (alphabetically) in /etc/cni/net.d/. If multiple CNIs are installed (e.g., Flannel leftover + Calico), the wrong one may be selected. Always clean up old CNI configurations during migration.

  5. Forgetting network policies. By default, all pod-to-pod traffic is allowed. Many teams install a CNI and assume they have segmentation -- they do not, until they deploy NetworkPolicy resources. Start with a default-deny policy and explicitly allow required traffic.

  6. BGP peering instability. When using BGP with Calico or Cilium, a flapping BGP session (due to resource pressure on the node) can cause route withdrawals, making pods on that node unreachable for seconds. Monitor BGP session state and ensure the CNI agent has sufficient CPU and memory resources.

8. What's Next?

  • eBPF Networking: Dive deeper into eBPF and Cilium's architecture in the eBPF Networking guide.
  • Network Policies: Learn how to write Kubernetes NetworkPolicy resources and Calico/Cilium-specific policies to segment your cluster traffic.
  • Service Networking: Understand how Services, kube-proxy, and Endpoints work on top of the CNI layer.
  • IPv6 and Dual-Stack: Explore dual-stack networking, where pods get both IPv4 and IPv6 addresses, and the CNI configuration changes required.
  • Troubleshooting: When pod networking fails, start by checking kubectl describe pod events, then inspect the CNI logs on the node (journalctl -u kubelet and the CNI DaemonSet logs), and finally verify the network namespace configuration with nsenter and ip addr.