The Architecture: Brain & Muscle
- Control Plane vs. Worker Nodes: The cluster is divided into the "Brain" (global decisions, scheduling, reconciliation) and "Muscle" (running application workloads).
- Single Source of Truth:
etcdis a distributed key-value store that holds all cluster state; its integrity is critical for cluster survival. - API Server: The central hub — every component and tool communicates through the API Server. It is the only component that reads/writes to etcd.
- Declarative Management: Controllers continuously watch the API Server and reconcile actual state with desired state.
- Node Agents: The
kubeletensures containers are running on each node, whilekube-proxymaintains networking rules for Service routing.
A Kubernetes cluster is split into two logical layers: the Control Plane (the Brain) and the Worker Nodes (the Muscle). Understanding how these components interact is fundamental to debugging, scaling, and securing your cluster.
Control Plane
Worker Node
Walkthrough: What Happens When You Run kubectl apply
Step through the journey of a Deployment manifest from your terminal to running containers on a Node:
The Control Plane (The "Brain")
The Control Plane makes global decisions about the cluster — like where to schedule Pods, when to scale workloads, and how to respond to failures. In production, Control Plane components are typically replicated across multiple machines for high availability.
0. The Core Concept: The Infinite Loop
Before diving into components, understand the mental model. Kubernetes is not a "fire and forget" system; it is an eventually consistent system based on continuous reconciliation.
Controllers run in an infinite loop:
- Observe the current state.
- Compare it to the desired state.
- Act to make the current state more like the desired state.
This means the cluster is never "done" syncing. If you manually delete a Pod managed by a Deployment, the system notices the deviation and recreates it. If a node crashes, the system notices and reschedules its work. This self-healing nature is the "magic" of Kubernetes.
1. API Server (kube-apiserver)
The API Server is the front door to the entire cluster. Every interaction — from kubectl commands to internal controller operations — goes through the API Server's RESTful HTTP interface.
Key responsibilities:
- Authentication and Authorization: Validates who is making the request (AuthN) and whether they are allowed (AuthZ via RBAC)
- Admission Control: Runs a chain of admission webhooks that can mutate or reject requests before they are persisted
- Validation: Ensures the submitted resource definition is structurally valid
- Persistence: Writes the validated object to etcd
- Watch Mechanism: Supports long-lived watch connections so controllers are notified immediately when objects change
kubectl apply ──→ API Server ──→ Authentication
──→ Authorization (RBAC)
──→ Admission Controllers
──→ Validation
──→ Write to etcd
──→ Notify watchers
Important detail: The API Server is stateless. It does not store any data itself — all state lives in etcd. This means you can run multiple API Server instances behind a load balancer for high availability without any synchronization concerns between them.
2. etcd (The "Memory")
etcd is a consistent, highly-available, distributed key-value store that serves as the single source of truth for all cluster state — every Pod definition, Service, ConfigMap, Secret, and RBAC policy.
Key characteristics:
| Property | Detail |
|---|---|
| Consensus algorithm | RAFT — ensures all nodes agree on the same data |
| CAP trade-off | Chooses Consistency over Availability — if a majority of etcd nodes are lost, the cluster stops accepting writes to prevent data corruption |
| Typical cluster size | 3 or 5 nodes (must be odd for RAFT quorum) |
| Data format | Key-value pairs stored under /registry/ prefix |
| Access | Only the API Server communicates with etcd directly |
Why etcd matters for operators:
- Backup etcd = Backup the cluster. If you lose etcd data, you lose the entire cluster state. Regular etcd snapshots are critical for disaster recovery.
- Performance: etcd is sensitive to disk I/O latency. Use SSDs for etcd data directories. Network latency between etcd nodes should be under 10ms.
- Size limits: Each value in etcd is limited to 1.5MB by default. This affects the maximum size of ConfigMaps, Secrets, and CRDs.
# Take an etcd snapshot (disaster recovery)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
3. Scheduler (kube-scheduler)
The Scheduler watches for newly created Pods that have no Node assigned and selects the best Node for them to run on.
The scheduling process has two phases:
- Filtering: Eliminate nodes that cannot run the Pod (insufficient CPU/RAM, taints the Pod doesn't tolerate, node selectors that don't match, PV affinity constraints)
- Scoring: Rank the remaining nodes by desirability (spread Pods across failure domains, prefer nodes with the image already cached, balance resource usage)
Example scenario: You create a Pod requesting 2 CPUs and 4GB RAM. The scheduler:
- Filters out Node A (only 1 CPU free) and Node B (tainted for GPU workloads)
- Scores Node C (8 CPU free, image cached) higher than Node D (4 CPU free, no cache)
- Binds the Pod to Node C
The scheduler is extensible. You can write custom schedulers or use scheduling plugins to implement your own filtering and scoring logic.
4. Controller Manager (kube-controller-manager)
The Controller Manager is a single binary that bundles dozens of independent control loops (controllers). Each controller watches for changes to specific resources and takes action to reconcile actual state with desired state.
Key controllers include:
| Controller | Watches | Action |
|---|---|---|
| ReplicaSet Controller | ReplicaSets | Creates/deletes Pods to match .spec.replicas |
| Deployment Controller | Deployments | Creates/updates ReplicaSets for rolling updates |
| Node Controller | Nodes | Detects unresponsive nodes, marks them NotReady, evicts Pods |
| Job Controller | Jobs | Creates Pods, tracks completions, handles failures |
| EndpointSlice Controller | Services + Pods | Updates EndpointSlices when Pods become ready/unready |
| ServiceAccount Controller | Namespaces | Creates default ServiceAccount in new namespaces |
| Namespace Controller | Namespaces | Cleans up resources when a namespace is deleted |
Each controller follows the same pattern:
Watch for changes → Compare desired vs. actual → Take corrective action → Repeat
5. Cloud Controller Manager (cloud-controller-manager)
This component embeds cloud-provider-specific logic, decoupling Kubernetes core from cloud APIs. It handles:
- Node Controller (cloud): Checks with the cloud API whether a node that stopped responding has been deleted from the cloud
- Route Controller: Sets up routes in the cloud infrastructure for Pod networking
- Service Controller: Creates, updates, and deletes cloud load balancers when you create a Service of type
LoadBalancer
Example: When you create a Service with type: LoadBalancer on EKS, the cloud controller manager calls the AWS API to provision an Elastic Load Balancer, configure health checks, and point it at your node group.
The Worker Nodes (The "Muscle")
Worker Nodes are the machines that actually run your application containers. Each node runs three essential components.
1. kubelet (The "Agent")
The kubelet is an agent running on every node in the cluster. It is the bridge between the Control Plane and the container runtime.
Key responsibilities:
- Watches the API Server for Pods scheduled to its node
- Instructs the container runtime to start, stop, and manage containers
- Reports node status (capacity, conditions, addresses) back to the API Server
- Executes liveness, readiness, and startup probes
- Manages volume mounting and unmounting
- Handles container log rotation
Important: The kubelet does not manage containers that were not created by Kubernetes. It only manages Pods that are either assigned by the API Server or defined as static Pod manifests on the local filesystem.
# Check kubelet status on a node
systemctl status kubelet
# View kubelet logs
journalctl -u kubelet -f
2. kube-proxy (The "Networker")
kube-proxy maintains network rules on each node that implement the Kubernetes Service abstraction. When traffic is destined for a Service ClusterIP, kube-proxy ensures it reaches a healthy backend Pod.
Implementation modes:
| Mode | Mechanism | Performance | Notes |
|---|---|---|---|
| iptables (default) | Linux iptables rules | Good for <1000 Services | Random Pod selection, no connection tracking overhead |
| IPVS | Linux IP Virtual Server | Better for >1000 Services | Supports round-robin, least-connection, and other algorithms |
| nftables | Successor to iptables | Available since K8s 1.29 (alpha) | Modern Linux kernels |
Note: In clusters using Cilium as the CNI, kube-proxy is often replaced entirely by Cilium's eBPF-based networking, which avoids iptables overhead.
3. Container Runtime
The software that actually runs containers. It must implement the Container Runtime Interface (CRI) to communicate with the kubelet.
| Runtime | Description | Used By |
|---|---|---|
| containerd | Industry standard, CNCF graduated | EKS, GKE, Kind, K3s |
| CRI-O | Lightweight, built for K8s | OpenShift, some bare-metal setups |
The runtime is responsible for pulling images from registries, creating containers from images, managing the container lifecycle, and providing the container's filesystem and network namespace.
The Reality: Leaky Abstractions
While Kubernetes promises isolation, Nodes are finite physical machines.
- Disk I/O: If one Pod saturates the node's disk IOPS (e.g., logging frantically), other Pods on that node will slow down. Kubernetes does not isolate Disk I/O by default.
- Network Bandwidth: A Pod downloading a massive dataset can starve others of bandwidth.
- Kernel Resources: Exhausting the Conntrack table or PID limit affects the entire node.
Expert Tip: This is why "Noisy Neighbors" are a top cause of intermittent failures. Use Quality of Service (QoS) classes and resource quotas to mitigate this, but know that perfect isolation requires hardware separation (e.g., dedicated node pools).
How a Pod Gets Created: End-to-End Flow
Understanding the full sequence helps with debugging:
1. User runs: kubectl apply -f pod.yaml
2. kubectl sends HTTP POST to API Server
3. API Server authenticates → authorizes → admits → validates
4. API Server writes Pod object to etcd (Status: Pending)
5. Scheduler watches for unscheduled Pods, finds this one
6. Scheduler filters and scores nodes, picks Node C
7. Scheduler writes node assignment to API Server → etcd
8. kubelet on Node C watches for Pods assigned to it
9. kubelet instructs containerd to pull image + start container
10. kubelet updates Pod status to Running via API Server
11. kube-proxy updates iptables/IPVS rules if Pod is behind a Service
If anything goes wrong, kubectl describe pod <name> shows the Events section, which traces exactly where in this flow the failure occurred.
High Availability Architecture
For production clusters, the Control Plane should be highly available:
- Multiple API Server instances behind a load balancer (API Server is stateless)
- 3 or 5 etcd nodes for RAFT quorum (tolerates 1 or 2 node failures respectively)
- Multiple Scheduler and Controller Manager instances with leader election (only one active at a time, others on standby)
┌──────────────┐
│ Load Balancer │
└──────┬───────┘
┌──────────┼──────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ API │ │ API │ │ API │
│Server 1│ │Server 2│ │Server 3│
└────┬───┘ └────┬───┘ └────┬───┘
│ │ │
┌────▼──────────▼──────────▼────┐
│ etcd cluster │
│ (3 nodes, RAFT consensus) │
└───────────────────────────────┘
Resilience: The Split Brain Problem
What happens if the network partitions and the Control Plane loses quorum (e.g., 2 out of 3 etcd nodes can't talk to each other)?
- Writes Stop: The API Server becomes read-only. You cannot deploy new code or scale up.
- Reads Continue: You can still inspect the cluster state (from the minority side, potentially stale, or majority side).
- Existing Workloads Run: The Kubelets on worker nodes continue running their assigned Pods. The cluster does not stop. It just stops accepting changes.
This "fail-static" behavior is a critical feature. Your application stays online even if the brain is lobotomized.
Common Pitfalls
-
etcd disk I/O: Running etcd on slow disks (HDD or network-attached storage with high latency) causes API Server timeouts and cluster instability. Always use fast SSDs.
-
Single Control Plane node: In production, a single master node is a single point of failure. Use at least 3 Control Plane nodes.
-
Overloaded API Server: Poorly written controllers or scripts that poll the API Server aggressively (instead of using watches) can overwhelm the API Server. Use informers and watch-based patterns.
-
kubelet certificate expiry: kubelet certificates rotate automatically, but if the rotation fails (e.g., due to clock skew), the node becomes NotReady.
Hands-On Exercise
Explore your cluster's architecture:
# List all nodes and their roles
kubectl get nodes -o wide
# Inspect Control Plane components (running as static pods)
kubectl get pods -n kube-system
# Check component health
kubectl get componentstatuses # Deprecated but still works in some versions
kubectl get --raw='/healthz' # API Server health
# View the kubelet configuration on a node
kubectl get configmap kubelet-config -n kube-system -o yaml
What's Next?
Now that you understand the architecture, proceed to:
- Setting Up Your Lab — Create a local cluster and inspect the components yourself
- Kubeadm Bootstrapping — Build a cluster from scratch to see each component start
- Pods — Learn about the smallest unit the architecture manages