The Architecture: Brain & Muscle

Key Takeaways for AI & Readers

Control Plane vs. Worker Nodes: The cluster is divided into the "Brain" (global decisions, scheduling, reconciliation) and "Muscle" (running application workloads).
Single Source of Truth: etcd is a distributed key-value store that holds all cluster state; its integrity is critical for cluster survival.
API Server: The central hub — every component and tool communicates through the API Server. It is the only component that reads/writes to etcd.
Declarative Management: Controllers continuously watch the API Server and reconcile actual state with desired state.
Node Agents: The kubelet ensures containers are running on each node, while kube-proxy maintains networking rules for Service routing.

A Kubernetes cluster is split into two logical layers: the Control Plane (the Brain) and the Worker Nodes (the Muscle). Understanding how these components interact is fundamental to debugging, scaling, and securing your cluster.

Control Plane

Worker Node

Click on a component above to see what it does.

Walkthrough: What Happens When You Run `kubectl apply`

Step through the journey of a Deployment manifest from your terminal to running containers on a Node:

The Control Plane (The "Brain")

The Control Plane makes global decisions about the cluster — like where to schedule Pods, when to scale workloads, and how to respond to failures. In production, Control Plane components are typically replicated across multiple machines for high availability.

0. The Core Concept: The Infinite Loop

Before diving into components, understand the mental model. Kubernetes is not a "fire and forget" system; it is an eventually consistent system based on continuous reconciliation.

Controllers run in an infinite loop:

Observe the current state.
Compare it to the desired state.
Act to make the current state more like the desired state.

This means the cluster is never "done" syncing. If you manually delete a Pod managed by a Deployment, the system notices the deviation and recreates it. If a node crashes, the system notices and reschedules its work. This self-healing nature is the "magic" of Kubernetes.

1. API Server (`kube-apiserver`)

The API Server is the front door to the entire cluster. Every interaction — from kubectl commands to internal controller operations — goes through the API Server's RESTful HTTP interface.

Key responsibilities:

Authentication and Authorization: Validates who is making the request (AuthN) and whether they are allowed (AuthZ via RBAC)
Admission Control: Runs a chain of admission webhooks that can mutate or reject requests before they are persisted
Validation: Ensures the submitted resource definition is structurally valid
Persistence: Writes the validated object to etcd
Watch Mechanism: Supports long-lived watch connections so controllers are notified immediately when objects change

kubectl apply ──→ API Server ──→ Authentication
                                  ──→ Authorization (RBAC)
                                  ──→ Admission Controllers
                                  ──→ Validation
                                  ──→ Write to etcd
                                  ──→ Notify watchers

Important detail: The API Server is stateless. It does not store any data itself — all state lives in etcd. This means you can run multiple API Server instances behind a load balancer for high availability without any synchronization concerns between them.

2. etcd (The "Memory")

etcd is a consistent, highly-available, distributed key-value store that serves as the single source of truth for all cluster state — every Pod definition, Service, ConfigMap, Secret, and RBAC policy.

Key characteristics:

Property	Detail
Consensus algorithm	RAFT — ensures all nodes agree on the same data
CAP trade-off	Chooses Consistency over Availability — if a majority of etcd nodes are lost, the cluster stops accepting writes to prevent data corruption
Typical cluster size	3 or 5 nodes (must be odd for RAFT quorum)
Data format	Key-value pairs stored under `/registry/` prefix
Access	Only the API Server communicates with etcd directly

Why etcd matters for operators:

Backup etcd = Backup the cluster. If you lose etcd data, you lose the entire cluster state. Regular etcd snapshots are critical for disaster recovery.
Performance: etcd is sensitive to disk I/O latency. Use SSDs for etcd data directories. Network latency between etcd nodes should be under 10ms.
Size limits: Each value in etcd is limited to 1.5MB by default. This affects the maximum size of ConfigMaps, Secrets, and CRDs.

# Take an etcd snapshot (disaster recovery)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

3. Scheduler (`kube-scheduler`)

The Scheduler watches for newly created Pods that have no Node assigned and selects the best Node for them to run on.

The scheduling process has two phases:

Filtering: Eliminate nodes that cannot run the Pod (insufficient CPU/RAM, taints the Pod doesn't tolerate, node selectors that don't match, PV affinity constraints)
Scoring: Rank the remaining nodes by desirability (spread Pods across failure domains, prefer nodes with the image already cached, balance resource usage)

Example scenario: You create a Pod requesting 2 CPUs and 4GB RAM. The scheduler:

Filters out Node A (only 1 CPU free) and Node B (tainted for GPU workloads)
Scores Node C (8 CPU free, image cached) higher than Node D (4 CPU free, no cache)
Binds the Pod to Node C

The scheduler is extensible. You can write custom schedulers or use scheduling plugins to implement your own filtering and scoring logic.

4. Controller Manager (`kube-controller-manager`)

The Controller Manager is a single binary that bundles dozens of independent control loops (controllers). Each controller watches for changes to specific resources and takes action to reconcile actual state with desired state.

Key controllers include:

Controller	Watches	Action
ReplicaSet Controller	ReplicaSets	Creates/deletes Pods to match `.spec.replicas`
Deployment Controller	Deployments	Creates/updates ReplicaSets for rolling updates
Node Controller	Nodes	Detects unresponsive nodes, marks them NotReady, evicts Pods
Job Controller	Jobs	Creates Pods, tracks completions, handles failures
EndpointSlice Controller	Services + Pods	Updates EndpointSlices when Pods become ready/unready
ServiceAccount Controller	Namespaces	Creates default ServiceAccount in new namespaces
Namespace Controller	Namespaces	Cleans up resources when a namespace is deleted

Each controller follows the same pattern:

Watch for changes → Compare desired vs. actual → Take corrective action → Repeat

5. Cloud Controller Manager (`cloud-controller-manager`)

This component embeds cloud-provider-specific logic, decoupling Kubernetes core from cloud APIs. It handles:

Node Controller (cloud): Checks with the cloud API whether a node that stopped responding has been deleted from the cloud
Route Controller: Sets up routes in the cloud infrastructure for Pod networking
Service Controller: Creates, updates, and deletes cloud load balancers when you create a Service of type LoadBalancer

Example: When you create a Service with type: LoadBalancer on EKS, the cloud controller manager calls the AWS API to provision an Elastic Load Balancer, configure health checks, and point it at your node group.

The Worker Nodes (The "Muscle")

Worker Nodes are the machines that actually run your application containers. Each node runs three essential components.

1. kubelet (The "Agent")

The kubelet is an agent running on every node in the cluster. It is the bridge between the Control Plane and the container runtime.

Key responsibilities:

Watches the API Server for Pods scheduled to its node
Instructs the container runtime to start, stop, and manage containers
Reports node status (capacity, conditions, addresses) back to the API Server
Executes liveness, readiness, and startup probes
Manages volume mounting and unmounting
Handles container log rotation

Important: The kubelet does not manage containers that were not created by Kubernetes. It only manages Pods that are either assigned by the API Server or defined as static Pod manifests on the local filesystem.

# Check kubelet status on a node
systemctl status kubelet

# View kubelet logs
journalctl -u kubelet -f

2. kube-proxy (The "Networker")

kube-proxy maintains network rules on each node that implement the Kubernetes Service abstraction. When traffic is destined for a Service ClusterIP, kube-proxy ensures it reaches a healthy backend Pod.

Implementation modes:

Mode	Mechanism	Performance	Notes
iptables (default)	Linux iptables rules	Good for <1000 Services	Random Pod selection, no connection tracking overhead
IPVS	Linux IP Virtual Server	Better for >1000 Services	Supports round-robin, least-connection, and other algorithms
nftables	Successor to iptables	Available since K8s 1.29 (alpha)	Modern Linux kernels

Note: In clusters using Cilium as the CNI, kube-proxy is often replaced entirely by Cilium's eBPF-based networking, which avoids iptables overhead.

3. Container Runtime

The software that actually runs containers. It must implement the Container Runtime Interface (CRI) to communicate with the kubelet.

Runtime	Description	Used By
containerd	Industry standard, CNCF graduated	EKS, GKE, Kind, K3s
CRI-O	Lightweight, built for K8s	OpenShift, some bare-metal setups

The runtime is responsible for pulling images from registries, creating containers from images, managing the container lifecycle, and providing the container's filesystem and network namespace.

The Reality: Leaky Abstractions

While Kubernetes promises isolation, Nodes are finite physical machines.

Disk I/O: If one Pod saturates the node's disk IOPS (e.g., logging frantically), other Pods on that node will slow down. Kubernetes does not isolate Disk I/O by default.
Network Bandwidth: A Pod downloading a massive dataset can starve others of bandwidth.
Kernel Resources: Exhausting the Conntrack table or PID limit affects the entire node.

Expert Tip: This is why "Noisy Neighbors" are a top cause of intermittent failures. Use Quality of Service (QoS) classes and resource quotas to mitigate this, but know that perfect isolation requires hardware separation (e.g., dedicated node pools).

How a Pod Gets Created: End-to-End Flow

Understanding the full sequence helps with debugging:

User runs: kubectl apply -f pod.yaml
kubectl sends HTTP POST to API Server
API Server authenticates → authorizes → admits → validates
API Server writes Pod object to etcd (Status: Pending)
Scheduler watches for unscheduled Pods, finds this one
Scheduler filters and scores nodes, picks Node C
Scheduler writes node assignment to API Server → etcd
kubelet on Node C watches for Pods assigned to it
kubelet instructs containerd to pull image + start container
kubelet updates Pod status to Running via API Server
kube-proxy updates iptables/IPVS rules if Pod is behind a Service

If anything goes wrong, kubectl describe pod <name> shows the Events section, which traces exactly where in this flow the failure occurred.

High Availability Architecture

For production clusters, the Control Plane should be highly available:

Multiple API Server instances behind a load balancer (API Server is stateless)
3 or 5 etcd nodes for RAFT quorum (tolerates 1 or 2 node failures respectively)
Multiple Scheduler and Controller Manager instances with leader election (only one active at a time, others on standby)

        ┌──────────────┐
        │ Load Balancer │
        └──────┬───────┘
    ┌──────────┼──────────┐
    ▼          ▼          ▼
┌────────┐ ┌────────┐ ┌────────┐
│ API    │ │ API    │ │ API    │
│Server 1│ │Server 2│ │Server 3│
└────┬───┘ └────┬───┘ └────┬───┘
     │          │          │
┌────▼──────────▼──────────▼────┐
│        etcd cluster           │
│  (3 nodes, RAFT consensus)    │
└───────────────────────────────┘

Resilience: The Split Brain Problem

What happens if the network partitions and the Control Plane loses quorum (e.g., 2 out of 3 etcd nodes can't talk to each other)?

Writes Stop: The API Server becomes read-only. You cannot deploy new code or scale up.
Reads Continue: You can still inspect the cluster state (from the minority side, potentially stale, or majority side).
Existing Workloads Run: The Kubelets on worker nodes continue running their assigned Pods. The cluster does not stop. It just stops accepting changes.

This "fail-static" behavior is a critical feature. Your application stays online even if the brain is lobotomized.

Common Pitfalls

etcd disk I/O: Running etcd on slow disks (HDD or network-attached storage with high latency) causes API Server timeouts and cluster instability. Always use fast SSDs.
Single Control Plane node: In production, a single master node is a single point of failure. Use at least 3 Control Plane nodes.
Overloaded API Server: Poorly written controllers or scripts that poll the API Server aggressively (instead of using watches) can overwhelm the API Server. Use informers and watch-based patterns.
kubelet certificate expiry: kubelet certificates rotate automatically, but if the rotation fails (e.g., due to clock skew), the node becomes NotReady.

Hands-On Exercise

Explore your cluster's architecture:

# List all nodes and their roles
kubectl get nodes -o wide

# Inspect Control Plane components (running as static pods)
kubectl get pods -n kube-system

# Check component health
kubectl get componentstatuses   # Deprecated but still works in some versions
kubectl get --raw='/healthz'    # API Server health

# View the kubelet configuration on a node
kubectl get configmap kubelet-config -n kube-system -o yaml

What's Next?

Now that you understand the architecture, proceed to:

Setting Up Your Lab — Create a local cluster and inspect the components yourself
Kubeadm Bootstrapping — Build a cluster from scratch to see each component start
Pods — Learn about the smallest unit the architecture manages

Control Plane

Worker Node

Walkthrough: What Happens When You Run kubectl apply​

The Control Plane (The "Brain")​

0. The Core Concept: The Infinite Loop​

1. API Server (kube-apiserver)​

2. etcd (The "Memory")​

3. Scheduler (kube-scheduler)​

4. Controller Manager (kube-controller-manager)​

5. Cloud Controller Manager (cloud-controller-manager)​

The Worker Nodes (The "Muscle")​

1. kubelet (The "Agent")​

2. kube-proxy (The "Networker")​

3. Container Runtime​

The Reality: Leaky Abstractions​

How a Pod Gets Created: End-to-End Flow​

High Availability Architecture​

Resilience: The Split Brain Problem​

Common Pitfalls​

Hands-On Exercise​

What's Next?​