Failure Stories: Anti-Patterns
- Learn from Pain: Production incidents often stem from subtle configuration details (like
ndots:5) rather than obvious bugs. - DNS Amplification: The default
ndots:5configuration in Kubernetes can turn a single external API call into 5+ DNS queries, potentially DDoS-ing your CoreDNS or upstream resolver. - The Thundering Herd: A massive number of Pods crashing simultaneously (without exponential backoff in custom controllers) can take down the Kubernetes Control Plane.
- Zombie Jobs: Jobs that fail to clean up after themselves can exhaust API server capacity and etcd storage limits.
In Kubernetes, "it works on my machine" often translates to "it took down production." This module explores real-world outage scenarios caused by common architectural anti-patterns.
Story 1: The "DNS DDoS" (ndots:5)
The Symptom: Latency for external API calls (e.g., to AWS S3 or Stripe) is randomly high (50ms+ extra). CoreDNS CPU usage is spiking.
The Cause:
By default, Kubernetes configures /etc/resolv.conf in every Pod with ndots:5. This means: "If a domain name has fewer than 5 dots, try appending the search domains first."
When your app calls google.com (1 dot):
- K8s tries
google.com.namespace.svc.cluster.local→ NXDOMAIN - K8s tries
google.com.svc.cluster.local→ NXDOMAIN - K8s tries
google.com.cluster.local→ NXDOMAIN - K8s tries
google.com.us-east-1.compute.internal(on AWS) → NXDOMAIN - Finally tries
google.com→ SUCCESS
The Impact: A single external call generates 5 DNS lookups. If your app does 1,000 RPS, you are hitting CoreDNS with 5,000 RPS. This can trigger cloud provider DNS throttling (e.g., AWS Route53 limit of 1024 packets/sec per interface).
The Fix:
- Use FQDNs: End your external domains with a dot (e.g.,
google.com.). This tells the resolver "stop searching, this is absolute." - Tune dnsConfig: If your app doesn't need cross-namespace service discovery, reduce
ndots.
apiVersion: v1
kind: Pod
metadata:
name: dns-optimized
spec:
dnsConfig:
options:
- name: ndots
value: "2" # Only append search domains if < 2 dots
Story 2: The API Server Meltdown
The Symptom: kubectl commands time out. The Control Plane is unresponsive. Deployments aren't updating.
The Cause: A developer wrote a custom script (or a naive Controller) to check the status of a job.
- The Logic: "Loop { List all Pods; Check if my-pod is running; Sleep 1s }"
- The Scale: The cluster has 5,000 Pods.
- The Math: 5,000 Pods * 10KB JSON per Pod = 50MB response body.
- The Impact: The script downloads 50MB of data every second. The API Server spends 100% of its CPU serializing JSON and saturates the network bandwidth.
The Fix:
- Use Informers/Watches: Don't poll. Open a long-lived connection and wait for events.
- Pagination: If you MUST list, use
limit=500andcontinuetokens. - Field Selectors: Filter on the server side (
kubectl get pods --field-selector status.phase=Running). - Priority & Fairness: Enable API Priority and Fairness (APF) to deprioritize "heavy" users so system components (Kubelet, Scheduler) don't starve.
Story 3: The CrashLoop Cascade
The Symptom: A downstream service (Database) goes down briefly. Suddenly, the entire cluster crawls to a halt, and nodes go NotReady.
The Cause:
- DB fails.
- 100 replicas of
api-servicecrash simultaneously (because they don't handle DB connection failures gracefully). - Kubernetes restarts them immediately.
- The application startup consumes massive CPU (loading Spring Boot / Node.js context).
- 100 containers starting at once saturates the Node CPU.
- Kubelet Starvation: The Kubelet process (which needs CPU to report heartbeat) gets starved.
- The Control Plane thinks the Node is dead and marks it
NotReady. - The Control Plane evicts the pods to other nodes, spreading the CPU spike and crashing those nodes too.
The Fix:
- Requests & Limits: Ensure application CPU limits are set so they can't starve system processes.
- Probe Tuning: Don't use aggressive Liveness Probes that kill the container during a temporary slow startup.
- Rate Limiting: Implement "Exponential Backoff" in your application's DB connection logic. Don't retry instantly.
Story 4: The Zombie Job Queue
The Symptom: etcd storage is full. API Server refuses writes (etcdserver: mvcc: database space exceeded).
The Cause: A CronJob runs every minute. It succeeds, but the old Job objects are never deleted.
- Retention: K8s keeps finished Jobs around so you can inspect logs.
- The Accumulation: 1 job/min * 1440 min/day * 30 days = 43,200 Job objects.
- The Impact: etcd (limited to ~2GB-8GB usually) fills up with stale object metadata.
The Fix: Always configure history limits on CronJobs.
spec:
jobTemplate:
spec:
ttlSecondsAfterFinished: 100 # Auto-delete Job 100s after completion
successfulJobsHistoryLimit: 3 # Only keep the last 3 successes
failedJobsHistoryLimit: 1