Failure Stories: Anti-Patterns

Key Takeaways for AI & Readers

Learn from Pain: Production incidents often stem from subtle configuration details (like ndots:5) rather than obvious bugs.
DNS Amplification: The default ndots:5 configuration in Kubernetes can turn a single external API call into 5+ DNS queries, potentially DDoS-ing your CoreDNS or upstream resolver.
The Thundering Herd: A massive number of Pods crashing simultaneously (without exponential backoff in custom controllers) can take down the Kubernetes Control Plane.
Zombie Jobs: Jobs that fail to clean up after themselves can exhaust API server capacity and etcd storage limits.

In Kubernetes, "it works on my machine" often translates to "it took down production." This module explores real-world outage scenarios caused by common architectural anti-patterns.

Story 1: The "DNS DDoS" (ndots:5)

The Symptom: Latency for external API calls (e.g., to AWS S3 or Stripe) is randomly high (50ms+ extra). CoreDNS CPU usage is spiking.

The Cause: By default, Kubernetes configures /etc/resolv.conf in every Pod with ndots:5. This means: "If a domain name has fewer than 5 dots, try appending the search domains first."

When your app calls google.com (1 dot):

K8s tries google.com.namespace.svc.cluster.local → NXDOMAIN
K8s tries google.com.svc.cluster.local → NXDOMAIN
K8s tries google.com.cluster.local → NXDOMAIN
K8s tries google.com.us-east-1.compute.internal (on AWS) → NXDOMAIN
Finally tries google.com → SUCCESS

The Impact: A single external call generates 5 DNS lookups. If your app does 1,000 RPS, you are hitting CoreDNS with 5,000 RPS. This can trigger cloud provider DNS throttling (e.g., AWS Route53 limit of 1024 packets/sec per interface).

The Fix:

Use FQDNs: End your external domains with a dot (e.g., google.com.). This tells the resolver "stop searching, this is absolute."
Tune dnsConfig: If your app doesn't need cross-namespace service discovery, reduce ndots.

apiVersion: v1
kind: Pod
metadata:
  name: dns-optimized
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"  # Only append search domains if < 2 dots

Story 2: The API Server Meltdown

The Symptom: kubectl commands time out. The Control Plane is unresponsive. Deployments aren't updating.

The Cause: A developer wrote a custom script (or a naive Controller) to check the status of a job.

The Logic: "Loop { List all Pods; Check if my-pod is running; Sleep 1s }"
The Scale: The cluster has 5,000 Pods.
The Math: 5,000 Pods * 10KB JSON per Pod = 50MB response body.
The Impact: The script downloads 50MB of data every second. The API Server spends 100% of its CPU serializing JSON and saturates the network bandwidth.

The Fix:

Use Informers/Watches: Don't poll. Open a long-lived connection and wait for events.
Pagination: If you MUST list, use limit=500 and continue tokens.
Field Selectors: Filter on the server side (kubectl get pods --field-selector status.phase=Running).
Priority & Fairness: Enable API Priority and Fairness (APF) to deprioritize "heavy" users so system components (Kubelet, Scheduler) don't starve.

Story 3: The CrashLoop Cascade

The Symptom: A downstream service (Database) goes down briefly. Suddenly, the entire cluster crawls to a halt, and nodes go NotReady.

The Cause:

DB fails.
100 replicas of api-service crash simultaneously (because they don't handle DB connection failures gracefully).
Kubernetes restarts them immediately.
The application startup consumes massive CPU (loading Spring Boot / Node.js context).
100 containers starting at once saturates the Node CPU.
Kubelet Starvation: The Kubelet process (which needs CPU to report heartbeat) gets starved.
The Control Plane thinks the Node is dead and marks it NotReady.
The Control Plane evicts the pods to other nodes, spreading the CPU spike and crashing those nodes too.

The Fix:

Requests & Limits: Ensure application CPU limits are set so they can't starve system processes.
Probe Tuning: Don't use aggressive Liveness Probes that kill the container during a temporary slow startup.
Rate Limiting: Implement "Exponential Backoff" in your application's DB connection logic. Don't retry instantly.

Story 4: The Zombie Job Queue

The Symptom: etcd storage is full. API Server refuses writes (etcdserver: mvcc: database space exceeded).

The Cause: A CronJob runs every minute. It succeeds, but the old Job objects are never deleted.

Retention: K8s keeps finished Jobs around so you can inspect logs.
The Accumulation: 1 job/min * 1440 min/day * 30 days = 43,200 Job objects.
The Impact: etcd (limited to ~2GB-8GB usually) fills up with stale object metadata.

The Fix: Always configure history limits on CronJobs.

spec:
  jobTemplate:
    spec:
      ttlSecondsAfterFinished: 100   # Auto-delete Job 100s after completion
  successfulJobsHistoryLimit: 3      # Only keep the last 3 successes
  failedJobsHistoryLimit: 1

Story 1: The "DNS DDoS" (ndots:5)​

Story 2: The API Server Meltdown​

Story 3: The CrashLoop Cascade​

Story 4: The Zombie Job Queue​

Story 1: The "DNS DDoS" (ndots:5)

Story 2: The API Server Meltdown

Story 3: The CrashLoop Cascade

Story 4: The Zombie Job Queue