Troubleshooting: The Decision Tree

Key Takeaways for AI & Readers

Pod Status as Clue: The initial state (Pending, CrashLoopBackOff, ImagePullBackOff, Error, Terminating, Unknown) provides the most direct hint toward the root cause of an issue. Always start diagnosis by checking pod status.
Pending (Scheduling): Often indicates a resource shortage, scheduling constraint (Taints/Affinity), or missing PersistentVolume. Check kubectl describe pod events for the scheduler's reason.
CrashLoopBackOff (App Error): Signifies the container starts but then fails repeatedly. Use kubectl logs --previous to see the exit reason from the last crashed container.
ImagePullBackOff (Registry/Auth): Occurs when the image name is incorrect, the tag does not exist, or credentials for a private registry are missing or expired.
Exit Code Decoding: Exit code 137 means OOMKilled (out of memory), 1 means generic application error, 126 means permission denied on the entrypoint, and 127 means command not found.
Systematic Approach: Always follow a top-down methodology: check pod status first, then events, then logs, then exec into the container, then check node health, then check networking.

When your application is not working, the Pod status is your first clue. Kubernetes provides a rich set of diagnostic information, but it can be overwhelming to know where to look first. This guide gives you a systematic approach to debugging, starting from the most common issues and working toward more obscure causes.

Visual Decision Tree

Use this interactive flow to diagnose common Pod issues. You can zoom in and drag the nodes to explore the paths.

The Systematic Debugging Methodology

When something goes wrong, resist the urge to start guessing. Follow this sequence:

What is the pod status? (kubectl get pods)
What do the events say? (kubectl describe pod <name>)
What do the logs say? (kubectl logs <name>)
Can I exec into the container? (kubectl exec -it <name> -- /bin/sh)
Is the node healthy? (kubectl get nodes, kubectl describe node <name>)
Is the network working? (Service endpoints, DNS resolution, kube-proxy)

Each step narrows the problem space. Most issues are resolved by step 3.

Essential kubectl Commands for Debugging

Before diving into specific pod statuses, here is your debugging toolkit:

# Get pod status with extra details (node, IP, restarts)
kubectl get pods -o wide

# Get all pods across all namespaces
kubectl get pods -A

# Detailed pod information including events
kubectl describe pod <pod-name>

# Container logs (current instance)
kubectl logs <pod-name>

# Container logs (previous crashed instance -- critical for CrashLoopBackOff)
kubectl logs <pod-name> --previous

# Logs for a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name>

# Stream logs in real-time
kubectl logs -f <pod-name>

# Execute a command inside a running container
kubectl exec -it <pod-name> -- /bin/sh

# View cluster events sorted by time
kubectl get events --sort-by='.lastTimestamp'

# Check resource usage (requires metrics-server)
kubectl top pods
kubectl top nodes

# Get YAML output to inspect the full spec
kubectl get pod <pod-name> -o yaml

Pod Status Diagnosis

1. Pending

The Pod has been accepted by the cluster but is not yet running on any node.

Common causes:

Insufficient resources: No node has enough CPU or memory to satisfy the pod's requests.
Taints and tolerations: The pod does not tolerate taints present on available nodes.
Affinity rules: The pod's node affinity or pod affinity cannot be satisfied.
Unbound PersistentVolumeClaim: The pod references a PVC that has no matching PV.
ResourceQuota exceeded: The namespace has hit its quota limit.

Debugging steps:

# Check the Events section -- the scheduler tells you exactly why
kubectl describe pod <pod-name>

# Look for lines like:
#   "0/5 nodes are available: 2 Insufficient cpu, 3 had taints that
#    the pod didn't tolerate"

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check if PVC is bound
kubectl get pvc

Fixes:

Scale up the cluster or reduce resource requests.
Add the correct toleration to the pod spec.
Relax affinity rules from required to preferred.
Create or fix the PersistentVolume.

2. CrashLoopBackOff

The container starts, runs briefly, then exits with an error. Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5 minutes).

Common causes:

Application code throws an unhandled exception on startup.
Missing or incorrect environment variables.
Misconfigured entrypoint or command in the container spec.
Missing configuration files (ConfigMap or Secret not mounted correctly).
Database connection failures on startup.
Liveness probe failing too aggressively.

Debugging steps:

# Check logs from the PREVIOUS (crashed) container
kubectl logs <pod-name> --previous

# Check the exit code in the pod description
kubectl describe pod <pod-name>
# Look for "Last State: Terminated" and "Exit Code"

# If the container crashes too fast to see logs, override the command
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- /bin/sh
# Now manually run the original entrypoint to see the error

Fixes:

Fix the application code or configuration.
Verify all required environment variables are set.
Check that ConfigMaps and Secrets exist and are mounted at the correct paths.
Increase liveness probe initialDelaySeconds if the app needs time to start.

3. ImagePullBackOff / ErrImagePull

Kubernetes cannot pull the container image from the registry.

Common causes:

Typo in the image name or tag.
The image tag does not exist (e.g., latest was replaced by a specific tag).
Private registry requires authentication and imagePullSecrets is not configured.
Registry rate limiting (common with Docker Hub's anonymous pull limit).
Network connectivity issues between the node and the registry.

Debugging steps:

# Check the exact error message
kubectl describe pod <pod-name>
# Look for "Failed to pull image" in events

# Verify the image exists (from your local machine)
docker pull <image-name>:<tag>

# Check if imagePullSecrets is configured
kubectl get pod <pod-name> -o jsonpath='{.spec.imagePullSecrets}'

# Verify the secret exists and is in the correct namespace
kubectl get secrets

Fixes:

Correct the image name and tag.
Create and attach an imagePullSecret for private registries.
Switch to a registry mirror if rate limited.

4. Error

The container exited with a non-zero exit code and has not been restarted (or restartPolicy is Never).

Debugging steps:

kubectl logs <pod-name>
kubectl describe pod <pod-name>
# Check "Exit Code" in the terminated state

5. Terminating (Stuck)

The pod has been asked to terminate but is not shutting down.

Common causes:

The application is not handling SIGTERM signals.
A finalizer is blocking deletion.
The node is unreachable and the kubelet cannot confirm deletion.

Debugging steps:

# Check if the pod has finalizers
kubectl get pod <pod-name> -o jsonpath='{.metadata.finalizers}'

# Force delete if the node is gone (use with caution)
kubectl delete pod <pod-name> --grace-period=0 --force

6. Unknown

The pod status cannot be determined, usually because the node hosting the pod is unreachable.

Debugging steps:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

Understanding Exit Codes

Exit codes reveal exactly how a container died:

Exit Code	Meaning	Common Cause
0	Success	Normal termination (expected for Jobs)
1	Generic error	Unhandled exception, assertion failure
2	Misuse of shell command	Invalid arguments to the entrypoint
126	Permission denied	Entrypoint binary is not executable
127	Command not found	Wrong entrypoint path, missing binary
128+N	Killed by signal N	Container received a signal
137	SIGKILL (128+9)	OOMKilled -- exceeded memory limit, or kubelet killed the container
139	SIGSEGV (128+11)	Segmentation fault in native code
143	SIGTERM (128+15)	Graceful shutdown requested (normal during rolling updates)

Diagnosing OOMKilled (Exit Code 137)

This is one of the most common and misunderstood issues:

# Confirm OOMKill
kubectl describe pod <pod-name>
# Look for: "Last State: Terminated, Reason: OOMKilled"

# Check current memory usage (requires metrics-server)
kubectl top pod <pod-name>

# Check the memory limit set on the container
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.memory}'

Fixes:

Increase the container's memory limit.
Fix memory leaks in the application.
For Java apps: ensure -Xmx is set to a value below the container limit (leave room for non-heap memory).

Node-Level Troubleshooting

Node Not Ready

When a node enters NotReady state, all pods on it become Unknown or are evicted.

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Look for conditions:
#   Ready=False
#   MemoryPressure=True
#   DiskPressure=True
#   PIDPressure=True

# Check kubelet logs on the node (SSH required)
journalctl -u kubelet -f

Common causes:

Kubelet process has crashed or is unresponsive.
Node is out of disk space.
Node is out of memory (kernel OOM killer is active).
Network partition between the node and the control plane.
Container runtime (containerd, CRI-O) is unresponsive.

Node Resource Pressure

Kubernetes monitors node resources and applies conditions when thresholds are crossed:

Condition	Trigger	Effect
`MemoryPressure`	Available memory below threshold	Node is tainted, pods may be evicted
`DiskPressure`	Available disk below threshold	Node is tainted, image garbage collection triggered
`PIDPressure`	Available PIDs below threshold	Node is tainted, no new pods scheduled

# View node conditions
kubectl describe node <node-name> | grep -A 5 "Conditions"

# Check actual resource usage
kubectl top node <node-name>

Service Connectivity Debugging

When pods are running but your application is unreachable, the problem is usually in the networking layer.

Step 1: Verify the Service has Endpoints

# Check that the Service exists and has endpoints
kubectl get svc <service-name>
kubectl get endpoints <service-name>

# If endpoints are empty, the label selector does not match any pods
kubectl get pods --show-labels
# Compare pod labels with the Service's selector
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'

A Service with zero endpoints is the most common networking issue. It almost always means the Service selector labels do not match the pod labels.

Step 2: Test DNS Resolution

# Run a debug pod with DNS tools
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup <service-name>

# Full DNS name format:
# <service-name>.<namespace>.svc.cluster.local
kubectl run dns-test --image=busybox:1.36 --rm -it -- \
  nslookup <service-name>.<namespace>.svc.cluster.local

# Check if CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns

Step 3: Test Connectivity from Inside the Cluster

# Run a debug pod and try to reach the service
kubectl run curl-test --image=curlimages/curl --rm -it -- \
  curl -v http://<service-name>:<port>/health

# If the service responds from inside but not outside,
# check Ingress or LoadBalancer configuration

Step 4: Check kube-proxy

# Verify kube-proxy is running on all nodes
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# Check kube-proxy logs for errors
kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=50

# Verify iptables rules exist (SSH to node)
iptables -t nat -L KUBE-SERVICES | grep <service-name>

Networking Issues Quick Reference

Symptom	Likely Cause	Check
Cannot reach Service by name	DNS issue	`nslookup` from debug pod
DNS works but connection times out	No matching endpoints or NetworkPolicy blocking	`kubectl get endpoints`
Works from same node, not others	kube-proxy issue or CNI plugin problem	kube-proxy logs, CNI pods
Intermittent failures	Unhealthy pod in endpoint list	Check readiness probes
External traffic not reaching pods	Ingress/LB misconfiguration	Ingress controller logs

Emergency Procedures

Pod is Consuming All Node Resources

# Identify the resource hog
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu

# Delete the pod immediately
kubectl delete pod <pod-name> --grace-period=0 --force

# If the pod is managed by a Deployment, also scale down
kubectl scale deployment <deploy-name> --replicas=0

Node is Unresponsive

# Cordon the node to prevent new scheduling
kubectl cordon <node-name>

# Drain the node (gracefully move pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# If drain hangs, force it
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force

Cluster-Wide DNS Failure

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# Restart CoreDNS
kubectl rollout restart deployment coredns -n kube-system

# Temporary workaround: use IP addresses instead of DNS names
kubectl get svc <service-name> -o jsonpath='{.spec.clusterIP}'

Namespace Stuck in Terminating

# Check for resources with finalizers
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>

# Remove finalizers from stuck resources (use with caution)
kubectl get namespace <namespace> -o json | \
  jq '.spec.finalizers = []' | \
  kubectl replace --raw "/api/v1/namespaces/<namespace>/finalize" -f -

Common Pitfalls

Not checking events: kubectl describe output includes an Events section at the bottom. This is where the scheduler, kubelet, and other components explain exactly what went wrong. Always read events first.
Forgetting --previous flag: When a pod is in CrashLoopBackOff, kubectl logs shows the current (often empty) container. Use --previous to see logs from the container that just crashed.
Debugging the wrong container: In multi-container pods, each container has its own logs. Use -c <container-name> to specify which container's logs to view.
Confusing resource requests with limits: Pods are scheduled based on resource requests but can be OOMKilled when actual usage exceeds limits. A pod with a 256Mi request and 512Mi limit will schedule on a node with 256Mi available but might be killed if it uses 513Mi.
Ignoring init containers: Init containers run before the main container. If an init container fails, the pod stays in Init:Error or Init:CrashLoopBackOff. Check init container logs with kubectl logs <pod> -c <init-container-name>.
Not checking the right namespace: The most common "I can't find my pod" issue is being in the wrong namespace. Always use -n <namespace> or check with kubectl get pods -A.

Best Practices

Add readiness and liveness probes: These give Kubernetes the information it needs to automatically detect and recover from failures. Without probes, Kubernetes assumes a running container is healthy.
Set resource requests and limits: Without these, pods can consume unbounded resources and starve other workloads. Resource requests also give the scheduler information for better placement decisions.
Use structured logging: JSON-formatted logs are easier to search and aggregate. Include request IDs, timestamps, and severity levels.
Keep a debug pod template handy: A pod with curl, dig, nslookup, and other network tools saves time when debugging connectivity issues.
Monitor cluster events: Set up alerting on recurring warning events. Patterns like repeated OOMKills or ImagePullBackOff indicate systemic issues.
Label everything consistently: Consistent labels make it easier to filter pods, match Services to pods, and understand which team owns which workload.

What's Next?

Health Checks: Learn how to configure liveness, readiness, and startup probes to give Kubernetes better signals about your application's health.
Resources: Understand how resource requests and limits prevent OOMKills and scheduling failures.
Observability: Set up monitoring and alerting to catch issues before they become incidents.
OpenTelemetry: Implement distributed tracing to debug complex request flows across microservices.
Service Discovery: Understand how Kubernetes DNS and Services work under the hood.

Visual Decision Tree​

The Systematic Debugging Methodology​

Essential kubectl Commands for Debugging​

Pod Status Diagnosis​

1. Pending​

2. CrashLoopBackOff​

3. ImagePullBackOff / ErrImagePull​

4. Error​

5. Terminating (Stuck)​

6. Unknown​

Understanding Exit Codes​

Diagnosing OOMKilled (Exit Code 137)​

Node-Level Troubleshooting​

Node Not Ready​

Node Resource Pressure​

Service Connectivity Debugging​

Step 1: Verify the Service has Endpoints​

Step 2: Test DNS Resolution​

Step 3: Test Connectivity from Inside the Cluster​

Step 4: Check kube-proxy​

Networking Issues Quick Reference​

Emergency Procedures​

Pod is Consuming All Node Resources​

Node is Unresponsive​

Cluster-Wide DNS Failure​

Namespace Stuck in Terminating​

Common Pitfalls​

Best Practices​

What's Next?​

Visual Decision Tree

The Systematic Debugging Methodology

Essential kubectl Commands for Debugging

Pod Status Diagnosis

1. Pending

2. CrashLoopBackOff

3. ImagePullBackOff / ErrImagePull

4. Error

5. Terminating (Stuck)

6. Unknown

Understanding Exit Codes

Diagnosing OOMKilled (Exit Code 137)

Node-Level Troubleshooting

Node Not Ready

Node Resource Pressure

Service Connectivity Debugging

Step 1: Verify the Service has Endpoints

Step 2: Test DNS Resolution

Step 3: Test Connectivity from Inside the Cluster

Step 4: Check kube-proxy

Networking Issues Quick Reference

Emergency Procedures

Pod is Consuming All Node Resources

Node is Unresponsive

Cluster-Wide DNS Failure

Namespace Stuck in Terminating

Common Pitfalls

Best Practices

What's Next?