Probes: Health Checks
- Liveness: Checks if an application is alive. If the liveness probe fails, the kubelet restarts the container. Use this to recover from deadlocks and unrecoverable internal states -- but never to check external dependencies.
- Readiness: Checks if an application can serve traffic. If the readiness probe fails, the pod's IP is removed from Service endpoints so it stops receiving requests. The container is not restarted.
- Startup: Protects slow-starting containers from being killed by liveness probes before they finish initializing. All other probes are disabled until the startup probe succeeds.
- Probe Mechanisms: Kubernetes supports four mechanisms -- HTTP GET, TCP Socket, gRPC, and exec command -- each suited to different application types.
- Pod Disruption Budgets (PDBs): Guarantee that a minimum number of pods remain available during voluntary disruptions like node drains and cluster upgrades.
Liveness Probe
Checks: "Is the app dead?"
Readiness Probe
Checks: "Can it handle traffic?"
Kubernetes needs to know the state of your application to manage it effectively. It does this through Probes.
By default, Kubernetes only checks if the container's main process is running. If your app is deadlocked (but the process is still running) or returning HTTP 500 errors on every request, Kubernetes considers it healthy. Probes solve this problem by giving the kubelet explicit instructions on how to verify that your application is truly functioning.
1. Liveness Probe
- Question it answers: "Is the container still alive and functioning?"
- Action on failure: The kubelet restarts the container (subject to the pod's
restartPolicy). - Use cases: Deadlocks, corrupted in-memory state, infinite loops, memory leaks that render the application unresponsive.
The liveness probe is your safety net for situations where the process is technically running but the application has entered a broken state from which it cannot recover on its own. A restart is the only fix.
Critical rule: Never fail a liveness probe because an external dependency (database, cache, downstream API) is unavailable. If the database is down and your liveness probe fails, Kubernetes restarts your pod. The new pod comes up, the database is still down, the probe fails again, and you enter a crash loop -- making a bad situation worse.
2. Readiness Probe
- Question it answers: "Is the container ready to accept traffic?"
- Action on failure: The pod's IP is removed from all Service endpoints. Traffic stops flowing to it. The container is not restarted.
- Use cases: Application is still loading data into memory, warming caches, waiting for a dependent service to become available, or is temporarily overloaded.
The readiness probe is the right place to check dependencies. If your application needs a database connection to serve requests, the readiness probe should verify that connection. When the database recovers, the probe passes again, and the pod is added back to the Service endpoints automatically.
3. Startup Probe
- Question it answers: "Has the container finished its initialization?"
- Action on failure: If the startup probe has not succeeded after
failureThreshold * periodSeconds, the kubelet kills the container and it is subject to the pod'srestartPolicy. - Effect on other probes: While the startup probe is running, liveness and readiness probes are disabled.
- Use cases: Legacy applications or Java apps that may take 60-120 seconds (or more) to initialize. Without a startup probe, you would need to set
initialDelaySecondson the liveness probe to an artificially high value, which delays detection of genuine failures after the application is running.
4. Probe Mechanisms
Kubernetes supports four ways to check your application's health. You choose one mechanism per probe.
HTTP GET
The kubelet sends an HTTP GET request to a specified path and port. Any response code between 200 and 399 is considered a success. Anything else is a failure.
livenessProbe:
httpGet:
path: /healthz # The endpoint your app exposes
port: 8080 # Container port (not Service port)
httpHeaders: # Optional custom headers
- name: X-Custom-Header
value: HealthCheck
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
This is the most common mechanism for web applications. Your application should expose a dedicated health endpoint that performs internal checks (e.g., can it allocate memory, is the event loop responsive).
TCP Socket
The kubelet attempts to open a TCP connection to the specified port. If the connection succeeds, the probe passes. If it fails or times out, the probe fails.
readinessProbe:
tcpSocket:
port: 3306 # Useful for databases or non-HTTP services
initialDelaySeconds: 10
periodSeconds: 5
Use this for services that do not speak HTTP, such as databases, message brokers, or custom TCP servers. Note that a successful TCP connection only confirms the port is open -- it does not verify the application is processing requests correctly.
gRPC
Available since Kubernetes 1.27 as a stable feature. The kubelet performs a gRPC health check using the standard gRPC Health Checking Protocol.
livenessProbe:
grpc:
port: 50051 # Your gRPC server port
service: "" # Empty string checks the overall server health
initialDelaySeconds: 10
periodSeconds: 10
Your application must implement the grpc.health.v1.Health service. An empty service field checks the overall server status. You can also specify a named service to check a specific component.
Exec Command
The kubelet executes a command inside the container. If the command exits with code 0, the probe succeeds. Any other exit code is a failure.
livenessProbe:
exec:
command:
- cat
- /tmp/healthy # A file your app creates when healthy
initialDelaySeconds: 5
periodSeconds: 10
This is useful for applications that do not expose a network endpoint or for checking conditions that are easier to verify with a script. Be aware that exec probes create a new process in the container on every check, which can add CPU overhead at high check frequencies.
5. Configuration Parameters
Every probe type accepts the same timing and threshold parameters:
| Parameter | Default | Description |
|---|---|---|
initialDelaySeconds | 0 | Seconds to wait after container start before the first probe |
periodSeconds | 10 | How often (in seconds) the probe is performed |
timeoutSeconds | 1 | Seconds after which the probe times out |
failureThreshold | 3 | Number of consecutive failures before taking action |
successThreshold | 1 | Number of consecutive successes to mark the probe as passing (must be 1 for liveness and startup probes) |
How Timing Works in Practice
Consider a startup probe configured as:
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 5
failureThreshold: 30
This gives your application 5 * 30 = 150 seconds to start up. The kubelet checks every 5 seconds, and allows up to 30 consecutive failures before killing the container. Once the startup probe passes, liveness and readiness probes take over.
6. Complete Real-World Example
Here is a production-grade pod spec for a web application with all three probes configured:
apiVersion: v1
kind: Pod
metadata:
name: web-app
labels:
app: web-app
spec:
containers:
- name: web-app
image: myregistry/web-app:v2.4.1
ports:
- containerPort: 8080
# Startup probe: give the app up to 120 seconds to initialize
# (periodSeconds=5 * failureThreshold=24 = 120s)
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 5
failureThreshold: 24
# Liveness probe: restart if 3 consecutive checks fail
# Only active after startup probe succeeds
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
# Readiness probe: remove from traffic if app reports not ready
# Checks a /ready endpoint that verifies DB connectivity
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 2 # Require 2 passes before adding back
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Notice the separation of concerns: /healthz tests whether the application process itself is healthy (liveness), while /ready tests whether it can serve user traffic, including whether downstream dependencies are reachable (readiness).
7. Pod Disruption Budgets (PDBs)
Probes protect individual pods, but Pod Disruption Budgets protect your overall service availability during planned disruptions (node drains, cluster upgrades, rolling updates).
A PDB declares the minimum number of pods that must remain available at any time:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
# Option A: minimum number of pods that must stay running
minAvailable: 2
# Option B (alternative): maximum number that can be unavailable
# maxUnavailable: 1
selector:
matchLabels:
app: web-app
When an administrator runs kubectl drain node-3, the eviction API checks all PDBs. If evicting a pod from node-3 would violate the PDB (dropping available replicas below minAvailable), the drain operation blocks until another pod is healthy on a different node.
Key points about PDBs:
- They only apply to voluntary disruptions (node drains, rolling updates). They do not prevent involuntary disruptions (hardware failures, OOM kills).
- Use
minAvailablewhen you know your minimum required capacity. UsemaxUnavailablewhen you want to express it as a percentage or count of pods that can be down simultaneously. - PDBs with
maxUnavailable: 0orminAvailableequal to the total replica count will block all drains. Avoid this in production -- it prevents cluster maintenance.
Common Pitfalls
1. Liveness probe checks external dependencies. If your liveness probe calls the database and the database goes down, every pod restarts in a cascade. The database comes back, all pods start simultaneously and overwhelm it. Use readiness probes for dependency checks.
2. Liveness probe is too aggressive.
Setting periodSeconds: 1 with failureThreshold: 1 means one slow response triggers a restart. Applications under load may occasionally have slow health check responses. Give your application room to breathe with failureThreshold: 3 and a reasonable timeoutSeconds.
3. Missing startup probe for slow-starting applications.
Without a startup probe, you must inflate initialDelaySeconds on the liveness probe. This means after a crash, the kubelet waits that entire delay before checking again, increasing your recovery time.
4. Readiness and liveness probes hitting the same endpoint with the same logic. If both probes are identical, a failing readiness check also triggers a restart via the liveness probe. Keep them separate: liveness checks the process, readiness checks the ability to serve.
5. Not setting resource limits on probe endpoints. If your health endpoint queries the database or performs significant computation, it can itself become a source of load. Health endpoints should be lightweight -- return quickly and check only essential state.
Best Practices
-
Always use readiness probes for any pod that receives traffic through a Service. Without readiness probes, pods receive traffic the moment the container starts, before your application is ready.
-
Use startup probes instead of large
initialDelaySecondsvalues. Startup probes give your application as long as it needs to start while still allowing fast failure detection after startup. -
Keep health endpoints fast. Liveness endpoints should respond in under 200ms. If your health check takes multiple seconds, increase
timeoutSecondsaccordingly and investigate why it is slow. -
Use different endpoints for liveness and readiness. A
/healthzendpoint that returns 200 if the process is alive, and a/readyendpoint that also verifies database connections and downstream dependencies. -
Set PDBs on all production workloads. Even if you have 10 replicas, a node drain can evict multiple pods on the same node simultaneously. PDBs ensure controlled disruption.
-
Monitor probe failures. Frequent readiness probe failures often indicate your application is under-provisioned or has dependency problems. Track these metrics in your monitoring stack.
-
Test your probes. Run
kubectl describe pod <name>and look at the Events section. Failed probes appear asUnhealthyevents with details about which probe failed and why.
What's Next?
- Resources & HPA -- Learn how to set resource requests and limits, which directly affect pod scheduling and QoS classes that determine eviction order.
- Observability -- Monitor probe failures and application health metrics with Prometheus and Grafana.
- Scheduling & Affinity -- Control where pods are placed to complement your PDB strategy.
- Troubleshooting -- Diagnose why pods are restarting or stuck in CrashLoopBackOff.