Probes: Health Checks

Key Takeaways for AI & Readers

Liveness: Checks if an application is alive. If the liveness probe fails, the kubelet restarts the container. Use this to recover from deadlocks and unrecoverable internal states -- but never to check external dependencies.
Readiness: Checks if an application can serve traffic. If the readiness probe fails, the pod's IP is removed from Service endpoints so it stops receiving requests. The container is not restarted.
Startup: Protects slow-starting containers from being killed by liveness probes before they finish initializing. All other probes are disabled until the startup probe succeeds.
Probe Mechanisms: Kubernetes supports four mechanisms -- HTTP GET, TCP Socket, gRPC, and exec command -- each suited to different application types.
Pod Disruption Budgets (PDBs): Guarantee that a minimum number of pods remain available during voluntary disruptions like node drains and cluster upgrades.

Liveness Probe

Checks: "Is the app dead?"

Result: Restart Pod

Readiness Probe

Checks: "Can it handle traffic?"

Result: Remove from LoadBalancer

User Traffic

✅Healthy

Kubernetes needs to know the state of your application to manage it effectively. It does this through Probes.

By default, Kubernetes only checks if the container's main process is running. If your app is deadlocked (but the process is still running) or returning HTTP 500 errors on every request, Kubernetes considers it healthy. Probes solve this problem by giving the kubelet explicit instructions on how to verify that your application is truly functioning.

1. Liveness Probe

Question it answers: "Is the container still alive and functioning?"
Action on failure: The kubelet restarts the container (subject to the pod's restartPolicy).
Use cases: Deadlocks, corrupted in-memory state, infinite loops, memory leaks that render the application unresponsive.

The liveness probe is your safety net for situations where the process is technically running but the application has entered a broken state from which it cannot recover on its own. A restart is the only fix.

Critical rule: Never fail a liveness probe because an external dependency (database, cache, downstream API) is unavailable. If the database is down and your liveness probe fails, Kubernetes restarts your pod. The new pod comes up, the database is still down, the probe fails again, and you enter a crash loop -- making a bad situation worse.

2. Readiness Probe

Question it answers: "Is the container ready to accept traffic?"
Action on failure: The pod's IP is removed from all Service endpoints. Traffic stops flowing to it. The container is not restarted.
Use cases: Application is still loading data into memory, warming caches, waiting for a dependent service to become available, or is temporarily overloaded.

The readiness probe is the right place to check dependencies. If your application needs a database connection to serve requests, the readiness probe should verify that connection. When the database recovers, the probe passes again, and the pod is added back to the Service endpoints automatically.

3. Startup Probe

Question it answers: "Has the container finished its initialization?"
Action on failure: If the startup probe has not succeeded after failureThreshold * periodSeconds, the kubelet kills the container and it is subject to the pod's restartPolicy.
Effect on other probes: While the startup probe is running, liveness and readiness probes are disabled.
Use cases: Legacy applications or Java apps that may take 60-120 seconds (or more) to initialize. Without a startup probe, you would need to set initialDelaySeconds on the liveness probe to an artificially high value, which delays detection of genuine failures after the application is running.

4. Probe Mechanisms

Kubernetes supports four ways to check your application's health. You choose one mechanism per probe.

HTTP GET

The kubelet sends an HTTP GET request to a specified path and port. Any response code between 200 and 399 is considered a success. Anything else is a failure.

livenessProbe:
  httpGet:
    path: /healthz          # The endpoint your app exposes
    port: 8080              # Container port (not Service port)
    httpHeaders:            # Optional custom headers
    - name: X-Custom-Header
      value: HealthCheck
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3
  successThreshold: 1

This is the most common mechanism for web applications. Your application should expose a dedicated health endpoint that performs internal checks (e.g., can it allocate memory, is the event loop responsive).

TCP Socket

The kubelet attempts to open a TCP connection to the specified port. If the connection succeeds, the probe passes. If it fails or times out, the probe fails.

readinessProbe:
  tcpSocket:
    port: 3306              # Useful for databases or non-HTTP services
  initialDelaySeconds: 10
  periodSeconds: 5

Use this for services that do not speak HTTP, such as databases, message brokers, or custom TCP servers. Note that a successful TCP connection only confirms the port is open -- it does not verify the application is processing requests correctly.

gRPC

Available since Kubernetes 1.27 as a stable feature. The kubelet performs a gRPC health check using the standard gRPC Health Checking Protocol.

livenessProbe:
  grpc:
    port: 50051             # Your gRPC server port
    service: ""             # Empty string checks the overall server health
  initialDelaySeconds: 10
  periodSeconds: 10

Your application must implement the grpc.health.v1.Health service. An empty service field checks the overall server status. You can also specify a named service to check a specific component.

Exec Command

The kubelet executes a command inside the container. If the command exits with code 0, the probe succeeds. Any other exit code is a failure.

livenessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy           # A file your app creates when healthy
  initialDelaySeconds: 5
  periodSeconds: 10

This is useful for applications that do not expose a network endpoint or for checking conditions that are easier to verify with a script. Be aware that exec probes create a new process in the container on every check, which can add CPU overhead at high check frequencies.

5. Configuration Parameters

Every probe type accepts the same timing and threshold parameters:

Parameter	Default	Description
`initialDelaySeconds`	0	Seconds to wait after container start before the first probe
`periodSeconds`	10	How often (in seconds) the probe is performed
`timeoutSeconds`	1	Seconds after which the probe times out
`failureThreshold`	3	Number of consecutive failures before taking action
`successThreshold`	1	Number of consecutive successes to mark the probe as passing (must be 1 for liveness and startup probes)

How Timing Works in Practice

Consider a startup probe configured as:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 5
  failureThreshold: 30

This gives your application 5 * 30 = 150 seconds to start up. The kubelet checks every 5 seconds, and allows up to 30 consecutive failures before killing the container. Once the startup probe passes, liveness and readiness probes take over.

6. Complete Real-World Example

Here is a production-grade pod spec for a web application with all three probes configured:

apiVersion: v1
kind: Pod
metadata:
  name: web-app
  labels:
    app: web-app
spec:
  containers:
  - name: web-app
    image: myregistry/web-app:v2.4.1
    ports:
    - containerPort: 8080

    # Startup probe: give the app up to 120 seconds to initialize
    # (periodSeconds=5 * failureThreshold=24 = 120s)
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 5
      failureThreshold: 24

    # Liveness probe: restart if 3 consecutive checks fail
    # Only active after startup probe succeeds
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3

    # Readiness probe: remove from traffic if app reports not ready
    # Checks a /ready endpoint that verifies DB connectivity
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 2
      successThreshold: 2          # Require 2 passes before adding back

    resources:
      requests:
        cpu: "250m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"

Notice the separation of concerns: /healthz tests whether the application process itself is healthy (liveness), while /ready tests whether it can serve user traffic, including whether downstream dependencies are reachable (readiness).

7. Pod Disruption Budgets (PDBs)

Probes protect individual pods, but Pod Disruption Budgets protect your overall service availability during planned disruptions (node drains, cluster upgrades, rolling updates).

A PDB declares the minimum number of pods that must remain available at any time:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  # Option A: minimum number of pods that must stay running
  minAvailable: 2
  # Option B (alternative): maximum number that can be unavailable
  # maxUnavailable: 1
  selector:
    matchLabels:
      app: web-app

When an administrator runs kubectl drain node-3, the eviction API checks all PDBs. If evicting a pod from node-3 would violate the PDB (dropping available replicas below minAvailable), the drain operation blocks until another pod is healthy on a different node.

Key points about PDBs:

They only apply to voluntary disruptions (node drains, rolling updates). They do not prevent involuntary disruptions (hardware failures, OOM kills).
Use minAvailable when you know your minimum required capacity. Use maxUnavailable when you want to express it as a percentage or count of pods that can be down simultaneously.
PDBs with maxUnavailable: 0 or minAvailable equal to the total replica count will block all drains. Avoid this in production -- it prevents cluster maintenance.

Common Pitfalls

1. Liveness probe checks external dependencies. If your liveness probe calls the database and the database goes down, every pod restarts in a cascade. The database comes back, all pods start simultaneously and overwhelm it. Use readiness probes for dependency checks.

2. Liveness probe is too aggressive. Setting periodSeconds: 1 with failureThreshold: 1 means one slow response triggers a restart. Applications under load may occasionally have slow health check responses. Give your application room to breathe with failureThreshold: 3 and a reasonable timeoutSeconds.

3. Missing startup probe for slow-starting applications. Without a startup probe, you must inflate initialDelaySeconds on the liveness probe. This means after a crash, the kubelet waits that entire delay before checking again, increasing your recovery time.

4. Readiness and liveness probes hitting the same endpoint with the same logic. If both probes are identical, a failing readiness check also triggers a restart via the liveness probe. Keep them separate: liveness checks the process, readiness checks the ability to serve.

5. Not setting resource limits on probe endpoints. If your health endpoint queries the database or performs significant computation, it can itself become a source of load. Health endpoints should be lightweight -- return quickly and check only essential state.

Best Practices

Always use readiness probes for any pod that receives traffic through a Service. Without readiness probes, pods receive traffic the moment the container starts, before your application is ready.
Use startup probes instead of large initialDelaySeconds values. Startup probes give your application as long as it needs to start while still allowing fast failure detection after startup.
Keep health endpoints fast. Liveness endpoints should respond in under 200ms. If your health check takes multiple seconds, increase timeoutSeconds accordingly and investigate why it is slow.
Use different endpoints for liveness and readiness. A /healthz endpoint that returns 200 if the process is alive, and a /ready endpoint that also verifies database connections and downstream dependencies.
Set PDBs on all production workloads. Even if you have 10 replicas, a node drain can evict multiple pods on the same node simultaneously. PDBs ensure controlled disruption.
Monitor probe failures. Frequent readiness probe failures often indicate your application is under-provisioned or has dependency problems. Track these metrics in your monitoring stack.
Test your probes. Run kubectl describe pod <name> and look at the Events section. Failed probes appear as Unhealthy events with details about which probe failed and why.

What's Next?

Resources & HPA -- Learn how to set resource requests and limits, which directly affect pod scheduling and QoS classes that determine eviction order.
Observability -- Monitor probe failures and application health metrics with Prometheus and Grafana.
Scheduling & Affinity -- Control where pods are placed to complement your PDB strategy.
Troubleshooting -- Diagnose why pods are restarting or stuck in CrashLoopBackOff.

Liveness Probe

Readiness Probe

1. Liveness Probe​

2. Readiness Probe​

3. Startup Probe​

4. Probe Mechanisms​

HTTP GET​

TCP Socket​

gRPC​

Exec Command​

5. Configuration Parameters​

How Timing Works in Practice​

6. Complete Real-World Example​

7. Pod Disruption Budgets (PDBs)​

Common Pitfalls​

Best Practices​

What's Next?​