Observability: Metrics, Logs & Traces
- Three Pillars: Observability relies on Metrics (numeric measurements over time), Logs (timestamped event records), and Traces (end-to-end request journeys across services). Each serves a different debugging need and they complement each other.
- Prometheus Pull Model: Prometheus scrapes numeric data from application
/metricsendpoints at regular intervals. This pull-based approach means applications do not need to know where to send metrics -- they just expose them. - Persistent Logging: Pod logs are ephemeral. When a pod is deleted, its logs are gone. A DaemonSet-based log collector (Fluentd, Fluent Bit, Promtail) ships logs to central storage (Loki, Elasticsearch) for retention and search.
- Distributed Tracing: Trace IDs propagated in HTTP headers enable tracking a single request as it flows through multiple microservices, pinpointing exactly where latency occurs.
- Alerting Philosophy: Alert on symptoms (user-visible impact) not causes. "Error rate above 1%" is actionable; "CPU above 80%" often is not.
- Kubernetes-Native Metrics: metrics-server provides resource usage for HPA and
kubectl top. kube-state-metrics exposes the state of Kubernetes objects (deployment replicas, pod phases, node conditions).
"Observability" is more than just monitoring. It is having enough data to ask new questions about your system without deploying new code. Monitoring tells you when something is wrong. Observability helps you understand why.
In Kubernetes, we build observability around the "Three Pillars":
1. Metrics (Is it healthy?)
Numbers measured over time. Metrics are the most efficient form of telemetry -- they are cheap to collect, store, and query.
Examples: CPU usage, memory consumption, HTTP request count, error rate, request latency percentiles, queue depth.
Prometheus Architecture
Prometheus is the de facto standard for metrics in Kubernetes. It was the second project to graduate from the CNCF (after Kubernetes itself).
How the Pull Model Works:
- Your application exposes a
/metricsendpoint that returns metrics in Prometheus text format. - Prometheus "scrapes" (sends an HTTP GET to) that endpoint at a configurable interval (default: 15 seconds).
- Prometheus stores the data in its built-in time-series database, indexed by metric name and labels.
- You query the data using PromQL (Prometheus Query Language).
# Example /metrics endpoint output (Prometheus text format)
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET", path="/api/users", status="200"} 14523
http_requests_total{method="GET", path="/api/users", status="500"} 23
http_requests_total{method="POST", path="/api/users", status="201"} 892
# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 12000
http_request_duration_seconds_bucket{le="0.1"} 14000
http_request_duration_seconds_bucket{le="1.0"} 14500
http_request_duration_seconds_bucket{le="+Inf"} 14523
Prometheus Service Discovery
In Kubernetes, Prometheus discovers scrape targets automatically. Using ServiceMonitor resources (from the Prometheus Operator) or annotation-based discovery, Prometheus finds all pods exposing metrics:
# ServiceMonitor tells Prometheus what to scrape
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: api-service # Match Services with this label
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics # Named port from the Service
interval: 15s # Scrape every 15 seconds
path: /metrics
Alternatively, you can use pod annotations for simpler setups:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
PromQL Basics
PromQL lets you query and aggregate metrics. Here are essential queries for Kubernetes:
# Request rate over the last 5 minutes
rate(http_requests_total{status=~"5.."}[5m])
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
# Memory usage by pod
container_memory_working_set_bytes{namespace="production"}
# Pods in CrashLoopBackOff
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
Kubernetes-Native Metrics Sources
| Source | What It Provides | Used By |
|---|---|---|
| metrics-server | Current CPU and memory usage per pod/node | HPA, kubectl top |
| kube-state-metrics | State of Kubernetes objects (deployments, pods, nodes) | Prometheus alerting, dashboards |
| cAdvisor | Container-level resource metrics (built into kubelet) | Prometheus via kubelet endpoints |
| node-exporter | Node hardware and OS metrics (disk, network, CPU) | Infrastructure dashboards |
The Hidden Cost: Cardinality
"Cardinality" refers to the number of unique time series.
- Low Cardinality:
http_requests_total{method="POST"}(Finite methods: GET, POST, PUT, DELETE). - High Cardinality:
http_requests_total{user_id="u-12938"}(Millions of users).
Danger: Prometheus stores a new time series for every unique combination of label values. Including user_id, request_id, or pod_ip as a label can explode your memory usage and crash Prometheus. Never include unbounded values in labels.
Grafana Dashboards
Prometheus stores data but its built-in UI is minimal. Grafana provides rich dashboards for visualization.
The Kubernetes community maintains standard dashboards:
- Cluster overview: Node count, pod count, resource utilization, API server health.
- Namespace detail: Per-namespace CPU, memory, network usage.
- Pod detail: Per-container CPU, memory, network, restart count.
- Workload dashboards: Deployment replica status, rollout progress.
Install Grafana with pre-built dashboards via the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, alertmanager, and common dashboards:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prom-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace
2. Logs (Why is it failing?)
Timestamped text records of events. Logs provide the narrative of what happened and why.
Examples: "Database connection failed: timeout after 5s", "NullPointerException at UserService.java:42", "Request processed in 20ms for user_id=1234".
The Ephemeral Log Problem
Kubernetes does not store logs permanently. When a pod is deleted or restarted, its logs are lost. kubectl logs only shows logs from the current (or previous) container instance, and only while the pod exists.
For production systems, you need a log collection pipeline.
Log Collection Patterns
Pattern 1: DaemonSet Collector (Recommended)
A log collector runs as a DaemonSet on every node. It reads log files from the node's filesystem (/var/log/containers/) and ships them to central storage.
# Simplified DaemonSet for log collection
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: log-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: log-collector
template:
metadata:
labels:
app: log-collector
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:latest
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
Pattern 2: Sidecar Collector
A logging sidecar runs alongside the application container in the same pod. The application writes logs to a shared volume, and the sidecar ships them.
Use sidecars when:
- Different applications need different log parsing or formatting.
- You need to enrich logs with application-specific metadata.
- The application writes logs to files rather than stdout.
The DaemonSet approach is preferred for most workloads because it consumes fewer resources (one collector per node vs. one per pod).
Centralized Logging Stacks
| Stack | Components | Best For |
|---|---|---|
| PLG | Promtail + Loki + Grafana | Cost-effective, lightweight, Grafana-native |
| ELK/EFK | Elasticsearch + Logstash/Fluentd + Kibana | Full-text search, complex queries, large scale |
| Fluent Bit + CloudWatch | Fluent Bit + AWS CloudWatch | AWS-native environments |
Loki is increasingly popular because it indexes only metadata labels (like namespace, pod, container) rather than full-text, making it much cheaper to operate than Elasticsearch. It integrates natively with Grafana, so you can correlate logs with metrics in one interface.
Structured Logging
Always emit logs as structured JSON rather than unstructured text. This enables filtering and querying in your logging backend:
{"timestamp": "2025-01-15T10:30:00Z", "level": "error", "service": "auth", "message": "database connection timeout", "duration_ms": 5000, "user_id": "u-1234", "trace_id": "abc123"}
vs. unstructured:
2025-01-15 10:30:00 ERROR auth - database connection timeout after 5000ms for user u-1234
Structured logs are parseable by machines. Unstructured logs require regex patterns that break when the format changes.
3. Traces (Where is it slow?)
The journey of a single request through your system. In a microservices architecture, one user action might touch 10 services. If the overall request takes 3 seconds, tracing tells you which service contributed the most latency.
How Distributed Tracing Works
- The first service (the "entry point") generates a Trace ID and a Span (representing its own work).
- When it calls the next service, it passes the Trace ID in HTTP headers (e.g.,
traceparent: 00-abc123-def456-01using the W3C Trace Context standard). - Each downstream service creates its own Span under the same Trace ID.
- All spans are collected by a tracing backend and assembled into a single trace timeline.
Instrumentation
Applications need to be instrumented to create and propagate trace contexts. OpenTelemetry is the current standard -- it provides libraries for all major languages that handle trace creation, context propagation, and export:
# OpenTelemetry Collector as a DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:latest
ports:
- containerPort: 4317 # gRPC receiver (OTLP)
- containerPort: 4318 # HTTP receiver (OTLP)
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://tempo.monitoring:4317"
Tracing Backends
| Tool | Maintained By | Storage | Best For |
|---|---|---|---|
| Jaeger | CNCF | Elasticsearch, Cassandra, Kafka | Mature, widely adopted, rich UI |
| Tempo | Grafana Labs | Object storage (S3, GCS) | Cost-effective, Grafana integration |
| Zipkin | OpenZipkin | MySQL, Cassandra, Elasticsearch | Lightweight, simple setup |
Sampling Strategies
Tracing 100% of requests is prohibitively expensive at scale.
- Head-Based Sampling: The first service decides (e.g., "Keep 1% of requests").
- Pros: Simple, low overhead.
- Cons: You might miss the interesting errors (the 1% might be all successful).
- Tail-Based Sampling: Collect everything, wait for the request to finish, and then decide (e.g., "Keep all errors and traces > 2s latency").
- Pros: Guarantees you capture failures.
- Cons: Requires buffering all trace data in memory, which is complex and resource-intensive.
Tempo is gaining adoption because it stores traces in cheap object storage and integrates directly with Grafana, allowing you to jump from a log line to the related trace with one click (via the trace_id field in structured logs).
4. Alerting Strategy
Collecting data is useless if no one looks at it. Alerting bridges the gap between data collection and human response.
Symptom-Based vs. Cause-Based Alerting
Alert on symptoms (user-facing impact), not causes (internal state):
| Symptom Alert (Good) | Cause Alert (Noisy) |
|---|---|
| Error rate > 1% for 5 minutes | Pod restarted |
| 99th percentile latency > 2s | CPU above 80% |
| Zero successful orders in 10 minutes | Disk usage above 70% |
Cause-based alerts generate noise. CPU at 85% might be perfectly normal during a batch job. But an error rate above 1% always means users are affected.
Prometheus Alerting Rules
# PrometheusRule resource (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-alerts
namespace: monitoring
spec:
groups:
- name: api-service
rules:
# Alert when error rate exceeds 1% for 5 minutes
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5..", namespace="production"}[5m])) /
sum(rate(http_requests_total{namespace="production"}[5m])) > 0.01
for: 5m # Must be true for 5 minutes before firing
labels:
severity: critical
annotations:
summary: "High error rate in production"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
# Alert when pods are in CrashLoopBackOff
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{namespace="production"}[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
Alertmanager
Prometheus fires alerts to Alertmanager, which handles deduplication, grouping, silencing, and routing to notification channels (Slack, PagerDuty, email, OpsGenie).
5. Monitoring the Monitoring Stack
Your observability stack is itself a critical service. If Prometheus goes down, you lose visibility into everything else.
- Prometheus: Monitor storage usage, scrape duration, targets that are down. Set up a secondary Prometheus instance (or use Thanos/Cortex for HA).
- CoreDNS: Track query rates and errors. DNS failure takes down most applications.
- Log collector: Monitor log pipeline throughput and lag. If Fluent Bit crashes, you lose log coverage.
- Alertmanager: Configure a dead man's switch -- an alert that always fires. If you stop receiving it, your alerting pipeline is broken.
Tool Comparison Summary
| Need | Recommended Stack | Alternative |
|---|---|---|
| Metrics | Prometheus + Grafana | Datadog, New Relic, Victoria Metrics |
| Logging | Fluent Bit + Loki + Grafana | EFK (Elasticsearch + Fluentd + Kibana) |
| Tracing | OpenTelemetry + Tempo + Grafana | Jaeger, Zipkin, Datadog APM |
| Alerting | Alertmanager + PagerDuty/Slack | OpsGenie, VictorOps |
| All-in-one | kube-prometheus-stack (Helm) | Datadog, Grafana Cloud |
The Grafana-centric stack (Prometheus + Loki + Tempo + Grafana) is popular because all three pillars are queryable from a single interface, with the ability to jump between a metric spike, the related logs, and the associated trace in a few clicks.
Common Pitfalls
1. No retention policy for metrics. Prometheus stores data locally. Without configuring retention (e.g., --storage.tsdb.retention.time=30d), the disk fills up and Prometheus crashes. For long-term storage, use Thanos or Cortex.
2. Alerting on symptoms without runbooks. An alert that says "Error rate high" without a link to a runbook telling the on-call engineer what to check is just noise. Every alert should have an annotations.runbook_url.
3. Not monitoring the monitoring stack. If Prometheus runs out of memory and gets OOMKilled, you lose all metrics. Set resource requests/limits and monitor Prometheus itself (even if it is circular -- use a lightweight external check).
4. Logging everything at DEBUG level. Excessive logging overwhelms your log pipeline and increases storage costs. Use INFO in production and enable DEBUG per-pod only when investigating an issue.
5. No trace context propagation. Tracing requires every service in the call chain to propagate the trace context headers. If one service drops the header, the trace is broken into fragments. Validate propagation across all services.
Best Practices
-
Deploy kube-prometheus-stack as your starting point. It gives you Prometheus, Grafana, Alertmanager, and standard Kubernetes dashboards in a single Helm install.
-
Use structured logging from day one. Retrofitting structured logging across dozens of services is painful. Start with JSON logs and consistent field names.
-
Instrument with OpenTelemetry. It is the vendor-neutral standard for metrics, logs, and traces. It prevents vendor lock-in and works with all major backends.
-
Alert on SLOs, not resource thresholds. Define Service Level Objectives (e.g., "99.9% of requests complete in under 500ms") and alert when the error budget is being consumed too fast.
-
Correlate across pillars. Include
trace_idin log entries so you can jump from a log line to the full distributed trace. Include pod labels in metrics so you can jump from a metric spike to the pod's logs. -
Right-size your Prometheus. Each active time series consumes approximately 1-2 KB of memory. 1 million time series requires 1-2 GB RAM. Monitor
prometheus_tsdb_head_seriesto track cardinality.
What's Next?
- Observability with OpenTelemetry -- Deep dive into OpenTelemetry instrumentation, collectors, and the OTLP protocol.
- Logging Stack -- Detailed guide to building a production logging pipeline with Loki or Elasticsearch.
- Resources & HPA -- Use Prometheus metrics as custom HPA metrics to scale on application-specific indicators like queue depth or request rate.
- Probes (Health Checks) -- Monitor probe failures in Prometheus to detect application health issues before users are affected.
- Service Discovery (DNS) -- Monitor CoreDNS metrics to detect DNS resolution issues that can cascade across your entire cluster.
- Troubleshooting -- Use your observability data to diagnose common Kubernetes issues.