Skip to main content

Observability: Metrics, Logs & Traces

Key Takeaways for AI & Readers
  • Three Pillars: Observability relies on Metrics (numeric measurements over time), Logs (timestamped event records), and Traces (end-to-end request journeys across services). Each serves a different debugging need and they complement each other.
  • Prometheus Pull Model: Prometheus scrapes numeric data from application /metrics endpoints at regular intervals. This pull-based approach means applications do not need to know where to send metrics -- they just expose them.
  • Persistent Logging: Pod logs are ephemeral. When a pod is deleted, its logs are gone. A DaemonSet-based log collector (Fluentd, Fluent Bit, Promtail) ships logs to central storage (Loki, Elasticsearch) for retention and search.
  • Distributed Tracing: Trace IDs propagated in HTTP headers enable tracking a single request as it flows through multiple microservices, pinpointing exactly where latency occurs.
  • Alerting Philosophy: Alert on symptoms (user-visible impact) not causes. "Error rate above 1%" is actionable; "CPU above 80%" often is not.
  • Kubernetes-Native Metrics: metrics-server provides resource usage for HPA and kubectl top. kube-state-metrics exposes the state of Kubernetes objects (deployment replicas, pod phases, node conditions).

"Observability" is more than just monitoring. It is having enough data to ask new questions about your system without deploying new code. Monitoring tells you when something is wrong. Observability helps you understand why.

In Kubernetes, we build observability around the "Three Pillars":


1. Metrics (Is it healthy?)

Numbers measured over time. Metrics are the most efficient form of telemetry -- they are cheap to collect, store, and query.

Examples: CPU usage, memory consumption, HTTP request count, error rate, request latency percentiles, queue depth.

Prometheus Architecture

Prometheus is the de facto standard for metrics in Kubernetes. It was the second project to graduate from the CNCF (after Kubernetes itself).

How the Pull Model Works:

  1. Your application exposes a /metrics endpoint that returns metrics in Prometheus text format.
  2. Prometheus "scrapes" (sends an HTTP GET to) that endpoint at a configurable interval (default: 15 seconds).
  3. Prometheus stores the data in its built-in time-series database, indexed by metric name and labels.
  4. You query the data using PromQL (Prometheus Query Language).
# Example /metrics endpoint output (Prometheus text format)
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET", path="/api/users", status="200"} 14523
http_requests_total{method="GET", path="/api/users", status="500"} 23
http_requests_total{method="POST", path="/api/users", status="201"} 892

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 12000
http_request_duration_seconds_bucket{le="0.1"} 14000
http_request_duration_seconds_bucket{le="1.0"} 14500
http_request_duration_seconds_bucket{le="+Inf"} 14523

Prometheus Service Discovery

In Kubernetes, Prometheus discovers scrape targets automatically. Using ServiceMonitor resources (from the Prometheus Operator) or annotation-based discovery, Prometheus finds all pods exposing metrics:

# ServiceMonitor tells Prometheus what to scrape
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: api-service # Match Services with this label
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics # Named port from the Service
interval: 15s # Scrape every 15 seconds
path: /metrics

Alternatively, you can use pod annotations for simpler setups:

metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"

PromQL Basics

PromQL lets you query and aggregate metrics. Here are essential queries for Kubernetes:

# Request rate over the last 5 minutes
rate(http_requests_total{status=~"5.."}[5m])

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Memory usage by pod
container_memory_working_set_bytes{namespace="production"}

# Pods in CrashLoopBackOff
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}

Kubernetes-Native Metrics Sources

SourceWhat It ProvidesUsed By
metrics-serverCurrent CPU and memory usage per pod/nodeHPA, kubectl top
kube-state-metricsState of Kubernetes objects (deployments, pods, nodes)Prometheus alerting, dashboards
cAdvisorContainer-level resource metrics (built into kubelet)Prometheus via kubelet endpoints
node-exporterNode hardware and OS metrics (disk, network, CPU)Infrastructure dashboards

The Hidden Cost: Cardinality

"Cardinality" refers to the number of unique time series.

  • Low Cardinality: http_requests_total{method="POST"} (Finite methods: GET, POST, PUT, DELETE).
  • High Cardinality: http_requests_total{user_id="u-12938"} (Millions of users).

Danger: Prometheus stores a new time series for every unique combination of label values. Including user_id, request_id, or pod_ip as a label can explode your memory usage and crash Prometheus. Never include unbounded values in labels.

Grafana Dashboards

Prometheus stores data but its built-in UI is minimal. Grafana provides rich dashboards for visualization.

The Kubernetes community maintains standard dashboards:

  • Cluster overview: Node count, pod count, resource utilization, API server health.
  • Namespace detail: Per-namespace CPU, memory, network usage.
  • Pod detail: Per-container CPU, memory, network, restart count.
  • Workload dashboards: Deployment replica status, rollout progress.

Install Grafana with pre-built dashboards via the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, alertmanager, and common dashboards:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prom-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace

2. Logs (Why is it failing?)

Timestamped text records of events. Logs provide the narrative of what happened and why.

Examples: "Database connection failed: timeout after 5s", "NullPointerException at UserService.java:42", "Request processed in 20ms for user_id=1234".

The Ephemeral Log Problem

Kubernetes does not store logs permanently. When a pod is deleted or restarted, its logs are lost. kubectl logs only shows logs from the current (or previous) container instance, and only while the pod exists.

For production systems, you need a log collection pipeline.

Log Collection Patterns

Pattern 1: DaemonSet Collector (Recommended)

A log collector runs as a DaemonSet on every node. It reads log files from the node's filesystem (/var/log/containers/) and ships them to central storage.

# Simplified DaemonSet for log collection
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: log-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: log-collector
template:
metadata:
labels:
app: log-collector
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:latest
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers

Pattern 2: Sidecar Collector

A logging sidecar runs alongside the application container in the same pod. The application writes logs to a shared volume, and the sidecar ships them.

Use sidecars when:

  • Different applications need different log parsing or formatting.
  • You need to enrich logs with application-specific metadata.
  • The application writes logs to files rather than stdout.

The DaemonSet approach is preferred for most workloads because it consumes fewer resources (one collector per node vs. one per pod).

Centralized Logging Stacks

StackComponentsBest For
PLGPromtail + Loki + GrafanaCost-effective, lightweight, Grafana-native
ELK/EFKElasticsearch + Logstash/Fluentd + KibanaFull-text search, complex queries, large scale
Fluent Bit + CloudWatchFluent Bit + AWS CloudWatchAWS-native environments

Loki is increasingly popular because it indexes only metadata labels (like namespace, pod, container) rather than full-text, making it much cheaper to operate than Elasticsearch. It integrates natively with Grafana, so you can correlate logs with metrics in one interface.

Structured Logging

Always emit logs as structured JSON rather than unstructured text. This enables filtering and querying in your logging backend:

{"timestamp": "2025-01-15T10:30:00Z", "level": "error", "service": "auth", "message": "database connection timeout", "duration_ms": 5000, "user_id": "u-1234", "trace_id": "abc123"}

vs. unstructured:

2025-01-15 10:30:00 ERROR auth - database connection timeout after 5000ms for user u-1234

Structured logs are parseable by machines. Unstructured logs require regex patterns that break when the format changes.


3. Traces (Where is it slow?)

The journey of a single request through your system. In a microservices architecture, one user action might touch 10 services. If the overall request takes 3 seconds, tracing tells you which service contributed the most latency.

How Distributed Tracing Works

  1. The first service (the "entry point") generates a Trace ID and a Span (representing its own work).
  2. When it calls the next service, it passes the Trace ID in HTTP headers (e.g., traceparent: 00-abc123-def456-01 using the W3C Trace Context standard).
  3. Each downstream service creates its own Span under the same Trace ID.
  4. All spans are collected by a tracing backend and assembled into a single trace timeline.

Instrumentation

Applications need to be instrumented to create and propagate trace contexts. OpenTelemetry is the current standard -- it provides libraries for all major languages that handle trace creation, context propagation, and export:

# OpenTelemetry Collector as a DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:latest
ports:
- containerPort: 4317 # gRPC receiver (OTLP)
- containerPort: 4318 # HTTP receiver (OTLP)
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://tempo.monitoring:4317"

Tracing Backends

ToolMaintained ByStorageBest For
JaegerCNCFElasticsearch, Cassandra, KafkaMature, widely adopted, rich UI
TempoGrafana LabsObject storage (S3, GCS)Cost-effective, Grafana integration
ZipkinOpenZipkinMySQL, Cassandra, ElasticsearchLightweight, simple setup

Sampling Strategies

Tracing 100% of requests is prohibitively expensive at scale.

  • Head-Based Sampling: The first service decides (e.g., "Keep 1% of requests").
    • Pros: Simple, low overhead.
    • Cons: You might miss the interesting errors (the 1% might be all successful).
  • Tail-Based Sampling: Collect everything, wait for the request to finish, and then decide (e.g., "Keep all errors and traces > 2s latency").
    • Pros: Guarantees you capture failures.
    • Cons: Requires buffering all trace data in memory, which is complex and resource-intensive.

Tempo is gaining adoption because it stores traces in cheap object storage and integrates directly with Grafana, allowing you to jump from a log line to the related trace with one click (via the trace_id field in structured logs).


4. Alerting Strategy

Collecting data is useless if no one looks at it. Alerting bridges the gap between data collection and human response.

Symptom-Based vs. Cause-Based Alerting

Alert on symptoms (user-facing impact), not causes (internal state):

Symptom Alert (Good)Cause Alert (Noisy)
Error rate > 1% for 5 minutesPod restarted
99th percentile latency > 2sCPU above 80%
Zero successful orders in 10 minutesDisk usage above 70%

Cause-based alerts generate noise. CPU at 85% might be perfectly normal during a batch job. But an error rate above 1% always means users are affected.

Prometheus Alerting Rules

# PrometheusRule resource (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-alerts
namespace: monitoring
spec:
groups:
- name: api-service
rules:
# Alert when error rate exceeds 1% for 5 minutes
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5..", namespace="production"}[5m])) /
sum(rate(http_requests_total{namespace="production"}[5m])) > 0.01
for: 5m # Must be true for 5 minutes before firing
labels:
severity: critical
annotations:
summary: "High error rate in production"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"

# Alert when pods are in CrashLoopBackOff
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{namespace="production"}[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"

Alertmanager

Prometheus fires alerts to Alertmanager, which handles deduplication, grouping, silencing, and routing to notification channels (Slack, PagerDuty, email, OpsGenie).


5. Monitoring the Monitoring Stack

Your observability stack is itself a critical service. If Prometheus goes down, you lose visibility into everything else.

  • Prometheus: Monitor storage usage, scrape duration, targets that are down. Set up a secondary Prometheus instance (or use Thanos/Cortex for HA).
  • CoreDNS: Track query rates and errors. DNS failure takes down most applications.
  • Log collector: Monitor log pipeline throughput and lag. If Fluent Bit crashes, you lose log coverage.
  • Alertmanager: Configure a dead man's switch -- an alert that always fires. If you stop receiving it, your alerting pipeline is broken.

Tool Comparison Summary

NeedRecommended StackAlternative
MetricsPrometheus + GrafanaDatadog, New Relic, Victoria Metrics
LoggingFluent Bit + Loki + GrafanaEFK (Elasticsearch + Fluentd + Kibana)
TracingOpenTelemetry + Tempo + GrafanaJaeger, Zipkin, Datadog APM
AlertingAlertmanager + PagerDuty/SlackOpsGenie, VictorOps
All-in-onekube-prometheus-stack (Helm)Datadog, Grafana Cloud

The Grafana-centric stack (Prometheus + Loki + Tempo + Grafana) is popular because all three pillars are queryable from a single interface, with the ability to jump between a metric spike, the related logs, and the associated trace in a few clicks.


Common Pitfalls

1. No retention policy for metrics. Prometheus stores data locally. Without configuring retention (e.g., --storage.tsdb.retention.time=30d), the disk fills up and Prometheus crashes. For long-term storage, use Thanos or Cortex.

2. Alerting on symptoms without runbooks. An alert that says "Error rate high" without a link to a runbook telling the on-call engineer what to check is just noise. Every alert should have an annotations.runbook_url.

3. Not monitoring the monitoring stack. If Prometheus runs out of memory and gets OOMKilled, you lose all metrics. Set resource requests/limits and monitor Prometheus itself (even if it is circular -- use a lightweight external check).

4. Logging everything at DEBUG level. Excessive logging overwhelms your log pipeline and increases storage costs. Use INFO in production and enable DEBUG per-pod only when investigating an issue.

5. No trace context propagation. Tracing requires every service in the call chain to propagate the trace context headers. If one service drops the header, the trace is broken into fragments. Validate propagation across all services.


Best Practices

  1. Deploy kube-prometheus-stack as your starting point. It gives you Prometheus, Grafana, Alertmanager, and standard Kubernetes dashboards in a single Helm install.

  2. Use structured logging from day one. Retrofitting structured logging across dozens of services is painful. Start with JSON logs and consistent field names.

  3. Instrument with OpenTelemetry. It is the vendor-neutral standard for metrics, logs, and traces. It prevents vendor lock-in and works with all major backends.

  4. Alert on SLOs, not resource thresholds. Define Service Level Objectives (e.g., "99.9% of requests complete in under 500ms") and alert when the error budget is being consumed too fast.

  5. Correlate across pillars. Include trace_id in log entries so you can jump from a log line to the full distributed trace. Include pod labels in metrics so you can jump from a metric spike to the pod's logs.

  6. Right-size your Prometheus. Each active time series consumes approximately 1-2 KB of memory. 1 million time series requires 1-2 GB RAM. Monitor prometheus_tsdb_head_series to track cardinality.


What's Next?

  • Observability with OpenTelemetry -- Deep dive into OpenTelemetry instrumentation, collectors, and the OTLP protocol.
  • Logging Stack -- Detailed guide to building a production logging pipeline with Loki or Elasticsearch.
  • Resources & HPA -- Use Prometheus metrics as custom HPA metrics to scale on application-specific indicators like queue depth or request rate.
  • Probes (Health Checks) -- Monitor probe failures in Prometheus to detect application health issues before users are affected.
  • Service Discovery (DNS) -- Monitor CoreDNS metrics to detect DNS resolution issues that can cascade across your entire cluster.
  • Troubleshooting -- Use your observability data to diagnose common Kubernetes issues.