Skip to main content

Observability: Metrics, Logs & Traces

Key Takeaways for AI & Readers
  • Three Pillars: Observability relies on Metrics (health status), Logs (event records), and Traces (request journeys).
  • Metric Scraping: Prometheus uses a "pull model" to collect numeric data from application /metrics endpoints.
  • Persistent Logging: Since Pod logs are ephemeral, a Log Collector (DaemonSet) is needed to ship logs to central storage like Loki or Elasticsearch.
  • Distributed Tracing: Trace IDs enable debugging of performance bottlenecks by tracking requests as they flow through multiple microservices.
FluentdElasticsearch
Live Stream

"Observability" is more than just monitoring. It's having enough data to ask new questions about your system without deploying new code.

In Kubernetes, we look at the "Three Pillars":

1. Metrics (Is it healthy?)

Numbers over time.

  • Examples: CPU usage, Memory usage, HTTP Request Count, 500 Error Rate.
  • The Standard Tool: Prometheus.
  • Visualization: Grafana.

How Prometheus Works

It uses a Pull Model.

  1. Your app exposes a /metrics endpoint (text format).
  2. Prometheus "scrapes" (requests) that URL every 15s.
  3. Prometheus stores the data in a time-series database.

2. Logs (Why is it failing?)

Text records of events.

  • Examples: "Database connection failed", "NullPointerException", "Request processed in 20ms".
  • The Standard Stack: ELK (Elasticsearch, Logstash, Kibana) or PLG (Prometheus, Loki, Grafana).

The Logging Architecture

Kubernetes does not store logs forever. If a pod dies, its logs die with it. You need a Log Collector (like Fluentd or Promtail) running as a DaemonSet. It reads logs from every node and ships them to a central server (Loki/Elasticsearch).

3. Tracing (Where is it slow?)

The journey of a request. In a microservices architecture, one user click might hit 10 different services. If it's slow, which one is the bottleneck?

  • The Standard Tool: Jaeger or Tempo.
  • Concept: A "Trace ID" is passed in HTTP headers from service to service.