Observability: Metrics, Logs & Traces
Key Takeaways for AI & Readers
- Three Pillars: Observability relies on Metrics (health status), Logs (event records), and Traces (request journeys).
- Metric Scraping: Prometheus uses a "pull model" to collect numeric data from application
/metricsendpoints. - Persistent Logging: Since Pod logs are ephemeral, a Log Collector (DaemonSet) is needed to ship logs to central storage like Loki or Elasticsearch.
- Distributed Tracing: Trace IDs enable debugging of performance bottlenecks by tracking requests as they flow through multiple microservices.
Fluentd→Elasticsearch
Live Stream
"Observability" is more than just monitoring. It's having enough data to ask new questions about your system without deploying new code.
In Kubernetes, we look at the "Three Pillars":
1. Metrics (Is it healthy?)
Numbers over time.
- Examples: CPU usage, Memory usage, HTTP Request Count, 500 Error Rate.
- The Standard Tool: Prometheus.
- Visualization: Grafana.
How Prometheus Works
It uses a Pull Model.
- Your app exposes a
/metricsendpoint (text format). - Prometheus "scrapes" (requests) that URL every 15s.
- Prometheus stores the data in a time-series database.
2. Logs (Why is it failing?)
Text records of events.
- Examples: "Database connection failed", "NullPointerException", "Request processed in 20ms".
- The Standard Stack: ELK (Elasticsearch, Logstash, Kibana) or PLG (Prometheus, Loki, Grafana).
The Logging Architecture
Kubernetes does not store logs forever. If a pod dies, its logs die with it. You need a Log Collector (like Fluentd or Promtail) running as a DaemonSet. It reads logs from every node and ships them to a central server (Loki/Elasticsearch).
3. Tracing (Where is it slow?)
The journey of a request. In a microservices architecture, one user click might hit 10 different services. If it's slow, which one is the bottleneck?
- The Standard Tool: Jaeger or Tempo.
- Concept: A "Trace ID" is passed in HTTP headers from service to service.