Logging Architecture

Key Takeaways for AI & Readers

Ephemeral Logs: Container logs in Kubernetes are not persistent. When a container is deleted or restarted, its logs are lost. A centralized logging solution is not optional in production — it is essential for debugging, auditing, and compliance.
Node-Level Collection (DaemonSet): The standard and most efficient pattern. A log collector (FluentBit, Promtail, or Fluentd) runs as a DaemonSet on every node, tailing log files from /var/log/pods, and forwards them to a centralized backend. This requires no application changes.
Sidecar Pattern: For applications that cannot write to stdout/stderr (legacy apps writing to local files), a sidecar container streams log files to stdout so the node-level agent can collect them. More resource-intensive but necessary for certain workloads.
Backend Options: Elasticsearch/OpenSearch (full-text search, powerful queries, higher cost) vs. Grafana Loki (label-indexed, lower storage cost, pairs with Grafana) vs. cloud-native solutions (CloudWatch, Cloud Logging, Azure Monitor).
Structured Logging: JSON-formatted logs with consistent fields (timestamp, level, service, trace_id) dramatically improve searchability and enable automated alerting. This is a prerequisite for effective log analysis at scale.
Cost Management: Logging at scale is expensive. Controlling log volume through appropriate log levels, filtering, sampling, and retention policies is critical to keeping costs manageable.

📦

App stdout

➡️

📄

/var/log/pods

➡️

🐝

FluentBit

➡️

📂

Elastic/Loki

Kubernetes containers write to stdout. A logging agent (running as a DaemonSet) picks up those files from the Node's filesystem and ships them to a central store.

In Kubernetes, logs are ephemeral. When a container is deleted, restarted, or evicted, its logs are gone. The container runtime (containerd) stores logs on the node's filesystem at /var/log/pods, but these files are rotated and eventually deleted. If a node is terminated (e.g., by the Cluster Autoscaler or a spot instance reclamation), all logs on that node are lost permanently.

For any production workload, you need a centralized logging pipeline that collects logs from all containers and stores them in a durable, searchable backend.

1. The Three Logging Patterns

Pattern 1: Node-Level DaemonSet (Recommended)

This is the standard approach used by the majority of Kubernetes clusters:

Application writes to stdout/stderr: Your application logs to the console (this is the 12-factor app best practice).
Container runtime captures logs: containerd captures stdout/stderr and writes them to JSON-formatted files at /var/log/pods/<namespace>_<pod>_<uid>/<container>/0.log on the node.
DaemonSet agent tails log files: A log collector (FluentBit, Promtail, Fluentd) runs as a DaemonSet on every node. It tails the log files from /var/log/pods and enriches them with Kubernetes metadata (pod name, namespace, labels, node).
Agent ships to backend: The agent forwards logs over the network to a centralized backend (Elasticsearch, Loki, CloudWatch, Splunk).

Advantages: No changes to application code. One agent per node (resource-efficient). Automatic Kubernetes metadata enrichment.

Pattern 2: Sidecar Container

For applications that cannot write to stdout/stderr (legacy applications that write logs to files on disk), use a sidecar:

apiVersion: v1
kind: Pod
metadata:
  name: legacy-app
spec:
  containers:
    # The main application writes to a file
    - name: legacy-app
      image: registry.example.com/legacy-app:v1.0
      volumeMounts:
        - name: log-volume
          mountPath: /var/log/app
    # The sidecar reads the file and streams to stdout
    - name: log-streamer
      image: busybox:1.36
      command: ["sh", "-c", "tail -F /var/log/app/application.log"]
      volumeMounts:
        - name: log-volume
          mountPath: /var/log/app
          readOnly: true
  volumes:
    - name: log-volume
      emptyDir: {}

The sidecar streams the file content to its own stdout, which the node-level DaemonSet agent then collects. This adds resource overhead (one extra container per pod) but requires no changes to the legacy application.

Pattern 3: Direct to Backend

The application ships logs directly to the backend using an SDK or library (e.g., a Loki client library, or the Elasticsearch client):

Advantages: No intermediary. Fine-grained control over what is logged. Disadvantages: Requires application code changes. Tight coupling to the backend. Missing Kubernetes metadata unless manually added. If the backend is down, logs are lost.

This pattern is rarely recommended as the primary logging approach. It is sometimes used alongside the DaemonSet pattern for specific high-volume or high-priority log streams.

2. Log Collectors: FluentBit vs. Fluentd vs. Promtail

Feature	FluentBit	Fluentd	Promtail
Language	C	Ruby/C	Go
Memory footprint	~15 MB	~60-100 MB	~25 MB
CPU overhead	Very low	Moderate	Low
Plugin ecosystem	Good (growing)	Excellent (1000+ plugins)	Limited (Loki-focused)
Output targets	Elasticsearch, Loki, CloudWatch, S3, Kafka, many more	Almost anything	Loki only
Configuration	INI-style or YAML	Ruby-based DSL	YAML
Best for	Resource-constrained clusters, multi-backend	Complex log processing pipelines	Loki-only deployments
CNCF status	CNCF Graduated (Fluent Bit)	CNCF Graduated (Fluentd)	Part of Grafana Loki project

Recommendation: Use FluentBit for most deployments — it has the lowest resource overhead and supports all major backends. Use Fluentd only if you need complex log processing (multiple regex transforms, conditional routing to different backends). Use Promtail if your backend is exclusively Grafana Loki.

3. FluentBit DaemonSet Configuration

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
        # Run on all nodes, including tainted ones
        - operator: Exists
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.0
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 128Mi
          volumeMounts:
            # Mount node log directory
            - name: varlog
              mountPath: /var/log
              readOnly: true
            # Mount container runtime log directory
            - name: varlogpods
              mountPath: /var/log/pods
              readOnly: true
            # FluentBit configuration
            - name: config
              mountPath: /fluent-bit/etc/
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlogpods
          hostPath:
            path: /var/log/pods
        - name: config
          configMap:
            name: fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020            # health check and metrics endpoint

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/pods/*/*/*.log
        Parser            cri               # parse CRI log format
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On                # parse JSON log messages
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Labels              On                # include pod labels as fields

    [FILTER]
        Name    grep
        Match   kube.*
        Exclude log ^$                        # drop empty log lines

    [OUTPUT]
        Name            es
        Match           kube.*
        Host            elasticsearch.logging.svc.cluster.local
        Port            9200
        Index           kubernetes-logs
        Type            _doc
        Logstash_Format On
        Retry_Limit     3
        tls             On
        tls.verify      Off

  parsers.conf: |
    [PARSER]
        Name        cri
        Format      regex
        Regex       ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

FluentBit Output for Loki

If using Grafana Loki instead of Elasticsearch:

[OUTPUT]
    Name          loki
    Match         kube.*
    Host          loki-gateway.logging.svc.cluster.local
    Port          80
    Labels        job=fluent-bit
    Label_Keys    $kubernetes['namespace_name'],$kubernetes['pod_name'],$kubernetes['container_name']
    Remove_Keys   kubernetes,stream
    Line_Format   json

4. Backend Comparison: EFK Stack vs. PLG Stack

EFK Stack (Elasticsearch + FluentBit/Fluentd + Kibana)

Elasticsearch: Full-text search engine. Indexes every word in every log line. Supports complex queries, aggregations, and regex.
Kibana: Visualization layer. Rich dashboards, log exploration, saved queries.
Strengths: Powerful free-text search ("find all logs containing 'connection refused'"). Mature ecosystem. Excellent for compliance and audit logs.
Weaknesses: Expensive to operate. High memory and storage requirements. Index management complexity. Requires significant tuning at scale.
Cost profile: Approximately 3-10x more expensive than Loki for the same log volume, due to full indexing.

PLG Stack (Promtail + Loki + Grafana)

Loki: Log aggregation system that indexes only labels (namespace, pod, container), not log content. Stores compressed log chunks in object storage (S3, GCS).
Grafana: Visualization layer. LogQL query language for filtering and aggregating logs.
Strengths: Dramatically lower storage costs (only labels are indexed). Pairs naturally with Prometheus (same label model). Simpler to operate.
Weaknesses: Free-text search is slower (requires scanning log chunks). Less suited for compliance use cases requiring complex full-text queries.
Cost profile: 3-10x cheaper than Elasticsearch for equivalent log volume.

Cloud-Native Options

AWS CloudWatch Logs: Managed, no infrastructure to operate. Higher per-GB cost but zero ops overhead.
GCP Cloud Logging: Integrated with GKE. First 50 GiB/month free.
Azure Monitor Logs: Integrated with AKS. Based on Log Analytics workspaces.

5. Structured Logging Best Practices

Unstructured log lines like Error processing request from 192.168.1.5 are nearly impossible to search and aggregate at scale. Structured logging uses a consistent format (typically JSON) with well-defined fields:

{
  "timestamp": "2024-01-15T14:30:22.123Z",
  "level": "error",
  "service": "store-api",
  "message": "Failed to process payment",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "user_id": "user-42",
  "error_code": "PAYMENT_DECLINED",
  "duration_ms": 1523,
  "http_status": 500,
  "http_method": "POST",
  "http_path": "/api/v2/payments"
}

Key Fields to Include

Field	Purpose
`timestamp`	When the event occurred (ISO 8601 format)
`level`	Severity: debug, info, warn, error, fatal
`service`	Which microservice emitted the log
`message`	Human-readable description
`trace_id`	Distributed tracing correlation ID
`error_code`	Machine-parseable error classification
`duration_ms`	Request/operation duration for latency analysis
`http_status`	HTTP response code for request logs

Log Levels and Filtering

Use log levels consistently across all services and filter aggressively:

debug: Detailed diagnostic information. Disable in production (except when debugging specific issues).
info: Normal operational events (request processed, job completed).
warn: Unexpected conditions that do not prevent operation (retry succeeded, deprecated API used).
error: Failures that require investigation (payment failed, database timeout).
fatal: Unrecoverable failures (cannot connect to database on startup).

Configure your log collector to filter out debug-level logs in production to reduce volume and cost.

6. Multi-Line Log Handling

Stack traces and multi-line log entries are a common challenge. By default, each line is treated as a separate log entry, splitting stack traces across multiple records.

FluentBit multi-line configuration:

[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/pods/*/*/*.log
    multiline.parser  docker, cri
    # Custom multi-line parser for Java stack traces
    multiline.parser  java-stacktrace

[MULTILINE_PARSER]
    Name          java-stacktrace
    Type          regex
    Flush_Timeout 1000
    # Match lines that start with whitespace or "Caused by" or "at "
    Rule          "start_state" "/^\d{4}-\d{2}-\d{2}/" "cont"
    Rule          "cont"        "/^\s+at\s|^\s+\.\.\.|^Caused by:/" "cont"

The best solution is to use structured JSON logging in your application. When logs are single-line JSON objects, multi-line parsing is unnecessary.

7. Log Rotation and Retention

Node-Level Rotation

Container runtimes (containerd) handle log rotation on the node. Default containerd configuration rotates logs when they reach 10 MiB, keeping up to 5 files per container. Customize via kubelet configuration:

# kubelet configuration
containerLogMaxSize: "50Mi"          # rotate after 50 MiB
containerLogMaxFiles: 3              # keep 3 rotated files

Backend Retention

Set retention policies to control storage costs:

Elasticsearch/OpenSearch: Use Index Lifecycle Management (ILM) to automatically delete old indices:

{
  "policy": {
    "phases": {
      "hot": { "actions": { "rollover": { "max_age": "1d", "max_size": "50gb" } } },
      "warm": { "min_age": "7d", "actions": { "shrink": { "number_of_shards": 1 } } },
      "delete": { "min_age": "30d", "actions": { "delete": {} } }
    }
  }
}

Loki: Configure retention in the Loki configuration:

# Loki config
limits_config:
  retention_period: 720h             # 30 days
compactor:
  retention_enabled: true

8. Cost of Logging at Scale

Logging is often the second or third largest infrastructure cost after compute. Key cost drivers:

Ingestion volume: The total GB/day of logs generated.
Storage: Depends on retention period and backend. Elasticsearch stores data on local SSDs (expensive). Loki stores on object storage (cheap).
Indexing overhead: Full-text indexing (Elasticsearch) costs ~3x the raw log storage. Label-only indexing (Loki) costs ~0.3x.
Query cost: Complex queries on large datasets require compute resources.

Cost Reduction Strategies

Set appropriate log levels: Do not log at DEBUG level in production.
Filter noisy logs: Drop health check access logs, readiness probe logs, and other high-volume, low-value entries in the FluentBit configuration.
Sample high-volume logs: For services generating thousands of logs per second, sample 10-25% and extrapolate.
Use Loki over Elasticsearch for cost-sensitive deployments.
Tier your retention: Keep 7 days hot (searchable), 30 days warm (slower queries), archive to S3 for compliance.
Set per-namespace quotas on log ingestion if using a managed backend.

# FluentBit filter to drop health check logs
[FILTER]
    Name    grep
    Match   kube.*
    Exclude log /health
    Exclude log /ready
    Exclude log /livez

Common Pitfalls

Not logging to stdout/stderr: If your application writes logs to files inside the container, the node-level DaemonSet cannot collect them. Always write to stdout/stderr.
Missing resource limits on log collectors: FluentBit/Fluentd without memory limits can consume excessive RAM on nodes with high log volume, potentially causing OOM kills for other pods.
Logging secrets or PII: Structured logging makes it easy to accidentally include passwords, tokens, or personally identifiable information. Implement scrubbing rules in your log pipeline.
Not parsing JSON logs: If your application outputs JSON and FluentBit does not parse it (Merge_Log On), the entire JSON string is stored as a single text field, defeating the purpose of structured logging.
Ignoring multi-line logs: Stack traces split across multiple log entries are extremely difficult to debug. Configure multi-line parsing or use single-line JSON output.
Underestimating logging costs: A cluster with 100 pods each generating 1 MB/minute of logs produces 144 GB/day. At Elasticsearch pricing, that can easily exceed $500/month in storage alone.

Best Practices

Use structured JSON logging in all applications. This is the single most impactful improvement.
Deploy FluentBit as a DaemonSet as the default log collection strategy.
Set memory limits on log collectors to prevent unbounded memory growth.
Add Kubernetes labels as log fields for filtering by team, environment, and service.
Include trace_id in all log entries to correlate logs with distributed traces.
Filter health check and probe logs at the collector level to reduce volume.
Set retention policies that balance debugging needs with cost (7-30 days for most workloads).
Monitor your logging pipeline — set alerts for collector crashes, dropped logs, and backend ingestion errors.
Use Loki for cost-sensitive deployments and Elasticsearch when full-text search is a hard requirement.
Do not log sensitive data. Implement scrubbing for known patterns (credit card numbers, SSNs, API keys) in your log pipeline.

What's Next?

Explore Observability to complement your logging stack with metrics and alerting.
Learn about Cost Optimization — logging costs can be a significant portion of your infrastructure bill.
See Multi-Tenancy for per-tenant log isolation and access control.
Understand Pod Security to ensure log collectors running as DaemonSets have appropriate privileges.

1. The Three Logging Patterns​

Pattern 1: Node-Level DaemonSet (Recommended)​

Pattern 2: Sidecar Container​

Pattern 3: Direct to Backend​

2. Log Collectors: FluentBit vs. Fluentd vs. Promtail​

3. FluentBit DaemonSet Configuration​

FluentBit Output for Loki​

4. Backend Comparison: EFK Stack vs. PLG Stack​

EFK Stack (Elasticsearch + FluentBit/Fluentd + Kibana)​

PLG Stack (Promtail + Loki + Grafana)​

Cloud-Native Options​

5. Structured Logging Best Practices​

Key Fields to Include​

Log Levels and Filtering​

6. Multi-Line Log Handling​

7. Log Rotation and Retention​

Node-Level Rotation​

Backend Retention​

8. Cost of Logging at Scale​

Cost Reduction Strategies​

Common Pitfalls​

Best Practices​

What's Next?​