StatefulSets: For Databases

Key Takeaways for AI & Readers

Identity: Assigns a sticky, unique network ID (e.g., web-0) to each Pod. Pod names are deterministic and persist across rescheduling, unlike Deployment Pods which get random suffixes.
Persistence: Each Pod maintains its own PersistentVolume across restarts via volumeClaimTemplates. When a Pod is rescheduled to a different node, its PVC follows it.
Ordering: Starts and stops Pods in a predictable, sequential order. Pod N+1 does not start until Pod N is Running and Ready.
Headless Service Required: StatefulSets require a Headless Service (a Service with clusterIP: None) to provide DNS entries for each individual Pod.
Not the Default Choice: Use Deployments for stateless workloads. Only use StatefulSets when your application genuinely requires stable identity, stable storage, or ordered lifecycle management.

Current: 0 Replicas

Click "Scale Up" to see how pods are created.

Deployments create interchangeable pods with random names. Order doesn't matter.

A StatefulSet manages the deployment and scaling of a set of Pods, providing guarantees about the ordering and uniqueness of these Pods.

Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods. These Pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling. This makes StatefulSets the right choice for workloads where each instance has a distinct role or must retain its data across restarts.

Use Cases

StatefulSets are valuable for applications that require one or more of the following:

Stable, unique network identifiers. (e.g., mysql-0, mysql-1)
Stable, persistent storage. (e.g., specific PVCs linked to specific Pods)
Ordered, graceful deployment and scaling.
Ordered, automated rolling updates.

Examples:

Databases (PostgreSQL, MySQL, MongoDB)
Distributed systems (Zookeeper, Kafka, Elasticsearch)
Message queues (RabbitMQ, NATS with JetStream)

Key Features

1. Stable Network ID

Pods in a StatefulSet get a predictable name: $(statefulset-name)-$(ordinal). If you define a StatefulSet named web with 3 replicas, you get:

web-0
web-1
web-2

Each Pod also gets a stable DNS entry when paired with a Headless Service. If the Headless Service is named nginx in the namespace default, each Pod is addressable at:

web-0.nginx.default.svc.cluster.local
web-1.nginx.default.svc.cluster.local
web-2.nginx.default.svc.cluster.local

This is critical for distributed systems where nodes need to discover and connect to specific peers. For example, a PostgreSQL replica needs to know the exact address of the primary instance to set up streaming replication.

2. Headless Service

A StatefulSet requires a Headless Service to provide network identity to its Pods. A Headless Service is a regular Kubernetes Service with clusterIP: None:

apiVersion: v1
kind: Service
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  ports:
    - port: 5432
      name: postgres
  clusterIP: None
  selector:
    app: postgres

With a standard Service, clients connect to a single virtual IP and traffic is load-balanced across Pods. With a Headless Service, a DNS lookup returns the individual Pod IPs, allowing clients to connect to a specific Pod by name. The StatefulSet controller creates DNS A/AAAA records for each Pod automatically.

3. Stable Storage with volumeClaimTemplates

Kubernetes creates a PersistentVolumeClaim (PVC) for each Pod based on a volumeClaimTemplate. If web-0 dies and is rescheduled on a different node, it re-attaches to the same storage volume it had before.

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 10Gi

This produces PVCs named data-web-0, data-web-1, data-web-2. The naming convention is $(volumeClaimTemplate-name)-$(statefulset-name)-$(ordinal).

Important: When you scale down a StatefulSet, the PVCs are not deleted. This is by design -- it protects against accidental data loss. You must manually delete the PVCs if you want to reclaim the storage.

4. Ordered Deployment and Scaling

Scale Up: 0 -> 1 -> 2 (Sequential). Pod N+1 is not created until Pod N is Running and Ready.
Scale Down: 2 -> 1 -> 0 (Reverse Sequential). The highest-ordinal Pod is terminated first, and the next Pod is not terminated until the previous one is fully shut down.

This ordering is essential for distributed databases. For instance, in a PostgreSQL cluster with streaming replication, you typically want the primary (pod-0) to be up before any replicas (pod-1, pod-2) attempt to connect.

You can relax this ordering constraint by setting .spec.podManagementPolicy to Parallel, which causes all Pods to launch or terminate simultaneously. Use this only when your application does not require ordered startup.

5. Pod Management Policies

The .spec.podManagementPolicy field controls how Pods are created and deleted:

OrderedReady (default): Pods are created sequentially (0, 1, 2...) and each must be Running and Ready before the next is created. On scale-down, the reverse order is followed (2, 1, 0...).
Parallel: All Pods are created or deleted simultaneously, without waiting for predecessors. This is useful for workloads like Cassandra or Elasticsearch where all nodes are peers and do not require a specific startup order.

spec:
  podManagementPolicy: Parallel

Scaling Operations

Scaling a StatefulSet works similarly to other workloads, but with important behavioral differences:

# Scale up to 5 replicas
kubectl scale statefulset postgres --replicas=5

# Scale down to 2 replicas
kubectl scale statefulset postgres --replicas=2

When scaling up from 3 to 5, Pods postgres-3 and postgres-4 are created (sequentially with OrderedReady). When scaling down from 5 to 2, Pods are removed in reverse order: postgres-4 is terminated first, then postgres-3, then postgres-2. The PVCs data-postgres-2, data-postgres-3, and data-postgres-4 are retained and will be reattached if you scale back up.

For production database clusters, always use kubectl scale or declarative manifests rather than direct API calls. Consider combining StatefulSet scaling with Pod Disruption Budgets to protect quorum-based systems during scale-down events.

Full YAML Example: PostgreSQL StatefulSet

apiVersion: v1
kind: Service
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  ports:
    - port: 5432
      name: postgres
  clusterIP: None
  selector:
    app: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
              name: postgres
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: "2"
              memory: 2Gi
          readinessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - postgres
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - postgres
            initialDelaySeconds: 30
            periodSeconds: 15
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3
        resources:
          requests:
            storage: 20Gi

Key observations:

serviceName: postgres links the StatefulSet to the Headless Service.
volumeClaimTemplates ensures each Pod gets its own dedicated 20Gi disk.
The readinessProbe uses pg_isready, which is the correct way to check PostgreSQL health.
terminationGracePeriodSeconds: 30 gives PostgreSQL time to flush WAL buffers and shut down cleanly.

Update Strategies

StatefulSets support two update strategies via .spec.updateStrategy.type:

RollingUpdate (Default)

When you update the Pod template (for example, changing the image tag), the StatefulSet controller deletes and recreates each Pod one at a time, starting from the highest ordinal and working down to the lowest. Each Pod must become Running and Ready before the next one is updated.

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

Partition Updates

The partition field allows you to perform canary updates. When you set a partition value, only Pods with an ordinal greater than or equal to the partition number are updated. Pods with a lower ordinal retain the old configuration.

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2

With 3 replicas (0, 1, 2) and partition: 2, only pod-2 receives the update. This lets you validate the new version on a single replica before rolling it out to the rest by lowering the partition value to 1, then 0.

OnDelete

With OnDelete, the StatefulSet controller does not automatically update Pods when the template changes. You must manually delete each Pod, and the controller recreates it with the new template. This gives you complete control over the rollout process.

spec:
  updateStrategy:
    type: OnDelete

This strategy is useful when you need human-in-the-loop verification for each instance, which is common for databases in production.

StatefulSet vs. Deployment

Feature	Deployment	StatefulSet
Pod names	Random suffix (`nginx-6b8f7c-xk4r2`)	Ordinal index (`nginx-0`, `nginx-1`)
Storage	Shared or no persistent storage	Dedicated PVC per Pod
Network identity	Pods are interchangeable behind a Service	Each Pod has a unique, stable DNS name
Scaling order	Parallel (all at once)	Sequential (ordered) by default
Update strategy	`RollingUpdate`, `Recreate`	`RollingUpdate` (with partition), `OnDelete`
PVC lifecycle	Shared; deleted with the workload	Retained after Pod or StatefulSet deletion
Use case	Stateless web apps, APIs	Databases, distributed systems, message queues

The key decision criterion: if your Pods are interchangeable and do not need their own persistent data, use a Deployment. If each Pod has a distinct role, needs its own volume, or must be addressable individually, use a StatefulSet.

Common Pitfalls

Forgetting the Headless Service: A StatefulSet will not create Pods if the serviceName references a Service that does not exist. The Headless Service must be created before or alongside the StatefulSet.
PVCs are not deleted on scale-down: When you scale from 3 replicas to 1, the PVCs for pod-1 and pod-2 remain. If you later scale back up, the new Pods reattach to those same PVCs. This is intentional but can cause confusion if old data is present.
Storage class must support dynamic provisioning: If the storageClassName in your volumeClaimTemplates does not support dynamic provisioning, PVCs will remain in Pending status and Pods will not start. Always verify that the storage class exists and is functional in your cluster.
Ordered startup can be slow: With the default OrderedReady pod management policy, a 10-replica StatefulSet must start each Pod sequentially. If each Pod takes 30 seconds to become Ready, the full startup takes 5 minutes. Consider Parallel pod management if ordering is not required.
Updating the volumeClaimTemplates is not allowed: You cannot modify the volumeClaimTemplates field of an existing StatefulSet. To change storage size or storage class, you must delete the StatefulSet (with --cascade=orphan), modify the PVCs manually, and recreate the StatefulSet.
Node failures require manual intervention: If a node becomes unreachable, the Pods on that node remain in Terminating status indefinitely because Kubernetes cannot confirm they have actually stopped. For StatefulSets, this means a replacement Pod is not created automatically. You may need to force-delete the Pod or remove the failed node from the cluster.

Best Practices

Always set resource requests and limits: Databases are resource-sensitive. Inadequate CPU or memory leads to poor query performance or OOM kills that can corrupt data.
Use readiness probes: Ensure that the StatefulSet controller waits for each Pod to be truly ready before proceeding to the next one. Use application-specific health checks (e.g., pg_isready for PostgreSQL, mysqladmin ping for MySQL).
Set terminationGracePeriodSeconds appropriately: Databases need time to flush data to disk before shutting down. The default 30 seconds may not be enough for large databases. Set this to 60 seconds or more for production workloads.
Back up your data: PersistentVolumes provide durability, not backups. Use tools like pg_dump, Velero, or volume snapshots to create regular backups.
Use Pod Disruption Budgets: A PDB with maxUnavailable: 1 ensures that node drains and voluntary disruptions do not take down your entire database cluster simultaneously.
Consider operators for complex databases: For production database deployments, consider using a Kubernetes Operator (e.g., CloudNativePG for PostgreSQL, Percona Operator for MySQL, MongoDB Community Operator). These operators handle replication, failover, backups, and upgrades automatically.

What's Next?

ConfigMaps & Secrets: Learn how to inject database credentials and configuration files into your StatefulSet Pods without baking them into container images.
DaemonSets: Understand how DaemonSets run one Pod per node for cluster-wide infrastructure like logging and monitoring.
Health Checks: Deep dive into liveness, readiness, and startup probes that are critical for stateful workloads.
Storage: Learn about PersistentVolumes, StorageClasses, and CSI drivers that power StatefulSet storage.
Resource Management: Properly size CPU and memory for database workloads.

Use Cases​

Key Features​

1. Stable Network ID​

2. Headless Service​

3. Stable Storage with volumeClaimTemplates​

4. Ordered Deployment and Scaling​

5. Pod Management Policies​

Scaling Operations​

Full YAML Example: PostgreSQL StatefulSet​

Update Strategies​

RollingUpdate (Default)​

Partition Updates​

OnDelete​

StatefulSet vs. Deployment​

Common Pitfalls​

Best Practices​

What's Next?​