StatefulSets: For Databases
- Identity: Assigns a sticky, unique network ID (e.g.,
web-0) to each Pod. Pod names are deterministic and persist across rescheduling, unlike Deployment Pods which get random suffixes. - Persistence: Each Pod maintains its own PersistentVolume across restarts via
volumeClaimTemplates. When a Pod is rescheduled to a different node, its PVC follows it. - Ordering: Starts and stops Pods in a predictable, sequential order. Pod N+1 does not start until Pod N is Running and Ready.
- Headless Service Required: StatefulSets require a Headless Service (a Service with
clusterIP: None) to provide DNS entries for each individual Pod. - Not the Default Choice: Use Deployments for stateless workloads. Only use StatefulSets when your application genuinely requires stable identity, stable storage, or ordered lifecycle management.
A StatefulSet manages the deployment and scaling of a set of Pods, providing guarantees about the ordering and uniqueness of these Pods.
Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods. These Pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling. This makes StatefulSets the right choice for workloads where each instance has a distinct role or must retain its data across restarts.
Use Cases
StatefulSets are valuable for applications that require one or more of the following:
- Stable, unique network identifiers. (e.g.,
mysql-0,mysql-1) - Stable, persistent storage. (e.g., specific PVCs linked to specific Pods)
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates.
Examples:
- Databases (PostgreSQL, MySQL, MongoDB)
- Distributed systems (Zookeeper, Kafka, Elasticsearch)
- Message queues (RabbitMQ, NATS with JetStream)
Key Features
1. Stable Network ID
Pods in a StatefulSet get a predictable name: $(statefulset-name)-$(ordinal).
If you define a StatefulSet named web with 3 replicas, you get:
web-0web-1web-2
Each Pod also gets a stable DNS entry when paired with a Headless Service. If the Headless Service is named nginx in the namespace default, each Pod is addressable at:
web-0.nginx.default.svc.cluster.local
web-1.nginx.default.svc.cluster.local
web-2.nginx.default.svc.cluster.local
This is critical for distributed systems where nodes need to discover and connect to specific peers. For example, a PostgreSQL replica needs to know the exact address of the primary instance to set up streaming replication.
2. Headless Service
A StatefulSet requires a Headless Service to provide network identity to its Pods. A Headless Service is a regular Kubernetes Service with clusterIP: None:
apiVersion: v1
kind: Service
metadata:
name: postgres
labels:
app: postgres
spec:
ports:
- port: 5432
name: postgres
clusterIP: None
selector:
app: postgres
With a standard Service, clients connect to a single virtual IP and traffic is load-balanced across Pods. With a Headless Service, a DNS lookup returns the individual Pod IPs, allowing clients to connect to a specific Pod by name. The StatefulSet controller creates DNS A/AAAA records for each Pod automatically.
3. Stable Storage with volumeClaimTemplates
Kubernetes creates a PersistentVolumeClaim (PVC) for each Pod based on a volumeClaimTemplate. If web-0 dies and is rescheduled on a different node, it re-attaches to the same storage volume it had before.
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 10Gi
This produces PVCs named data-web-0, data-web-1, data-web-2. The naming convention is $(volumeClaimTemplate-name)-$(statefulset-name)-$(ordinal).
Important: When you scale down a StatefulSet, the PVCs are not deleted. This is by design -- it protects against accidental data loss. You must manually delete the PVCs if you want to reclaim the storage.
4. Ordered Deployment and Scaling
- Scale Up:
0 -> 1 -> 2(Sequential). Pod N+1 is not created until Pod N is Running and Ready. - Scale Down:
2 -> 1 -> 0(Reverse Sequential). The highest-ordinal Pod is terminated first, and the next Pod is not terminated until the previous one is fully shut down.
This ordering is essential for distributed databases. For instance, in a PostgreSQL cluster with streaming replication, you typically want the primary (pod-0) to be up before any replicas (pod-1, pod-2) attempt to connect.
You can relax this ordering constraint by setting .spec.podManagementPolicy to Parallel, which causes all Pods to launch or terminate simultaneously. Use this only when your application does not require ordered startup.
5. Pod Management Policies
The .spec.podManagementPolicy field controls how Pods are created and deleted:
OrderedReady(default): Pods are created sequentially (0, 1, 2...) and each must be Running and Ready before the next is created. On scale-down, the reverse order is followed (2, 1, 0...).Parallel: All Pods are created or deleted simultaneously, without waiting for predecessors. This is useful for workloads like Cassandra or Elasticsearch where all nodes are peers and do not require a specific startup order.
spec:
podManagementPolicy: Parallel
Scaling Operations
Scaling a StatefulSet works similarly to other workloads, but with important behavioral differences:
# Scale up to 5 replicas
kubectl scale statefulset postgres --replicas=5
# Scale down to 2 replicas
kubectl scale statefulset postgres --replicas=2
When scaling up from 3 to 5, Pods postgres-3 and postgres-4 are created (sequentially with OrderedReady). When scaling down from 5 to 2, Pods are removed in reverse order: postgres-4 is terminated first, then postgres-3, then postgres-2. The PVCs data-postgres-2, data-postgres-3, and data-postgres-4 are retained and will be reattached if you scale back up.
For production database clusters, always use kubectl scale or declarative manifests rather than direct API calls. Consider combining StatefulSet scaling with Pod Disruption Budgets to protect quorum-based systems during scale-down events.
Full YAML Example: PostgreSQL StatefulSet
apiVersion: v1
kind: Service
metadata:
name: postgres
labels:
app: postgres
spec:
ports:
- port: 5432
name: postgres
clusterIP: None
selector:
app: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
terminationGracePeriodSeconds: 30
containers:
- name: postgres
image: postgres:16
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
readinessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 30
periodSeconds: 15
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 20Gi
Key observations:
serviceName: postgreslinks the StatefulSet to the Headless Service.volumeClaimTemplatesensures each Pod gets its own dedicated 20Gi disk.- The
readinessProbeusespg_isready, which is the correct way to check PostgreSQL health. terminationGracePeriodSeconds: 30gives PostgreSQL time to flush WAL buffers and shut down cleanly.
Update Strategies
StatefulSets support two update strategies via .spec.updateStrategy.type:
RollingUpdate (Default)
When you update the Pod template (for example, changing the image tag), the StatefulSet controller deletes and recreates each Pod one at a time, starting from the highest ordinal and working down to the lowest. Each Pod must become Running and Ready before the next one is updated.
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
Partition Updates
The partition field allows you to perform canary updates. When you set a partition value, only Pods with an ordinal greater than or equal to the partition number are updated. Pods with a lower ordinal retain the old configuration.
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 2
With 3 replicas (0, 1, 2) and partition: 2, only pod-2 receives the update. This lets you validate the new version on a single replica before rolling it out to the rest by lowering the partition value to 1, then 0.
OnDelete
With OnDelete, the StatefulSet controller does not automatically update Pods when the template changes. You must manually delete each Pod, and the controller recreates it with the new template. This gives you complete control over the rollout process.
spec:
updateStrategy:
type: OnDelete
This strategy is useful when you need human-in-the-loop verification for each instance, which is common for databases in production.
StatefulSet vs. Deployment
| Feature | Deployment | StatefulSet |
|---|---|---|
| Pod names | Random suffix (nginx-6b8f7c-xk4r2) | Ordinal index (nginx-0, nginx-1) |
| Storage | Shared or no persistent storage | Dedicated PVC per Pod |
| Network identity | Pods are interchangeable behind a Service | Each Pod has a unique, stable DNS name |
| Scaling order | Parallel (all at once) | Sequential (ordered) by default |
| Update strategy | RollingUpdate, Recreate | RollingUpdate (with partition), OnDelete |
| PVC lifecycle | Shared; deleted with the workload | Retained after Pod or StatefulSet deletion |
| Use case | Stateless web apps, APIs | Databases, distributed systems, message queues |
The key decision criterion: if your Pods are interchangeable and do not need their own persistent data, use a Deployment. If each Pod has a distinct role, needs its own volume, or must be addressable individually, use a StatefulSet.
Common Pitfalls
-
Forgetting the Headless Service: A StatefulSet will not create Pods if the
serviceNamereferences a Service that does not exist. The Headless Service must be created before or alongside the StatefulSet. -
PVCs are not deleted on scale-down: When you scale from 3 replicas to 1, the PVCs for
pod-1andpod-2remain. If you later scale back up, the new Pods reattach to those same PVCs. This is intentional but can cause confusion if old data is present. -
Storage class must support dynamic provisioning: If the
storageClassNamein yourvolumeClaimTemplatesdoes not support dynamic provisioning, PVCs will remain inPendingstatus and Pods will not start. Always verify that the storage class exists and is functional in your cluster. -
Ordered startup can be slow: With the default
OrderedReadypod management policy, a 10-replica StatefulSet must start each Pod sequentially. If each Pod takes 30 seconds to become Ready, the full startup takes 5 minutes. ConsiderParallelpod management if ordering is not required. -
Updating the
volumeClaimTemplatesis not allowed: You cannot modify thevolumeClaimTemplatesfield of an existing StatefulSet. To change storage size or storage class, you must delete the StatefulSet (with--cascade=orphan), modify the PVCs manually, and recreate the StatefulSet. -
Node failures require manual intervention: If a node becomes unreachable, the Pods on that node remain in
Terminatingstatus indefinitely because Kubernetes cannot confirm they have actually stopped. For StatefulSets, this means a replacement Pod is not created automatically. You may need to force-delete the Pod or remove the failed node from the cluster.
Best Practices
- Always set resource requests and limits: Databases are resource-sensitive. Inadequate CPU or memory leads to poor query performance or OOM kills that can corrupt data.
- Use readiness probes: Ensure that the StatefulSet controller waits for each Pod to be truly ready before proceeding to the next one. Use application-specific health checks (e.g.,
pg_isreadyfor PostgreSQL,mysqladmin pingfor MySQL). - Set
terminationGracePeriodSecondsappropriately: Databases need time to flush data to disk before shutting down. The default 30 seconds may not be enough for large databases. Set this to 60 seconds or more for production workloads. - Back up your data: PersistentVolumes provide durability, not backups. Use tools like
pg_dump, Velero, or volume snapshots to create regular backups. - Use Pod Disruption Budgets: A PDB with
maxUnavailable: 1ensures that node drains and voluntary disruptions do not take down your entire database cluster simultaneously. - Consider operators for complex databases: For production database deployments, consider using a Kubernetes Operator (e.g., CloudNativePG for PostgreSQL, Percona Operator for MySQL, MongoDB Community Operator). These operators handle replication, failover, backups, and upgrades automatically.
What's Next?
- ConfigMaps & Secrets: Learn how to inject database credentials and configuration files into your StatefulSet Pods without baking them into container images.
- DaemonSets: Understand how DaemonSets run one Pod per node for cluster-wide infrastructure like logging and monitoring.
- Health Checks: Deep dive into liveness, readiness, and startup probes that are critical for stateful workloads.
- Storage: Learn about PersistentVolumes, StorageClasses, and CSI drivers that power StatefulSet storage.
- Resource Management: Properly size CPU and memory for database workloads.