Advanced Storage Patterns
- CSI Standard: The Container Storage Interface (CSI) decouples storage drivers from the Kubernetes core binary, allowing storage vendors (AWS, Azure, GCP, NetApp, Portworx, Ceph) to develop, release, and update their drivers independently. CSI drivers are installed as Pods in your cluster, not compiled into Kubernetes.
- Dynamic Provisioning with StorageClasses: StorageClasses enable automatic PersistentVolume creation when a PVC is submitted. Critical parameters include
reclaimPolicy(Delete vs Retain for data safety),volumeBindingMode(Immediate vs WaitForFirstConsumer for topology alignment), and provider-specific parameters (disk type, IOPS, encryption). - Day 2 Operations: Modern CSI drivers support on-the-fly volume expansion (resize a PVC without downtime), point-in-time snapshots for backup and recovery, and volume cloning for creating copies of existing data.
- Topology-Aware Provisioning: Using
volumeBindingMode: WaitForFirstConsumerensures storage is provisioned in the same availability zone as the scheduled Pod, preventing cross-AZ mount failures that are a common source of StatefulSet scheduling issues. - Reclaim Policies Are Critical: The difference between
Delete(disk destroyed when PVC is deleted) andRetain(disk preserved for manual recovery) is the difference between data loss and data safety. Always useRetainfor production databases and stateful workloads.
In the Core Concepts, we covered PersistentVolumeClaim (PVC) for requesting storage. Now let's look at how storage actually works under the hood, how to handle Day 2 operations (expansion, snapshots, migration), and how to configure storage for production workloads.
1. Container Storage Interface (CSI)
In the early days of Kubernetes, volume plugins for AWS EBS, GCP PD, NFS, and other storage systems were compiled directly into the Kubernetes binary. This "in-tree" approach meant that adding a new storage backend required changing Kubernetes itself, and storage vendors had to synchronize their release cycles with Kubernetes releases.
The Container Storage Interface (CSI) solved this by defining a standard gRPC-based interface between Kubernetes and storage drivers. Storage vendors now implement CSI drivers as independent containers that run in your cluster.
CSI Architecture
A CSI driver consists of two components:
- Controller Plugin (Deployment): Handles volume lifecycle operations that do not need to run on a specific node -- creating volumes, deleting volumes, taking snapshots, expanding volumes. It communicates with the cloud provider's API to provision the actual storage.
- Node Plugin (DaemonSet): Runs on every node and handles operations that must happen on the node where the pod is scheduled -- mounting the volume to the pod's filesystem, formatting the disk, and unmounting on pod termination.
Both plugins communicate with Kubernetes through sidecar containers provided by the Kubernetes CSI project:
csi-provisioner: Watches for new PVCs and calls the CSI driver'sCreateVolumeRPC.csi-attacher: Watches forVolumeAttachmentobjects and calls the driver'sControllerPublishVolumeRPC.csi-resizer: Watches for PVC size changes and callsControllerExpandVolume.csi-snapshotter: Watches forVolumeSnapshotobjects and callsCreateSnapshot.
Installing a CSI Driver
CSI drivers are typically installed via Helm:
# Install the AWS EBS CSI driver
helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm install aws-ebs-csi-driver aws-ebs-csi-driver/aws-ebs-csi-driver \
--namespace kube-system \
--set controller.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::123456789:role/ebs-csi-role
After installation, verify the driver is registered:
kubectl get csidrivers
# NAME ATTACHREQUIRED PODINFOONMOUNT MODES AGE
# ebs.csi.aws.com true false Persistent 5m
2. StorageClasses and Dynamic Provisioning
Instead of manually creating PersistentVolume objects for every disk, StorageClasses enable dynamic provisioning -- Kubernetes automatically creates the underlying storage when a PVC is submitted.
StorageClass Configuration
# storageclass-fast-ssd.yaml
# High-performance SSD storage for databases
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: ebs.csi.aws.com # The CSI driver to use
parameters:
type: gp3 # AWS EBS volume type
iops: "5000" # Provisioned IOPS (gp3 supports up to 16,000)
throughput: "250" # MB/s throughput (gp3 supports up to 1,000)
encrypted: "true" # Encrypt the volume at rest
kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/abcd-1234" # Custom KMS key
reclaimPolicy: Retain # Keep the disk when PVC is deleted
volumeBindingMode: WaitForFirstConsumer # Wait for pod scheduling before provisioning
allowVolumeExpansion: true # Allow PVC resize
mountOptions:
- noatime # Skip updating file access times for performance
StorageClass Parameters by Provider
| Provider | Provisioner | Key Parameters |
|---|---|---|
| AWS EBS | ebs.csi.aws.com | type (gp3, io2, st1), iops, throughput, encrypted, kmsKeyId |
| GCP PD | pd.csi.storage.gke.io | type (pd-standard, pd-ssd, pd-balanced), replication-type |
| Azure Disk | disk.csi.azure.com | skuName (Premium_LRS, StandardSSD_LRS), cachingMode |
| Ceph RBD | rbd.csi.ceph.com | clusterID, pool, imageFeatures |
| NFS | nfs.csi.k8s.io | server, share |
Reclaim Policy
The reclaim policy determines what happens to the underlying storage when the PVC is deleted:
Delete(default for most cloud StorageClasses): The PersistentVolume and the underlying cloud disk are deleted. Data is permanently lost. Use this for ephemeral workloads, test environments, and caches.Retain: The PersistentVolume is kept inReleasedstate. The underlying disk is preserved but not accessible to new PVCs. You must manually reclaim the data and delete the PV. Always use this for databases and stateful workloads in production.
# Change the reclaim policy of an existing PV (e.g., after provisioning with Delete)
kubectl patch pv pvc-abc123 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
WaitForFirstConsumer
The volumeBindingMode field controls when dynamic provisioning occurs:
Immediate: Create the disk as soon as the PVC is created. The disk is provisioned in an arbitrary AZ. If the pod is later scheduled in a different AZ, the volume cannot be mounted, causing the pod to be stuck inPending.WaitForFirstConsumer: Wait until a pod references the PVC and is scheduled to a specific node. The disk is then created in the same AZ as the node. Always use this for block storage (EBS, PD, Azure Disk) to avoid cross-AZ issues.
Access Modes
ReadWriteOnce(RWO): The volume can be mounted as read-write by a single node. This is the most common mode for block storage (EBS, PD).ReadWriteMany(RWX): The volume can be mounted as read-write by many nodes simultaneously. Required for shared filesystems (NFS, EFS, Azure Files).ReadOnlyMany(ROX): The volume can be mounted as read-only by many nodes. Useful for distributing static datasets.ReadWriteOncePod(RWOP): The volume can be mounted as read-write by a single pod (not just a single node). This provides stronger guarantees than RWO and is available in Kubernetes 1.27+.
3. Volume Expansion
When a volume runs out of space, you can expand it by simply editing the PVC's spec.resources.requests.storage field. The CSI driver handles the underlying disk resize and filesystem expansion.
# Expand a PVC from 50Gi to 100Gi
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
namespace: production
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd # Must have allowVolumeExpansion: true
resources:
requests:
storage: 100Gi # Changed from 50Gi to 100Gi
Online expansion: Most modern CSI drivers (EBS, GCP PD, Azure Disk) support online expansion -- the filesystem grows while the volume is mounted and the pod is running. No downtime required.
Offline expansion: Some drivers or filesystem types require the volume to be unmounted before expansion. In this case, you must scale the pod to 0, wait for the expansion to complete, then scale back up.
Important: Volume expansion is one-way. You cannot shrink a PVC. Always set an appropriate initial size and use monitoring to alert when volumes are approaching capacity.
4. Volume Snapshots
Volume snapshots create point-in-time copies of your persistent volumes. They are useful for backups, data protection before upgrades, and creating test environments from production data.
VolumeSnapshotClass
# volumesnapshotclass.yaml
# Define how snapshots are created for EBS volumes
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-snapshot-class
driver: ebs.csi.aws.com # Must match the StorageClass provisioner
deletionPolicy: Retain # Keep snapshot even if VolumeSnapshot CR is deleted
Creating a Snapshot
# volume-snapshot.yaml
# Take a snapshot of the postgres data volume
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-snapshot-20241215
namespace: production
spec:
volumeSnapshotClassName: ebs-snapshot-class
source:
persistentVolumeClaimName: postgres-data # The PVC to snapshot
Restoring from a Snapshot
# pvc-from-snapshot.yaml
# Create a new PVC from a snapshot (e.g., for a test environment)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-restored
namespace: staging
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi # Must be >= the snapshot size
dataSource:
name: postgres-snapshot-20241215
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
5. Volume Cloning
Volume cloning creates a copy of an existing PVC without going through the snapshot intermediate step. The new PVC is a full, independent copy of the source data.
# pvc-clone.yaml
# Clone a PVC for testing (data is copied, not shared)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-clone
namespace: staging
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
dataSource:
name: postgres-data # Source PVC to clone
kind: PersistentVolumeClaim
Cloning is useful for creating staging environments with real data, running one-off analytics on a copy of production data, or testing database migrations on a copy before running them on the real database.
6. Raw Block Volumes
For workloads that need direct block device access without a filesystem (some databases, caching layers, or custom storage engines), Kubernetes supports raw block volumes:
# raw-block-pod.yaml
# Mount a volume as a raw block device instead of a filesystem
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: raw-block-pvc
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block # Block instead of Filesystem
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Pod
metadata:
name: block-consumer
spec:
containers:
- name: app
image: myapp/block-consumer:v1
volumeDevices: # volumeDevices instead of volumeMounts
- name: data
devicePath: /dev/xvda # Exposed as a block device
volumes:
- name: data
persistentVolumeClaim:
claimName: raw-block-pvc
7. Topology-Aware Provisioning
In multi-zone clusters, topology-aware provisioning ensures that volumes are created in the same availability zone as the pods that use them. This is handled by the WaitForFirstConsumer binding mode in the StorageClass.
For StatefulSets with multiple replicas, Kubernetes uses topologySpreadConstraints or pod anti-affinity to distribute pods across zones, and WaitForFirstConsumer ensures each pod's PVC is created in the correct zone.
# statefulset-with-topology.yaml
# A StatefulSet that spreads across AZs with topology-aligned storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: postgres
containers:
- name: postgres
image: postgres:15
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd # Uses WaitForFirstConsumer
resources:
requests:
storage: 100Gi
8. Ephemeral Volumes
Not all volumes need to be persistent. Kubernetes provides several ephemeral volume types for temporary data:
emptyDir: Created when a pod is assigned to a node, deleted when the pod is removed. Useful for scratch space, caches, and inter-container data sharing within a pod. Can be backed by the node's disk or RAM (medium: Memory).
volumes:
- name: cache
emptyDir:
sizeLimit: 1Gi # Limit to prevent filling node disk
- name: tmpfs
emptyDir:
medium: Memory # RAM-backed for performance (counts against memory limits)
sizeLimit: 256Mi
configMap and secret volumes: Project ConfigMap data or Secret data as files in the container. Updates to the ConfigMap/Secret are eventually reflected in the mounted files (with a delay of up to the kubelet sync period).
projected volumes: Combine multiple volume sources (configMap, secret, downwardAPI, serviceAccountToken) into a single mount point.
9. Data Migration Strategies
Moving data between storage backends or clusters requires careful planning:
In-cluster migration (change StorageClass): Create a new PVC with the desired StorageClass, run a Job that copies data from the old PVC to the new PVC (using rsync or cp), update the workload to reference the new PVC, then delete the old PVC.
# migration-job.yaml
# Copy data from one PVC to another
apiVersion: batch/v1
kind: Job
metadata:
name: migrate-storage
spec:
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: alpine:3
command:
- sh
- -c
- "cp -av /source/* /destination/"
volumeMounts:
- name: source
mountPath: /source
readOnly: true
- name: destination
mountPath: /destination
volumes:
- name: source
persistentVolumeClaim:
claimName: old-pvc
- name: destination
persistentVolumeClaim:
claimName: new-pvc
Cross-cluster migration: Use Velero with file-level backup (Restic/Kopia) to back up volumes from the source cluster and restore them in the destination cluster. This works across cloud providers since it operates at the filesystem level.
10. Backup Strategies for Persistent Data
Persistent data backup should be layered:
- Application-level backups: Use the application's native tools (
pg_dump,mongodump,mysqldump). These produce logically consistent backups and are the most reliable for databases. - Volume snapshots: Use VolumeSnapshot objects for point-in-time block-level copies. Fast and efficient but provider-specific.
- File-level backups: Use Velero with Restic/Kopia for portable, cross-provider backups. Slower but works with any volume type.
- Continuous replication: For near-zero RPO, use database-native replication (streaming replication for PostgreSQL, replica sets for MongoDB) to a standby in another region.
Common Pitfalls
- Using
volumeBindingMode: Immediatewith block storage: This provisions the disk before the pod is scheduled. If the pod lands in a different AZ, it cannot mount the volume and gets stuck inPending. Always useWaitForFirstConsumer. - Forgetting
allowVolumeExpansion: true: If your StorageClass does not have this flag, PVC resize requests are rejected by the API server. You must set it when creating the StorageClass (it can be changed later). - Using
Deletereclaim policy for production databases: One accidentalkubectl delete pvcdestroys your data permanently. UseRetainfor any stateful workload. - Not monitoring volume capacity: Kubernetes does not alert when a PVC is nearly full. Use Prometheus with
kubelet_volume_stats_used_bytesandkubelet_volume_stats_capacity_bytesto alert at 80% capacity. - Ignoring IOPS and throughput limits: Cloud provider disks have IOPS and throughput caps that can cause severe performance degradation. Size your StorageClass parameters based on workload requirements, not just capacity.
- Assuming RWO means single-pod access: RWO allows access from a single node, not a single pod. Multiple pods on the same node can mount the same RWO volume. Use RWOP if you need single-pod exclusivity.
Best Practices
- Use
WaitForFirstConsumerfor all block storage StorageClasses: This prevents AZ mismatch issues and is the correct default for EBS, PD, and Azure Disk. - Set
Retainreclaim policy for production data: Even if you plan to delete volumes,Retaingives you a safety net to recover data after accidental deletion. - Monitor volume usage: Set up Prometheus alerts on
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytesto detect volumes approaching capacity before they fill up. - Automate snapshots: Use CronJobs or Velero schedules to take regular volume snapshots. Do not rely on manual snapshot creation.
- Test restores: Periodically restore from snapshots to a test environment to verify that your backup process actually works.
- Use separate StorageClasses for different workload types: Create
fast-ssdfor databases,standardfor general workloads, andcold-storagefor archival. This prevents misconfiguration and makes cost allocation easier. - Encrypt volumes at rest: Enable encryption in your StorageClass parameters using your cloud provider's KMS. This is a compliance requirement in most regulated industries.
What's Next?
- Disaster Recovery: Volume snapshots and backup strategies are foundational to disaster recovery. Learn how to integrate storage backups into a comprehensive DR plan with Velero.
- CRDs & Operators: Storage operators like Rook (for Ceph) and OpenEBS use CRDs to manage complex storage systems declaratively within Kubernetes.
- Security Policies: Use RBAC to control who can create PVCs with specific StorageClasses, preventing unauthorized provisioning of expensive storage.