Skip to main content

Advanced Storage Patterns

Key Takeaways for AI & Readers
  • CSI Standard: The Container Storage Interface (CSI) decouples storage drivers from the Kubernetes core binary, allowing storage vendors (AWS, Azure, GCP, NetApp, Portworx, Ceph) to develop, release, and update their drivers independently. CSI drivers are installed as Pods in your cluster, not compiled into Kubernetes.
  • Dynamic Provisioning with StorageClasses: StorageClasses enable automatic PersistentVolume creation when a PVC is submitted. Critical parameters include reclaimPolicy (Delete vs Retain for data safety), volumeBindingMode (Immediate vs WaitForFirstConsumer for topology alignment), and provider-specific parameters (disk type, IOPS, encryption).
  • Day 2 Operations: Modern CSI drivers support on-the-fly volume expansion (resize a PVC without downtime), point-in-time snapshots for backup and recovery, and volume cloning for creating copies of existing data.
  • Topology-Aware Provisioning: Using volumeBindingMode: WaitForFirstConsumer ensures storage is provisioned in the same availability zone as the scheduled Pod, preventing cross-AZ mount failures that are a common source of StatefulSet scheduling issues.
  • Reclaim Policies Are Critical: The difference between Delete (disk destroyed when PVC is deleted) and Retain (disk preserved for manual recovery) is the difference between data loss and data safety. Always use Retain for production databases and stateful workloads.

In the Core Concepts, we covered PersistentVolumeClaim (PVC) for requesting storage. Now let's look at how storage actually works under the hood, how to handle Day 2 operations (expansion, snapshots, migration), and how to configure storage for production workloads.

1. Container Storage Interface (CSI)

In the early days of Kubernetes, volume plugins for AWS EBS, GCP PD, NFS, and other storage systems were compiled directly into the Kubernetes binary. This "in-tree" approach meant that adding a new storage backend required changing Kubernetes itself, and storage vendors had to synchronize their release cycles with Kubernetes releases.

The Container Storage Interface (CSI) solved this by defining a standard gRPC-based interface between Kubernetes and storage drivers. Storage vendors now implement CSI drivers as independent containers that run in your cluster.

CSI Architecture

A CSI driver consists of two components:

  • Controller Plugin (Deployment): Handles volume lifecycle operations that do not need to run on a specific node -- creating volumes, deleting volumes, taking snapshots, expanding volumes. It communicates with the cloud provider's API to provision the actual storage.
  • Node Plugin (DaemonSet): Runs on every node and handles operations that must happen on the node where the pod is scheduled -- mounting the volume to the pod's filesystem, formatting the disk, and unmounting on pod termination.

Both plugins communicate with Kubernetes through sidecar containers provided by the Kubernetes CSI project:

  • csi-provisioner: Watches for new PVCs and calls the CSI driver's CreateVolume RPC.
  • csi-attacher: Watches for VolumeAttachment objects and calls the driver's ControllerPublishVolume RPC.
  • csi-resizer: Watches for PVC size changes and calls ControllerExpandVolume.
  • csi-snapshotter: Watches for VolumeSnapshot objects and calls CreateSnapshot.

Installing a CSI Driver

CSI drivers are typically installed via Helm:

# Install the AWS EBS CSI driver
helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm install aws-ebs-csi-driver aws-ebs-csi-driver/aws-ebs-csi-driver \
--namespace kube-system \
--set controller.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::123456789:role/ebs-csi-role

After installation, verify the driver is registered:

kubectl get csidrivers
# NAME ATTACHREQUIRED PODINFOONMOUNT MODES AGE
# ebs.csi.aws.com true false Persistent 5m

2. StorageClasses and Dynamic Provisioning

Instead of manually creating PersistentVolume objects for every disk, StorageClasses enable dynamic provisioning -- Kubernetes automatically creates the underlying storage when a PVC is submitted.

StorageClass Configuration

# storageclass-fast-ssd.yaml
# High-performance SSD storage for databases
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: ebs.csi.aws.com # The CSI driver to use
parameters:
type: gp3 # AWS EBS volume type
iops: "5000" # Provisioned IOPS (gp3 supports up to 16,000)
throughput: "250" # MB/s throughput (gp3 supports up to 1,000)
encrypted: "true" # Encrypt the volume at rest
kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/abcd-1234" # Custom KMS key
reclaimPolicy: Retain # Keep the disk when PVC is deleted
volumeBindingMode: WaitForFirstConsumer # Wait for pod scheduling before provisioning
allowVolumeExpansion: true # Allow PVC resize
mountOptions:
- noatime # Skip updating file access times for performance

StorageClass Parameters by Provider

ProviderProvisionerKey Parameters
AWS EBSebs.csi.aws.comtype (gp3, io2, st1), iops, throughput, encrypted, kmsKeyId
GCP PDpd.csi.storage.gke.iotype (pd-standard, pd-ssd, pd-balanced), replication-type
Azure Diskdisk.csi.azure.comskuName (Premium_LRS, StandardSSD_LRS), cachingMode
Ceph RBDrbd.csi.ceph.comclusterID, pool, imageFeatures
NFSnfs.csi.k8s.ioserver, share

Reclaim Policy

The reclaim policy determines what happens to the underlying storage when the PVC is deleted:

  • Delete (default for most cloud StorageClasses): The PersistentVolume and the underlying cloud disk are deleted. Data is permanently lost. Use this for ephemeral workloads, test environments, and caches.
  • Retain: The PersistentVolume is kept in Released state. The underlying disk is preserved but not accessible to new PVCs. You must manually reclaim the data and delete the PV. Always use this for databases and stateful workloads in production.
# Change the reclaim policy of an existing PV (e.g., after provisioning with Delete)
kubectl patch pv pvc-abc123 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

WaitForFirstConsumer

The volumeBindingMode field controls when dynamic provisioning occurs:

  • Immediate: Create the disk as soon as the PVC is created. The disk is provisioned in an arbitrary AZ. If the pod is later scheduled in a different AZ, the volume cannot be mounted, causing the pod to be stuck in Pending.
  • WaitForFirstConsumer: Wait until a pod references the PVC and is scheduled to a specific node. The disk is then created in the same AZ as the node. Always use this for block storage (EBS, PD, Azure Disk) to avoid cross-AZ issues.

Access Modes

  • ReadWriteOnce (RWO): The volume can be mounted as read-write by a single node. This is the most common mode for block storage (EBS, PD).
  • ReadWriteMany (RWX): The volume can be mounted as read-write by many nodes simultaneously. Required for shared filesystems (NFS, EFS, Azure Files).
  • ReadOnlyMany (ROX): The volume can be mounted as read-only by many nodes. Useful for distributing static datasets.
  • ReadWriteOncePod (RWOP): The volume can be mounted as read-write by a single pod (not just a single node). This provides stronger guarantees than RWO and is available in Kubernetes 1.27+.

3. Volume Expansion

10Gi
AWS EBS Volume
Bound
10Gi

When a volume runs out of space, you can expand it by simply editing the PVC's spec.resources.requests.storage field. The CSI driver handles the underlying disk resize and filesystem expansion.

# Expand a PVC from 50Gi to 100Gi
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
namespace: production
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd # Must have allowVolumeExpansion: true
resources:
requests:
storage: 100Gi # Changed from 50Gi to 100Gi

Online expansion: Most modern CSI drivers (EBS, GCP PD, Azure Disk) support online expansion -- the filesystem grows while the volume is mounted and the pod is running. No downtime required.

Offline expansion: Some drivers or filesystem types require the volume to be unmounted before expansion. In this case, you must scale the pod to 0, wait for the expansion to complete, then scale back up.

Important: Volume expansion is one-way. You cannot shrink a PVC. Always set an appropriate initial size and use monitoring to alert when volumes are approaching capacity.

4. Volume Snapshots

Volume snapshots create point-in-time copies of your persistent volumes. They are useful for backups, data protection before upgrades, and creating test environments from production data.

VolumeSnapshotClass

# volumesnapshotclass.yaml
# Define how snapshots are created for EBS volumes
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-snapshot-class
driver: ebs.csi.aws.com # Must match the StorageClass provisioner
deletionPolicy: Retain # Keep snapshot even if VolumeSnapshot CR is deleted

Creating a Snapshot

# volume-snapshot.yaml
# Take a snapshot of the postgres data volume
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-snapshot-20241215
namespace: production
spec:
volumeSnapshotClassName: ebs-snapshot-class
source:
persistentVolumeClaimName: postgres-data # The PVC to snapshot

Restoring from a Snapshot

# pvc-from-snapshot.yaml
# Create a new PVC from a snapshot (e.g., for a test environment)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-restored
namespace: staging
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi # Must be >= the snapshot size
dataSource:
name: postgres-snapshot-20241215
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io

5. Volume Cloning

Volume cloning creates a copy of an existing PVC without going through the snapshot intermediate step. The new PVC is a full, independent copy of the source data.

# pvc-clone.yaml
# Clone a PVC for testing (data is copied, not shared)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-clone
namespace: staging
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
dataSource:
name: postgres-data # Source PVC to clone
kind: PersistentVolumeClaim

Cloning is useful for creating staging environments with real data, running one-off analytics on a copy of production data, or testing database migrations on a copy before running them on the real database.

6. Raw Block Volumes

For workloads that need direct block device access without a filesystem (some databases, caching layers, or custom storage engines), Kubernetes supports raw block volumes:

# raw-block-pod.yaml
# Mount a volume as a raw block device instead of a filesystem
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: raw-block-pvc
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block # Block instead of Filesystem
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Pod
metadata:
name: block-consumer
spec:
containers:
- name: app
image: myapp/block-consumer:v1
volumeDevices: # volumeDevices instead of volumeMounts
- name: data
devicePath: /dev/xvda # Exposed as a block device
volumes:
- name: data
persistentVolumeClaim:
claimName: raw-block-pvc

7. Topology-Aware Provisioning

In multi-zone clusters, topology-aware provisioning ensures that volumes are created in the same availability zone as the pods that use them. This is handled by the WaitForFirstConsumer binding mode in the StorageClass.

For StatefulSets with multiple replicas, Kubernetes uses topologySpreadConstraints or pod anti-affinity to distribute pods across zones, and WaitForFirstConsumer ensures each pod's PVC is created in the correct zone.

# statefulset-with-topology.yaml
# A StatefulSet that spreads across AZs with topology-aligned storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: postgres
containers:
- name: postgres
image: postgres:15
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd # Uses WaitForFirstConsumer
resources:
requests:
storage: 100Gi

8. Ephemeral Volumes

Not all volumes need to be persistent. Kubernetes provides several ephemeral volume types for temporary data:

emptyDir: Created when a pod is assigned to a node, deleted when the pod is removed. Useful for scratch space, caches, and inter-container data sharing within a pod. Can be backed by the node's disk or RAM (medium: Memory).

volumes:
- name: cache
emptyDir:
sizeLimit: 1Gi # Limit to prevent filling node disk
- name: tmpfs
emptyDir:
medium: Memory # RAM-backed for performance (counts against memory limits)
sizeLimit: 256Mi

configMap and secret volumes: Project ConfigMap data or Secret data as files in the container. Updates to the ConfigMap/Secret are eventually reflected in the mounted files (with a delay of up to the kubelet sync period).

projected volumes: Combine multiple volume sources (configMap, secret, downwardAPI, serviceAccountToken) into a single mount point.

9. Data Migration Strategies

Moving data between storage backends or clusters requires careful planning:

In-cluster migration (change StorageClass): Create a new PVC with the desired StorageClass, run a Job that copies data from the old PVC to the new PVC (using rsync or cp), update the workload to reference the new PVC, then delete the old PVC.

# migration-job.yaml
# Copy data from one PVC to another
apiVersion: batch/v1
kind: Job
metadata:
name: migrate-storage
spec:
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: alpine:3
command:
- sh
- -c
- "cp -av /source/* /destination/"
volumeMounts:
- name: source
mountPath: /source
readOnly: true
- name: destination
mountPath: /destination
volumes:
- name: source
persistentVolumeClaim:
claimName: old-pvc
- name: destination
persistentVolumeClaim:
claimName: new-pvc

Cross-cluster migration: Use Velero with file-level backup (Restic/Kopia) to back up volumes from the source cluster and restore them in the destination cluster. This works across cloud providers since it operates at the filesystem level.

10. Backup Strategies for Persistent Data

Persistent data backup should be layered:

  1. Application-level backups: Use the application's native tools (pg_dump, mongodump, mysqldump). These produce logically consistent backups and are the most reliable for databases.
  2. Volume snapshots: Use VolumeSnapshot objects for point-in-time block-level copies. Fast and efficient but provider-specific.
  3. File-level backups: Use Velero with Restic/Kopia for portable, cross-provider backups. Slower but works with any volume type.
  4. Continuous replication: For near-zero RPO, use database-native replication (streaming replication for PostgreSQL, replica sets for MongoDB) to a standby in another region.

Common Pitfalls

  • Using volumeBindingMode: Immediate with block storage: This provisions the disk before the pod is scheduled. If the pod lands in a different AZ, it cannot mount the volume and gets stuck in Pending. Always use WaitForFirstConsumer.
  • Forgetting allowVolumeExpansion: true: If your StorageClass does not have this flag, PVC resize requests are rejected by the API server. You must set it when creating the StorageClass (it can be changed later).
  • Using Delete reclaim policy for production databases: One accidental kubectl delete pvc destroys your data permanently. Use Retain for any stateful workload.
  • Not monitoring volume capacity: Kubernetes does not alert when a PVC is nearly full. Use Prometheus with kubelet_volume_stats_used_bytes and kubelet_volume_stats_capacity_bytes to alert at 80% capacity.
  • Ignoring IOPS and throughput limits: Cloud provider disks have IOPS and throughput caps that can cause severe performance degradation. Size your StorageClass parameters based on workload requirements, not just capacity.
  • Assuming RWO means single-pod access: RWO allows access from a single node, not a single pod. Multiple pods on the same node can mount the same RWO volume. Use RWOP if you need single-pod exclusivity.

Best Practices

  1. Use WaitForFirstConsumer for all block storage StorageClasses: This prevents AZ mismatch issues and is the correct default for EBS, PD, and Azure Disk.
  2. Set Retain reclaim policy for production data: Even if you plan to delete volumes, Retain gives you a safety net to recover data after accidental deletion.
  3. Monitor volume usage: Set up Prometheus alerts on kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes to detect volumes approaching capacity before they fill up.
  4. Automate snapshots: Use CronJobs or Velero schedules to take regular volume snapshots. Do not rely on manual snapshot creation.
  5. Test restores: Periodically restore from snapshots to a test environment to verify that your backup process actually works.
  6. Use separate StorageClasses for different workload types: Create fast-ssd for databases, standard for general workloads, and cold-storage for archival. This prevents misconfiguration and makes cost allocation easier.
  7. Encrypt volumes at rest: Enable encryption in your StorageClass parameters using your cloud provider's KMS. This is a compliance requirement in most regulated industries.

What's Next?

  • Disaster Recovery: Volume snapshots and backup strategies are foundational to disaster recovery. Learn how to integrate storage backups into a comprehensive DR plan with Velero.
  • CRDs & Operators: Storage operators like Rook (for Ceph) and OpenEBS use CRDs to manage complex storage systems declaratively within Kubernetes.
  • Security Policies: Use RBAC to control who can create PVCs with specific StorageClasses, preventing unauthorized provisioning of expensive storage.