Disaster Recovery & Backups
- Comprehensive Backup Strategy: Disaster recovery requires backing up both Kubernetes metadata (all API objects stored in etcd) and persistent application data (databases, files on PersistentVolumes). Neither alone is sufficient for a full restore.
- Velero for Cluster Backups: Velero is the de facto standard for creating consistent, scheduled backups of Kubernetes resources and persistent volumes. It supports cloud-native snapshots, file-level backups via Restic/Kopia, and enables full or selective cluster restoration.
- RPO and RTO Drive Architecture: Your Recovery Point Objective (maximum acceptable data loss) and Recovery Time Objective (maximum acceptable downtime) determine whether you need simple nightly backups or active-active multi-region architectures.
- High Availability etcd: For self-managed clusters, ensuring etcd redundancy (3 or 5 nodes with Raft consensus) is crucial for control plane resilience. etcd snapshots are the last line of defense for cluster state.
- Multi-Region Strategies: Active-passive or active-active configurations across regions provide robust resilience against regional outages, but each pattern comes with trade-offs in cost, complexity, and data consistency.
- Test Your DR Plan: A backup that has never been restored is not a backup. Regular DR drills are essential to validate that your recovery procedures actually work within your RTO.
Kubernetes is self-healing at the container level -- it restarts crashed pods, reschedules workloads from failed nodes, and maintains desired replica counts. But Kubernetes cannot heal from a deleted namespace, a corrupted etcd database, a cloud account compromise, or a regional outage. You need a deliberate disaster recovery strategy that covers both metadata and data.
1. What Needs Backing Up?
A complete Kubernetes disaster recovery plan must address two distinct categories of state:
Metadata (The API Objects): Every Deployment, Service, ConfigMap, Secret, RBAC rule, CRD instance, and Namespace stored in etcd. This is the declarative desired state of your entire cluster. Losing this means losing the blueprint for everything running in your cluster.
Persistent Data: The actual bytes stored on your cloud disks (AWS EBS, Azure Disk, GCP Persistent Disk) or network file systems (NFS, EFS). This includes database files, uploaded assets, and any stateful application data. Losing a PersistentVolume means losing the data your application depends on.
Configuration Outside the Cluster: Do not forget external DNS records, load balancer configurations, IAM roles, and cloud networking (VPCs, subnets) that your cluster depends on but does not manage.
2. RPO and RTO: Defining Your Requirements
Before choosing a DR strategy, you must define two critical metrics:
Recovery Point Objective (RPO) is the maximum amount of data loss your business can tolerate, measured in time. An RPO of 1 hour means you can afford to lose up to 1 hour of data. This dictates your backup frequency. If your RPO is 15 minutes, you need backups at least every 15 minutes.
Recovery Time Objective (RTO) is the maximum amount of downtime your business can tolerate. An RTO of 4 hours means you need to be fully operational within 4 hours of a disaster. This dictates your recovery architecture -- a cold standby cluster takes longer to spin up than a warm standby receiving replicated data.
| Strategy | Typical RPO | Typical RTO | Cost |
|---|---|---|---|
| Nightly Velero backups | 24 hours | 4-8 hours | Low |
| Hourly Velero + volume snapshots | 1 hour | 1-2 hours | Medium |
| Active-passive with replication | Minutes | 15-30 minutes | High |
| Active-active multi-region | Near zero | Near zero | Very high |
3. etcd Snapshots: The Foundation
For self-managed clusters (not EKS, GKE, or AKS where the provider manages the control plane), etcd is your most critical component. All cluster state lives in etcd, and losing it without a backup means rebuilding from scratch.
Taking an etcd Snapshot
# Create a snapshot of the etcd database
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20241215-030000.db --write-table
Restoring from an etcd Snapshot
# Stop the kube-apiserver and etcd on all control plane nodes first
# Then restore on each etcd member with unique names and peer URLs
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20241215-030000.db \
--name=etcd-node-1 \
--initial-cluster=etcd-node-1=https://10.0.1.10:2380,etcd-node-2=https://10.0.1.11:2380,etcd-node-3=https://10.0.1.12:2380 \
--initial-advertise-peer-urls=https://10.0.1.10:2380 \
--data-dir=/var/lib/etcd-restored
Schedule etcd snapshots every 30 minutes via a systemd timer or cron job, and upload them to an off-cluster location (S3, GCS) immediately. An etcd backup stored only on the etcd node itself is not a backup.
4. Velero: The Cluster Time Machine
Velero is the de facto standard for Kubernetes cluster backup and restore. It handles both API objects and persistent volume data in a unified workflow.
Velero Architecture
Velero consists of several components working together:
- Velero Server: A Deployment running in your cluster that watches for Backup and Restore custom resources and orchestrates the process.
- BackupStorageLocation (BSL): Defines where backup files (compressed tarballs of JSON-serialized API objects) are stored -- typically an S3-compatible bucket.
- VolumeSnapshotLocation (VSL): Defines where PersistentVolume snapshots are stored, using your cloud provider's native snapshot API.
- Restic/Kopia Integration: For volumes that do not support native snapshots (NFS, hostPath, or when you need cross-provider portability), Velero can use file-level backup via Restic or Kopia.
How a Backup Works
- Velero queries the Kubernetes API for all resources matching the backup spec (specific namespaces, label selectors, or the entire cluster).
- It serializes these resources to JSON and uploads them as a tarball to the BSL (your S3 bucket).
- For each PersistentVolume, Velero calls the cloud provider's snapshot API (e.g.,
ec2:CreateSnapshot) to create a point-in-time snapshot. - Metadata about the backup (timestamps, resource counts, warnings) is stored alongside the backup in the BSL.
Installing Velero
# Install Velero with the AWS plugin
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket my-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero
Scheduled Backups
# velero-schedule.yaml
# Runs a full cluster backup every 6 hours, retaining backups for 30 days
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: cluster-full-backup
namespace: velero
spec:
schedule: "0 */6 * * *" # Every 6 hours
template:
ttl: 720h0m0s # Retain for 30 days
includedNamespaces:
- "*" # All namespaces
excludedNamespaces:
- kube-system # Exclude system namespace if desired
storageLocation: default
volumeSnapshotLocations:
- default
defaultVolumesToFsBackup: false # Use native snapshots by default
# velero-backup-critical.yaml
# On-demand backup of a specific namespace before a risky operation
apiVersion: velero.io/v1
kind: Backup
metadata:
name: pre-migration-production-db
namespace: velero
spec:
includedNamespaces:
- production
labelSelector:
matchLabels:
app: postgres
ttl: 168h0m0s # Retain for 7 days
storageLocation: default
volumeSnapshotLocations:
- default
hooks:
resources:
- name: postgres-freeze
includedNamespaces:
- production
labelSelector:
matchLabels:
app: postgres
pre: # Run before snapshot
- exec:
container: postgres
command:
- /bin/bash
- -c
- "psql -U postgres -c 'SELECT pg_start_backup($$velero$$, true);'"
post: # Run after snapshot
- exec:
container: postgres
command:
- /bin/bash
- -c
- "psql -U postgres -c 'SELECT pg_stop_backup();'"
Restoring to a New Cluster
# On the new cluster, install Velero pointing to the same bucket
# Then restore from the most recent backup
velero restore create --from-backup cluster-full-backup-20241215060000
# Restore only a specific namespace
velero restore create --from-backup cluster-full-backup-20241215060000 \
--include-namespaces production
# Restore with name remapping (useful for cloning environments)
velero restore create --from-backup cluster-full-backup-20241215060000 \
--include-namespaces production \
--namespace-mappings production:staging
5. Persistent Volume Backup Strategies
Not all persistent volume backup approaches are equal. Choose based on your requirements:
Cloud-Native Snapshots are the fastest and simplest option for cloud-managed disks. They are incremental, cheap, and fast. However, they are provider-specific and cannot be used to migrate data across cloud providers.
File-Level Backup (Restic/Kopia via Velero) copies individual files from mounted volumes. This is slower but portable across providers and works with any volume type (NFS, hostPath, local). Enable it with the annotation backup.velero.io/backup-volumes: my-volume on your pods.
Application-Level Backups use the application's own backup tooling -- pg_dump for PostgreSQL, mongodump for MongoDB, mysqldump for MySQL. These produce logically consistent backups and are the most reliable for databases. Use Velero backup hooks (as shown above) or dedicated CronJobs.
6. Cross-Region Disaster Recovery Patterns
Active-Passive
One cluster in the primary region handles all traffic. A second cluster in a different region is kept in sync and ready to take over. During failover, DNS is updated to point to the standby cluster.
- Data replication happens at the storage layer (database replication, S3 cross-region replication, or Velero backup/restore).
- The standby cluster can run a minimal footprint (scaled-down replicas) to save costs.
- Failover can be automated with health checks or manual with a runbook.
Active-Active
Both clusters in different regions receive traffic simultaneously via global load balancing (e.g., AWS Route 53, Cloudflare, GCP Global Load Balancer). This provides near-zero RTO but requires your application to handle data consistency across regions, typically via a globally distributed database (CockroachDB, Spanner, Vitess) or eventual consistency patterns.
7. Testing Your DR Plan
A disaster recovery plan that has not been tested is merely a hypothesis. Schedule regular DR drills:
- Monthly: Restore a single namespace backup to a test cluster and verify application functionality.
- Quarterly: Perform a full cluster restore to a new cluster and validate all services, including DNS cutover.
- Annually: Simulate a full regional outage and execute your failover runbook end to end with time tracking.
Document every drill: how long restoration took, what broke, and what needs improvement. Compare actual RTO against your target.
Common Pitfalls
- Backing up only API objects without volumes: Your Deployments restore perfectly, but your databases come back empty. Always include volume snapshots or file-level backups.
- Storing backups in the same region as the cluster: A regional outage takes out both your cluster and your backups. Use cross-region replication for your backup bucket.
- Never testing restores: You discover your backup process is broken only during an actual disaster. Test restores regularly.
- Forgetting cluster-scoped resources: Velero backs up namespaced resources by default. Ensure you include ClusterRoles, ClusterRoleBindings, StorageClasses, and CRDs.
- Ignoring Secrets encryption: Velero backups contain Secrets in plaintext (base64-encoded). Encrypt your backup bucket with server-side encryption and restrict access with IAM policies.
- Not accounting for ordering during restore: Some resources depend on others (CRDs must exist before their instances). Velero handles most ordering, but custom hooks may be needed for complex dependencies.
Best Practices
- Automate everything: Use Velero Schedules, not manual
velero backup createcommands. Humans forget; cron does not. - Use the 3-2-1 rule: Keep 3 copies of your data, on 2 different media types, with 1 copy off-site (different region or provider).
- Separate backup credentials: Use dedicated, minimally-scoped IAM credentials for Velero. If your cluster is compromised, the attacker should not be able to delete backups.
- Enable object lock / immutable backups: Use S3 Object Lock or equivalent to prevent backup deletion, even by administrators, for a retention period.
- Monitor backup success: Alert on failed backups. Velero exposes Prometheus metrics (
velero_backup_failure_total,velero_backup_success_total) -- use them. - Document your runbook: A step-by-step disaster recovery runbook should be accessible outside your Kubernetes cluster (a wiki, a printed document, a shared drive).
- Include external dependencies: Your DR plan should cover DNS failover, TLS certificate re-issuance, and external service re-configuration.
Cluster Migration with Velero
Velero is not just for disaster recovery -- it is also the standard tool for migrating workloads between clusters (e.g., upgrading to a new Kubernetes version, migrating from one cloud provider to another, or consolidating clusters).
# On the source cluster: create a full backup
velero backup create migration-backup --include-namespaces app-ns
# On the destination cluster: install Velero pointing to the same BSL
# Then restore
velero restore create --from-backup migration-backup
# For cross-provider migration, use file-level backups (Restic/Kopia)
# since native snapshots are not portable across providers
velero backup create cross-cloud-migration \
--include-namespaces app-ns \
--default-volumes-to-fs-backup
What's Next?
- Advanced Storage (CSI): Learn about volume snapshots, StorageClasses, and CSI driver configuration that underpin persistent volume backups.
- GitOps (ArgoCD): Understand how GitOps complements DR by maintaining a declarative, version-controlled record of your cluster's desired state in Git.
- CRDs & Operators: Explore how Velero itself uses CRDs (Backup, Restore, Schedule) to extend the Kubernetes API.