Skip to main content

Disaster Recovery & Backups

Key Takeaways for AI & Readers
  • Comprehensive Backup: Disaster recovery requires backing up both Kubernetes metadata (YAMLs in etcd) and persistent application data.
  • Velero for Cluster Backups: Velero is the standard tool for creating consistent backups of Kubernetes resources and persistent volumes, enabling full cluster restoration.
  • High Availability etcd: For self-managed clusters, ensuring etcd redundancy (e.g., 3 or 5 nodes) is crucial for control plane resilience.
  • Multi-Region Strategies: Active-passive or active-active configurations across regions provide robust resilience against regional outages for critical applications.

Kubernetes is self-healing, but it cannot heal from a deleted database or a deleted cloud account. You need a backup strategy that covers both Metadata and Data.

1. What needs backing up?

  1. Metadata (The YAMLs): Every Deployment, Service, and ConfigMap stored in etcd.
  2. Persistent Data: The actual files stored on your cloud disks (EBS, AzureDisk).

2. Velero: The Cluster Time Machine

Velero is the standard tool for cluster-wide backups.

Live Cluster
Remote Storage (S3)
📦
backup-2025-12-29Metadata + Snapshots
Velero backs up your etcd objects and disk snapshots to an external bucket. If the cluster dies, you can recreate it and restore everything in minutes.

How it works:

  1. Snapshot: Velero tells the cloud provider to take a snapshot of all Persistent Volumes.
  2. Export: Velero zips up all the API objects (YAMLs) and uploads them to an S3 bucket.
  3. Restore: On a brand new cluster, you install Velero, point it to the bucket, and run velero restore. Within minutes, your apps and data are back.

3. High Availability etcd

If you manage your own Control Plane (not EKS/GKE), you must ensure etcd is redundant.

  • Run an odd number of nodes (3 or 5).
  • Even if 1 or 2 nodes die, the cluster continues to function.

4. Multi-Region Strategy

For "Mission Critical" apps, one cluster isn't enough.

  • Active-Passive: One cluster is live, another is on standby.
  • Active-Active: Both clusters receive traffic via Global DNS (e.g., Cloudflare or AWS Route53).