Progressive Delivery: Beyond Rolling Updates
- Risk Mitigation: Progressive delivery techniques (canary, blue-green, A/B testing) drastically reduce the blast radius of faulty deployments by gradually exposing new versions to real users while monitoring key health metrics.
- Automated Rollbacks: Tools like Argo Rollouts integrate with observability platforms (Prometheus, Datadog, New Relic, Cloudwatch) to automatically detect regressions in error rate, latency, or custom business metrics and trigger rollbacks before widespread user impact.
- Canary Deployments: Route a small percentage of live traffic (e.g., 5%) to a new version, run automated analysis against real production metrics, and progressively increase the traffic weight only if the analysis passes at each step.
- Blue-Green Deployments: Maintain two identical environments and instantly switch 100% of traffic between them. This provides the fastest rollback (seconds) but requires double the resources during the transition.
- Argo Rollouts: The most widely adopted progressive delivery controller for Kubernetes. It introduces the
Rolloutresource as a drop-in replacement forDeployment, with native support forAnalysisTemplates, manual gates, traffic management integrations, and experiment resources. - Flagger: An alternative progressive delivery tool that works with existing
Deploymentresources and integrates with service meshes (Istio, Linkerd) and ingress controllers for traffic shifting.
A standard Kubernetes Deployment only supports a basic Rolling Update strategy. During a rolling update, old pods are replaced with new pods incrementally — but there is no automated health validation beyond readiness probes. If you deploy a bug that passes readiness checks but increases error rates, the rollout proceeds to 100% of your pods. By the time you notice, every user is affected. Rolling back requires another full deployment cycle.
Progressive Delivery uses automation and traffic management to reduce the "blast radius" of a bad release to a small fraction of users, validate the release with real production metrics, and automatically roll back if something goes wrong — all without human intervention.
1. Canary Deployments
A canary rollout sends a tiny fraction of live traffic to the new version while the majority continues hitting the stable version. At each step, automated analysis checks whether key metrics (error rate, latency, saturation) have degraded. If the canary is healthy, the traffic weight increases. If it is not, the rollout aborts and all traffic returns to the stable version.
Argo Rollout: Canary
The typical canary progression looks like this:
- Deploy the new version alongside the stable version (new ReplicaSet with minimal replicas).
- Route 5% of traffic to the canary.
- Run an automated analysis for 2-5 minutes — query Prometheus for 5xx rate and p99 latency.
- If analysis passes, increase to 20%.
- Repeat analysis.
- Increase to 50%, then 80%, then 100%.
- Scale down the old ReplicaSet.
If the analysis fails at any step, the controller immediately routes 100% of traffic back to the stable version and scales down the canary pods.
Traffic Splitting Mechanisms
Canary deployments require a traffic splitting mechanism. There are two primary approaches:
- Service Mesh-based (Istio, Linkerd): The mesh controls traffic at the proxy level using
VirtualService(Istio) orTrafficSplit(SMI). This provides true percentage-based splitting at the request level. - Ingress-based (NGINX, ALB, Gateway API): The ingress controller splits traffic using weighted backend references. This is simpler to set up but may be less precise for low traffic volumes.
- Pod-ratio-based: Without a mesh or advanced ingress, you can approximate traffic splitting by scaling replicas. If you have 9 stable pods and 1 canary pod behind the same Service, approximately 10% of traffic hits the canary. This approach is coarse-grained but requires no additional infrastructure.
2. Blue-Green Deployments
Blue-green is a simpler model with a different tradeoff:
- Blue is the current live version, receiving 100% of traffic.
- Green is deployed alongside Blue with full capacity but receives zero traffic.
- Automated tests or manual verification run against the Green environment (using a preview Service).
- When ready, the controller switches the active Service selector from Blue to Green — 100% of traffic moves instantly.
- Instant rollback: If something is wrong, the selector switches back to Blue in seconds.
- After a configurable scaledown delay, the Blue environment is removed.
Tradeoff: Blue-green requires double the resources during the transition (both environments run at full capacity). Canary requires only a few extra pods. Choose blue-green when you need atomic cutover and instant rollback, and canary when you want gradual risk exposure with automated metrics validation.
3. Argo Rollouts
Argo Rollouts is the most widely adopted progressive delivery controller for Kubernetes. It is a CNCF project that introduces a Rollout custom resource as a drop-in replacement for the native Deployment.
Installation
# Install Argo Rollouts controller
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# Install the kubectl plugin for CLI management
brew install argoproj/tap/kubectl-argo-rollouts # macOS
# or download from GitHub releases for Linux
Canary Rollout with Analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: store-api
namespace: production
spec:
replicas: 10
revisionHistoryLimit: 3
selector:
matchLabels:
app: store-api
template:
metadata:
labels:
app: store-api
spec:
containers:
- name: store-api
image: registry.example.com/store-api:v2.3.1
ports:
- containerPort: 8080
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
memory: 512Mi
strategy:
canary:
# Traffic management via Istio
trafficRouting:
istio:
virtualServices:
- name: store-api-vsvc
routes:
- primary
destinationRule:
name: store-api-destrule
canarySubsetName: canary
stableSubsetName: stable
steps:
# Step 1: 5% traffic to canary
- setWeight: 5
# Step 2: Run analysis for 5 minutes
- analysis:
templates:
- templateName: success-rate-and-latency
args:
- name: service-name
value: store-api-canary
# Step 3: Increase to 25%
- setWeight: 25
- pause: { duration: 3m } # soak time
# Step 4: Manual gate — human must approve
- pause: {} # indefinite pause until promoted
# Step 5: 50% traffic
- setWeight: 50
- analysis:
templates:
- templateName: success-rate-and-latency
args:
- name: service-name
value: store-api-canary
# Step 6: Full rollout
- setWeight: 100
# Automatic rollback on analysis failure
abortScaleDownDelaySeconds: 30
# Anti-affinity to spread canary and stable across nodes
antiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
weight: 100
AnalysisTemplate
The AnalysisTemplate defines what metrics to check and what constitutes success or failure:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-and-latency
namespace: production
spec:
args:
- name: service-name
metrics:
# Metric 1: HTTP success rate must be >= 99%
- name: success-rate
interval: 60s # check every 60 seconds
count: 5 # run 5 measurements total
failureLimit: 1 # allow at most 1 failure
successCondition: result[0] >= 0.99
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"2..|3.."
}[2m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[2m]))
# Metric 2: p99 latency must be < 500ms
- name: p99-latency
interval: 60s
count: 5
failureLimit: 1
successCondition: result[0] < 500
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_ms_bucket{
service="{{args.service-name}}"
}[2m])) by (le)
)
Blue-Green with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
spec:
replicas: 5
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
containers:
- name: payments-api
image: registry.example.com/payments-api:v3.0.0
strategy:
blueGreen:
# The main production Service
activeService: payments-api-active
# A preview Service for testing before promotion
previewService: payments-api-preview
# Auto-promote after 5 minutes if no issues
autoPromotionSeconds: 300
# Keep old ReplicaSet for 30 seconds after promotion
scaleDownDelaySeconds: 30
# Run analysis on the preview before promotion
prePromotionAnalysis:
templates:
- templateName: smoke-tests
args:
- name: preview-url
value: http://payments-api-preview.production.svc.cluster.local
Manual Gates and Promotion
Argo Rollouts supports indefinite pauses (pause: {} with no duration) that require manual approval:
# Check rollout status
kubectl argo rollouts get rollout store-api -n production
# Promote (resume) a paused rollout
kubectl argo rollouts promote store-api -n production
# Abort a rollout (trigger rollback)
kubectl argo rollouts abort store-api -n production
# Retry an aborted rollout
kubectl argo rollouts retry rollout store-api -n production
4. Flagger
Flagger is an alternative progressive delivery tool that takes a different approach: instead of replacing Deployment with a custom resource, Flagger watches standard Deployment objects and manages canary/blue-green traffic shifting automatically.
Flagger integrates with:
- Service meshes: Istio, Linkerd, Open Service Mesh
- Ingress controllers: NGINX, Contour, Gloo, Gateway API
- Metrics providers: Prometheus, Datadog, CloudWatch, New Relic
Flagger creates a Canary custom resource that references the target Deployment:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: store-api
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: store-api
service:
port: 8080
analysis:
interval: 1m # check metrics every minute
threshold: 5 # max number of failed checks
maxWeight: 50 # max canary traffic weight
stepWeight: 10 # increment per step (10%, 20%, 30%...)
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
When you update the Deployment image, Flagger detects the change, creates a canary Deployment, and starts the progressive traffic shift.
5. Metrics-Driven Promotion
The key differentiator of progressive delivery over manual canary testing is automated metrics evaluation. The general pattern is:
- Define SLIs (Service Level Indicators): success rate, latency percentiles, error counts, or custom business metrics (e.g., checkout conversion rate).
- Set thresholds: success rate >= 99.5%, p99 latency < 300ms.
- The controller queries your observability backend at each analysis interval.
- Pass: metrics are within thresholds — proceed to the next step.
- Fail: metrics exceed thresholds — abort the rollout and route all traffic to the stable version.
Provider integrations go beyond Prometheus. Argo Rollouts supports Datadog, New Relic, Cloudwatch, Wavefront, Kayenta (Netflix's automated canary analysis), and even arbitrary web-based analysis (hit any REST endpoint that returns a JSON result).
6. Real-World Progressive Delivery Pipeline
A production-grade progressive delivery pipeline typically looks like this:
- CI pipeline builds, tests, and pushes a new container image.
- GitOps tool (ArgoCD, Flux) detects the image change and updates the
Rolloutmanifest in the cluster. - Argo Rollouts creates the canary ReplicaSet and sets initial weight (5%).
- AnalysisRun queries Prometheus for error rate and latency.
- Traffic weight increases through configured steps (5% -> 25% -> 50%).
- A manual gate pauses at 50% for human review of dashboards.
- After promotion, traffic shifts to 100% and the old ReplicaSet scales down.
- Notification sent to Slack/Teams via Argo Rollouts notification controller.
If analysis fails at any step, the rollout aborts, traffic returns to stable, and an alert fires to the on-call channel.
Common Pitfalls
- Insufficient analysis duration: Running analysis for only 30 seconds may not catch latency regressions that only appear under sustained load. Use at least 2-5 minutes per analysis step.
- Missing baseline comparison: Checking if error rate < 1% is good, but checking if error rate increased compared to the stable version is better. Use Kayenta-style comparative analysis when possible.
- Not testing the rollback path: Ensure your application handles version rollbacks gracefully (database migrations, schema changes, cache compatibility).
- Pod-ratio splitting without enough traffic: If you only get 10 requests per minute, a 5% canary weight means the canary might receive zero requests during an analysis window, leading to inconclusive results.
- Forgetting to configure anti-affinity: Canary and stable pods on the same node can mask infrastructure issues.
- Schema migrations in canary deployments: If v2 requires a database schema change that breaks v1, canary splitting will cause errors in the stable version. Use expand-and-contract migration patterns.
Best Practices
- Start with blue-green if you want the simplest rollback model. Graduate to canary when you need more granular traffic control and cost efficiency.
- Use a service mesh (Istio, Linkerd) for precise request-level traffic splitting rather than relying on pod ratios.
- Always include a manual gate before promoting past 50% in high-risk services (payments, authentication).
- Define AnalysisTemplates centrally and reuse them across Rollouts for consistency. Store them in a shared namespace or Helm library chart.
- Set
abortScaleDownDelaySecondsto give time for in-flight requests to complete before canary pods are terminated. - Monitor the rollout controller itself — if Argo Rollouts is down, rollouts will stall. Set up alerts for controller health.
- Use experiments (Argo Rollouts Experiment resource) to run A/B tests with dedicated baseline and canary ReplicaSets for clean comparison.
What's Next?
- See how the Gateway API provides native traffic splitting for canary deployments without a service mesh.
- Learn about Observability to set up the Prometheus metrics that power AnalysisTemplates.
- Explore Policy as Code to enforce that all production Rollouts include AnalysisTemplates.
- Understand Pod Security to ensure canary pods run with the same security constraints as stable pods.