Vertical Pod Autoscaler (VPA)

Key Takeaways for AI & Readers

Vertical Scaling: VPA automatically adjusts the CPU and memory requests (and optionally limits) of individual pods based on observed usage, eliminating the guesswork of manual right-sizing.
Three Components: VPA consists of the Recommender (analyzes metrics and computes recommendations), the Updater (evicts pods that are significantly mis-sized), and the Admission Controller (mutates pod specs at creation time to apply recommendations).
Update Modes: Off (recommendation only -- safe to start with), Initial (applies recommendations only at pod creation), and Auto/Recreate (evicts and recreates running pods to apply new resource values).
Complementary to HPA: VPA scales vertically (adjusting per-pod resources) while HPA scales horizontally (adjusting replica count). They should not target the same metric (e.g., CPU) simultaneously, but they can coexist when HPA uses custom metrics.
Cost Optimization: VPA reduces cloud spend by 20-50% in many organizations by eliminating over-provisioned resource requests that pad capacity "just in case."
Key Limitation: VPA must restart pods to change their resource requests (in-place pod resize is a separate alpha feature in K8s 1.27+). Ensure sufficient replicas and PodDisruptionBudgets to maintain availability during updates.

While HPA adds more Pods (horizontal scaling), VPA makes existing Pods larger or smaller by adjusting their CPU and Memory requests. This is essential for workloads that cannot scale horizontally -- databases, single-threaded applications, and stateful services where adding replicas is complex or impossible.

1. The Problem VPA Solves

Most Kubernetes resource requests are set once at deployment time and never updated. Engineers either:

Over-provision (set requests too high): Pods reserve resources they never use, wasting cluster capacity and increasing cloud costs.
Under-provision (set requests too low): Pods get throttled (CPU) or killed (OOMKilled for memory), causing latency spikes and outages.

In practice, studies consistently show that 50-80% of pods are over-provisioned by 2x or more. VPA continuously monitors actual resource consumption and adjusts requests to match real usage patterns, with a safety margin.

2. HPA vs. VPA

Application Load

VPA Recommendation

Target CPU: 100m

Pod Size (Vertical)

📦

RESOURCES

Unlike HPA which adds more pods, Vertical Pod Autoscaler (VPA) updates the Requests and Limits of a single pod to match its actual usage.

Aspect	HPA	VPA
Scaling Direction	Horizontal (add/remove pods)	Vertical (resize individual pods)
Best For	Stateless, horizontally scalable apps	Stateful apps, databases, single-threaded workloads
Metric	CPU, memory, custom metrics	CPU and memory usage history
Disruption	No restarts (adds new pods)	Restarts pods (to apply new requests)
Speed	Seconds to add a pod	Minutes (must evict and recreate)
Conflict	Cannot use both on the same metric	Cannot use both on the same metric

Using HPA and VPA together: You can run both HPA and VPA for the same Deployment, but they must not target the same metric. A common pattern is: HPA scales on a custom metric (e.g., requests-per-second from Prometheus) while VPA adjusts CPU and memory requests. Alternatively, run VPA in Off mode purely for recommendations while HPA handles autoscaling.

3. VPA Components

VPA is not a single controller -- it is three separate components that work together.

Recommender

The Recommender is the brain of VPA. It watches pod resource usage metrics from the Metrics API (or Prometheus) and computes recommended CPU and memory values. The algorithm:

Collects CPU and memory usage samples over a rolling window (default: 8 days of history).
Computes a histogram of usage values, weighting recent samples more heavily (exponential decay with a half-life of 24 hours).
Sets the target recommendation at the 90th percentile of usage (configurable).
Sets the lower bound at a safety margin below the target.
Sets the upper bound as a cap to prevent runaway recommendations.
Sets the uncapped target as the raw recommendation before min/max constraints are applied.

Updater

The Updater watches for pods whose current resource requests differ significantly from the VPA recommendation. When a pod is outside the recommended range (below the lower bound or above the upper bound), the Updater evicts it. The pod's controller (Deployment, StatefulSet) recreates it, and the Admission Controller applies the new recommendations at creation time.

The Updater respects PodDisruptionBudgets and will not evict pods if doing so would violate the PDB. It also rate-limits evictions to avoid cascading disruptions.

Admission Controller

The VPA Admission Controller is a mutating webhook that intercepts pod creation requests. When a pod is created that matches a VPA object, the webhook modifies the pod's resource requests (and optionally limits) to match the current VPA recommendation before the pod is admitted to the cluster.

VPA Architecture:

Metrics API ──> Recommender ──> VPA Status (recommendations)
                                      |
                                      v
                                  Updater (evicts mis-sized pods)
                                      |
                                      v
                           Pod Controller recreates pod
                                      |
                                      v
                           Admission Controller (mutates requests)
                                      |
                                      v
                              Pod runs with optimal resources

4. VPA Modes

`Off` (Recommendation Only)

VPA computes and stores recommendations in the VPA object's .status.recommendation field, but takes no action. This is the safest way to start -- you can review recommendations before trusting VPA to act.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"        # Recommendation only -- no pod changes
  resourcePolicy:
    containerPolicies:
      - containerName: "app"
        minAllowed:
          cpu: "50m"         # Never recommend less than 50m CPU
          memory: "64Mi"     # Never recommend less than 64Mi memory
        maxAllowed:
          cpu: "4"           # Never recommend more than 4 CPUs
          memory: "8Gi"      # Never recommend more than 8Gi memory
        controlledResources:
          - cpu
          - memory

Check the recommendation:

kubectl describe vpa my-app-vpa -n production
# Look for the "Recommendation" section in the status:
#   Target:      cpu=250m, memory=512Mi
#   Lower Bound: cpu=200m, memory=400Mi
#   Upper Bound: cpu=1,    memory=2Gi

`Initial`

VPA applies recommendations only when pods are first created. It does not evict running pods. This is useful for workloads where the initial resource guess is unreliable, but you do not want VPA to cause restarts during operation.

spec:
  updatePolicy:
    updateMode: "Initial"    # Apply at pod creation only

`Auto` / `Recreate`

VPA will actively evict pods that are significantly mis-sized and rely on the controller to recreate them with updated resources. Auto and Recreate behave identically in current versions (future versions may support in-place resize for Auto).

spec:
  updatePolicy:
    updateMode: "Auto"       # Evict and recreate pods as needed
    minReplicas: 2           # Require at least 2 replicas before evicting

5. How VPA Calculates Recommendations

The VPA Recommender uses a decaying histogram algorithm, not a simple average. This matters because resource usage is typically spiky and non-uniform:

Sample collection: Every 60 seconds, the Recommender reads CPU and memory usage from the Metrics API.
Histogram bucketing: Samples are placed into exponential histogram buckets. CPU buckets range from 1m to 1000 cores; memory buckets from 1Mi to 1Ti.
Exponential decay: Older samples are weighted less than recent ones. The default half-life is 24 hours, meaning a sample from yesterday has half the weight of a sample from today.
Percentile estimation: The target recommendation is set at the configured percentile (default: 90th for CPU, 90th for memory). This means the recommendation will cover 90% of observed usage patterns.
Safety margin: A configurable margin (default: 15% for CPU, 20% for memory) is added on top of the percentile estimate.
Bound clamping: The final recommendation is clamped to the minAllowed and maxAllowed values specified in the resourcePolicy.

The 8-day rolling window means that VPA takes about a week of observation before its recommendations stabilize. During this period, recommendations may fluctuate.

6. VPA YAML Example: Complete Configuration

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: backend-api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend-api
  updatePolicy:
    updateMode: "Auto"
    minReplicas: 2                # Don't evict if fewer than 2 replicas
  resourcePolicy:
    containerPolicies:
      - containerName: "api-server"
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "8"
          memory: "16Gi"
        controlledResources:
          - cpu
          - memory
        controlledValues: RequestsAndLimits  # Adjust both requests and limits
      - containerName: "sidecar-proxy"
        mode: "Off"                # Don't touch the sidecar's resources
---
# Ensure a PDB exists to protect availability during VPA evictions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: backend-api-pdb
  namespace: production
spec:
  minAvailable: 1                  # At least 1 pod must remain running
  selector:
    matchLabels:
      app: backend-api

7. Goldilocks: VPA Visualization and Dashboard

Goldilocks (by Fairwinds) is an open-source tool that creates a VPA in Off mode for every Deployment in a labeled namespace and presents the recommendations in a web dashboard. It answers the question: "What should I set my resource requests to?"

# Install Goldilocks
helm install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks --create-namespace

# Label a namespace to enable Goldilocks
kubectl label namespace production goldilocks.fairwinds.com/enabled=true

# Access the dashboard
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

Goldilocks shows per-container recommendations for CPU and memory across all Deployments in labeled namespaces, making it easy to identify the most over-provisioned workloads and prioritize right-sizing efforts.

8. Limitations and Considerations

Pod restarts are required. VPA cannot change resource requests on a running pod (the in-place pod resize feature in Kubernetes 1.27+ is a separate alpha feature and is not part of VPA). Every adjustment requires evicting and recreating the pod.
Not suitable for short-lived pods. Pods that live for minutes (batch jobs, CI runners) do not generate enough history for meaningful recommendations.
Single-replica risks. If a Deployment has only one replica, VPA in Auto mode will cause downtime when it evicts that pod. Always use minReplicas: 2 or a PDB.
HPA conflict on same metric. If HPA scales on CPU and VPA also adjusts CPU requests, they can fight: VPA raises the request, which lowers the CPU utilization percentage, which causes HPA to scale in, which increases per-pod load, which causes VPA to raise the request further. Avoid this feedback loop.
Memory-only mode is safer. Consider controlling only memory with VPA (controlledResources: [memory]) and letting HPA handle CPU-based scaling.
Initial instability. During the first 24-48 hours, VPA recommendations can swing significantly as the histogram fills. Use Off mode initially to observe.

9. Common Pitfalls

Not installing the Admission Controller. Without the mutating webhook, VPA can evict pods but the recreated pods will come back with the original (wrong) resource requests. This causes a restart loop with no benefit.
Setting maxAllowed too low. If the workload genuinely needs 4Gi of memory but maxAllowed is set to 2Gi, VPA will cap its recommendation at 2Gi and the pod will continue to be OOMKilled.
Missing PodDisruptionBudget. VPA respects PDBs, but if you do not create one, VPA can evict all replicas simultaneously, causing a complete outage.
Ignoring the VPA Updater's resource needs. The Updater and Recommender are Deployments that need their own resource requests. In large clusters with hundreds of VPA objects, they can consume significant CPU and memory.
VPA on DaemonSets. VPA does not officially support DaemonSets as target references. While it may work in some configurations, it is not tested or recommended.
Confusing requests and limits. VPA adjusts requests by default, not limits. If your pods have limits set to the same value as requests (Guaranteed QoS), use controlledValues: RequestsAndLimits to keep them in sync.

10. What's Next?

HPA Deep Dive: Understand horizontal autoscaling mechanics and how to combine HPA with VPA safely.
Resource Requests and Limits: Review how Kubernetes QoS classes (Guaranteed, Burstable, BestEffort) are determined and how they affect scheduling and eviction.
Evictions: Learn how node-pressure eviction interacts with pod resource requests, which VPA directly affects. See Evictions.
In-Place Pod Resize: Track the Kubernetes KEP for in-place pod resource updates, which will eventually allow VPA to adjust resources without restarting pods.
Cluster Autoscaler: VPA adjusts pod resource requests, which may trigger the Cluster Autoscaler to add or remove nodes. Understand the interaction between these two systems.

1. The Problem VPA Solves​

2. HPA vs. VPA​

Application Load

3. VPA Components​

Recommender​

Updater​

Admission Controller​

4. VPA Modes​

Off (Recommendation Only)​

Initial​

Auto / Recreate​

5. How VPA Calculates Recommendations​

6. VPA YAML Example: Complete Configuration​

7. Goldilocks: VPA Visualization and Dashboard​

8. Limitations and Considerations​

9. Common Pitfalls​

10. What's Next?​