Resources: Requests, Limits & HPA
- Requests vs. Limits: Requests are a guaranteed resource floor used by the scheduler for placement decisions. Limits are a hard ceiling -- CPU exceeding the limit is throttled, memory exceeding the limit triggers an OOMKill.
- QoS Classes: Every pod is assigned a QoS class (Guaranteed, Burstable, or BestEffort) based on its resource configuration. This class determines eviction priority when a node runs out of memory.
- Auto-Scaling Logic: The Horizontal Pod Autoscaler (HPA) uses the formula
desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))and supports resource metrics, custom metrics, and external metrics. HPA v2 has been stable since Kubernetes 1.23. - Resource Governance: ResourceQuotas cap total namespace resource consumption. LimitRanges enforce default and maximum values per-container, preventing unconstrained pods from being created.
- Right-sizing: Setting requests too high wastes cluster capacity. Setting them too low causes scheduling failures and evictions. Use observability data to right-size over time.
Before you can auto-scale, you must define how much CPU and RAM your application needs. Getting this right is one of the most impactful things you can do for both reliability and cost efficiency.
1. Requests vs Limits
Every container in a Pod can specify requests and limits for CPU and memory.
Requests (The "Guarantee")
- What it means: "I need at least this much to function."
- Scheduler behavior: The Kubernetes Scheduler sums up the requests of all containers on a node and only places a new pod on a node that has enough allocatable capacity remaining. If no node has enough free capacity, the pod stays
Pending. - Runtime behavior: The Linux kernel uses requests to set CPU shares (via cgroups). A container with
cpu: 500mgets proportionally more CPU time than one withcpu: 250mwhen the node is contended.
Limits (The "Ceiling")
- What it means: "The container must never exceed this."
- CPU behavior: When a container tries to use more CPU than its limit, the kernel throttles it. The process slows down but continues running. Throttling is invisible to the application -- there are no errors, just increased latency.
- Memory behavior: When a container allocates more memory than its limit, the kernel OOMKills the process. The container is terminated and restarted according to the pod's
restartPolicy. This produces a visibleOOMKilledstatus inkubectl get pods.
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: myapp:v1
resources:
requests:
cpu: "250m" # 1000m = 1 full CPU core. 250m = 0.25 cores.
memory: "256Mi" # 256 Mebibytes (binary). Use Mi, not MB.
limits:
cpu: "500m" # Throttled above this. Process slows down.
memory: "512Mi" # OOMKilled above this. Process is terminated.
CPU vs Memory: A Fundamental Difference
Understanding this distinction is essential:
| CPU | Memory | |
|---|---|---|
| Nature | Compressible (can be throttled) | Incompressible (cannot be "taken back") |
| Over-limit behavior | Throttling (slower execution) | OOMKill (container terminated) |
| Under-request impact | Pod gets less CPU time under contention | Pod may be evicted by kubelet |
| Symptom of wrong setting | Increased latency, timeout errors | CrashLoopBackOff with OOMKilled status |
Because CPU is compressible, some teams choose not to set CPU limits at all, allowing containers to burst and use idle CPU on the node. This is a valid strategy when you trust your workloads and prefer throughput over predictability. Memory limits, on the other hand, should almost always be set because there is no graceful degradation for memory overuse.
2. Quality of Service (QoS) Classes
Kubernetes assigns a QoS class to every Pod based on its resource configuration. This determines which pod gets killed first when the node runs out of memory.
Guaranteed (Last to be evicted)
Criteria: Every container in the pod must have requests equal to limits for both CPU and memory.
# Guaranteed QoS: requests == limits for all containers
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "500m" # Same as request
memory: "256Mi" # Same as request
These pods have the highest priority. They are the last to be evicted and receive the most predictable performance because the kernel never needs to throttle or reclaim resources.
Burstable (Evicted under pressure)
Criteria: At least one container has a request or limit set, but requests are not equal to limits.
# Burstable QoS: requests < limits
resources:
requests:
cpu: "250m"
memory: "128Mi"
limits:
cpu: "1000m" # Can burst to 4x the request
memory: "512Mi"
When the node is under memory pressure, the kubelet evicts Burstable pods that are using more memory than their request before evicting Guaranteed pods.
BestEffort (First to be evicted)
Criteria: No requests or limits set at all.
# BestEffort QoS: no resource configuration
resources: {}
These pods are the first to be evicted. In production, you should almost never have BestEffort pods -- they make capacity planning impossible and are unreliable under load.
3. Horizontal Pod Autoscaling (HPA)
Once requests are set, HPA can automatically scale the number of Pods based on observed utilization.
The Prerequisite: Metrics Server
HPA is a consumer of data, not a producer. By default, Kubernetes does not know how much CPU or memory a pod is using. You must install the Metrics Server in your cluster.
- The Metrics Server scrapes resource usage data from the kubelet on every node.
- HPA queries the Metrics API (
metrics.k8s.io) to make its scaling decisions. - If
kubectl top podsreturns an error, your Metrics Server is not working and HPA will not function.
Interactive Simulator
Increase the "Traffic Load" slider below. Watch how the Average CPU per Pod increases. When it exceeds the target (50%), the HPA controller adds more replicas to share the load.
The Algorithm
The HPA controller runs a control loop (default every 15 seconds) and calculates the desired replica count:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
Example: You have 3 replicas, average CPU utilization is 75%, and your target is 50%.
desiredReplicas = ceil[3 * (75 / 50)] = ceil[4.5] = 5
The HPA scales from 3 to 5 replicas. On the next check, if average utilization drops to 45% (below target), the HPA calculates fewer replicas are needed and scales down.
Metric Types
HPA v2 (stable since Kubernetes 1.23) supports four types of metrics:
| Type | Description | Example |
|---|---|---|
| Resource | CPU or memory utilization from metrics-server | Average CPU at 50% |
| Pods | Custom metric per-pod from your app | Queue depth per pod |
| Object | Metric from a Kubernetes object | Ingress requests-per-second |
| External | Metric from an external system | AWS SQS queue length |
Scaling Behavior Configuration
HPA v2 allows you to control how fast scaling happens, which prevents oscillation (rapid scale-up/scale-down cycles):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
# Scale on CPU utilization
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Target 60% of CPU request
# Also scale on memory
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up again
policies:
- type: Pods
value: 4 # Add at most 4 pods per 60s
periodSeconds: 60
- type: Percent
value: 100 # Or double the pods, whichever is higher
periodSeconds: 60
selectPolicy: Max # Use the policy that allows more replicas
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 25 # Remove at most 25% of pods per 60s
periodSeconds: 60
The stabilizationWindowSeconds for scale-down is important. Without it, a brief dip in traffic causes the HPA to scale down, then the remaining pods get overwhelmed and it scales back up, creating a flapping cycle.
Custom Metrics Example
To scale on application-specific metrics (e.g., requests per second from Prometheus), you need a metrics adapter like prometheus-adapter:
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second # Exposed by prometheus-adapter
target:
type: AverageValue
averageValue: "1000" # Target 1000 req/s per pod
Real-World Scenario: Scaling a Web Service
Imagine you run an e-commerce API. Normal traffic is 500 requests/second, but during flash sales it spikes to 5,000 requests/second.
- Baseline: 3 replicas, each handling ~170 req/s, CPU at 30%.
- Flash sale starts: Traffic jumps to 2,000 req/s. CPU rises to 80%.
- HPA reacts:
ceil[3 * (80/60)] = 4. Scales to 4 replicas. - Traffic continues rising: CPU still at 75%.
ceil[4 * (75/60)] = 5. Scales to 5. - Peak traffic: Eventually reaches 10 replicas, each handling ~500 req/s at 55% CPU.
- Traffic subsides: After 5 minutes of low utilization (stabilization window), HPA scales down gradually (25% at a time).
4. Namespace Resource Governance
Admins can enforce rules to prevent one team from consuming the entire cluster.
ResourceQuota (The Hard Cap)
Limits the total resource usage across all pods in a namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-a-quota
namespace: team-a
spec:
hard:
requests.cpu: "10" # Total CPU requests across all pods
requests.memory: "20Gi" # Total memory requests
limits.cpu: "20" # Total CPU limits
limits.memory: "40Gi" # Total memory limits
pods: "50" # Maximum number of pods
services: "20" # Maximum number of services
persistentvolumeclaims: "10" # Maximum PVCs
When a ResourceQuota is in effect, every pod in the namespace must specify requests and limits. If a user tries to create a pod without them, the API server rejects it. This is where LimitRange becomes essential.
LimitRange (The Defaults and Constraints)
Sets default values for pods that do not specify resources, and enforces min/max boundaries:
apiVersion: v1
kind: LimitRange
metadata:
name: team-a-limits
namespace: team-a
spec:
limits:
- type: Container
default: # Applied if no limit is specified
cpu: "500m"
memory: "256Mi"
defaultRequest: # Applied if no request is specified
cpu: "100m"
memory: "128Mi"
max: # No container can request more than this
cpu: "4"
memory: "8Gi"
min: # No container can request less than this
cpu: "50m"
memory: "64Mi"
This prevents BestEffort pods from being accidentally created and stops individual containers from requesting disproportionate resources.
Right-Sizing Strategies
Setting accurate requests is an iterative process. Over-provisioning wastes money; under-provisioning causes instability.
- Start with estimates. Use load testing or developer knowledge to set initial values. For a typical web application,
cpu: 250mandmemory: 256Miis a reasonable starting point. - Deploy and observe. Use
kubectl top podsand your monitoring stack (Prometheus/Grafana) to measure actual usage over days or weeks. Single-day observations miss weekly traffic patterns. - Adjust requests to P95 usage. Your request should cover the 95th percentile of actual usage. This handles normal spikes while avoiding over-provisioning.
- Set limits at 2-3x the request for CPU (or omit CPU limits entirely). For memory, set limits at 1.5-2x the request to accommodate temporary spikes while still capping runaway processes.
- Use Prometheus queries for data-driven decisions. The query
container_memory_working_set_bytes{pod=~"myapp.*"}over a 7-day window gives you the real memory footprint. Compare this to your requests to find over- or under-provisioned workloads. - Automate with VPA. The Vertical Pod Autoscaler can observe usage and recommend or automatically adjust requests. See the VPA documentation for details.
Common Pitfalls
1. Not setting any resource requests. Without requests, pods are BestEffort and get evicted first. The scheduler also cannot make informed placement decisions, leading to overloaded nodes.
2. Setting CPU limits too low. Aggressive CPU limits cause throttling even when the node has idle CPU. This manifests as mysterious latency spikes that are hard to diagnose. Use container_cpu_cfs_throttled_periods_total in Prometheus to detect this.
3. Setting memory requests much lower than actual usage. The kubelet evicts pods using more memory than their request when the node is under pressure. If your request is 256Mi but your app regularly uses 400Mi, it will be evicted during any node memory contention.
4. Confusing Mi and M. 128Mi is 128 Mebibytes (134,217,728 bytes). 128M is 128 Megabytes (128,000,000 bytes). Always use the Mi suffix to match how Kubernetes displays and computes values internally.
5. Forgetting to install Metrics Server for HPA. HPA silently does nothing without metrics. Check with kubectl get apiservices | grep metrics to verify.
6. HPA and VPA conflicting on the same resource. Do not use HPA and VPA on the same metric (e.g., both targeting CPU). HPA adjusts replica count while VPA adjusts per-pod requests, and they can fight each other. If you use both, configure VPA in "recommend only" mode for CPU and let HPA handle scaling.
Best Practices
-
Always set memory requests and limits. Memory is incompressible and OOMKills are disruptive. Every production container should have memory limits.
-
Set CPU requests, but consider omitting CPU limits. This lets containers burst when CPU is available. Monitor throttling metrics to validate this approach for your workloads.
-
Use HPA with conservative scale-down settings. A 5-minute
stabilizationWindowSecondsfor scale-down prevents flapping and premature removal of capacity. -
Combine HPA with PodDisruptionBudgets. HPA manages replica count, but PDBs ensure availability during rolling updates and node drains. They complement each other.
-
Set ResourceQuotas on all shared namespaces. Without quotas, one team's runaway deployment can starve the entire cluster.
-
Review resource utilization monthly. Workloads change over time. A service that needed 2 CPU at launch may only use 500m six months later (or vice versa).
-
Use namespace-level LimitRanges alongside ResourceQuotas. Quotas enforce totals but do not prevent a single pod from consuming the entire quota. LimitRanges enforce per-container maximums, ensuring fair distribution within a namespace.
-
Monitor HPA decisions in Grafana. The
kube_horizontalpodautoscaler_status_current_replicasandkube_horizontalpodautoscaler_spec_target_metricmetrics from kube-state-metrics let you build dashboards that show when and why HPA scaled your deployments.
What's Next?
- Probes (Health Checks) -- Ensure your pods are healthy before they receive traffic, which directly affects how HPA counts "ready" replicas.
- Observability -- Use Prometheus and Grafana to monitor resource usage, throttling metrics, and HPA decisions.
- Scheduling & Affinity -- Control pod placement based on node resources, labels, and taints.
- Cluster Autoscaler -- Automatically add or remove nodes when pod scheduling fails due to insufficient cluster capacity.
- Vertical Pod Autoscaler (VPA) -- Automatically adjust requests and limits based on observed usage patterns.
- Cost Optimization -- Strategies for reducing cloud spend through resource right-sizing and spot instances.