Resources: Requests, Limits & HPA

Key Takeaways for AI & Readers

Requests vs. Limits: Requests are a guaranteed resource floor used by the scheduler for placement decisions. Limits are a hard ceiling -- CPU exceeding the limit is throttled, memory exceeding the limit triggers an OOMKill.
QoS Classes: Every pod is assigned a QoS class (Guaranteed, Burstable, or BestEffort) based on its resource configuration. This class determines eviction priority when a node runs out of memory.
Auto-Scaling Logic: The Horizontal Pod Autoscaler (HPA) uses the formula desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric)) and supports resource metrics, custom metrics, and external metrics. HPA v2 has been stable since Kubernetes 1.23.
Resource Governance: ResourceQuotas cap total namespace resource consumption. LimitRanges enforce default and maximum values per-container, preventing unconstrained pods from being created.
Right-sizing: Setting requests too high wastes cluster capacity. Setting them too low causes scheduling failures and evictions. Use observability data to right-size over time.

Before you can auto-scale, you must define how much CPU and RAM your application needs. Getting this right is one of the most impactful things you can do for both reliability and cost efficiency.

1. Requests vs Limits

Every container in a Pod can specify requests and limits for CPU and memory.

Requests (The "Guarantee")

What it means: "I need at least this much to function."
Scheduler behavior: The Kubernetes Scheduler sums up the requests of all containers on a node and only places a new pod on a node that has enough allocatable capacity remaining. If no node has enough free capacity, the pod stays Pending.
Runtime behavior: The Linux kernel uses requests to set CPU shares (via cgroups). A container with cpu: 500m gets proportionally more CPU time than one with cpu: 250m when the node is contended.

Limits (The "Ceiling")

What it means: "The container must never exceed this."
CPU behavior: When a container tries to use more CPU than its limit, the kernel throttles it. The process slows down but continues running. Throttling is invisible to the application -- there are no errors, just increased latency.
Memory behavior: When a container allocates more memory than its limit, the kernel OOMKills the process. The container is terminated and restarted according to the pod's restartPolicy. This produces a visible OOMKilled status in kubectl get pods.

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: myapp:v1
    resources:
      requests:
        cpu: "250m"        # 1000m = 1 full CPU core. 250m = 0.25 cores.
        memory: "256Mi"    # 256 Mebibytes (binary). Use Mi, not MB.
      limits:
        cpu: "500m"        # Throttled above this. Process slows down.
        memory: "512Mi"    # OOMKilled above this. Process is terminated.

CPU vs Memory: A Fundamental Difference

Understanding this distinction is essential:

	CPU	Memory
Nature	Compressible (can be throttled)	Incompressible (cannot be "taken back")
Over-limit behavior	Throttling (slower execution)	OOMKill (container terminated)
Under-request impact	Pod gets less CPU time under contention	Pod may be evicted by kubelet
Symptom of wrong setting	Increased latency, timeout errors	CrashLoopBackOff with OOMKilled status

Because CPU is compressible, some teams choose not to set CPU limits at all, allowing containers to burst and use idle CPU on the node. This is a valid strategy when you trust your workloads and prefer throughput over predictability. Memory limits, on the other hand, should almost always be set because there is no graceful degradation for memory overuse.

2. Quality of Service (QoS) Classes

Kubernetes assigns a QoS class to every Pod based on its resource configuration. This determines which pod gets killed first when the node runs out of memory.

Guaranteed (Last to be evicted)

Criteria: Every container in the pod must have requests equal to limits for both CPU and memory.

# Guaranteed QoS: requests == limits for all containers
resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "500m"       # Same as request
    memory: "256Mi"   # Same as request

These pods have the highest priority. They are the last to be evicted and receive the most predictable performance because the kernel never needs to throttle or reclaim resources.

Burstable (Evicted under pressure)

Criteria: At least one container has a request or limit set, but requests are not equal to limits.

# Burstable QoS: requests < limits
resources:
  requests:
    cpu: "250m"
    memory: "128Mi"
  limits:
    cpu: "1000m"      # Can burst to 4x the request
    memory: "512Mi"

When the node is under memory pressure, the kubelet evicts Burstable pods that are using more memory than their request before evicting Guaranteed pods.

BestEffort (First to be evicted)

Criteria: No requests or limits set at all.

# BestEffort QoS: no resource configuration
resources: {}

These pods are the first to be evicted. In production, you should almost never have BestEffort pods -- they make capacity planning impossible and are unreliable under load.

3. Horizontal Pod Autoscaling (HPA)

Once requests are set, HPA can automatically scale the number of Pods based on observed utilization.

The Prerequisite: Metrics Server

HPA is a consumer of data, not a producer. By default, Kubernetes does not know how much CPU or memory a pod is using. You must install the Metrics Server in your cluster.

The Metrics Server scrapes resource usage data from the kubelet on every node.
HPA queries the Metrics API (metrics.k8s.io) to make its scaling decisions.
If kubectl top pods returns an error, your Metrics Server is not working and HPA will not function.

Interactive Simulator

Increase the "Traffic Load" slider below. Watch how the Average CPU per Pod increases. When it exceeds the target (50%), the HPA controller adds more replicas to share the load.

The Algorithm

The HPA controller runs a control loop (default every 15 seconds) and calculates the desired replica count:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

Example: You have 3 replicas, average CPU utilization is 75%, and your target is 50%.

desiredReplicas = ceil[3 * (75 / 50)] = ceil[4.5] = 5

The HPA scales from 3 to 5 replicas. On the next check, if average utilization drops to 45% (below target), the HPA calculates fewer replicas are needed and scales down.

Metric Types

HPA v2 (stable since Kubernetes 1.23) supports four types of metrics:

Type	Description	Example
Resource	CPU or memory utilization from metrics-server	Average CPU at 50%
Pods	Custom metric per-pod from your app	Queue depth per pod
Object	Metric from a Kubernetes object	Ingress requests-per-second
External	Metric from an external system	AWS SQS queue length

Scaling Behavior Configuration

HPA v2 allows you to control how fast scaling happens, which prevents oscillation (rapid scale-up/scale-down cycles):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  # Scale on CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60       # Target 60% of CPU request
  # Also scale on memory
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 75
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # Wait 60s before scaling up again
      policies:
      - type: Pods
        value: 4                       # Add at most 4 pods per 60s
        periodSeconds: 60
      - type: Percent
        value: 100                     # Or double the pods, whichever is higher
        periodSeconds: 60
      selectPolicy: Max                # Use the policy that allows more replicas
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
      - type: Percent
        value: 25                      # Remove at most 25% of pods per 60s
        periodSeconds: 60

The stabilizationWindowSeconds for scale-down is important. Without it, a brief dip in traffic causes the HPA to scale down, then the remaining pods get overwhelmed and it scales back up, creating a flapping cycle.

Custom Metrics Example

To scale on application-specific metrics (e.g., requests per second from Prometheus), you need a metrics adapter like prometheus-adapter:

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second    # Exposed by prometheus-adapter
    target:
      type: AverageValue
      averageValue: "1000"              # Target 1000 req/s per pod

Real-World Scenario: Scaling a Web Service

Imagine you run an e-commerce API. Normal traffic is 500 requests/second, but during flash sales it spikes to 5,000 requests/second.

Baseline: 3 replicas, each handling ~170 req/s, CPU at 30%.
Flash sale starts: Traffic jumps to 2,000 req/s. CPU rises to 80%.
HPA reacts: ceil[3 * (80/60)] = 4. Scales to 4 replicas.
Traffic continues rising: CPU still at 75%. ceil[4 * (75/60)] = 5. Scales to 5.
Peak traffic: Eventually reaches 10 replicas, each handling ~500 req/s at 55% CPU.
Traffic subsides: After 5 minutes of low utilization (stabilization window), HPA scales down gradually (25% at a time).

4. Namespace Resource Governance

Admins can enforce rules to prevent one team from consuming the entire cluster.

ResourceQuota (The Hard Cap)

Limits the total resource usage across all pods in a namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"           # Total CPU requests across all pods
    requests.memory: "20Gi"      # Total memory requests
    limits.cpu: "20"             # Total CPU limits
    limits.memory: "40Gi"        # Total memory limits
    pods: "50"                   # Maximum number of pods
    services: "20"               # Maximum number of services
    persistentvolumeclaims: "10" # Maximum PVCs

When a ResourceQuota is in effect, every pod in the namespace must specify requests and limits. If a user tries to create a pod without them, the API server rejects it. This is where LimitRange becomes essential.

LimitRange (The Defaults and Constraints)

Sets default values for pods that do not specify resources, and enforces min/max boundaries:

apiVersion: v1
kind: LimitRange
metadata:
  name: team-a-limits
  namespace: team-a
spec:
  limits:
  - type: Container
    default:                     # Applied if no limit is specified
      cpu: "500m"
      memory: "256Mi"
    defaultRequest:              # Applied if no request is specified
      cpu: "100m"
      memory: "128Mi"
    max:                         # No container can request more than this
      cpu: "4"
      memory: "8Gi"
    min:                         # No container can request less than this
      cpu: "50m"
      memory: "64Mi"

This prevents BestEffort pods from being accidentally created and stops individual containers from requesting disproportionate resources.

Right-Sizing Strategies

Setting accurate requests is an iterative process. Over-provisioning wastes money; under-provisioning causes instability.

Start with estimates. Use load testing or developer knowledge to set initial values. For a typical web application, cpu: 250m and memory: 256Mi is a reasonable starting point.
Deploy and observe. Use kubectl top pods and your monitoring stack (Prometheus/Grafana) to measure actual usage over days or weeks. Single-day observations miss weekly traffic patterns.
Adjust requests to P95 usage. Your request should cover the 95th percentile of actual usage. This handles normal spikes while avoiding over-provisioning.
Set limits at 2-3x the request for CPU (or omit CPU limits entirely). For memory, set limits at 1.5-2x the request to accommodate temporary spikes while still capping runaway processes.
Use Prometheus queries for data-driven decisions. The query container_memory_working_set_bytes{pod=~"myapp.*"} over a 7-day window gives you the real memory footprint. Compare this to your requests to find over- or under-provisioned workloads.
Automate with VPA. The Vertical Pod Autoscaler can observe usage and recommend or automatically adjust requests. See the VPA documentation for details.

Common Pitfalls

1. Not setting any resource requests. Without requests, pods are BestEffort and get evicted first. The scheduler also cannot make informed placement decisions, leading to overloaded nodes.

2. Setting CPU limits too low. Aggressive CPU limits cause throttling even when the node has idle CPU. This manifests as mysterious latency spikes that are hard to diagnose. Use container_cpu_cfs_throttled_periods_total in Prometheus to detect this.

3. Setting memory requests much lower than actual usage. The kubelet evicts pods using more memory than their request when the node is under pressure. If your request is 256Mi but your app regularly uses 400Mi, it will be evicted during any node memory contention.

4. Confusing Mi and M. 128Mi is 128 Mebibytes (134,217,728 bytes). 128M is 128 Megabytes (128,000,000 bytes). Always use the Mi suffix to match how Kubernetes displays and computes values internally.

5. Forgetting to install Metrics Server for HPA. HPA silently does nothing without metrics. Check with kubectl get apiservices | grep metrics to verify.

6. HPA and VPA conflicting on the same resource. Do not use HPA and VPA on the same metric (e.g., both targeting CPU). HPA adjusts replica count while VPA adjusts per-pod requests, and they can fight each other. If you use both, configure VPA in "recommend only" mode for CPU and let HPA handle scaling.

Best Practices

Always set memory requests and limits. Memory is incompressible and OOMKills are disruptive. Every production container should have memory limits.
Set CPU requests, but consider omitting CPU limits. This lets containers burst when CPU is available. Monitor throttling metrics to validate this approach for your workloads.
Use HPA with conservative scale-down settings. A 5-minute stabilizationWindowSeconds for scale-down prevents flapping and premature removal of capacity.
Combine HPA with PodDisruptionBudgets. HPA manages replica count, but PDBs ensure availability during rolling updates and node drains. They complement each other.
Set ResourceQuotas on all shared namespaces. Without quotas, one team's runaway deployment can starve the entire cluster.
Review resource utilization monthly. Workloads change over time. A service that needed 2 CPU at launch may only use 500m six months later (or vice versa).
Use namespace-level LimitRanges alongside ResourceQuotas. Quotas enforce totals but do not prevent a single pod from consuming the entire quota. LimitRanges enforce per-container maximums, ensuring fair distribution within a namespace.
Monitor HPA decisions in Grafana. The kube_horizontalpodautoscaler_status_current_replicas and kube_horizontalpodautoscaler_spec_target_metric metrics from kube-state-metrics let you build dashboards that show when and why HPA scaled your deployments.

What's Next?

Probes (Health Checks) -- Ensure your pods are healthy before they receive traffic, which directly affects how HPA counts "ready" replicas.
Observability -- Use Prometheus and Grafana to monitor resource usage, throttling metrics, and HPA decisions.
Scheduling & Affinity -- Control pod placement based on node resources, labels, and taints.
Cluster Autoscaler -- Automatically add or remove nodes when pod scheduling fails due to insufficient cluster capacity.
Vertical Pod Autoscaler (VPA) -- Automatically adjust requests and limits based on observed usage patterns.
Cost Optimization -- Strategies for reducing cloud spend through resource right-sizing and spot instances.

1. Requests vs Limits​

Requests (The "Guarantee")​

Limits (The "Ceiling")​

CPU vs Memory: A Fundamental Difference​

2. Quality of Service (QoS) Classes​

Guaranteed (Last to be evicted)​

Burstable (Evicted under pressure)​

BestEffort (First to be evicted)​

3. Horizontal Pod Autoscaling (HPA)​

The Prerequisite: Metrics Server​

Interactive Simulator​

The Algorithm​

Metric Types​

Scaling Behavior Configuration​

Custom Metrics Example​

Real-World Scenario: Scaling a Web Service​

4. Namespace Resource Governance​

ResourceQuota (The Hard Cap)​

LimitRange (The Defaults and Constraints)​

Right-Sizing Strategies​

Common Pitfalls​

Best Practices​

What's Next?​