Skip to main content

Cluster Autoscaler (Scale the Nodes)

Key Takeaways for AI & Readers
  • Infrastructure Scaling: Unlike HPA (which scales pods) and VPA (which scales pod resources), the Cluster Autoscaler (CA) adds or removes physical/virtual nodes when the existing cluster capacity is exhausted or underutilized.
  • Trigger Mechanism: CA monitors for pods stuck in Pending state with Insufficient CPU/Memory scheduling failures. When detected, it calculates which node group can satisfy the pending pods and requests new nodes from the cloud provider.
  • Scale-Down Logic: CA identifies underutilized nodes (utilization below 50% by default), verifies that all pods on the node can be safely rescheduled elsewhere, and then drains and terminates the node to reduce cost.
  • Expander Strategies: When multiple node groups can satisfy pending pods, CA uses an expander strategy (random, most-pods, least-waste, priority) to choose the best fit.
  • Karpenter: A next-generation node provisioner that bypasses node groups entirely. It provisions individual nodes with the exact instance type and size needed for pending pods, resulting in faster scaling (seconds vs. minutes) and better bin-packing.
  • Cloud Provider Integration: CA integrates with AWS Auto Scaling Groups, GCP Managed Instance Groups, and Azure VMSS. Karpenter works directly with the cloud provider's compute APIs for more flexible provisioning.

We've covered HPA (horizontal pod autoscaling) and VPA (vertical pod autoscaling). But what happens when you need to run 100 pods and your existing nodes are completely full? No amount of pod scaling helps if there is no compute capacity available. You need to scale the infrastructure — the nodes themselves.

1. Automatic Node Provisioning

Scale the application below. Watch how new physical nodes are added when there is no room left on existing ones.

3
Desired Pods: 3
Physical Node 1
📦
📦
📦
Cluster Autoscaler detects when Pods cannot be scheduled due to lack of resources and automatically provisions a new Node from the cloud provider (AWS/GCP).

The Cluster Autoscaler is a Kubernetes component that automatically adjusts the number of nodes in your cluster. It integrates with your cloud provider's compute APIs to add nodes when demand increases and remove them when capacity is no longer needed.

2. How Scale-Up Works

The Cluster Autoscaler scale-up process follows a specific sequence:

  1. Detection: A pod enters the Pending state because the scheduler cannot find a node with sufficient CPU, memory, or other resources. The scheduler sets the condition PodScheduled: False with reason Unschedulable.

  2. Evaluation: CA runs its main loop (default: every 10 seconds) and discovers the unschedulable pods. It evaluates which node groups (ASGs, MIGs, VMSS) could accommodate the pending pods.

  3. Simulation: For each candidate node group, CA simulates adding a node and checks whether the pending pods would fit. It considers node resources, taints, tolerations, affinity rules, and topology spread constraints.

  4. Expansion: CA selects a node group using the configured expander strategy and requests the cloud provider to increase the node group size by the calculated number of nodes.

  5. Node Ready: The cloud provider provisions a new VM. The kubelet starts, registers the node with the API server, and the node transitions to Ready status. Typical time: 1-3 minutes depending on the cloud provider and VM type.

  6. Scheduling: The scheduler places the pending pods on the newly available node.

Scale-Up Configuration

# Cluster Autoscaler deployment args (key parameters)
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
command:
- ./cluster-autoscaler
- --v=4
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
# Scan for pending pods every 10 seconds
- --scan-interval=10s
# Node must be unneeded for 10 minutes before scale-down
- --scale-down-unneeded-time=10m
# Node utilization below this threshold triggers scale-down consideration
- --scale-down-utilization-threshold=0.5
# Maximum number of nodes CA can add in one scale-up event
- --max-node-provision-time=15m
# Maximum number of nodes the cluster can grow to
- --max-nodes-total=100
# Expander strategy
- --expander=least-waste
# Balance similar node groups for high availability
- --balance-similar-node-groups=true
# Node groups to manage (AWS example)
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster

3. How Scale-Down Works

CA also works in reverse, removing underutilized nodes to save money:

  1. Utilization Check: Every scan-interval, CA calculates the utilization of each node. Utilization is defined as the sum of pod requests divided by the node's allocatable capacity.

  2. Underutilization Threshold: If a node's utilization is below the scale-down-utilization-threshold (default: 50%), it becomes a scale-down candidate.

  3. Cool-Down Period: The node must remain underutilized for scale-down-unneeded-time (default: 10 minutes) to avoid thrashing (scaling down and immediately back up).

  4. Safety Checks: Before removing a node, CA verifies:

    • All pods on the node can be rescheduled to other existing nodes.
    • No pod has a PodDisruptionBudget that would be violated.
    • No pod uses local storage (unless --skip-nodes-with-local-storage=false).
    • No pod has the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "false".
    • The node is not running a mirror pod (static pod) or a pod with no controller (standalone pod).
  5. Drain and Terminate: CA cordons the node (marks it unschedulable), drains all pods (respecting PDBs and graceful termination), and then requests the cloud provider to terminate the VM.

Preventing Scale-Down

For nodes that should never be removed (e.g., nodes running singleton controllers or special hardware):

# Annotation on the node
metadata:
annotations:
cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"

For pods that should prevent their node from being removed:

# Annotation on the pod
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

4. Expander Strategies

When multiple node groups can satisfy pending pods, the expander determines which one CA chooses:

Random

The simplest strategy. Picks a node group at random. Works well when all node groups have similar configurations.

Most-Pods

Selects the node group that would schedule the most pending pods. Good for maximizing pod throughput.

Least-Waste

Selects the node group that would have the least idle resources after scheduling the pending pods. This optimizes for cost efficiency and is the most commonly recommended strategy.

Priority

Uses a ConfigMap to define a priority order for node groups. CA tries the highest-priority group first and falls back to lower priorities:

apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
100:
- spot-node-group-.* # try spot instances first (cheapest)
50:
- on-demand-node-group-.* # fall back to on-demand
10:
- gpu-node-group-.* # GPU nodes only when needed

This is the most powerful strategy for cost optimization — it ensures CA tries cheaper spot node groups before falling back to on-demand.

5. Node Group / Node Pool Configuration

Node groups (AWS ASGs, GCP MIGs, Azure VMSS) define the pool of VMs that CA can scale. Key configuration considerations:

AWS Auto Scaling Groups

# AWS ASG tags for CA auto-discovery
# k8s.io/cluster-autoscaler/enabled = true
# k8s.io/cluster-autoscaler/<cluster-name> = owned

# ASG configuration
MinSize: 1
MaxSize: 20
DesiredCapacity: 3

# Mixed instances policy for cost optimization
MixedInstancesPolicy:
InstancesDistribution:
OnDemandBaseCapacity: 1 # first node is on-demand (reliability)
OnDemandPercentageAboveBaseCapacity: 0 # rest are spot
SpotAllocationStrategy: capacity-optimized
LaunchTemplate:
Overrides:
- InstanceType: m5.xlarge
- InstanceType: m5a.xlarge
- InstanceType: m5d.xlarge
- InstanceType: m6i.xlarge # multiple types for spot availability

GKE Node Pools

# Create a GKE node pool with autoscaling
gcloud container node-pools create general-pool \
--cluster=production \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=20 \
--machine-type=e2-standard-4 \
--spot # use spot VMs for cost savings

AKS Node Pools

# Create an AKS node pool with autoscaling
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name production \
--name generalpool \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 20 \
--node-vm-size Standard_D4s_v3 \
--priority Spot # use spot VMs

Multiple Node Groups

A common pattern is to have different node groups for different workload types:

Node GroupInstance TypeUse CaseCA Priority
spot-generalm5.xlarge, m5a.xlargeStateless web apps, batch jobsHighest (cheapest)
on-demand-generalm5.xlargeStateful apps, controllersMedium
gpu-spotg4dn.xlargeML inference, optional GPU workloadsLow
gpu-on-demandp3.2xlargeML training (cannot be interrupted)Lowest

6. Karpenter: The Modern Alternative

Karpenter is a node provisioner originally built by AWS and now a CNCF project. It takes a fundamentally different approach from the Cluster Autoscaler:

AspectCluster AutoscalerKarpenter
Provisioning unitNode groups (ASGs/MIGs)Individual nodes
Instance type selectionFixed by node group configDynamic — chooses optimal instance type per workload
Scale-up speed1-3 minutes (ASG-dependent)30-60 seconds (direct API calls)
ConsolidationScale-down only for underutilized nodesActive consolidation — replaces nodes with cheaper/smaller ones
ConfigurationPer node group (ASG size, instance types)Declarative NodePool CRD
Cloud supportAWS, GCP, Azure, and othersAWS (GA), Azure (beta), others in development

Karpenter NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
metadata:
labels:
environment: production
spec:
# NodeClass reference (cloud-provider-specific config)
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
# Instance type requirements
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"] # allow ARM for cost savings
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # prefer spot
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c", "r"] # general, compute, memory families
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"] # generation 5+
# Automatically expire nodes after 720 hours (30 days)
expireAfter: 720h
# Disruption behavior
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
# Total resource limits for this NodePool
limits:
cpu: 200
memory: 400Gi
# How much headroom to leave for fast scale-up
weight: 50 # higher weight = preferred

EC2NodeClass (AWS-Specific)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
# AMI discovery
amiSelectorTerms:
- alias: al2023@latest # Amazon Linux 2023
# Subnet discovery
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: production
# Security group discovery
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: production
# Instance profile for node IAM role
role: KarpenterNodeRole-production
# Block device configuration
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
deleteOnTermination: true
# Tags applied to EC2 instances
tags:
managed-by: karpenter
environment: production

How Karpenter Consolidation Works

Karpenter continuously evaluates whether workloads can be packed more efficiently:

  1. Empty node deletion: If a node has no non-daemonset pods, Karpenter terminates it immediately.
  2. Single-node consolidation: If a node's pods can all fit on other existing nodes, Karpenter drains and terminates it.
  3. Multi-node consolidation: If pods from multiple underutilized nodes can be combined onto fewer nodes, Karpenter replaces them.
  4. Instance type optimization: If a smaller or cheaper instance type can run the current workload, Karpenter replaces the node.

This active consolidation is more aggressive than CA's scale-down and results in significantly higher cluster utilization.

7. Priority-Based Expansion

For clusters with workloads of varying importance, priority-based expansion ensures that high-priority workloads get nodes first:

# PriorityClass for critical workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-production
value: 1000000 # high priority
globalDefault: false
description: "Critical production workloads that must always run"
---
# PriorityClass for batch workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-processing
value: 100 # low priority
globalDefault: false
preemptionPolicy: Never # do not preempt other pods
description: "Batch jobs that can wait for capacity"

When nodes are full, the scheduler will preempt low-priority pods to make room for high-priority pods. CA then provisions new nodes for the displaced low-priority pods if needed.

8. Cloud Provider-Specific Notes

AWS (EKS)

  • CA uses Auto Scaling Groups. Each ASG is a node group.
  • Use --node-group-auto-discovery with ASG tags for automatic configuration.
  • Recommendation: Consider Karpenter on EKS — it is GA and provides superior scaling performance.

GCP (GKE)

  • CA is built into GKE and managed by Google. No separate deployment needed.
  • Configure autoscaling per node pool via gcloud or Terraform.
  • GKE also supports Node Auto-Provisioning (NAP), which automatically creates and deletes node pools based on workload requirements (similar to Karpenter's approach).

Azure (AKS)

  • CA is integrated into AKS. Enabled via --enable-cluster-autoscaler flag.
  • Uses Virtual Machine Scale Sets (VMSS) as node groups.
  • Configure per node pool with --min-count and --max-count.

Common Pitfalls

  1. Setting maxSize equal to minSize: This effectively disables autoscaling. Set maxSize high enough to handle peak load.
  2. Not using PodDisruptionBudgets: Without PDBs, CA can drain critical pods during scale-down, causing downtime. Always set PDBs for production workloads.
  3. Pods without resource requests: CA cannot accurately calculate node utilization or simulate scheduling without resource requests. Pods without requests effectively have zero resource footprint in CA's calculations, leading to incorrect scaling decisions.
  4. Local storage blocking scale-down: Pods using emptyDir with data or hostPath volumes prevent CA from draining the node by default. Use --skip-nodes-with-local-storage=false if you want CA to evict these pods.
  5. Scaling speed expectations: CA + cloud provider node provisioning takes 1-3 minutes. If your workload needs sub-minute scaling, consider over-provisioning with "pause pods" or using Karpenter.
  6. Not balancing across zones: Without --balance-similar-node-groups=true, CA may concentrate nodes in a single availability zone, reducing high-availability.
  7. Ignoring max-node-provision-time: If a new node takes longer than this timeout (default: 15 minutes) to become Ready, CA marks the scale-up as failed and may try a different node group.

Best Practices

  1. Set resource requests on all pods — accurate requests are essential for CA to make correct scaling decisions.
  2. Use PodDisruptionBudgets on all production workloads to protect against aggressive scale-down.
  3. Use the least-waste or priority expander for cost optimization.
  4. Enable --balance-similar-node-groups to spread nodes across availability zones.
  5. Use multiple instance types per node group (mixed instances policy on AWS) to improve spot instance availability.
  6. Consider Karpenter on AWS for faster provisioning, active consolidation, and simpler configuration.
  7. Monitor CA logs and metrics — watch for ScaleUp, ScaleDown, NoScaleUp events to understand scaling behavior.
  8. Use "pause pods" (low-priority pods that reserve capacity) if you need fast scale-up without waiting for node provisioning.
  9. Set max-nodes-total as a safety valve to prevent runaway scaling from driving up costs unexpectedly.
  10. Review node utilization regularly — if it is consistently below 40%, you have room to right-size nodes or enable more aggressive consolidation.

What's Next?

  • Learn about Cost Optimization strategies including spot instances, VPA right-sizing, and FinOps practices.
  • Explore Pod Security to ensure autoscaled nodes run secure workloads.
  • See how Multi-Tenancy uses ResourceQuotas to control how much capacity each team can consume.
  • Understand Progressive Delivery and how canary deployments interact with autoscaling during rollouts.