Cluster Autoscaler (Scale the Nodes)
- Infrastructure Scaling: Unlike HPA (which scales pods) and VPA (which scales pod resources), the Cluster Autoscaler (CA) adds or removes physical/virtual nodes when the existing cluster capacity is exhausted or underutilized.
- Trigger Mechanism: CA monitors for pods stuck in
Pendingstate withInsufficient CPU/Memoryscheduling failures. When detected, it calculates which node group can satisfy the pending pods and requests new nodes from the cloud provider. - Scale-Down Logic: CA identifies underutilized nodes (utilization below 50% by default), verifies that all pods on the node can be safely rescheduled elsewhere, and then drains and terminates the node to reduce cost.
- Expander Strategies: When multiple node groups can satisfy pending pods, CA uses an expander strategy (random, most-pods, least-waste, priority) to choose the best fit.
- Karpenter: A next-generation node provisioner that bypasses node groups entirely. It provisions individual nodes with the exact instance type and size needed for pending pods, resulting in faster scaling (seconds vs. minutes) and better bin-packing.
- Cloud Provider Integration: CA integrates with AWS Auto Scaling Groups, GCP Managed Instance Groups, and Azure VMSS. Karpenter works directly with the cloud provider's compute APIs for more flexible provisioning.
We've covered HPA (horizontal pod autoscaling) and VPA (vertical pod autoscaling). But what happens when you need to run 100 pods and your existing nodes are completely full? No amount of pod scaling helps if there is no compute capacity available. You need to scale the infrastructure — the nodes themselves.
1. Automatic Node Provisioning
Scale the application below. Watch how new physical nodes are added when there is no room left on existing ones.
The Cluster Autoscaler is a Kubernetes component that automatically adjusts the number of nodes in your cluster. It integrates with your cloud provider's compute APIs to add nodes when demand increases and remove them when capacity is no longer needed.
2. How Scale-Up Works
The Cluster Autoscaler scale-up process follows a specific sequence:
-
Detection: A pod enters the
Pendingstate because the scheduler cannot find a node with sufficient CPU, memory, or other resources. The scheduler sets the conditionPodScheduled: Falsewith reasonUnschedulable. -
Evaluation: CA runs its main loop (default: every 10 seconds) and discovers the unschedulable pods. It evaluates which node groups (ASGs, MIGs, VMSS) could accommodate the pending pods.
-
Simulation: For each candidate node group, CA simulates adding a node and checks whether the pending pods would fit. It considers node resources, taints, tolerations, affinity rules, and topology spread constraints.
-
Expansion: CA selects a node group using the configured expander strategy and requests the cloud provider to increase the node group size by the calculated number of nodes.
-
Node Ready: The cloud provider provisions a new VM. The kubelet starts, registers the node with the API server, and the node transitions to
Readystatus. Typical time: 1-3 minutes depending on the cloud provider and VM type. -
Scheduling: The scheduler places the pending pods on the newly available node.
Scale-Up Configuration
# Cluster Autoscaler deployment args (key parameters)
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
command:
- ./cluster-autoscaler
- --v=4
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
# Scan for pending pods every 10 seconds
- --scan-interval=10s
# Node must be unneeded for 10 minutes before scale-down
- --scale-down-unneeded-time=10m
# Node utilization below this threshold triggers scale-down consideration
- --scale-down-utilization-threshold=0.5
# Maximum number of nodes CA can add in one scale-up event
- --max-node-provision-time=15m
# Maximum number of nodes the cluster can grow to
- --max-nodes-total=100
# Expander strategy
- --expander=least-waste
# Balance similar node groups for high availability
- --balance-similar-node-groups=true
# Node groups to manage (AWS example)
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
3. How Scale-Down Works
CA also works in reverse, removing underutilized nodes to save money:
-
Utilization Check: Every
scan-interval, CA calculates the utilization of each node. Utilization is defined as the sum of pod requests divided by the node's allocatable capacity. -
Underutilization Threshold: If a node's utilization is below the
scale-down-utilization-threshold(default: 50%), it becomes a scale-down candidate. -
Cool-Down Period: The node must remain underutilized for
scale-down-unneeded-time(default: 10 minutes) to avoid thrashing (scaling down and immediately back up). -
Safety Checks: Before removing a node, CA verifies:
- All pods on the node can be rescheduled to other existing nodes.
- No pod has a
PodDisruptionBudgetthat would be violated. - No pod uses local storage (unless
--skip-nodes-with-local-storage=false). - No pod has the annotation
cluster-autoscaler.kubernetes.io/safe-to-evict: "false". - The node is not running a mirror pod (static pod) or a pod with no controller (standalone pod).
-
Drain and Terminate: CA cordons the node (marks it unschedulable), drains all pods (respecting PDBs and graceful termination), and then requests the cloud provider to terminate the VM.
Preventing Scale-Down
For nodes that should never be removed (e.g., nodes running singleton controllers or special hardware):
# Annotation on the node
metadata:
annotations:
cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"
For pods that should prevent their node from being removed:
# Annotation on the pod
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
4. Expander Strategies
When multiple node groups can satisfy pending pods, the expander determines which one CA chooses:
Random
The simplest strategy. Picks a node group at random. Works well when all node groups have similar configurations.
Most-Pods
Selects the node group that would schedule the most pending pods. Good for maximizing pod throughput.
Least-Waste
Selects the node group that would have the least idle resources after scheduling the pending pods. This optimizes for cost efficiency and is the most commonly recommended strategy.
Priority
Uses a ConfigMap to define a priority order for node groups. CA tries the highest-priority group first and falls back to lower priorities:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
100:
- spot-node-group-.* # try spot instances first (cheapest)
50:
- on-demand-node-group-.* # fall back to on-demand
10:
- gpu-node-group-.* # GPU nodes only when needed
This is the most powerful strategy for cost optimization — it ensures CA tries cheaper spot node groups before falling back to on-demand.
5. Node Group / Node Pool Configuration
Node groups (AWS ASGs, GCP MIGs, Azure VMSS) define the pool of VMs that CA can scale. Key configuration considerations:
AWS Auto Scaling Groups
# AWS ASG tags for CA auto-discovery
# k8s.io/cluster-autoscaler/enabled = true
# k8s.io/cluster-autoscaler/<cluster-name> = owned
# ASG configuration
MinSize: 1
MaxSize: 20
DesiredCapacity: 3
# Mixed instances policy for cost optimization
MixedInstancesPolicy:
InstancesDistribution:
OnDemandBaseCapacity: 1 # first node is on-demand (reliability)
OnDemandPercentageAboveBaseCapacity: 0 # rest are spot
SpotAllocationStrategy: capacity-optimized
LaunchTemplate:
Overrides:
- InstanceType: m5.xlarge
- InstanceType: m5a.xlarge
- InstanceType: m5d.xlarge
- InstanceType: m6i.xlarge # multiple types for spot availability
GKE Node Pools
# Create a GKE node pool with autoscaling
gcloud container node-pools create general-pool \
--cluster=production \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=20 \
--machine-type=e2-standard-4 \
--spot # use spot VMs for cost savings
AKS Node Pools
# Create an AKS node pool with autoscaling
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name production \
--name generalpool \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 20 \
--node-vm-size Standard_D4s_v3 \
--priority Spot # use spot VMs
Multiple Node Groups
A common pattern is to have different node groups for different workload types:
| Node Group | Instance Type | Use Case | CA Priority |
|---|---|---|---|
spot-general | m5.xlarge, m5a.xlarge | Stateless web apps, batch jobs | Highest (cheapest) |
on-demand-general | m5.xlarge | Stateful apps, controllers | Medium |
gpu-spot | g4dn.xlarge | ML inference, optional GPU workloads | Low |
gpu-on-demand | p3.2xlarge | ML training (cannot be interrupted) | Lowest |
6. Karpenter: The Modern Alternative
Karpenter is a node provisioner originally built by AWS and now a CNCF project. It takes a fundamentally different approach from the Cluster Autoscaler:
| Aspect | Cluster Autoscaler | Karpenter |
|---|---|---|
| Provisioning unit | Node groups (ASGs/MIGs) | Individual nodes |
| Instance type selection | Fixed by node group config | Dynamic — chooses optimal instance type per workload |
| Scale-up speed | 1-3 minutes (ASG-dependent) | 30-60 seconds (direct API calls) |
| Consolidation | Scale-down only for underutilized nodes | Active consolidation — replaces nodes with cheaper/smaller ones |
| Configuration | Per node group (ASG size, instance types) | Declarative NodePool CRD |
| Cloud support | AWS, GCP, Azure, and others | AWS (GA), Azure (beta), others in development |
Karpenter NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
metadata:
labels:
environment: production
spec:
# NodeClass reference (cloud-provider-specific config)
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
# Instance type requirements
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"] # allow ARM for cost savings
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # prefer spot
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c", "r"] # general, compute, memory families
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"] # generation 5+
# Automatically expire nodes after 720 hours (30 days)
expireAfter: 720h
# Disruption behavior
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
# Total resource limits for this NodePool
limits:
cpu: 200
memory: 400Gi
# How much headroom to leave for fast scale-up
weight: 50 # higher weight = preferred
EC2NodeClass (AWS-Specific)
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
# AMI discovery
amiSelectorTerms:
- alias: al2023@latest # Amazon Linux 2023
# Subnet discovery
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: production
# Security group discovery
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: production
# Instance profile for node IAM role
role: KarpenterNodeRole-production
# Block device configuration
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
deleteOnTermination: true
# Tags applied to EC2 instances
tags:
managed-by: karpenter
environment: production
How Karpenter Consolidation Works
Karpenter continuously evaluates whether workloads can be packed more efficiently:
- Empty node deletion: If a node has no non-daemonset pods, Karpenter terminates it immediately.
- Single-node consolidation: If a node's pods can all fit on other existing nodes, Karpenter drains and terminates it.
- Multi-node consolidation: If pods from multiple underutilized nodes can be combined onto fewer nodes, Karpenter replaces them.
- Instance type optimization: If a smaller or cheaper instance type can run the current workload, Karpenter replaces the node.
This active consolidation is more aggressive than CA's scale-down and results in significantly higher cluster utilization.
7. Priority-Based Expansion
For clusters with workloads of varying importance, priority-based expansion ensures that high-priority workloads get nodes first:
# PriorityClass for critical workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-production
value: 1000000 # high priority
globalDefault: false
description: "Critical production workloads that must always run"
---
# PriorityClass for batch workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-processing
value: 100 # low priority
globalDefault: false
preemptionPolicy: Never # do not preempt other pods
description: "Batch jobs that can wait for capacity"
When nodes are full, the scheduler will preempt low-priority pods to make room for high-priority pods. CA then provisions new nodes for the displaced low-priority pods if needed.
8. Cloud Provider-Specific Notes
AWS (EKS)
- CA uses Auto Scaling Groups. Each ASG is a node group.
- Use
--node-group-auto-discoverywith ASG tags for automatic configuration. - Recommendation: Consider Karpenter on EKS — it is GA and provides superior scaling performance.
GCP (GKE)
- CA is built into GKE and managed by Google. No separate deployment needed.
- Configure autoscaling per node pool via
gcloudor Terraform. - GKE also supports Node Auto-Provisioning (NAP), which automatically creates and deletes node pools based on workload requirements (similar to Karpenter's approach).
Azure (AKS)
- CA is integrated into AKS. Enabled via
--enable-cluster-autoscalerflag. - Uses Virtual Machine Scale Sets (VMSS) as node groups.
- Configure per node pool with
--min-countand--max-count.
Common Pitfalls
- Setting
maxSizeequal tominSize: This effectively disables autoscaling. SetmaxSizehigh enough to handle peak load. - Not using PodDisruptionBudgets: Without PDBs, CA can drain critical pods during scale-down, causing downtime. Always set PDBs for production workloads.
- Pods without resource requests: CA cannot accurately calculate node utilization or simulate scheduling without resource requests. Pods without requests effectively have zero resource footprint in CA's calculations, leading to incorrect scaling decisions.
- Local storage blocking scale-down: Pods using
emptyDirwith data orhostPathvolumes prevent CA from draining the node by default. Use--skip-nodes-with-local-storage=falseif you want CA to evict these pods. - Scaling speed expectations: CA + cloud provider node provisioning takes 1-3 minutes. If your workload needs sub-minute scaling, consider over-provisioning with "pause pods" or using Karpenter.
- Not balancing across zones: Without
--balance-similar-node-groups=true, CA may concentrate nodes in a single availability zone, reducing high-availability. - Ignoring
max-node-provision-time: If a new node takes longer than this timeout (default: 15 minutes) to become Ready, CA marks the scale-up as failed and may try a different node group.
Best Practices
- Set resource requests on all pods — accurate requests are essential for CA to make correct scaling decisions.
- Use PodDisruptionBudgets on all production workloads to protect against aggressive scale-down.
- Use the
least-wasteorpriorityexpander for cost optimization. - Enable
--balance-similar-node-groupsto spread nodes across availability zones. - Use multiple instance types per node group (mixed instances policy on AWS) to improve spot instance availability.
- Consider Karpenter on AWS for faster provisioning, active consolidation, and simpler configuration.
- Monitor CA logs and metrics — watch for
ScaleUp,ScaleDown,NoScaleUpevents to understand scaling behavior. - Use "pause pods" (low-priority pods that reserve capacity) if you need fast scale-up without waiting for node provisioning.
- Set
max-nodes-totalas a safety valve to prevent runaway scaling from driving up costs unexpectedly. - Review node utilization regularly — if it is consistently below 40%, you have room to right-size nodes or enable more aggressive consolidation.
What's Next?
- Learn about Cost Optimization strategies including spot instances, VPA right-sizing, and FinOps practices.
- Explore Pod Security to ensure autoscaled nodes run secure workloads.
- See how Multi-Tenancy uses ResourceQuotas to control how much capacity each team can consume.
- Understand Progressive Delivery and how canary deployments interact with autoscaling during rollouts.