Serverless Kubernetes (Knative)
- Scale-to-Zero Capabilities: Knative extends Kubernetes to support serverless workloads, automatically scaling applications down to zero replicas when idle and back up on demand. This eliminates cost for truly idle workloads while maintaining Kubernetes-native deployment workflows.
- Cost Optimization: By terminating idle pods entirely, Knative eliminates resource charges for unused compute. This is ideal for development environments, infrequent batch jobs, webhook handlers, and APIs with spiky or low traffic patterns.
- Serving and Eventing: Knative provides two core components: Serving (request-driven autoscaling, revision management, traffic splitting) and Eventing (event sources, brokers, triggers for event-driven architectures). They can be used independently or together.
- KEDA as an Alternative: KEDA (Kubernetes Event-Driven Autoscaling) provides event-driven scaling without the full Knative stack. It scales standard Kubernetes Deployments based on external event sources (queues, streams, databases) and supports scale-to-zero.
- Cold Start Trade-off: The primary cost of scale-to-zero is "cold start" latency. The first request after idle triggers pod creation, image pulling (if not cached), and application startup. Optimizing container images, using warm pools, and configuring
minScalemitigate this.
Standard Kubernetes Deployments are designed to be always-on. You specify a replica count, and Kubernetes maintains that many pods at all times, whether they are processing thousands of requests per second or sitting completely idle. For workloads with variable or infrequent traffic, this wastes significant compute resources and money.
Knative extends Kubernetes to provide a serverless experience, including the ability to scale down to zero replicas when a service receives no traffic, and to scale back up automatically when requests arrive.
1. Scale-to-Zero
The defining feature of serverless on Kubernetes is scale-to-zero. When no requests are flowing to a service, Knative terminates all pods. When a new request arrives, the Knative activator component holds the request, triggers pod creation, waits for the pod to become ready, and then forwards the request.
The scale-to-zero flow:
- Idle detection: The Knative autoscaler monitors request concurrency. When no requests have arrived for a configurable window (default: 30 seconds), it scales the Deployment to zero replicas.
- Request buffering: When a new request arrives, the Knative activator (a cluster-wide component) intercepts it and holds it in memory.
- Scale-up trigger: The activator signals the autoscaler to scale from zero to one (or more) replicas.
- Pod startup: Kubernetes creates the pod, pulls the image (if not cached), and starts the container.
- Request forwarding: Once the pod passes its readiness probe, the activator forwards the buffered request.
2. Knative Serving
Knative Serving handles the request-driven lifecycle of serverless workloads. It manages autoscaling, revision tracking, and traffic routing.
Knative Service
A Knative Service is the primary resource. It wraps a Kubernetes Deployment, Service, and Ingress into a single declarative resource.
# Knative Service with scale-to-zero and concurrency-based autoscaling
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: api-handler
namespace: production
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: kpa.autoscaling.knative.dev # Knative Pod Autoscaler
autoscaling.knative.dev/metric: concurrency # Scale based on concurrent requests
autoscaling.knative.dev/target: "10" # Target 10 concurrent requests per pod
autoscaling.knative.dev/minScale: "0" # Allow scale to zero
autoscaling.knative.dev/maxScale: "50" # Maximum 50 pods
autoscaling.knative.dev/scaleDownDelay: "30s" # Wait 30s before scaling down
spec:
containerConcurrency: 0 # Unlimited concurrency per container
timeoutSeconds: 300 # Request timeout
containers:
- image: myregistry.io/api-handler:v1.2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "1"
memory: "512Mi"
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 2
Revisions
Every change to a Knative Service creates a new Revision -- an immutable, point-in-time snapshot of the service configuration. Revisions are the foundation for traffic splitting and rollback.
# Traffic splitting between two revisions (canary deployment)
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: api-handler
namespace: production
spec:
template:
metadata:
name: api-handler-v2 # Explicit revision name
spec:
containers:
- image: myregistry.io/api-handler:v2.0.0
traffic:
- revisionName: api-handler-v1
percent: 90 # 90% to the current stable version
- revisionName: api-handler-v2
percent: 10 # 10% to the new canary version
- revisionName: api-handler-v2
tag: canary # Named URL: canary-api-handler.example.com
This traffic splitting is handled at the networking layer (via Istio, Kourier, or Contour), allowing you to gradually shift traffic to new versions and roll back instantly by updating the traffic percentages.
Autoscaling Modes
Knative supports two autoscaling classes:
- KPA (Knative Pod Autoscaler): The default. Supports scale-to-zero and scale based on concurrency or requests-per-second. Responds faster than HPA for request-driven workloads.
- HPA (Horizontal Pod Autoscaler): Uses the standard Kubernetes HPA. Supports CPU and memory-based scaling but does not support scale-to-zero.
3. Knative Eventing
Knative Eventing provides infrastructure for building event-driven architectures. It decouples event producers from event consumers using CloudEvents, the CNCF standard for event formatting.
Core Components
- Sources: Connect to external systems and produce events. Examples: KafkaSource (Kafka topics), ApiServerSource (Kubernetes API events), PingSource (cron-based events), GitHubSource (webhooks).
- Brokers: Event routing hubs that receive events and distribute them to subscribers based on filters.
- Triggers: Define which events a subscriber wants to receive, filtering by event type, source, or custom attributes.
# Knative Eventing: process S3 upload events
apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
name: daily-report-trigger
namespace: production
spec:
schedule: "0 6 * * *" # Every day at 6 AM UTC
contentType: "application/json"
data: '{"report": "daily-summary"}'
sink:
ref:
apiVersion: eventing.knative.dev/v1
kind: Broker
name: default
---
# Broker receives events and routes them
apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
name: default
namespace: production
---
# Trigger routes specific events to a Knative Service
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
name: daily-report-handler
namespace: production
spec:
broker: default
filter:
attributes:
type: dev.knative.sources.ping # Filter by event type
subscriber:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: report-generator # This Knative Service handles the event
This pattern enables fully serverless event processing: the report-generator service scales to zero when there are no events, spins up to process the daily trigger, and scales back to zero when done.
4. KEDA: Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) provides an alternative to Knative for event-driven workloads. KEDA extends the Kubernetes HPA with custom scalers that monitor external event sources.
Key differences from Knative:
- Works with standard Deployments: KEDA scales regular Kubernetes Deployments, Jobs, and StatefulSets. No new resource types are required.
- 60+ scalers: KEDA supports scaling based on Kafka lag, RabbitMQ queue depth, AWS SQS, Azure Service Bus, Prometheus metrics, cron schedules, PostgreSQL query results, and many more.
- Simpler to adopt: KEDA requires only a ScaledObject CRD alongside your existing Deployment. No networking layer changes needed.
# KEDA ScaledObject: scale a Deployment based on Kafka consumer lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor # Standard Kubernetes Deployment
minReplicaCount: 0 # Scale to zero when no lag
maxReplicaCount: 30
cooldownPeriod: 60 # Wait 60s before scaling to zero
triggers:
- type: kafka
metadata:
bootstrapServers: kafka.production:9092
consumerGroup: order-processor
topic: orders
lagThreshold: "10" # Scale up when lag exceeds 10 messages
# KEDA ScaledJob: run batch jobs based on queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: image-processor
namespace: production
spec:
jobTargetRef:
template:
spec:
containers:
- name: processor
image: myregistry.io/image-processor:v1.0
restartPolicy: Never
pollingInterval: 10 # Check queue every 10 seconds
maxReplicaCount: 20
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456/images
queueLength: "5" # Each job processes 5 messages
5. OpenFaaS
OpenFaaS (Open Function as a Service) provides a simpler, function-oriented serverless experience on Kubernetes. It is easier to get started with than Knative but has fewer features for advanced use cases.
OpenFaaS works best for simple function workloads where developers want to write a function handler and deploy it without managing Kubernetes manifests. It supports multiple languages through templates and includes a built-in UI and CLI.
6. Cold Starts: The Fundamental Trade-off
The primary disadvantage of scale-to-zero is cold start latency. When the first request arrives after an idle period, the user must wait for:
- Activator processing: ~10-50ms for Knative to detect and route the request.
- Pod scheduling: ~100-500ms for the scheduler to find a node and create the pod.
- Image pulling: 0ms (if cached) to 30+ seconds (if not cached). This is typically the largest contributor.
- Container startup: Depends on the application. A Go binary starts in milliseconds; a JVM application may take 5-15 seconds.
- Readiness probe: The time until the first readiness probe succeeds.
Mitigating Cold Starts
- Pre-pull images: Use a DaemonSet to pre-pull serverless images on all nodes so that image pull time is eliminated.
- Use small images: Go, Rust, and GraalVM native images start in milliseconds. Avoid large JVM applications for latency-sensitive serverless workloads, or use GraalVM native compilation.
- Set
minScale: 1: Keep at least one pod running at all times. You lose the cost savings of scale-to-zero but eliminate cold starts entirely. - Configure
scaleDownDelay: Increase the idle timeout before scaling to zero. A 5-minute delay handles most bursty traffic patterns without cold starts. - Use init containers wisely: Heavy initialization (database migrations, cache warming) in init containers adds to cold start time. Move these to the application startup path or external jobs.
7. Knative vs. Cloud-Native Serverless
| Aspect | Knative (K8s Serverless) | Cloud Serverless (Lambda/Cloud Functions) |
|---|---|---|
| Infrastructure | You manage the cluster | Fully managed by cloud provider |
| Cold starts | Higher (pod scheduling + image pull) | Lower (optimized by provider) |
| Execution time | Unlimited | Typically limited (15 min for Lambda) |
| Language support | Any container image | Limited to supported runtimes |
| Vendor lock-in | Portable across clouds | Locked to one cloud provider |
| Cost at scale | Lower (cluster amortization) | Higher (per-invocation pricing) |
| Networking | Full Kubernetes networking | Limited (VPC configuration required) |
Choose Knative when you need container-level flexibility, longer execution times, or multi-cloud portability. Choose cloud-native serverless when you want zero infrastructure management and can accept the provider's constraints.
8. Event-Driven Architecture Patterns
Webhook Processing
External webhooks (GitHub, Stripe, Slack) trigger a Knative Service that processes the event and scales to zero between events.
Scheduled Jobs
A PingSource fires at scheduled intervals, triggering a Knative Service for batch processing (report generation, data aggregation, cleanup tasks).
Stream Processing
KEDA scales consumers based on message lag in Kafka, SQS, or RabbitMQ. When the queue is empty, consumers scale to zero. When messages arrive, consumers scale proportionally to the backlog.
Fan-Out / Fan-In
A Knative Broker receives events and fans them out to multiple subscribers via Triggers. Each subscriber processes the event independently, enabling parallel processing of different aspects of the same event.
Common Pitfalls
- Not accounting for cold start latency: If your SLA requires sub-second response times, scale-to-zero may violate it. Use
minScale: 1for latency-sensitive services or accept the latency trade-off for non-critical workloads. - Large container images: A 500 MB image takes significant time to pull, dominating cold start latency. Optimize images to be as small as possible. Use multi-stage builds, distroless base images, and pre-pull strategies.
- Networking layer overhead: Knative requires a networking layer (Istio, Kourier, or Contour). Each adds operational complexity. Kourier is the lightest option; Istio provides the most features but is the heaviest.
- KEDA and Knative confusion: KEDA and Knative solve overlapping problems but are designed differently. KEDA is best for scaling standard workloads based on external metrics. Knative is best for request-driven, HTTP-based serverless workloads.
- Ignoring resource requests: Pods that scale from zero need accurate resource requests so the scheduler can place them quickly. Under-requesting resources causes scheduling delays; over-requesting wastes capacity.
- Not monitoring scale events: Without visibility into scaling decisions (scale-to-zero events, cold start latency, activator queue depth), you cannot optimize performance. Monitor Knative's autoscaler metrics and alert on prolonged cold starts.
What's Next?
- Deploy Knative Serving using the official quickstart guide and experiment with scale-to-zero and traffic splitting.
- Evaluate KEDA for event-driven workloads that need to scale based on external event sources (Kafka, SQS, RabbitMQ).
- Implement Knative Eventing for event-driven architectures using Brokers and Triggers to decouple producers from consumers.
- Optimize cold start performance by pre-pulling images, using lightweight runtimes (Go, Rust), and configuring appropriate
minScalevalues. - Compare Knative vs. cloud serverless (Lambda, Cloud Functions, Azure Functions) for your specific workload patterns to determine the right balance of flexibility and managed infrastructure.
- Explore Knative Functions (
func) for a developer-friendly experience that generates Knative Services from function code without writing Kubernetes manifests.