Serverless Kubernetes (Knative)

Key Takeaways for AI & Readers

Scale-to-Zero Capabilities: Knative extends Kubernetes to support serverless workloads, automatically scaling applications down to zero replicas when idle and back up on demand. This eliminates cost for truly idle workloads while maintaining Kubernetes-native deployment workflows.
Cost Optimization: By terminating idle pods entirely, Knative eliminates resource charges for unused compute. This is ideal for development environments, infrequent batch jobs, webhook handlers, and APIs with spiky or low traffic patterns.
Serving and Eventing: Knative provides two core components: Serving (request-driven autoscaling, revision management, traffic splitting) and Eventing (event sources, brokers, triggers for event-driven architectures). They can be used independently or together.
KEDA as an Alternative: KEDA (Kubernetes Event-Driven Autoscaling) provides event-driven scaling without the full Knative stack. It scales standard Kubernetes Deployments based on external event sources (queues, streams, databases) and supports scale-to-zero.
Cold Start Trade-off: The primary cost of scale-to-zero is "cold start" latency. The first request after idle triggers pod creation, image pulling (if not cached), and application startup. Optimizing container images, using warm pools, and configuring minScale mitigate this.

Standard Kubernetes Deployments are designed to be always-on. You specify a replica count, and Kubernetes maintains that many pods at all times, whether they are processing thousands of requests per second or sitting completely idle. For workloads with variable or infrequent traffic, this wastes significant compute resources and money.

Knative extends Kubernetes to provide a serverless experience, including the ability to scale down to zero replicas when a service receives no traffic, and to scale back up automatically when requests arrive.

1. Scale-to-Zero

The defining feature of serverless on Kubernetes is scale-to-zero. When no requests are flowing to a service, Knative terminates all pods. When a new request arrives, the Knative activator component holds the request, triggers pod creation, waits for the pod to become ready, and then forwards the request.

Gateway

🌐

Service Pod

💤

Scale: 0 (Cold)

Knative brings Serverless to K8s. It intercepts requests, spins up pods instantly (Cold Start), and scales them back to Zero when the queue is empty.

The scale-to-zero flow:

Idle detection: The Knative autoscaler monitors request concurrency. When no requests have arrived for a configurable window (default: 30 seconds), it scales the Deployment to zero replicas.
Request buffering: When a new request arrives, the Knative activator (a cluster-wide component) intercepts it and holds it in memory.
Scale-up trigger: The activator signals the autoscaler to scale from zero to one (or more) replicas.
Pod startup: Kubernetes creates the pod, pulls the image (if not cached), and starts the container.
Request forwarding: Once the pod passes its readiness probe, the activator forwards the buffered request.

2. Knative Serving

Knative Serving handles the request-driven lifecycle of serverless workloads. It manages autoscaling, revision tracking, and traffic routing.

Knative Service

A Knative Service is the primary resource. It wraps a Kubernetes Deployment, Service, and Ingress into a single declarative resource.

# Knative Service with scale-to-zero and concurrency-based autoscaling
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: api-handler
  namespace: production
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev  # Knative Pod Autoscaler
        autoscaling.knative.dev/metric: concurrency   # Scale based on concurrent requests
        autoscaling.knative.dev/target: "10"           # Target 10 concurrent requests per pod
        autoscaling.knative.dev/minScale: "0"          # Allow scale to zero
        autoscaling.knative.dev/maxScale: "50"         # Maximum 50 pods
        autoscaling.knative.dev/scaleDownDelay: "30s"  # Wait 30s before scaling down
    spec:
      containerConcurrency: 0                          # Unlimited concurrency per container
      timeoutSeconds: 300                              # Request timeout
      containers:
        - image: myregistry.io/api-handler:v1.2.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "1"
              memory: "512Mi"
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 2

Revisions

Every change to a Knative Service creates a new Revision -- an immutable, point-in-time snapshot of the service configuration. Revisions are the foundation for traffic splitting and rollback.

# Traffic splitting between two revisions (canary deployment)
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: api-handler
  namespace: production
spec:
  template:
    metadata:
      name: api-handler-v2              # Explicit revision name
    spec:
      containers:
        - image: myregistry.io/api-handler:v2.0.0
  traffic:
    - revisionName: api-handler-v1
      percent: 90                        # 90% to the current stable version
    - revisionName: api-handler-v2
      percent: 10                        # 10% to the new canary version
    - revisionName: api-handler-v2
      tag: canary                        # Named URL: canary-api-handler.example.com

This traffic splitting is handled at the networking layer (via Istio, Kourier, or Contour), allowing you to gradually shift traffic to new versions and roll back instantly by updating the traffic percentages.

Autoscaling Modes

Knative supports two autoscaling classes:

KPA (Knative Pod Autoscaler): The default. Supports scale-to-zero and scale based on concurrency or requests-per-second. Responds faster than HPA for request-driven workloads.
HPA (Horizontal Pod Autoscaler): Uses the standard Kubernetes HPA. Supports CPU and memory-based scaling but does not support scale-to-zero.

3. Knative Eventing

Knative Eventing provides infrastructure for building event-driven architectures. It decouples event producers from event consumers using CloudEvents, the CNCF standard for event formatting.

Core Components

Sources: Connect to external systems and produce events. Examples: KafkaSource (Kafka topics), ApiServerSource (Kubernetes API events), PingSource (cron-based events), GitHubSource (webhooks).
Brokers: Event routing hubs that receive events and distribute them to subscribers based on filters.
Triggers: Define which events a subscriber wants to receive, filtering by event type, source, or custom attributes.

# Knative Eventing: process S3 upload events
apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
  name: daily-report-trigger
  namespace: production
spec:
  schedule: "0 6 * * *"                    # Every day at 6 AM UTC
  contentType: "application/json"
  data: '{"report": "daily-summary"}'
  sink:
    ref:
      apiVersion: eventing.knative.dev/v1
      kind: Broker
      name: default
---
# Broker receives events and routes them
apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  name: default
  namespace: production
---
# Trigger routes specific events to a Knative Service
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
  name: daily-report-handler
  namespace: production
spec:
  broker: default
  filter:
    attributes:
      type: dev.knative.sources.ping       # Filter by event type
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: report-generator               # This Knative Service handles the event

This pattern enables fully serverless event processing: the report-generator service scales to zero when there are no events, spins up to process the daily trigger, and scales back to zero when done.

4. KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) provides an alternative to Knative for event-driven workloads. KEDA extends the Kubernetes HPA with custom scalers that monitor external event sources.

Key differences from Knative:

Works with standard Deployments: KEDA scales regular Kubernetes Deployments, Jobs, and StatefulSets. No new resource types are required.
60+ scalers: KEDA supports scaling based on Kafka lag, RabbitMQ queue depth, AWS SQS, Azure Service Bus, Prometheus metrics, cron schedules, PostgreSQL query results, and many more.
Simpler to adopt: KEDA requires only a ScaledObject CRD alongside your existing Deployment. No networking layer changes needed.

# KEDA ScaledObject: scale a Deployment based on Kafka consumer lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor                  # Standard Kubernetes Deployment
  minReplicaCount: 0                       # Scale to zero when no lag
  maxReplicaCount: 30
  cooldownPeriod: 60                       # Wait 60s before scaling to zero
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka.production:9092
        consumerGroup: order-processor
        topic: orders
        lagThreshold: "10"                 # Scale up when lag exceeds 10 messages

# KEDA ScaledJob: run batch jobs based on queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: image-processor
  namespace: production
spec:
  jobTargetRef:
    template:
      spec:
        containers:
          - name: processor
            image: myregistry.io/image-processor:v1.0
        restartPolicy: Never
  pollingInterval: 10                      # Check queue every 10 seconds
  maxReplicaCount: 20
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456/images
        queueLength: "5"                   # Each job processes 5 messages

5. OpenFaaS

OpenFaaS (Open Function as a Service) provides a simpler, function-oriented serverless experience on Kubernetes. It is easier to get started with than Knative but has fewer features for advanced use cases.

OpenFaaS works best for simple function workloads where developers want to write a function handler and deploy it without managing Kubernetes manifests. It supports multiple languages through templates and includes a built-in UI and CLI.

6. Cold Starts: The Fundamental Trade-off

The primary disadvantage of scale-to-zero is cold start latency. When the first request arrives after an idle period, the user must wait for:

Activator processing: ~10-50ms for Knative to detect and route the request.
Pod scheduling: ~100-500ms for the scheduler to find a node and create the pod.
Image pulling: 0ms (if cached) to 30+ seconds (if not cached). This is typically the largest contributor.
Container startup: Depends on the application. A Go binary starts in milliseconds; a JVM application may take 5-15 seconds.
Readiness probe: The time until the first readiness probe succeeds.

Mitigating Cold Starts

Pre-pull images: Use a DaemonSet to pre-pull serverless images on all nodes so that image pull time is eliminated.
Use small images: Go, Rust, and GraalVM native images start in milliseconds. Avoid large JVM applications for latency-sensitive serverless workloads, or use GraalVM native compilation.
Set minScale: 1: Keep at least one pod running at all times. You lose the cost savings of scale-to-zero but eliminate cold starts entirely.
Configure scaleDownDelay: Increase the idle timeout before scaling to zero. A 5-minute delay handles most bursty traffic patterns without cold starts.
Use init containers wisely: Heavy initialization (database migrations, cache warming) in init containers adds to cold start time. Move these to the application startup path or external jobs.

7. Knative vs. Cloud-Native Serverless

Aspect	Knative (K8s Serverless)	Cloud Serverless (Lambda/Cloud Functions)
Infrastructure	You manage the cluster	Fully managed by cloud provider
Cold starts	Higher (pod scheduling + image pull)	Lower (optimized by provider)
Execution time	Unlimited	Typically limited (15 min for Lambda)
Language support	Any container image	Limited to supported runtimes
Vendor lock-in	Portable across clouds	Locked to one cloud provider
Cost at scale	Lower (cluster amortization)	Higher (per-invocation pricing)
Networking	Full Kubernetes networking	Limited (VPC configuration required)

Choose Knative when you need container-level flexibility, longer execution times, or multi-cloud portability. Choose cloud-native serverless when you want zero infrastructure management and can accept the provider's constraints.

8. Event-Driven Architecture Patterns

Webhook Processing

External webhooks (GitHub, Stripe, Slack) trigger a Knative Service that processes the event and scales to zero between events.

Scheduled Jobs

A PingSource fires at scheduled intervals, triggering a Knative Service for batch processing (report generation, data aggregation, cleanup tasks).

Stream Processing

KEDA scales consumers based on message lag in Kafka, SQS, or RabbitMQ. When the queue is empty, consumers scale to zero. When messages arrive, consumers scale proportionally to the backlog.

Fan-Out / Fan-In

A Knative Broker receives events and fans them out to multiple subscribers via Triggers. Each subscriber processes the event independently, enabling parallel processing of different aspects of the same event.

Common Pitfalls

Not accounting for cold start latency: If your SLA requires sub-second response times, scale-to-zero may violate it. Use minScale: 1 for latency-sensitive services or accept the latency trade-off for non-critical workloads.
Large container images: A 500 MB image takes significant time to pull, dominating cold start latency. Optimize images to be as small as possible. Use multi-stage builds, distroless base images, and pre-pull strategies.
Networking layer overhead: Knative requires a networking layer (Istio, Kourier, or Contour). Each adds operational complexity. Kourier is the lightest option; Istio provides the most features but is the heaviest.
KEDA and Knative confusion: KEDA and Knative solve overlapping problems but are designed differently. KEDA is best for scaling standard workloads based on external metrics. Knative is best for request-driven, HTTP-based serverless workloads.
Ignoring resource requests: Pods that scale from zero need accurate resource requests so the scheduler can place them quickly. Under-requesting resources causes scheduling delays; over-requesting wastes capacity.
Not monitoring scale events: Without visibility into scaling decisions (scale-to-zero events, cold start latency, activator queue depth), you cannot optimize performance. Monitor Knative's autoscaler metrics and alert on prolonged cold starts.

What's Next?

Deploy Knative Serving using the official quickstart guide and experiment with scale-to-zero and traffic splitting.
Evaluate KEDA for event-driven workloads that need to scale based on external event sources (Kafka, SQS, RabbitMQ).
Implement Knative Eventing for event-driven architectures using Brokers and Triggers to decouple producers from consumers.
Optimize cold start performance by pre-pulling images, using lightweight runtimes (Go, Rust), and configuring appropriate minScale values.
Compare Knative vs. cloud serverless (Lambda, Cloud Functions, Azure Functions) for your specific workload patterns to determine the right balance of flexibility and managed infrastructure.
Explore Knative Functions (func) for a developer-friendly experience that generates Knative Services from function code without writing Kubernetes manifests.

1. Scale-to-Zero​

2. Knative Serving​

Knative Service​

Revisions​

Autoscaling Modes​

3. Knative Eventing​

Core Components​

4. KEDA: Event-Driven Autoscaling​

5. OpenFaaS​

6. Cold Starts: The Fundamental Trade-off​

Mitigating Cold Starts​

7. Knative vs. Cloud-Native Serverless​

8. Event-Driven Architecture Patterns​

Webhook Processing​

Scheduled Jobs​

Stream Processing​

Fan-Out / Fan-In​

Common Pitfalls​

What's Next?​