Skip to main content

Jobs & CronJobs

Key Takeaways for AI & Readers
  • Batch Processing Primitives: Kubernetes Jobs are the fundamental resource for tasks that run to completion rather than running indefinitely like Deployments. They guarantee that a specified number of Pods successfully terminate, with configurable parallelism, retry behavior, and timeouts.
  • Job Patterns: Jobs support three distinct patterns -- non-parallel (single pod), fixed completion count (run N pods to success), and work queue (parallel pods pulling from a shared queue). Each pattern suits different batch processing scenarios.
  • CronJob Scheduling: CronJobs automate Job creation on a recurring schedule using standard cron syntax. Critical settings include concurrencyPolicy (Allow, Forbid, Replace) to control overlapping runs, and startingDeadlineSeconds to handle missed schedules.
  • Indexed Jobs: For parallel processing where each pod needs a unique index (e.g., processing shard 0-9 of a dataset), Indexed Jobs assign each pod a unique completion index accessible via the JOB_COMPLETION_INDEX environment variable.
  • Resource Cleanup: Completed Jobs and their Pods are not automatically deleted. Use ttlSecondsAfterFinished to auto-clean finished Jobs, or they will accumulate in etcd and eventually cause performance degradation.

While Deployments and StatefulSets are designed for long-running processes (web servers, APIs, databases), Kubernetes also has first-class support for batch and scheduled tasks through Jobs and CronJobs.

1. Jobs In Depth

A Job creates one or more Pods and ensures that a specified number of them successfully terminate. Unlike a Deployment, which restarts Pods indefinitely, a Job tracks successful completions and stops creating new Pods once the required number of successes is reached.

Core Fields

Every Job spec supports these essential fields:

  • spec.completions: The number of Pods that must successfully complete for the Job to be considered successful. Defaults to 1.
  • spec.parallelism: The maximum number of Pods that can run concurrently. Defaults to 1.
  • spec.backoffLimit: The number of retries before marking the Job as failed. Each retry uses exponential backoff (10s, 20s, 40s, ..., capped at 6 minutes). Defaults to 6.
  • spec.activeDeadlineSeconds: An absolute time limit for the entire Job. Once this deadline is reached, all running Pods are terminated and the Job is marked as failed. This is your safety net against hung Jobs.
  • spec.ttlSecondsAfterFinished: Automatically delete the Job (and its Pods) this many seconds after it completes. Without this, completed Jobs accumulate forever.

Job Pattern 1: Non-Parallel (Single Completion)

The simplest pattern: one Pod runs to completion. If it fails, it is retried up to backoffLimit times.

# job-db-migration.yaml
# Run a one-time database migration
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration-v42
namespace: production
spec:
backoffLimit: 3 # Retry up to 3 times on failure
activeDeadlineSeconds: 600 # Fail the entire Job after 10 minutes
ttlSecondsAfterFinished: 3600 # Clean up 1 hour after completion
template:
metadata:
labels:
app: db-migration
spec:
restartPolicy: Never # Required for Jobs: Never or OnFailure
containers:
- name: migrate
image: myapp/migrations:v42
command: ["python", "manage.py", "migrate", "--no-input"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi

restartPolicy: Never vs restartPolicy: OnFailure: With Never, a failed Pod is left in Failed state and a new Pod is created for the retry. With OnFailure, the same Pod is restarted in place. Use Never when you need to inspect failed Pod logs; use OnFailure to avoid accumulating failed Pods.

Job Pattern 2: Fixed Completion Count

Run a specific number of Pods to success, optionally in parallel. Useful when you know exactly how many units of work need processing.

# job-render-frames.yaml
# Render 50 video frames, 10 at a time
apiVersion: batch/v1
kind: Job
metadata:
name: render-video-frames
namespace: batch
spec:
completions: 50 # 50 Pods must succeed
parallelism: 10 # Run 10 at a time
backoffLimit: 10 # Allow up to 10 total failures
ttlSecondsAfterFinished: 7200 # Clean up after 2 hours
template:
spec:
restartPolicy: Never
containers:
- name: renderer
image: studio/renderer:v3
command: ["render", "--frame-from-queue"]
resources:
requests:
cpu: "2"
memory: 4Gi

Job Pattern 3: Work Queue

Set parallelism without completions (or set completions to null). Pods pull work from an external queue (RabbitMQ, Redis, SQS). The Job succeeds when at least one Pod exits successfully (indicating the queue is empty).

# job-work-queue.yaml
# Process messages from a queue with 5 parallel workers
apiVersion: batch/v1
kind: Job
metadata:
name: process-uploads
namespace: batch
spec:
parallelism: 5 # 5 workers pulling from the queue
# No completions field -- workers exit when queue is empty
backoffLimit: 4
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: worker
image: myapp/queue-worker:v2
env:
- name: QUEUE_URL
value: "amqp://rabbitmq.default.svc:5672/uploads"
resources:
requests:
cpu: 500m
memory: 512Mi

Indexed Jobs for Parallel Processing

Indexed Jobs (stable since Kubernetes 1.24) assign each Pod a unique index via the JOB_COMPLETION_INDEX environment variable. This is useful when each Pod should process a specific shard or partition of data.

# job-indexed-processing.yaml
# Process 10 data partitions, each handled by a uniquely indexed Pod
apiVersion: batch/v1
kind: Job
metadata:
name: partition-processor
namespace: batch
spec:
completions: 10 # 10 partitions to process
parallelism: 5 # Process 5 at a time
completionMode: Indexed # Assign unique index to each Pod
backoffLimit: 5
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: processor
image: myapp/processor:v1
command:
- python
- process_partition.py
- --partition=$(JOB_COMPLETION_INDEX) # Each pod gets 0, 1, 2, ..., 9
env:
- name: JOB_COMPLETION_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
resources:
requests:
cpu: "1"
memory: 2Gi

2. CronJobs

Schedule: */5 * * * *
Time: 0
Jobs are ephemeral. They run, complete, and then the Pod stops.

A CronJob creates a Job on a repeating schedule. It is the Kubernetes equivalent of the Unix crontab -- but with additional controls for concurrency, missed schedules, and history management.

Schedule Syntax

CronJobs use standard 5-field cron syntax:

┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of month (1-31)
│ │ │ ┌───────────── month (1-12)
│ │ │ │ ┌───────────── day of week (0-6, Sunday=0)
│ │ │ │ │
* * * * *

Common examples:

  • "0 2 * * *" -- Every day at 2:00 AM
  • "*/15 * * * *" -- Every 15 minutes
  • "0 9 * * 1" -- Every Monday at 9:00 AM
  • "0 0 1 * *" -- First day of every month at midnight

CronJob Configuration

# cronjob-db-backup.yaml
# Back up the database every night at 2 AM
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-db-backup
namespace: production
spec:
schedule: "0 2 * * *" # Every day at 2:00 AM
timeZone: "America/New_York" # Explicit timezone (stable in K8s 1.27+)
concurrencyPolicy: Forbid # Do not start a new Job if the previous one is still running
startingDeadlineSeconds: 600 # If missed by >10 minutes, skip this run
successfulJobsHistoryLimit: 7 # Keep last 7 successful Job objects
failedJobsHistoryLimit: 3 # Keep last 3 failed Job objects
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 3600 # Kill the Job if it runs longer than 1 hour
ttlSecondsAfterFinished: 86400 # Clean up Job after 24 hours
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: postgres:15
command:
- /bin/bash
- -c
- |
pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \
gzip | \
aws s3 cp - s3://backups/db/$(date +%Y%m%d-%H%M%S).sql.gz
env:
- name: DB_HOST
value: "postgres.production.svc"
- name: DB_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: DB_NAME
value: "myapp"
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
resources:
requests:
cpu: 250m
memory: 512Mi

Concurrency Policy

The concurrencyPolicy field controls what happens when it is time to create a new Job but the previous Job is still running:

  • Allow (default): Multiple Jobs can run concurrently. Use this when Jobs are idempotent and independent.
  • Forbid: Skip the new Job if the previous one is still running. Use this for database backups or any operation that should not overlap.
  • Replace: Delete the currently running Job and start a new one. Use this when only the most recent run matters (e.g., cache warming).

Starting Deadline Seconds

startingDeadlineSeconds defines how many seconds after the scheduled time a Job can still be started. If the CronJob controller was down or the schedule was missed (e.g., the cluster was being upgraded), this field determines whether missed runs are executed.

If the number of missed schedules exceeds 100 within the startingDeadlineSeconds window, the CronJob controller logs a warning and does not create a Job. This prevents a thundering herd of Jobs after an extended controller outage.

3. Cleanup and History

# cronjob-report.yaml
# Generate a weekly report with strict cleanup
apiVersion: batch/v1
kind: CronJob
metadata:
name: weekly-report
namespace: analytics
spec:
schedule: "0 8 * * 1" # Every Monday at 8:00 AM
timeZone: "UTC"
successfulJobsHistoryLimit: 4 # Keep 4 weeks of successful Jobs
failedJobsHistoryLimit: 2 # Keep 2 failed Jobs for debugging
jobTemplate:
spec:
ttlSecondsAfterFinished: 604800 # Auto-delete after 7 days
template:
spec:
restartPolicy: Never
containers:
- name: report
image: analytics/report-generator:v5
command: ["python", "generate_report.py", "--week=last"]

Common Pitfalls

  • Forgetting ttlSecondsAfterFinished: Completed Jobs and their Pods are never cleaned up by default. Over weeks, thousands of completed Job objects accumulate in etcd, degrading API server performance. Always set ttlSecondsAfterFinished or configure successfulJobsHistoryLimit/failedJobsHistoryLimit on CronJobs.
  • Timezone confusion in CronJobs: Before Kubernetes 1.27, CronJob schedules were interpreted in the kube-controller-manager's timezone, which varied by deployment. Always use the timeZone field (stable since 1.27) to make schedules explicit.
  • Using restartPolicy: Always: Jobs do not support restartPolicy: Always. You must use Never or OnFailure. Kubernetes will reject the manifest at apply time if you use Always.
  • Not setting activeDeadlineSeconds: Without an absolute deadline, a hung Job runs forever, consuming cluster resources indefinitely. Always set a reasonable deadline.
  • Ignoring backoffLimit exhaustion: When all retries are exhausted, the Job is marked as failed but remains in the cluster. Without monitoring, failed Jobs go unnoticed. Alert on kube_job_status_failed Prometheus metrics.
  • CronJob schedule drift: A schedule of "0 */2 * * *" means "at minute 0 of every 2nd hour," not "every 2 hours from now." If the Job takes longer than the interval, you need concurrencyPolicy: Forbid to prevent overlapping runs.

Best Practices

  1. Always set resource requests and limits: Batch Jobs can be greedy. Without limits, a poorly written Job can consume all node resources and evict other workloads.
  2. Use activeDeadlineSeconds as a safety net: Even if your Job should finish in 5 minutes, set a deadline of 30 minutes to catch hangs.
  3. Monitor Job success/failure: Use Prometheus metrics (kube_job_status_succeeded, kube_job_status_failed) and alert on failure. CronJob failures are silent by default.
  4. Prefer restartPolicy: Never for debugging: With Never, failed Pods remain so you can inspect their logs with kubectl logs. With OnFailure, logs from the failed attempt are lost when the container restarts.
  5. Use Pod priority for batch workloads: Set a lower PriorityClass for batch Jobs so they can be preempted by higher-priority services during resource contention.
  6. Idempotency is essential: Jobs can be retried and CronJobs can occasionally run twice (due to the "at least once" guarantee). Design your workloads to be idempotent so that duplicate execution is harmless.

What's Next?

  • GitOps (ArgoCD): Manage your Job and CronJob definitions declaratively in Git with ArgoCD.
  • Disaster Recovery: Use CronJobs to automate backup processes as part of your disaster recovery strategy.
  • CRDs & Operators: For complex batch workflows, consider operator frameworks like Argo Workflows or Tekton that build on the Job primitive.