Jobs & CronJobs
- Batch Processing Primitives: Kubernetes Jobs are the fundamental resource for tasks that run to completion rather than running indefinitely like Deployments. They guarantee that a specified number of Pods successfully terminate, with configurable parallelism, retry behavior, and timeouts.
- Job Patterns: Jobs support three distinct patterns -- non-parallel (single pod), fixed completion count (run N pods to success), and work queue (parallel pods pulling from a shared queue). Each pattern suits different batch processing scenarios.
- CronJob Scheduling: CronJobs automate Job creation on a recurring schedule using standard cron syntax. Critical settings include
concurrencyPolicy(Allow, Forbid, Replace) to control overlapping runs, andstartingDeadlineSecondsto handle missed schedules. - Indexed Jobs: For parallel processing where each pod needs a unique index (e.g., processing shard 0-9 of a dataset), Indexed Jobs assign each pod a unique completion index accessible via the
JOB_COMPLETION_INDEXenvironment variable. - Resource Cleanup: Completed Jobs and their Pods are not automatically deleted. Use
ttlSecondsAfterFinishedto auto-clean finished Jobs, or they will accumulate in etcd and eventually cause performance degradation.
While Deployments and StatefulSets are designed for long-running processes (web servers, APIs, databases), Kubernetes also has first-class support for batch and scheduled tasks through Jobs and CronJobs.
1. Jobs In Depth
A Job creates one or more Pods and ensures that a specified number of them successfully terminate. Unlike a Deployment, which restarts Pods indefinitely, a Job tracks successful completions and stops creating new Pods once the required number of successes is reached.
Core Fields
Every Job spec supports these essential fields:
spec.completions: The number of Pods that must successfully complete for the Job to be considered successful. Defaults to 1.spec.parallelism: The maximum number of Pods that can run concurrently. Defaults to 1.spec.backoffLimit: The number of retries before marking the Job as failed. Each retry uses exponential backoff (10s, 20s, 40s, ..., capped at 6 minutes). Defaults to 6.spec.activeDeadlineSeconds: An absolute time limit for the entire Job. Once this deadline is reached, all running Pods are terminated and the Job is marked as failed. This is your safety net against hung Jobs.spec.ttlSecondsAfterFinished: Automatically delete the Job (and its Pods) this many seconds after it completes. Without this, completed Jobs accumulate forever.
Job Pattern 1: Non-Parallel (Single Completion)
The simplest pattern: one Pod runs to completion. If it fails, it is retried up to backoffLimit times.
# job-db-migration.yaml
# Run a one-time database migration
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration-v42
namespace: production
spec:
backoffLimit: 3 # Retry up to 3 times on failure
activeDeadlineSeconds: 600 # Fail the entire Job after 10 minutes
ttlSecondsAfterFinished: 3600 # Clean up 1 hour after completion
template:
metadata:
labels:
app: db-migration
spec:
restartPolicy: Never # Required for Jobs: Never or OnFailure
containers:
- name: migrate
image: myapp/migrations:v42
command: ["python", "manage.py", "migrate", "--no-input"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
restartPolicy: Never vs restartPolicy: OnFailure: With Never, a failed Pod is left in Failed state and a new Pod is created for the retry. With OnFailure, the same Pod is restarted in place. Use Never when you need to inspect failed Pod logs; use OnFailure to avoid accumulating failed Pods.
Job Pattern 2: Fixed Completion Count
Run a specific number of Pods to success, optionally in parallel. Useful when you know exactly how many units of work need processing.
# job-render-frames.yaml
# Render 50 video frames, 10 at a time
apiVersion: batch/v1
kind: Job
metadata:
name: render-video-frames
namespace: batch
spec:
completions: 50 # 50 Pods must succeed
parallelism: 10 # Run 10 at a time
backoffLimit: 10 # Allow up to 10 total failures
ttlSecondsAfterFinished: 7200 # Clean up after 2 hours
template:
spec:
restartPolicy: Never
containers:
- name: renderer
image: studio/renderer:v3
command: ["render", "--frame-from-queue"]
resources:
requests:
cpu: "2"
memory: 4Gi
Job Pattern 3: Work Queue
Set parallelism without completions (or set completions to null). Pods pull work from an external queue (RabbitMQ, Redis, SQS). The Job succeeds when at least one Pod exits successfully (indicating the queue is empty).
# job-work-queue.yaml
# Process messages from a queue with 5 parallel workers
apiVersion: batch/v1
kind: Job
metadata:
name: process-uploads
namespace: batch
spec:
parallelism: 5 # 5 workers pulling from the queue
# No completions field -- workers exit when queue is empty
backoffLimit: 4
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: worker
image: myapp/queue-worker:v2
env:
- name: QUEUE_URL
value: "amqp://rabbitmq.default.svc:5672/uploads"
resources:
requests:
cpu: 500m
memory: 512Mi
Indexed Jobs for Parallel Processing
Indexed Jobs (stable since Kubernetes 1.24) assign each Pod a unique index via the JOB_COMPLETION_INDEX environment variable. This is useful when each Pod should process a specific shard or partition of data.
# job-indexed-processing.yaml
# Process 10 data partitions, each handled by a uniquely indexed Pod
apiVersion: batch/v1
kind: Job
metadata:
name: partition-processor
namespace: batch
spec:
completions: 10 # 10 partitions to process
parallelism: 5 # Process 5 at a time
completionMode: Indexed # Assign unique index to each Pod
backoffLimit: 5
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: processor
image: myapp/processor:v1
command:
- python
- process_partition.py
- --partition=$(JOB_COMPLETION_INDEX) # Each pod gets 0, 1, 2, ..., 9
env:
- name: JOB_COMPLETION_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
resources:
requests:
cpu: "1"
memory: 2Gi
2. CronJobs
A CronJob creates a Job on a repeating schedule. It is the Kubernetes equivalent of the Unix crontab -- but with additional controls for concurrency, missed schedules, and history management.
Schedule Syntax
CronJobs use standard 5-field cron syntax:
┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of month (1-31)
│ │ │ ┌───────────── month (1-12)
│ │ │ │ ┌───────────── day of week (0-6, Sunday=0)
│ │ │ │ │
* * * * *
Common examples:
"0 2 * * *"-- Every day at 2:00 AM"*/15 * * * *"-- Every 15 minutes"0 9 * * 1"-- Every Monday at 9:00 AM"0 0 1 * *"-- First day of every month at midnight
CronJob Configuration
# cronjob-db-backup.yaml
# Back up the database every night at 2 AM
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-db-backup
namespace: production
spec:
schedule: "0 2 * * *" # Every day at 2:00 AM
timeZone: "America/New_York" # Explicit timezone (stable in K8s 1.27+)
concurrencyPolicy: Forbid # Do not start a new Job if the previous one is still running
startingDeadlineSeconds: 600 # If missed by >10 minutes, skip this run
successfulJobsHistoryLimit: 7 # Keep last 7 successful Job objects
failedJobsHistoryLimit: 3 # Keep last 3 failed Job objects
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 3600 # Kill the Job if it runs longer than 1 hour
ttlSecondsAfterFinished: 86400 # Clean up Job after 24 hours
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: postgres:15
command:
- /bin/bash
- -c
- |
pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \
gzip | \
aws s3 cp - s3://backups/db/$(date +%Y%m%d-%H%M%S).sql.gz
env:
- name: DB_HOST
value: "postgres.production.svc"
- name: DB_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: DB_NAME
value: "myapp"
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
resources:
requests:
cpu: 250m
memory: 512Mi
Concurrency Policy
The concurrencyPolicy field controls what happens when it is time to create a new Job but the previous Job is still running:
Allow(default): Multiple Jobs can run concurrently. Use this when Jobs are idempotent and independent.Forbid: Skip the new Job if the previous one is still running. Use this for database backups or any operation that should not overlap.Replace: Delete the currently running Job and start a new one. Use this when only the most recent run matters (e.g., cache warming).
Starting Deadline Seconds
startingDeadlineSeconds defines how many seconds after the scheduled time a Job can still be started. If the CronJob controller was down or the schedule was missed (e.g., the cluster was being upgraded), this field determines whether missed runs are executed.
If the number of missed schedules exceeds 100 within the startingDeadlineSeconds window, the CronJob controller logs a warning and does not create a Job. This prevents a thundering herd of Jobs after an extended controller outage.
3. Cleanup and History
# cronjob-report.yaml
# Generate a weekly report with strict cleanup
apiVersion: batch/v1
kind: CronJob
metadata:
name: weekly-report
namespace: analytics
spec:
schedule: "0 8 * * 1" # Every Monday at 8:00 AM
timeZone: "UTC"
successfulJobsHistoryLimit: 4 # Keep 4 weeks of successful Jobs
failedJobsHistoryLimit: 2 # Keep 2 failed Jobs for debugging
jobTemplate:
spec:
ttlSecondsAfterFinished: 604800 # Auto-delete after 7 days
template:
spec:
restartPolicy: Never
containers:
- name: report
image: analytics/report-generator:v5
command: ["python", "generate_report.py", "--week=last"]
Common Pitfalls
- Forgetting
ttlSecondsAfterFinished: Completed Jobs and their Pods are never cleaned up by default. Over weeks, thousands of completed Job objects accumulate in etcd, degrading API server performance. Always setttlSecondsAfterFinishedor configuresuccessfulJobsHistoryLimit/failedJobsHistoryLimiton CronJobs. - Timezone confusion in CronJobs: Before Kubernetes 1.27, CronJob schedules were interpreted in the kube-controller-manager's timezone, which varied by deployment. Always use the
timeZonefield (stable since 1.27) to make schedules explicit. - Using
restartPolicy: Always: Jobs do not supportrestartPolicy: Always. You must useNeverorOnFailure. Kubernetes will reject the manifest at apply time if you useAlways. - Not setting
activeDeadlineSeconds: Without an absolute deadline, a hung Job runs forever, consuming cluster resources indefinitely. Always set a reasonable deadline. - Ignoring
backoffLimitexhaustion: When all retries are exhausted, the Job is marked as failed but remains in the cluster. Without monitoring, failed Jobs go unnoticed. Alert onkube_job_status_failedPrometheus metrics. - CronJob schedule drift: A schedule of
"0 */2 * * *"means "at minute 0 of every 2nd hour," not "every 2 hours from now." If the Job takes longer than the interval, you needconcurrencyPolicy: Forbidto prevent overlapping runs.
Best Practices
- Always set resource requests and limits: Batch Jobs can be greedy. Without limits, a poorly written Job can consume all node resources and evict other workloads.
- Use
activeDeadlineSecondsas a safety net: Even if your Job should finish in 5 minutes, set a deadline of 30 minutes to catch hangs. - Monitor Job success/failure: Use Prometheus metrics (
kube_job_status_succeeded,kube_job_status_failed) and alert on failure. CronJob failures are silent by default. - Prefer
restartPolicy: Neverfor debugging: WithNever, failed Pods remain so you can inspect their logs withkubectl logs. WithOnFailure, logs from the failed attempt are lost when the container restarts. - Use Pod priority for batch workloads: Set a lower PriorityClass for batch Jobs so they can be preempted by higher-priority services during resource contention.
- Idempotency is essential: Jobs can be retried and CronJobs can occasionally run twice (due to the "at least once" guarantee). Design your workloads to be idempotent so that duplicate execution is harmless.
What's Next?
- GitOps (ArgoCD): Manage your Job and CronJob definitions declaratively in Git with ArgoCD.
- Disaster Recovery: Use CronJobs to automate backup processes as part of your disaster recovery strategy.
- CRDs & Operators: For complex batch workflows, consider operator frameworks like Argo Workflows or Tekton that build on the Job primitive.