Jobs & CronJobs

Key Takeaways for AI & Readers

Batch Processing Primitives: Kubernetes Jobs are the fundamental resource for tasks that run to completion rather than running indefinitely like Deployments. They guarantee that a specified number of Pods successfully terminate, with configurable parallelism, retry behavior, and timeouts.
Job Patterns: Jobs support three distinct patterns -- non-parallel (single pod), fixed completion count (run N pods to success), and work queue (parallel pods pulling from a shared queue). Each pattern suits different batch processing scenarios.
CronJob Scheduling: CronJobs automate Job creation on a recurring schedule using standard cron syntax. Critical settings include concurrencyPolicy (Allow, Forbid, Replace) to control overlapping runs, and startingDeadlineSeconds to handle missed schedules.
Indexed Jobs: For parallel processing where each pod needs a unique index (e.g., processing shard 0-9 of a dataset), Indexed Jobs assign each pod a unique completion index accessible via the JOB_COMPLETION_INDEX environment variable.
Resource Cleanup: Completed Jobs and their Pods are not automatically deleted. Use ttlSecondsAfterFinished to auto-clean finished Jobs, or they will accumulate in etcd and eventually cause performance degradation.

While Deployments and StatefulSets are designed for long-running processes (web servers, APIs, databases), Kubernetes also has first-class support for batch and scheduled tasks through Jobs and CronJobs.

1. Jobs In Depth

A Job creates one or more Pods and ensures that a specified number of them successfully terminate. Unlike a Deployment, which restarts Pods indefinitely, a Job tracks successful completions and stops creating new Pods once the required number of successes is reached.

Core Fields

Every Job spec supports these essential fields:

spec.completions: The number of Pods that must successfully complete for the Job to be considered successful. Defaults to 1.
spec.parallelism: The maximum number of Pods that can run concurrently. Defaults to 1.
spec.backoffLimit: The number of retries before marking the Job as failed. Each retry uses exponential backoff (10s, 20s, 40s, ..., capped at 6 minutes). Defaults to 6.
spec.activeDeadlineSeconds: An absolute time limit for the entire Job. Once this deadline is reached, all running Pods are terminated and the Job is marked as failed. This is your safety net against hung Jobs.
spec.ttlSecondsAfterFinished: Automatically delete the Job (and its Pods) this many seconds after it completes. Without this, completed Jobs accumulate forever.

Job Pattern 1: Non-Parallel (Single Completion)

The simplest pattern: one Pod runs to completion. If it fails, it is retried up to backoffLimit times.

# job-db-migration.yaml
# Run a one-time database migration
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-v42
  namespace: production
spec:
  backoffLimit: 3                    # Retry up to 3 times on failure
  activeDeadlineSeconds: 600         # Fail the entire Job after 10 minutes
  ttlSecondsAfterFinished: 3600     # Clean up 1 hour after completion
  template:
    metadata:
      labels:
        app: db-migration
    spec:
      restartPolicy: Never           # Required for Jobs: Never or OnFailure
      containers:
        - name: migrate
          image: myapp/migrations:v42
          command: ["python", "manage.py", "migrate", "--no-input"]
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: url
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi

restartPolicy: Never vs restartPolicy: OnFailure: With Never, a failed Pod is left in Failed state and a new Pod is created for the retry. With OnFailure, the same Pod is restarted in place. Use Never when you need to inspect failed Pod logs; use OnFailure to avoid accumulating failed Pods.

Job Pattern 2: Fixed Completion Count

Run a specific number of Pods to success, optionally in parallel. Useful when you know exactly how many units of work need processing.

# job-render-frames.yaml
# Render 50 video frames, 10 at a time
apiVersion: batch/v1
kind: Job
metadata:
  name: render-video-frames
  namespace: batch
spec:
  completions: 50                    # 50 Pods must succeed
  parallelism: 10                    # Run 10 at a time
  backoffLimit: 10                   # Allow up to 10 total failures
  ttlSecondsAfterFinished: 7200     # Clean up after 2 hours
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: renderer
          image: studio/renderer:v3
          command: ["render", "--frame-from-queue"]
          resources:
            requests:
              cpu: "2"
              memory: 4Gi

Job Pattern 3: Work Queue

Set parallelism without completions (or set completions to null). Pods pull work from an external queue (RabbitMQ, Redis, SQS). The Job succeeds when at least one Pod exits successfully (indicating the queue is empty).

# job-work-queue.yaml
# Process messages from a queue with 5 parallel workers
apiVersion: batch/v1
kind: Job
metadata:
  name: process-uploads
  namespace: batch
spec:
  parallelism: 5                     # 5 workers pulling from the queue
  # No completions field -- workers exit when queue is empty
  backoffLimit: 4
  ttlSecondsAfterFinished: 3600
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: worker
          image: myapp/queue-worker:v2
          env:
            - name: QUEUE_URL
              value: "amqp://rabbitmq.default.svc:5672/uploads"
          resources:
            requests:
              cpu: 500m
              memory: 512Mi

Indexed Jobs for Parallel Processing

Indexed Jobs (stable since Kubernetes 1.24) assign each Pod a unique index via the JOB_COMPLETION_INDEX environment variable. This is useful when each Pod should process a specific shard or partition of data.

# job-indexed-processing.yaml
# Process 10 data partitions, each handled by a uniquely indexed Pod
apiVersion: batch/v1
kind: Job
metadata:
  name: partition-processor
  namespace: batch
spec:
  completions: 10                    # 10 partitions to process
  parallelism: 5                     # Process 5 at a time
  completionMode: Indexed            # Assign unique index to each Pod
  backoffLimit: 5
  ttlSecondsAfterFinished: 3600
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: processor
          image: myapp/processor:v1
          command:
            - python
            - process_partition.py
            - --partition=$(JOB_COMPLETION_INDEX)   # Each pod gets 0, 1, 2, ..., 9
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: "1"
              memory: 2Gi

2. CronJobs

Schedule: */5 * * * *

Time: 0

Jobs are ephemeral. They run, complete, and then the Pod stops.

A CronJob creates a Job on a repeating schedule. It is the Kubernetes equivalent of the Unix crontab -- but with additional controls for concurrency, missed schedules, and history management.

Schedule Syntax

CronJobs use standard 5-field cron syntax:

┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of month (1-31)
│ │ │ ┌───────────── month (1-12)
│ │ │ │ ┌───────────── day of week (0-6, Sunday=0)
│ │ │ │ │
* * * * *

Common examples:

"0 2 * * *" -- Every day at 2:00 AM
"*/15 * * * *" -- Every 15 minutes
"0 9 * * 1" -- Every Monday at 9:00 AM
"0 0 1 * *" -- First day of every month at midnight

CronJob Configuration

# cronjob-db-backup.yaml
# Back up the database every night at 2 AM
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-db-backup
  namespace: production
spec:
  schedule: "0 2 * * *"             # Every day at 2:00 AM
  timeZone: "America/New_York"      # Explicit timezone (stable in K8s 1.27+)
  concurrencyPolicy: Forbid          # Do not start a new Job if the previous one is still running
  startingDeadlineSeconds: 600       # If missed by >10 minutes, skip this run
  successfulJobsHistoryLimit: 7      # Keep last 7 successful Job objects
  failedJobsHistoryLimit: 3          # Keep last 3 failed Job objects
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 3600    # Kill the Job if it runs longer than 1 hour
      ttlSecondsAfterFinished: 86400 # Clean up Job after 24 hours
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: postgres:15
              command:
                - /bin/bash
                - -c
                - |
                  pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \
                    gzip | \
                    aws s3 cp - s3://backups/db/$(date +%Y%m%d-%H%M%S).sql.gz
              env:
                - name: DB_HOST
                  value: "postgres.production.svc"
                - name: DB_USER
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: username
                - name: DB_NAME
                  value: "myapp"
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: password
              resources:
                requests:
                  cpu: 250m
                  memory: 512Mi

Concurrency Policy

The concurrencyPolicy field controls what happens when it is time to create a new Job but the previous Job is still running:

Allow (default): Multiple Jobs can run concurrently. Use this when Jobs are idempotent and independent.
Forbid: Skip the new Job if the previous one is still running. Use this for database backups or any operation that should not overlap.
Replace: Delete the currently running Job and start a new one. Use this when only the most recent run matters (e.g., cache warming).

Starting Deadline Seconds

startingDeadlineSeconds defines how many seconds after the scheduled time a Job can still be started. If the CronJob controller was down or the schedule was missed (e.g., the cluster was being upgraded), this field determines whether missed runs are executed.

If the number of missed schedules exceeds 100 within the startingDeadlineSeconds window, the CronJob controller logs a warning and does not create a Job. This prevents a thundering herd of Jobs after an extended controller outage.

3. Cleanup and History

# cronjob-report.yaml
# Generate a weekly report with strict cleanup
apiVersion: batch/v1
kind: CronJob
metadata:
  name: weekly-report
  namespace: analytics
spec:
  schedule: "0 8 * * 1"             # Every Monday at 8:00 AM
  timeZone: "UTC"
  successfulJobsHistoryLimit: 4      # Keep 4 weeks of successful Jobs
  failedJobsHistoryLimit: 2          # Keep 2 failed Jobs for debugging
  jobTemplate:
    spec:
      ttlSecondsAfterFinished: 604800  # Auto-delete after 7 days
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: report
              image: analytics/report-generator:v5
              command: ["python", "generate_report.py", "--week=last"]

Common Pitfalls

Forgetting ttlSecondsAfterFinished: Completed Jobs and their Pods are never cleaned up by default. Over weeks, thousands of completed Job objects accumulate in etcd, degrading API server performance. Always set ttlSecondsAfterFinished or configure successfulJobsHistoryLimit/failedJobsHistoryLimit on CronJobs.
Timezone confusion in CronJobs: Before Kubernetes 1.27, CronJob schedules were interpreted in the kube-controller-manager's timezone, which varied by deployment. Always use the timeZone field (stable since 1.27) to make schedules explicit.
Using restartPolicy: Always: Jobs do not support restartPolicy: Always. You must use Never or OnFailure. Kubernetes will reject the manifest at apply time if you use Always.
Not setting activeDeadlineSeconds: Without an absolute deadline, a hung Job runs forever, consuming cluster resources indefinitely. Always set a reasonable deadline.
Ignoring backoffLimit exhaustion: When all retries are exhausted, the Job is marked as failed but remains in the cluster. Without monitoring, failed Jobs go unnoticed. Alert on kube_job_status_failed Prometheus metrics.
CronJob schedule drift: A schedule of "0 */2 * * *" means "at minute 0 of every 2nd hour," not "every 2 hours from now." If the Job takes longer than the interval, you need concurrencyPolicy: Forbid to prevent overlapping runs.

Best Practices

Always set resource requests and limits: Batch Jobs can be greedy. Without limits, a poorly written Job can consume all node resources and evict other workloads.
Use activeDeadlineSeconds as a safety net: Even if your Job should finish in 5 minutes, set a deadline of 30 minutes to catch hangs.
Monitor Job success/failure: Use Prometheus metrics (kube_job_status_succeeded, kube_job_status_failed) and alert on failure. CronJob failures are silent by default.
Prefer restartPolicy: Never for debugging: With Never, failed Pods remain so you can inspect their logs with kubectl logs. With OnFailure, logs from the failed attempt are lost when the container restarts.
Use Pod priority for batch workloads: Set a lower PriorityClass for batch Jobs so they can be preempted by higher-priority services during resource contention.
Idempotency is essential: Jobs can be retried and CronJobs can occasionally run twice (due to the "at least once" guarantee). Design your workloads to be idempotent so that duplicate execution is harmless.

What's Next?

GitOps (ArgoCD): Manage your Job and CronJob definitions declaratively in Git with ArgoCD.
Disaster Recovery: Use CronJobs to automate backup processes as part of your disaster recovery strategy.
CRDs & Operators: For complex batch workflows, consider operator frameworks like Argo Workflows or Tekton that build on the Job primitive.

1. Jobs In Depth​

Core Fields​

Job Pattern 1: Non-Parallel (Single Completion)​

Job Pattern 2: Fixed Completion Count​

Job Pattern 3: Work Queue​

Indexed Jobs for Parallel Processing​

2. CronJobs​

Schedule Syntax​

CronJob Configuration​

Concurrency Policy​

Starting Deadline Seconds​

3. Cleanup and History​

Common Pitfalls​

Best Practices​

What's Next?​