The four failure patterns that account for most Databricks job failures
Databricks job failures cluster around a small number of root causes. Out of memory errors on the driver or executor, data skew pushing one partition far beyond the memory limit of an executor, shuffle failures when spilled data cannot be read back after a node failure, and driver heartbeat timeouts caused by garbage collection pauses long enough to break cluster communication.
These patterns are related. Data skew causes OOM on the overloaded executor. Large shuffles spill to disk and then fail when a node restarts. GC pauses are caused by heap pressure — which comes from collecting large datasets to the driver or from accumulating shuffle state.
Identifying which pattern you are dealing with changes the fix entirely. OOM on the driver is solved differently from OOM on an executor. Shuffle failures require increasing shuffle partitions, not driver memory. Getting to the right diagnosis quickly — without rerunning the entire job — is what determines whether a production failure is a thirty-minute incident or a three-hour one.
Out of memory on the driver: causes and fixes
DRIVER_NOT_RESPONDING is the most visible symptom of driver OOM. The Spark driver crashes because its heap fills up — from a collect() that pulled too much data from executors, from a large broadcast join that the driver staged in memory, or from accumulated object references during a long-running job.
The first fix is increasing driver memory in the job cluster configuration: spark.driver.memory from the default (often 4g) to 8g or 16g depending on the workload. This is done in the job compute settings, not in notebook code, because driver memory must be set before the SparkSession starts.
The second fix is architectural. If the driver is OOMing because of collect() calls on large DataFrames, the pattern needs to change — write results to Delta instead of collecting to the driver. If large broadcasts are the cause, set spark.sql.autoBroadcastJoinThreshold to -1 to disable automatic broadcast joins and let Spark use a sort-merge join instead.
For executor OOM, the symptom is ExecutorLostFailure in the task logs. Increasing executor memory or reducing the number of cores per executor (so each executor handles fewer partitions at once) resolves most cases.
Data skew: why one task is 100x slower than the rest
Data skew happens when records are unevenly distributed across partitions after a shuffle. One partition ends up with ten million rows while the others have fifty thousand. The executor handling the large partition runs for twenty minutes while the others finish in thirty seconds. The job waits on the slow executor, eventually times out, or the executor runs out of memory under the load.
The diagnostic signal is visible in the Spark UI: look at the task duration distribution for the failing stage. If one task is significantly longer than all others, skew is the cause.
The fix is salting: add a random integer to the join key to artificially distribute records across more partitions.
from pyspark.sql.functions import rand, col df_salted = df.withColumn("salt", (rand() * 10).cast("int")) df_joined = df_salted.join(other_df, ["key", "salt"])
The salt value (10 in this example) determines how many sub-partitions each key value is split into. Larger salt values reduce skew more aggressively but increase the total number of shuffle partitions. After salting, the slow partition splits into ten roughly equal partitions, and the job completes proportionally faster.
Shuffle errors and spark.sql.shuffle.partitions
Shuffle-intensive operations — joins, aggregations, group-by on large datasets — require Spark to redistribute data across executors. The intermediate shuffle files are written to disk on each executor. If an executor restarts mid-shuffle, the files on that node become unreadable to executors trying to fetch them, and the stage fails with a shuffle fetch exception.
The configuration fix for shuffle memory pressure is increasing spark.sql.shuffle.partitions beyond the default of 200. For datasets in the hundreds of gigabytes, values of 400 to 1000 reduce the per-partition size and decrease the memory required per executor during the shuffle phase.
spark.conf.set("spark.sql.shuffle.partitions", 800)
This is a tradeoff: more partitions mean more tasks, which adds scheduling overhead. But for large datasets the reduced per-partition memory pressure is almost always worth the overhead.
For persistent shuffle failures after node restarts, enabling speculation (spark.speculation=true) allows Spark to launch duplicate tasks for slow tasks and use whichever completes first, reducing exposure to node instability.
Driver heartbeat timeout: GC pauses and how to detect them
The driver maintains a heartbeat with the cluster manager on a fixed interval (default: 60 seconds). If the driver's JVM is paused for a full GC cycle longer than this interval, the cluster manager considers the driver dead and marks the job as failed with DRIVER_NOT_RESPONDING or similar timeout errors.
GC pause length is proportional to heap occupancy. The more objects in the driver heap, the longer full GC takes. The root cause is always heap pressure — either from a workload that should have been distributed to executors, or from a memory leak in user code (accumulated broadcast variables, unclosed accumulators, growing object graphs in loops).
Detecting GC pressure: in the Spark UI, open Executors or the Driver tab and look at GC Time as a percentage of Task Time. More than 10% of time spent in GC is a signal of memory pressure. Above 20% is the threshold where GC pauses can start triggering heartbeat timeouts.
Fixes: increase spark.driver.memoryOverhead (the off-heap fraction), switch to G1GC if the cluster is using the default parallel collector, and avoid patterns that accumulate objects in the driver — particularly collect() calls on results that will be iterated rather than written.
Repair Run: the tool most engineers don't use
When a Databricks job fails partway through, most engineers restart the entire job. This is usually wrong — it wastes time and compute budget re-running tasks that already succeeded.
Repair Run re-runs only the failed and skipped tasks in a previous job run, starting from the point of failure. Successful tasks are not re-executed. In a job with ten tasks where task seven failed, Repair Run reruns tasks seven through ten and reuses the output of tasks one through six.
Access it from the Lakeflow Jobs UI: find the failed run, click the run ID, and select Repair run. Alternatively, use the Jobs REST API:
POST /api/2.1/jobs/runs/repair
{"run_id":
Repair Run is especially valuable for jobs with expensive upstream tasks — large data ingestion, model training, or heavy Delta merge operations. Re-running those from scratch when only a downstream task failed is expensive and unnecessary.
For transient failures — network flaps, Azure Spot instance preemption — configure retry policies at the task level (Maximum retries: 1-2, Retry on timeout: true) so the job handles them automatically without manual intervention.
Production job clusters versus all-purpose clusters
All-purpose clusters are optimized for interactive use. They start once and stay running. They are not auto-scaled for batch workloads and they accumulate JVM state between runs. Using an all-purpose cluster for a production scheduled job means every run inherits the memory state from previous interactive sessions — a reliable source of hard-to-reproduce OOM failures.
Job clusters are the correct choice for production jobs. They start fresh for each run, use only the resources the job needs, shut down when the run completes, and isolate runs from each other. They are also cheaper: you pay only for the duration of the job run, not for idle time between runs.
For jobs that need to access system tables for monitoring, the query:
SELECT job_id, run_id, result_state, start_time, end_time, state_message FROM system.lakeflow.job_run_timeline WHERE result_state = 'FAILED' AND start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP()) ORDER BY start_time DESC
provides a seven-day failure history directly from Databricks, including error messages per run.
MetricSign: job_failed and job_slow incidents
MetricSign monitors Databricks jobs via the Jobs REST API and generates two types of incidents: job_failed when a run ends in a failed state (including the exact state_message from the run), and job_slow when a run significantly exceeds its historical P50 duration — detected using MAD-based anomaly detection.
The job_slow signal is the earlier of the two. A job that normally completes in twelve minutes but is now taking forty-five minutes may not yet have failed — but it is likely heading there. The cause is often growing data volume that is pushing memory usage toward a tipping point, or a data skew pattern that got worse as the source table grew.
Having both signals means you can act before the outage rather than after it: investigate the slow job, find the skew or memory pressure, and fix it before the next run produces an OOM failure.