MetricSign
Start free
error-reference8 min·

Your Databricks Reconciliation Job Runs Forever Because It Has No Reason to Stop

Reconciliation workloads compare two large datasets row by row. When that comparison never converges, your cluster burns compute until someone notices — or the budget runs out.

Lees dit artikel in het Nederlands →

A reconciliation job without a timeout is an open-ended promise

A Databricks community thread titled "Lakebridge reconciliation code keeps running continuously" captures a pattern that shows up regularly in migration projects. You write reconciliation logic — usually a full outer join between a source system and a lakehouse target — schedule it as a Databricks job, and it never finishes. The cluster stays alive, the run status stays at RUNNING, and nobody gets an alert because the job hasn't technically failed.

The root issue is that Databricks jobs have no default maximum run duration. If you create a job through the UI or the Jobs API without setting timeout_seconds, the run will execute until it succeeds, fails, or is manually cancelled. For short ETL tasks this is fine. For reconciliation workloads that compare millions of rows across two systems with different schemas, different partitioning, and different data distributions, the absence of a timeout is an invitation for the job to run for hours or days.

The Jobs API accepts timeout_seconds at the task level:

``json { "task_key": "reconcile_orders", "timeout_seconds": 7200, "notebook_task": { "notebook_path": "/Repos/prod/reconciliation/orders_recon" } } ``

Set this to something reasonable — two to four times the expected duration of the reconciliation. When the timeout fires, the run terminates with a RUN_DURATION_EXCEEDED state. That state is actionable. An endlessly running job is not.

You should also configure max_concurrent_runs to 1 for reconciliation jobs. If a scheduled trigger fires while the previous run is still going, a second instance starts comparing the same data. Two concurrent reconciliation runs on the same tables will deadlock on Delta table commits or, at minimum, double your compute cost with no benefit.

Full-outer joins on skewed keys produce shuffle partitions that never finish

Reconciliation logic almost always involves a full outer join. You need to find rows present in the source but missing from the target, rows in the target not in the source, and rows where values differ. The canonical pattern looks like this:

``python df_diff = df_source.join(df_target, on="order_id", how="full_outer") .filter( col("source.amount") != col("target.amount") | col("source.order_id").isNull() | col("target.order_id").isNull() ) ``

This works on small tables. On tables with tens of millions of rows, the join triggers a shuffle. Spark's default spark.sql.shuffle.partitions is 200. If your join key has high cardinality but uneven distribution — say 40% of orders belong to three large customers — a handful of shuffle partitions will receive a disproportionate share of the data. Those partitions take orders of magnitude longer than the rest.

You'll see this in the Spark UI: 197 of 200 tasks complete in minutes, and three tasks sit at 99% for hours. The stage never finishes. The job never finishes.

Two fixes work reliably. First, increase shuffle partitions to spread the skew:

``python spark.conf.set("spark.sql.shuffle.partitions", 2000) ``

Second, and more effective, partition your reconciliation by a date column or a modulo of the join key. Instead of one massive full-outer join, run 30 reconciliation passes — one per day of data:

``python for day in date_range: df_source_day = df_source.filter(col("order_date") == day) df_target_day = df_target.filter(col("order_date") == day) reconcile(df_source_day, df_target_day) ``

Each pass fits in memory comfortably. The overall wall-clock time is often shorter because no single shuffle partition becomes a bottleneck.

Pre-launch checklist for Databricks reconciliation jobs 1 Set timeout_seconds to 3x expected duration 2 Set max_concurrent_runs to 1 3 Partition reconciliation by date or key range 4 Use storage-optimized instances (i3 / L-series) 5 Verify spark.local.dir points to NVMe mount 6 Add health rule for RUN_DURATION_SECONDS 7 Review Spark UI Stages tab after first production run
Pre-launch checklist for Databricks reconciliation jobs

Shuffle spills to disk fill the local SSD and the executor stalls silently

When a reconciliation join exceeds available executor memory, Spark spills shuffle data to the local SSD. This is expected behavior — Spark was designed to handle it. But there's a threshold where spilling becomes pathological. If spilled data exceeds the local disk capacity on the worker node, the executor enters a loop: it writes partial shuffle data, runs out of space, garbage-collects aggressively, frees a sliver of memory, writes a bit more, and repeats. The executor doesn't crash. It doesn't throw an OutOfMemoryError. It just runs at near-zero throughput indefinitely.

You can detect this from the Spark UI's Stages tab. Look for tasks where "Shuffle Spill (Disk)" is measured in tens of gigabytes while "Shuffle Spill (Memory)" is a fraction of that. A ratio above 10:1 between disk and memory spill means the executor is thrashing.

The fix depends on cluster configuration. For jobs clusters, choose instance types with NVMe storage proportional to the data volume. On AWS, i3.xlarge instances provide 950 GB of local NVMe. On Azure, Standard_L8s_v2 gives 1.9 TB. If you're running reconciliation on general-purpose instances like m5.2xlarge with only EBS storage, shuffle-heavy joins will hit the disk throughput ceiling long before they exhaust CPU or network.

Also check spark.local.dir. By default it points to /tmp, which on some Databricks runtime versions maps to a small root volume rather than the attached NVMe drives. Explicitly set it to the NVMe mount:

``python spark.conf.set("spark.local.dir", "/local_disk0") ``

This single configuration change has resolved "infinite run" symptoms in cases where the reconciliation logic itself was correct but the executor was silently thrashing on a 50 GB root volume.

The cluster auto-terminates but the job doesn't: a misleading safety net

Teams sometimes assume that cluster auto-termination will kill a runaway job. It won't — at least not the way they expect. Auto-termination applies to all-purpose (interactive) clusters that have been idle for a configurable period. Jobs clusters are different. A jobs cluster is created when the run starts and destroyed when the run ends. If the run never ends, the cluster never terminates.

This distinction matters for cost. A Standard_E8s_v3 instance on Azure costs roughly $0.50/hour. A reconciliation job running unnoticed for a week accumulates $84 per node. On a four-node cluster, that's $336 in pure waste — more if you're running on spot instances that got promoted to on-demand after a preemption.

Databricks provides two mechanisms to guard against this. The first is timeout_seconds on the task, described above. The second is the health block in the job definition, available since Jobs API 2.1:

``json { "health": { "rules": [ { "metric": "RUN_DURATION_SECONDS", "op": "GREATER_THAN", "value": 10800 } ] } } ``

When the health rule triggers, Databricks marks the run as unhealthy and can send a notification via webhook or email. Unlike timeout_seconds, the health rule doesn't kill the run — it alerts. Use both: the health rule as an early warning and the timeout as a hard stop.

You can also query run durations programmatically through the Runs API:

``bash curl -X GET "https://.azuredatabricks.net/api/2.1/jobs/runs/list" \ -H "Authorization: Bearer $TOKEN" \ -d '{"job_id": 12345, "active_only": true}' ``

If the start_time on an active run is more than a few hours old for a job that normally completes in 30 minutes, something is wrong. But polling an API requires someone to build and maintain the polling. Most teams don't.

Reconciliation tools add abstraction layers that hide Spark's feedback

Lakebridge and similar migration reconciliation tools generate Spark SQL or DataFrame code from a configuration layer. You define source and target tables, specify key columns, and the tool produces the comparison logic. This is efficient for setting up reconciliation across hundreds of tables during a migration. The tradeoff is that the generated code may not be optimized for your specific data distribution, and the tool's abstraction layer can obscure Spark's runtime behavior.

When a generated reconciliation query runs for too long, the first instinct is to look at the tool's logs. But the tool's logs typically show "reconciliation in progress" — they don't surface the shuffle spill ratios, skewed partitions, or disk throughput issues happening inside Spark. You need to go to the Spark UI directly.

Open the Databricks cluster's Spark UI, navigate to the Stages tab, and sort by duration. Find the stage that corresponds to the join operation. Click into it and examine task-level metrics. If one or two tasks have durations 100x longer than the median, you have a skew problem. If all tasks are slow and showing high GC time, you have a memory problem. If tasks are fast but there are thousands of them pending, you have a parallelism problem.

For migration projects running reconciliation across many tables, wrap each table's reconciliation in a try/except with a per-table timeout using Python's concurrent.futures:

```python from concurrent.futures import ThreadPoolExecutor, TimeoutError

def reconcile_table(table_name): # reconciliation logic here pass

with ThreadPoolExecutor(max_workers=1) as executor: future = executor.submit(reconcile_table, "orders") try: future.result(timeout=3600) except TimeoutError: print(f"Reconciliation of orders exceeded 1 hour") future.cancel() ```

This gives you table-level granularity on timeouts. MetricSign monitors Databricks job runs and flags when a run exceeds its historical duration baseline, grouping the incident with cluster-level metrics so you can see whether the cause is data skew, executor memory pressure, or a configuration gap — without building custom polling against the Runs API.

A checklist before you run reconciliation at scale

Before scheduling a reconciliation job that compares production-scale tables, walk through these configuration gates. Each one addresses a specific failure mode that causes infinite runs.

Set timeout_seconds on every reconciliation task. There is no valid reason for a reconciliation job to run indefinitely. If you don't know the expected duration, run the reconciliation manually once, measure it, and set the timeout to 3x that value.

Configure max_concurrent_runs: 1. Overlapping reconciliation runs cause Delta table commit conflicts, double the compute cost, and produce duplicate discrepancy records.

Partition the reconciliation by date or key range. A single full-outer join across 500 million rows will shuffle more data than most clusters can handle in a reasonable time. Break it into daily or weekly partitions.

Choose storage-optimized instances for the jobs cluster. Reconciliation is shuffle-heavy. Instances with local NVMe storage (i3 on AWS, L-series on Azure) handle spills without thrashing.

Verify spark.local.dir points to the NVMe mount, not /tmp. This is easy to overlook and has an outsized impact on shuffle-heavy workloads.

Add a health rule with RUN_DURATION_SECONDS as an early-warning alert. Set it to fire before the hard timeout so you have time to investigate before the run is killed.

Check the Spark UI's Stages tab after the first production run. Confirm no single task is taking disproportionately long. If it is, the skew in your join key will only get worse as the data grows.

Each of these is a five-minute configuration change. Collectively, they prevent the scenario that brings teams to community forums at 2 AM: a reconciliation job that's been running for 14 hours with no error, no output, and no indication of when — or whether — it will finish.

Frequently asked questions

Why doesn't my Databricks reconciliation job time out automatically?+
Databricks jobs have no default maximum run duration. Unless you explicitly set timeout_seconds in the task configuration via the Jobs API or UI, a job will run until it succeeds, fails with an error, or is manually cancelled. For reconciliation workloads that can stall on skewed joins or shuffle spills without throwing an exception, this means the job runs indefinitely.
How do I identify whether data skew is causing my reconciliation job to hang?+
Open the Spark UI from the Databricks cluster page, go to the Stages tab, and click into the stage running the join. Look at per-task durations. If 197 of 200 tasks completed in under a minute but three tasks have been running for hours, those tasks received a disproportionate share of data due to skewed join keys. The fix is to increase spark.sql.shuffle.partitions or partition your reconciliation into smaller date-based or key-range-based passes.
Does cluster auto-termination protect me from runaway Databricks jobs?+
No. Auto-termination only applies to all-purpose (interactive) clusters that have been idle. Jobs clusters are created when a run starts and destroyed when it ends. If the run never ends, the jobs cluster stays alive and continues accumulating cost. You need timeout_seconds on the task and optionally a health rule to guard against this.

Related integrations

Related articles