A reconciliation job without a timeout is an open-ended promise
A Databricks community thread titled "Lakebridge reconciliation code keeps running continuously" captures a pattern that shows up regularly in migration projects. You write reconciliation logic — usually a full outer join between a source system and a lakehouse target — schedule it as a Databricks job, and it never finishes. The cluster stays alive, the run status stays at RUNNING, and nobody gets an alert because the job hasn't technically failed.
The root issue is that Databricks jobs have no default maximum run duration. If you create a job through the UI or the Jobs API without setting timeout_seconds, the run will execute until it succeeds, fails, or is manually cancelled. For short ETL tasks this is fine. For reconciliation workloads that compare millions of rows across two systems with different schemas, different partitioning, and different data distributions, the absence of a timeout is an invitation for the job to run for hours or days.
The Jobs API accepts timeout_seconds at the task level:
``json
{
"task_key": "reconcile_orders",
"timeout_seconds": 7200,
"notebook_task": {
"notebook_path": "/Repos/prod/reconciliation/orders_recon"
}
}
``
Set this to something reasonable — two to four times the expected duration of the reconciliation. When the timeout fires, the run terminates with a RUN_DURATION_EXCEEDED state. That state is actionable. An endlessly running job is not.
You should also configure max_concurrent_runs to 1 for reconciliation jobs. If a scheduled trigger fires while the previous run is still going, a second instance starts comparing the same data. Two concurrent reconciliation runs on the same tables will deadlock on Delta table commits or, at minimum, double your compute cost with no benefit.
Full-outer joins on skewed keys produce shuffle partitions that never finish
Reconciliation logic almost always involves a full outer join. You need to find rows present in the source but missing from the target, rows in the target not in the source, and rows where values differ. The canonical pattern looks like this:
``python
df_diff = df_source.join(df_target, on="order_id", how="full_outer")
.filter(
col("source.amount") != col("target.amount") |
col("source.order_id").isNull() |
col("target.order_id").isNull()
)
``
This works on small tables. On tables with tens of millions of rows, the join triggers a shuffle. Spark's default spark.sql.shuffle.partitions is 200. If your join key has high cardinality but uneven distribution — say 40% of orders belong to three large customers — a handful of shuffle partitions will receive a disproportionate share of the data. Those partitions take orders of magnitude longer than the rest.
You'll see this in the Spark UI: 197 of 200 tasks complete in minutes, and three tasks sit at 99% for hours. The stage never finishes. The job never finishes.
Two fixes work reliably. First, increase shuffle partitions to spread the skew:
``python
spark.conf.set("spark.sql.shuffle.partitions", 2000)
``
Second, and more effective, partition your reconciliation by a date column or a modulo of the join key. Instead of one massive full-outer join, run 30 reconciliation passes — one per day of data:
``python
for day in date_range:
df_source_day = df_source.filter(col("order_date") == day)
df_target_day = df_target.filter(col("order_date") == day)
reconcile(df_source_day, df_target_day)
``
Each pass fits in memory comfortably. The overall wall-clock time is often shorter because no single shuffle partition becomes a bottleneck.
Shuffle spills to disk fill the local SSD and the executor stalls silently
When a reconciliation join exceeds available executor memory, Spark spills shuffle data to the local SSD. This is expected behavior — Spark was designed to handle it. But there's a threshold where spilling becomes pathological. If spilled data exceeds the local disk capacity on the worker node, the executor enters a loop: it writes partial shuffle data, runs out of space, garbage-collects aggressively, frees a sliver of memory, writes a bit more, and repeats. The executor doesn't crash. It doesn't throw an OutOfMemoryError. It just runs at near-zero throughput indefinitely.
You can detect this from the Spark UI's Stages tab. Look for tasks where "Shuffle Spill (Disk)" is measured in tens of gigabytes while "Shuffle Spill (Memory)" is a fraction of that. A ratio above 10:1 between disk and memory spill means the executor is thrashing.
The fix depends on cluster configuration. For jobs clusters, choose instance types with NVMe storage proportional to the data volume. On AWS, i3.xlarge instances provide 950 GB of local NVMe. On Azure, Standard_L8s_v2 gives 1.9 TB. If you're running reconciliation on general-purpose instances like m5.2xlarge with only EBS storage, shuffle-heavy joins will hit the disk throughput ceiling long before they exhaust CPU or network.
Also check spark.local.dir. By default it points to /tmp, which on some Databricks runtime versions maps to a small root volume rather than the attached NVMe drives. Explicitly set it to the NVMe mount:
``python
spark.conf.set("spark.local.dir", "/local_disk0")
``
This single configuration change has resolved "infinite run" symptoms in cases where the reconciliation logic itself was correct but the executor was silently thrashing on a 50 GB root volume.
The cluster auto-terminates but the job doesn't: a misleading safety net
Teams sometimes assume that cluster auto-termination will kill a runaway job. It won't — at least not the way they expect. Auto-termination applies to all-purpose (interactive) clusters that have been idle for a configurable period. Jobs clusters are different. A jobs cluster is created when the run starts and destroyed when the run ends. If the run never ends, the cluster never terminates.
This distinction matters for cost. A Standard_E8s_v3 instance on Azure costs roughly $0.50/hour. A reconciliation job running unnoticed for a week accumulates $84 per node. On a four-node cluster, that's $336 in pure waste — more if you're running on spot instances that got promoted to on-demand after a preemption.
Databricks provides two mechanisms to guard against this. The first is timeout_seconds on the task, described above. The second is the health block in the job definition, available since Jobs API 2.1:
``json
{
"health": {
"rules": [
{
"metric": "RUN_DURATION_SECONDS",
"op": "GREATER_THAN",
"value": 10800
}
]
}
}
``
When the health rule triggers, Databricks marks the run as unhealthy and can send a notification via webhook or email. Unlike timeout_seconds, the health rule doesn't kill the run — it alerts. Use both: the health rule as an early warning and the timeout as a hard stop.
You can also query run durations programmatically through the Runs API:
``bash
curl -X GET "https://``
If the start_time on an active run is more than a few hours old for a job that normally completes in 30 minutes, something is wrong. But polling an API requires someone to build and maintain the polling. Most teams don't.
Reconciliation tools add abstraction layers that hide Spark's feedback
Lakebridge and similar migration reconciliation tools generate Spark SQL or DataFrame code from a configuration layer. You define source and target tables, specify key columns, and the tool produces the comparison logic. This is efficient for setting up reconciliation across hundreds of tables during a migration. The tradeoff is that the generated code may not be optimized for your specific data distribution, and the tool's abstraction layer can obscure Spark's runtime behavior.
When a generated reconciliation query runs for too long, the first instinct is to look at the tool's logs. But the tool's logs typically show "reconciliation in progress" — they don't surface the shuffle spill ratios, skewed partitions, or disk throughput issues happening inside Spark. You need to go to the Spark UI directly.
Open the Databricks cluster's Spark UI, navigate to the Stages tab, and sort by duration. Find the stage that corresponds to the join operation. Click into it and examine task-level metrics. If one or two tasks have durations 100x longer than the median, you have a skew problem. If all tasks are slow and showing high GC time, you have a memory problem. If tasks are fast but there are thousands of them pending, you have a parallelism problem.
For migration projects running reconciliation across many tables, wrap each table's reconciliation in a try/except with a per-table timeout using Python's concurrent.futures:
```python from concurrent.futures import ThreadPoolExecutor, TimeoutError
def reconcile_table(table_name): # reconciliation logic here pass
with ThreadPoolExecutor(max_workers=1) as executor: future = executor.submit(reconcile_table, "orders") try: future.result(timeout=3600) except TimeoutError: print(f"Reconciliation of orders exceeded 1 hour") future.cancel() ```
This gives you table-level granularity on timeouts. MetricSign monitors Databricks job runs and flags when a run exceeds its historical duration baseline, grouping the incident with cluster-level metrics so you can see whether the cause is data skew, executor memory pressure, or a configuration gap — without building custom polling against the Runs API.
A checklist before you run reconciliation at scale
Before scheduling a reconciliation job that compares production-scale tables, walk through these configuration gates. Each one addresses a specific failure mode that causes infinite runs.
Set timeout_seconds on every reconciliation task. There is no valid reason for a reconciliation job to run indefinitely. If you don't know the expected duration, run the reconciliation manually once, measure it, and set the timeout to 3x that value.
Configure max_concurrent_runs: 1. Overlapping reconciliation runs cause Delta table commit conflicts, double the compute cost, and produce duplicate discrepancy records.
Partition the reconciliation by date or key range. A single full-outer join across 500 million rows will shuffle more data than most clusters can handle in a reasonable time. Break it into daily or weekly partitions.
Choose storage-optimized instances for the jobs cluster. Reconciliation is shuffle-heavy. Instances with local NVMe storage (i3 on AWS, L-series on Azure) handle spills without thrashing.
Verify spark.local.dir points to the NVMe mount, not /tmp. This is easy to overlook and has an outsized impact on shuffle-heavy workloads.
Add a health rule with RUN_DURATION_SECONDS as an early-warning alert. Set it to fire before the hard timeout so you have time to investigate before the run is killed.
Check the Spark UI's Stages tab after the first production run. Confirm no single task is taking disproportionately long. If it is, the skew in your join key will only get worse as the data grows.
Each of these is a five-minute configuration change. Collectively, they prevent the scenario that brings teams to community forums at 2 AM: a reconciliation job that's been running for 14 hours with no error, no output, and no indication of when — or whether — it will finish.