Databricks Snapshot Connectors Return Stale Data Without Telling You

Snapshots look fresh until they aren't

Every query-based connector that reads from a Databricks Delta table operates on a snapshot. That snapshot is a consistent view of the table at a specific Delta log version. The connector opens a session, resolves the latest version, reads the Parquet files referenced by that version, and returns the result set. This is the isolation guarantee Delta Lake provides, and it works well — until the read takes too long.

The problem surfaces in a specific sequence. A connector begins reading version 47 of a table. The read is slow, maybe because the table is large, the warehouse is under contention, or the connector's own pagination logic adds overhead. While the read is in progress, an upstream pipeline writes new data, advancing the table to version 52. A VACUUM operation runs on schedule and removes Parquet files that are no longer referenced by any version within the retention window. If version 47's files get vacuumed before the connector finishes reading them, the read fails with a FileNotFoundException or, in some JDBC/ODBC connector implementations, a truncated result set with no error surfaced to the caller.

The default delta.deletedFileRetentionDuration is 7 days. That sounds generous. But connector jobs that run on a schedule — say, a Power BI dataset refreshing every 30 minutes against a Databricks SQL warehouse — don't control when VACUUM runs. If an administrator or a maintenance job vacuums aggressively with a shorter retention, the safety margin disappears. The connector doesn't know its snapshot files are gone until it tries to read them.

VERSION AS OF fails silently in connector contexts

Delta Lake time travel gives you VERSION AS OF and TIMESTAMP AS OF syntax for explicit snapshot access. These are powerful for ad-hoc analysis and reproducible queries. But when a query-based connector uses time travel, the failure mode shifts from 'read might break' to 'read might succeed with outdated data and nobody notices.'

Consider a connector configured with a query like SELECT * FROM sales_facts VERSION AS OF 100. This was set up when version 100 was current. Weeks later, the table is at version 340. The connector still returns version 100, faithfully. No error. No warning. The dashboard consumers see data that's internally consistent but weeks behind reality.

This happens more often than it should, particularly in partner connector configurations where the query is embedded in a connection string or a saved query definition that doesn't get revisited. The Databricks community thread on this topic highlights exactly this confusion: users expect the connector to behave like a live query, but the snapshot semantics mean it returns exactly what was asked for, even when that's no longer useful.

The delta.logRetentionDuration default of 30 days adds another cliff. Once the log entry for version 100 expires, the query fails with an AnalysisException: Cannot time travel Delta table to version 100. At that point the connector breaks loudly, but the preceding weeks of silent staleness were the real damage.

For connectors that don't use explicit time travel but still cache query plans or connection metadata, a similar problem exists. Some JDBC drivers cache the resolved snapshot version at connection time. If the connection pool reuses a stale connection, subsequent queries read from the old snapshot without re-resolving the current version.

Snapshot connector failure sequence

Change Data Feed doesn't fix the connector gap

Delta Lake's Change Data Feed (CDF) is the recommended path for incremental processing. Enable it with ALTER TABLE SET TBLPROPERTIES (delta.enableChangeDataFeed = true), and you get row-level change tracking with _change_type, _commit_version, and _commit_timestamp metadata columns. Structured Streaming consumers can use readChangeFeed to process only new changes.

But query-based connectors — the JDBC/ODBC connections used by Power BI, Tableau, and custom ETL scripts — don't speak Structured Streaming. They issue SQL queries and read result sets. CDF is available to them only through batch queries like SELECT * FROM table_changes('sales_facts', 45, 52), which requires the connector to track its own high-water mark: the last version it successfully read.

This is where most implementations fall apart. The connector needs durable state management for that watermark. If the connector crashes after reading changes but before persisting the new watermark, it either re-reads (duplicates) or skips ahead (data loss). If the watermark points to a version that's been vacuumed, the CDF query fails with a version-not-available error.

The 2024 Stack Overflow Developer Survey shows Databricks SQL at 1.9% adoption among all respondents. That's a small but growing user base, and many of these users connect Databricks to downstream tools through exactly these query-based patterns. The connector layer is the weakest link in the pipeline, yet it's the least instrumented. Databricks surfaces query history in the SQL warehouse UI, but connector-side failures — timeouts, partial reads, stale watermarks — live outside that visibility boundary.

CDF also has a constraint that catches teams off guard: only changes made after enablement are recorded. If you enable CDF on an existing table and a connector tries to read changes from before the enablement version, the query returns nothing. The connector reports success with zero rows, and the downstream table stays empty.

Cluster auto-termination severs long-running reads

Databricks clusters and SQL warehouses have auto-stop behavior. Interactive clusters default to 120 minutes of inactivity before terminating. SQL warehouses can be configured down to 5 minutes. A query-based connector that opens a session, starts a large read, and then pauses — waiting for the client to paginate through results — may find its cluster terminated mid-transfer.

The connector's experience depends on implementation. Well-built JDBC drivers detect the broken connection and throw an IOException. Simpler HTTP-based connectors may receive a partial response and treat it as complete. The Databricks SQL connector for Python, for example, uses Arrow-based result sets fetched in chunks from cloud storage. If the warehouse shuts down after producing the result but before the client finishes downloading all chunks, the client gets a truncated dataset.

The fix seems straightforward: increase the auto-stop timeout. But that costs money. A SQL warehouse running 24/7 on medium sizing can easily exceed $2,000/month in DBU costs. Teams set aggressive auto-stop policies for cost control and then wonder why their overnight connector jobs produce incomplete data.

A more robust approach is to decouple the query execution from the result consumption. Use CREATE TABLE AS SELECT or INSERT OVERWRITE to materialize the connector's query into a staging table, then have the connector read from the staging table. The compute cluster only needs to stay alive for the write, not for the client's slow download. The staging table is a stable Delta snapshot that won't be affected by the source table's VACUUM schedule.

This pattern adds complexity — another table to manage, another job to schedule, another thing that can fail — but it eliminates the two most common connector failure modes: mid-read cluster termination and snapshot file vacuuming.

Monitoring the version gap catches staleness early

The operational signal that matters is the gap between the snapshot version a connector actually read and the current version of the source table. Delta Lake exposes this through DESCRIBE HISTORY table_name, which returns every commit version with its timestamp and operation type. The current version is the max.

If your connector job logs the version it consumed (either from the Delta log or from a query like SELECT MAX(_commit_version) FROM table_changes(...)), you can compute the lag: current version minus consumed version. A lag of zero means the connector read the latest data. A lag of 5 might be normal if the table receives frequent small writes. A lag of 200 means something broke days ago and nobody noticed.

Most teams don't instrument this. The connector job either succeeds or fails, and success is treated as 'data is current.' That assumption is the root cause of stale dashboard incidents that take hours to diagnose because every system reports green.

MetricSign monitors Databricks job runs and can detect when downstream refresh jobs complete but the data they produced hasn't actually advanced. The refresh_delayed signal fires not on job failure — those are easy — but on job success where the resulting data is older than expected. That's the gap between 'the pipeline ran' and 'the pipeline delivered current data,' which is exactly where snapshot connector staleness hides.

For teams that can't instrument the version gap directly, a simpler proxy works: track the max timestamp of a known monotonically increasing column in the destination table. If that timestamp stops advancing across consecutive connector runs, the connector is reading stale snapshots. Set an alert threshold based on your table's normal write frequency, and you'll catch the problem before a stakeholder opens a dashboard and sees last Tuesday's numbers.