Databricks is frequently the compute layer between raw source data and Power BI datasets. Databricks jobs run transformations, aggregate data, and write to Delta tables or Azure SQL tables that Power BI datasets subsequently query during refresh.
How the dependency works
The typical chain looks like: 1. Raw data lands in Azure Data Lake Storage (from ADF, Kafka, or direct upload) 2. Databricks jobs read the raw data and write transformed Delta tables 3. Power BI datasets read from those Delta tables (via Direct Lake or import mode) 4. Power BI reports visualize the data
When step 2 fails or produces incomplete output, step 3 either fails (if the Delta table is missing) or loads stale/incorrect data (if the previous Delta table version is still there).
What to monitor in Databricks
Job run status: The most basic signal — did the job complete or fail? The Databricks Jobs API returns run history with status, start time, end time, and error details for failed runs.
Job duration baseline: A job that normally runs in 20 minutes and starts taking 90 minutes is experiencing performance degradation — likely due to data volume growth, cluster resource pressure, or an inefficient query. MetricSign uses MAD (Median Absolute Deviation) to detect when a job run is significantly slower than its historical baseline.
Notebook-level failures in multi-notebook jobs: A Databricks job can be composed of multiple notebooks running in sequence. If one notebook fails but the job is configured to continue, the job may complete with a Partially Succeeded status while producing partial output. Monitoring notebook-level task status (not just job status) catches this.
Connecting Databricks monitoring to Power BI
The connection between a Databricks job and a Power BI dataset is established through the Delta table they share. MetricSign matches Databricks job output table paths against Power BI datasource configurations to build the lineage link.
When a Databricks job fails or produces a slow run, the linked Power BI datasets are identified and included in the incident context — so the alert says not just "job X failed" but "job X failed and datasets Y and Z are scheduled to refresh in 2 hours from its output."