MetricSign
Start free
Data Observability14 min·

Data Observability for the Microsoft Stack: Power BI, ADF, Databricks, dbt, and Fabric

Five failure layers, no single native tool that covers them, and a correlation problem that makes every incident look like three.

Data Observability for the Microsoft Stack: Power BI, ADF, Databricks, dbt, and Fabric

Data observability looks different on the Microsoft stack

For dbt+Snowflake teams, data observability is a solved category. Monte Carlo, Bigeye, and Elementary cover freshness, volume, schema, and lineage inside the warehouse. The boundary of the problem is roughly the boundary of the warehouse.

Microsoft stack teams do not have that luxury. A typical pipeline lands raw data in ADLS through an ADF copy activity, transforms it in a Databricks notebook or dbt project running on Databricks SQL, materializes a star schema, and serves it through a Power BI semantic model — increasingly inside Microsoft Fabric. Every one of those handoffs has its own failure mode, its own log surface, and its own native alerting that does not talk to the next one.

The result: when a Friday-morning refresh shows stale numbers in the executive dashboard, the on-call BI developer has five places to check and no way to know which one broke first.

The five failure layers, and what each one breaks like

1. Source data. Upstream systems — Dynamics 365, SAP, Salesforce, on-prem SQL — change schema without notice, drop volume after a deployment, or land late because an SFTP partner missed a window. The symptom: a downstream COPY INTO succeeds but writes 12 rows instead of 1.2M, or a MERGE fails with Cannot resolve column 'CustomerStatus'. Native tooling: nothing built-in. You either write row-count and schema checks in ADF Lookup activities or rely on dbt source freshness tests downstream — meaning you detect it after it has already propagated.

2. ADF pipelines. Activity failures (ErrorCode=2200, ErrorCode=2100), mapping data flow errors, self-hosted IR gateway timeouts (ErrorCode=9013), and integration runtime quota exhaustion. Symptom: pipeline state Failed in the monitoring tab, often with a 4-deep nested error chain. Native tooling: ADF Monitoring, Azure Monitor diagnostic logs to Log Analytics, action groups for email/webhook alerts. Useful, but each pipeline alerts in isolation — you get one email per failed activity, not one incident per logical pipeline.

3. Databricks jobs. OOM in a worker (Driver/Executor lost), SparkOutOfMemoryError, slow jobs that exceed SLA, autoscaling failures, cluster startup failures from INVALID_PARAMETER_VALUE on instance pools. Symptom: a job that ran in 14 minutes yesterday runs 47 minutes today, or fails on the third attempt. Native tooling: job-level email alerts, system tables (system.lakeflow.jobs), webhook destinations to Slack or PagerDuty. Granular, but blind to whether the job's output is correct.

4. dbt transformations. Model build failures, test failures (unique, not_null, custom singular tests), upstream source freshness errors, and snapshot drift. Symptom: dbt run exits non-zero, or a dbt test reports 14,302 failing rows in not_null_orders_customer_id. Native tooling: dbt Cloud has alerts and the Discovery API; dbt Core users have run_results.json and whatever they wire to it. The blind spot: dbt knows nothing about whether the Power BI dataset that consumes its mart actually refreshed.

5. Power BI semantic layer. Dataset refresh failures, schema drift between the model and the source view (The 'X' column does not exist), gateway dropouts (PBIEgwService stops without a status change), capacity throttling on Premium/Fabric F-SKU, and — the silent killer — refresh_delayed: a refresh that succeeds but ran 6 hours late because the upstream pipeline finished late. Native tooling: Power BI Service refresh history, Activity Log, the Admin REST API, Fabric Capacity Metrics app. None of these surface delay against expected SLA out of the box.

The 5 failure layers of the Microsoft data stack Schema drift (column dropped/ren Volume drop after upstream deplo Late-arriving SFTP/API extracts Activity failures (ErrorCode 220 Self-hosted IR gateway timeout ( Mapping data flow errors Driver/Executor OOM Slow job (2x median duration) Cluster startup / pool failures Model build failure Test failure (unique/not_null) Source freshness error Refresh failure (credentials/sch refresh_delayed (succeeded but l Capacity throttling / PBIEgwServ
The 5 failure layers of the Microsoft data stack

What Monte Carlo and Bigeye don't see

Generic data observability tools were built for the warehouse-first world. They connect to Snowflake, BigQuery, Redshift, and increasingly Databricks, and they do an excellent job of detecting freshness anomalies, volume drops, and schema changes inside those tables.

What they do not see on the Microsoft stack:

ADF pipeline state. Monte Carlo has no integration with the ADF REST API, no awareness of activity-level failures, no view into self-hosted IR health. A pipeline that has not run for 8 hours because the gateway dropped is invisible.

Power BI refresh state. Neither Monte Carlo nor Bigeye queries the Power BI Admin API. A failed Refresh-PowerBIDataset does not register. A dataset that imports from a Databricks table the tool does monitor will show green at the warehouse layer while the dashboard shows yesterday's numbers.

Fabric pipeline runs. Fabric Data Pipelines and notebook runs are a separate surface from ADF, with their own monitoring API. Coverage is effectively zero in generic tools as of mid-2026.

The practical consequence: a dbt model that fails its unique test, breaking a downstream Power BI mart, generates one alert from dbt Cloud, one from Power BI's refresh failure email, and zero from your observability tool — because at the warehouse layer the table still exists and still has rows.

The cross-stack correlation problem

Here is a real failure pattern. At 02:14 UTC, an upstream Dynamics 365 export drops the OpportunityStageId column after a Microsoft monthly release. At 03:00, the ADF copy activity into bronze succeeds because the schema is read dynamically. At 03:45, the Databricks silver job runs MERGE and fails with AnalysisException: cannot resolve OpportunityStageId. The dbt run scheduled at 04:30 fails because its source model is stale. The Power BI dataset scheduled at 06:00 fails to refresh because its DirectQuery view references a column that no longer exists.

What the on-call engineer sees at 06:05:

  • A Databricks job-failure email from 03:46
  • A dbt Cloud failure notification from 04:32
  • A Power BI refresh failure email from 06:01
  • A Slack message from the CFO at 06:07 asking why the pipeline dashboard is empty

Four notifications, one root cause, no linkage. The engineer spends 40 minutes establishing that these are the same incident before they can start fixing it. Lineage information exists — Purview has some of it, dbt has the rest — but no system stitches the runtime events together with the lineage graph and presents one incident.

This is the unsolved problem. It is not a data quality problem. It is not a warehouse observability problem. It is a cross-system event correlation problem with a dependency graph attached.

How to instrument each layer

Start with native tooling at each layer, then layer correlation on top.

Source data. Add validate_schema Lookup activities in ADF that compare incoming column lists against an expected manifest, and fail loud. For volume, write a row-count delta query into a control table after every load and alert on |delta| > 25% week-over-week. dbt source freshness with loaded_at_field covers the late-arrival case if you actually configure warn_after and error_after.

ADF. Route diagnostic logs to Log Analytics. Write a KQL alert on ADFActivityRun | where Status == 'Failed' grouped by pipeline name to collapse N activity failures into one pipeline incident. Wire action groups to a webhook, not just email.

Databricks. Use job-level webhook notifications, not email. Query system.lakeflow.job_run_timeline for runs exceeding their median duration by 2x — this catches the slow-degradation case before the SLA-breach case. Set max_retries=2 with exponential backoff on transient cluster errors.

dbt. Wire run_results.json and manifest.json into your incident system after every run. Treat state:modified test failures differently from new test failures — the latter are usually the noise. For dbt Core, the artifacts are in target/; for dbt Cloud, use the Discovery API.

Power BI. Poll the Admin REST API /admin/datasets/{id}/refreshes every 5 minutes. Alert on status != Completed and on endTime - scheduledStartTime > expected_duration — the second condition is what catches refresh_delayed, the case where the refresh technically succeeds but runs late because something upstream blocked it. Capacity throttling shows up in the Fabric Capacity Metrics app under CU Usage and Throttling.

Gaps remain after all of this. The biggest one: nothing in the native toolset connects a dbt test failure to the Power BI dataset that depends on the failed model. You will write that linkage yourself, or you will adopt a tool that does.

Where MetricSign fits

MetricSign was built for this specific correlation problem on the Microsoft stack. It connects to ADF, Databricks, dbt (Core and Cloud), Fabric Data Pipelines, and the Power BI Admin API, then correlates events across them into a single incident with a timeline and a lineage graph.

The scenario from earlier — Dynamics schema change cascading to a failed Power BI refresh — surfaces as one incident with the four failed runs ordered by timestamp, the lineage path from source to dataset highlighted, and the originating event (the schema change) flagged as root cause. The refresh_delayed signal is first-class: a Power BI refresh that succeeded but ran 4 hours past its expected window is treated as an incident, not a green checkmark. That is what generic observability tools, built for warehouse-first stacks, do not do.

Related integrations

← All articlesShare on LinkedIn