MetricSign
Start free
Data Observability14 min·

Data Observability for the Microsoft Stack: Power BI, ADF, Databricks, dbt, and Fabric

Five failure layers, no single native tool that covers them, and a correlation problem that makes every incident look like three.

Lees dit artikel in het Nederlands →

Data observability looks different on the Microsoft stack

For dbt+Snowflake teams, data observability is a solved category. Monte Carlo, Bigeye, and Elementary cover freshness, volume, schema, and lineage inside the warehouse. The boundary of the problem is roughly the boundary of the warehouse.

Microsoft stack teams do not have that luxury. Most data observability platforms built for warehouse-first stacks leave the ADF, Databricks, and Power BI layers uncovered — and those are exactly the layers where Microsoft stack incidents originate.

A typical pipeline lands raw data in ADLS through an ADF copy activity, transforms it in a Databricks notebook or dbt project running on Databricks SQL, materializes a star schema, and serves it through a Power BI semantic model — increasingly inside Microsoft Fabric. Every one of those handoffs has its own failure mode, its own log surface, and its own native alerting that does not talk to the next one.

The result: when a Friday-morning refresh shows stale numbers in the executive dashboard, the on-call BI developer has five places to check and no way to know which one broke first.

Zie ook: Best data observability platforms in 2026

The five failure layers, and what each one breaks like

1. Source data. Upstream systems — Dynamics 365, SAP, Salesforce, on-prem SQL — change schema without notice, drop volume after a deployment, or land late because an SFTP partner missed a window. The symptom: a downstream COPY INTO succeeds but writes 12 rows instead of 1.2M, or a MERGE fails with Cannot resolve column 'CustomerStatus'. Native tooling: nothing built-in. You either write row-count and schema checks in ADF Lookup activities or rely on dbt source freshness tests downstream — meaning you detect it after it has already propagated.

2. ADF pipelines. Activity failures (ErrorCode=2200, ErrorCode=2100), mapping data flow errors, self-hosted IR gateway timeouts (ErrorCode=9013), and integration runtime quota exhaustion. Symptom: pipeline state Failed in the monitoring tab, often with a 4-deep nested error chain. Native tooling: ADF Monitoring, Azure Monitor diagnostic logs to Log Analytics, action groups for email/webhook alerts. Useful, but each pipeline alerts in isolation — you get one email per failed activity, not one incident per logical pipeline.

3. Databricks jobs. OOM in a worker (Driver/Executor lost), SparkOutOfMemoryError, slow jobs that exceed SLA, autoscaling failures, cluster startup failures from INVALID_PARAMETER_VALUE on instance pools. Symptom: a job that ran in 14 minutes yesterday runs 47 minutes today, or fails on the third attempt. Native tooling: job-level email alerts, system tables (system.lakeflow.jobs), webhook destinations to Slack or PagerDuty. Granular, but blind to whether the job's output is correct.

4. dbt transformations. Model build failures, test failures (unique, not_null, custom singular tests), upstream source freshness errors, and snapshot drift. Symptom: dbt run exits non-zero, or a dbt test reports 14,302 failing rows in not_null_orders_customer_id. Native tooling: dbt Cloud has alerts and the Discovery API; dbt Core users have run_results.json and whatever they wire to it. The blind spot: dbt knows nothing about whether the Power BI dataset that consumes its mart actually refreshed.

5. Power BI semantic layer. Dataset refresh failures, schema drift between the model and the source view (The 'X' column does not exist), gateway dropouts (PBIEgwService stops without a status change), capacity throttling on Premium/Fabric F-SKU, and — the silent killer — refresh_delayed: a refresh that succeeds but ran 6 hours late because the upstream pipeline finished late. Native tooling: Power BI Service refresh history, Activity Log, the Admin REST API, Fabric Capacity Metrics app. None of these surface delay against expected SLA out of the box.

The 5 failure layers of the Microsoft data stack Schema drift (column dropped/ren Volume drop after upstream deplo Late-arriving SFTP/API extracts Activity failures (ErrorCode 220 Self-hosted IR gateway timeout ( Mapping data flow errors Driver/Executor OOM Slow job (2x median duration) Cluster startup / pool failures Model build failure Test failure (unique/not_null) Source freshness error Refresh failure (credentials/sch refresh_delayed (succeeded but l Capacity throttling / PBIEgwServ
The 5 failure layers of the Microsoft data stack

What Monte Carlo and Bigeye don't see

Generic data observability tools were built for the warehouse-first world. They connect to Snowflake, BigQuery, Redshift, and increasingly Databricks, and they do an excellent job of detecting freshness anomalies, volume drops, and schema changes inside those tables.

What they do not see on the Microsoft stack:

ADF pipeline state. Monte Carlo has no integration with the ADF REST API, no awareness of activity-level failures, no view into self-hosted IR health. A pipeline that has not run for 8 hours because the gateway dropped is invisible.

Power BI refresh state. Neither Monte Carlo nor Bigeye queries the Power BI Admin API. A failed Refresh-PowerBIDataset does not register. A dataset that imports from a Databricks table the tool does monitor will show green at the warehouse layer while the dashboard shows yesterday's numbers.

Fabric pipeline runs. Fabric Data Pipelines and notebook runs are a separate surface from ADF, with their own monitoring API. Coverage is effectively zero in generic tools as of mid-2026.

The practical consequence: a dbt model that fails its unique test, breaking a downstream Power BI mart, generates one alert from dbt Cloud, one from Power BI's refresh failure email, and zero from your observability tool — because at the warehouse layer the table still exists and still has rows.

The cross-stack correlation problem

Here is a real failure pattern. At 02:14 UTC, an upstream Dynamics 365 export drops the OpportunityStageId column after a Microsoft monthly release. At 03:00, the ADF copy activity into bronze succeeds because the schema is read dynamically. At 03:45, the Databricks silver job runs MERGE and fails with AnalysisException: cannot resolve OpportunityStageId. The dbt run scheduled at 04:30 fails because its source model is stale. The Power BI dataset scheduled at 06:00 fails to refresh because its DirectQuery view references a column that no longer exists.

What the on-call engineer sees at 06:05:

  • A Databricks job-failure email from 03:46
  • A dbt Cloud failure notification from 04:32
  • A Power BI refresh failure email from 06:01
  • A Slack message from the CFO at 06:07 asking why the pipeline dashboard is empty

Four notifications, one root cause, no linkage. The engineer spends 40 minutes establishing that these are the same incident before they can start fixing it. Lineage information exists — Purview has some of it, dbt has the rest — but no system stitches the runtime events together with the lineage graph and presents one incident.

This is the unsolved problem. It is not a data quality problem. It is not a warehouse observability problem. It is a cross-system event correlation problem with a dependency graph attached.

How to instrument each layer

Start with native tooling at each layer, then layer correlation on top.

Source data. Add validate_schema Lookup activities in ADF that compare incoming column lists against an expected manifest, and fail loud. For volume, write a row-count delta query into a control table after every load and alert on |delta| > 25% week-over-week. dbt source freshness with loaded_at_field covers the late-arrival case if you actually configure warn_after and error_after.

ADF. Route diagnostic logs to Log Analytics. Write a KQL alert on ADFActivityRun | where Status == 'Failed' grouped by pipeline name to collapse N activity failures into one pipeline incident. Wire action groups to a webhook, not just email.

Databricks. Use job-level webhook notifications, not email. Query system.lakeflow.job_run_timeline for runs exceeding their median duration by 2x — this catches the slow-degradation case before the SLA-breach case. Set max_retries=2 with exponential backoff on transient cluster errors.

dbt. Wire run_results.json and manifest.json into your incident system after every run. Treat state:modified test failures differently from new test failures — the latter are usually the noise. For dbt Core, the artifacts are in target/; for dbt Cloud, use the Discovery API.

Power BI. Poll the Admin REST API /admin/datasets/{id}/refreshes every 5 minutes. Alert on status != Completed and on endTime - scheduledStartTime > expected_duration — the second condition is what catches refresh_delayed, the case where the refresh technically succeeds but runs late because something upstream blocked it. Capacity throttling shows up in the Fabric Capacity Metrics app under CU Usage and Throttling.

Gaps remain after all of this. The biggest one: nothing in the native toolset connects a dbt test failure to the Power BI dataset that depends on the failed model. You will write that linkage yourself, or you will adopt a tool that does.

Where MetricSign fits

MetricSign was built for this specific correlation problem on the Microsoft stack. It connects to ADF, Databricks, dbt (Core and Cloud), Fabric Data Pipelines, and the Power BI Admin API, then correlates events across them into a single incident with a timeline and a lineage graph.

The scenario from earlier — Dynamics schema change cascading to a failed Power BI refresh — surfaces as one incident with the four failed runs ordered by timestamp, the lineage path from source to dataset highlighted, and the originating event (the schema change) flagged as root cause. The refresh_delayed signal is first-class: a Power BI refresh that succeeded but ran 4 hours past its expected window is treated as an incident, not a green checkmark. That is what generic observability tools, built for warehouse-first stacks, do not do.

Frequently asked questions

Is Azure Monitor enough for data observability on the Microsoft stack?+
Azure Monitor is excellent at the infrastructure and ADF activity layer — Log Analytics with KQL alerts on ADFActivityRun is the right primitive for ADF failures. It does not cover dbt run results, Power BI dataset refresh state, or Fabric pipeline runs in any first-class way. You can ingest custom logs from those systems, but you still write the correlation logic yourself. Azure Monitor is necessary but not sufficient.
How is data observability different from data quality?+
Data quality asks: are the values in this column correct? It is typically implemented as tests (dbt tests, Great Expectations, ADF data flow assertions). Data observability asks: did the system that produces this column run, run on time, and produce the expected volume and schema? A row-level uniqueness test is data quality. Detecting that the Power BI refresh ran 4 hours late because the Databricks job was queued behind a capacity throttle is observability. Most production incidents are observability problems, not quality problems.
Does MetricSign work with dbt Core as well as dbt Cloud?+
Yes. For dbt Cloud, MetricSign uses the Discovery API and the Admin API for run metadata. For dbt Core, it ingests run_results.json and manifest.json artifacts via a post-hook or CI step — same model, just a different transport. Test failures, model build failures, and source freshness errors surface identically in the incident timeline regardless of which runtime produced them.
What about Microsoft Purview for lineage — does that solve this?+
Purview gives you static lineage: it knows that a Power BI dataset depends on a Databricks table that depends on an ADLS file. What it does not give you is runtime event correlation — it will not tell you that today's dataset refresh failed because today's Databricks job had an OOM. You need both: the lineage graph from Purview (or dbt's manifest) and the runtime events from each system, joined together. That join is the observability product.
How do you detect a Power BI refresh that ran but ran late?+
Poll the Power BI Admin REST API endpoint /admin/datasets/{id}/refreshes and compare endTime against the scheduled startTime plus an expected duration. If a refresh scheduled for 06:00 with a typical 25-minute runtime completes at 10:30, status will be 'Completed' but the refresh is effectively stale from a stakeholder's perspective. Most monitoring setups only alert on status != Completed and miss this case entirely. MetricSign exposes it as a refresh_delayed signal.
We're migrating from ADF to Fabric Data Pipelines — does the observability story change?+
The failure modes are largely the same (activity failures, gateway issues, capacity throttling) but the monitoring API is different — Fabric exposes its own pipeline run endpoint under the workspace, separate from ADF's Synapse-style API. Most generic observability tools have no Fabric integration as of mid-2026. If you are mid-migration and running both surfaces, you need a tool that polls both APIs and treats the underlying pipeline as one logical entity.

Related integrations

Related articles