MetricSign
Request Access
Data Lineage8 min·

End-to-End Data Lineage: From ADF to Power BI

Without a map of your data chain, every investigation starts from scratch.

The Diagnosis Problem

Your Power BI dashboard is showing wrong numbers. Where do you start? Without lineage metadata, the answer is: anywhere. You check the dataset refresh history in Power BI Service. You look at ADF pipeline runs in Azure Monitor. You query the staging database to see if the data looks reasonable. You read through Power Query steps looking for something obvious.

This process — investigating a failure without a map of the data flow — takes anywhere from 30 minutes to several hours depending on the complexity of the pipeline and the experience of the engineer doing the diagnosis. For a simple pipeline it might be quick. For a chain that includes ADF, Databricks, dbt, Azure SQL, and Power BI, it's a detective investigation that rarely finishes before someone is impatiently asking for an ETA.

Data lineage solves this. Lineage is a map of how data flows through your systems: which pipeline produced which table, which transformation read from that table and wrote to another, which Power BI dataset pulls from that output, and which reports are built on that dataset. With lineage, a failure at any point in the chain tells you exactly what's affected downstream — and exactly what to investigate upstream.

The Chain: From ADF to Power BI

A typical enterprise data chain looks something like this:

  1. Source systems: SQL Server databases, SAP exports, REST APIs, SFTP files, Dataverse
  2. Azure Data Factory: orchestrates movement and initial transformation from sources to staging
  3. Staging layer: Azure SQL Database, Azure Data Lake, or Synapse Analytics — the landing zone
  4. Transformation layer: Databricks notebooks, dbt models, or Synapse SQL transformations that build analytical models
  5. Serving layer: Azure SQL or Synapse Analytics with clean, query-optimized tables
  6. Power BI datasets: semantic models that pull from the serving layer
  7. Power BI reports: visualizations built on those datasets

At each step, something can go wrong. The ADF pipeline can fail or copy empty data. The Databricks job can error. The dbt model can produce incorrect output. The Power BI refresh can fail or load stale data. Any of these failures has a downstream effect on every subsequent layer.

Without lineage, you're checking each step independently in separate monitoring consoles. With lineage, you see the chain in one view: the ADF pipeline at step 2 failed — datasets X, Y, and Z at step 6 are affected — reports A, B, and C at step 7 are currently serving stale data.

Where Lineage Data Comes From

Lineage isn't a single data source — it's assembled from multiple signals.

Pipeline output tables: ADF pipelines write to specific tables. Capturing the pipeline name and the destination table on each run gives you a direct link between pipeline runs and database state.

dbt manifest: The manifest file documents the full DAG of dbt models — which models depend on which sources, which sources connect to which databases. This is one of the richest lineage sources available, and it's already generated as part of every dbt build.

Power BI datasource metadata: The Power BI REST API returns the data source for each dataset — the server, database, and table or view name. This provides the link between the serving layer and the Power BI model.

Databricks job metadata: Databricks jobs read from and write to Delta tables. The job run API provides enough metadata to trace which job processed which data and when.

Assembling these signals into a coherent lineage graph requires matching on common identifiers — table names, database names, server hostnames. The matching is imperfect in practice because naming conventions aren't always consistent. But even 70% coverage is dramatically more useful than no lineage at all. You're not aiming for perfect documentation; you're aiming for actionable investigation shortcuts.

Lineage-Aware Monitoring in Practice

Here's the same scenario — wrong numbers in a Power BI dashboard — handled with and without lineage.

Without lineage: You open Power BI Service, check the refresh history (succeeded). You check ADF Monitor (no failures visible in the pipeline summary). You query the staging table (data looks present). You check the Databricks job (completed with 2 warnings). After 90 minutes, you find that one notebook in the Databricks job errored silently, producing partial output. The partial output was loaded into Power BI.

With lineage: The monitoring system shows: Sales Overview dataset → sales_reporting dbt model → databricks_daily_transform → failed notebook compute_margins. You see the impact immediately: three datasets depend on this output, eight reports are currently serving potentially incorrect data. The on-call engineer receives an alert with this context already assembled. Investigation time: under 10 minutes.

The difference isn't just speed — it's confidence. Without lineage, you might miss an affected dataset and tell stakeholders the problem is fixed when it isn't. With lineage, impact assessment is systematic and complete.

Upstream vs. Downstream: Two Directions of Lineage

Lineage works in both directions, and each direction serves a different purpose.

Upstream traversal (root cause): This dataset is showing wrong data — which pipeline produced it? Which source system fed that pipeline? Did the pipeline run on schedule? Did it load the expected rows? Upstream traversal is the investigation direction: you start at the broken thing and trace backward to the cause.

Downstream traversal (impact): This ADF pipeline failed — which datasets depend on its output? Which reports are built on those datasets? Which teams use those reports and need to be notified? Downstream traversal is the communication direction: you start at the failure and trace forward to understand blast radius.

For teams running a large Power BI environment — dozens of workspaces, hundreds of datasets, thousands of reports — downstream impact analysis is the single highest-value capability that lineage provides. You can instantly answer "this ADF pipeline failed — which reports are affected?" rather than manually checking every dataset's data source configuration.

Proactive monitoring becomes possible once you have this map. When an ADF pipeline fails at 02:30 and three Power BI datasets are scheduled to refresh at 05:00, a lineage-aware system can alert the on-call engineer at 02:30 with a list of at-risk datasets — before those refreshes run and surface stale data to early-morning users.

Related error codes

Related integrations

Related articles

← All articles