A renamed column and 14 broken reports
Last Tuesday, the CFO opened the Power BI report for the monthly close review. 'Sales by Region' showed a blank table. The data engineer checked Power BI Service: dataset refresh successful, no errors logged. Checked Azure Data Factory: pipeline ran, 147,000 rows copied. Checked Microsoft Fabric: data landed correctly.
Three tools. Three green statuses. One blank report.
The actual problem: a source analyst had renamed a column in the SQL Server staging table three days earlier. The ADF copy activity still ran. It wrote null to the column that fed 14 downstream reports. Without a lineage tool, the data engineer had no way to trace which reports depended on that column, or which column in the source table had changed.
Data lineage tools map how data moves through your pipeline, from source systems and ingestion jobs through transformation layers to the dashboards and reports your business runs on. When a schema change breaks something downstream, lineage tells you which reports are affected and which upstream job caused it.
This guide covers the four main categories of lineage tools, what each is built for, and how to choose one if you're running a Microsoft stack: Power BI, ADF, Fabric, and dbt.
Table-level vs. column-level lineage: the distinction that matters
Most lineage tools start at the table level. They track which pipeline reads from table A and writes to table B, and which dataset reads from table B to produce report C. For understanding dependency chains, table-level lineage is enough to know that 'if source table X changes, report Y might be affected.'
Column-level lineage goes deeper. It tracks which specific columns flow from source to destination, including through each transformation step. When the CFO's report shows blank data because a region_name column was renamed, table-level lineage tells you that the report depends on the pipeline. Column-level lineage tells you specifically that region_name fed the visual that broke, and that same column appears in six other reports.
The distinction matters most during schema drift: the gradual, often unannounced changes to source table structures that come from SaaS API updates, operational database schema changes, or upstream team decisions that don't propagate as breaking errors. Table-level lineage detects that a dependency exists. Column-level lineage detects that the specific field your transformation depends on has changed.
For Microsoft stack teams running dbt transformations on top of Fabric lakehouses, column-level lineage also helps with impact analysis before you ship: rename a dbt model column, and you can see exactly which Power BI measures and calculated columns reference it.
What Power BI's native lineage view can and cannot do
Power BI Service has included a lineage view since 2021. In the workspace view, you can see a graph showing data sources, dataflows, datasets, and reports in a dependency chain. For a single workspace with straightforward data sources, this covers the basics: report A depends on dataset B, dataset B connects to an Azure SQL database.
The limits appear as soon as your stack crosses tool boundaries.
Power BI's lineage view does not show upstream pipeline history. If the dataset connects to a Fabric lakehouse populated by an ADF pipeline, the lineage view shows 'Lakehouse' as the source, not the ADF pipeline run that filled it or the dbt model that shaped the data before it arrived.
It does not show column-level dependencies. You know report A uses dataset B. You do not know which specific columns from dataset B power which visuals in report A.
Cross-workspace dependencies are also invisible. In organizations running development, test, and production workspaces (or when different teams own datasets in separate workspaces), the native lineage view shows only the workspace you're currently in.
For teams with a simple stack (a few datasets directly connected to Azure SQL or SharePoint), the native view covers the basics. For teams whose Power BI reports sit at the end of a chain involving ADF, dbt, and Fabric, the native view shows the last meter of a much longer path.
End-to-end lineage in the Microsoft stack: where the gaps are
A typical Microsoft data stack connects Azure Data Factory or Fabric Data Pipelines to source systems, moves data into a Fabric lakehouse or Synapse warehouse, transforms it with dbt, and surfaces it through Power BI datasets and reports. Four layers. Four distinct APIs. Four independent logging systems.
The gaps emerge at the seams. ADF logs which tables it copied and how many rows, not which columns, and not which downstream dbt models depend on that output. dbt Cloud tracks which models ran and which tests failed, but its lineage is internal to the dbt project: it is not connected to the ADF pipeline upstream or the Power BI dataset downstream. Power BI Service tracks dataset refreshes and report views, but not what changed in the Fabric lakehouse that the dataset reads from.
Microsoft Purview closes some of these gaps. It scans ADF, Power BI, Fabric, and Azure SQL sources to build a unified catalog with lineage metadata. For governance purposes (data ownership, classification, and compliance), Purview is the native Microsoft answer.
Where Purview has operational limits: its scan-based architecture means lineage metadata updates on a scheduled scan cycle, not in real time. If an ADF pipeline fails at 03:00 and the next Purview scan runs at 06:00, the lineage graph reflects the state from the previous scan. For change impact alerting before the 07:00 Power BI refresh window, Purview's architecture is better suited to governance queries than to real-time incident detection.
See also: What is a data observability platform?
Four categories of data lineage tools
The lineage tool market splits into four categories that serve different use cases. Picking the right one depends more on what you need lineage for than on feature comparison scores.
Microsoft Purview: governance lineage
Purview is the right choice when your driver is governance, compliance, or catalog management rather than operational monitoring. It integrates natively with Azure services, supports data classification and sensitivity labels, and connects to Microsoft Information Protection. For teams that need to prove to auditors which systems contain PII, Purview is built for that problem. Its lineage capabilities work well as a reference layer (a map of what depends on what), queried by data stewards and compliance teams, not by on-call engineers responding to a 03:00 pipeline failure.
Collibra and Alation: enterprise data governance platforms
Both platforms build lineage as part of a broader data catalog and governance framework. They are designed for large organizations with dedicated governance teams, compliance requirements, and the engineering bandwidth to implement a full catalog. Lineage is one feature among many: data glossary, quality rules, stewardship workflows, and policy management. If your organization runs data governance at scale, with budget and team capacity allocated specifically to it, either platform can serve. If your team of 4–8 data engineers needs operational visibility into why a report broke this morning, the governance-first scope and enterprise price point are more than you need.
OpenLineage and DataHub: open source options
OpenLineage is a standard maintained by the Linux Foundation for emitting lineage events from data pipelines. DataHub is an open source data catalog that can consume those events. dbt, Airflow, and Spark support OpenLineage emitters natively. The main requirement is instrumentation: every pipeline that should emit lineage events needs an OpenLineage client configured. For teams running Airflow with dedicated platform engineering resources, this approach gives full control over the lineage graph at no license cost. For teams running primarily managed Microsoft services (ADF, Fabric Data Pipelines, Power BI) where you cannot instrument the tool's internals, the open source path requires custom connectors that most teams do not want to build and maintain.
MetricSign: monitoring-first lineage for the Microsoft stack
MetricSign builds lineage from activity logs rather than schema scans or code instrumentation. It reads from Power BI's activity log, ADF's pipeline run API, Fabric's job history, dbt Cloud's run API, and Databricks' job logs, and constructs a real-time lineage graph from those operational events. The primary use case is operational: when an ADF pipeline fails, MetricSign identifies which Power BI datasets depend on it and sends an alert before the next refresh runs. More detail in the section below.
| Tool | Best for | Microsoft native | Column-level | Real-time | Pricing |
|---|---|---|---|---|---|
| Microsoft Purview | Governance and compliance | Yes | Partial | No (scan cycle) | Azure consumption |
| Collibra / Alation | Enterprise data catalog | No | Yes | No | Enterprise (contact) |
| OpenLineage / DataHub | Platform teams with Airflow | No | Yes (with config) | Yes | Open source |
| MetricSign | Operational monitoring, Microsoft stack | Yes (via APIs) | Partial | Yes | Free tier + usage |
How to choose: three scenarios
Three paths cover most teams evaluating lineage tooling for a Microsoft stack.
Scenario 1: You need to know why a report broke and what else is affected
This is an operational use case. The right tool connects to your Power BI activity log, your ADF pipeline run history, and your dbt model runs, and correlates them into a single incident view. Purview can answer this question in principle, but requires a completed scan cycle to reflect current state. A monitoring-first tool answers it within minutes of the pipeline failure.
Scenario 2: Governance and compliance are driving the requirement
If your organization needs to classify data assets, track lineage for regulatory purposes, assign data stewardship, or build a business glossary, Purview is the first option to evaluate, especially if you're invested in the Azure ecosystem. For larger organizations with dedicated governance teams, Collibra or Alation add catalog depth and stewardship workflows that Purview does not include.
Scenario 3: Your team has platform engineering resources and wants maximum control
If you're running Airflow, have a data platform team that writes infrastructure code, and want a lineage graph you fully own, OpenLineage with DataHub is a viable path. Budget the instrumentation work at several weeks per pipeline type you want to emit lineage from, plus ongoing maintenance as pipeline frameworks version.
One honest observation: most data teams running a Microsoft stack (4 to 10 engineers, Power BI as the consumption layer, no dedicated platform engineering function) are served better by Scenario 1. A governance platform like Collibra is designed for an organizational context where catalog management is a full-time function.
MetricSign: lineage built from your stack's own activity logs
MetricSign builds a live lineage graph by reading the APIs your existing Microsoft tools already expose. It connects to Power BI Service to read dataset and report dependencies, to Azure Data Factory for pipeline run history and output tables, to Fabric for job history, to dbt Cloud for model run results, and to Databricks for job logs.
From those feeds, it constructs a graph: ADF pipeline X writes to Fabric lakehouse table Y, dbt model Z reads from table Y and writes to semantic model A, Power BI report B reads from semantic model A. When ADF pipeline X fails at 03:47, MetricSign traverses the graph, identifies semantic model A and Power BI report B as downstream dependents, and sends an alert before the 07:00 Power BI refresh starts.
This differs from Purview's scan-based lineage in one specific way: MetricSign operates on pipeline run events, not metadata snapshots. The lineage graph updates within minutes of a pipeline completing. For impact analysis before a schema change, a query on 'which reports depend on this dbt model column' returns a current answer, not the state from the last scan cycle.
MetricSign is free to start. It connects to Power BI and ADF in under 15 minutes without changes to your existing pipelines. The use case it's built for: something breaks in your Microsoft stack at 03:00, and you need to know which reports are affected before the business day starts at 07:00.