MetricSign
Start free
Data Lineage7 min·

Data Pipelines Need Lineage, Not Just Data Monitoring Software

Data monitoring software tells you what broke. Lineage tells you why — and what it's taking down with it.

Lees dit artikel in het Nederlands →

Data Pipelines Need Lineage, Not Just Data Monitoring Software

Monitoring is a smoke detector. Lineage is the building map.

Most teams adopt data monitoring software, see incidents drop in the first month, and then plateau. The reason is the same one we keep meeting: monitoring tells you something is wrong; it doesn't tell you what depends on it. That requires lineage.

A smoke detector is indispensable. It tells you something is burning. But it doesn't tell you where the fire started, whether the kitchen is safe to enter, or which exits are blocked. When the alarm fires at 03:00, you have urgency without direction.

Data monitoring is the smoke detector. It tells you that something is wrong — a refresh failed, a volume dropped, a watermark went stale. What it cannot tell you is which upstream component caused it, which other components depend on the one that failed, or how many downstream reports are currently serving wrong data.

Data lineage is the building map. It shows the structure of your data pipeline: which systems feed which tables, which tables feed which transformations, which transformations feed which reports. When the alarm fires, the map tells you where to go.

Monitoring and lineage are not alternatives. Monitoring without lineage produces alarms with no direction — the engineer investigates by trying every path. Lineage without monitoring provides a map but no early warning — you learn about problems when users find them, and then you know where to look.

Seven steps to find one problem: the investigation without lineage

The typical investigation without lineage follows a predictable, slow pattern. An alert fires — dataset refresh failed, or a user reports wrong data.

  1. Engineer checks Power BI Service refresh logs: refresh succeeded
  2. Engineer checks ADF pipeline runs: pipeline succeeded
  3. Engineer queries the database directly: data appears present
  4. Engineer checks dbt Cloud job: succeeded with two warnings
  5. Engineer reads the warnings: one model failed due to null values in an upstream table
  6. Engineer confirms this is the root cause
  7. Engineer manually checks which other datasets depend on the same model — one by one

That last step is where most of the time goes. Without lineage, figuring out downstream impact means opening each dataset in Power BI Service, navigating to its datasource configuration, and checking whether it reads from the affected table. For 50 datasets, that's 45–60 minutes. For 200, it's not realistically completable before the business day starts.

Finding the root cause here took six steps. The impact assessment took as long as all six combined.

A useful lineage map answers three specific questions reliably

A data lineage map doesn't need to be a perfectly-documented graph database with every edge tracked. It needs to answer three questions reliably: what does a failing component affect downstream, what produced the data that's now wrong, and are there assets currently at risk that haven't failed yet?

Forward traversal starts from the failing component — a dbt model, an ADF pipeline, a database table — and walks forward. Which Power BI datasets read from it? Which reports are built on those datasets? This is what you need in the first five minutes of an incident, before you touch anything.

Backward traversal runs the other direction. You start from the broken report or the anomalous dataset and work upstream. Which pipeline loads its data source? Did that pipeline run on schedule and at full volume? What's the likely cause?

The third one is the one most teams skip: proactive risk identification. Are there datasets currently scheduled to refresh that depend on a component that just failed? Before they run and load stale data, can they be paused or the upstream issue fixed first?

In practice, a lineage map assembled from ADF pipeline logs, dbt manifests, and Power BI datasource metadata handles all three. The map is never complete — 80% coverage is realistic — but 80% eliminates most of the manual investigation.

The real value of lineage is proactive, not investigative

The investigative value of lineage is real — cutting a 90-minute root cause hunt to a 10-minute lookup is significant. But that's not the deepest value. The deeper value is that lineage makes proactive response possible.

Without lineage: a dataset shows wrong data at 08:30. An engineer investigates for 90 minutes, finds the root cause (a dbt job failed at 02:30), and restarts the pipeline. The dataset is corrected by 11:00. Several stakeholders have already made decisions on the wrong data.

With lineage: the dbt job fails at 02:30. The monitoring system knows which Power BI datasets depend on its output and that those datasets are scheduled to refresh at 05:00. The engineer gets an alert at 02:30: "dbt job daily_sales failed. Three downstream datasets — Sales Overview, Revenue by Region, Monthly Actuals — are scheduled to refresh at 05:00. Root cause: model compute_margins failed due to nulls in order_line_items." The engineer fixes the dbt model before 05:00. No user sees stale data.

The difference isn't faster investigation. The incident never became a user-visible problem. That only happens when monitoring and lineage work together.

Start with your five most-viewed reports and trace each back

You don't need to document your entire data pipeline before lineage becomes useful. The highest-value lineage to build first is the chain behind your most business-critical assets.

For a Power BI environment, the fastest path to useful lineage is:

  1. Identify your five most-viewed reports (Power BI usage metrics in the admin portal show view count by report)
  2. For each report, identify the datasets it reads from
  3. For each dataset, identify its data source — the database server, table name, and connection string
  4. Find which pipeline loads that table and on what schedule
  5. Capture whether that pipeline depends on any upstream transformation jobs (dbt, Databricks, Synapse)

Documenting these five chains — even in a structured spreadsheet — gives you the most important 20% of your lineage coverage immediately. With those chains explicit, you can configure targeted monitoring: alerts when the pipeline feeding your most-viewed reports fails, volume checks on the tables they read from, and watermark checks on their primary timestamp columns.

Automatic lineage — updated continuously as new pipelines run, new datasets are created, and datasource configurations change — requires tooling that parses pipeline metadata, dbt artifacts, and Power BI API responses on an ongoing basis. But the manual starting point has value from day one.

Frequently asked questions

What is the difference between data monitoring and data lineage?+
Monitoring detects when something goes wrong — a failed refresh, a volume drop, a watermark anomaly. Lineage shows what depends on what, so when monitoring fires you know where to look and what else is affected. Monitoring without lineage gives you the alarm; lineage gives you the map.
Why is impact assessment slow without data lineage?+
Without lineage, the engineer has to check each potential downstream dependency manually — opening each dataset's datasource configuration to see if it reads from the affected table. For environments with dozens of datasets, this takes as long as the root cause investigation itself.
How does data lineage enable proactive alerting?+
A lineage-aware monitoring system knows which downstream assets depend on each upstream component. When a pipeline or transformation fails, it can identify which downstream Power BI datasets are scheduled to refresh and send a warning before they load stale data.
How complete does a lineage map need to be to be useful?+
It doesn't need to be complete. A lineage map with 80% coverage — the most important chains documented, even if not every edge is captured — eliminates most of the manual investigation work. Start with your highest-traffic assets and build outward.
How do you build initial data lineage coverage for Power BI?+
Start with your five most-viewed reports. Trace each back: report → dataset → data source → pipeline → upstream transformation jobs. Documenting these five chains covers the most business-critical paths and gives you a foundation for targeted monitoring immediately.

Related error codes

Related integrations

Related articles