MetricSign
Start free
Data Observability7 min·

Why Silent Data Failures Cost More Than Outages

A failed refresh announces itself. Wrong data loaded silently does not.

Lees dit artikel in het Nederlands →

Why Silent Data Failures Cost More Than Outages

A hard failure is honest. A silent failure is not.

When a Power BI refresh fails with an error, the consequences are bounded. The API returns a failed status, the alert fires within minutes, and an engineer starts investigating. The data is known to be unavailable. Stakeholders can be told. The damage is contained by the speed of detection.

Silent failures are different in kind, not just severity. The pipeline runs. The refresh completes. Power BI reports success. Nobody is alerted because there is nothing to alert on — the system believes everything is fine. The first signal is a stakeholder opening a report and acting on numbers that are wrong. By that point the bad data has been in production for hours or days, and the question is no longer how to fix the pipeline. It's how many decisions were made on incorrect information.

Hard failures attract engineering attention immediately. Silent failures attract it only after the business consequence is visible. That asymmetry explains why silent failures consistently produce more damage despite being less dramatic.

Hard Failure refresh error · immediate detection ⚠ alert fired resolved Day 0 Day 3 Day 7 Day 14 Day 21 Silent Failure data wrong · refresh ✓ · no alert "numbers look wrong" user report · week 2 14 days undetected
Hard failures alert immediately. Silent failures accumulate for days before anyone notices.

The ADF pipeline that ran fine and loaded nothing

Here is a scenario that plays out regularly in Azure-heavy data stacks. An ADF pipeline runs on its usual schedule, completes without errors, and marks the run Succeeded. Power BI refreshes the downstream dataset and shows the data it loaded. Everything in the monitoring dashboard is green.

What actually happened: the source query had a date filter that became more restrictive after a configuration change. The pipeline ran but copied zero rows. ADF's execution model considers this a success — the pipeline did exactly what it was configured to do, it just produced no output. Power BI's import-mode dataset is now holding the data from the previous run. The watermark hasn't moved. Nobody knows.

The first person to notice is a business analyst who tries to reconcile yesterday's numbers and can't find the transactions from the previous afternoon. She escalates to the data team. The engineer on call logs into ADF, sees the green status, and starts from scratch trying to find where the data went. Without row count validation on the pipeline output, the investigation takes two to three hours. With it, the anomaly would have been visible within fifteen minutes of the pipeline completing.

The cost accumulates in three places most teams don't measure

Most engineers estimate the cost of a data incident as hours to fix. That's the visible part.

A sales forecast built on incomplete data produces resource allocations that don't match reality. An operations team schedules staff based on demand numbers that are two days old. Those costs are real — but they land in an operations budget, not a data budget. The analysis that identified the wrong decision came days later. The connection to the pipeline failure is indirect. Nobody files a ticket.

Every time a stakeholder finds a data problem before the data team does, something shifts. The implicit contract — that the data team monitors its own systems — has been broken visibly. Stakeholders start building their own checks: double-verifying reports against source systems, keeping parallel spreadsheets, delaying decisions until they can confirm the numbers. That overhead persists after the pipeline is fixed. The trust event is now part of their mental model of the data team.

And then there's investigation time. A silent failure that's been in production for twelve hours before detection requires tracing through every layer of the stack to figure out where it started, what it touched, and when it began. Without lineage metadata, that typically takes two to four hours for a senior engineer. The cost doesn't show up anywhere — but it comes directly out of development capacity.

The failure patterns data engineers recognize — and users find first

Silent failures cluster around a handful of patterns that repeat across data stacks.

The incrementally-growing null column is one of the most common. A source system begins returning null for a field that previously had values — not a schema error, just missing data. Power BI aggregates null values as zero. A revenue column starts showing lower totals, subtly, over several days. Nobody sees a spike. The trend is gradual enough to miss in daily reviews. By the time a business user asks why the numbers are down, the problem has been in the data for a week.

The partition boundary edge case appears in large fact tables that load incrementally. A date partition that should include data from 23:45 to midnight on a given day gets cut short by a timing issue. The daily total is slightly low. It happens once and isn't caught. It happens again three weeks later. The audit trail is now inconsistent and the data team spends two days reconstructing what should have been a ten-minute check.

The incremental refresh filter mismatch is specific to Power BI incremental refresh policies. A filter key column — usually a date column used to determine which rows are "new" — gets renamed in the source. The incremental refresh policy stops loading new rows. The dataset keeps serving the data it already has, refreshing successfully every day, loading nothing new.

Detection requires monitoring the data, not just the pipeline

The detection gap for silent failures is structural. Pipeline APIs report execution status. Refresh APIs report load completion. Neither is designed to report data correctness. Closing the gap requires adding checks at layers that neither API monitors.

At the pipeline layer, row count validation after each copy activity is the most direct check. ADF supports this natively through activity output properties — the number of rows read and written is available in the run history. A pipeline that wrote zero rows where it normally writes fifty thousand is detectable immediately.

At the dataset layer, a watermark check on the primary timestamp column catches stale data that the pipeline layer won't surface. If the maximum transaction date in the loaded dataset is more than one refresh cycle behind the current time, something stopped the data from flowing.

At the volume layer, comparing the current row count against the day-of-week baseline catches the class of failures where data is genuinely absent rather than just old. Both checks run in seconds against the Power BI REST API and require no changes to the pipeline or data model.

Frequently asked questions

Why do silent data failures cost more than outages?+
Hard failures alert immediately, contain damage, and get fixed fast. Silent failures accumulate undetected — decisions are made on wrong data, stakeholder trust erodes, and when the problem is finally found the investigation has to reconstruct hours or days of history. The total cost is higher because none of it is visible until after the damage is done.
Can an ADF pipeline succeed but load no data?+
Yes. ADF's execution model marks a pipeline as Succeeded when the activities complete without errors. A copy activity that runs with a date filter that matches zero rows writes zero rows and still reports success. The only way to detect this is to validate the row count written against the expected volume.
What are the business costs of bad data in Power BI reports?+
Three categories: wrong decisions made on incorrect data (with costs that land in operational budgets, not data budgets), eroded stakeholder trust that produces lasting verification overhead, and investigation time that comes directly out of engineering development capacity.
How do you detect that a Power BI refresh loaded wrong data?+
Two complementary checks: a watermark check that reads the max timestamp column in the loaded dataset (stale data detection), and a volume check that compares row count against the day-of-week historical baseline (missing data detection). Both run against the Power BI REST API.
What is a zero-row copy in Azure Data Factory?+
A zero-row copy is a copy activity that executes successfully but transfers no data — typically because the source query returned zero rows due to a filter, partition boundary issue, or source table truncation. ADF marks the activity and pipeline as Succeeded regardless of row count.

Related error codes

Related integrations

Related articles