Step 1 — Detect early, at every layer
Data pipeline management is two jobs in one: keeping the pipelines healthy on a normal day, and responding fast when they break on the worst day. Most teams over-invest in the first and under-invest in the second. This is the playbook for the second.
Detecting a failure at 03:00 versus 08:30 — when the first user opens a report — is the difference between a quiet fix and an incident with stakeholder fallout.
This is where data observability platforms earn their place: the best ones surface incidents at every layer of the pipeline before a user ever opens a report.
ADF and Fabric pipeline failure notifications via Azure Monitor catch hard failures. Power BI refresh failure notifications catch when the load itself fails. Neither catches silent failures — the refresh that succeeds but loads wrong data.
A volume check closes that gap. Compare row count after each refresh against the day-of-week baseline: if the ADF pipeline ran successfully and copied zero rows, that shows up immediately. A watermark check — reading the max timestamp column in the loaded dataset — catches stale data from a source that stopped updating. With both in place, the first time you hear about a data problem is from your monitoring system, not from a user.
One more thing: configure alerts to include diagnostic context, not just the failure event. "Sales Overview refresh failed" is almost useless at 03:00. "Sales Overview refresh failed — last successful refresh 06:04 yesterday, upstream ADF pipeline status: succeeded, row count: 0 (expected: ~48,000)" is something you can act on immediately.
Step 2 — Diagnose fast with a fixed checklist
Once you know there's a problem, you need to find the root cause fast. A standard diagnostic checklist eliminates the cognitive overhead of deciding where to look at 03:00 with incomplete information.
Run these checks in order:
- Pipeline layer: Did the ADF or Fabric pipeline run on schedule? If yes, did it succeed? If it failed, what was the specific error? (Azure Monitor → Pipeline runs → specific run → activity details)
- Volume layer: What was the row count written by the last pipeline run? Is it consistent with the expected volume for this time of day?
- Source layer: Is the source system reachable? Run a connection test from ADF or query the source directly to confirm data is present and recent.
- Gateway layer: If any pipeline uses the on-premises data gateway, is the gateway service running? Check the gateway health page or query the service status on the gateway machine.
- Refresh layer: Check the Power BI refresh history for the affected dataset. What was the last successful refresh? Was it the scheduled time or earlier?
This sequence goes deepest-upstream first and stops as soon as the failure is found. Most data pipeline incidents are identified at step 1 or 2. The full five-step sequence takes five minutes to execute when the checklist is prepared in advance.
Step 3 — Assess downstream impact before you touch anything
Before you fix anything, establish who's affected. This shapes the urgency of your response, determines who needs to be notified, and prevents the most common communication error in data incidents: starting the fix before anyone knows what the blast radius is.
Impact assessment answers four questions. Which reports are currently serving stale or incorrect data? Which of those reports have been opened since the failure occurred? Which business users or teams depend on those reports for time-sensitive decisions? Is there an alternative data source or workaround they can use while the incident is resolved?
For the first question — which reports are affected — lineage metadata is the fastest tool. If you have a map of which Power BI datasets read from the failing pipeline's output tables, and which reports are built on those datasets, the impact scope is a lookup. Without lineage, you check each dataset's datasource configuration manually.
For the second question — whether reports have been opened since the failure — Power BI usage metrics show the most recent access time per report. A report that was last opened two hours ago during a period when the data was already wrong has likely influenced a decision that needs follow-up.
Notify affected stakeholders with a short, factual message: what is known, what is being investigated, and when the next update will arrive. Do not estimate resolution time before root cause is confirmed — a wrong ETA is worse than no ETA.
Step 4 — Fix, verify end-to-end, and document before closing
The remediation path depends on root cause, but the verification pattern after the fix is consistent regardless of what broke.
After any fix, verify end-to-end before marking the incident resolved. A pipeline that has been restarted needs its row count validated — confirm the expected number of rows were written, not just that the run completed. A Power BI dataset that has been manually refreshed needs its watermark checked — confirm the maximum timestamp column reflects data from the expected time range, not from a prior load. A gateway that has been restarted needs a connection test — confirm at least one on-premises-connected dataset refreshes successfully before assuming the gateway is healthy.
Visual verification is the final step: open the affected report and confirm the numbers look reasonable compared to prior periods. This step catches the class of issue where a fix resolved the pipeline error but didn't address the root cause — for example, restarting a failed pipeline that was failing because of bad source data. The pipeline succeeds on retry, the row count matches, and the data is still wrong.
Document the incident before closing the ticket: timestamp of detection, root cause, time-to-detect, time-to-resolve, and one action item. The action item should be specific enough to prevent the same failure from reaching users again — not "improve monitoring" but "add row count alert for daily_sales ADF pipeline with threshold below 40,000 rows."
Step 5 — Run the incident review before the memory fades
An incident review isn't a blame session. It's a 30-minute analysis of what happened and what specific change will close the gap that let this failure reach users.
Start with the timeline. When did the failure occur, when was it detected, when was root cause identified, when was it resolved. The gap between failure and detection is the most important number — that's what the action item should target.
Then ask why the detection gap existed. If a pipeline failed at 02:30 and wasn't detected until 08:45, was there no alert configured? Did an alert fire but go to a channel nobody checks? Did the alert fire but lack enough context to be actionable? The specific mechanism determines the fix.
What would have caught it faster? If a volume check had been on that pipeline, would it have fired within fifteen minutes of the failure? If a watermark check had been on the downstream dataset, would it have surfaced the stale data before anyone opened a report?
Then commit to one change. One specific, completable thing with an owner and a due date — not a list of potential improvements. "Add row count alert for daily_sales ADF pipeline, threshold below 40,000 rows, alert to #data-on-call by Friday." That's an action item. "Improve monitoring" is not.
