MetricSign
Request Access
Best Practices9 min·

How to Set Up Incident Response for Data Pipeline Failures

What do you do when it's 3am and your most important dataset just failed to refresh?

Step 1 — Detection: The First Alarm

Detection determines how quickly you know about a failure. The difference between detecting a problem at 03:00 and at 08:30 — when the first user opens a report — is the difference between a silent fix and a visible incident with stakeholder impact.

Good detection requires automated alerts from multiple layers:

Pipeline layer: ADF and Fabric Pipeline failure notifications via Azure Monitor webhook triggers. These fire within minutes of a failure.

Dataset layer: Power BI refresh failure notifications via the Power BI Service built-in alert system. These cover failures that the pipeline layer misses — for example, when a pipeline succeeds but the Power BI model refresh fails due to memory limits or a query timeout.

Volume layer: Automated row count checks that run after each refresh. These are the only way to catch silent failures — refreshes that succeed but load wrong data.

For on-call purposes, alerts need to be actionable without context-switching. A notification that says "Sales Overview dataset — refresh succeeded but row count dropped 65% compared to last run" is far more actionable than "refresh failed" — it tells you the problem category immediately.

Route critical alerts — datasets backing board reports, financial closes, operational dashboards — to a paging system with wakeup urgency. Route non-critical alerts (minor volume dips, performance degradation) to a team channel for morning review. Not every data problem warrants waking someone up at 03:00.

Step 2 — Diagnose: Five Checks in Order

Once you're awake at 03:00 and you know there's a problem, you need to diagnose it fast. A standard diagnostic checklist eliminates the cognitive overhead of deciding where to look first.

Run these five checks in order:

  1. Check the pipeline layer: Did the ADF or Fabric pipeline run? Did it succeed? If it failed, what was the specific error? (Azure Monitor → Pipeline runs → specific run → activity-level error details.)
  1. Check the volume: How many rows did the pipeline copy compared to the previous run? A zero-row copy is a different problem from a partial load. Zero rows usually means the source didn't produce data. A partial load suggests the pipeline was interrupted.
  1. Check the source: Is the source system accessible? Can you query the source table directly and get the expected rows? This rules out source-side failures — a database that's down, an API that's returning 503s, an SFTP server that didn't produce the export file.
  1. Check the gateway: If the pipeline uses an on-premises data gateway, is the gateway online? Are there other datasets that failed at the same time? Multiple simultaneous failures pointing at gateway-connected datasets indicate a gateway problem, not a pipeline problem.
  1. Check the Power BI refresh: Did the dataset refresh run after the pipeline completed? Did it succeed? If both the pipeline and the refresh succeeded but data is wrong, the problem is in the data content itself — a silent failure requiring volume and watermark investigation.

Document this checklist and share it with everyone on the on-call rotation. The first five minutes of a 03:00 incident should be structured investigation, not freestyle debugging.

Step 3 — Impact Assessment: Who's Affected

Before you fix anything, establish who's affected. This shapes the urgency of your response and determines who needs to be notified and when.

Impact assessment answers four questions:

Which reports are affected? Which Power BI reports are built on the failed dataset? Are there downstream datasets that import from this one? Without lineage tooling, this requires manually checking each report's data source configuration — time-consuming in large environments.

Who uses those reports? What business processes depend on this data? Is this a board-level report, an operational dashboard used hourly, or a self-service analysis that's accessed irregularly?

What's the time sensitivity? Is this report needed for a 08:00 leadership meeting? Does it drive a daily business process that starts at 09:00? Or can it wait until the next scheduled refresh cycle without any user impact?

What exactly is wrong with the data? Is the data completely absent (no rows loaded, users see no data)? Wrong (silent failure, users see incorrect numbers)? Or delayed (refresh running late, users will have correct data once it completes)?

The answers to these questions determine your communication priority. If the affected data feeds a board meeting in 4 hours and the fix will take 90 minutes, the right executive assistant needs to know now so the meeting can be rescheduled or the data limitation acknowledged in advance. If the affected data is a rarely-accessed analyst tool, a Slack message in the morning is sufficient.

Step 4 — Fix and Verify

The remediation path depends entirely on the root cause. Common patterns and their fixes:

Pipeline failure: Rerun the pipeline. Before you do, confirm that the root cause is resolved — if the pipeline failed because the source system was unavailable, rerunning it immediately won't help. Check source availability and expected data first.

Zero-row copy: The pipeline ran but loaded no data. Wait for the source system to produce its export, then rerun. If the source has an SLA for delivery and has missed it, escalate to the source system owner before rerunning.

Power BI refresh failure: Trigger a manual refresh from Power BI Service or via the REST API. Monitor the refresh progress rather than waiting for a notification — watching it live gives you faster feedback if it fails again.

Gateway failure: Restart the gateway service. Confirm whether the gateway is configured for automatic restarts. If the gateway is down due to a Windows update or a certificate expiry, the fix is different from a crashed service and takes longer.

Verification is as important as the fix. After remediation:

  1. Confirm the row count on the destination table matches expectations
  2. Confirm the watermark (max date field) has advanced to the expected value
  3. Trigger a Power BI refresh and confirm it completes with the expected row count
  4. Open the affected report and visually confirm the data looks correct

Do not declare an incident closed until you've verified end-to-end. A pipeline that ran successfully but loaded to the wrong table, or a refresh that completed but is still serving cached data, will produce another incident within hours.

Step 5 — Postmortem: Closing the Loop

A postmortem isn't a blame session. It's a structured analysis of what happened, why existing detection or prevention mechanisms didn't catch it earlier, and what specific changes will reduce the likelihood of recurrence.

For data pipeline incidents, the most useful postmortem questions are:

Detection gap: How long did the problem exist before anyone knew about it? If the answer is "until a user reported it at 09:30", you have a monitoring gap. What specific alert would have caught this at 03:00 instead?

Blast radius accuracy: How many reports were affected? Were all affected reports identified during impact assessment, or were some discovered later when users complained? If any were missed, the lineage documentation is incomplete.

Recovery time analysis: How long did it take from alert to resolution? Where did the investigation slow down? Which diagnostic step would have been faster with better tooling or documentation?

Preventability: Could this failure have been prevented? A credential expiry is preventable with proactive monitoring. A source system outage is not preventable on your end — but the downstream impact can be reduced with better fallback handling or earlier communication to stakeholders.

For each action item, assign a specific owner and a deadline. "Improve monitoring" without an owner and a specific change is not an action item. The most common high-value postmortem actions are: adding a volume check that was missing, documenting a lineage link that wasn't tracked, and adding an escalation path for a stakeholder group that was notified too late.

Run postmortems on every incident that required waking someone up, broke a stakeholder SLA, or caused decisions to be made on bad data. For minor incidents, batch them monthly and look for patterns rather than individual root causes.

Related error codes

Related integrations

Related articles

← All articles