MetricSign
Start free
Best Practices9 min·

ADF pipeline failure monitoring: where native alerts stop working

Native Azure Monitor catches pipeline failures. It misses the Copy activity that succeeded with the wrong schema — and that's the one your stakeholders will call about.

Lees dit artikel in het Nederlands →

A Copy activity fails silently at 3am and Power BI refreshes anyway

The source team renamed customer_id to cust_id in the Salesforce export. The ADF Copy activity reads from a stored procedure that still references the old column. The activity returns Succeeded — zero rows copied, no exception thrown, because the column mapping was set to 'auto' and the schema drift policy silently dropped the unmapped field. The pipeline reports Succeeded. The downstream Power BI dataset refreshes against the empty staging table. At 9am, the CFO opens a dashboard showing yesterday's revenue at $0.

This is the gap that native Azure Monitor alerting cannot close — and exactly the kind of failure mode that dedicated data observability platforms are built to catch.

ADF surfaces monitoring in four places: the Azure Monitor metrics blade, the pipeline run history in the studio, alert rules wired through Action Groups, and Log Analytics with diagnostic logs. Most teams configure two of them — pipeline run history (it's on by default) and a single Action Group that emails the team distribution list when a pipeline fails. That covers the case where the pipeline throws. It does not cover the case above.

Zie ook: Best data observability platforms in 2026

What native ADF alerting actually gives you

The supported path for pipeline-level failure alerts is five steps. In your Data Factory, open Diagnostic Settings and route the PipelineRuns and ActivityRuns categories to a Log Analytics workspace. Wait for logs to flow (2-5 minutes after the next run). In Azure Monitor, create a scheduled query rule against that workspace. Attach an Action Group with email, Teams webhook, or an ITSM connector. The base KQL for pipeline failures is:

``kusto ADFPipelineRun | where TimeGenerated > ago(15m) | where Status == "Failed" | project TimeGenerated, PipelineName, RunId, Status, FailureType, Message=ErrorMessage ``

End-to-end latency: 3-10 minutes from failure to email. The rule fires once per evaluation window. Coverage is pipeline-level Status only — anything that bubbles up as a pipeline failure will be caught. What it misses: activity-level failures inside a succeeded pipeline, schema drift that produces wrong output, downstream impact on Power BI or Databricks, and the question of whether the ten alerts you just got at 3am are one incident or ten.

Native ADF alert path {'step': 1, 'label': 'ADF Pipeline Fails'} {'step': 2, 'label': 'Azure Monitor ingests diagnostic {'step': 3, 'label': 'Alert Rule evaluates KQL'} {'step': 4, 'label': 'Action Group triggers'} {'step': 5, 'label': 'Email / Teams / Webhook'}
Native ADF alert path

The activity-level gap

A pipeline's Status is computed from the orchestration outcome, not the data outcome. A Copy activity with enableSkipIncompatibleRow: true will skip rows and report Succeeded. A Lookup activity with a malformed query that returns an empty result set is a successful execution. A ForEach over an empty array completes in milliseconds with Status Succeeded.

The schema-mismatch case is the one that bites hardest. ADF's tabular translator with mapComplexValuesToString and auto-mapping will tolerate a renamed column by dropping it from the projection. The Copy activity reports rowsCopied: 487293 and Status Succeeded — the row count is real, the columns are wrong. Pipeline-level monitoring will never see this. The only signal is the row count itself, parsed from the activity Output JSON, compared against an expected baseline.

KQL for activity-level detection

Activity-level alerting requires querying ADFActivityRun directly and parsing the Output column. Here's a working query that catches both explicit failures and zero-row Copy activities:

``kusto ADFActivityRun | where TimeGenerated > ago(15m) | where ActivityType in ("Copy", "Lookup", "ExecuteDataFlow") | extend Output = parse_json(Output) | extend rowsCopied = toint(Output.rowsCopied), rowsRead = toint(Output.rowsRead), errors = toint(Output.errors) | where Status == "Failed" or (ActivityType == "Copy" and Status == "Succeeded" and (rowsCopied == 0 or errors > 0)) | project TimeGenerated, PipelineName, ActivityName, ActivityType, Status, rowsRead, rowsCopied, errors, ErrorMessage=tostring(Error.message) ``

This works. It also requires you to know which activities have a meaningful zero-row baseline (a delta load may legitimately copy zero rows on a quiet day), which expected row count to compare against, and which pipelines need this treatment. Budget 15-30 minutes per pipeline to write, test, and tune. Multiply by your portfolio. The cost is not the KQL — it's the maintenance when activity names change, when new pipelines ship without alerts, and when the same failure triggers five rules across a fan-out.

Where dedicated monitoring earns its keep

Three problems show up once your portfolio crosses roughly five pipelines or your stack mixes ADF with Databricks, dbt, or Power BI.

Cross-pipeline correlation. The ADF Copy activity failure at 03:14 is the same incident as the dbt staging_orders model failure at 03:22 and the Power BI Sales Executive dataset refresh_delayed signal at 06:00. Native tooling treats these as three alerts in three inboxes. They are one incident, with a clear causal chain.

Deduplication. A pipeline that fans out to 12 datasets via ForEach generates 12 activity failures on the same root cause. Action Groups will email you 12 times, or once per rule, depending on how you wrote the threshold — neither is the right answer. The right answer groups by root cause.

Root cause surfacing. ADF's ErrorMessage column gives you Operation on target Copy data1 failed: ErrorCode=2200, ... followed by 2KB of stack trace. The actionable detail — column 'customer_id' not found in source — is in there, but you have to read for it. Dedicated tooling parses these, surfaces the column name, and links it to the schema change that introduced it.

This is where MetricSign fits: it reads ADF activity logs through the same diagnostic stream you already configured, groups failures into incidents across ADF, Databricks, dbt, and Power BI, and surfaces the root cause hint instead of the stack trace.

This cross-stack approach is what a data observability platform is designed to deliver — not per-tool alerting, but a correlated view across every layer of the pipeline.

What to actually build

For one to four pipelines with no downstream BI dependencies: configure Diagnostic Settings, write the pipeline-level alert rule above, attach an Action Group, and stop. The native path is sufficient. Spend the time on data quality tests inside the pipelines instead.

For portfolios with mixed orchestration (ADF triggering Databricks, dbt running on a schedule, Power BI refreshing on top), or for pipelines under an SLA where stakeholders will notice within an hour: the per-pipeline KQL approach hits a maintenance wall around five to ten pipelines. The math flips — engineering hours spent on alert plumbing exceed the cost of a tool that does correlation, dedup, and root cause parsing out of the box. Decide based on portfolio size and stack diversity, not on whether the native path 'works'. It works. It just doesn't scale to the failure mode that wakes you up.

Frequently asked questions

How long does Azure Monitor take to fire an ADF alert?+
Diagnostic logs land in Log Analytics within 2-5 minutes of pipeline completion. Alert rules evaluate on a schedule (1, 5, or 15 minutes), so end-to-end detection is typically 3-10 minutes after the failure. Activity Log alerts on the Microsoft.DataFactory provider can fire in under a minute but only cover service-level events, not pipeline run outcomes.
Can I monitor ADF pipeline activities, not just pipelines?+
Yes, but only via Log Analytics. Enable the ActivityRuns diagnostic category in Diagnostic Settings, then query the ADFActivityRun table. The PipelineRuns table only reports pipeline-level Status, so a pipeline that completes with Status == 'Succeeded' will not surface activity errors unless you query ADFActivityRun directly.
What's the difference between ADF diagnostic logs and pipeline run history?+
Pipeline run history is the UI view in the ADF portal, retained for 45 days and not queryable outside the studio. Diagnostic logs are a separate stream you opt into via Diagnostic Settings, sent to Log Analytics, Storage, or Event Hub. Only diagnostic logs support KQL queries, alert rules, and retention beyond 45 days.
Do I need an Action Group for every alert?+
One Action Group can serve many alert rules. Define it once with your email distribution list, Teams webhook, or ITSM connector, then attach it to each scheduled query rule. Splitting Action Groups by severity (Sev2 to on-call, Sev3 to channel) is more useful than splitting by pipeline.
Will Azure Monitor catch a Copy activity that succeeds but writes zero rows?+
Not by default. The activity reports Succeeded because it executed without exception. You need a custom KQL query against ADFActivityRun that parses the Output JSON for rowsCopied or rowsRead, then alerts when the value falls below an expected threshold. This is per-activity work and does not generalize across a portfolio.
Can I alert on ADF failures in Microsoft Teams without a webhook?+
Action Groups support Teams via incoming webhook only. Configure the webhook in the target Teams channel, paste the URL into the Action Group's Webhook action, and Azure Monitor posts an adaptive card on each fire. There is no native Teams connector that bypasses the webhook.

Related integrations