MetricSign
Start free
Data Observability10 min·

Microsoft Fabric SLA Monitoring: Why Your Alerting Architecture Breaks Before Your Pipeline Does

Fabric gives you three layers of pipeline alerting — activity-level, item-level, workspace-level — and none of them natively answers "did the file arrive on time?"

Fabric's alerting covers what happened, not what didn't

A recurring question on the Fabric community forums captures a gap that most teams discover only after going to production: how do you get alerted when a file that should have arrived between 4 PM and 8 PM never shows up? Fabric's alerting primitives — scheduled pipeline failure notifications, Data Activator rules on job events, workspace-level KQL queries — all react to events that did occur. A pipeline fails, and you get an email. A pipeline succeeds in 90 minutes instead of 30, and an operations agent flags the anomaly. But if Event Grid never fires because the upstream system never dropped a file into ADLS, nothing in Fabric's native stack notices the absence.

This is the fundamental architectural mismatch. SLA monitoring is about the non-event: the file that didn't land, the pipeline that didn't trigger, the refresh that didn't start. Fabric's monitoring Eventhouse writes to the ItemJobEventLogs table, which records JobStatus values like Failed, Completed, and In progress. There is no row for a job that was expected but never began.

The simplest failure notification — configuring email alerts under Home > Schedule > Failure notifications on a pipeline — only fires on explicit failures of scheduled runs. If your pipeline is event-triggered via Eventstream and the event never arrives, the pipeline never runs, and no failure is recorded. You learn about the problem the next morning when a stakeholder opens a Power BI report and sees yesterday's numbers.

Three alerting layers and the seams between them

Fabric offers alerting at three levels, each with distinct mechanics and blind spots.

Activity-level alerts use Outlook or Teams activities wired after specific pipeline activities. They fire inline during execution, so they are reliable for signaling completion of a critical stage. The limitation is scope: you must wire them into every pipeline manually, and they cannot detect a pipeline that never started.

Item-level Data Activator rules react to Fabric workspace item events — job succeeded, job failed, item created, item deleted. You create an Activator item, select Get Data > Job Events, pick a pipeline, and define a rule with an action (email, Teams message, or launching another Fabric item). This works well for individual pipelines but the scope is per-item. The Fabric documentation confirms that events in Data Activator are scoped at the item level, meaning each pipeline must be individually selected. There is no wildcard or workspace-wide subscription through this path.

Workspace-level monitoring closes the scope gap. When you enable workspace monitoring, Fabric writes execution logs into an Eventhouse KQL database. You write a KQL query against ItemJobEventLogs, attach it to an Activator rule, and get alerts across every pipeline in the workspace from a single rule. The reference query filters where JobType == 'Pipeline' and JobStatus == 'Failed' within a rolling time window (SecondsAgo <= 540). This is powerful, but the polling interval matters: if Activator polls less frequently than your KQL window, you either miss alerts or get duplicates. Microsoft's own documentation warns that if Activator is paused or disabled, expected alerts will be missed.

The seam between these layers is where SLA monitoring lives — and it is not covered by any of them out of the box.

SLA breach detection chain in Microsoft Fabric ADLS file expected within SLA window Event Grid fires blob-created event (or doesn't) Eventstream routes event to pipeline trigger Pipeline starts and logs to ItemJobEventLogs KQL query joins FileArrivalLog ← leftanti → Missing match: Activator polls KQL Queryset Activator fires alert (email / Teams / pipeline) Breach logged to SLABreachLog for deduplication
SLA breach detection chain in Microsoft Fabric

Building the missing-file clock with Event Grid and KQL

To detect a file that should have arrived but didn't, you need to build a timer outside Fabric's job event system. The architecture that the Fabric community converges on uses three components: Event Grid for file arrival detection, an Eventhouse table for logging arrivals, and a KQL query that checks for gaps.

Start by routing ADLS blob-created events through Event Grid into a Fabric Eventstream. Configure the Eventstream to write each event into a custom KQL table — call it FileArrivalLog — with columns for FileName, ContainerPath, ArrivalTimestamp, and EventId. This gives you a ledger of what actually arrived.

Next, create a reference table — ExpectedFileSchedule — that defines your SLA contracts: which files are expected, in which container paths, and within what time windows. A row might specify that sales_daily_extract.parquet is expected in /landing/sales/ between 16:00 and 20:00 UTC every weekday.

The detection query joins these two tables:

`` let SLAWindow = ago(4h); ExpectedFileSchedule | where ExpectedBy <= now() and ExpectedAfter >= SLAWindow | join kind=leftanti ( FileArrivalLog | where ArrivalTimestamp between (SLAWindow .. now()) ) on FileName, ContainerPath | project FileName, ContainerPath, ExpectedBy, AlertTime=now() ``

This returns files that were expected within the SLA window but have no matching arrival record. Attach this query to an Activator rule polling every 10 minutes, and you have missing-file detection. The critical detail is the leftanti join — it produces rows only for the absence of a match, which is exactly the non-event you need to detect.

To avoid duplicate alerts (the original poster specifically asked for a single alert per SLA breach, not repeated notifications), add a SLABreachLog table. After each alert fires, write the breach to this table and add a leftanti filter against it in your detection query.

Silent trigger failures need a second watchdog

Missing files are one failure mode. A subtler one: the file arrives on time, Event Grid fires, but the Eventstream-to-pipeline trigger silently fails. This can happen when Eventstream experiences transient ingestion delays, when the pipeline's trigger condition doesn't match the event schema, or when capacity throttling pauses job execution.

Detecting this requires correlating two event streams: file arrivals in your FileArrivalLog and pipeline executions in ItemJobEventLogs. The query pattern is similar — a time-bounded leftanti join — but now you are joining on a shared key between the file event and the pipeline run. If your pipelines are named consistently (e.g., ingest_sales_daily), you can join on a derived key from FileName to ItemName.

`` FileArrivalLog | where ArrivalTimestamp between (ago(30m) .. ago(5m)) | extend ExpectedPipeline = strcat('ingest_', extract('(.+)_extract', 1, FileName)) | join kind=leftanti ( ItemJobEventLogs | where JobType == 'Pipeline' | where Timestamp >= ago(30m) | project ItemName, Timestamp ) on $left.ExpectedPipeline == $right.ItemName | project FileName, ArrivalTimestamp, ExpectedPipeline, AlertTime=now() ``

The ago(5m) buffer gives the pipeline time to start before flagging it as missing. Tune this based on your observed trigger latency. In workspaces with high concurrency, Fabric queues jobs, and the gap between event arrival and pipeline start can stretch to several minutes — the monitoring Eventhouse's JobStartTime vs. Timestamp columns reveal this drift over time.

The operations agent (currently in preview) can partially address this with natural-language playbooks, but it monitors ItemJobEventLogs only. It cannot see your custom FileArrivalLog table, so it cannot detect the correlation failure between file arrival and pipeline trigger. The watchdog query above fills that gap.

Polling intervals create an alerting uncertainty window

Every KQL-based Activator rule introduces latency. The chain is: event occurs → Eventhouse ingests the log row → Activator polls the KQL query → condition evaluates true → action dispatches (email, Teams, pipeline). Each link adds seconds to minutes.

Eventhouse ingestion latency for ItemJobEventLogs is typically under 60 seconds but can spike during capacity pressure. Activator's polling interval is configurable but defaults to 1 minute for KQL Queryset triggers. The KQL query's time window must be wider than the polling interval to avoid missing events between polls — Microsoft's documentation explicitly warns about this. If your query filters SecondsAgo <= 540 (9 minutes) and Activator polls every 60 seconds, you have comfortable overlap. But if you tighten the window to 120 seconds to reduce duplicates, a single slow poll means a missed alert.

For SLA monitoring specifically, this uncertainty window is less critical because SLA breaches are measured in hours, not seconds. A 5-minute detection delay on a file that was due by 8 PM is acceptable. Where it matters is in the trigger-failure scenario: if you are checking whether a pipeline started within 5 minutes of file arrival, and your alerting chain itself takes 3 minutes end-to-end, your effective detection buffer shrinks to 2 minutes.

The practical recommendation: set your Activator polling interval to 1 minute, your KQL query window to 10 minutes, and your trigger-failure detection buffer to 15 minutes. This gives you triple overlap — enough to absorb ingestion delays, Activator hiccups, and capacity-related job queuing. Log the detection timestamps to your breach table so you can measure actual alert latency over time and tighten the windows as you gain confidence in the system's behavior.

MetricSign closes the non-event gap without custom KQL

The architecture described above works. It also requires you to maintain two custom KQL tables, a reference schedule, three detection queries, a deduplication log, and tuned Activator polling intervals — all of which are themselves unmonitored unless you build yet another layer.

MetricSign approaches this differently. Rather than reacting to execution events, it tracks expected pipeline completion times and dataset refresh windows, then emits a refresh_delayed signal when the expected event doesn't occur. The detection is clock-based, not event-based, which means it catches silent trigger failures, missing file arrivals, and paused capacities through a single mechanism. When a Fabric pipeline that normally completes by 6:30 PM hasn't produced a completion event by 7:00 PM, MetricSign surfaces the delay with root cause context — was it the pipeline that didn't start, or the upstream file that never arrived? — without requiring you to build and maintain the correlation queries yourself.

This matters most for teams running dozens of pipelines across multiple workspaces where the per-item scoping of Data Activator rules becomes unmanageable, and where the custom KQL approach demands ongoing tuning as pipeline counts and SLA contracts evolve. The monitoring infrastructure should not itself become a source of operational risk.

Related integrations

Related articles

← All articlesShare on LinkedIn