Fabric ran green. The numbers are still wrong.
It is Wednesday, 06:47. Your Fabric capacity is humming at 84% CU usage. The Lakehouse pipeline that loads sales data from a Snowflake source completed at 06:00. The Power BI Direct Lake report on top refreshes a few minutes later. By 08:30 the CFO opens the report on her phone and the year-to-date figure looks low by €1.4M.
The pipeline did not fail. The capacity did not throttle. Workspace Monitoring shows zero errors. The report is still wrong.
What happened: the upstream Snowflake table dropped a partition during a Tuesday-night reorg, the Lakehouse copy step pulled in 320,000 fewer rows than its 30-day baseline, and Direct Lake served the smaller dataset to Power BI without complaint. Every native Fabric surface reported success.
This is the gap that the phrase 'Microsoft Fabric monitoring' tries to cover, and it is the gap most teams running production Fabric workloads hit within a quarter. The native tools do their job. They tell you what ran and how much capacity it used. They do not tell you whether the data is correct, whether Direct Lake is serving stale numbers, or whether the failure that just landed in your Snowflake job will reach Power BI in 90 minutes.
What Microsoft Fabric monitoring covers out of the box
Fabric ships three distinct monitoring surfaces. Each one has a clear job. None of them was designed to talk to the others.
The Monitoring Hub — real-time activity tracking
The Monitoring Hub is the operational view. Open it and you see every recently-run item across the workspaces you have access to: pipelines, dataflows, semantic model refreshes, notebooks, Spark jobs. You can filter by status, item type, capacity, and time range, and click into any individual run to see its log.
It is good for two things: triaging what is running right now, and looking up why a specific item failed yesterday. It is not good for trend analysis, alerting, or anything that needs to combine data across items. Microsoft documents the Hub as a place to 'view and track Fabric activity', which is an honest description — it shows you activity, it does not interpret it.
Workspace Monitoring — logs and metrics at workspace level
Workspace Monitoring is the deeper layer. When you enable it on a workspace, Fabric writes operational logs and metrics into a managed Eventhouse, queryable via KQL. You get item-level run history, query performance, and (depending on the item type) per-row diagnostic data with retention up to 30 days by default.
This is where teams who already run Kusto comfortably get a lot of value. You can write a KQL query that joins refresh failures with capacity events, alert on it through Activator or a Reflex item, and route the alert into Teams. The catch is that you need somebody on the team who actually writes KQL, owns the alert definitions, and keeps them current as item types and log schemas change. For a smaller data team, that is one more system to maintain.
The Fabric Capacity Metrics App
The Capacity Metrics App is a Power BI app that reads telemetry from your Fabric capacity and shows CU consumption, throttling events, and item-level breakdowns over time. It is the only place where you can see how a specific Spark job or semantic model refresh contributed to a throttling event last Tuesday at 14:00. We will come back to its limits in a dedicated section.
How Microsoft Fabric capacity works under the hood
Most native monitoring questions in Fabric eventually become capacity questions. If you do not know how CU smoothing and bursting work, capacity alerts will either fire constantly or never fire at all.
SKUs, CU bursting and throttling explained
A Fabric capacity is sized in Capacity Units (CU). The SKU you buy — F2, F4, F8, all the way up to F2048 — sets a per-second CU budget. Items consume CU when they run. A heavy Spark notebook might burn through several minutes of your capacity budget in 30 seconds; a Direct Lake query barely registers.
Fabric does not stop you the moment you exceed your per-second budget. It uses two mechanisms to absorb spikes:
- Bursting lets a single operation use more than the per-second budget, so a 90-second Spark job can complete even if it temporarily uses 4× your steady-state CU.
- Smoothing spreads the cost of background operations (refreshes, scheduled pipelines) over a 24-hour window, so a heavy nightly load does not throttle interactive users at 09:00.
When smoothed consumption stays above 100% of the SKU long enough, throttling kicks in. Microsoft documents three throttling stages: interactive delay (operations queued for up to 20 seconds), interactive rejection (new queries refused), and background rejection (scheduled refreshes refused). The order matters because users feel interactive delay first — the dashboard gets slow before refreshes start failing.
Fabric vs Synapse: what changed for monitoring?
If you are migrating from Synapse, the mental model is different. Synapse Dedicated SQL Pools billed per DWU and per query, and Azure Monitor with diagnostic settings handled most of the observability layer. Fabric collapses storage and compute into a single capacity, so there is no DWU dial to turn and no per-query cost view in the Azure portal.
The monitoring consequence is that the patterns you knew from Synapse — set up Log Analytics, write a KQL alert on long-running queries, page on-call — translate to Fabric only if you enable Workspace Monitoring and rebuild the alerts there. Azure Monitor does not see inside a Fabric capacity. For migration teams this is the single biggest gap to plan for.
Fabric capacity metrics: what gets tracked (and what doesn't)
The Capacity Metrics App is the most-cited native tool for Fabric monitoring. It is also the source of more frustrated Reddit threads than any other piece of the platform. Both are accurate.
The Fabric Capacity Metrics App: setup, limits and workarounds
Setup is straightforward. Install the app from AppSource, point it at a capacity ID, and grant the necessary permissions. After about 30 minutes you have a Power BI report with three pages: a multi-metric overview, a timepoint detail view that shows what was running during a throttling event, and a compute-by-item breakdown.
What it does well: it is the only built-in surface that connects a specific item run to a specific throttling stage. If your capacity hit interactive rejection at 14:07, the timepoint view tells you which Spark notebook and which dataset refresh were drawing CU at that moment.
What it does not do:
- No alerting. The app is a report, not an alert engine. You can pin a tile and visit it twice a day, but there is no built-in path from 'capacity at 95% for 10 minutes' to a Teams message.
- No trend tracking beyond 14 days. The detail view caps at 14 days of history; the multi-metric overview goes further but with less granularity. Long-term capacity planning requires exporting the data or piping the underlying logs into Log Analytics.
- No cross-item lineage. The app shows you that a Spark notebook ran for 4 minutes and consumed 320 CU-seconds. It does not show you which downstream semantic models or Direct Lake reports depend on that notebook's output.
- Roughly 30 minutes of lag. Telemetry takes time to land in the app's underlying dataset. For incident response, 30 minutes is the difference between catching a problem and reading about it in Slack.
Blind spots — what the native tools do not tell you
Three gaps recur across the three tools above:
- Data correctness. No native surface flags zero-row loads, schema drift, or volume anomalies. A Lakehouse copy that loads 0 rows is a successful run.
- Cross-stack lineage. If your data starts in Snowflake or Azure Data Factory and ends in a Direct Lake report, no Fabric tool sees the upstream half of the journey.
- Proactive routing. Native tools wait for you to look at them. There is no built-in concept of 'page the on-call data engineer when this specific dataset has not refreshed by 06:30'.
Building a monitoring strategy that actually works
A working Fabric monitoring strategy assumes the native tools are inputs, not the answer. The job is to combine them with three things they do not provide on their own: alerting, cross-stack lineage, and freshness checks.
Combining native metrics with external observability
A pattern that works for most mid-market data teams running Fabric:
- Use the Monitoring Hub for live triage during incidents. Keep it open in a tab, do not try to make it your dashboard.
- Enable Workspace Monitoring on every production workspace and pipe the logs into a Log Analytics workspace if you need retention beyond 30 days. Microsoft documents the diagnostic settings path; it is one config change per workspace.
- Use the Capacity Metrics App for weekly capacity reviews and post-incident analysis. Do not rely on it for real-time alerting.
- Layer an external observability tool (or a homegrown KQL alerting setup if you have a Kusto specialist) on top, with two specific jobs: detect anomalies the native tools cannot see, and route alerts into the channel your team actually reads.
Alerting on capacity overages before users notice
The window between 'capacity hits 95% for 10 minutes' and 'users complain that Power BI is slow' is where alerting earns its keep. A workable threshold rule looks like this:
- Warn at 85% smoothed CU sustained over 10 minutes (gives you time to investigate).
- Page at 95% smoothed CU sustained over 5 minutes (interactive delay is imminent).
- Always include the top 3 items by CU draw in the alert payload, so the responder does not have to open the Capacity Metrics App from scratch.
This is not a feature any native tool ships with. It is a rule you build, either in Workspace Monitoring's KQL alerting layer or in an external tool that reads the same telemetry.
How MetricSign closes the monitoring gap
MetricSign sits above Fabric and the rest of your stack, reads the telemetry from each layer, and connects technical events to business impact. For Fabric specifically, it watches three things the native tools do not surface together.
Capacity events with downstream context. When a capacity hits the interactive rejection threshold, MetricSign cross-references which Direct Lake reports are served by that capacity and which executives have those reports pinned. The alert reads: 'F32 capacity at 96% for 7 minutes — 4 reports affected, including the CFO weekly dashboard. Top draw: notebook nb-sales-aggregation, 480 CU-seconds in the last 5 minutes.'
Volume and freshness anomalies on Lakehouse loads. MetricSign tracks a 30-day rolling baseline for row counts and load timestamps on Lakehouse and Warehouse tables. When the Tuesday-night Snowflake reorg drops 320,000 rows from a feed, MetricSign flags the deviation before the 06:00 Direct Lake refresh runs. The alert names the source table, the variance, and the downstream reports.
Cross-stack lineage from source to report. If the failure starts in Azure Data Factory or Snowflake or Databricks, MetricSign traces it forward through dbt models and Lakehouse copies into the semantic model and the Direct Lake report. One alert names the root cause and the affected reports, instead of three separate notifications from three tools that do not know about each other.
Alerts route to Teams, Slack, Telegram, or PagerDuty. Setup connects to a Fabric tenant in roughly 15 minutes via a service principal — no agents, no pipeline rewrites. See end-to-end data lineage from ADF to Power BI for how the lineage layer works in practice.
Where this approach falls short
Two honest limitations.
Workspace Monitoring's KQL endpoint is the deepest native option Fabric ships. If you have a Kusto specialist on the team, a security model that keeps log data inside Fabric, and the patience to build alerting on top of KQL, you can get most of the way without an external tool. MetricSign helps when you do not have that person, when lineage needs to cross out of Fabric (into Snowflake, ADF, or Databricks), or when alert routing into Slack/Teams/Telegram has to happen before the next refresh window.
MetricSign also does not see inside Spark notebook code. If a notebook silently writes wrong values because of a bug in the transformation logic, no monitoring tool — native or external — catches that without explicit data quality tests in the pipeline. Volume and freshness anomaly detection catches a different class of problem.
Fabric monitoring is not one product missing one feature. It is three native surfaces that each cover a slice and assume you will stitch them together with KQL and manual triage. For some teams that stitching is the right answer. For most data teams running mixed-stack pipelines into Fabric, the missing piece is the connector between technical events inside Fabric and business impact two layers downstream.
