What a data monitoring system actually is
Cut through the definitions and a data monitoring system comes down to this: the combined tools, telemetry, alert rules, and human processes that let your team detect data problems before the people downstream do.
The word "system" matters. A single dashboard isn't a system. A Slack alert without an owner isn't a system. A monitoring tool plugged in but never tuned isn't a system. A system implies parts that fit together: detection, routing, response, and learning.
Most teams have one or two parts and call it a system. The result: lots of noise, slow detection, and incidents that still reach the CFO before the data team.
According to a Wakefield Research / Monte Carlo industry survey, organisations average 70 data incidents per 1.000 tables per year, with 40% of data engineer time spent dealing with quality issues. (Monte Carlo, 2024)
The four layers of a working data monitoring system
A data monitoring system that earns its keep covers four layers. Skip one and the whole thing leaks.
| Layer | Question it answers | Typical signals |
|---|---|---|
| 1. Pipeline health | Did the job run? | Job status, run duration, retry count |
| 2. Data health | Is the data correct? | Freshness, row count, schema, distribution |
| 3. Business impact | Does anyone care? | Downstream lineage, dataset criticality, SLA tier |
| 4. Resolution | Who fixes it, how, by when? | Alert routing, runbook links, MTTR tracking |
Layer 1 — Pipeline health is what most teams build first. Airflow, ADF, Databricks Jobs all expose run status. Set up a Slack webhook, you have layer 1.
Layer 2 — Data health is where most teams stall. dbt tests cover known rules. Layer 2 needs more: detection of schema drift, freshness lag, volume anomalies. This is where data observability tooling lives.
Layer 3 — Business impact is the layer almost everyone skips. Without it, every alert looks the same. The dataset that drives the CFO's daily revenue dashboard fires the same noise level as a backfill job nobody depends on. Result: alert fatigue.
Layer 4 — Resolution is the layer that turns alerts into fixes. An alert with no owner and no runbook is just decoration. Track median time to detect, time to resolve, and incidents that reach business users.
Why most data monitoring systems leak
We've seen the same pattern across dozens of data teams. The leak almost always happens at one of three places.
Leak 1: Building layer 2 without layer 3 A team adopts a data observability tool and onboards 500 datasets at once. Day 2: the team has 500 alerts. Day 3: they mute the channel. The tool itself wasn't wrong. The team didn't define which datasets are business-critical, so every alert weighs the same.
Leak 2: Treating data tests as monitoring dbt tests are excellent for catching known issues. They are not monitoring. A test only catches what you wrote a rule for. A monitoring system needs to detect the unknown unknowns: the kinds of failure you didn't think to write a rule for.
Leak 3: Stopping at detection A Slack alert that nobody owns gets read once and forgotten. Without runbook, severity, and routing rules, layer 4 is missing — and the system is just a notifier, not a monitor.
"You don't have a monitoring system; you have a notification system." — paraphrased from an SRE workshop discussion at SREcon 2023.
How to build one that works (in this order)
If you're starting from scratch or rebuilding an existing system, this is the order that produces results fastest. Getting the sequence wrong is the most common reason monitoring systems fail early.
Week 1 — Inventory and tier your datasets Make a list of every production dataset, table, and dashboard. Rate each tier 1 (CFO sees it), tier 2 (operational team uses it), tier 3 (everything else). This is layer 3, done before layer 1 or 2 even exists. Without it, you'll repeat the leak above.
Week 2-3 — Pipeline health (layer 1) Wire up basic run-status alerts for tier 1 and tier 2 pipelines. Native tool dashboards plus a Slack channel is enough at this stage. Don't try to be clever yet.
Week 4-6 — Data health on tier 1 only (layer 2) Introduce data quality checks (dbt tests, Great Expectations) and freshness/volume monitoring on tier 1 datasets. Resist the urge to onboard everything. The 20 datasets that drive board-visible numbers cover 80% of business risk.
Week 7-8 — Routing and runbooks (layer 4) For every alert that exists, add an owner, severity, and a one-paragraph runbook. Alerts without those three are downgraded or removed.
Week 9 onwards — Expand Now add tier 2 datasets to layer 2. Tune thresholds based on real incidents. Track three numbers monthly: median time to detect, time to resolve, and number of incidents that reached a business user before your team.
What to put in (and leave out) of your stack
Tool choice depends on your stack shape. Here's a starting kit by stack type.
| Stack shape | Layer 1 (pipeline) | Layer 2 (data) | Layer 3-4 (impact + resolution) |
|---|---|---|---|
| Modern data stack (Snowflake + dbt + Looker) | Native dbt Cloud + Snowflake alerts | Elementary or Soda | Slack + custom routing |
| Microsoft (Power BI + ADF + Fabric) | Azure Monitor + native alerts | MetricSign | MetricSign routing + PagerDuty |
| Hybrid (Databricks + dbt + multiple BI) | Native + custom | Soda or MetricSign | Slack + ownership matrix |
| Small / single-tool | Native alerts only | dbt tests | Single Slack channel + on-call rotation |
What to leave out: anything that promises to do all four layers without configuration. Layer 3 (business impact) cannot be inferred from the data — it needs human input on which datasets matter to whom.
Metrics that tell you the system is working
Three numbers, tracked monthly, will tell you whether your data monitoring system is earning its keep.
Median time to detect (MTTD) From "thing broke" to "first alert fired". Track per severity. A healthy MTTD on tier 1 datasets is under 10 minutes. Anything over an hour means your monitoring is reactive, not proactive.
Median time to resolve (MTTR) From "first alert fired" to "incident closed". This depends mostly on layer 4 quality (runbooks, ownership). Healthy MTTR on tier 1 is under 2 hours.
Incidents-reaching-business count The single most important number. How many incidents per month are noticed by a business user before your team noticed? If this isn't trending toward zero, the system is leaking somewhere.
According to IDC research, data engineers spend up to 30% of their time on incident triage and root cause work. (IDC, 2023) The point of a data monitoring system isn't to find more incidents. It's to push that number down.
Where MetricSign fits
If your stack lives in the Microsoft data ecosystem (Power BI, ADF, Databricks, dbt, Fabric, Snowflake), MetricSign is the data monitoring system layer 2 + 3 piece.
It watches dataset freshness, volume, schema, and distribution across the full stack. It models lineage between tools so a failure in ADF surfaces as an alert on the Power BI dashboard that depends on it. And it lets you tier datasets so tier-1 incidents page on-call while tier-3 incidents go to a digest.
Layer 1 stays in your existing tooling (Azure Monitor, native dashboards). Layer 4 connects to your existing on-call process (Slack, PagerDuty, Teams). MetricSign covers the layers most teams build last and worst.
