MetricSign
Start free
Data Observability9 min read·

Data Monitoring System: What It Is, What It Isn't, and How to Build One That Works

Most data monitoring systems are a Slack channel, a few cron jobs, and hope. The teams that ship reliable data are the ones who build the four layers below — in this order.

Lees dit artikel in het Nederlands →

Data Monitoring System: What It Is, What It Isn't, and How to Build One That Works

What a data monitoring system actually is

Cut through the definitions and a data monitoring system comes down to this: the combined tools, telemetry, alert rules, and human processes that let your team detect data problems before the people downstream do.

The word "system" matters. A single dashboard isn't a system. A Slack alert without an owner isn't a system. A monitoring tool plugged in but never tuned isn't a system. A system implies parts that fit together: detection, routing, response, and learning.

Most teams have one or two parts and call it a system. The result: lots of noise, slow detection, and incidents that still reach the CFO before the data team.

According to a Wakefield Research / Monte Carlo industry survey, organisations average 70 data incidents per 1.000 tables per year, with 40% of data engineer time spent dealing with quality issues. (Monte Carlo, 2024)

The four layers of a working data monitoring system

A data monitoring system that earns its keep covers four layers. Skip one and the whole thing leaks.

LayerQuestion it answersTypical signals
1. Pipeline healthDid the job run?Job status, run duration, retry count
2. Data healthIs the data correct?Freshness, row count, schema, distribution
3. Business impactDoes anyone care?Downstream lineage, dataset criticality, SLA tier
4. ResolutionWho fixes it, how, by when?Alert routing, runbook links, MTTR tracking

Layer 1 — Pipeline health is what most teams build first. Airflow, ADF, Databricks Jobs all expose run status. Set up a Slack webhook, you have layer 1.

Layer 2 — Data health is where most teams stall. dbt tests cover known rules. Layer 2 needs more: detection of schema drift, freshness lag, volume anomalies. This is where data observability tooling lives.

Layer 3 — Business impact is the layer almost everyone skips. Without it, every alert looks the same. The dataset that drives the CFO's daily revenue dashboard fires the same noise level as a backfill job nobody depends on. Result: alert fatigue.

Layer 4 — Resolution is the layer that turns alerts into fixes. An alert with no owner and no runbook is just decoration. Track median time to detect, time to resolve, and incidents that reach business users.

Why most data monitoring systems leak

We've seen the same pattern across dozens of data teams. The leak almost always happens at one of three places.

Leak 1: Building layer 2 without layer 3 A team adopts a data observability tool and onboards 500 datasets at once. Day 2: the team has 500 alerts. Day 3: they mute the channel. The tool itself wasn't wrong. The team didn't define which datasets are business-critical, so every alert weighs the same.

Leak 2: Treating data tests as monitoring dbt tests are excellent for catching known issues. They are not monitoring. A test only catches what you wrote a rule for. A monitoring system needs to detect the unknown unknowns: the kinds of failure you didn't think to write a rule for.

Leak 3: Stopping at detection A Slack alert that nobody owns gets read once and forgotten. Without runbook, severity, and routing rules, layer 4 is missing — and the system is just a notifier, not a monitor.

"You don't have a monitoring system; you have a notification system." — paraphrased from an SRE workshop discussion at SREcon 2023.

How to build one that works (in this order)

If you're starting from scratch or rebuilding an existing system, this is the order that produces results fastest. Getting the sequence wrong is the most common reason monitoring systems fail early.

Week 1 — Inventory and tier your datasets Make a list of every production dataset, table, and dashboard. Rate each tier 1 (CFO sees it), tier 2 (operational team uses it), tier 3 (everything else). This is layer 3, done before layer 1 or 2 even exists. Without it, you'll repeat the leak above.

Week 2-3 — Pipeline health (layer 1) Wire up basic run-status alerts for tier 1 and tier 2 pipelines. Native tool dashboards plus a Slack channel is enough at this stage. Don't try to be clever yet.

Week 4-6 — Data health on tier 1 only (layer 2) Introduce data quality checks (dbt tests, Great Expectations) and freshness/volume monitoring on tier 1 datasets. Resist the urge to onboard everything. The 20 datasets that drive board-visible numbers cover 80% of business risk.

Week 7-8 — Routing and runbooks (layer 4) For every alert that exists, add an owner, severity, and a one-paragraph runbook. Alerts without those three are downgraded or removed.

Week 9 onwards — Expand Now add tier 2 datasets to layer 2. Tune thresholds based on real incidents. Track three numbers monthly: median time to detect, time to resolve, and number of incidents that reached a business user before your team.

What to put in (and leave out) of your stack

Tool choice depends on your stack shape. Here's a starting kit by stack type.

Stack shapeLayer 1 (pipeline)Layer 2 (data)Layer 3-4 (impact + resolution)
Modern data stack (Snowflake + dbt + Looker)Native dbt Cloud + Snowflake alertsElementary or SodaSlack + custom routing
Microsoft (Power BI + ADF + Fabric)Azure Monitor + native alertsMetricSignMetricSign routing + PagerDuty
Hybrid (Databricks + dbt + multiple BI)Native + customSoda or MetricSignSlack + ownership matrix
Small / single-toolNative alerts onlydbt testsSingle Slack channel + on-call rotation

What to leave out: anything that promises to do all four layers without configuration. Layer 3 (business impact) cannot be inferred from the data — it needs human input on which datasets matter to whom.

Metrics that tell you the system is working

Three numbers, tracked monthly, will tell you whether your data monitoring system is earning its keep.

Median time to detect (MTTD) From "thing broke" to "first alert fired". Track per severity. A healthy MTTD on tier 1 datasets is under 10 minutes. Anything over an hour means your monitoring is reactive, not proactive.

Median time to resolve (MTTR) From "first alert fired" to "incident closed". This depends mostly on layer 4 quality (runbooks, ownership). Healthy MTTR on tier 1 is under 2 hours.

Incidents-reaching-business count The single most important number. How many incidents per month are noticed by a business user before your team noticed? If this isn't trending toward zero, the system is leaking somewhere.

According to IDC research, data engineers spend up to 30% of their time on incident triage and root cause work. (IDC, 2023) The point of a data monitoring system isn't to find more incidents. It's to push that number down.

Where MetricSign fits

If your stack lives in the Microsoft data ecosystem (Power BI, ADF, Databricks, dbt, Fabric, Snowflake), MetricSign is the data monitoring system layer 2 + 3 piece.

It watches dataset freshness, volume, schema, and distribution across the full stack. It models lineage between tools so a failure in ADF surfaces as an alert on the Power BI dashboard that depends on it. And it lets you tier datasets so tier-1 incidents page on-call while tier-3 incidents go to a digest.

Layer 1 stays in your existing tooling (Azure Monitor, native dashboards). Layer 4 connects to your existing on-call process (Slack, PagerDuty, Teams). MetricSign covers the layers most teams build last and worst.

Frequently asked questions

What is a data monitoring system?+
A data monitoring system is the combined set of tools, telemetry, alert rules, and human processes that lets a team detect data problems before downstream users do. It typically covers four layers: pipeline health, data health, business impact context, and resolution flow. A single tool or dashboard alone is not a system — the system is the way the parts work together.
How is data monitoring different from data observability?+
Data observability is the broader concept: the property of being able to understand the health of your data without having to ask. Data monitoring is the practical implementation: the tools and processes that operationalise observability. You can think of monitoring as 'observability in production', with alerts, dashboards, and runbooks attached.
What are the layers of a data monitoring system?+
Four layers: (1) pipeline health — did the job run? (2) data health — is the data correct? (3) business impact — does this matter to anyone, and to whom? (4) resolution — who owns this, what's the runbook, what's the SLA? Most teams build 1 and 2 and skip 3 and 4, which is why their alerts get ignored.
What metrics tell me my data monitoring system is working?+
Three: median time to detect (MTTD), median time to resolve (MTTR), and the count of incidents that reached a business user before the data team noticed. Track them monthly per dataset tier. The third metric is the most important — it directly measures whether your monitoring system is preventing the failures that matter.
Should I build my own data monitoring system or buy one?+
Layer 1 (pipeline health) usually comes with your existing tools — keep it native. Layer 4 (resolution) is process and lives in Slack/PagerDuty/Teams. Layers 2 and 3 are where buying a tool typically wins, because the engineering work to build them properly takes months and the maintenance burden grows with every new pipeline. Build the layers you control, buy the layers that scale.

Related integrations

Related articles