Data Reliability Monitoring for Modern Data Stacks

From ADF pipeline run to Power BI report refresh, MetricSign tracks every layer of your stack and surfaces reliability failures before your users file a ticket.

MetricSign monitors data reliability across the full stack — from ingestion to BI report — detecting freshness failures, volume drops, and schema drift before they reach end users.

MetricSign vs Manual monitoring

Feature

MetricSign

Manual monitoring

Freshness alerting on tables

✓Configurable thresholds per source

Zero-row / volume anomaly detection

✓Historical baseline

✕Requires manual check

Schema drift detection with diff

✓Column-level diff

Power BI dataset refresh monitoring

✓Native API integration

✕Manual portal check

Tableau extract refresh monitoring

✓Native REST API

✕Manual portal check

ADF pipeline run tracking

✓Activity-level detail

✕Azure portal only

dbt test result integration

✓Job run results

✕CI logs only

Lineage-aware blast radius on failure

✓Full dependency chain

Slack / Teams alerting

✓

✕Manual only

Pricing model

~€299/mo or €2,990/yr — all connectors, unlimited workspaces & users

~Free tool, high staff time cost

Setup time for full stack coverage

~Hours

✕Ongoing manual effort

✓Supported

~Partial / limited

✕Not supported

What Data Reliability Actually Means

Data reliability is not a single metric — it is the combination of properties that determine whether the data in your BI tools and downstream systems can be trusted at any given moment. A dashboard can be technically accessible while showing numbers that are three days stale, missing an entire region, or silently computed on a broken schema. All of those are reliability failures.

In practice, data reliability breaks down into four dimensions that each fail independently and require separate detection logic:

Freshness — Was the data updated when it was supposed to be? A report that should reflect yesterday's sales but shows data from four days ago is a freshness failure, even if every row is technically correct.
Volume — Did the expected number of records arrive? A dbt model that loads zero rows instead of 200,000 is not a pipeline error — it produces no error code, it just silently delivers nothing.
Schema — Did the structure of the data change? A source column renamed from customer_id to cust_id upstream will break downstream joins without raising an exception in many ETL configurations.
Completeness — Are all required fields populated? A pipeline that passes validation but leaves nullable columns empty where values are expected produces reports with silent gaps.

Each of these failure modes has a different root cause, a different detection method, and a different urgency level. Manual monitoring — checking dashboards or reading pipeline logs — does not give you systematic coverage across all four. By the time a user notices stale numbers, the freshness breach may already be six hours old.

How Data Becomes Unreliable in Practice

Reliability problems rarely announce themselves. They surface as confused Slack messages from analysts, as revenue figures that do not match last week's, or as a Power BI report that simply shows no data for an entire product line. Here are the patterns MetricSign is built to catch:

Pipeline failure with no downstream alert

An ADF pipeline fails at 03:00 with ErrorCode=UserErrorDataLakeFilesNotFound. The pipeline itself is marked as failed in the Azure portal, but there is no alert configured for downstream consumers. The Snowflake table is not updated. The dbt models run on schedule against stale source data. Power BI refreshes on schedule against stale dbt output. At 09:00 the team opens a dashboard showing yesterday's numbers, not today's.

Zero-row load

INSERT INTO analytics.orders SELECT * FROM staging.orders WHERE load_date = CURRENT_DATE
-- Result: 0 rows affected

No error. No exception. The table exists, the columns are correct, the refresh completes. The metric just shows zero sales for the day.

Schema drift

A source system team renames a column as part of a refactor. The staging table still loads — the column is simply absent. The dbt model that references it either fails silently on a LEFT JOIN or, if the column was in a SELECT *, produces NULL values across an entire dimension.

Delayed refresh

A Power BI dataset is configured to refresh at 06:00. The underlying Snowflake query now takes 4 hours instead of 45 minutes due to a missing cluster key on a new partition. The refresh does not fail — it finishes at 10:00. Business users working at 07:00 see data that is effectively 28 hours old.

Each of these scenarios requires monitoring at a different layer. No single tool covers all of them out of the box.

Full-Stack Coverage: ADF → dbt → Snowflake → Power BI

MetricSign connects to every layer of a modern data stack and monitors reliability signals at each one. The goal is a single view of pipeline health rather than four separate dashboards in four separate tools.

Azure Data Factory

MetricSign pulls activity run status, duration trends, and error codes from ADF. It detects not just failed runs but also runs that complete with warnings, runs that take longer than a configurable baseline, and pipelines that have not started within their expected window.

dbt

For dbt projects, MetricSign reads job run results including test outcomes. A dbt not_null test failure on a critical column is surfaced as a reliability incident, not just a CI log entry. Model execution times are tracked over time so gradual performance degradation is visible before it causes timeout failures.

Snowflake and other warehouses

At the warehouse layer, MetricSign tracks row counts, schema versions, and last-modified timestamps on monitored tables. A table that has not been written to within its expected window triggers a freshness alert. A schema change — column added, column removed, type changed — triggers a schema drift alert.

Tableau and Power BI

At the BI layer, MetricSign monitors dataset refresh status, refresh duration, and report-level data timestamps. A Power BI dataset stuck in a Refresh in progress state for three hours is flagged. A Tableau extract that last completed successfully two days ago is flagged.

The result is a reliability chain: you can trace a late Power BI refresh back through the dbt job that fed it, back through the ADF pipeline that populated the source, and see exactly where in the chain the delay was introduced.

Detecting Reliability Issues Before Users Do

The operational value of reliability monitoring is time-to-detection. Every minute between a reliability failure and the moment the engineering team knows about it is a minute when business decisions may be based on bad data.

MetricSign uses several detection mechanisms that go beyond simple pass/fail status checks:

Freshness thresholds

For each monitored table or dataset, you configure an expected update frequency. If a table that refreshes hourly has not been written to in 90 minutes, an alert fires — before the two-hour business reporting window starts.

Volume anomaly detection

MetricSign tracks historical row counts for tables and models. A load that delivers 0 rows when the 30-day average is 180,000 rows is flagged immediately. The same applies to partial loads — if a table usually receives 50,000 rows per partition but receives 3,000, that is a volume anomaly even if no pipeline error occurred.

Schema fingerprinting

Each time a monitored table is scanned, MetricSign computes a schema fingerprint covering column names, types, and nullable flags. A change in the fingerprint triggers a schema drift alert that includes a diff:

Schema change detected: analytics.fact_orders
  - customer_id (INTEGER, NOT NULL)  → removed
  + cust_id (INTEGER, NOT NULL)      → added
  ~ order_value (FLOAT) → order_value (NUMERIC(18,2))

Lineage-aware incident routing

When a source table fails a freshness check, MetricSign resolves the downstream impact: which dbt models depend on it, which datasets are built from those models, which reports are served from those datasets. The incident is routed to the team responsible for the source, with context on what is affected downstream — so the on-call engineer knows the blast radius without manual investigation.

Alerts are delivered via email, Slack, or Teams, with direct links to the affected pipeline run or dataset.

Why Traditional Monitoring Falls Short

Most data teams rely on a combination of Azure Monitor alerts, dbt test failures in CI, and manual dashboard checks to maintain reliability. This approach has structural gaps.

Azure Monitor covers infrastructure and pipeline execution status, but it does not understand data semantics. It can tell you that an ADF pipeline succeeded — it cannot tell you that the pipeline wrote zero rows, or that the Snowflake table it feeds has not changed in 18 hours.

dbt tests run on schedule and catch structural problems at the model layer. But they only run when the dbt job runs, they require explicit test definitions for every condition, and a passing dbt test suite does not mean the data is fresh — it means the data that arrived passed your declared tests.

Manual checks — an analyst looking at a dashboard and noticing something is wrong — are the last line of defense, not a monitoring strategy. By the time a user notices, the reliability window has already closed.

Monte Carlo is a dedicated observability platform with strong ML-based anomaly detection, but it is built for data warehouse-centric stacks. It does not have native integration with Power BI or Tableau refresh status, and its pricing model is designed for large data platform teams rather than mid-market organizations running mixed Microsoft and cloud stacks.

MetricSign is designed specifically for teams running Microsoft-centric stacks — ADF, Fabric, Power BI — combined with dbt, Snowflake, or Tableau. The integrations are native, not webhook-based, and the reliability model covers the full chain from ingestion to report.

Frequently asked questions

What is data reliability and how is it different from data quality?

Data quality typically refers to the accuracy and consistency of data values — whether records are correct, deduplicated, and conform to business rules. Data reliability is a broader operational concept: it describes whether data is available, fresh, complete, and structurally intact when downstream consumers need it. A pipeline can deliver high-quality data that arrives four hours late, making it unreliable for time-sensitive reporting even though the data itself is accurate. MetricSign monitors reliability — freshness, volume, schema, and completeness — as a separate layer from business-logic data quality checks.

How does MetricSign detect a zero-row load if no pipeline error is raised?

MetricSign tracks row counts on monitored tables by querying metadata or running lightweight count queries after each detected write event. It maintains a rolling baseline of expected row counts per load window. When a load completes but the row count is zero — or significantly below the baseline — MetricSign flags it as a volume anomaly and sends an alert. This works independently of whether the upstream pipeline reported success or failure, because the check is on the data state, not the pipeline exit code.

Does MetricSign work with both Power BI Premium and Power BI Pro workspaces?

MetricSign integrates with Power BI via the Power BI REST API, which supports both Premium and Pro workspaces for dataset refresh history and status. Some metadata — such as detailed execution timing and capacity utilization — is only available in Premium or Fabric workspaces. For Pro workspaces, MetricSign monitors refresh success or failure status and last refresh timestamp. Capacity-level metrics require Premium Per User or Premium Per Capacity licensing on the Power BI side.

Can MetricSign monitor dbt Cloud and self-hosted dbt Core projects?

For dbt Cloud, MetricSign connects via the dbt Cloud Admin API and pulls job run results, model execution times, and test outcomes directly. For self-hosted dbt Core projects running via Airflow, Azure DevOps, or GitHub Actions, MetricSign can ingest run artifacts — specifically `run_results.json` — via a webhook or a configured artifact path. This covers both managed and self-hosted dbt deployments within a single reliability view.

How long does it take to set up full-stack reliability monitoring?

Initial setup for a standard stack — one ADF workspace, one Snowflake account, one dbt Cloud project, and one Power BI workspace — typically takes a few hours. Each connector requires OAuth authorization or a service principal with read-only access. Once connected, MetricSign begins collecting baseline metrics immediately and can trigger alerts based on configured thresholds from day one. Building an accurate volume anomaly baseline takes approximately seven to fourteen days of observed load history to reduce false positives on high-variance tables.