Your Actual-vs-Budget Variance Visual Only Lies When the Refresh Fails Silently
Custom variance visuals like PBIGenie's Hammerhead make actual-vs-budget comparisons readable. They don't make the underlying data trustworthy.
In-depth articles on data observability, lineage, and incident response — written for data engineers who manage Power BI, ADF, Databricks, Fabric, and dbt.
Custom variance visuals like PBIGenie's Hammerhead make actual-vs-budget comparisons readable. They don't make the underlying data trustworthy.
Copilot writes a DAX query that times out your dataset refresh. The error log says timeout. It doesn't say why the query existed in the first place.
Synced tables, scale-to-zero session drops, and metrics that report zero when data still exists — Lakebase introduces failure modes that don't map to your existing Databricks monitoring.
A Databricks job fails at 3am. The cluster terminated. The driver log rolled over. The downstream dbt model ran anyway — on yesterday's data. Here is how to build the audit trail Databricks does not give you by default.
Query-based connectors in Databricks rely on Delta Lake snapshots that can silently age out, leaving downstream consumers reading data that looks current but isn't.
You set an alert on your Power BI revenue card. Three weeks later, the pipeline breaks, the card shows yesterday's number, and nobody gets notified.
Your Fabric capacity hit 100% utilization at 06:12 this morning. The Capacity Metrics App won't show it for another 15 minutes. By then, interactive queries are already delayed.
Your Lakehouse copy ran green. Capacity sits at 84%. Direct Lake served the report on time. The numbers are still wrong by €1.4M.
Vendors call almost anything an observability tool. These are the five capabilities that decide whether one will save your team or just add another dashboard to ignore.
Azure Monitor is excellent at one thing: telling you when CPU goes up. The problems that actually wake data teams at night live in the gaps between what it watches and what your business sees.
Most data monitoring systems are a Slack channel, a few cron jobs, and hope. The teams that ship reliable data are the ones who build the four layers below — in this order.
A data quality monitoring tool tells you when a column violates a rule you wrote. It is the cheapest, fastest improvement most data teams can make. It is also where most teams stop, and that is where the trouble starts.
Most comparisons miss the question that matters: does the platform actually cover your stack?
Power BI says the refresh succeeded. ADF reports the pipeline ran. Databricks shows all jobs completed. Your users are looking at yesterday's numbers.
Your dbt job finished. Your ADF pipeline ran. Your Power BI dashboard shows last week's numbers. Nobody got an alert.
Fabric gives you three layers of pipeline alerting — activity-level, item-level, workspace-level — and none of them natively answers "did the file arrive on time?"
Five failure layers, no single native tool that covers them, and a correlation problem that makes every incident look like three.
Your refresh says succeeded. Your users see wrong data. These are the four signals a data observability tool watches that most Power BI monitoring setups miss.
A failed refresh announces itself. Wrong data loaded silently does not.
A practical checklist for teams that want to catch data issues before their users do — without committing to a full data observability tool on day one.
Power BI says 'refresh succeeded.' The report shows blank data. Somewhere between your ADF pipeline and the Fabric lakehouse, a column was renamed. You have no way to trace which of your 32 datasets depend on that column.
Most lineage tools show you what happened. Compile-time lineage shows you what will break.
Rocky, a Rust-based warehouse control plane, computes column-level lineage during compilation rather than after execution. The difference determines whether you find a broken join before or after your stakeholders do.
Without a map of your data chain, every investigation starts from scratch.
Data monitoring software tells you what broke. Lineage tells you why — and what it's taking down with it.
During migration, you're not monitoring one environment — you're monitoring two. Most data monitoring software is built to watch one stack, not two stacks running side by side.
Three generations of ETL tooling, one data stack — maintaining visibility when the tools keep changing.
Your R code runs clean. The cell completes. The plot area is blank. Databricks doesn't tell you why — because from the runtime's perspective, nothing went wrong.
Mixing DirectQuery with imported SharePoint lists sounds pragmatic. The storage engine disagrees.
The dataset refreshed at 06:02. The audit log says succeeded. The board meeting starts at 09:00. The Admin Portal has nothing to tell you about the ADF pipeline that wrote zero rows at 03:44.
Six Spark properties stand between your Databricks cluster and an Iceberg table registered in AWS Glue. Get one wrong and you'll see TABLE_OR_VIEW_NOT_FOUND — with no hint about which property caused it.
Your vendor's consultant just overwrote a production notebook at 4pm on a Friday. Here's how folder permissions, service principals, and Git folders prevent that from happening again.
The Compute tab vanishes silently when entitlements are wrong. Three settings control whether your users can see it, and none of them produce an error message.
A UNION ALL in the USING clause looks correct until two source tables contribute a row for the same key. Delta rejects the ambiguity outright.
The split-and-getItem pattern works perfectly on sample data. Production strings have trailing spaces, embedded delimiters, and missing fields that turn your columns into nulls without warning.
UNION ALL your sources into MERGE and Spark will punish you with an ambiguous match error — unless you deduplicate first.
Native notifications miss the failures that actually hurt. Here's how the major Power BI monitoring tools compare on detection, correlation, and time-to-deploy.
Native Azure Monitor catches pipeline failures. It misses the Copy activity that succeeded with the wrong schema — and that's the one your stakeholders will call about.
The runtime gap between PySpark and Scala is not what most benchmarks measure. The real cost lives in serialization boundaries, executor process model, and where your UDFs run.
The tutorial shows a green checkmark. Production shows a half-loaded Lakehouse table and a stakeholder asking why yesterday's revenue is missing.
Power BI has built-in refresh failure notifications. They're not enough for most production environments.
If manual refresh works and scheduled refresh fails, the problem is not the data source. It is the environment the scheduled run uses.
A gateway that goes offline at 02:00 and recovers by 09:00 can silently fail dozens of scheduled refreshes while everyone sleeps.
What do you do when it's 3am and your most important dataset just failed to refresh? A data pipeline management playbook for the moment monitoring fires its first alert.
Reconciliation workloads compare two large datasets row by row. When that comparison never converges, your cluster burns compute until someone notices — or the budget runs out.
Your dbt run finished at 04:12. Three models failed. The error log says 'current transaction is aborted'. Downstream, Power BI already refreshed on yesterday's data.
Your scheduled refresh failed with an AADSTS code. The dashboard still shows yesterday's numbers. Here is how to read the code and find the right fix without trawling the full Microsoft reference.
Your ADF pipeline failed at 03:42 with a UserError code that means nothing on its own. The Power BI refresh that depends on it is two hours away. Here is how to read the error class and jump to the fix.
Your DAX query returns 11,000 rows in DAX Studio and 6,000 through VBA. No error. No warning. Just missing data your stakeholders will find before you do.
Your DAX query returns 11,000 rows in DAX Studio but 6,000 through VBA. The query isn't wrong. The ADODB plumbing is.
The setup wizard looks simple. Four steps, a few stored procedures, done. But the database setup step fails without telling you which prerequisite it actually checked and rejected.
Your scheduled refresh failed at 06:00. The error message contains an AADSTS code. Here's what each one means.
A DM_GWPipeline error means the gateway is part of the problem. Here's how to find out which part.
The connection test passes. The pipeline run fails with 403. They are not the same thing.
DRIVER_NOT_RESPONDING is a symptom. The cause is almost always memory pressure or GC pause. Here is how to find it and fix it.
The model works locally. The production deployment fails. The difference is almost always permissions, credentials, or SQL dialect.
MetricSign monitors your Power BI datasets, ADF pipelines, Databricks jobs, Fabric Pipelines, and dbt models — and surfaces incidents with root cause context before your stakeholders notice.
Get started free →