Data Pipeline Monitoring

Post-Summit Databricks Upgrades Change How Your Jobs Fail — Your Alerts Don't Know That Yet

Data+AI Summit 2026 introduced serverless auto-optimization, Lakeflow Jobs rebranding, and new identity controls. Each one shifts how failures surface — and your existing monitoring was built for the old patterns.

Your Databricks Job Passed Every Test and Still Failed at 3am

The Databricks community celebrates impressive builds. Production clusters don't care how clever your notebook was — they care whether your spot instances survived long enough to finish the job.

Databricks Goes Dark for Hours and Nobody Gets an Error

A Databricks interruption that fixes itself in three hours still breaks every pipeline scheduled in that window. The failure isn't the outage — it's that nothing told you it happened.

Databricks Context Engineering Pipelines Fail Differently Than ETL — and Most Teams Find Out Too Late

Databricks now certifies context engineers. The pipelines they build need monitoring that doesn't exist in most teams' playbooks.

Azure Monitoring Tools for Data Pipelines: What They Cover and Where They Stop

Azure gives you four distinct monitoring tools across its data platform. None of them talk to each other — and the failures that matter most happen exactly at those seams.

Azure Monitor Alerts for Data Pipelines: What They Catch and What They Miss for Power BI and ADF

An Azure Monitor alert fired. Your ADF pipeline failed. But which Power BI datasets are now showing stale data — and did anyone get notified before the 07:00 refresh?

dbt Source Freshness: How to Monitor It and Alert on Violations

Learn how to configure dbt source freshness, understand its limits, and link violations to downstream Power BI and Table

Best Data Observability Tools in 2026: An Honest Comparison

An honest comparison of the best data observability tools in 2026: Monte Carlo, Great Expectations, Elementary, Soda, Me

Monte Carlo Data Observability: How It Works and Where It Stops

Learn how Monte Carlo data observability works — ML anomaly detection, lineage, circuit breakers — and where it stops at

Data Observability vs Data Quality: What's the Difference and Why You Need Both

Your dbt tests all passed. Your Power BI dashboard is showing yesterday's numbers. This is what data quality monitoring cannot catch — and why data observability exists.

Data Observability6 min

Data Freshness Monitoring: What It Is and Why Stale Data Is Worse Than Missing Data

The pipeline ran. The refresh succeeded. The CEO is looking at last week's numbers. Data freshness monitoring would have caught this before the Monday standup.

Databricks and Power BI Monitoring: The Failures That Stay Hidden Between the Two

The Databricks job ran. Power BI refreshed. The sales dashboard is showing yesterday's empty rows. The failure happened in between, and neither tool noticed.

dbt Monitoring: What 'Job Succeeded' Doesn't Tell You

Every model in your dbt job ran successfully. The Power BI dashboard is serving last week's numbers. The failure happened between 'model succeeded' and 'data correct'.

Data Quality Monitoring for Snowflake: What Queries Succeeding Doesn't Guarantee

The Snowflake query ran. The ADF pipeline completed. The Power BI dataset refreshed. The numbers are from two days ago. Every tool reported success.

Data Observability6 min

Databricks Power BI Connector: What the Connection Doesn't Monitor

The Databricks Power BI connector is connected. Refreshes are running. But which Databricks job last wrote to that Delta table, and was the output correct?

dbt Snowflake Monitoring: How Failures Travel from Transformation to Warehouse

The dbt job finished. The Snowflake table refreshed. The Power BI report is showing wrong numbers. The failure crossed two tool boundaries without triggering a single alert.

Snowflake and Power BI: From Connection to Monitoring

Getting Snowflake connected to Power BI takes an afternoon. Knowing whether the data that arrived is correct, fresh, and complete — that takes monitoring.

Power BI's PDF Connector Parses Your File Once — Then Breaks on Every Structural Change

The PDF connector in Power Query works fine during development. It fails during scheduled refresh — when nobody is watching the column mappings dissolve.

June 1, 2026→

Your Actual-vs-Budget Variance Visual Only Lies When the Refresh Fails Silently

Custom variance visuals like PBIGenie's Hammerhead make actual-vs-budget comparisons readable. They don't make the underlying data trustworthy.

May 25, 2026→

AI Agents Generate Queries Your Pipeline Monitoring Was Never Built to Trace

Copilot writes a DAX query that times out your dataset refresh. The error log says timeout. It doesn't say why the query existed in the first place.

Databricks Lakebase Adds a New Failure Surface Your Pipeline Monitoring Doesn't Cover

Synced tables, scale-to-zero session drops, and metrics that report zero when data still exists — Lakebase introduces failure modes that don't map to your existing Databricks monitoring.

Databricks Job Failures Leave No Breadcrumbs Unless You Build the Trail Yourself

A Databricks job fails at 3am. The cluster terminated. The driver log rolled over. The downstream dbt model ran anyway — on yesterday's data. Here is how to build the audit trail Databricks does not give you by default.

Databricks Snapshot Connectors Return Stale Data Without Telling You

Query-based connectors in Databricks rely on Delta Lake snapshots that can silently age out, leaving downstream consumers reading data that looks current but isn't.

Data Observability9 min read

Data Observability10 min

Power BI Alerts: What Native Alerting Can and Can't Do

You set an alert on your Power BI revenue card. Three weeks later, the pipeline breaks, the card shows yesterday's number, and nobody gets notified.

May 10, 2026→

Data Observability11 min

Fabric Capacity Metrics Explained: What to Monitor Before You Get Throttled

Your Fabric capacity hit 100% utilization at 06:12 this morning. The Capacity Metrics App won't show it for another 15 minutes. By then, interactive queries are already delayed.

May 10, 2026→

Data Observability11 min

Microsoft Fabric Monitoring: What Native Tools Miss and How to Fill the Gaps

Your Lakehouse copy ran green. Capacity sits at 84%. Direct Lake served the report on time. The numbers are still wrong by €1.4M.

May 9, 2026→

Data Observability Tool: 5 Capabilities That Separate Hype from Help

Vendors call almost anything an observability tool. These are the five capabilities that decide whether one will save your team or just add another dashboard to ignore.

Data Observability8 min read

Azure Monitor Alerts: What It Catches, What It Misses, and What to Do Next

Azure Monitor is excellent at one thing: telling you when CPU goes up. The problems that actually wake data teams at night live in the gaps between what it watches and what your business sees.

Data Observability9 min read

Data Monitoring System: What It Is, What It Isn't, and How to Build One That Works

Most data monitoring systems are a Slack channel, a few cron jobs, and hope. The teams that ship reliable data are the ones who build the four layers below — in this order.

Data Observability8 min read

Data Quality Monitoring Tools: What They Catch, What They Miss, and How to Choose One

A data quality monitoring tool tells you when a column violates a rule you wrote. It is the cheapest, fastest improvement most data teams can make. It is also where most teams stop, and that is where the trouble starts.

Data Observability13 min

Best Data Observability Tools and Platforms in 2026 (Compared)

Most comparisons miss the question that matters: does the platform actually cover your stack?

May 6, 2026→

Data Observability Platform for the Microsoft Data Stack

Power BI says the refresh succeeded. ADF reports the pipeline ran. Databricks shows all jobs completed. Your users are looking at yesterday's numbers.

May 6, 2026→

What Is a Data Observability Platform? (And Why Your Modern Data Stack Needs One)

Your dbt job finished. Your ADF pipeline ran. Your Power BI dashboard shows last week's numbers. Nobody got an alert.

May 5, 2026→

Data Observability10 min

Microsoft Fabric SLA Monitoring: Why Your Alerting Architecture Breaks Before Your Pipeline Does

Fabric gives you three layers of pipeline alerting — activity-level, item-level, workspace-level — and none of them natively answers "did the file arrive on time?"

Data Observability14 min

Data Observability for the Microsoft Stack: Power BI, ADF, Databricks, dbt, and Fabric

Five failure layers, no single native tool that covers them, and a correlation problem that makes every incident look like three.

Power BI Monitoring Beyond Refreshes: What a Data Observability Tool Actually Watches

Your refresh says succeeded. Your users see wrong data. These are the four signals a data observability tool watches that most Power BI monitoring setups miss.

April 11, 2026→

Why Silent Data Failures Cost More Than Outages

A failed refresh announces itself. Wrong data loaded silently does not.

April 10, 2026→

5 Data Observability Practices for Power BI Teams (Without a Heavy Tool)

A practical checklist for teams that want to catch data issues before their users do — without committing to a full data observability tool on day one.

April 9, 2026→

Data Lineage

Data Lineage12 min

Data Lineage Tools: A Practical Guide for Microsoft Stack Teams

Power BI says 'refresh succeeded.' The report shows blank data. Somewhere between your ADF pipeline and the Fabric lakehouse, a column was renamed. You have no way to trace which of your 32 datasets depend on that column.

May 12, 2026→

Data Lineage9 min

Column Lineage at Compile Time Changes What You Can Catch Before Production

Most lineage tools show you what happened. Compile-time lineage shows you what will break.

Data Lineage8 min

Column Lineage at Compile Time Catches What Post-Hoc Graph Crawls Miss

Rocky, a Rust-based warehouse control plane, computes column-level lineage during compilation rather than after execution. The difference determines whether you find a broken join before or after your stakeholders do.

Data Lineage8 min

End-to-End Data Lineage: From ADF to Power BI

Without a map of your data chain, every investigation starts from scratch.

April 8, 2026→

Data Lineage7 min

Data Pipelines Need Lineage, Not Just Data Monitoring Software

Data monitoring software tells you what broke. Lineage tells you why — and what it's taking down with it.

April 7, 2026→

Cloud Migration

Cloud Migration8 min

Monitoring During Cloud Migration: Why Single-Environment Data Monitoring Software Falls Short

During migration, you're not monitoring one environment — you're monitoring two. Most data monitoring software is built to watch one stack, not two stacks running side by side.

April 6, 2026→

Cloud Migration8 min

From SSIS to ADF to Fabric: Keeping Oversight

Three generations of ETL tooling, one data stack — maintaining visibility when the tools keep changing.

April 5, 2026→

Best Practices

Your Databricks Training Workspace Works Fine — Production Fails on Day One

Six configuration gaps between Databricks training workspaces and production that cause job failures the moment you deploy real pipelines.

DISTINCT on 100M Rows Forces a Full Shuffle — and No Spark Config Can Eliminate It

Global deduplication requires every row to find every other matching row. That means a full shuffle, no matter how many Spark configs you toggle. Here's what to do instead.

SharePoint List Mirroring in Fabric Silently Drops Your Person Columns

Fabric mirroring treats SharePoint User/Group fields as unsupported nested objects and skips them without warning. Your "Created By" and "Modified By" columns arrive as nulls — or don't arrive at all.

Azure Functions as Fabric REST API Middleware: Authentication, Polling, and the Errors Nobody Warns You About

Microsoft Fabric exposes a capable REST API for job scheduling, item management, and workspace automation. Azure Functions is the obvious glue layer. But token acquisition, long-running operation polling, and consumption plan timeouts create failure modes that surface only in production.

Power BI Missing Refresh Detection: Catch Silent Failures Before Users Do

A failed refresh produces an error. A missing refresh produces silence. Power BI's built-in alerting catches the first. It does not catch the second.

Power BI Usage Analytics: What Native Metrics Miss and How to Fill the Gap

30 days of usage data is not enough to identify abandoned reports, justify licences, or spot seasonal trends. Here is what to do about it.

Best Practices11 min

Data Pipeline Monitoring Tools: A Practical Comparison for 2026

A pipeline failure at 2am that nobody catches until 9am costs the same whether the break is in dbt, Snowflake, ADF, or Power BI. Most monitoring tools watch only one of those layers.

Best Practices12 min

Best Power BI Monitoring Tools in 2026: An Objective Comparison

Not every Power BI monitoring problem needs the same tool. Here is an honest breakdown of what each option actually covers — including where each one falls short.

Databricks R Plots Vanish Without an Error — The Graphics Device Fails Silently

Your R code runs clean. The cell completes. The plot area is blank. Databricks doesn't tell you why — because from the runtime's perspective, nothing went wrong.

May 25, 2026→

Your Composite Model Is Slower Than DirectQuery Alone — Here's Why

Mixing DirectQuery with imported SharePoint lists sounds pragmatic. The storage engine disagrees.

Power BI Admin Portal: What It Shows, What It Hides, and When You Need More

The dataset refreshed at 06:02. The audit log says succeeded. The board meeting starts at 09:00. The Admin Portal has nothing to tell you about the ADF pipeline that wrote zero rows at 03:44.

May 12, 2026→

Databricks Cannot Find Your Iceberg Table in Glue — The Catalog Configuration That Fails Silently

Six Spark properties stand between your Databricks cluster and an Iceberg table registered in AWS Glue. Get one wrong and you'll see TABLE_OR_VIEW_NOT_FOUND — with no hint about which property caused it.

Databricks Vendor Access: How to Block Direct Workspace Changes Without Breaking Delivery

Your vendor's consultant just overwrote a production notebook at 4pm on a Friday. Here's how folder permissions, service principals, and Git folders prevent that from happening again.

Your Databricks Compute Tab Is Missing Because of Entitlements, Not a Bug

The Compute tab vanishes silently when entitlements are wrong. Three settings control whether your users can see it, and none of them produce an error message.

Delta MERGE From Multiple Source Tables Fails Because UNION ALL Isn't Enough

A UNION ALL in the USING clause looks correct until two source tables contribute a row for the same key. Delta rejects the ambiguity outright.

Best Practices7 min

PySpark split() Silently Drops Data When Your Delimiter Assumption Is Wrong

The split-and-getItem pattern works perfectly on sample data. Production strings have trailing spaces, embedded delimiters, and missing fields that turn your columns into nulls without warning.

Delta MERGE from Multiple Source Tables Fails When You Skip Deduplication

UNION ALL your sources into MERGE and Spark will punish you with an ambiguous match error — unless you deduplicate first.

Best Practices14 min

Power BI Monitoring Tools Compared: The 2026 Buyer's Guide

Native notifications miss the failures that actually hurt. Here's how the major Power BI monitoring tools compare on detection, correlation, and time-to-deploy.

ADF pipeline failure monitoring: where native alerts stop working

Native Azure Monitor catches pipeline failures. It misses the Copy activity that succeeded with the wrong schema — and that's the one your stakeholders will call about.

Spark Performance: Scala vs Python Where It Actually Matters

The runtime gap between PySpark and Scala is not what most benchmarks measure. The real cost lives in serialization boundaries, executor process model, and where your UDFs run.

April 26, 2026→

Microsoft Fabric Copy Job: Failure Modes Beginners Hit in Production

The tutorial shows a green checkmark. Production shows a half-loaded Lakehouse table and a stakeholder asking why yesterday's revenue is missing.

April 26, 2026→

How to Get Notified When a Power BI Dataset Refresh Fails

Power BI has built-in refresh failure notifications. They're not enough for most production environments.

Power BI Scheduled Refresh Fails But Manual Refresh Works: Root Causes and Fixes

If manual refresh works and scheduled refresh fails, the problem is not the data source. It is the environment the scheduled run uses.

Power BI On-Premises Gateway Offline: Causes, Diagnostics, and Fixes

A gateway that goes offline at 02:00 and recovers by 09:00 can silently fail dozens of scheduled refreshes while everyone sleeps.

Incident Response for Data Pipeline Failures: A Data Pipeline Management Playbook

What do you do when it's 3am and your most important dataset just failed to refresh? A data pipeline management playbook for the moment monitoring fires its first alert.

April 4, 2026→

Troubleshooting

Troubleshooting8 min

Why Databricks Notebooks Reject raw_input — and How to Authenticate APIs Without a Terminal

StdinNotImplementedError kills OAuth flows in Databricks because notebooks have no stdin. The fix requires restructuring how you acquire tokens — not patching the prompt.

Troubleshooting8 min

Why the Databricks "Create" Button Does Nothing (And the Five Permission Layers Behind It)

A greyed-out button with no error message sends you hunting through five layers of Databricks permissions. Here's how to isolate the cause in minutes.

Troubleshooting7 min

Why "Keyword not supported: variables" Breaks Dataflow Gen2 — Even After You Remove the Gateway

A gateway connection you already deleted can poison your Dataflow Gen2 metadata permanently. The fix is not where you'd expect.

Troubleshooting8 min

Your Databricks Reconciliation Job Runs Forever Because It Has No Reason to Stop

Reconciliation workloads compare two large datasets row by row. When that comparison never converges, your cluster burns compute until someone notices — or the budget runs out.

dbt Production Errors: A Reference Index of Run Failures

Your dbt run finished at 04:12. Three models failed. The error log says 'current transaction is aborted'. Downstream, Power BI already refreshed on yesterday's data.

May 10, 2026→

Troubleshooting11 min

Power BI Authentication Errors: A Reference Index of AADSTS Codes

Your scheduled refresh failed with an AADSTS code. The dashboard still shows yesterday's numbers. Here is how to read the code and find the right fix without trawling the full Microsoft reference.

May 9, 2026→

Troubleshooting12 min

Azure Data Factory Pipeline Errors: A Reference Index of Common Failures

Your ADF pipeline failed at 03:42 with a UserError code that means nothing on its own. The Power BI refresh that depends on it is two hours away. Here is how to read the error class and jump to the fix.

May 9, 2026→

VBA + ADODB Queries Against Power BI Lose Rows Without Telling You

Your DAX query returns 11,000 rows in DAX Studio and 6,000 through VBA. No error. No warning. Just missing data your stakeholders will find before you do.

VBA Queries Against Power BI XMLA Endpoints Silently Drop Rows

Your DAX query returns 11,000 rows in DAX Studio but 6,000 through VBA. The query isn't wrong. The ADODB plumbing is.

Lakeflow Connect SQL Server: Why the Database Setup Step Keeps Failing

The setup wizard looks simple. Four steps, a few stored procedures, done. But the database setup step fails without telling you which prerequisite it actually checked and rejected.

April 27, 2026→

Troubleshooting10 min

AADSTS Errors in Power BI Scheduled Refresh: Causes and Fixes

Your scheduled refresh failed at 06:00. The error message contains an AADSTS code. Here's what each one means.

Power BI Gateway Errors: DM_GWPipeline Codes Explained

A DM_GWPipeline error means the gateway is part of the problem. Here's how to find out which part.

ADF Pipeline Permission Errors: Access Denied, 401, and 403 Fixes

The connection test passes. The pipeline run fails with 403. They are not the same thing.

Databricks Job Failures: OOM, Data Skew, and DRIVER_NOT_RESPONDING

DRIVER_NOT_RESPONDING is a symptom. The cause is almost always memory pressure or GC pause. Here is how to find it and fix it.

dbt Run Failures in Production: Permission Errors, SQL Failures, and Incremental Drift

The model works locally. The production deployment fails. The difference is almost always permissions, credentials, or SQL dialect.