Data Pipeline Monitoring: When AI Agents Aren't Traced

Your error log doesn't know an AI wrote the query

A Power BI dataset refresh fails at 3:14 AM with error code DMTS_DM_Error_1030 — timeout during query execution. You open the refresh history. You see the duration spiked. You check the data source. It's fine.

What you don't see: a Copilot-generated DAX measure was added to the model two hours earlier by a business analyst who asked Copilot to "show revenue by region adjusted for currency." Copilot produced a SUMMARIZECOLUMNS expression with a nested CALCULATETABLE that fans out across a 40-million-row fact table without any filter context. The measure compiled. It even returned results in Desktop. But at refresh time, when the Analysis Services engine evaluates all measures for metadata, the query plan explodes.

The refresh log records the timeout. It records the dataset name. It does not record that the measure was AI-generated, because Power BI doesn't distinguish between human-authored and Copilot-authored DAX. To the engine, a measure is a measure.

This is the core problem. Every monitoring tool in the Microsoft data stack — Activity Events API, XMLA endpoint diagnostics, Azure Monitor — treats query execution as a black box with an author field that says "user@domain.com." Whether that user typed the DAX by hand or clicked Accept on a Copilot suggestion is invisible to telemetry. The same gap exists in Databricks, where AI-assisted notebook cells generate SQL that gets submitted to Unity Catalog with the notebook owner's identity. And in Azure Data Factory, where Copilot can now suggest pipeline expressions that become part of your deployed data flows.

The failure modes are identical to human errors. The observability gap is that you can't filter for them, can't trend them, and can't correlate them to the moment an agent produced the code.

Agent-generated SQL breaks lineage because it never existed in source control

Data lineage tools parse your repository. They trace a column from a dbt model back through a staging layer to a source table. This works because the SQL is static — it's in a .sql file, committed to Git, and the lineage graph is built at compile time.

Agent-generated SQL doesn't have a file. When a Databricks AI function generates a query inside a notebook, or when ADF Copilot suggests a derived column expression, that code exists only at runtime. It's not in your repo. It's not in your dbt manifest. It's in an execution log, if you're lucky, and in ephemeral memory if you're not.

This breaks lineage in two ways. First, downstream consumers can't trace their inputs. If an AI-generated transformation reshapes a table that feeds a Power BI dataset, the lineage graph shows a gap — data arrives at the semantic model from a source that has no documented transformation logic. Second, impact analysis fails. When you need to answer "what breaks if I rename this column," your lineage tool can't account for agent-generated queries that reference that column dynamically.

Databricks Unity Catalog does log query text in the query history, and you can pull it via the system.access.audit table. But there's no flag distinguishing AI-generated queries from human ones. You'd need to parse the query text for patterns — unusual aliasing, specific formatting quirks that LLMs tend to produce — which is fragile and unreliable.

The practical consequence: your lineage documentation drifts from reality every time an agent writes a query that isn't captured in your transformation layer. Teams that rely on lineage for compliance or debugging are working with an incomplete map, and they may not realize it until a column change cascades into a failure they can't explain.

How agent-generated queries create untracked failures

Non-deterministic queries make historical comparison useless

Traditional pipeline monitoring relies on baselines. A dataset refresh that normally takes 8 minutes but suddenly takes 25 minutes triggers an alert. This works when the workload is deterministic — same model, same queries, same data volume, predictable growth.

AI agents break this assumption. A Copilot-generated measure might produce different query plans depending on how the user phrased their request. Two analysts asking for "sales by quarter" and "quarterly sales breakdown" might get functionally similar but structurally different DAX. Each produces a different memory footprint, a different number of storage engine queries, and a different execution time.

In Databricks, AI-assisted code generation in notebooks means the same notebook can produce different SQL across runs if the cell was regenerated. A data engineer might accept a Copilot suggestion on Monday that uses a window function, then regenerate the cell on Wednesday and get a CTE-based approach. Both are correct. Both have different performance characteristics. Your monitoring sees two different execution profiles for the same notebook and has no way to attribute the variance to a code change that happened outside version control.

This makes anomaly detection unreliable. If your baseline is built from a mix of human-authored and agent-authored query executions, the variance in the baseline itself is wider than it should be. Real anomalies — a data source going slow, a table growing unexpectedly — get masked by the noise of agent-generated query variance.

The fix isn't to ban AI agents from your data stack. That ship has sailed — Microsoft reported that 70% of Fortune 500 companies were using Copilot features in Fabric as of early 2026. The fix is to instrument your monitoring to detect when execution patterns diverge from committed code, so you can at least isolate which failures correlate with uncommitted, potentially agent-generated, changes.

Refresh failures cluster around Copilot adoption waves

Organizations that rolled out Power BI Copilot access in phases saw a pattern: refresh failure rates spiked 2-3 weeks after each wave of user enablement. Not immediately — because Copilot suggestions start in Desktop, and it takes time for those measures and visuals to get published to the Service and hit the refresh cycle.

The failure signatures are consistent. DMTS_DM_Error_1030 (timeout) and DMTS_DM_Error_1033 (memory exceeded) appear more frequently. The datasets involved tend to be larger semantic models where Copilot-generated measures interact with complex relationships. The errors are legitimate — the queries genuinely exceed resource limits — but the root cause is a measure that no one on the data team wrote or reviewed.

ADF pipelines show a similar lag pattern. When Copilot is enabled for pipeline authoring, expression errors in derived column transformations increase within weeks. The errors are often type mismatches — Copilot generates an expression that assumes a string column is numeric, or produces a date format that doesn't match the source. These failures surface at pipeline runtime as DFExecutorUserError with an inner message about type conversion, which looks identical to a human-authored expression error.

The correlation is hard to prove with native tools alone because you'd need to cross-reference the timestamp of each Copilot interaction with subsequent pipeline or refresh failures. The Activity Events API records some Copilot usage events, but joining that data with refresh failure logs requires custom engineering — building a pipeline to monitor your pipelines, which is exactly the kind of recursive complexity that burns engineering time.

MetricSign detects these failure clusters by grouping refresh errors across datasets and correlating timing patterns with root cause context, so you can identify whether a spike in timeouts maps to a model change window rather than a data source issue.

What you can instrument today without new tooling

You can't wait for Microsoft or Databricks to add an "AI-generated" flag to every query log. But you can build guardrails with existing capabilities.

In Power BI, use XMLA endpoint read access to periodically snapshot your semantic model metadata. Script it with the Tabular Object Model (TOM) — enumerate all measures and calculated columns, hash their expressions, and store the hashes with timestamps. When a refresh fails, compare the current model state against the last known-good snapshot. If a measure expression changed since the last successful refresh, you've found your candidate. The script is straightforward in C# or PowerShell using the Microsoft.AnalysisServices.Tabular namespace, and it runs in under a second for most models.

In Databricks, enable audit logging to your own storage account and query system.access.audit for notebook execution events. Build a scheduled job that compares the SQL text of each query against the committed notebook source in your Git repo. Any query that doesn't match a committed cell is either dynamically generated or was modified in the interactive session — both cases warrant review if they precede a job failure.

In ADF, enable diagnostic logging to Log Analytics and write KQL queries that filter for pipeline failures where the failing activity was recently modified. The ActivityUpdated timestamp in the pipeline metadata, combined with the failure timestamp, gives you a window to investigate.

None of these approaches are perfect. They're workarounds for an observability gap that the platforms haven't closed yet. But they convert a class of invisible failures — agent-generated code that breaks something downstream — into something you can at least detect and investigate before the next stakeholder email lands in your inbox.

AI Agents Generate Queries Your Pipeline Monitoring Was Never Built to Trace

Your error log doesn't know an AI wrote the query

Agent-generated SQL breaks lineage because it never existed in source control

Non-deterministic queries make historical comparison useless

Refresh failures cluster around Copilot adoption waves

What you can instrument today without new tooling

Frequently asked questions

Related integrations

How we compare

Related articles