MetricSign
Start free
error-reference9 min·

dbt Production Errors: A Reference Index of Run Failures

Your dbt run finished at 04:12. Three models failed. The error log says 'current transaction is aborted'. Downstream, Power BI already refreshed on yesterday's data.

Lees dit artikel in het Nederlands →

dbt Production Errors: A Reference Index of Run Failures

Three models failed at 04:12. Nobody noticed until the board meeting.

It's 04:12. Your nightly dbt run in production just failed. Three models in the marts directory threw database errors, and two more were skipped because they depend on the failed ones. Nobody on the data team is awake.

At 06:00, the Power BI scheduled refresh runs anyway. It pulls from the tables that dbt was supposed to update. Two of those tables still have yesterday's data. One has data from three days ago because the model failed last night and the night before.

At 09:14, the VP of Sales opens the pipeline dashboard. The Q2 numbers dropped 30% overnight. The data engineer starts debugging at 09:20, finds the dbt error in the run logs, and spends 45 minutes tracing which downstream datasets were affected.

This post is the reference you open at 09:20. It covers the error categories you'll hit in production dbt runs, how to limit damage with --fail-fast, where dbt Core and dbt Cloud differ in error handling, and what to do when dbt Cloud itself is the problem.

Most dbt production errors fall into three categories

Knowing which category you're in determines your debugging path.

Compilation errors

dbt compiles your SQL models into executable queries before sending them to the database. Compilation errors mean dbt never ran the query. Common causes:

A Jinja macro references a variable that doesn't exist in the production environment. Your dev profile sets target_schema to dev_analytics; the prod profile uses analytics. If a macro hardcodes the dev schema name, it works locally and fails in CI.

A ref() points to a model that was renamed or deleted. In a project with 200+ models, renaming stg_orders to stg_raw_orders breaks every model that references the old name. dbt catches this at compile time.

YAML configuration errors: a misplaced indent in schema.yml or a config() block with an invalid materialization. These are caught before any SQL runs.

Compilation errors are the easiest to fix because dbt gives you the exact file, line number, and error message. The challenge in production is that they block the entire run, or everything downstream if you use --fail-fast.

Database errors

The query compiled. dbt sent it to the warehouse. The warehouse rejected it. This is where sql state: 25p02 lives.

25p02 is PostgreSQL's error code for "current transaction is aborted, commands ignored until end of transaction block." In dbt context, this happens when a previous statement in the same transaction already failed (permission error, missing schema, syntax error), and PostgreSQL refuses every subsequent statement until the transaction is rolled back.

The fix: find the original error that caused the transaction to abort. The 25p02 message is a symptom. Check the logs for the first failure in that model's run. Common root causes are permission errors on production schemas, missing tables that exist in dev but not in prod, and column type mismatches after a source schema change.

Other common database errors:

  • Timeout: Your model works in dev with 100K rows. Production has 50M rows. Snowflake returns a TIMEOUT error; Databricks' Spark jobs may produce an OOM.
  • Permission denied: The service account running your production dbt job doesn't have the same grants as your personal dev account. A GRANT SELECT on a new source table is easy to forget.
  • Concurrency limits: Running 30 models in parallel can exceed your warehouse's concurrent query limit, especially on smaller Snowflake warehouses or shared Databricks clusters.

Dependency and freshness errors

dbt source freshness checks validate that source tables have recent data before models run. A freshness error means the source data is older than the configured threshold:

``yaml sources: - name: raw_sales loaded_at_field: _loaded_at freshness: warn_after: {count: 12, period: hour} error_after: {count: 24, period: hour} ``

If raw_sales was last loaded 26 hours ago, dbt throws an error before running any model that depends on it. This is one of dbt's most useful production features: it prevents you from building marts on stale data. But it also means your dbt run can fail because of an upstream problem that has nothing to do with dbt itself.

fail-fast exists because a full run after first failure is wasted compute

The default behavior of dbt run is to continue executing all remaining models after one fails. In development, this makes sense. You want to see all errors at once. In production, it wastes compute and can produce corrupted intermediate tables.

dbt run --fail-fast stops execution on the first model error. If stg_orders fails, dbt does not attempt int_orders, fct_orders, or any other model downstream of it. Models that don't depend on the failed one still run.

When to use fail-fast:

  • Nightly production runs where consistent stale data is better than a mix of fresh and stale tables. If one model in a dependency chain fails, models that ran after it may have consumed incomplete upstream data.
  • CI/CD runs in staging. Fail-fast in CI lets you catch the first error without waiting for the full build.

When not to use fail-fast:

  • Independent model groups. If your project has Sales, Marketing, and Finance models with no shared dependencies, fail-fast stops all three when only Sales has an error. Run each domain separately: dbt run --fail-fast --select tag:sales.

Configuration in dbt_project.yml (dbt 1.6+):

``yaml flags: fail_fast: true ``

Or per invocation:

``bash dbt run --fail-fast --select tag:critical dbt run --fail-fast --exclude tag:experimental ``

The flag applies to dbt run, dbt seed, and dbt snapshot. It does not apply to dbt test, which always executes all selected tests regardless of individual failures.

Core and Cloud surface errors differently, and that changes how fast you respond

dbt Core and dbt Cloud use the same compilation engine. The errors themselves are identical. The difference is in how you see them, how fast you see them, and what happens automatically when they occur.

Logging and error surfacing

dbt Core logs to target/run_results.json and logs/dbt.log. You read them by opening the file, piping through jq, or checking whatever log aggregator your team uses (Datadog, CloudWatch, Grafana). If your orchestrator (Airflow, Dagster, Prefect) captures stdout, errors show up there too. There is no built-in UI.

dbt Cloud shows errors in a run history UI with syntax-highlighted SQL, the compiled query, the raw database error, and per-model timing. For teams running 50+ models, clicking through the UI is faster than reading a 2,000-line log file.

Alerting and notifications

dbt Core has no built-in alerting. If your job fails, your orchestrator sends an email or Slack message, but only if you configured it. Airflow uses on_failure_callback; Dagster has @op(on_failure=...). If you run dbt Core via cron, you probably have no alerting unless you built it yourself.

dbt Cloud sends email notifications on job failure by default. It supports webhooks that can trigger a Slack message, a PagerDuty incident, or a custom function. The webhook payload includes job ID, run ID, and status.

Retry behavior

dbt Core does not retry failed models. If stg_orders fails, it stays failed until the next scheduled run. You can add retry logic in your orchestrator (Airflow's retries parameter), but dbt itself runs once and exits.

dbt Cloud job definitions include a configurable retry with delay. This helps with transient errors: brief warehouse unavailability, temporary network issues, permissions during a deployment window. For persistent errors (schema mismatches, syntax errors, missing tables), retries just waste time.

Sometimes the problem is dbt Cloud itself, not your project

When your dbt Cloud job fails and the error isn't in your SQL or configuration, check whether the platform itself has an issue.

status.getdbt.com shows current service status for the API, IDE, scheduler, and metadata API individually. An outage on the scheduler means your jobs won't trigger even if the rest of the platform is operational.

For programmatic checks, the dbt Cloud Administrative API exposes a /api/v2/accounts/{account_id}/runs/ endpoint. If your calls return HTTP 503 or timeout consistently, the platform has an issue.

Setting up external monitoring for dbt Cloud outages:

  1. Subscribe to status.getdbt.com via email or RSS. The page runs on Statuspage.io and sends incident notifications automatically.
  2. Add a health check in your own monitoring. Call the dbt Cloud API every 5 minutes. Alert if it returns non-200 for 3 consecutive checks.
  3. Configure an orchestrator fallback. If you run Airflow alongside dbt Cloud, configure it to trigger a dbt Core run when dbt Cloud jobs miss their expected completion window.

Check the status page history to understand the frequency and duration of past incidents for your region. For teams where a 30-minute delay is acceptable, dbt Cloud's built-in retry handles most outages. For teams with strict SLA windows (your Power BI refresh runs at 06:00 and the board opens the dashboard at 07:00), an external monitoring layer is worth the setup.

The debugging workflow that works in production

When a production dbt run fails, the debugging path follows a consistent sequence.

1. Read the error message, starting from the first failure. In a multi-model run, the first error is the root cause. Everything after it may be a cascading failure, including 25p02 errors. In dbt Cloud, sort the run results by execution order. In dbt Core, check target/run_results.json and look for the first entry with status: "error".

2. Compare compiled SQL to the source. dbt generates compiled SQL in target/compiled/. Open the failing model's compiled file and run the query directly in your warehouse client (Snowflake's worksheet, Databricks SQL editor, psql). If it fails there too, the problem is in the SQL or the database state. If it succeeds, the problem is in how dbt executes it: transaction handling, permissions, or connection settings.

3. Check environment differences. The most common reason a model works in dev and fails in production:

  • Different schemas or databases (target config in profiles.yml)
  • Different permissions (your personal account vs. the service account)
  • Different data volumes (a query that runs in 10 seconds on 1M rows times out on 100M rows)
  • Different warehouse sizing (Snowflake XS vs. L warehouse, Databricks cluster auto-scaling config)

4. Check source freshness. If the model ran but the output looks wrong, the input data may be stale. Run dbt source freshness against the sources your model depends on.

5. Check the warehouse query history. Snowflake's QUERY_HISTORY view and Databricks' Spark UI show execution time, rows scanned, and spill-to-disk metrics. A query that processed 10M rows last week and 500M rows today (because someone removed a WHERE clause upstream) will look normal in dbt's logs but obvious in the warehouse query profiler.

Monitoring dbt runs across your stack

dbt's built-in monitoring tells you that a model failed. It does not tell you which Power BI datasets depend on that model, whether the downstream refresh has already run, or who needs to know that the Sales dashboard is showing stale data.

MetricSign connects to dbt Cloud (via webhook) and dbt Core (via run_results.json polling) alongside your warehouse and BI layer. When a dbt model fails in production, MetricSign reads the error, identifies which downstream datasets are affected through the lineage graph, and routes an alert to the right channel before the next Power BI refresh runs.

A typical sequence: your nightly dbt run fails at 04:12 on fct_revenue. MetricSign detects the failure, checks the lineage, and finds that fct_revenue feeds the "Revenue by Region" Power BI dataset. That dataset refreshes at 06:00. MetricSign sends a Telegram alert to the on-call data engineer at 04:15, 105 minutes before the refresh would load stale data into the board dashboard.

MetricSign does not replace dbt's error handling. It extends it by connecting dbt errors to their downstream impact in the BI layer. If your dbt models feed only a warehouse and your consumers query it directly, dbt Cloud's native alerting may be sufficient. MetricSign adds value when a BI tool like Power BI refreshes on its own schedule, independent of dbt, because neither tool knows about the other's state.

The gap in dbt production monitoring is not the detection of errors. dbt already does that. The gap is the window between detection and downstream impact: the time where an error sits in a log file while the BI refresh runs on schedule, unaware that its source data is wrong.

Frequently asked questions

Why does my dbt job fail in production but work in development?+
The most common causes are environment differences: different schemas or databases in profiles.yml, different permissions between your personal account and the production service account, and different data volumes that cause queries to timeout in production. Run the compiled SQL from target/compiled/ directly in your production warehouse to isolate the issue.
How do I use --fail-fast to limit blast radius?+
Add --fail-fast to your dbt run command or set fail_fast: true in the flags section of dbt_project.yml (dbt 1.6+). This stops execution on the first model error and skips all downstream models. It applies to dbt run, seed, and snapshot, but not to dbt test.
What does sql state 25p02 mean in dbt?+
PostgreSQL error 25p02 means 'current transaction is aborted, commands ignored until end of transaction block.' A previous statement in the same transaction already failed, and PostgreSQL is rejecting everything that follows. Find the first error in the model's run log. That's the root cause. The 25p02 is a cascading symptom.
How do I check if dbt Cloud is down?+
Visit status.getdbt.com for real-time service status. The page shows separate components: API, IDE, scheduler, and metadata API. Subscribe via email or RSS for automatic notifications. For programmatic checks, monitor the dbt Cloud API and alert on repeated 503 responses.
What are the differences between dbt Core and dbt Cloud for error handling?+
The errors themselves are identical. The differences are in surfacing and response: Core logs to files and relies on your orchestrator for alerting. Cloud provides a UI with run history, sends email notifications by default, supports webhooks for Slack or PagerDuty integration, and offers configurable retries on failure.

Related integrations

Related articles