MetricSign
Start free
High severityexecutionDatabricks

Databricks Error:
MLFLOW_RUN_FAILED

What does this error mean?

An MLflow experiment run was terminated with a FAILED status, meaning the training or evaluation job did not complete successfully.

Common causes

  • 1A Python exception in the training script terminated the run before it could log final metrics
  • 2The cluster ran out of memory during model training on large datasets
  • 3A dependency package version conflict caused an import error at startup
  • 4The MLflow artifact storage location (S3/ADLS/GCS) is inaccessible or has insufficient permissions

How to fix it

  1. 1Step 1: Open the MLflow experiment in the Databricks UI, click the failed run, and check the 'System Metrics' and 'Tags' tabs for the error message.
  2. 2Step 2: Review the cluster driver logs for the full Python stack trace.
  3. 3Step 3: If memory-related, increase the cluster size or reduce the batch size / dataset sample.
  4. 4Step 4: If artifact storage fails, verify the storage account permissions for the MLflow artifact URI.
  5. 5Step 5: Re-run the experiment after fixing the root cause — MLflow run IDs are immutable, so a new run is always created.

Frequently asked questions

Can I resume a failed MLflow run?

No — MLflow runs are immutable once ended. You must start a new run, optionally using mlflow.start_run(run_name=...) with the same parameters.

How do I set up alerts for failed MLflow runs?

Use Databricks Jobs to wrap the MLflow training script and configure email or webhook notifications on job failure in the Jobs UI.

Source · docs.databricks.com/aws/en/mlflow/index.html

Other execution errors