What causes MLFLOW_RUN_FAILED?

A Python exception in the training script terminated the run before it could log final metrics. The cluster ran out of memory during model training on large datasets. A dependency package version conflict caused an import error at startup. The MLflow artifact storage location (S3/ADLS/GCS) is inaccessible or has insufficient permissions

How do I fix MLFLOW_RUN_FAILED?

Step 1: Open the MLflow experiment in the Databricks UI, click the failed run, and check the 'System Metrics' and 'Tags' tabs for the error message.. Step 2: Review the cluster driver logs for the full Python stack trace.. Step 3: If memory-related, increase the cluster size or reduce the batch size / dataset sample.. Step 4: If artifact storage fails, verify the storage account permissions for the MLflow artifact URI.. Step 5: Re-run the experiment after fixing the root cause — MLflow run IDs are immutable, so a new run is always created.

High severityexecutionDatabricks →

Databricks Error:
MLFLOW_RUN_FAILED

What does this error mean?

An MLflow experiment run was terminated with a FAILED status, meaning the training or evaluation job did not complete successfully.

Common causes

1A Python exception in the training script terminated the run before it could log final metrics
2The cluster ran out of memory during model training on large datasets
3A dependency package version conflict caused an import error at startup
4The MLflow artifact storage location (S3/ADLS/GCS) is inaccessible or has insufficient permissions

How to fix it

1Step 1: Open the MLflow experiment in the Databricks UI, click the failed run, and check the 'System Metrics' and 'Tags' tabs for the error message.
2Step 2: Review the cluster driver logs for the full Python stack trace.
3Step 3: If memory-related, increase the cluster size or reduce the batch size / dataset sample.
4Step 4: If artifact storage fails, verify the storage account permissions for the MLflow artifact URI.
5Step 5: Re-run the experiment after fixing the root cause — MLflow run IDs are immutable, so a new run is always created.

Frequently asked questions

Can I resume a failed MLflow run?

No — MLflow runs are immutable once ended. You must start a new run, optionally using mlflow.start_run(run_name=...) with the same parameters.

How do I set up alerts for failed MLflow runs?

Use Databricks Jobs to wrap the MLflow training script and configure email or webhook notifications on job failure in the Jobs UI.

Source · docs.databricks.com/aws/en/mlflow/index.html