MetricSign
EN|NLRequest Access
High severityexecution

Power BI Refresh Error:
MLFLOW_RUN_FAILED

What does this error mean?

An MLflow experiment run was terminated with a FAILED status, meaning the training or evaluation job did not complete successfully.

Common causes

  • 1A Python exception in the training script terminated the run before it could log final metrics
  • 2The cluster ran out of memory during model training on large datasets
  • 3A dependency package version conflict caused an import error at startup
  • 4The MLflow artifact storage location (S3/ADLS/GCS) is inaccessible or has insufficient permissions

How to fix it

  1. 1Step 1: Open the MLflow experiment in the Databricks UI, click the failed run, and check the 'System Metrics' and 'Tags' tabs for the error message.
  2. 2Step 2: Review the cluster driver logs for the full Python stack trace.
  3. 3Step 3: If memory-related, increase the cluster size or reduce the batch size / dataset sample.
  4. 4Step 4: If artifact storage fails, verify the storage account permissions for the MLflow artifact URI.
  5. 5Step 5: Re-run the experiment after fixing the root cause — MLflow run IDs are immutable, so a new run is always created.

Frequently asked questions

Can I resume a failed MLflow run?

No — MLflow runs are immutable once ended. You must start a new run, optionally using mlflow.start_run(run_name=...) with the same parameters.

How do I set up alerts for failed MLflow runs?

Use Databricks Jobs to wrap the MLflow training script and configure email or webhook notifications on job failure in the Jobs UI.

Other execution errors