What causes CLUSTER_TERMINATED_UNEXPECTEDLY?

The driver or worker ran out of memory (OOM) and the JVM was killed by the OS. An Azure/AWS spot instance was preempted by the cloud provider. A runaway shuffle or broadcast join caused disk spill beyond available temp space. A cluster init script raised an unhandled exception that terminated the process

How do I fix CLUSTER_TERMINATED_UNEXPECTEDLY?

Step 1: Open the cluster's Event Log and check the Termination Reason field for the root cause code.. Step 2: If OOM: increase the driver or worker memory, reduce partition size, or add more workers.. Step 3: If spot preemption: switch to on-demand instances for production jobs or enable Spot Fallback.. Step 4: Review Ganglia or Spark UI for memory and GC pressure before the termination.. Step 5: Enable cluster log delivery to DBFS or S3 so logs are available after the cluster is gone.

Critical severityresourceDatabricks →

Databricks Error:
CLUSTER_TERMINATED_UNEXPECTEDLY

What does this error mean?

A running Databricks cluster stopped without a user-initiated action, typically due to an out-of-memory condition, a driver JVM crash, or an underlying cloud infrastructure failure.

Common causes

1The driver or worker ran out of memory (OOM) and the JVM was killed by the OS
2An Azure/AWS spot instance was preempted by the cloud provider
3A runaway shuffle or broadcast join caused disk spill beyond available temp space
4A cluster init script raised an unhandled exception that terminated the process

How to fix it

1Step 1: Open the cluster's Event Log and check the Termination Reason field for the root cause code.
2Step 2: If OOM: increase the driver or worker memory, reduce partition size, or add more workers.
3Step 3: If spot preemption: switch to on-demand instances for production jobs or enable Spot Fallback.
4Step 4: Review Ganglia or Spark UI for memory and GC pressure before the termination.
5Step 5: Enable cluster log delivery to DBFS or S3 so logs are available after the cluster is gone.

Frequently asked questions

How is this different from COMMUNICATION_LOST?

COMMUNICATION_LOST means Databricks lost contact with the driver but the underlying VM may still be running (e.g., network partition). CLUSTER_TERMINATED_UNEXPECTEDLY means the cluster process itself has definitively stopped.

Will Databricks automatically retry a job that fails this way?

Yes, if you configure a retry policy on the job. Set the maximum retries and minimum retry interval in the job settings, and ensure your job is idempotent before enabling retries.

Source · docs.databricks.com/aws/en/clusters/cluster-error-codes.html