What causes COMMUNICATION_LOST?

The driver node was killed by the Linux OOM (out-of-memory) killer due to memory pressure. A network partition between the cluster and the Databricks control plane. The driver VM crashed due to hardware failure on the underlying host. A kernel panic or OS-level crash on the driver node. A spot instance was terminated without a graceful shutdown signal

How do I fix COMMUNICATION_LOST?

Check driver logs for OOM killer messages (look for 'Out of memory: Kill process' in syslog).. Review memory usage metrics for the cluster — if the driver was consistently at high memory, increase the driver node size.. Check cloud provider health for the availability zone to rule out infrastructure issues.. Add driver memory overhead configuration if running large Spark collect() operations or broadcasting large DataFrames.. Consider using a dedicated driver node type with more memory for memory-intensive workloads.

High severityclusterDatabricks →

Databricks Error:
COMMUNICATION_LOST

What does this error mean?

The Databricks control plane lost communication with the cluster. The cluster was running but became unreachable — this can be caused by network issues, the driver VM crashing, or the OS being killed by the Linux OOM killer.

Common causes

1The driver node was killed by the Linux OOM (out-of-memory) killer due to memory pressure
2A network partition between the cluster and the Databricks control plane
3The driver VM crashed due to hardware failure on the underlying host
4A kernel panic or OS-level crash on the driver node
5A spot instance was terminated without a graceful shutdown signal

How to fix it

1Check driver logs for OOM killer messages (look for 'Out of memory: Kill process' in syslog).
2Review memory usage metrics for the cluster — if the driver was consistently at high memory, increase the driver node size.
3Check cloud provider health for the availability zone to rule out infrastructure issues.
4Add driver memory overhead configuration if running large Spark collect() operations or broadcasting large DataFrames.
5Consider using a dedicated driver node type with more memory for memory-intensive workloads.

Frequently asked questions

How do I know if it was an OOM kill?

Check the Ganglia metrics on the cluster for memory usage just before the failure, and look for 'oom' in the driver system logs.

Can I prevent this with auto-scaling?

Auto-scaling adds executor nodes but does not help with driver memory issues. The driver always runs on a single node — you need a larger driver node type.

Source · docs.databricks.com/aws/en/clusters/cluster-error-codes.html