MetricSign
Start free
High severitysqlDatabricks

Databricks Error:
OUT_OF_MEMORY

What does this error mean?

The Databricks cluster ran out of memory while executing a query. The query was consuming more memory than the cluster has available, causing the executor or driver to be killed.

Common causes

  • 1A cartesian join (cross join) producing an unexpectedly large intermediate result
  • 2A shuffle operation creating too many small partitions or one very large partition (data skew)
  • 3Collecting a large dataset to the driver with collect() or toPandas()
  • 4Broadcasting a large table that exceeds the broadcast threshold
  • 5Processing significantly more data than usual due to upstream data growth

How to fix it

  1. 1Identify the query step that is consuming the most memory using the Spark UI (look for the stage with the highest memory spill).
  2. 2Check for data skew — if one partition is much larger than others, use salting or repartition.
  3. 3Replace collect() with distributed aggregations and only pull summary results to the driver.
  4. 4Increase the number of shuffle partitions (spark.sql.shuffle.partitions) to reduce per-partition size.
  5. 5Increase the cluster size or switch to a memory-optimized instance type.
  6. 6Enable Spark memory spill to disk as a safety valve (at the cost of performance).

Example log output

org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in stage 3.0 failed 4 times. Most recent failure: Lost task 14.3 in stage 3.0 (TID 612, 10.0.1.8, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 28.2 GB of 28 GB physical memory used.
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2844)
SQLSTATE: 53000

Frequently asked questions

Can I fix OOM without increasing cluster size?

Often yes — query optimization (fixing skew, avoiding cartesian joins, reducing collect() usage) can dramatically reduce memory usage without adding nodes.

What is data skew and how do I detect it?

Data skew occurs when one partition has much more data than others. You can detect it in the Spark UI by looking for a stage where one task takes much longer than the others.

Source · docs.databricks.com/aws/en/clusters/cluster-error-codes.html

Other sql errors