The benchmark you read is probably wrong
Most blog posts comparing Spark performance for Scala versus Python end with a single number — Scala is 10x faster, or 2x faster, or roughly equivalent. The number is usually correct for the workload the author tested and useless for yours.
The accepted Stack Overflow answer with 429 votes makes the right distinction. Spark exposes three execution paths and language choice affects each one differently. The RDD API runs Python code in a separate worker process and pays a per-row serialization cost. The DataFrame API compiles to a Catalyst logical plan, then a Tungsten physical plan, with neither containing a single line of your application's Python or Scala. Spark SQL is identical to DataFrame in this respect. UDFs cut across all three and reintroduce the language boundary wherever you use them.
The Stack Overflow 2024 developer survey shows Python at 51% adoption and Scala at 2.6%. The hiring market reflects that — most data teams write PySpark not because they benchmarked it but because their engineers already know Python. The interesting question is not which language is faster in the abstract. It is: given that you wrote the job in PySpark, where will the JVM-Python boundary actually cost you, and which of those costs are worth fixing?
The rest of this post answers that. We will walk through the four places the boundary shows up — RDD operations, Python UDFs, the executor process model, and driver-side collection — and what each one looks like in a Databricks job log.
DataFrames erase the gap, until you write a UDF
Run df.groupBy("customer_id").agg(sum("amount")) in PySpark and Scala. Inspect the physical plan with df.explain(true) in either language. The output is identical: HashAggregate with partial and final stages, an Exchange hashpartitioning shuffle, and a Project. Catalyst does not care which language built the DataFrame. Both calls cross py4j or the Scala API once at plan-construction time, then the JVM executes the entire query.
This is why DataFrame-heavy PySpark jobs benchmark within 5% of their Scala equivalents. The Python process on the driver builds a logical plan, ships it to the JVM, and waits. Executors run JVM bytecode generated by Tungsten's whole-stage code generation. There is no Python in the hot path.
The moment you call df.withColumn("score", my_python_udf(col("features"))), the picture changes. Spark cannot push your function into Catalyst. It inserts a BatchEvalPython operator into the plan, which serializes each row with pickle, sends it through a socket to a forked Python worker, deserializes the result, and reinserts it into the columnar batch. On a billion-row DataFrame this dwarfs the actual computation. A simple int-to-int UDF can take 40 minutes where the equivalent Scala UDF takes 90 seconds.
Pandas UDFs, registered with @pandas_udf, change the protocol. Rows are batched into Arrow record batches — typically 10,000 rows per batch, controlled by spark.sql.execution.arrow.maxRecordsPerBatch — and transferred columnar. The same workload drops from 40 minutes to roughly four. Still slower than Scala, but in the same order of magnitude. If you are writing PySpark UDFs in 2026 without Arrow enabled, that is the first thing to check: spark.sql.execution.arrow.pyspark.enabled = true.
The executor process model multiplies memory pressure
Scala Spark runs N task threads inside a single JVM per executor. PySpark runs N JVM task threads, and each one forks a Python worker process the first time it needs to evaluate Python code. On an executor with 16 cores you can end up with 16 Python processes plus the JVM, each holding the interpreter, every imported module, and a private copy of every broadcast variable.
This surfaces in three concrete ways. First, broadcast joins behave worse in PySpark. A 500MB broadcast variable consumed once in Scala consumes 500MB × N in Python. On a worker node with 64GB and 8 executors of 16 cores each, that is 64GB of Python copies before any actual data lands. Second, you hit the Linux OOM killer rather than a clean Spark OOM. When the kernel kills a Python worker, the executor reports ExecutorLostFailure or PythonException with no Java stack trace pointing at the cause. Check dmesg or the kernel log on the worker — you will see oom-killer entries naming the python3 PID.
Third, container memory accounting in YARN or Kubernetes counts Python memory against the executor's overall limit. The setting that controls headroom is spark.executor.memoryOverhead, which defaults to max(384MB, 0.10 × executorMemory). For PySpark workloads with non-trivial UDFs or broadcasts, raise it to 25-30%. The symptom of getting this wrong is the executor being killed by YARN with exit code 143 and the message "Container killed by YARN for exceeding memory limits."
None of this happens in Scala because there is one process. The thread-based model does not isolate failures as cleanly — a memory leak in one task can take down the whole executor — but the steady-state footprint is dramatically lower.
Where Scala still wins outright: RDDs and driver-side work
If your codebase still uses sc.textFile(...).map(...).reduceByKey(...), language choice matters more than anywhere else in Spark. RDD operations have no Catalyst optimizer. Every map function is your code, executed for every record. In PySpark that means pickle-and-pipe per record, with no Arrow fallback because RDDs predate Arrow integration.
The fix is usually not to rewrite in Scala. It is to move to DataFrames. A textFile + map + reduceByKey pipeline almost always has an equivalent expressed as spark.read.text() with select, groupBy, and agg, and that equivalent runs at JVM speed regardless of which language wrote it. The original Stack Overflow answer is from 2015. The reason its RDD warnings have aged well is that a lot of legacy ETL code never migrated.
The other place Scala wins is driver-side aggregation. A common pattern: collect a few hundred thousand rows to the driver, post-process them, write a config file. In Scala this runs in the JVM that is already there. In PySpark the rows get pickled across the py4j boundary into the Python process, where you then loop over them. For 100K rows of a wide schema this can take 30 seconds versus near-instant in Scala. The fix is usually to push the post-processing back into Spark — use approxQuantile, percentile_approx, or a window function instead of collect-and-iterate.
The survey numbers reinforce this pragmatically. With 2.6% Scala adoption, hiring a Scala team to chase RDD performance is rarely the right move. Hiring a PySpark engineer who knows when to reach for a Pandas UDF is.
How to actually measure it on your cluster
Stop trusting external benchmarks. The variables that matter — your data shape, your UDFs, your cluster topology, your Spark version — are not in anyone else's test. Use the Spark UI on Databricks to find the real cost.
Open the SQL/DataFrame tab and click into a query. The plan tree shows operator-level timing. BatchEvalPython and ArrowEvalPython operators expose a "time" metric — that is wall-clock time spent in the Python boundary for that stage. If it is more than 20% of total query time, your UDFs are the bottleneck. If it is under 5%, language choice is not your problem and rewriting in Scala will not help.
For RDD jobs, the Stages tab shows task duration distribution. Compare the median task time to the 95th percentile. A wide spread on PySpark RDD jobs often means a few executors are paging Python workers in and out of memory. Combine with the Executors tab — look at "Peak JVM Memory" against "Peak Other Memory." Other memory is largely Python. If it exceeds memoryOverhead, you are about to be killed.
For driver-side cost, look at the gap between the last stage's completion time and the next stage starting. Long gaps with low driver CPU mean py4j marshaling. Long gaps with high driver CPU mean your Python post-processing is the bottleneck.
This is the kind of regression you want to catch before stakeholders do. A PySpark job that ran in 12 minutes last week and 47 minutes today usually has not changed language — it has gained a UDF, lost Arrow, or hit a broadcast threshold. Tracking duration and stage metrics across runs turns a half-day investigation into a five-minute diff.