Databricks Error:
INTERNAL_ERROR
What does this error mean?
INTERNAL_ERROR means the Databricks control plane or underlying cloud infrastructure failed during job execution — your Spark code, notebook logic, or SQL query is not the cause. The Jobs API returns result_state: FAILED with error_code: INTERNAL_ERROR when the platform cannot complete the run due to an infrastructure-level fault. In a data pipeline context (ADF, Airflow, dbt Cloud triggering Databricks jobs), this surfaces as an unexpected task failure with no user-facing stack trace. The cluster event log typically contains a more specific termination code (COMMUNICATION_LOST, CLOUD_PROVIDER_SHUTDOWN, etc.) that reveals the actual root cause. Engineers usually see this during peak hours or when spot/preemptible instances are reclaimed.
Common causes
- 1Transient Databricks control plane issue — the Jobs API or cluster manager was temporarily unavailable, causing the run to abort without executing user code.
- 2Cloud provider reclaimed the underlying VM. On AWS this means a spot instance interruption; on Azure a low-priority VM eviction; on GCP a preemptible VM shutdown. The cluster terminates with CLOUD_PROVIDER_SHUTDOWN or SPOT_INSTANCE_TERMINATION.
- 3Network connectivity loss between the Spark driver and the Databricks control plane. The cluster event log shows termination code COMMUNICATION_LOST, typically lasting 5-15 minutes during cloud provider networking events.
- 4The cluster auto-scaled and the new nodes failed to join within the timeout (typically 10 minutes). This cascades into an INTERNAL_ERROR when the driver cannot redistribute tasks.
- 5A Databricks regional outage or degraded service in your workspace region. Concurrent failures across multiple unrelated jobs are the telltale sign.
- 6Instance pool exhaustion — the job requested a cluster from a pool with no available instances, and the pool could not provision new VMs within the cloud provider quota.
How to fix it
- 1Check the Databricks status page (https://status.databricks.com) for active or recent incidents in your workspace region. Cross-reference the timestamp of your failed run with any listed incidents.
- 2Open the failed run in the Databricks UI: Jobs → select job → click the failed run. Under 'Cluster', click the cluster link and go to the 'Event Log' tab. Look for the termination_code field — this is the actual root cause (e.g. COMMUNICATION_LOST, SPOT_INSTANCE_TERMINATION, CLOUD_PROVIDER_SHUTDOWN).
- 3Retry the run via the UI ('Repair Run') or CLI: `databricks jobs run-now --job-id <JOB_ID>`. INTERNAL_ERROR resolves on retry in roughly 80% of cases because the underlying infrastructure fault is transient.
- 4If spot/preemptible instances caused the failure, switch the job cluster to on-demand instances for critical pipelines: in the job cluster config, set `aws_attributes.availability` to `ON_DEMAND` (AWS) or `azure_attributes.availability` to `ON_DEMAND_AZURE` (Azure).
- 5For recurring INTERNAL_ERROR on auto-scaling clusters, set a fixed cluster size or reduce max_workers to avoid hitting cloud provider quota limits. Check your cloud provider's VM quota: AWS → EC2 → Limits, Azure → Subscription → Usage + quotas, GCP → IAM → Quotas.
- 6If orchestrating via ADF or Airflow, add a retry policy to the Databricks activity/operator. In ADF: set retry count to 2 with 5-minute intervals on the Databricks Notebook activity. In Airflow: `retries=2, retry_delay=timedelta(minutes=5)` on the DatabricksRunNowOperator.
- 7Contact Databricks support with the run_id, cluster_id, and termination_code if the error persists after 3+ retries. Use: `databricks runs get --run-id <RUN_ID> | jq '.state'` to capture the full state object for the support ticket.
Example log output
2026-05-11T08:22:14.003Z [ERROR] Run 748291 of job 5023 terminated with result_state=FAILED, error_code=INTERNAL_ERROR
2026-05-11T08:22:14.005Z [ERROR] Cluster 0511-081903-abc123 terminated. Reason: {"code":"COMMUNICATION_LOST","parameters":{"databricks_error_message":"The cluster lost contact with the Databricks control plane for more than 10 minutes."}}
2026-05-11T08:22:14.008Z [WARN] No retry attempted: max_retries=0 for task 'load_facts_daily'