MetricSign
Start free
High severityjobDatabricks

Databricks Error:
INTERNAL_ERROR

What does this error mean?

INTERNAL_ERROR means the Databricks control plane or underlying cloud infrastructure failed during job execution — your Spark code, notebook logic, or SQL query is not the cause. The Jobs API returns result_state: FAILED with error_code: INTERNAL_ERROR when the platform cannot complete the run due to an infrastructure-level fault. In a data pipeline context (ADF, Airflow, dbt Cloud triggering Databricks jobs), this surfaces as an unexpected task failure with no user-facing stack trace. The cluster event log typically contains a more specific termination code (COMMUNICATION_LOST, CLOUD_PROVIDER_SHUTDOWN, etc.) that reveals the actual root cause. Engineers usually see this during peak hours or when spot/preemptible instances are reclaimed.

Common causes

  • 1Transient Databricks control plane issue — the Jobs API or cluster manager was temporarily unavailable, causing the run to abort without executing user code.
  • 2Cloud provider reclaimed the underlying VM. On AWS this means a spot instance interruption; on Azure a low-priority VM eviction; on GCP a preemptible VM shutdown. The cluster terminates with CLOUD_PROVIDER_SHUTDOWN or SPOT_INSTANCE_TERMINATION.
  • 3Network connectivity loss between the Spark driver and the Databricks control plane. The cluster event log shows termination code COMMUNICATION_LOST, typically lasting 5-15 minutes during cloud provider networking events.
  • 4The cluster auto-scaled and the new nodes failed to join within the timeout (typically 10 minutes). This cascades into an INTERNAL_ERROR when the driver cannot redistribute tasks.
  • 5A Databricks regional outage or degraded service in your workspace region. Concurrent failures across multiple unrelated jobs are the telltale sign.
  • 6Instance pool exhaustion — the job requested a cluster from a pool with no available instances, and the pool could not provision new VMs within the cloud provider quota.

How to fix it

  1. 1Check the Databricks status page (https://status.databricks.com) for active or recent incidents in your workspace region. Cross-reference the timestamp of your failed run with any listed incidents.
  2. 2Open the failed run in the Databricks UI: Jobs → select job → click the failed run. Under 'Cluster', click the cluster link and go to the 'Event Log' tab. Look for the termination_code field — this is the actual root cause (e.g. COMMUNICATION_LOST, SPOT_INSTANCE_TERMINATION, CLOUD_PROVIDER_SHUTDOWN).
  3. 3Retry the run via the UI ('Repair Run') or CLI: `databricks jobs run-now --job-id <JOB_ID>`. INTERNAL_ERROR resolves on retry in roughly 80% of cases because the underlying infrastructure fault is transient.
  4. 4If spot/preemptible instances caused the failure, switch the job cluster to on-demand instances for critical pipelines: in the job cluster config, set `aws_attributes.availability` to `ON_DEMAND` (AWS) or `azure_attributes.availability` to `ON_DEMAND_AZURE` (Azure).
  5. 5For recurring INTERNAL_ERROR on auto-scaling clusters, set a fixed cluster size or reduce max_workers to avoid hitting cloud provider quota limits. Check your cloud provider's VM quota: AWS → EC2 → Limits, Azure → Subscription → Usage + quotas, GCP → IAM → Quotas.
  6. 6If orchestrating via ADF or Airflow, add a retry policy to the Databricks activity/operator. In ADF: set retry count to 2 with 5-minute intervals on the Databricks Notebook activity. In Airflow: `retries=2, retry_delay=timedelta(minutes=5)` on the DatabricksRunNowOperator.
  7. 7Contact Databricks support with the run_id, cluster_id, and termination_code if the error persists after 3+ retries. Use: `databricks runs get --run-id <RUN_ID> | jq '.state'` to capture the full state object for the support ticket.

Example log output

2026-05-11T08:22:14.003Z [ERROR] Run 748291 of job 5023 terminated with result_state=FAILED, error_code=INTERNAL_ERROR
2026-05-11T08:22:14.005Z [ERROR] Cluster 0511-081903-abc123 terminated. Reason: {"code":"COMMUNICATION_LOST","parameters":{"databricks_error_message":"The cluster lost contact with the Databricks control plane for more than 10 minutes."}}
2026-05-11T08:22:14.008Z [WARN] No retry attempted: max_retries=0 for task 'load_facts_daily'

Frequently asked questions

Should I report INTERNAL_ERROR to Databricks support?

Only if it recurs on the same job after 3+ retries, or if it hits multiple jobs simultaneously. Include the run_id, cluster_id, and the cluster termination code from the event log. Single occurrences are almost always transient and not worth a ticket.

Is INTERNAL_ERROR always a Databricks bug?

No. The majority of INTERNAL_ERROR occurrences trace back to cloud provider events: spot instance reclamation, VM host maintenance, or transient networking faults. Databricks wraps these as INTERNAL_ERROR because the failure happened outside user code. The cluster event log termination_code reveals the actual cause.

How do I automatically retry jobs that fail with INTERNAL_ERROR?

In the Databricks job definition, set max_retries to 2 or 3 under the task settings. If you orchestrate via Airflow or ADF, configure retries on the operator/activity level instead, so the orchestrator handles backoff and logging. Avoid infinite retries — if a regional outage is ongoing, retries will keep failing and consume cluster hours.

Can INTERNAL_ERROR corrupt my data or leave partial writes?

If your job uses Delta Lake, no — Delta transactions are atomic, so a failed run leaves the table in its last consistent state. If you write to non-Delta targets (CSV, Parquet overwrites, external databases), a partial write is possible. Check the target table row count or _delta_log to confirm whether the last transaction committed.

Source · docs.databricks.com/api/workspace/jobs/getrun

Other job errors