What causes INTERNAL_ERROR?

Transient Databricks control plane issue — the Jobs API or cluster manager was temporarily unavailable, causing the run to abort without executing user code.. Cloud provider reclaimed the underlying VM. On AWS this means a spot instance interruption; on Azure a low-priority VM eviction; on GCP a preemptible VM shutdown. The cluster terminates with CLOUD_PROVIDER_SHUTDOWN or SPOT_INSTANCE_TERMINATION.. Network connectivity loss between the Spark driver and the Databricks control plane. The cluster event log shows termination code COMMUNICATION_LOST, typically lasting 5-15 minutes during cloud provider networking events.. The cluster auto-scaled and the new nodes failed to join within the timeout (typically 10 minutes). This cascades into an INTERNAL_ERROR when the driver cannot redistribute tasks.. A Databricks regional outage or degraded service in your workspace region. Concurrent failures across multiple unrelated jobs are the telltale sign.. Instance pool exhaustion — the job requested a cluster from a pool with no available instances, and the pool could not provision new VMs within the cloud provider quota.

How do I fix INTERNAL_ERROR?

Check the Databricks status page (https://status.databricks.com) for active or recent incidents in your workspace region. Cross-reference the timestamp of your failed run with any listed incidents.. Open the failed run in the Databricks UI: Jobs → select job → click the failed run. Under 'Cluster', click the cluster link and go to the 'Event Log' tab. Look for the termination_code field — this is the actual root cause (e.g. COMMUNICATION_LOST, SPOT_INSTANCE_TERMINATION, CLOUD_PROVIDER_SHUTDOWN).. Retry the run via the UI ('Repair Run') or CLI: `databricks jobs run-now --job-id `. INTERNAL_ERROR resolves on retry in roughly 80% of cases because the underlying infrastructure fault is transient.. If spot/preemptible instances caused the failure, switch the job cluster to on-demand instances for critical pipelines: in the job cluster config, set `aws_attributes.availability` to `ON_DEMAND` (AWS) or `azure_attributes.availability` to `ON_DEMAND_AZURE` (Azure).. For recurring INTERNAL_ERROR on auto-scaling clusters, set a fixed cluster size or reduce max_workers to avoid hitting cloud provider quota limits. Check your cloud provider's VM quota: AWS → EC2 → Limits, Azure → Subscription → Usage + quotas, GCP → IAM → Quotas.. If orchestrating via ADF or Airflow, add a retry policy to the Databricks activity/operator. In ADF: set retry count to 2 with 5-minute intervals on the Databricks Notebook activity. In Airflow: `retries=2, retry_delay=timedelta(minutes=5)` on the DatabricksRunNowOperator.. Contact Databricks support with the run_id, cluster_id, and termination_code if the error persists after 3+ retries. Use: `databricks runs get --run-id | jq '.state'` to capture the full state object for the support ticket.

High severityjobDatabricks →

Databricks Error:
INTERNAL_ERROR

Impact

When INTERNAL_ERROR hits multiple unrelated jobs within the same hour, your entire pipeline is likely stalled by a regional Databricks or cloud provider incident. Downstream tables go stale, scheduled refreshes pile up, and SLA clocks start ticking.

INTERNAL_ERROR aborts the entire job run, so all downstream tasks in a multi-task job are skipped. In orchestrated pipelines (ADF, Airflow, dbt Cloud), the Databricks step fails and blocks dependent activities. Delta tables targeted by the failed job retain their last committed state, but downstream models and dashboards that depend on fresh data go stale. Without retry policies, a single INTERNAL_ERROR during a nightly load can delay morning dashboards by hours. If the job writes to a staging table that a downstream MERGE depends on, the merge runs against outdated data or skips entirely.

What does this error mean?

INTERNAL_ERROR means the Databricks control plane or underlying cloud infrastructure failed during job execution — your Spark code, notebook logic, or SQL query is not the cause. The Jobs API returns result_state: FAILED with error_code: INTERNAL_ERROR when the platform cannot complete the run due to an infrastructure-level fault. In a data pipeline context (ADF, Airflow, dbt Cloud triggering Databricks jobs), this surfaces as an unexpected task failure with no user-facing stack trace. The cluster event log typically contains a more specific termination code (COMMUNICATION_LOST, CLOUD_PROVIDER_SHUTDOWN, etc.) that reveals the actual root cause. Engineers usually see this during peak hours or when spot/preemptible instances are reclaimed.

Common causes

1Transient Databricks control plane issue — the Jobs API or cluster manager was temporarily unavailable, causing the run to abort without executing user code.
2Cloud provider reclaimed the underlying VM. On AWS this means a spot instance interruption; on Azure a low-priority VM eviction; on GCP a preemptible VM shutdown. The cluster terminates with CLOUD_PROVIDER_SHUTDOWN or SPOT_INSTANCE_TERMINATION.
3Network connectivity loss between the Spark driver and the Databricks control plane. The cluster event log shows termination code COMMUNICATION_LOST, typically lasting 5-15 minutes during cloud provider networking events.
4The cluster auto-scaled and the new nodes failed to join within the timeout (typically 10 minutes). This cascades into an INTERNAL_ERROR when the driver cannot redistribute tasks.
5A Databricks regional outage or degraded service in your workspace region. Concurrent failures across multiple unrelated jobs are the telltale sign.
6Instance pool exhaustion — the job requested a cluster from a pool with no available instances, and the pool could not provision new VMs within the cloud provider quota.

How to fix it

1Check the Databricks status page (https://status.databricks.com) for active or recent incidents in your workspace region. Cross-reference the timestamp of your failed run with any listed incidents.
2Open the failed run in the Databricks UI: Jobs → select job → click the failed run. Under 'Cluster', click the cluster link and go to the 'Event Log' tab. Look for the termination_code field — this is the actual root cause (e.g. COMMUNICATION_LOST, SPOT_INSTANCE_TERMINATION, CLOUD_PROVIDER_SHUTDOWN).
3Retry the run via the UI ('Repair Run') or CLI: `databricks jobs run-now --job-id <JOB_ID>`. INTERNAL_ERROR resolves on retry in roughly 80% of cases because the underlying infrastructure fault is transient.
4If spot/preemptible instances caused the failure, switch the job cluster to on-demand instances for critical pipelines: in the job cluster config, set `aws_attributes.availability` to `ON_DEMAND` (AWS) or `azure_attributes.availability` to `ON_DEMAND_AZURE` (Azure).
5For recurring INTERNAL_ERROR on auto-scaling clusters, set a fixed cluster size or reduce max_workers to avoid hitting cloud provider quota limits. Check your cloud provider's VM quota: AWS → EC2 → Limits, Azure → Subscription → Usage + quotas, GCP → IAM → Quotas.
6If orchestrating via ADF or Airflow, add a retry policy to the Databricks activity/operator. In ADF: set retry count to 2 with 5-minute intervals on the Databricks Notebook activity. In Airflow: `retries=2, retry_delay=timedelta(minutes=5)` on the DatabricksRunNowOperator.
7Contact Databricks support with the run_id, cluster_id, and termination_code if the error persists after 3+ retries. Use: `databricks runs get --run-id <RUN_ID> | jq '.state'` to capture the full state object for the support ticket.

Example log output

2026-05-11T08:22:14.003Z [ERROR] Run 748291 of job 5023 terminated with result_state=FAILED, error_code=INTERNAL_ERROR
2026-05-11T08:22:14.005Z [ERROR] Cluster 0511-081903-abc123 terminated. Reason: {"code":"COMMUNICATION_LOST","parameters":{"databricks_error_message":"The cluster lost contact with the Databricks control plane for more than 10 minutes."}}
2026-05-11T08:22:14.008Z [WARN] No retry attempted: max_retries=0 for task 'load_facts_daily'

Frequently asked questions

Should I report INTERNAL_ERROR to Databricks support?

Only if it recurs on the same job after 3+ retries, or if it hits multiple jobs simultaneously. Include the run_id, cluster_id, and the cluster termination code from the event log. Single occurrences are almost always transient and not worth a ticket.

Is INTERNAL_ERROR always a Databricks bug?

No. The majority of INTERNAL_ERROR occurrences trace back to cloud provider events: spot instance reclamation, VM host maintenance, or transient networking faults. Databricks wraps these as INTERNAL_ERROR because the failure happened outside user code. The cluster event log termination_code reveals the actual cause.

How do I automatically retry jobs that fail with INTERNAL_ERROR?

In the Databricks job definition, set max_retries to 2 or 3 under the task settings. If you orchestrate via Airflow or ADF, configure retries on the operator/activity level instead, so the orchestrator handles backoff and logging. Avoid infinite retries — if a regional outage is ongoing, retries will keep failing and consume cluster hours.

Can INTERNAL_ERROR corrupt my data or leave partial writes?

If your job uses Delta Lake, no — Delta transactions are atomic, so a failed run leaves the table in its last consistent state. If you write to non-Delta targets (CSV, Parquet overwrites, external databases), a partial write is possible. Check the target table row count or _delta_log to confirm whether the last transaction committed.

Source · docs.databricks.com/api/workspace/jobs/getrun