What causes SERVICE_UNAVAILABLE?

Regional control-plane outage: Databricks operates separate control planes per cloud region (e.g. us-east-1, westeurope). A zone-level failure or deployment rollout can make the REST API unavailable for minutes to tens of minutes for all workspaces in that region, while other regions are unaffected.. Request shedding under heavy load: when the control plane is saturated — for example during a large batch of concurrent job launches or a spike in Repos sync traffic — it begins returning 503 to shed load rather than queuing indefinitely. This is distinct from rate-limiting (429) and does not carry a Retry-After header.. Intermediate network layer failure: a load balancer, WAF, or corporate egress proxy between the caller and *.azuredatabricks.net or *.cloud.databricks.com can itself return a 503 before the request reaches Databricks. This is common in locked-down enterprise networks with SSL inspection appliances.. Scheduled maintenance window: Databricks occasionally performs planned maintenance on workspace infrastructure. During these windows specific API endpoints (especially Clusters and Jobs) may return 503 for a bounded period. Databricks publishes these on status.databricks.com with advance notice.. Workspace migration or upgrade: when a workspace is being migrated to a new control-plane version or underlying cloud infrastructure, API calls may fail with 503 for the duration of the migration. The workspace UI shows a maintenance banner in this state.. Databricks Workflows or Jobs API internal dependency failure: the Jobs API depends on internal services (metadata store, scheduler). If one of those internal services is degraded, only Jobs API endpoints return 503 while other endpoints (e.g. DBFS, SQL Warehouses) remain healthy.

How do I fix SERVICE_UNAVAILABLE?

Step 1: Check status.databricks.com — filter by your cloud provider and region. If an incident is active, note the estimated resolution time and hold retries until the incident is resolved; retrying into an active outage wastes quota and increases load.. Step 2: Confirm the error is truly a platform 503 and not a proxy or network 503. Run `curl -v -H 'Authorization: Bearer ' https:// /api/2.1/clusters/list` from the same network as your pipeline. If the response comes from a proxy (check the Server header), investigate your egress layer.. Step 3: Implement exponential back-off with jitter if you are making direct HTTP calls. Start at 1s, double on each attempt, cap at 60s, add ±20% jitter, and stop after 5 retries. Example Python snippet: `wait = min(60, (2 ** attempt)) * (0.8 + 0.4 * random.random())`.. Step 4: If you use the Databricks Python SDK, enable built-in retry by setting `retry_timeout_seconds` in the SDK config: `WorkspaceClient(retry_timeout_seconds=300)`. The SDK retries 503s automatically with back-off.. Step 5: For Airflow or ADF pipelines, configure the operator/activity retry policy. In Airflow set `retries=3` and `retry_delay=timedelta(minutes=2)` on the DatabricksRunNowOperator. In ADF set the activity retry count to 3 and retry interval to 120 seconds.. Step 6: If the 503 persists beyond 30 minutes and no incident is posted on the status page, open a Databricks support ticket. Include the workspace ID, the UTC timestamp range, the full HTTP response including headers, and the request ID from the `x-databricks-request-id` response header if available.. Step 7: After recovery, validate that all jobs that were mid-trigger actually failed cleanly — check the Jobs UI for runs in SKIPPED or INTERNAL_ERROR state that may need manual re-triggering to prevent gaps in downstream tables.

Low severityinfrastructureDatabricks →

Databricks Error:
SERVICE_UNAVAILABLE

Impact

A single 503 is self-healing with retry logic. Escalate when 503s persist for more than 15–30 minutes: every scheduled job in the affected region will miss its trigger window, downstream tables go stale, and any SLA on a morning refresh will likely breach.

When the trigger call fails with a 503, no Databricks run is created. The orchestrating layer (ADF, Airflow, dbt Cloud) marks the step as failed and, depending on retry configuration, either retries or halts the DAG. Downstream tables that depend on this job will not be refreshed for the affected schedule window. If the job runs hourly and the outage lasts 45 minutes, consumers may see up to two missed refresh cycles. Any SLA dashboard or alerting that reads from those tables will reflect stale data until the next successful run completes.

What does this error mean?

The Databricks REST API returned HTTP 503 Service Unavailable, meaning the control plane rejected the request before any work was executed. In a data pipeline this surfaces as a failed job trigger, a stalled orchestration step, or a dropped API call from tools like dbt, Airflow, or Azure Data Factory. The engineer sees no run record in the Jobs UI — the call never reached the execution layer. The cause is always server-side: a regional outage, control-plane overload, or rolling maintenance. The request must be retried; the payload itself is valid and does not need to change.

Common causes

1Regional control-plane outage: Databricks operates separate control planes per cloud region (e.g. us-east-1, westeurope). A zone-level failure or deployment rollout can make the REST API unavailable for minutes to tens of minutes for all workspaces in that region, while other regions are unaffected.
2Request shedding under heavy load: when the control plane is saturated — for example during a large batch of concurrent job launches or a spike in Repos sync traffic — it begins returning 503 to shed load rather than queuing indefinitely. This is distinct from rate-limiting (429) and does not carry a Retry-After header.
3Intermediate network layer failure: a load balancer, WAF, or corporate egress proxy between the caller and *.azuredatabricks.net or *.cloud.databricks.com can itself return a 503 before the request reaches Databricks. This is common in locked-down enterprise networks with SSL inspection appliances.
4Scheduled maintenance window: Databricks occasionally performs planned maintenance on workspace infrastructure. During these windows specific API endpoints (especially Clusters and Jobs) may return 503 for a bounded period. Databricks publishes these on status.databricks.com with advance notice.
5Workspace migration or upgrade: when a workspace is being migrated to a new control-plane version or underlying cloud infrastructure, API calls may fail with 503 for the duration of the migration. The workspace UI shows a maintenance banner in this state.
6Databricks Workflows or Jobs API internal dependency failure: the Jobs API depends on internal services (metadata store, scheduler). If one of those internal services is degraded, only Jobs API endpoints return 503 while other endpoints (e.g. DBFS, SQL Warehouses) remain healthy.

How to fix it

1Step 1: Check status.databricks.com — filter by your cloud provider and region. If an incident is active, note the estimated resolution time and hold retries until the incident is resolved; retrying into an active outage wastes quota and increases load.
2Step 2: Confirm the error is truly a platform 503 and not a proxy or network 503. Run `curl -v -H 'Authorization: Bearer <token>' https://<workspace-host>/api/2.1/clusters/list` from the same network as your pipeline. If the response comes from a proxy (check the Server header), investigate your egress layer.
3Step 3: Implement exponential back-off with jitter if you are making direct HTTP calls. Start at 1s, double on each attempt, cap at 60s, add ±20% jitter, and stop after 5 retries. Example Python snippet: `wait = min(60, (2 ** attempt)) * (0.8 + 0.4 * random.random())`.
4Step 4: If you use the Databricks Python SDK, enable built-in retry by setting `retry_timeout_seconds` in the SDK config: `WorkspaceClient(retry_timeout_seconds=300)`. The SDK retries 503s automatically with back-off.
5Step 5: For Airflow or ADF pipelines, configure the operator/activity retry policy. In Airflow set `retries=3` and `retry_delay=timedelta(minutes=2)` on the DatabricksRunNowOperator. In ADF set the activity retry count to 3 and retry interval to 120 seconds.
6Step 6: If the 503 persists beyond 30 minutes and no incident is posted on the status page, open a Databricks support ticket. Include the workspace ID, the UTC timestamp range, the full HTTP response including headers, and the request ID from the `x-databricks-request-id` response header if available.
7Step 7: After recovery, validate that all jobs that were mid-trigger actually failed cleanly — check the Jobs UI for runs in SKIPPED or INTERNAL_ERROR state that may need manual re-triggering to prevent gaps in downstream tables.

Example log output

HTTPError: 503 Server Error: Service Unavailable for url: https://adb-1234567890.12.azuredatabricks.net/api/2.1/jobs/run-now
Response body: {"error_code":"SERVICE_UNAVAILABLE","message":"Service temporarily unavailable. Please retry your request."}
Request ID: 01234567-abcd-ef01-2345-6789abcdef01

Frequently asked questions

Databricks service unavailable fix — what is the fastest recovery path?

Check status.databricks.com first. If an incident is active, wait for resolution — retrying into an outage does not help and may consume API quota. If no incident is listed, retry immediately with back-off; the 503 is likely transient load shedding and typically resolves within 1–3 retries.

Databricks service unavailable retry — how many retries and with what interval?

The Databricks Python SDK default is 300 seconds total retry window with exponential back-off starting at 1 second. For custom HTTP clients, use: 5 retries, start at 1s, double each time, cap at 60s, add ±20% jitter. Do not retry on 4xx errors; only retry 503 and 429.

How do I distinguish a 503 from a Databricks job failure in my alerting?

A job failure produces a run record with a terminal state (FAILED, INTERNAL_ERROR) and a specific error message in run_state.state_message. A 503 occurs before any run is created — there is no run_id. In your orchestrator logs you will see an HTTP exception, not a Databricks job state transition.

Can a 503 from Databricks cause silent data loss in a pipeline?

Yes if your orchestrator does not retry and does not alert on trigger failures. An ADF pipeline activity that fails with a 503 and has no retry policy will mark the pipeline run as Failed, but if downstream activities have dependency conditions set to 'Succeeded only', they are skipped silently. Always set retry policies on Databricks activities and alert on pipeline run failures, not just job failures.

Source · docs.databricks.com/aws/en/error-messages/error-classes.html