Databricks Error:
SERVICE_UNAVAILABLE
What does this error mean?
The Databricks REST API returned HTTP 503 Service Unavailable, meaning the control plane rejected the request before any work was executed. In a data pipeline this surfaces as a failed job trigger, a stalled orchestration step, or a dropped API call from tools like dbt, Airflow, or Azure Data Factory. The engineer sees no run record in the Jobs UI — the call never reached the execution layer. The cause is always server-side: a regional outage, control-plane overload, or rolling maintenance. The request must be retried; the payload itself is valid and does not need to change.
Common causes
- 1Regional control-plane outage: Databricks operates separate control planes per cloud region (e.g. us-east-1, westeurope). A zone-level failure or deployment rollout can make the REST API unavailable for minutes to tens of minutes for all workspaces in that region, while other regions are unaffected.
- 2Request shedding under heavy load: when the control plane is saturated — for example during a large batch of concurrent job launches or a spike in Repos sync traffic — it begins returning 503 to shed load rather than queuing indefinitely. This is distinct from rate-limiting (429) and does not carry a Retry-After header.
- 3Intermediate network layer failure: a load balancer, WAF, or corporate egress proxy between the caller and *.azuredatabricks.net or *.cloud.databricks.com can itself return a 503 before the request reaches Databricks. This is common in locked-down enterprise networks with SSL inspection appliances.
- 4Scheduled maintenance window: Databricks occasionally performs planned maintenance on workspace infrastructure. During these windows specific API endpoints (especially Clusters and Jobs) may return 503 for a bounded period. Databricks publishes these on status.databricks.com with advance notice.
- 5Workspace migration or upgrade: when a workspace is being migrated to a new control-plane version or underlying cloud infrastructure, API calls may fail with 503 for the duration of the migration. The workspace UI shows a maintenance banner in this state.
- 6Databricks Workflows or Jobs API internal dependency failure: the Jobs API depends on internal services (metadata store, scheduler). If one of those internal services is degraded, only Jobs API endpoints return 503 while other endpoints (e.g. DBFS, SQL Warehouses) remain healthy.
How to fix it
- 1Step 1: Check status.databricks.com — filter by your cloud provider and region. If an incident is active, note the estimated resolution time and hold retries until the incident is resolved; retrying into an active outage wastes quota and increases load.
- 2Step 2: Confirm the error is truly a platform 503 and not a proxy or network 503. Run `curl -v -H 'Authorization: Bearer <token>' https://<workspace-host>/api/2.1/clusters/list` from the same network as your pipeline. If the response comes from a proxy (check the Server header), investigate your egress layer.
- 3Step 3: Implement exponential back-off with jitter if you are making direct HTTP calls. Start at 1s, double on each attempt, cap at 60s, add ±20% jitter, and stop after 5 retries. Example Python snippet: `wait = min(60, (2 ** attempt)) * (0.8 + 0.4 * random.random())`.
- 4Step 4: If you use the Databricks Python SDK, enable built-in retry by setting `retry_timeout_seconds` in the SDK config: `WorkspaceClient(retry_timeout_seconds=300)`. The SDK retries 503s automatically with back-off.
- 5Step 5: For Airflow or ADF pipelines, configure the operator/activity retry policy. In Airflow set `retries=3` and `retry_delay=timedelta(minutes=2)` on the DatabricksRunNowOperator. In ADF set the activity retry count to 3 and retry interval to 120 seconds.
- 6Step 6: If the 503 persists beyond 30 minutes and no incident is posted on the status page, open a Databricks support ticket. Include the workspace ID, the UTC timestamp range, the full HTTP response including headers, and the request ID from the `x-databricks-request-id` response header if available.
- 7Step 7: After recovery, validate that all jobs that were mid-trigger actually failed cleanly — check the Jobs UI for runs in SKIPPED or INTERNAL_ERROR state that may need manual re-triggering to prevent gaps in downstream tables.
Example log output
HTTPError: 503 Server Error: Service Unavailable for url: https://adb-1234567890.12.azuredatabricks.net/api/2.1/jobs/run-now
Response body: {"error_code":"SERVICE_UNAVAILABLE","message":"Service temporarily unavailable. Please retry your request."}
Request ID: 01234567-abcd-ef01-2345-6789abcdef01