What causes MODEL_SERVING_RATE_LIMIT_EXCEEDED?

A batch inference job sends too many concurrent requests without throttling logic. A traffic spike from multiple upstream jobs all calling the same endpoint at schedule time. Provisioned concurrency is set too low relative to actual peak traffic. The workspace has hit its maximum parallel request limit and requires a quota increase

How do I fix MODEL_SERVING_RATE_LIMIT_EXCEEDED?

Step 1: Check the serving endpoint's Traffic and Metrics tab in the Databricks UI to confirm the concurrency spike timing.. Step 2: Enable autoscaling on the endpoint to let it scale up provisioned concurrency automatically during spikes.. Step 3: Add retry logic with exponential backoff in the calling application to handle transient 429 responses gracefully.. Step 4: If the workspace parallel request limit is hit (not just endpoint concurrency), contact Databricks support to increase the quota.. Step 5: For batch scoring, switch to a Databricks job with parallel tasks instead of calling the REST endpoint serially.

High severityresourceDatabricks →

Databricks Error:
MODEL_SERVING_RATE_LIMIT_EXCEEDED

What does this error mean?

Too many requests were sent to a Databricks Model Serving endpoint in a given time window, exceeding either the endpoint's provisioned concurrency or the workspace-level parallel request quota.

Common causes

1A batch inference job sends too many concurrent requests without throttling logic
2A traffic spike from multiple upstream jobs all calling the same endpoint at schedule time
3Provisioned concurrency is set too low relative to actual peak traffic
4The workspace has hit its maximum parallel request limit and requires a quota increase

How to fix it

1Step 1: Check the serving endpoint's Traffic and Metrics tab in the Databricks UI to confirm the concurrency spike timing.
2Step 2: Enable autoscaling on the endpoint to let it scale up provisioned concurrency automatically during spikes.
3Step 3: Add retry logic with exponential backoff in the calling application to handle transient 429 responses gracefully.
4Step 4: If the workspace parallel request limit is hit (not just endpoint concurrency), contact Databricks support to increase the quota.
5Step 5: For batch scoring, switch to a Databricks job with parallel tasks instead of calling the REST endpoint serially.

Frequently asked questions

What is the difference between the endpoint concurrency limit and the workspace parallel request limit?

Endpoint concurrency is the number of simultaneous in-flight requests the specific endpoint can handle; it can be increased by raising provisioned concurrency or enabling autoscaling. The workspace limit is an account-level cap on all concurrent requests across all endpoints; only Databricks support can raise it.

Is autoscaling on model serving endpoints always recommended?

For production endpoints with variable traffic, yes. Autoscaling scales concurrency up during peaks and back down during idle periods. For strictly cost-sensitive or latency-sensitive workloads, fixed provisioned concurrency with capacity planning may be preferable.

Source · docs.databricks.com/aws/en/machine-learning/model-serving/index.html