name: BentoML Rate Limits description: > BentoCloud does not publish fixed platform-level API rate limits. Instead, concurrency and throughput are governed by per-deployment configuration. Each BentoML service deployment defines its own concurrency ceiling and scaling bounds. BentoCloud autoscales replicas to meet demand within the configured min/max replica range. An optional external request queue can buffer excess traffic to prevent overload. Specific platform quotas (API management calls, organization-level limits) are not publicly documented and may vary by plan tier; contact BentoML sales for enterprise quota details. specificationVersion: "0.1" url: https://docs.bentoml.com/en/latest/scale-with-bentocloud/scaling/autoscaling.html limits: - name: Service Concurrency description: > Maximum number of simultaneous requests each replica handles before the autoscaler adds additional replicas. Configured per deployment via the `traffic.concurrency` setting in the @bentoml.service decorator. scope: per-replica configurable: true default: No hard default; recommended to set explicitly per workload unit: concurrent requests per replica reference: https://docs.bentoml.com/en/latest/scale-with-bentocloud/scaling/autoscaling.html - name: Minimum Replicas description: > Minimum number of running replicas for a deployment. Set to 0 to enable scale-to-zero. Configured via `scaling_min` in deployment settings or `--scaling-min` CLI flag. scope: per-deployment configurable: true default: 0 (scale-to-zero enabled by default) unit: replicas reference: https://docs.bentoml.com/en/latest/scale-with-bentocloud/deployment/configure-deployments.html - name: Maximum Replicas description: > Maximum number of replicas the autoscaler can provision for a deployment. Acts as a cost and resource ceiling. Configured via `scaling_max` in deployment settings or `--scaling-max` CLI flag. scope: per-deployment configurable: true default: Plan-dependent; contact BentoML for per-plan replica caps unit: replicas reference: https://docs.bentoml.com/en/latest/scale-with-bentocloud/deployment/configure-deployments.html - name: Request Timeout description: > Per-request timeout for inference endpoints. Configured via `traffic.timeout` in the @bentoml.service decorator. scope: per-service configurable: true default: 10 seconds (example default; may vary) unit: seconds reference: https://docs.bentoml.com/en/latest/build-with-bentoml/services.html - name: External Queue Buffering description: > Optional request queue that buffers excess traffic when concurrency is saturated, preventing request rejection at the cost of increased latency. Enabled via `traffic.external_queue: true`. scope: per-deployment configurable: true default: disabled reference: https://docs.bentoml.com/en/latest/scale-with-bentocloud/scaling/autoscaling.html - name: Autoscaler Stabilization Window description: > Configurable delay windows (0-3600 seconds) for scale-up and scale-down decisions, preventing reactive scaling to brief traffic spikes. scope: per-deployment configurable: true default: Platform default; configurable per deployment unit: seconds (0-3600) reference: https://docs.bentoml.com/en/latest/scale-with-bentocloud/scaling/autoscaling.html