--- title: Serverless Inference sha256: 84e33ccfdf6c109e052eceb1573f775b9f18c8ed4ec24aa3291bc0c4b8dca3f5 type: raw-article source: newsletter source_url: https://try.digitalocean.com/serverless-inference/ fetcher: jina ingested: 2026-05-15 --- Markdown Content: ![Image 1](https://try.digitalocean.com/serverless-inference/) ## The fastest way to run AI on DigitalOcean. Serverless inference provides a single OpenAI- and Anthropic-compatible API for all your workloads. No GPU provisioning, no infrastructure to manage — just tokens, latency, and output. It runs next to your databases, storage, networking, and agents on the AI-Native Cloud with no egress between layers. [Get started →](https://cloud.digitalocean.com/registrations/new)[Talk to sales](https://www.digitalocean.com/company/contact/sales) ![Image 2](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) 55+ curated models ![Image 3](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) OpenAI- and Anthropic-compatible ![Image 4](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) VPC + zero data retention by default Production AI runs on DigitalOcean ![Image 5](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/096391bd-images.svg) ![Image 6](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/f1a22120-images.svg) ![Image 7](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/62da3bf9-images.svg) ![Image 8](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/abbcd076-images.svg) ![Image 9](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/ab1bacac-images.svg) ![Image 10](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/dfdafca8-images.svg) ### Inference should feel simple. Most platforms make it anything but. #### Once AI reaches production traffic, teams stop worrying about models and start managing infrastructure, routing logic, scaling behavior, and vendor complexity. You pay for GPUs, not requests. Inference rarely runs at steady state. Traffic spikes, then disappears, but capacity stays provisioned. Even with autoscaling, you’re still paying for idle infrastructure built for peak demand. ![Image 11](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/b52ec30c-red-x-comparison.svg) Shipping one model call turns into an entire platform. What starts as a simple API request quickly expands into routing, retries, caching, observability, rate limits, and cost controls. By production, you’ve built an internal inference stack. ![Image 12](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/b52ec30c-red-x-comparison.svg) Model APIs look interchangeable until you try to switch them. Every provider has different schemas, latency profiles, safety layers, and pricing behavior. Switching isn’t swapping endpoints — it’s reworking routing, evaluation, and application logic. ![Image 13](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/b52ec30c-red-x-comparison.svg) ### Pay-per-token. Off-peak when you can wait. Pay per token. No GPU contracts. No minimums. Forecasting your inference cost should look like forecasting your AWS bill. Batch at ~50% of real-time. Off-peak dynamic pricing on Mini Max M2.5 and Kimi K2.5 today, expanding. ![Image 14](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/8fa6dcbc-i1.svg) ![Image 15](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) $1M+ customer ARR up 179% YoY in Q1 2026. ![Image 16](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) >80% of AI customer ARR now from inference + core cloud, not bare metal. ![Image 17](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Scale-to-zero on Serverless. Reserved capacity on Dedicated when you graduate. PREDICTABLE AI ECONOMICS If it can’t take real traffic, it doesn’t count. Independently ranked, custom-kernel optimized, 55+ models behind one API. VPC, zero data retention, platform guardrails, and built-in observability ship as defaults — not enterprise add-ons. ![Image 18](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/cd62ce54-i2.svg) ![Image 19](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) #1 by Artificial Analysis on output speed for DeepSeek V3.2 and Qwen 3.5 397B. ![Image 20](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) 230 tok/sec on DeepSeek V3.2 — 3.9× faster than AWS Bedrock. ![Image 21](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) 180M+ patient interactions — Hippocratic AI clinical calls/day at 400ms in production. PRODUCTION-GRADE BY DEFAULT Bring your model. Keep your stack open. Open-weight out of the box: DeepSeek, Qwen, Llama, Mixtral, Phi, gpt-oss. LoRA on Serverless lands Q2; full BYOM on Dedicated today. No proprietary lock-in. ![Image 22](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/30c86cac-i3.svg) ![Image 23](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Five integrated layers: compute, network, storage, data, AI — open at every one. ![Image 24](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Messages API for Claude Code-compatible agentic workflows. ![Image 25](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Drop-in OpenAI and Anthropic schemas. Migrate behind a feature flag, not a rewrite. OPEN AT EVERY LAYER Image, video, speech, vision-language. Same API, same bill. Stable Diffusion 3.5 for image. Wan 2.2 for video. Qwen3 TTS for speech. Nemotron and Kimi for vision-language. Plus the lifecycle around them routing, evals, observability — that wrappers don’t have. ![Image 26](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/d366c3f8-i4.svg) ![Image 27](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Among inference-only competitors, only Together ships full image/video/audio. Fireworks has no video. Baseten, Groq, DeepInfra have no multimodal. ![Image 28](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Platform content guardrails on image and video by default — not opt-in. ![Image 29](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Native multimodal generation, not a stitched chain of vendor APIs. EVERY MODALITY, ONE PLATFORM [Get started →](https://cloud.digitalocean.com/registrations/new)[Talk to sales](https://www.digitalocean.com/company/contact/sales) ## From real-time agents to trillion-token workloads, leaders in AI run on DigitalOcean. ### DigitalOcean makes production inference simple. Nothing to provision. Nothing to size. Inference runs only when you call it. Capacity scales automatically based on demand, so you don’t manage GPUs or plan for peak traffic. ![Image 30](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Scale-to-zero by default ![Image 31](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Automatic handling of traffic spikes ![Image 32](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Pay only for active inference 01 You see exactly how inference behaves. Serverless Inference includes observability and control primitives so you can understand and manage production workloads without adding external tooling. ![Image 33](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Metrics for latency, tokens, errors, and spend ![Image 34](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Request-level visibility across workloads ![Image 35](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Built-in controls for rate limits and usage tracking 02 Models are interchangeable. Serverless Inference provides a unified API so teams can switch or experiment with models without changing application logic or rewriting integrations. ![Image 36](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) OpenAI- and Anthropic-compatible API ![Image 37](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Consistent request and response format ![Image 38](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adc00d9c-do-yellow-check.svg) Swap models without code changes 03 NO CARD · FREE UNTIL YOU MAKE A CALL · CANCEL ANY TIME One rate card. 55+ models. No commits. Pay per token, billed by the second of generation. Off-peak dynamic serverless inference pricing on MiniMax M2.5 (Public Preview) and Kimi K2.5 today, expanding across the catalog soon. [See full price list →](https://docs.digitalocean.com/products/inference/details/pricing/) ![Image 39](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/77f687f3-pricing-updated-may-2026_10cw06x0cw06w000000028.png) ### Most inference stacks fall into one of four patterns. DigitalOcean brings them together. #### Teams today typically build on top of cloud platforms, inference APIs, GPU infrastructure, or direct model endpoints. Each approach solves part of the problem, but leaves gaps when moving to production scale. Cloud platforms with broad capability, more coordination required Large cloud providers offer extensive infrastructure and model access within a unified environment. Teams often benefit from breadth and enterprise features, but deployments can involve multiple services, configuration layers, and procurement steps. VS. HYPERSCALERS Lightweight inference access without full infrastructure ownership Some solutions provide streamlined access to models through a simple API layer. They reduce operational overhead but typically rely on external systems for storage, deployment, and production orchestration. VS. WRAPPERS Compute-first infrastructure for custom inference stacks GPU-focused providers offer raw compute resources for teams that want full control over their inference stack. This approach enables flexibility but requires assembling and maintaining the surrounding system components. VS. NEOCLOUDS Fastest path to model access with additional production layers needed Direct model endpoints provide immediate access to frontier models and are often used for initial development. Production applications usually require additional layers for routing, scaling, and operational management. VS. DIRECT APIS ![Image 40](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/adcf33b2-comparison-table-may-2026_10u00e6000000000000028.png) ### Numbers from teams already running on DigitalOcean. “In healthcare AI, a node going down isn’t just an SLA issue — it impacts patient experience. We’ve pressed DigitalOcean hard on reliability, access to the newest hardware, and the ability to scale efficiently. They’ve delivered.” Debajyoti Datta Co-Founder, Hippocratic AI ![Image 41](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/ab1bacac-images.svg) ![Image 42](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/26d0ee95-1677269077236_101o01o01o01n00000001o.jpg) “Serverless Inference is fantastic because we can make as many calls as we need without worrying about provisioning infrastructure. It just scales automatically.” Carlo Ruiz Infrastructure Engineer, Traversal ![Image 43](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/62da3bf9-images.svg) ![Image 44](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/352a84c6-carlo-ruiz-traversal_101o01n00000000000001o.jpeg) ### Three steps and you’re making API calls. Create a key. Sign up with email or GitHub. Open the console and generate an API key in one click. Keys are scoped and can be rotated at any time. 01 ![Image 45](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/2b3d7f8f-step-1_107s024000000000000028.png) Watch tokens, not nodes. Track tokens, latency, errors, and spend directly in the console. Set usage limits and alerts in a few clicks. 03 ![Image 46](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/99690edf-step-3_107s023000000000000028.png) Point your code. Update your OpenAI- or Anthropic-compatible SDK to point to DigitalOcean. Swap the model name to a supported Serverless Inference model and start making requests. 02 ![Image 47](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/fd13f242-step-2_107s02k000000000000028.png) [Get started →](https://cloud.digitalocean.com/registrations/new)[Talk to sales](https://www.digitalocean.com/company/contact/sales) ### A few things teams typically want to know. Is there a free tier? Yes. You can sign up without a credit card and start testing the API immediately. Pay-per-token pricing applies once you exceed the included usage. Learn how to get started [here](https://docs.digitalocean.com/products/inference/details/pricing/#serverless-inference-pricing). Q · 01 How do I migrate from OpenAI or Anthropic? Update your base URL to the DigitalOcean endpoint and select a supported model. The OpenAI- and Anthropic-compatible SDKs continue to work, along with popular frameworks like LangChain and LlamaIndex. Most teams validate with a small workload before fully switching. Q · 02 What happens when a frontier provider goes down? Inference Router (Public Preview) can route requests across models based on policy, including fallback behavior when a provider is unavailable. This can help maintain continuity when individual model APIs experience disruptions. Q · 03 When should I use Serverless vs. Dedicated? Serverless is the starting point for most workloads — it’s serverless, usage-based, and scales automatically. Dedicated Inference is designed for sustained, high-throughput production workloads. Both use the same API and billing system. Q · 04 Can I bring my own model? Dedicated Inference supports custom models today. Open-weight models such as DeepSeek, Qwen, and Llama are available out of the box on Serverless Inference, with additional customization options expanding over time. Q · 05 ![Image 48](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/ca12a731-do-symbol-dark-theme-light.svg) © 2026 DigitalOcean, LLC. The problem Our solution Pricing HOW THE MARKET BREAKS DOWN In production Up and running in minutes Have questions? Production AI runs on DigitalOcean ## From real-time agents to trillion-token workloads, leaders in AI run on DigitalOcean. ![Image 49](https://try.digitalocean.com/serverless-inference/) ![Image 50](https://try.digitalocean.com/serverless-inference/) ![Image 51](https://try.digitalocean.com/serverless-inference/) ![Image 52](https://try.digitalocean.com/serverless-inference/) ![Image 53](https://try.digitalocean.com/serverless-inference/) ![Image 54](https://try.digitalocean.com/serverless-inference/) ![Image 55](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/eb0782d1-hero-visual-3x_10eo0ca000000000000028.png) ![Image 56](https://try.digitalocean.com/serverless-inference/) ![Image 57](https://try.digitalocean.com/serverless-inference/) ### Stop stitching infra together. Run AI on the AI-Native Cloud. #### Serverless Inference gives you access to text, image, and speech models through a single API with built-in scaling, observability, and usage-based pricing. [Get started →](https://cloud.digitalocean.com/registrations/new)[Talk to sales](https://www.digitalocean.com/company/contact/sales) ![Image 58](https://d9hhrg4mnvzow.cloudfront.net/try.digitalocean.com/serverless-inference/3b4cd778-do-dark-theme-logo.svg) Serverless Inference · Generally Available