LMSYS Blog

Win on TCO: How AMD Instinct™ MI355X Achieves Cost-Competitive Distributed Inference Through SGLang with MoRI

Thu, 28 May 2026 00:00:00 +0000

The SGLang and AMD team has worked closely to unlock competitive Total Cost of Ownership (TCO) for large-scale DeepSeek-R1 disaggregated inference on AMD Instinct™ MI355X GPUs. Building on [SGLang](https://github.com/sgl-project/sglang)'s serving framework and AMD's [MoRI](https://github.com/ROCm/mori) communication library, we demonstrate that AMD achieves competitive — and at key operating points, superior — TCO compared to NVIDIA B200 running Dynamo + TRT-LLM. These results are validated by [InferenceX](https://github.com/SemiAnalysisAI/InferenceX), SemiAnalysis's open-source continuous benchmark platform that tests across hundreds of GPUs with a [live dashboard](https://inferencex.com).

Updating 1T parameters in seconds — P2P weight transfer in Large Scale Distributed RL

Wed, 29 Apr 2026 00:00:00 +0000

We introduced a **RDMA-based, Peer to Peer weight update** mechanism for RL workloads in SGLang as a supplement to traditional NCCL broadcast methods, compatible with all major open source models. By utilizing a source-side **CPU engine replica** and **P2P RDMA transfers** via Mooncake TransferEngine, we speed up weight transfer times for 1T-parameter Kimi-K2 7 times (53 seconds -> 7.2 seconds), at the cost of one additional inference engine replica (32G) per training rank on CPU memory. These optimizations minimize network redundancy and allow inference servers to resume rollout significantly faster.

DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles

Sat, 25 Apr 2026 00:00:00 +0000

We are thrilled to announce Day-0 support for **DeepSeek-V4** across both inference and RL training. **SGLang** and **Miles** form the first open-source stack to serve and train DeepSeek-V4 on launch day — with systems purpose-built for its hybrid sparse-attention architecture, manifold-constrained hyper-connections (mHC), and FP4 expert weights.

HiSparse: Turbocharging Sparse Attention with Hierarchical Memory

Fri, 10 Apr 2026 00:00:00 +0000

Self-attention has become a major bottleneck in scaling LLMs to long contexts because of its quadratic compute and memory/IO cost. This has driven growing interest in efficient attention mechanisms. Among them, **sparse attention** is especially promising: by attending to only a selected subset of KV caches, it retains strong modeling capability while avoiding the sharp increase in compute and I/O costs that regular attention faces as context grows.

Highlights of SGLang at NVIDIA GTC 2026

Tue, 31 Mar 2026 00:00:00 +0000

SGLang came to NVIDIA GTC 2026 with panels, a happy hour, a 200-person meetup, and a hands-on training lab. Three days, five events, one packed week at the center of the LLM ecosystem and left with a lot to share. If you missed it, here's the full recap.

Elastic EP in SGLang: Achieving Partial Failure Tolerance for DeepSeek MoE Deployments

Wed, 25 Mar 2026 00:00:00 +0000

To serve massive Mixture-of-Experts (MoE) models efficiently, deploying a "wide" Expert Parallelism (EP) strategy—often spanning 32 GPUs or more per inference instance—is not just an option; it is a necessity. We need wide EP for two critical reasons:

ROCm Support for Miles: Large-Scale RL Post-Training on AMD Instinct™ GPUs

Tue, 17 Mar 2026 00:00:00 +0000

Reinforcement learning (RL) has rapidly become a core stage of modern foundation-model development. While large-scale pretraining remains essential, today's most capable models rely heavily on post-training techniques to improve reasoning, tool use, and multi-turn interaction. These workflows depend on scalable reinforcement learning infrastructure capable of running across multi-node GPU clusters.

SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systems

Wed, 11 Mar 2026 00:00:00 +0000

We are excited to announce that SGLang supports NVIDIA Nemotron 3 Super on Day 0.

Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72

Fri, 20 Feb 2026 00:00:00 +0000

The SGLang team has worked closely with NVIDIA across [multiple GPU generations](https://lmsys.org/blog/2025-05-05-large-scale-ep/) to unlock step-function gains in inference performance for large-scale deployments of Mixture of Expert (MoE) reasoning models. Building on [prior results](https://lmsys.org/blog/2025-10-14-sa-inference-max/) that delivered 4x speedups on Blackwell B200 vs.Hopper H200 in SemiAnalysis InferenceMAXv1, we are now extending this momentum to Blackwell Ultra. With GB300 NVL72, SGLang achieves up to 25x performance gain on the latest InferenceXv2 benchmark compared to H200. Additionally, we increased SGLang's InferenceXv2 performance on GB200 NVL72 by up to 8x in less than 4 months. These performance gains are a result of the close collaboration between SGLang developers and NVIDIA engineering teams and translate directly into lower latency, higher throughput, and significantly reduced cost per token for large-scale Mixture of Experts (MoE) reasoning model deployments.

Deploying DeepSeek on GB300 NVL72: Big Wins in Long-Context Inference

Thu, 19 Feb 2026 00:00:00 +0000

As the latest addition to the Blackwell family, the **GB300 NVL72** is the most powerful platform for long-context LLM inference. In this blog post, we share our latest progress on optimizing DeepSeek R1-NVFP4 for 128K/8K ISL/OSL (Input Sequence Length/Output Sequence Length) long-context serving using prefill–decode disaggregation (PD), chunked pipeline parallelism (PP) for prefill, wide expert parallelism (Wide-EP) for decode, multi-token prediction (MTP), overlap scheduling, and faster attention kernels driven by 2x Special Function Unit (SFU) throughput increase in key instructions used in attention softmax.

SGLang-Diffusion: Advanced Optimizations for Production-Ready Video Generation

Mon, 16 Feb 2026 00:00:00 +0000

Following our [two-month progress update](https://lmsys.org/blog/2026-01-16-sglang-diffusion/), we're excited to share a

Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 and Qwen3-VL on AMD MI300X Series

Wed, 11 Feb 2026 00:00:00 +0000

Qwen is a series of large-scale, high-performance Large Language Models (LLMs) developed by the Qwen Team of Alibaba Cloud. From the first generation to the latest third-generation flagship models, all Qwen variants have undergone dedicated training and fine-grained tuning, endowing them with strong instruction-following capabilities, efficient deployability for interactive AI applications, and robust performance in solving complex tasks. As flagship models in the Qwen3 family, Qwen3-235B and Qwen3-VL-235B have achieved comprehensive multi-dimensional improvements and have been widely deployed at scale in the Qwen APP.

Squeezing 1TB Model Rollout into a Single H200: INT4 QAT RL End-to-End Practice

Mon, 26 Jan 2026 00:00:00 +0000

> 💡 **TL;DR:**

Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang

Wed, 21 Jan 2026 00:00:00 +0000

A suite of production-tested, high-impact optimizations has been developed by Novita AI for deploying GLM4-MOE models based on SGLANG.

SGLang-Diffusion: Two Months In

Fri, 16 Jan 2026 00:00:00 +0000

Since its release in early Nov. 2025, **SGLang-Diffusion** has gained significant attention and widespread adoption

Pipeline Parallelism in SGLang: Scaling to Million-Token Contexts and Beyond

Thu, 15 Jan 2026 00:00:00 +0000

We are excited to introduce SGLang's highly optimized Pipeline Parallelism (PP) implementation, specifically engineered to tackle the challenges of ultra-long context inference. By integrating **Chunked Pipeline Parallelism**, **Asynchronous P2P Communication**, and a simple yet effective **Dynamic Chunking mechanism**, this PP design achieves industry-leading performance while ensuring seamless compatibility with other parallel strategies, PD Disaggregation, and HiCache. In multi-node deployments, scaling to PP4 TP8 with this implementation yields a **3.31× Prefill Throughput for DeepSeek-V3.1** on an H20 cluster compared to TP8 when the chunked prefill size is set to 12K, significantly outperforming the TP32 solution (2.54×) by a **30.5% margin**. This highlights PP's inherent architectural advantage for large-scale, cross-node scaling over pure TP. Furthermore, our implementation also delivers up to a **67.9% reduction in TTFT** while maintaining an **82.8% strong scaling efficiency**, providing a highly efficient, open-source path for scaling trillion-parameter models for ultra-long context.

EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models in SGLang

Mon, 12 Jan 2026 00:00:00 +0000

> We introduce Encoder-Prefill-Decode (EPD) Disaggregation in SGLang, a novel architecture that separates vision encoding from language processing in Vision-Language Models (VLMs). This can enable:

SpecBundle & SpecForge v0.2: Production-Ready Speculative Decoding Models and Framework

Tue, 23 Dec 2025 00:00:00 +0000

The SpecForge team has collaborated with multiple industry partners - including **Ant, Meituan, Nex-AGI, and EigenAI** - to release [**SpecBundle (Phase 1)**](https://huggingface.co/collections/lmsys/specbundle), a collection of production-grade EAGLE-3 model checkpoints trained on large-scale datasets. **SpecBundle** is designed to improve the availability and real-world performance of speculative decoding, with Phase 1 focusing on instruct-tuned models.

Power Up Diffusion LLMs: Day‑0 Support for LLaDA 2.0

Fri, 19 Dec 2025 00:00:00 +0000

We are excited to introduce the design and implementation of the Diffusion Large Language Model (dLLM) framework within SGLang. By leveraging the existing Chunked-Prefill mechanism, our system achieves:

Mini-SGLang: Efficient Inference Engine in a Nutshell

Wed, 17 Dec 2025 00:00:00 +0000

We're excited to introduce **Mini-SGLang**, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the [SGLang](https://github.com/sgl-project/sglang) project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced features that define state-of-the-art performance, including **Radix Attention** for efficient KV cache reuse, **Chunked Prefill** for controlled memory footprint, **Overlap Scheduling** for reduced CPU overhead, and **Tensor Parallelism** for scalable distributed serving. With an OpenAI-compatible API and out-of-the-box support for models like Llama-3 and Qwen-3, Mini-SGLang serves as both a capable inference engine and a transparent reference implementation for researchers and developers.

SGLang Day-0 Support for MiMo-V2-Flash Model

Tue, 16 Dec 2025 00:00:00 +0000

[XiaomiMiMo/MiMo-V2-Flash](https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash), with 309B total parameters and 15B activated parameters, is a new inference-centric model designed to maximize decoding efficiency. It is based on two key designs: **sliding window attention** and **multi-layer MTP**. MiMo-V2-Flash is explicitly co-designed for real-world serving workloads, enabling flexible tradeoffs between throughput and latency on different hardware. Combined with SGLang’s optimized Spec v2 runtime, which provides near-zero-overhead support for multi-layer MTP and efficient SWA execution, MiMo-V2-Flash delivers balanced TPOT and throughput on H200. In this blog, we will introduce the model and SGLang's efficient support.

SGLang Adds Day-0 Support for the Highly Efficient, Open Nemotron 3 Nano Hybrid MoE Model

Mon, 15 Dec 2025 00:00:00 +0000

**Jan 28th Update**: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by SGLang out of the box and it uses a new method called Quantization-Aware Distillation (QAD) to maintain accuracy on NVFP4 while delivering 4x throughput on B200 compared to FP8-H100. You can download the NVFP4 checkpoints [here](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) and run them using this [NVIDIA Brev launchable](https://brev.nvidia.com/launchable/deploy?launchableID=env-386BHXsTBKROX8F2WBCbQP6S6qt).

Let Tensors Fly — Accelerating Large Model Weight Loading with R-Fork

Wed, 10 Dec 2025 00:00:00 +0000

> We introduce **Tensor R-Fork** (stands for Tensor Remote Fork), a novel weight loading methodology that leverages **efficient inter-node device-to-device interconnect** to load tensors from a running SGLang instance to a new instance with **zero-copy**.

Boost SGLang Inference: Native NVIDIA Model Optimizer Integration for Seamless Quantization and Deployment

Tue, 02 Dec 2025 00:00:00 +0000

(Updated on Dec 2)

From research to production: Accelerate OSS LLM with EAGLE-3 on Vertex

Mon, 01 Dec 2025 00:00:00 +0000

**TL;DR:** Speculative decoding boosts LLM inference, but traditional methods require a separate, inefficient draft model. Vertex AI utilizes EAGLE-3, adding a small draft head (2-5% of the target model) to internal layers, simplifying training and achieving ~2x-3x decoding speedup. **This post outlines our pipeline for data cleaning, embeddings, training, and serving EAGLE-3 with SGLang on Vertex AI at scale.**

Unified FP8: Moving Beyond Mixed Precision for Stable and Accelerated MoE RL

Tue, 25 Nov 2025 00:00:00 +0000

> TL;DR: We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.

LMSYS Fellowship Program

Sun, 23 Nov 2025 00:00:00 +0000

We are thrilled to announce the launch of the LMSYS Fellowship Program!

Introducing Miles — RL Framework To Fire Up Large-Scale MoE Training

Wed, 19 Nov 2025 00:00:00 +0000

> *A journey of a thousand miles is made one small step at a time.*

🚀 AutoRound Meets SGLang: Enabling Quantized Model Inference with AutoRound

Fri, 14 Nov 2025 00:00:00 +0000

We are thrilled to announce an official collaboration between [**SGLang**](https://github.com/sgl-project/sglang) and [**AutoRound**](https://github.com/intel/auto-round), enabling low-bit quantization for efficient LLM inference.

SGLang Diffusion: Accelerating Video and Image Generation

Fri, 07 Nov 2025 00:00:00 +0000

We are excited to introduce SGLang Diffusion, which brings SGLang's state-of-the-art performance to accelerate image and video generation for diffusion models.

"No Free Lunch": Deconstruct Efficient Attention with MiniMax M2

Tue, 04 Nov 2025 00:00:00 +0000

We are excited to announce day-one support for the new flagship model, MiniMax M2, on SGLang. The MiniMax M2 redefines efficiency for agents: it is a compact, fast, and cost-effective Mixture of Experts (MoE) model (230 billion total parameters, 10 billion active) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With only 10B activated parameters, M2 delivers the sophisticated, end-to-end tool-use performance expected from leading models, but in a streamlined form factor that makes deployment and scaling easier than ever.

Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark

Mon, 03 Nov 2025 00:00:00 +0000

We’ve got some exciting updates about the **NVIDIA DGX Spark**\! In the week following the official launch, we collaborated closely with NVIDIA and successfully brought **GPT-OSS 20B** and **GPT-OSS 120B** support to **SGLang** on the DGX Spark. The results are impressive: around **70 tokens/s** on GPT-OSS 20B and **50 tokens/s** on GPT-OSS 120B, which is state-of-the-art so far, and makes running a **local coding agent** on the DGX Spark fully viable.

SGLang-Jax: An Open-Source Solution for Native TPU Inference

Wed, 29 Oct 2025 00:00:00 +0000

We're excited to introduce SGLang-Jax, a state-of-the-art open-source inference engine built entirely on Jax and XLA.

Accelerating Hybrid Inference in SGLang with KTransformers CPU Kernels

Wed, 22 Oct 2025 00:00:00 +0000

Modern Mixture-of-Experts (MoE) language models such as **DeepSeek-V3** contain hundreds of billions of parameters, but only a small subset of experts are activated per token.

SGLang and NVIDIA Accelerating SemiAnalysis InferenceMAX and GB200 Together

Tue, 14 Oct 2025 00:00:00 +0000

The SGLang and NVIDIA teams have a strong track record of collaboration, consistently delivering inference optimizations and system-level improvements to ensure exceptional performance of the SGLang framework. Most recently, this collaboration has been centered on the **NVIDIA Blackwell architecture**, NVIDIA’s latest data center GPU. By leveraging key Blackwell features like **FP8 attention**, **NVFP4 MoE**, and **PD-Disaggregated Expert Parallelism** architecture, SGLang achieved [breakthrough performance](https://lmsys.org/blog/2025-09-25-gb200-part-2/) at high throughput. On an NVIDIA GB200 NVL72 system, SGLang served the DeepSeek R1 models at an incredible **26k input and 13k output tokens per second per GPU** for prefill and decode, respectively. This milestone represents a new level of cost and power efficiency at scale.

NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference

Mon, 13 Oct 2025 00:00:00 +0000

Thanks to NVIDIA’s early access program, we are thrilled to get our hands on the NVIDIA DGX™ Spark. It’s quite an unconventional system, as NVIDIA rarely releases compact, all-in-one machines that bring supercomputing-class performance to a desktop workstation form factor.

SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention

Mon, 29 Sep 2025 00:00:00 +0000

We are excited to announce that **SGLang supports DeepSeek-V3.2 on Day 0**! According to the DeepSeek [tech report](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf), it equips DeepSeek-V3.1-Terminus with [DeepSeek Sparse Attention (DSA)](https://arxiv.org/pdf/2502.11089) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. For more details about upcoming features, please check our [Roadmap](https://github.com/sgl-project/sglang/issues/11060).

PD-Multiplexing: Unlocking High-Goodput LLM Serving with GreenContext

Sun, 28 Sep 2025 00:00:00 +0000

This post highlights our initial efforts to support **a new serving paradigm, PD-Multiplexing, in** **SGLang.** It is designed to deliver higher goodput in LLM serving. PD-Multiplexing leverages [**GreenContext**](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GREEN__CONTEXTS.html), a new NVIDIA GPU capability that allows lightweight and fine-grained partitioning of GPU resources across tasks within the same process. We envision this paradigm as a promising new approach to LLM service deployment, delivering stronger SLO guarantees and higher goodput for Model-as-a-Service (MaaS).

Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G

Fri, 26 Sep 2025 00:00:00 +0000

Operationalizing scaled Mixture-of-Experts (MoE) models such as DeepSeek-R1 requires a careful balance of latency, throughput, and cost. The challenge is especially acute on hardware with asymmetric performance profiles—for example, the H20 GPU, which offers high memory bandwidth but comparatively low compute throughput. Our goal was to design a serving stack that meets the stringent SLAs typically achieved on high-end GPUs while leveraging the H20’s cost advantages.

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput

Thu, 25 Sep 2025 00:00:00 +0000

The GB200 NVL72 is one of the most powerful hardware for deep learning. In this blog post, we share our progress after our [previous blog post](https://lmsys.org/blog/2025-06-16-gb200-part-1/) to optimize the inference performance of DeepSeek V3/R1 with FP8 attention, NVFP4 MoE, large-scale expert parallelism, prefill-decode disaggregation, and various other optimizations. When using FP8 attention and NVFP4 MoE, SGLang achieved 26,156 input and 13,386 output tokens per second per GPU for prefill and decode, respectively, on DeepSeek V3/R1 for 2000-token input sequences, which is a 3.8x and 4.8x speedup compared to [H100 settings](https://lmsys.org/blog/2025-05-05-large-scale-ep/). Even with traditional BF16 attention and FP8 MoE, SGLang still achieves 18,471 input and 9,087 output tokens per second. Reproduction instructions can be found [here](https://github.com/sgl-project/sglang/issues/10903).

Optimizing FP4 Mixed-Precision Inference on AMD GPUs

Sun, 21 Sep 2025 00:00:00 +0000

As frontier large language models (LLMs) continue scaling to unprecedented sizes, they demand increasingly more compute power and memory bandwidth from GPUs. Both GPU manufacturers and model developers are shifting toward low-precision floating-point formats. FP4 (4-bit floating point) quantization has emerged as a particularly compelling solution—for instance, FP4-quantized [Llama 3.3 70B](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4) models achieve a 3.5x reduction in model size while maintaining minimal quality degradation on benchmarks like [MMLU](https://arxiv.org/abs/2009.03300).

SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends

Wed, 10 Sep 2025 00:00:00 +0000

In a coding agent scenario using Qwen3-Coder-480B, the observed dialogues often stretched past 25K tokens around 8 turns per session. Without full KV cache retention, nearly every request required costly re-computation. By **integrating SGLang HiCache with DeepSeek 3FS KVStore** for large-scale historical KV caching, the session’s **average TTFT dropped by 56%, inference throughput doubled, and the cache hit rate jumped from 40% to 80%.”**

LongCat-Flash: Deploying Meituan's Agentic Model with SGLang

Mon, 01 Sep 2025 00:00:00 +0000

LongCat-Flash, Meituan's open-source Agentic Mixture-of-Experts (MoE) model is now available from huggingface [LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat). Released by Meituan LongCat Team, it features:

Fine-tune and deploy gpt-oss MXFP4: ModelOpt + SGLang

Thu, 28 Aug 2025 00:00:00 +0000

(Updated on Aug 29)

SGLang for gpt-oss: From Day 0 Support to Enhanced Performance

Wed, 27 Aug 2025 00:00:00 +0000

We are excited to announce a major update for SGLang, focusing on deep performance optimizations and new features for the recently released openai/gpt-oss-120b model. **While we had support from day zero, we took the last few weeks to enhance our engine to ensure you get the best possible performance.**

GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities

Thu, 31 Jul 2025 00:00:00 +0000

Today, we are excited to introduce our latest flagship models [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) and [GLM-4.5-Air](https://huggingface.co/zai-org/GLM-4.5-Air), along with their FP8 variants. All models are now available with day-one support on SGLang.

SpecForge: Accelerating Speculative Decoding Training for SGLang

Fri, 25 Jul 2025 00:00:00 +0000

Speculative decoding is a powerful technique for accelerating Large Language Model (LLM) inference. In this blog post, we are excited to announce the open-sourcing of **[SpecForge](https://github.com/sgl-project/SpecForge)**, our new training framework for Eagle3-based speculative decoding. SpecForge is designed for ease of use and is tightly integrated with the **[SGLang](https://github.com/sgl-project/sglang)** inference engine, enabling a seamless transition from training to deployment.

Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs

Sun, 20 Jul 2025 00:00:00 +0000

**Kimi K2 is currently the most advanced open-source Mixture-of-Experts (MoE) model available.**

Accelerating SGLang with Multiple Token Prediction

Thu, 17 Jul 2025 00:00:00 +0000

SGLang now supports smooth combination of these advanced features: **Multiple Token Prediction (MTP)**, **Large-Scale Expert Parallelism (EP)**, and **Prefill-Decode disaggregation**. This integration delivers **up to 60% higher output throughput** through a new decoding paradigm, better parallelism, and more efficient resource utilization without sacrificing generation quality. If you are serving models, e.g., DeepSeek V3, SGLang now supports MTP as a plug-and-play feature, unlocking immediate performance gains. You can find instruction for reproduction [here](https://github.com/sgl-project/sglang/issues/7998).

How to support new VLMs into SGLang: A Case Study with NVILA

Wed, 16 Jul 2025 00:00:00 +0000

The world of LLMs is evolving at a remarkable pace, with Visual Language Models (VLMs) at the forefront of this revolution. These models power applications that can understand and reason about both images and text. There are [tons of new VLM models](https://huggingface.co/models?pipeline_tag=image-text-to-text&sort=trending) emerging daily, and we want to integrate them into [SGLang](https://github.com/sgl-project/sglang) to leverage its high-speed throughput. Today, we’ll provide a step-by-step walkthrough for integrating new VLMs into the SGLang ecosystem, using the recent [NVILA model](https://arxiv.org/abs/2412.04468) as a real-world case study.

Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang

Mon, 14 Jul 2025 00:00:00 +0000

The impressive performance of DeepSeek R1 marked a rise of giant Mixture of Experts (MoE) models in Large Language Models (LLM). However, its massive model size and unique architecture have posed new challenges on deployment. The significant memory requirements will normally require 8x or even 16x high-end AI accelerators to deploy.

slime: An SGLang-Native Post-Training Framework for RL Scaling

Wed, 09 Jul 2025 00:00:00 +0000

We believe in RL. We believe RL is the final piece toward AGI.

OME: Revolutionizing LLM Infrastructure with Model-Driven Architecture

Tue, 08 Jul 2025 00:00:00 +0000

In any large organization deploying LLMs, two distinct teams emerge with conflicting needs:

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput

Mon, 16 Jun 2025 00:00:00 +0000

The GB200 NVL72 is the world's most advanced hardware for AI training and inference. In this blog post, we're excited to share early results from running DeepSeek 671B with prefill-decode disaggregation and large-scale expert parallelism on the GB200 NVL72. By leveraging Blackwell-specific features to enhance existing components, **SGLang achieved 7,583 tokens per second per GPU for decoding on the GB200 NVL72—a 2.7x speedup compared to the H100 per GPU** ([link](https://lmsys.org/blog/2025-05-05-large-scale-ep/)) for 2,000-token input lengths. Performance is expected to improve further with ongoing optimizations. You can find reproduction instructions [here](https://github.com/sgl-project/sglang/issues/7227).

Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs

Mon, 05 May 2025 00:00:00 +0000

DeepSeek is a popular open-source large language model (LLM) praised for its strong performance. However, its large size and unique architecture, which uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE), require an advanced system for efficient serving at scale. In this blog, we explain how we match DeepSeek's inference system performance with SGLang.

SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs

Wed, 04 Dec 2024 00:00:00 +0000

We’re excited to release [SGLang v0.4](https://github.com/sgl-project/sglang), featuring significant performance improvements and new features:

Announcing a New Site for Chatbot Arena

Fri, 20 Sep 2024 00:00:00 +0000

We’re excited to share that Chatbot Arena now has its own dedicated website: [lmarena.ai](https://lmarena.ai) and [blog](https://blog.lmarena.ai)!

RedTeam Arena: An Open-Source, Community-driven Jailbreaking Platform

Fri, 13 Sep 2024 00:00:00 +0000

We are excited to launch [RedTeam Arena](https://redarena.ai), a community-driven redteaming platform, built in collaboration with [Pliny](https://x.com/elder_plinius) and the [BASI](https://discord.gg/Y6GxC59G) community!

SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision

Wed, 04 Sep 2024 00:00:00 +0000

We're excited to announce the release of [SGLang v0.3](https://github.com/sgl-project/sglang/tree/main), which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates:

Does style matter? Disentangling style and substance in Chatbot Arena

Thu, 29 Aug 2024 00:00:00 +0000

Why is GPT-4o-mini so good? Why does Claude rank so low, when anecdotal experience suggests otherwise?

Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)

Thu, 25 Jul 2024 00:00:00 +0000

At LMSYS.org, we've been running the [Chatbot Arena](https://chat.lmsys.org/) platform for over a year, serving millions of users. We know firsthand how crucial efficient serving is for AI products and research. Through our operational experiences and in-depth research, we've continuously enhanced the underlying serving systems, spanning from the high-level multi-model serving framework, [FastChat](https://github.com/lm-sys/FastChat/tree/main), to the efficient serving engine, [SGLang Runtime (SRT)](https://github.com/sgl-project/sglang).

RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing

Mon, 01 Jul 2024 00:00:00 +0000

LLMs have demonstrated remarkable capabilities across a range of tasks, but there exists wide variation in their costs and capabilities, as seen from the plot of performance against cost in Figure 1. Very broadly, more capable models tend to be more expensive than less capable models. This leads to a dilemma when deploying LLMs in the real-world - routing all queries to the largest, most capable model leads to the highest-quality responses but can be expensive, while routing queries to smaller models can save costs but may result in lower-quality responses.

The Multimodal Arena is Here!

Thu, 27 Jun 2024 00:00:00 +0000

We added image support to [Chatbot Arena](https://lmarena.ai/)! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother.

Introducing Hard Prompts Category in Chatbot Arena

Mon, 20 May 2024 00:00:00 +0000

Introducing **Hard Prompts**, a new and challenging category in the Chatbot Arena [Leaderboard](https://leaderboard.lmsys.org).

What’s up with Llama 3? Arena data analysis

Wed, 08 May 2024 00:00:00 +0000

On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English [Chatbot Arena leaderboard](https://leaderboard.lmsys.org) with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this blog post, we aim to provide more insight into why users rank Llama 3-70b on par with top-ranked models like GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus.

LMSYS Kaggle Competition – Predicting Human Preference with $100,000 in Prizes

Thu, 02 May 2024 00:00:00 +0000

LMSYS and Kaggle are launching a human preference prediction competition! You are challenged to predict which responses users will prefer in head-to-head battles between Large Language Models (LLMs). You'll work with a dataset from the [Chatbot Arena](https://lmarena.ai), containing conversations and user preferences across various LLMs. By developing a model that accurately predicts human preferences, you'll contribute to improving chatbot performance and alignment with user expectations. The training dataset includes over 55,000 real-world user and LLM conversations and user preferences, with personally identifiable information removed. Your solution submission will be tested on a hidden test set of 25,000 samples.

From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline

Fri, 19 Apr 2024 00:00:00 +0000

Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1) robustly separate model capability, 2) reflect human preference in real-world use cases, and 3) frequently update to avoid over-fitting or test set leakage.

LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation

Fri, 01 Mar 2024 00:00:00 +0000

Chatbot Arena ([lmarena.ai](https://lmarena.ai)) is an open-source project developed by members from [LMSYS](https://lmarena.ai/?about) and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://lmarena.ai/?leaderboard) periodically.

Fast JSON Decoding for Local LLMs with Compressed Finite State Machine

Mon, 05 Feb 2024 00:00:00 +0000

Constraining an LLM to consistently generate valid JSON or YAML that adheres to a specific schema is a critical feature for many applications.

Fast and Expressive LLM Inference with RadixAttention and SGLang

Wed, 17 Jan 2024 00:00:00 +0000

Large Language Models (LLMs) are increasingly utilized for complex tasks that require multiple chained generation calls, advanced prompting techniques, control flow, and interaction with external environments. However, there is a notable deficiency in efficient systems for programming and executing these applications.

Chatbot Arena: New models & Elo system update

Thu, 07 Dec 2023 00:00:00 +0000

Welcome to our latest update on the Chatbot Arena, our open evaluation platform to test the most advanced LLMs. We're excited to share that over **130,000** votes that are now collected to rank the most capable 40+ models! In this blog post, we'll cover the results of several new models:

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Tue, 21 Nov 2023 00:00:00 +0000

**TL;DR:** We introduce **lookahead decoding**, a new, exact, and parallel decoding algorithm to accelerate LLM inference.

Recipe for Serving Thousands of Concurrent LoRA Adapters

Wed, 15 Nov 2023 00:00:00 +0000

In this blog post, we introduce [S-LoRA](https://arxiv.org/abs/2311.03285) ([code](https://github.com/S-LoRA/S-LoRA)), a system designed for the scalable serving of many LoRA adapters. S-LoRA adopts the idea of

Catch me if you can! How to beat GPT-4 with a 13B model

Tue, 14 Nov 2023 00:00:00 +0000

Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)!

ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions

Mon, 30 Oct 2023 00:00:00 +0000

In this blogpost, we introduce ToxicChat, a benchmark consisting of 10K high-quality data for content moderation in real-world user-AI interactions. Evaluation results show that fine-tuning on this benchmark notably improves a baseline model’s ability to detect toxic queries in user-AI interactions.

Chatbot Arena Conversation Dataset Release

Thu, 20 Jul 2023 00:00:00 +0000

Since its launch three months ago, [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models.

How Long Can Open-Source LLMs Truly Promise on Context Length?

Thu, 29 Jun 2023 00:00:00 +0000

In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 16K tokens.

Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B

Thu, 22 Jun 2023 00:00:00 +0000

In this blog post, we share the latest update on Chatbot Arena leaderboard, which now includes more open models and three metrics:

Building a Truly \"Open\" OpenAI API Server with Open Models Locally

Fri, 09 Jun 2023 00:00:00 +0000

Many applications have been built on closed-source OpenAI APIs, but now you can effortlessly port them to use open-source alternatives without modifying the code. [FastChat](https://github.com/lm-sys/FastChat)'s OpenAI-compatible API server enables this seamless transition.

Chatbot Arena Leaderboard Updates (Week 4)

Thu, 25 May 2023 00:00:00 +0000

In this update, we are excited to welcome the following models joining the [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/):

Chatbot Arena Leaderboard Updates (Week 2)

Wed, 10 May 2023 00:00:00 +0000

We release an updated leaderboard with more models and new data we collected last week, after the announcement of the anonymous [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/). We are actively iterating on the design of the arena and leaderboard scores.

Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings

Wed, 03 May 2023 00:00:00 +0000

We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer.

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality

Thu, 30 Mar 2023 00:00:00 +0000

We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%^* of cases. The cost of training Vicuna-13B is around $300. The [code](https://github.com/lm-sys/FastChat) and [weights](https://github.com/lm-sys/FastChat#vicuna-weights), along with an online [demo](https://chat.lmsys.org), are publicly available for non-commercial use.