LMSYS Blog https://lmsys.org/blog/ Research blog from LMSYS http://www.rssboard.org/rss-specification python-feedgen en Mon, 01 Jun 2026 04:51:51 +0000 Win on TCO: How AMD Instinct™ MI355X Achieves Cost-Competitive Distributed Inference Through SGLang with MoRI https://lmsys.org/blog/2026-05-28-mori/ The SGLang and AMD team has worked closely to unlock competitive Total Cost of Ownership (TCO) for large-scale DeepSeek-R1 disaggregated inference on AMD Instinct™ MI355X GPUs. Building on [SGLang](https://github.com/sgl-project/sglang)'s serving framework and AMD's [MoRI](https://github.com/ROCm/mori) communication library, we demonstrate that AMD achieves competitive — and at key operating points, superior — TCO compared to NVIDIA B200 running Dynamo + TRT-LLM. These results are validated by [InferenceX](https://github.com/SemiAnalysisAI/InferenceX), SemiAnalysis's open-source continuous benchmark platform that tests across hundreds of GPUs with a [live dashboard](https://inferencex.com). https://lmsys.org/blog/2026-05-28-mori/ Thu, 28 May 2026 00:00:00 +0000 Updating 1T parameters in seconds — P2P weight transfer in Large Scale Distributed RL https://lmsys.org/blog/2026-04-29-p2p-update/ We introduced a **RDMA-based, Peer to Peer weight update** mechanism for RL workloads in SGLang as a supplement to traditional NCCL broadcast methods, compatible with all major open source models. By utilizing a source-side **CPU engine replica** and **P2P RDMA transfers** via Mooncake TransferEngine, we speed up weight transfer times for 1T-parameter Kimi-K2 7 times (53 seconds -> 7.2 seconds), at the cost of one additional inference engine replica (32G) per training rank on CPU memory. These optimizations minimize network redundancy and allow inference servers to resume rollout significantly faster. https://lmsys.org/blog/2026-04-29-p2p-update/ Wed, 29 Apr 2026 00:00:00 +0000 DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles https://lmsys.org/blog/2026-04-25-deepseek-v4/ We are thrilled to announce Day-0 support for **DeepSeek-V4** across both inference and RL training. **SGLang** and **Miles** form the first open-source stack to serve and train DeepSeek-V4 on launch day — with systems purpose-built for its hybrid sparse-attention architecture, manifold-constrained hyper-connections (mHC), and FP4 expert weights. https://lmsys.org/blog/2026-04-25-deepseek-v4/ Sat, 25 Apr 2026 00:00:00 +0000 HiSparse: Turbocharging Sparse Attention with Hierarchical Memory https://lmsys.org/blog/2026-04-10-sglang-hisparse/ Self-attention has become a major bottleneck in scaling LLMs to long contexts because of its quadratic compute and memory/IO cost. This has driven growing interest in efficient attention mechanisms. Among them, **sparse attention** is especially promising: by attending to only a selected subset of KV caches, it retains strong modeling capability while avoiding the sharp increase in compute and I/O costs that regular attention faces as context grows. https://lmsys.org/blog/2026-04-10-sglang-hisparse/ Fri, 10 Apr 2026 00:00:00 +0000 Highlights of SGLang at NVIDIA GTC 2026 https://lmsys.org/blog/2026-03-25-gtc2026/ SGLang came to NVIDIA GTC 2026 with panels, a happy hour, a 200-person meetup, and a hands-on training lab. Three days, five events, one packed week at the center of the LLM ecosystem and left with a lot to share. If you missed it, here's the full recap. https://lmsys.org/blog/2026-03-25-gtc2026/ Tue, 31 Mar 2026 00:00:00 +0000 Elastic EP in SGLang: Achieving Partial Failure Tolerance for DeepSeek MoE Deployments https://lmsys.org/blog/2026-03-25-eep-partial-failure-tolerance/ To serve massive Mixture-of-Experts (MoE) models efficiently, deploying a "wide" Expert Parallelism (EP) strategy—often spanning 32 GPUs or more per inference instance—is not just an option; it is a necessity. We need wide EP for two critical reasons: https://lmsys.org/blog/2026-03-25-eep-partial-failure-tolerance/ Wed, 25 Mar 2026 00:00:00 +0000 ROCm Support for Miles: Large-Scale RL Post-Training on AMD Instinct™ GPUs https://lmsys.org/blog/2026-03-17-rocm-miles-rl-amd/ Reinforcement learning (RL) has rapidly become a core stage of modern foundation-model development. While large-scale pretraining remains essential, today's most capable models rely heavily on post-training techniques to improve reasoning, tool use, and multi-turn interaction. These workflows depend on scalable reinforcement learning infrastructure capable of running across multi-node GPU clusters. https://lmsys.org/blog/2026-03-17-rocm-miles-rl-amd/ Tue, 17 Mar 2026 00:00:00 +0000 SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systems https://lmsys.org/blog/2026-03-11-run-nvidia-nemotron-3-super/ We are excited to announce that SGLang supports NVIDIA Nemotron 3 Super on Day 0. https://lmsys.org/blog/2026-03-11-run-nvidia-nemotron-3-super/ Wed, 11 Mar 2026 00:00:00 +0000 Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72 https://lmsys.org/blog/2026-02-20-gb300-inferencex/ The SGLang team has worked closely with NVIDIA across [multiple GPU generations](https://lmsys.org/blog/2025-05-05-large-scale-ep/) to unlock step-function gains in inference performance for large-scale deployments of Mixture of Expert (MoE) reasoning models. Building on [prior results](https://lmsys.org/blog/2025-10-14-sa-inference-max/) that delivered 4x speedups on Blackwell B200 vs.Hopper H200 in SemiAnalysis InferenceMAXv1, we are now extending this momentum to Blackwell Ultra. With GB300 NVL72, SGLang achieves up to 25x performance gain on the latest InferenceXv2 benchmark compared to H200. Additionally, we increased SGLang's InferenceXv2 performance on GB200 NVL72 by up to 8x in less than 4 months. These performance gains are a result of the close collaboration between SGLang developers and NVIDIA engineering teams and translate directly into lower latency, higher throughput, and significantly reduced cost per token for large-scale Mixture of Experts (MoE) reasoning model deployments. https://lmsys.org/blog/2026-02-20-gb300-inferencex/ Fri, 20 Feb 2026 00:00:00 +0000 Deploying DeepSeek on GB300 NVL72: Big Wins in Long-Context Inference https://lmsys.org/blog/2026-02-19-gb300-longctx/ As the latest addition to the Blackwell family, the **GB300 NVL72** is the most powerful platform for long-context LLM inference. In this blog post, we share our latest progress on optimizing DeepSeek R1-NVFP4 for 128K/8K ISL/OSL (Input Sequence Length/Output Sequence Length) long-context serving using prefill–decode disaggregation (PD), chunked pipeline parallelism (PP) for prefill, wide expert parallelism (Wide-EP) for decode, multi-token prediction (MTP), overlap scheduling, and faster attention kernels driven by 2x Special Function Unit (SFU) throughput increase in key instructions used in attention softmax. https://lmsys.org/blog/2026-02-19-gb300-longctx/ Thu, 19 Feb 2026 00:00:00 +0000 SGLang-Diffusion: Advanced Optimizations for Production-Ready Video Generation https://lmsys.org/blog/2026-02-16-sglang-diffusion-advanced-optimizations/ Following our [two-month progress update](https://lmsys.org/blog/2026-01-16-sglang-diffusion/), we're excited to share a https://lmsys.org/blog/2026-02-16-sglang-diffusion-advanced-optimizations/ Mon, 16 Feb 2026 00:00:00 +0000 Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 and Qwen3-VL on AMD MI300X Series https://lmsys.org/blog/2026-02-11-Qwen-latency/ Qwen is a series of large-scale, high-performance Large Language Models (LLMs) developed by the Qwen Team of Alibaba Cloud. From the first generation to the latest third-generation flagship models, all Qwen variants have undergone dedicated training and fine-grained tuning, endowing them with strong instruction-following capabilities, efficient deployability for interactive AI applications, and robust performance in solving complex tasks. As flagship models in the Qwen3 family, Qwen3-235B and Qwen3-VL-235B have achieved comprehensive multi-dimensional improvements and have been widely deployed at scale in the Qwen APP. https://lmsys.org/blog/2026-02-11-Qwen-latency/ Wed, 11 Feb 2026 00:00:00 +0000 Squeezing 1TB Model Rollout into a Single H200: INT4 QAT RL End-to-End Practice https://lmsys.org/blog/2026-01-26-int4-qat/ > 💡 **TL;DR:** https://lmsys.org/blog/2026-01-26-int4-qat/ Mon, 26 Jan 2026 00:00:00 +0000 Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang https://lmsys.org/blog/2026-01-21-novita-glm4/ A suite of production-tested, high-impact optimizations has been developed by Novita AI for deploying GLM4-MOE models based on SGLANG. https://lmsys.org/blog/2026-01-21-novita-glm4/ Wed, 21 Jan 2026 00:00:00 +0000 SGLang-Diffusion: Two Months In https://lmsys.org/blog/2026-01-16-sglang-diffusion/ Since its release in early Nov. 2025, **SGLang-Diffusion** has gained significant attention and widespread adoption https://lmsys.org/blog/2026-01-16-sglang-diffusion/ Fri, 16 Jan 2026 00:00:00 +0000 Pipeline Parallelism in SGLang: Scaling to Million-Token Contexts and Beyond https://lmsys.org/blog/2026-01-15-chunked-pipeline/ We are excited to introduce SGLang's highly optimized Pipeline Parallelism (PP) implementation, specifically engineered to tackle the challenges of ultra-long context inference. By integrating **Chunked Pipeline Parallelism**, **Asynchronous P2P Communication**, and a simple yet effective **Dynamic Chunking mechanism**, this PP design achieves industry-leading performance while ensuring seamless compatibility with other parallel strategies, PD Disaggregation, and HiCache. In multi-node deployments, scaling to PP4 TP8 with this implementation yields a **3.31× Prefill Throughput for DeepSeek-V3.1** on an H20 cluster compared to TP8 when the chunked prefill size is set to 12K, significantly outperforming the TP32 solution (2.54×) by a **30.5% margin**. This highlights PP's inherent architectural advantage for large-scale, cross-node scaling over pure TP. Furthermore, our implementation also delivers up to a **67.9% reduction in TTFT** while maintaining an **82.8% strong scaling efficiency**, providing a highly efficient, open-source path for scaling trillion-parameter models for ultra-long context. https://lmsys.org/blog/2026-01-15-chunked-pipeline/ Thu, 15 Jan 2026 00:00:00 +0000 EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models in SGLang https://lmsys.org/blog/2026-01-12-epd/ > We introduce Encoder-Prefill-Decode (EPD) Disaggregation in SGLang, a novel architecture that separates vision encoding from language processing in Vision-Language Models (VLMs). This can enable: https://lmsys.org/blog/2026-01-12-epd/ Mon, 12 Jan 2026 00:00:00 +0000 SpecBundle & SpecForge v0.2: Production-Ready Speculative Decoding Models and Framework https://lmsys.org/blog/2025-12-23-spec-bundle-phase-1/ The SpecForge team has collaborated with multiple industry partners - including **Ant, Meituan, Nex-AGI, and EigenAI** - to release [**SpecBundle (Phase 1)**](https://huggingface.co/collections/lmsys/specbundle), a collection of production-grade EAGLE-3 model checkpoints trained on large-scale datasets. **SpecBundle** is designed to improve the availability and real-world performance of speculative decoding, with Phase 1 focusing on instruct-tuned models. https://lmsys.org/blog/2025-12-23-spec-bundle-phase-1/ Tue, 23 Dec 2025 00:00:00 +0000 Power Up Diffusion LLMs: Day‑0 Support for LLaDA 2.0 https://lmsys.org/blog/2025-12-19-diffusion-llm/ We are excited to introduce the design and implementation of the Diffusion Large Language Model (dLLM) framework within SGLang. By leveraging the existing Chunked-Prefill mechanism, our system achieves: https://lmsys.org/blog/2025-12-19-diffusion-llm/ Fri, 19 Dec 2025 00:00:00 +0000 Mini-SGLang: Efficient Inference Engine in a Nutshell https://lmsys.org/blog/2025-12-17-minisgl/ We're excited to introduce **Mini-SGLang**, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the [SGLang](https://github.com/sgl-project/sglang) project, Mini-SGLang is designed to demystify the complexities of modern serving systems. Despite its compact codebase, it retains the advanced features that define state-of-the-art performance, including **Radix Attention** for efficient KV cache reuse, **Chunked Prefill** for controlled memory footprint, **Overlap Scheduling** for reduced CPU overhead, and **Tensor Parallelism** for scalable distributed serving. With an OpenAI-compatible API and out-of-the-box support for models like Llama-3 and Qwen-3, Mini-SGLang serves as both a capable inference engine and a transparent reference implementation for researchers and developers. https://lmsys.org/blog/2025-12-17-minisgl/ Wed, 17 Dec 2025 00:00:00 +0000 SGLang Day-0 Support for MiMo-V2-Flash Model https://lmsys.org/blog/2025-12-16-mimo-v2-flash/ [XiaomiMiMo/MiMo-V2-Flash](https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash), with 309B total parameters and 15B activated parameters, is a new inference-centric model designed to maximize decoding efficiency. It is based on two key designs: **sliding window attention** and **multi-layer MTP**. MiMo-V2-Flash is explicitly co-designed for real-world serving workloads, enabling flexible tradeoffs between throughput and latency on different hardware. Combined with SGLang’s optimized Spec v2 runtime, which provides near-zero-overhead support for multi-layer MTP and efficient SWA execution, MiMo-V2-Flash delivers balanced TPOT and throughput on H200. In this blog, we will introduce the model and SGLang's efficient support. https://lmsys.org/blog/2025-12-16-mimo-v2-flash/ Tue, 16 Dec 2025 00:00:00 +0000 SGLang Adds Day-0 Support for the Highly Efficient, Open Nemotron 3 Nano Hybrid MoE Model https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/ **Jan 28th Update**: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by SGLang out of the box and it uses a new method called Quantization-Aware Distillation (QAD) to maintain accuracy on NVFP4 while delivering 4x throughput on B200 compared to FP8-H100. You can download the NVFP4 checkpoints [here](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) and run them using this [NVIDIA Brev launchable](https://brev.nvidia.com/launchable/deploy?launchableID=env-386BHXsTBKROX8F2WBCbQP6S6qt). https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/ Mon, 15 Dec 2025 00:00:00 +0000 Let Tensors Fly — Accelerating Large Model Weight Loading with R-Fork https://lmsys.org/blog/2025-12-10-rfork/ > We introduce **Tensor R-Fork** (stands for Tensor Remote Fork), a novel weight loading methodology that leverages **efficient inter-node device-to-device interconnect** to load tensors from a running SGLang instance to a new instance with **zero-copy**. https://lmsys.org/blog/2025-12-10-rfork/ Wed, 10 Dec 2025 00:00:00 +0000 Boost SGLang Inference: Native NVIDIA Model Optimizer Integration for Seamless Quantization and Deployment https://lmsys.org/blog/2025-12-02-modelopt-quantization/ (Updated on Dec 2) https://lmsys.org/blog/2025-12-02-modelopt-quantization/ Tue, 02 Dec 2025 00:00:00 +0000 From research to production: Accelerate OSS LLM with EAGLE-3 on Vertex https://lmsys.org/blog/2025-12-01-eagle3-vertex/ **TL;DR:** Speculative decoding boosts LLM inference, but traditional methods require a separate, inefficient draft model. Vertex AI utilizes EAGLE-3, adding a small draft head (2-5% of the target model) to internal layers, simplifying training and achieving ~2x-3x decoding speedup. **This post outlines our pipeline for data cleaning, embeddings, training, and serving EAGLE-3 with SGLang on Vertex AI at scale.** https://lmsys.org/blog/2025-12-01-eagle3-vertex/ Mon, 01 Dec 2025 00:00:00 +0000 Unified FP8: Moving Beyond Mixed Precision for Stable and Accelerated MoE RL https://lmsys.org/blog/2025-11-25-fp8-rl/ > TL;DR: We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training. https://lmsys.org/blog/2025-11-25-fp8-rl/ Tue, 25 Nov 2025 00:00:00 +0000 LMSYS Fellowship Program https://lmsys.org/blog/2025-11-23-fellowship-apply/ We are thrilled to announce the launch of the LMSYS Fellowship Program! https://lmsys.org/blog/2025-11-23-fellowship-apply/ Sun, 23 Nov 2025 00:00:00 +0000 Introducing Miles — RL Framework To Fire Up Large-Scale MoE Training https://lmsys.org/blog/2025-11-19-miles/ > *A journey of a thousand miles is made one small step at a time.* https://lmsys.org/blog/2025-11-19-miles/ Wed, 19 Nov 2025 00:00:00 +0000 🚀 AutoRound Meets SGLang: Enabling Quantized Model Inference with AutoRound https://lmsys.org/blog/2025-11-13-AutoRound/ We are thrilled to announce an official collaboration between [**SGLang**](https://github.com/sgl-project/sglang) and [**AutoRound**](https://github.com/intel/auto-round), enabling low-bit quantization for efficient LLM inference. https://lmsys.org/blog/2025-11-13-AutoRound/ Fri, 14 Nov 2025 00:00:00 +0000 SGLang Diffusion: Accelerating Video and Image Generation https://lmsys.org/blog/2025-11-07-sglang-diffusion/ We are excited to introduce SGLang Diffusion, which brings SGLang's state-of-the-art performance to accelerate image and video generation for diffusion models. https://lmsys.org/blog/2025-11-07-sglang-diffusion/ Fri, 07 Nov 2025 00:00:00 +0000 "No Free Lunch": Deconstruct Efficient Attention with MiniMax M2 https://lmsys.org/blog/2025-11-04-miminmax-m2/ We are excited to announce day-one support for the new flagship model, MiniMax M2, on SGLang. The MiniMax M2 redefines efficiency for agents: it is a compact, fast, and cost-effective Mixture of Experts (MoE) model (230 billion total parameters, 10 billion active) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With only 10B activated parameters, M2 delivers the sophisticated, end-to-end tool-use performance expected from leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. https://lmsys.org/blog/2025-11-04-miminmax-m2/ Tue, 04 Nov 2025 00:00:00 +0000 Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/ We’ve got some exciting updates about the **NVIDIA DGX Spark**\! In the week following the official launch, we collaborated closely with NVIDIA and successfully brought **GPT-OSS 20B** and **GPT-OSS 120B** support to **SGLang** on the DGX Spark. The results are impressive: around **70 tokens/s** on GPT-OSS 20B and **50 tokens/s** on GPT-OSS 120B, which is state-of-the-art so far, and makes running a **local coding agent** on the DGX Spark fully viable. https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/ Mon, 03 Nov 2025 00:00:00 +0000 SGLang-Jax: An Open-Source Solution for Native TPU Inference https://lmsys.org/blog/2025-10-29-sglang-jax/ We're excited to introduce SGLang-Jax, a state-of-the-art open-source inference engine built entirely on Jax and XLA. https://lmsys.org/blog/2025-10-29-sglang-jax/ Wed, 29 Oct 2025 00:00:00 +0000 Accelerating Hybrid Inference in SGLang with KTransformers CPU Kernels https://lmsys.org/blog/2025-10-22-KTransformers/ Modern Mixture-of-Experts (MoE) language models such as **DeepSeek-V3** contain hundreds of billions of parameters, but only a small subset of experts are activated per token. https://lmsys.org/blog/2025-10-22-KTransformers/ Wed, 22 Oct 2025 00:00:00 +0000 SGLang and NVIDIA Accelerating SemiAnalysis InferenceMAX and GB200 Together https://lmsys.org/blog/2025-10-14-sa-inference-max/ The SGLang and NVIDIA teams have a strong track record of collaboration, consistently delivering inference optimizations and system-level improvements to ensure exceptional performance of the SGLang framework. Most recently, this collaboration has been centered on the **NVIDIA Blackwell architecture**, NVIDIA’s latest data center GPU. By leveraging key Blackwell features like **FP8 attention**, **NVFP4 MoE**, and **PD-Disaggregated Expert Parallelism** architecture, SGLang achieved [breakthrough performance](https://lmsys.org/blog/2025-09-25-gb200-part-2/) at high throughput. On an NVIDIA GB200 NVL72 system, SGLang served the DeepSeek R1 models at an incredible **26k input and 13k output tokens per second per GPU** for prefill and decode, respectively. This milestone represents a new level of cost and power efficiency at scale. https://lmsys.org/blog/2025-10-14-sa-inference-max/ Tue, 14 Oct 2025 00:00:00 +0000 NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/ Thanks to NVIDIA’s early access program, we are thrilled to get our hands on the NVIDIA DGX™ Spark. It’s quite an unconventional system, as NVIDIA rarely releases compact, all-in-one machines that bring supercomputing-class performance to a desktop workstation form factor. https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/ Mon, 13 Oct 2025 00:00:00 +0000 SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention https://lmsys.org/blog/2025-09-29-deepseek-V32/ We are excited to announce that **SGLang supports DeepSeek-V3.2 on Day 0**! According to the DeepSeek [tech report](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf), it equips DeepSeek-V3.1-Terminus with [DeepSeek Sparse Attention (DSA)](https://arxiv.org/pdf/2502.11089) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. For more details about upcoming features, please check our [Roadmap](https://github.com/sgl-project/sglang/issues/11060). https://lmsys.org/blog/2025-09-29-deepseek-V32/ Mon, 29 Sep 2025 00:00:00 +0000 PD-Multiplexing: Unlocking High-Goodput LLM Serving with GreenContext https://lmsys.org/blog/2025-09-28-pdmux/ This post highlights our initial efforts to support **a new serving paradigm, PD-Multiplexing, in** **SGLang.** It is designed to deliver higher goodput in LLM serving. PD-Multiplexing leverages [**GreenContext**](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GREEN__CONTEXTS.html), a new NVIDIA GPU capability that allows lightweight and fine-grained partitioning of GPU resources across tasks within the same process. We envision this paradigm as a promising new approach to LLM service deployment, delivering stronger SLO guarantees and higher goodput for Model-as-a-Service (MaaS). https://lmsys.org/blog/2025-09-28-pdmux/ Sun, 28 Sep 2025 00:00:00 +0000 Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G https://lmsys.org/blog/2025-09-26-sglang-ant-group/ Operationalizing scaled Mixture-of-Experts (MoE) models such as DeepSeek-R1 requires a careful balance of latency, throughput, and cost. The challenge is especially acute on hardware with asymmetric performance profiles—for example, the H20 GPU, which offers high memory bandwidth but comparatively low compute throughput. Our goal was to design a serving stack that meets the stringent SLAs typically achieved on high-end GPUs while leveraging the H20’s cost advantages. https://lmsys.org/blog/2025-09-26-sglang-ant-group/ Fri, 26 Sep 2025 00:00:00 +0000 Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput https://lmsys.org/blog/2025-09-25-gb200-part-2/ The GB200 NVL72 is one of the most powerful hardware for deep learning. In this blog post, we share our progress after our [previous blog post](https://lmsys.org/blog/2025-06-16-gb200-part-1/) to optimize the inference performance of DeepSeek V3/R1 with FP8 attention, NVFP4 MoE, large-scale expert parallelism, prefill-decode disaggregation, and various other optimizations. When using FP8 attention and NVFP4 MoE, SGLang achieved 26,156 input and 13,386 output tokens per second per GPU for prefill and decode, respectively, on DeepSeek V3/R1 for 2000-token input sequences, which is a 3.8x and 4.8x speedup compared to [H100 settings](https://lmsys.org/blog/2025-05-05-large-scale-ep/). Even with traditional BF16 attention and FP8 MoE, SGLang still achieves 18,471 input and 9,087 output tokens per second. Reproduction instructions can be found [here](https://github.com/sgl-project/sglang/issues/10903). https://lmsys.org/blog/2025-09-25-gb200-part-2/ Thu, 25 Sep 2025 00:00:00 +0000 Optimizing FP4 Mixed-Precision Inference on AMD GPUs https://lmsys.org/blog/2025-09-21-petit-amdgpu/ As frontier large language models (LLMs) continue scaling to unprecedented sizes, they demand increasingly more compute power and memory bandwidth from GPUs. Both GPU manufacturers and model developers are shifting toward low-precision floating-point formats. FP4 (4-bit floating point) quantization has emerged as a particularly compelling solution—for instance, FP4-quantized [Llama 3.3 70B](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4) models achieve a 3.5x reduction in model size while maintaining minimal quality degradation on benchmarks like [MMLU](https://arxiv.org/abs/2009.03300). https://lmsys.org/blog/2025-09-21-petit-amdgpu/ Sun, 21 Sep 2025 00:00:00 +0000 SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends https://lmsys.org/blog/2025-09-10-sglang-hicache/ In a coding agent scenario using Qwen3-Coder-480B, the observed dialogues often stretched past 25K tokens around 8 turns per session. Without full KV cache retention, nearly every request required costly re-computation. By **integrating SGLang HiCache with DeepSeek 3FS KVStore** for large-scale historical KV caching, the session’s **average TTFT dropped by 56%, inference throughput doubled, and the cache hit rate jumped from 40% to 80%.”** https://lmsys.org/blog/2025-09-10-sglang-hicache/ Wed, 10 Sep 2025 00:00:00 +0000 LongCat-Flash: Deploying Meituan's Agentic Model with SGLang https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/ LongCat-Flash, Meituan's open-source Agentic Mixture-of-Experts (MoE) model is now available from huggingface [LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat). Released by Meituan LongCat Team, it features: https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/ Mon, 01 Sep 2025 00:00:00 +0000 Fine-tune and deploy gpt-oss MXFP4: ModelOpt + SGLang https://lmsys.org/blog/2025-08-28-gpt-oss-qat/ (Updated on Aug 29) https://lmsys.org/blog/2025-08-28-gpt-oss-qat/ Thu, 28 Aug 2025 00:00:00 +0000 SGLang for gpt-oss: From Day 0 Support to Enhanced Performance https://lmsys.org/blog/2025-08-27-gpt-oss/ We are excited to announce a major update for SGLang, focusing on deep performance optimizations and new features for the recently released openai/gpt-oss-120b model. **While we had support from day zero, we took the last few weeks to enhance our engine to ensure you get the best possible performance.** https://lmsys.org/blog/2025-08-27-gpt-oss/ Wed, 27 Aug 2025 00:00:00 +0000 GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities https://lmsys.org/blog/2025-07-31-glm4-5/ Today, we are excited to introduce our latest flagship models [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) and [GLM-4.5-Air](https://huggingface.co/zai-org/GLM-4.5-Air), along with their FP8 variants. All models are now available with day-one support on SGLang. https://lmsys.org/blog/2025-07-31-glm4-5/ Thu, 31 Jul 2025 00:00:00 +0000 SpecForge: Accelerating Speculative Decoding Training for SGLang https://lmsys.org/blog/2025-07-25-spec-forge/ Speculative decoding is a powerful technique for accelerating Large Language Model (LLM) inference. In this blog post, we are excited to announce the open-sourcing of **[SpecForge](https://github.com/sgl-project/SpecForge)**, our new training framework for Eagle3-based speculative decoding. SpecForge is designed for ease of use and is tightly integrated with the **[SGLang](https://github.com/sgl-project/sglang)** inference engine, enabling a seamless transition from training to deployment. https://lmsys.org/blog/2025-07-25-spec-forge/ Fri, 25 Jul 2025 00:00:00 +0000 Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/ **Kimi K2 is currently the most advanced open-source Mixture-of-Experts (MoE) model available.** https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/ Sun, 20 Jul 2025 00:00:00 +0000 Accelerating SGLang with Multiple Token Prediction https://lmsys.org/blog/2025-07-17-mtp/ SGLang now supports smooth combination of these advanced features: **Multiple Token Prediction (MTP)**, **Large-Scale Expert Parallelism (EP)**, and **Prefill-Decode disaggregation**. This integration delivers **up to 60% higher output throughput** through a new decoding paradigm, better parallelism, and more efficient resource utilization without sacrificing generation quality. If you are serving models, e.g., DeepSeek V3, SGLang now supports MTP as a plug-and-play feature, unlocking immediate performance gains. You can find instruction for reproduction [here](https://github.com/sgl-project/sglang/issues/7998). https://lmsys.org/blog/2025-07-17-mtp/ Thu, 17 Jul 2025 00:00:00 +0000 How to support new VLMs into SGLang: A Case Study with NVILA https://lmsys.org/blog/2025-07-16-nvila/ The world of LLMs is evolving at a remarkable pace, with Visual Language Models (VLMs) at the forefront of this revolution. These models power applications that can understand and reason about both images and text. There are [tons of new VLM models](https://huggingface.co/models?pipeline_tag=image-text-to-text&sort=trending) emerging daily, and we want to integrate them into [SGLang](https://github.com/sgl-project/sglang) to leverage its high-speed throughput. Today, we’ll provide a step-by-step walkthrough for integrating new VLMs into the SGLang ecosystem, using the recent [NVILA model](https://arxiv.org/abs/2412.04468) as a real-world case study. https://lmsys.org/blog/2025-07-16-nvila/ Wed, 16 Jul 2025 00:00:00 +0000 Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/ The impressive performance of DeepSeek R1 marked a rise of giant Mixture of Experts (MoE) models in Large Language Models (LLM). However, its massive model size and unique architecture have posed new challenges on deployment. The significant memory requirements will normally require 8x or even 16x high-end AI accelerators to deploy. https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/ Mon, 14 Jul 2025 00:00:00 +0000 slime: An SGLang-Native Post-Training Framework for RL Scaling https://lmsys.org/blog/2025-07-09-slime/ We believe in RL. We believe RL is the final piece toward AGI. https://lmsys.org/blog/2025-07-09-slime/ Wed, 09 Jul 2025 00:00:00 +0000 OME: Revolutionizing LLM Infrastructure with Model-Driven Architecture https://lmsys.org/blog/2025-07-08-ome/ In any large organization deploying LLMs, two distinct teams emerge with conflicting needs: https://lmsys.org/blog/2025-07-08-ome/ Tue, 08 Jul 2025 00:00:00 +0000 Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput https://lmsys.org/blog/2025-06-16-gb200-part-1/ The GB200 NVL72 is the world's most advanced hardware for AI training and inference. In this blog post, we're excited to share early results from running DeepSeek 671B with prefill-decode disaggregation and large-scale expert parallelism on the GB200 NVL72. By leveraging Blackwell-specific features to enhance existing components, **SGLang achieved 7,583 tokens per second per GPU for decoding on the GB200 NVL72—a 2.7x speedup compared to the H100 per GPU** ([link](https://lmsys.org/blog/2025-05-05-large-scale-ep/)) for 2,000-token input lengths. Performance is expected to improve further with ongoing optimizations. You can find reproduction instructions [here](https://github.com/sgl-project/sglang/issues/7227). https://lmsys.org/blog/2025-06-16-gb200-part-1/ Mon, 16 Jun 2025 00:00:00 +0000 Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs https://lmsys.org/blog/2025-05-05-large-scale-ep/ DeepSeek is a popular open-source large language model (LLM) praised for its strong performance. However, its large size and unique architecture, which uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE), require an advanced system for efficient serving at scale. In this blog, we explain how we match DeepSeek's inference system performance with SGLang. https://lmsys.org/blog/2025-05-05-large-scale-ep/ Mon, 05 May 2025 00:00:00 +0000 SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs https://lmsys.org/blog/2024-12-04-sglang-v0-4/ We’re excited to release [SGLang v0.4](https://github.com/sgl-project/sglang), featuring significant performance improvements and new features: https://lmsys.org/blog/2024-12-04-sglang-v0-4/ Wed, 04 Dec 2024 00:00:00 +0000 Announcing a New Site for Chatbot Arena https://lmsys.org/blog/2024-09-20-arena-new-site/ We’re excited to share that Chatbot Arena now has its own dedicated website: [lmarena.ai](https://lmarena.ai) and [blog](https://blog.lmarena.ai)! https://lmsys.org/blog/2024-09-20-arena-new-site/ Fri, 20 Sep 2024 00:00:00 +0000 RedTeam Arena: An Open-Source, Community-driven Jailbreaking Platform https://lmsys.org/blog/2024-09-13-redteam-arena/ We are excited to launch [RedTeam Arena](https://redarena.ai), a community-driven redteaming platform, built in collaboration with [Pliny](https://x.com/elder_plinius) and the [BASI](https://discord.gg/Y6GxC59G) community! https://lmsys.org/blog/2024-09-13-redteam-arena/ Fri, 13 Sep 2024 00:00:00 +0000 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision https://lmsys.org/blog/2024-09-04-sglang-v0-3/ We're excited to announce the release of [SGLang v0.3](https://github.com/sgl-project/sglang/tree/main), which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates: https://lmsys.org/blog/2024-09-04-sglang-v0-3/ Wed, 04 Sep 2024 00:00:00 +0000 Does style matter? Disentangling style and substance in Chatbot Arena https://lmsys.org/blog/2024-08-28-style-control/ Why is GPT-4o-mini so good? Why does Claude rank so low, when anecdotal experience suggests otherwise? https://lmsys.org/blog/2024-08-28-style-control/ Thu, 29 Aug 2024 00:00:00 +0000 Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) https://lmsys.org/blog/2024-07-25-sglang-llama3/ At LMSYS.org, we've been running the [Chatbot Arena](https://chat.lmsys.org/) platform for over a year, serving millions of users. We know firsthand how crucial efficient serving is for AI products and research. Through our operational experiences and in-depth research, we've continuously enhanced the underlying serving systems, spanning from the high-level multi-model serving framework, [FastChat](https://github.com/lm-sys/FastChat/tree/main), to the efficient serving engine, [SGLang Runtime (SRT)](https://github.com/sgl-project/sglang). https://lmsys.org/blog/2024-07-25-sglang-llama3/ Thu, 25 Jul 2024 00:00:00 +0000 RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing https://lmsys.org/blog/2024-07-01-routellm/ LLMs have demonstrated remarkable capabilities across a range of tasks, but there exists wide variation in their costs and capabilities, as seen from the plot of performance against cost in Figure 1. Very broadly, more capable models tend to be more expensive than less capable models. This leads to a dilemma when deploying LLMs in the real-world - routing all queries to the largest, most capable model leads to the highest-quality responses but can be expensive, while routing queries to smaller models can save costs but may result in lower-quality responses. https://lmsys.org/blog/2024-07-01-routellm/ Mon, 01 Jul 2024 00:00:00 +0000 The Multimodal Arena is Here! https://lmsys.org/blog/2024-06-27-multimodal/ We added image support to [Chatbot Arena](https://lmarena.ai/)! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother. https://lmsys.org/blog/2024-06-27-multimodal/ Thu, 27 Jun 2024 00:00:00 +0000 Introducing Hard Prompts Category in Chatbot Arena https://lmsys.org/blog/2024-05-17-category-hard/ Introducing **Hard Prompts**, a new and challenging category in the Chatbot Arena [Leaderboard](https://leaderboard.lmsys.org). https://lmsys.org/blog/2024-05-17-category-hard/ Mon, 20 May 2024 00:00:00 +0000 What’s up with Llama 3? Arena data analysis https://lmsys.org/blog/2024-05-08-llama3/ On April 18th, Meta released Llama 3, their newest open-weight large language model. Since then, Llama 3-70B has quickly risen to the top of the English [Chatbot Arena leaderboard](https://leaderboard.lmsys.org) with over 50,000 battles. This remarkable achievement by Meta is excellent news for the open-source community. In this blog post, we aim to provide more insight into why users rank Llama 3-70b on par with top-ranked models like GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus. https://lmsys.org/blog/2024-05-08-llama3/ Wed, 08 May 2024 00:00:00 +0000 LMSYS Kaggle Competition – Predicting Human Preference with $100,000 in Prizes https://lmsys.org/blog/2024-05-02-kaggle-competition/ LMSYS and Kaggle are launching a human preference prediction competition! You are challenged to predict which responses users will prefer in head-to-head battles between Large Language Models (LLMs). You'll work with a dataset from the [Chatbot Arena](https://lmarena.ai), containing conversations and user preferences across various LLMs. By developing a model that accurately predicts human preferences, you'll contribute to improving chatbot performance and alignment with user expectations. The training dataset includes over 55,000 real-world user and LLM conversations and user preferences, with personally identifiable information removed. Your solution submission will be tested on a hidden test set of 25,000 samples. https://lmsys.org/blog/2024-05-02-kaggle-competition/ Thu, 02 May 2024 00:00:00 +0000 From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline https://lmsys.org/blog/2024-04-19-arena-hard/ Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1) robustly separate model capability, 2) reflect human preference in real-world use cases, and 3) frequently update to avoid over-fitting or test set leakage. https://lmsys.org/blog/2024-04-19-arena-hard/ Fri, 19 Apr 2024 00:00:00 +0000 LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation https://lmsys.org/blog/2024-03-01-policy/ Chatbot Arena ([lmarena.ai](https://lmarena.ai)) is an open-source project developed by members from [LMSYS](https://lmarena.ai/?about) and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish [leaderboard](https://lmarena.ai/?leaderboard) periodically. https://lmsys.org/blog/2024-03-01-policy/ Fri, 01 Mar 2024 00:00:00 +0000 Fast JSON Decoding for Local LLMs with Compressed Finite State Machine https://lmsys.org/blog/2024-02-05-compressed-fsm/ Constraining an LLM to consistently generate valid JSON or YAML that adheres to a specific schema is a critical feature for many applications. https://lmsys.org/blog/2024-02-05-compressed-fsm/ Mon, 05 Feb 2024 00:00:00 +0000 Fast and Expressive LLM Inference with RadixAttention and SGLang https://lmsys.org/blog/2024-01-17-sglang/ Large Language Models (LLMs) are increasingly utilized for complex tasks that require multiple chained generation calls, advanced prompting techniques, control flow, and interaction with external environments. However, there is a notable deficiency in efficient systems for programming and executing these applications. https://lmsys.org/blog/2024-01-17-sglang/ Wed, 17 Jan 2024 00:00:00 +0000 Chatbot Arena: New models & Elo system update https://lmsys.org/blog/2023-12-07-leaderboard/ Welcome to our latest update on the Chatbot Arena, our open evaluation platform to test the most advanced LLMs. We're excited to share that over **130,000** votes that are now collected to rank the most capable 40+ models! In this blog post, we'll cover the results of several new models: https://lmsys.org/blog/2023-12-07-leaderboard/ Thu, 07 Dec 2023 00:00:00 +0000 Break the Sequential Dependency of LLM Inference Using Lookahead Decoding https://lmsys.org/blog/2023-11-21-lookahead-decoding/ **TL;DR:** We introduce **lookahead decoding**, a new, exact, and parallel decoding algorithm to accelerate LLM inference. https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Tue, 21 Nov 2023 00:00:00 +0000 Recipe for Serving Thousands of Concurrent LoRA Adapters https://lmsys.org/blog/2023-11-15-slora/ In this blog post, we introduce [S-LoRA](https://arxiv.org/abs/2311.03285) ([code](https://github.com/S-LoRA/S-LoRA)), a system designed for the scalable serving of many LoRA adapters. S-LoRA adopts the idea of https://lmsys.org/blog/2023-11-15-slora/ Wed, 15 Nov 2023 00:00:00 +0000 Catch me if you can! How to beat GPT-4 with a 13B model https://lmsys.org/blog/2023-11-14-llm-decontaminator/ Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)! https://lmsys.org/blog/2023-11-14-llm-decontaminator/ Tue, 14 Nov 2023 00:00:00 +0000 ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions https://lmsys.org/blog/2023-10-30-toxicchat/ In this blogpost, we introduce ToxicChat, a benchmark consisting of 10K high-quality data for content moderation in real-world user-AI interactions. Evaluation results show that fine-tuning on this benchmark notably improves a baseline model’s ability to detect toxic queries in user-AI interactions. https://lmsys.org/blog/2023-10-30-toxicchat/ Mon, 30 Oct 2023 00:00:00 +0000 Chatbot Arena Conversation Dataset Release https://lmsys.org/blog/2023-07-20-dataset/ Since its launch three months ago, [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. https://lmsys.org/blog/2023-07-20-dataset/ Thu, 20 Jul 2023 00:00:00 +0000 How Long Can Open-Source LLMs Truly Promise on Context Length? https://lmsys.org/blog/2023-06-29-longchat/ In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 16K tokens. https://lmsys.org/blog/2023-06-29-longchat/ Thu, 29 Jun 2023 00:00:00 +0000 Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B https://lmsys.org/blog/2023-06-22-leaderboard/ In this blog post, we share the latest update on Chatbot Arena leaderboard, which now includes more open models and three metrics: https://lmsys.org/blog/2023-06-22-leaderboard/ Thu, 22 Jun 2023 00:00:00 +0000 Building a Truly \"Open\" OpenAI API Server with Open Models Locally https://lmsys.org/blog/2023-06-09-api-server/ Many applications have been built on closed-source OpenAI APIs, but now you can effortlessly port them to use open-source alternatives without modifying the code. [FastChat](https://github.com/lm-sys/FastChat)'s OpenAI-compatible API server enables this seamless transition. https://lmsys.org/blog/2023-06-09-api-server/ Fri, 09 Jun 2023 00:00:00 +0000 Chatbot Arena Leaderboard Updates (Week 4) https://lmsys.org/blog/2023-05-25-leaderboard/ In this update, we are excited to welcome the following models joining the [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/): https://lmsys.org/blog/2023-05-25-leaderboard/ Thu, 25 May 2023 00:00:00 +0000 Chatbot Arena Leaderboard Updates (Week 2) https://lmsys.org/blog/2023-05-10-leaderboard/ We release an updated leaderboard with more models and new data we collected last week, after the announcement of the anonymous [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/). We are actively iterating on the design of the arena and leaderboard scores. https://lmsys.org/blog/2023-05-10-leaderboard/ Wed, 10 May 2023 00:00:00 +0000 Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings https://lmsys.org/blog/2023-05-03-arena/ We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer. https://lmsys.org/blog/2023-05-03-arena/ Wed, 03 May 2023 00:00:00 +0000 Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-30-vicuna/ We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%<sup>*</sup> of cases. The cost of training Vicuna-13B is around $300. The [code](https://github.com/lm-sys/FastChat) and [weights](https://github.com/lm-sys/FastChat#vicuna-weights), along with an online [demo](https://chat.lmsys.org), are publicly available for non-commercial use. https://lmsys.org/blog/2023-03-30-vicuna/ Thu, 30 Mar 2023 00:00:00 +0000