---
name: sglang
description: Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Inference Serving, SGLang, Structured Generation, RadixAttention, Prefix Caching, Constrained Decoding, Agents, JSON Output, Fast Inference, Production Scale]
dependencies: [sglang, torch, transformers]
---

# SGLang

High-performance serving framework for LLMs and VLMs with RadixAttention for automatic prefix caching.

## When to use SGLang

**Use SGLang when:**
- Need structured outputs (JSON, regex, grammar)
- Building agents with repeated prefixes (system prompts, tools)
- Agentic workflows with function calling
- Multi-turn conversations with shared context
- Need faster JSON decoding (3× vs standard)

**Use vLLM instead when:**
- Simple text generation without structure
- Don't need prefix caching
- Want mature, widely-tested production system

**Use TensorRT-LLM instead when:**
- Maximum single-request latency (no batching needed)
- NVIDIA-only deployment
- Need FP8/INT4 quantization on H100

## Quick start

### Installation

```bash
# pip install (recommended)
pip install "sglang[all]"

# With FlashInfer (faster, CUDA 11.8/12.1)
pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

# From source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
```

### Launch server

```bash
# Basic server (Llama 3-8B)
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000

# With RadixAttention (automatic prefix caching)
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000 \
    --enable-radix-cache  # Default: enabled

# Multi-GPU (tensor parallelism)
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-70B-Instruct \
    --tp 4 \
    --port 30000
```

### Basic inference

```python
import sglang as sgl

# Set backend
sgl.set_default_backend(sgl.OpenAI("http://localhost:30000/v1"))

# Simple generation
@sgl.function
def simple_gen(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", max_tokens=100)

# Run
state = simple_gen.run(question="What is the capital of France?")
print(state["answer"])
# Output: "The capital of France is Paris."
```

### Structured JSON output

```python
import sglang as sgl

@sgl.function
def extract_person(s, text):
    s += f"Extract person information from: {text}\n"
    s += "Output JSON:\n"

    # Constrained JSON generation
    s += sgl.gen(
        "json_output",
        max_tokens=200,
        regex=r'\{"name": "[^"]+", "age": \d+, "occupation": "[^"]+"\}'
    )

# Run
state = extract_person.run(
    text="John Smith is a 35-year-old software engineer."
)
print(state["json_output"])
# Output: {"name": "John Smith", "age": 35, "occupation": "software engineer"}
```

## RadixAttention (Key Innovation)

**What it does**: Automatically caches and reuses common prefixes across requests.

**Performance**:
- **5× faster** for agentic workloads with shared system prompts
- **10× faster** for few-shot prompting with repeated examples
- **Zero configuration** - works automatically

**How it works**:
1. Builds radix tree of all processed tokens
2. Automatically detects shared prefixes
3. Reuses KV cache for matching prefixes
4. Only computes new tokens

**Example** (Agent with system prompt):

```
Request 1: [SYSTEM_PROMPT] + "What's the weather?"
→ Computes full prompt (1000 tokens)

Request 2: [SAME_SYSTEM_PROMPT] + "Book a flight"
→ Reuses system prompt KV cache (998 tokens)
→ Only computes 2 new tokens
→ 5× faster!
```

## Structured generation patterns

### JSON with schema

```python
@sgl.function
def structured_extraction(s, article):
    s += f"Article: {article}\n\n"
    s += "Extract key information as JSON:\n"

    # JSON schema constraint
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "summary": {"type": "string"},
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
        },
        "required": ["title", "author", "summary", "sentiment"]
    }

    s += sgl.gen("info", max_tokens=300, json_schema=schema)

state = structured_extraction.run(article="...")
print(state["info"])
# Output: Valid JSON matching schema
```

### Regex-constrained generation

```python
@sgl.function
def extract_email(s, text):
    s += f"Extract email from: {text}\n"
    s += "Email: "

    # Email regex pattern
    s += sgl.gen(
        "email",
        max_tokens=50,
        regex=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    )

state = extract_email.run(text="Contact john.doe@example.com for details")
print(state["email"])
# Output: "john.doe@example.com"
```

### Grammar-based generation

```python
@sgl.function
def generate_code(s, description):
    s += f"Generate Python code for: {description}\n"
    s += "```python\n"

    # EBNF grammar for Python
    python_grammar = """
    ?start: function_def
    function_def: "def" NAME "(" [parameters] "):" suite
    parameters: parameter ("," parameter)*
    parameter: NAME
    suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
    """

    s += sgl.gen("code", max_tokens=200, grammar=python_grammar)
    s += "\n```"
```

## Agent workflows with function calling

```python
import sglang as sgl

# Define tools
tools = [
    {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            }
        }
    },
    {
        "name": "book_flight",
        "description": "Book a flight",
        "parameters": {
            "type": "object",
            "properties": {
                "from": {"type": "string"},
                "to": {"type": "string"},
                "date": {"type": "string"}
            }
        }
    }
]

@sgl.function
def agent_workflow(s, user_query, tools):
    # System prompt (cached with RadixAttention)
    s += "You are a helpful assistant with access to tools.\n"
    s += f"Available tools: {tools}\n\n"

    # User query
    s += f"User: {user_query}\n"
    s += "Assistant: "

    # Generate with function calling
    s += sgl.gen(
        "response",
        max_tokens=200,
        tools=tools,  # SGLang handles tool call format
        stop=["User:", "\n\n"]
    )

# Multiple queries reuse system prompt
state1 = agent_workflow.run(
    user_query="What's the weather in NYC?",
    tools=tools
)
# First call: Computes full system prompt

state2 = agent_workflow.run(
    user_query="Book a flight to LA",
    tools=tools
)
# Second call: Reuses system prompt (5× faster)
```

## Performance benchmarks

### RadixAttention speedup

**Few-shot prompting** (10 examples in prompt):
- vLLM: 2.5 sec/request
- SGLang: **0.25 sec/request** (10× faster)
- Throughput: 4× higher

**Agent workflows** (1000-token system prompt):
- vLLM: 1.8 sec/request
- SGLang: **0.35 sec/request** (5× faster)

**JSON decoding**:
- Standard: 45 tok/s
- SGLang: **135 tok/s** (3× faster)

### Throughput (Llama 3-8B, A100)

| Workload | vLLM | SGLang | Speedup |
|----------|------|--------|---------|
| Simple generation | 2500 tok/s | 2800 tok/s | 1.12× |
| Few-shot (10 examples) | 500 tok/s | 5000 tok/s | 10× |
| Agent (tool calls) | 800 tok/s | 4000 tok/s | 5× |
| JSON output | 600 tok/s | 2400 tok/s | 4× |

## Multi-turn conversations

```python
@sgl.function
def multi_turn_chat(s, history, new_message):
    # System prompt (always cached)
    s += "You are a helpful AI assistant.\n\n"

    # Conversation history (cached as it grows)
    for msg in history:
        s += f"{msg['role']}: {msg['content']}\n"

    # New user message (only new part)
    s += f"User: {new_message}\n"
    s += "Assistant: "
    s += sgl.gen("response", max_tokens=200)

# Turn 1
history = []
state = multi_turn_chat.run(history=history, new_message="Hi there!")
history.append({"role": "User", "content": "Hi there!"})
history.append({"role": "Assistant", "content": state["response"]})

# Turn 2 (reuses Turn 1 KV cache)
state = multi_turn_chat.run(history=history, new_message="What's 2+2?")
# Only computes new message (much faster!)

# Turn 3 (reuses Turn 1 + Turn 2 KV cache)
state = multi_turn_chat.run(history=history, new_message="Tell me a joke")
# Progressively faster as history grows
```

## Advanced features

### Speculative decoding

```bash
# Launch with draft model (2-3× faster)
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-70B-Instruct \
    --speculative-model meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-num-steps 5
```

### Multi-modal (vision models)

```python
@sgl.function
def describe_image(s, image_path):
    s += sgl.image(image_path)
    s += "Describe this image in detail: "
    s += sgl.gen("description", max_tokens=200)

state = describe_image.run(image_path="photo.jpg")
print(state["description"])
```

### Batching and parallel requests

```python
# Automatic batching (continuous batching)
states = sgl.run_batch(
    [
        simple_gen.bind(question="What is AI?"),
        simple_gen.bind(question="What is ML?"),
        simple_gen.bind(question="What is DL?"),
    ]
)

# All 3 processed in single batch (efficient)
```

## OpenAI-compatible API

```bash
# Start server with OpenAI API
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --port 30000

# Use with OpenAI client
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "system", "content": "You are helpful"},
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Works with OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello"}]
)
```

## Supported models

**Text models**:
- Llama 2, Llama 3, Llama 3.1, Llama 3.2
- Mistral, Mixtral
- Qwen, Qwen2, QwQ
- DeepSeek-V2, DeepSeek-V3
- Gemma, Phi-3

**Vision models**:
- LLaVA, LLaVA-OneVision
- Phi-3-Vision
- Qwen2-VL

**100+ models** from HuggingFace

## Hardware support

**NVIDIA**: A100, H100, L4, T4 (CUDA 11.8+)
**AMD**: MI300, MI250 (ROCm 6.0+)
**Intel**: Xeon with GPU (coming soon)
**Apple**: M1/M2/M3 via MPS (experimental)

## References

- **[Structured Generation Guide](references/structured-generation.md)** - JSON schemas, regex, grammars, validation
- **[RadixAttention Deep Dive](references/radix-attention.md)** - How it works, optimization, benchmarks
- **[Production Deployment](references/deployment.md)** - Multi-GPU, monitoring, autoscaling

## Resources

- **GitHub**: https://github.com/sgl-project/sglang
- **Docs**: https://sgl-project.github.io/
- **Paper**: RadixAttention (arXiv:2312.07104)
- **Discord**: https://discord.gg/sglang