--- name: slime-user description: Guide for using SLIME (LLM post-training framework for RL Scaling). Use when working with SLIME for reinforcement learning training of language models, including setup, configuration, training execution, multi-turn interactions, custom reward models, tool calling scenarios, or troubleshooting SLIME workflows. Covers GRPO, GSPO, PPO, Reinforce++, multi-agent RL, VLM training, FSDP/Megatron backends, SGLang integration, dynamic sampling, and custom generation functions. --- # SLIME User Guide SLIME is an LLM post-training framework for RL Scaling developed by THUDM. It supports various RL algorithms (GRPO, GSPO, PPO, Reinforce++), multiple training backends (Megatron, FSDP), and advanced features like multi-turn interactions, tool calling, and dynamic sampling. ## Quick Start Workflow ### For First-Time Users 1. **Environment Setup** - Use Docker: `docker pull slimerl/slime:latest` - Or build from source: See `docs/en/get_started/quick_start.md` - Hardware: Supports H100/H200, B200 series 2. **Download Model and Data** ```bash hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k ``` 3. **Convert Weights** (Megatron backend only) ```bash source scripts/models/qwen3-4B.sh PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \ ${MODEL_ARGS[@]} \ --hf-checkpoint /root/Qwen3-4B \ --save /root/Qwen3-4B_torch_dist ``` 4. **Run Training** ```bash bash scripts/run-qwen3-4B.sh ``` ### For Experienced Users When user needs specific functionality: - **Multi-turn/tool calling**: Read [references/examples_reference.md](references/examples_reference.md) Search-R1 section - **Custom reward models**: See custom RM pattern in examples reference - **FSDP instead of Megatron**: Use `--train-backend fsdp`, skip weight conversion - **Large-scale training**: See multi-node examples (GLM-4.5, DeepSeek-R1) - **Source code exploration**: Check [references/source_code_reference.md](references/source_code_reference.md) ## Documentation Navigation SLIME has extensive documentation. Use this guide to find what you need quickly. ### Essential Documentation (Read These First) 1. **Quick Start Guide**: `docs/en/get_started/quick_start.md` - Setup and first training run 2. **Usage Guide**: `docs/en/get_started/usage.md` - Comprehensive parameter reference 3. **Example Docs**: `docs/en/examples/qwen3-4B.md` or `docs/en/examples/glm4-9B.md` For detailed navigation of all documentation, see [references/doc_navigation.md](references/doc_navigation.md). ### Common Tasks → Documentation Mapping | Task | Documentation | |------|---------------| | First-time setup | `docs/en/get_started/quick_start.md` | | Understanding parameters | `docs/en/get_started/usage.md` | | Basic training (8 GPUs) | `docs/en/examples/qwen3-4B.md` | | Multi-turn tool use | `examples/search-r1/` | | Custom generation logic | `docs/en/get_started/customization.md` | | Multi-node training | `docs/en/examples/glm4.5-355B-A32B.md` | | FSDP backend | `docs/en/get_started/usage.md` (FSDP section) | | VLM training | `examples/geo3k_vlm/` | | Troubleshooting | `docs/en/get_started/qa.md` | ## Core Concepts ### Training Loop SLIME uses a "Rollout → Train" loop: 1. **Rollout**: Generate responses using SGLang inference 2. **Reward**: Compute rewards using reward model 3. **Train**: Update model weights using Megatron/FSDP 4. Repeat for `--num-rollout` iterations ### Key Constraint ``` rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout ``` ### Resource Allocation Modes **Colocated** (training and inference share GPUs): ```bash --actor-num-nodes 1 \ --actor-num-gpus-per-node 8 \ --colocate \ --sglang-mem-fraction-static 0.7 ``` **Disaggregated** (separate GPUs for training/inference): ```bash --actor-num-nodes 1 \ --actor-num-gpus-per-node 4 \ --rollout-num-gpus 4 ``` ## Parameter Quick Reference ### Essential Parameters **Model Loading**: - `--hf-checkpoint`: HuggingFace model path (for SGLang and FSDP) - `--ref-load`: Megatron reference model checkpoint - `--load`: Megatron actor checkpoint (resume training) - `--save`: Save path for checkpoints **Data**: - `--prompt-data`: JSONL dataset path - `--input-key`: Field name for prompts (default: "prompt") - `--label-key`: Field name for labels (default: "label") - `--metadata-key`: Field name for metadata (default: "metadata") - `--apply-chat-template`: Apply tokenizer chat template **Rollout**: - `--rollout-batch-size`: Prompts per rollout - `--n-samples-per-prompt`: Responses per prompt - `--rollout-max-response-len`: Max response length - `--rollout-temperature`: Sampling temperature **Training**: - `--num-rollout`: Total training iterations - `--num-steps-per-rollout`: Optimizer steps per rollout (default: 1) - `--global-batch-size`: Samples per optimizer step - `--advantage-estimator`: RL algorithm (grpo, gspo, ppo, reinforce_plus_plus) **Reward Model**: - `--rm-type`: Built-in RM type (e.g., "deepscaler") - `--custom-rm-path`: Custom RM function path **Backends**: - `--train-backend`: Training backend (megatron or fsdp) - `--rollout-num-gpus-per-engine`: GPUs per SGLang engine (like tp_size) For complete parameter reference, see `docs/en/get_started/usage.md`. ## Common Workflows ### 1. Standard Single-Turn Training Use example scripts as templates: - `scripts/run-qwen3-4B.sh`: Basic 8xH100 setup - `scripts/run-glm4-9B.sh`: With dynamic sampling Key sections in script: ```bash # Load model config source scripts/models/qwen3-4B.sh # Configure checkpoints CKPT_ARGS=(--hf-checkpoint /root/Qwen3-4B ...) # Configure rollout ROLLOUT_ARGS=( --rollout-batch-size 32 --n-samples-per-prompt 8 --rm-type deepscaler ) # Configure algorithm GRPO_ARGS=(--advantage-estimator grpo ...) # Run training ray job submit ... -- python3 train.py \ ${MODEL_ARGS[@]} ${CKPT_ARGS[@]} ${ROLLOUT_ARGS[@]} ... ``` ### 2. Multi-Turn Tool Calling For multi-turn scenarios (like Search-R1): 1. **Prepare Data** with metadata: ```json { "question": "User query", "final_answer": "Expected answer", "metadata": "{\"session_id\": \"123\", \"tool_code\": \"...\"}" } ``` 2. **Implement Custom Generation Function**: ```python async def generate(args, sample: Sample, sampling_params) -> Sample: for turn in range(max_turns): # Generate action model_output = await call_sglang(...) sample.loss_mask += [1] * len(model_tokens) # Train on actions # Execute tool tool_output = await execute_tool(...) sample.loss_mask += [0] * len(tool_tokens) # Mask tool outputs if action == "answer": break sample.tokens = prompt_tokens + response_tokens sample.response_length = len(response_tokens) return sample ``` 3. **Configure Custom Functions**: ```bash --custom-generate-function-path my_module.generate \ --custom-rm-path my_module.reward_func \ --metadata-key metadata ``` See `examples/search-r1/` for complete example. ### 3. Dynamic Sampling (DAPO-style) Filter low-quality samples during generation: ```bash ROLLOUT_ARGS+=( --over-sampling-batch-size 64 \ --rollout-batch-size 32 \ --dynamic-sampling-filter-path \ slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std ) ``` How it works: - Samples 64 prompts (over-sampling) - Filters groups based on reward diversity - Keeps only 32 prompts × 8 samples that pass filter - Automatically resamples if too many filtered out ### 4. FSDP Backend (No Weight Conversion) ```bash --train-backend fsdp \ --hf-checkpoint /root/Qwen3-4B \ --gradient-checkpointing \ --context-parallel-size 2 ``` Benefits: - No HF → Megatron weight conversion needed - Directly load HuggingFace checkpoints - Simpler setup for supported models See `examples/geo3k_vlm/` and `docs/en/get_started/usage.md` FSDP section. ### 5. Multi-Node Training 1. Start Ray cluster: ```bash # Head node ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 # Worker nodes ray start --address=${MASTER_ADDR}:6379 --num-gpus 8 ``` 2. Submit job: ```bash ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"env_vars": {"PYTHONPATH": "/root/Megatron-LM/"}}' \ -- python3 train.py \ --actor-num-nodes 8 \ --actor-num-gpus-per-node 8 \ ... ``` See `docs/en/examples/glm4.5-355B-A32B.md` for large-scale example. ## Customization Guide ### Custom Reward Model Implement async function: ```python async def my_reward_func(args, sample: Sample, **kwargs) -> float: # Access sample fields prompt = sample.prompt response = sample.response label = sample.label # Compute reward reward = compute_score(response, label) return float(reward) ``` Use with: `--custom-rm-path module.path:my_reward_func` ### Custom Generation Function Implement async function: ```python async def my_generate(args, sample: Sample, sampling_params) -> Sample: # Load tokenizer from slime.utils.processing_utils import load_tokenizer tokenizer = load_tokenizer(args.hf_checkpoint, trust_remote_code=True) # Generate response (call SGLang API or custom logic) from slime.utils.http_utils import post output = await post( f"http://{args.sglang_router_ip}:{args.sglang_router_port}/generate", {"text": sample.prompt, "sampling_params": sampling_params} ) # Set sample fields prompt_tokens = tokenizer(sample.prompt, add_special_tokens=False)["input_ids"] response_tokens = tokenizer(output["text"], add_special_tokens=False)["input_ids"] sample.tokens = prompt_tokens + response_tokens sample.response_length = len(response_tokens) sample.response = output["text"] sample.truncated = output["meta_info"]["finish_reason"]["type"] == "length" return sample ``` Use with: `--custom-generate-function-path module.path:my_generate` ### Custom Dynamic Filter Implement filter function: ```python def my_filter(args, samples: list[Sample], **kwargs) -> bool: # Return True to keep samples, False to discard return all(sample.reward > 0.5 for sample in samples) ``` Use with: `--dynamic-sampling-filter-path module.path:my_filter` ## Examples Reference For detailed examples and patterns, see [references/examples_reference.md](references/examples_reference.md). Quick finder: - **Basic math training**: `scripts/run-qwen3-4B.sh` - **Multi-turn tool use**: `examples/search-r1/` - **Vision-language RL**: `examples/geo3k_vlm/` - **Large-scale MOE**: `docs/en/examples/glm4.5-355B-A32B.md` - **Custom generation**: `examples/search-r1/search_r1_logic.py` - **FSDP backend**: `examples/geo3k_vlm/` ## Source Code Reference For source code exploration, see [references/source_code_reference.md](references/source_code_reference.md). Key files: - **Arguments**: `slime/utils/arguments.py` - **Rollout**: `slime/rollout/sglang_rollout.py` - **Sample type**: `slime/utils/types.py` - **Reward models**: `slime/rollout/rm_hub/` - **Conversion tools**: `tools/convert_hf_to_torch_dist.py` ## Troubleshooting ### Common Issues **OOM during colocated training**: - Reduce `--sglang-mem-fraction-static` (try 0.7 or 0.6) - Reduce `--max-tokens-per-gpu` - Enable gradient checkpointing: `--recompute-granularity full` **Mismatched batch sizes**: - Ensure: `rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout` **Weight conversion errors**: - Check model config matches exactly (e.g., `--rotary-base`) - Use FSDP backend to skip conversion: `--train-backend fsdp` **Multi-node communication issues**: - Set environment variables: `GLOO_SOCKET_IFNAME`, `NCCL_SOCKET_IFNAME` - See `docs/en/get_started/quick_start.md` multi-node section **SGLang concurrency issues**: - Limit concurrency: `--sglang-server-concurrency 160` - Increase CUDA graphs: `--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)` For more troubleshooting, see `docs/en/get_started/qa.md`. ## Additional Resources ### Reference Files - **Doc Navigation**: [references/doc_navigation.md](references/doc_navigation.md) - Find documentation quickly - **Examples Reference**: [references/examples_reference.md](references/examples_reference.md) - Example scripts and patterns - **Source Code Reference**: [references/source_code_reference.md](references/source_code_reference.md) - Code structure and key functions ### External Links - **GitHub Repository**: https://github.com/THUDM/slime - **Docker Image**: `slimerl/slime:latest` - **Megatron-LM**: https://github.com/NVIDIA/Megatron-LM - **SGLang**: https://github.com/sgl-project/sglang