--- title: gRPC Workers --- # gRPC Workers When workers connect via gRPC instead of HTTP, SMG becomes a full OpenAI-compatible server — handling tokenization, chat templates, reasoning extraction, and tool calling at the gateway level. Workers run raw inference only.
#### Before you begin - Completed the [Getting Started](index.md) guide - A gRPC-capable inference worker (vLLM with gRPC entrypoint) - Access to the model weights or a HuggingFace model path (for tokenizer loading)
--- ## What gRPC Mode Enables | Capability | HTTP Mode (worker handles) | gRPC Mode (gateway handles) | |------------|---------------------------|----------------------------| | Chat templates | Worker | Gateway | | Tokenization | Worker | Gateway (with caching) | | Load balancing | Request-level | Token-aware | | Reasoning extraction | Worker | Gateway | | Tool call parsing | Worker | Gateway | | MCP tool execution (Responses API) | N/A | Gateway | In HTTP mode, SMG is a smart proxy — routing and failover only. In gRPC mode, SMG takes over the full request processing pipeline. --- ## Start a gRPC Worker === "SGLang" ```bash python -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 50051 \ --grpc-mode ``` === "vLLM" ```bash python -m vllm.entrypoints.grpc_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 50051 ``` === "TensorRT-LLM" ```bash python -m tensorrt_llm.commands.serve serve \ meta-llama/Llama-3.1-8B-Instruct \ --grpc \ --host 0.0.0.0 \ --port 50051 \ --backend pytorch ``` --- ## Connect SMG Point SMG at the gRPC worker using `grpc://` URLs and provide `--model-path` so the gateway can load the tokenizer: ```bash smg \ --worker-urls grpc://localhost:50051 \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 30000 ``` !!! note "`--model-path` is required" The gateway needs the tokenizer to apply chat templates, count tokens for load balancing, and parse tool calls. This can be a HuggingFace model ID or a local path. The API is still OpenAI-compatible — clients send the same requests as with HTTP workers: ```bash curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ {"role": "user", "content": "Hello!"} ] }' ``` --- ## Multiple gRPC Workers ```bash smg \ --worker-urls grpc://worker1:50051 grpc://worker2:50052 \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --policy round_robin ``` --- ## Reasoning Extraction For thinking models (DeepSeek-R1, Qwen3, etc.), SMG can extract chain-of-thought content into a separate field: ```bash smg \ --worker-urls grpc://worker:50051 \ --model-path deepseek-ai/DeepSeek-R1 \ --reasoning-parser deepseek_r1 ``` The parser is auto-detected from the model name by default. Override with `--reasoning-parser` if needed. Request with `separate_reasoning: true`: ```bash curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [{"role": "user", "content": "What is 25 * 37?"}], "separate_reasoning": true }' ``` Response includes both fields: ```json { "choices": [{ "message": { "role": "assistant", "content": "925", "reasoning_content": "Let me calculate 25 * 37 step by step..." } }] } ``` ### Supported Reasoning Parsers Auto-detected from the model name. Override with `--reasoning-parser` if needed. | Parser | Models | |--------|--------| | `deepseek_r1` | DeepSeek-R1 | | `qwen3` | Qwen3 | | `qwen3_thinking` | Qwen3-Thinking | | `kimi` | Kimi | | `glm45` | GLM-4.5, GLM-4.7 | | `step3` | Step-3 | | `minimax` | MiniMax, MiniMax-M2 | | `cohere_cmd` | Command-R, Command-A, C4AI | --- ## Tool Calling In gRPC mode, SMG parses function calls from model output: ```bash smg \ --worker-urls grpc://worker:50051 \ --model-path meta-llama/Llama-3.2-70B-Instruct \ --tool-call-parser llama ``` For MCP tool execution in Responses API, see the dedicated guide: ```bash # See: # Getting Started → MCP in Responses API # /v1/responses + --mcp-config-path ``` ### Supported Tool Call Parsers Auto-detected from the model name. Override with `--tool-call-parser` if needed. | Parser | Models | |--------|--------| | `json` | GPT-4/4o, Claude, Gemini, Gemma, Llama (generic) | | `llama` | Llama 3.2 | | `pythonic` | Llama 4, DeepSeek (generic) | | `deepseek` | DeepSeek-V3 | | `mistral` | Mistral, Mixtral | | `qwen` | Qwen | | `qwen_xml` | Qwen3-Coder, Qwen3.5+ | | `glm45_moe` | GLM-4.5, GLM-4.6 | | `glm47_moe` | GLM-4.7 | | `step3` | Step-3 | | `kimik2` | Kimi-K2 | | `minimax_m2` | MiniMax | | `cohere` | Command-R, Command-A, C4AI | --- ## HTTP vs gRPC: When to Use Which | Use Case | Recommended Mode | |----------|-----------------| | Workers already run OpenAI servers (SGLang, vLLM HTTP) | HTTP | | You need gateway-level tool parsing or Responses MCP | gRPC | | You want token-aware load balancing | gRPC | | You use thinking models and want reasoning extraction | gRPC | | Simplest possible setup | HTTP | --- ## Next Steps - [gRPC Pipeline Concepts](../concepts/architecture/grpc-pipeline.md) — Full pipeline architecture, all supported parsers - [Tokenizer Caching](../concepts/performance/tokenizer-caching.md) — Two-level cache for reduced CPU overhead - [MCP in Responses API](mcp.md) — Configure Model Context Protocol servers for `/v1/responses` - [PD Disaggregation](pd-disaggregation.md) — Separate prefill and decode with gRPC workers