---
title: gRPC Pipeline
---
# gRPC Pipeline
When workers communicate via gRPC, SMG becomes a complete OpenAI-compatible server with a sophisticated request processing pipeline for reasoning extraction, tool call parsing, and MCP execution.
---
## Overview
### :material-comment-processing: Chat Templates
Apply model-specific chat templates with full Jinja2 support for all major model families.
### :material-memory: Tokenization Caching
Two-level tokenization cache reduces CPU overhead by 60-90% for repeated content.
### :material-brain: Reasoning Extraction
Extract chain-of-thought content from thinking models (DeepSeek-R1, Qwen3, etc.).
### :material-function: Tool Call Parsing
Parse function calls and execute MCP tools with automatic result injection.
---
## Pipeline Architecture

### :material-lightning-bolt: gRPC Mode
**Gateway = Full Server**
SMG handles tokenization, chat templates, tool parsing, MCP loops, and detokenization. Workers run raw inference.
### :material-swap-horizontal: HTTP Mode
**Gateway = Smart Proxy**
SMG handles routing, load balancing, and failover. Workers run full OpenAI-compatible servers.
### Responsibility Comparison
| Capability | gRPC Mode (Gateway) | HTTP Mode (Worker) |
|------------|--------------------|--------------------|
| Chat template | Gateway | Worker |
| Tokenization | Gateway (cached) | Worker |
| Load balancing | Token-aware | Request count |
| Reasoning extraction | Gateway | Worker |
| Tool call parsing | Gateway | Worker |
| MCP execution | Gateway | N/A |
---
## Reasoning Parsers
Reasoning parsers extract chain-of-thought content from model outputs. Essential for models that produce thinking tokens before their final response.
### Configuration
| Option | `--reasoning-parser` |
|--------|---------------------|
| Default | Auto-detected from model name |
### Supported Parsers
**DeepSeek-R1**
- Pattern: `*deepseek-r1*`
- Initial state: In reasoning
- Tokens: `` to exit
```bash
smg --reasoning-parser deepseek_r1
```
**Qwen3**
- Pattern: `*qwen3*`
- Initial state: Not in reasoning
- Tokens: `` / ``
```bash
smg --reasoning-parser qwen3
```
**Kimi**
- Pattern: `*kimi*`
- Initial state: Not in reasoning
- Tokens: Unicode markers
```bash
smg --reasoning-parser kimi
```
**GLM-4.5**
- Pattern: `*glm45*`, `*glm47*`
- Initial state: Not in reasoning
- Tokens: `` / ``
```bash
smg --reasoning-parser glm45
```
### Complete Parser Reference
| Parser | Model Pattern | Initial State | Tokens |
|--------|--------------|---------------|--------|
| `deepseek_r1` | `*deepseek-r1*` | In reasoning | `` |
| `qwen3` | `*qwen3*` | Not in reasoning | `` / `` |
| `qwen3_thinking` | `*qwen-thinking*` | In reasoning | `` / `` |
| `kimi` | `*kimi*` | Not in reasoning | Unicode markers |
| `glm45` | `*glm45*`, `*glm47*` | Not in reasoning | `` / `` |
| `step3` | `*step3*` | In reasoning | `` / `` |
| `minimax` | `*minimax*`, `*mm-m2*` | In reasoning | `` appended |
### Output Format
When `separate_reasoning: true` is set in the request:
```json
{
"choices": [{
"message": {
"role": "assistant",
"content": "The answer is 42.",
"reasoning_content": "Let me think step by step..."
}
}]
}
```
---
## Tool Call Parsers
Tool call parsers extract function calls from model output and validate arguments against schemas.
### Configuration
| Option | `--tool-call-parser` |
|--------|----------------|
| Default | Auto-detected from model name |
### Supported Parsers
**Llama**
Native Llama 3.2 function calling format.
```json
<|python_tag|>{"name": "get_weather", "parameters": {"location": "NYC"}}
```
**DeepSeek**
DeepSeek V3 tool format.
```xml
get_weather(location="NYC")
```
**Qwen**
Qwen model JSON tool calling format.
```json
{"name": "get_weather", "arguments": {"location": "NYC"}}
```
**Qwen XML**
Qwen3-Coder / Qwen3.5+ XML format with parameter tags.
```xml
NYC
```
### Complete Parser Reference
| Parser | Model Pattern | Format |
|--------|--------------|--------|
| `passthrough` | Default fallback | No parsing (returns text unchanged) |
| `json` | `gpt-*`, `claude-*`, `gemini-*` | Standard JSON function calls |
| `mistral` | `mistral-*`, `mixtral-*` | Mistral-specific format |
| `qwen` | `qwen*`, `Qwen*` | JSON tool calls |
| `qwen_xml` | `Qwen3-Coder*`, `Qwen3.5*` | XML with parameter tags |
| `pythonic` | `llama-4*`, `deepseek-*` | Python-style function syntax |
| `llama` | `llama-3.2*` | Python tag with JSON |
| `deepseek` | `deepseek-v3*` | XML with function syntax |
| `glm45_moe` | `glm-4.5*`, `glm-4.6*` | GLM 4.5/4.6 MoE format |
| `glm47_moe` | `glm-4.7*` | GLM 4.7 MoE format |
| `step3` | `step3*`, `Step-3*` | Step-3 model format |
| `kimik2` | `kimi-k2*`, `Kimi-K2*` | Kimi K2 model format |
| `minimax_m2` | `minimax*`, `MiniMax*` | MiniMax M2 model format |
### Tool Execution Flow
1. **Parse**: Extract tool calls from model output
2. **Validate**: Check arguments against tool schema
3. **Execute**: Run MCP tools or return to client
4. **Inject**: Add tool results back to conversation
5. **Continue**: Resume generation if needed
---
## Configuration
### Parser CLI Options
| Option | Default | Description |
|--------|---------|-------------|
| `--reasoning-parser` | Auto | Reasoning parser type to use |
| `--tool-call-parser` | Auto | Tool call parser type to use |
| `--mcp-config-path` | None | Path to MCP server configuration file |
### MCP Integration
When MCP is configured, tool calls can be executed automatically:
```bash
smg \
--mcp-config-path /path/to/mcp.json \
--tool-call-parser llama
```
See the [MCP Guide](../extensibility/mcp.md) for detailed configuration.
---
## Recommended Configurations
### :material-brain: Thinking Model
DeepSeek-R1 with reasoning extraction.
```bash
smg \
--model-path deepseek-ai/DeepSeek-R1 \
--reasoning-parser deepseek_r1 \
--worker-urls grpc://worker1:50051
```
### :material-function: Tool Calling Model
Llama with MCP tool execution.
```bash
smg \
--model-path meta-llama/Llama-3.2-70B-Instruct \
--tool-call-parser llama \
--mcp-config-path /config/mcp.json
```
### :material-all-inclusive: Full Pipeline
Complete configuration with all features.
```bash
smg \
--model-path Qwen/Qwen2.5-72B-Instruct \
--reasoning-parser qwen3 \
--tool-call-parser qwen \
--mcp-config-path /config/mcp.json \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-enable-l1 \
--worker-urls grpc://worker:50051
```
---
## Monitoring
### Pipeline Metrics
| Metric | Description |
|--------|-------------|
| `smg_router_stage_duration_seconds` | Time spent in each pipeline stage |
| `smg_mcp_tool_calls_total` | MCP tool invocations |
### Debug Logging
```bash
# Enable pipeline debug logging
RUST_LOG=smg::pipeline=debug smg ...
# Enable parser debug logging
RUST_LOG=smg::parsers=debug smg ...
```
---
## Troubleshooting
| Symptom | Cause | Solution |
|---------|-------|----------|
| Reasoning not extracted | Wrong parser | Check model and parser match |
| Tool calls not parsed | Format mismatch | Verify tool parser selection |
| MCP tools timeout | Slow tool execution | Check MCP server configuration |
| Empty reasoning_content | Model not thinking | Enable `separate_reasoning: true` in request |
---
## What's Next?
### :material-memory: Tokenizer Caching
Learn about two-level tokenizer caching for performance.
[Tokenizer Caching →](../performance/tokenizer-caching.md)
### :material-puzzle: MCP Integration
Configure Model Context Protocol servers for tool execution.
[MCP →](../extensibility/mcp.md)
### :material-cached: Cache-Aware Routing
Maximize KV cache hits with prefix-based routing.
[Cache-Aware Routing →](../routing/cache-aware.md)