---
title: Getting Started
---

# Getting Started

Shepherd Model Gateway (SMG) routes and manages LLM traffic across workers. This page gives you a fast path to a working gateway, then points you to feature-specific setup guides.

## Install

=== "pip (recommended)"

    Pre-built wheels are available for Linux (x86_64, aarch64, musllinux), macOS (Intel and Apple Silicon), and Windows (x86_64), with Python 3.9–3.14.

    ```bash
    pip install smg
    ```

    This installs both:

    - `smg serve` (Python orchestration command for workers + gateway)
    - `smg launch` (router launch path in Rust CLI)

=== "Cargo (crates.io)"

    ```bash
    cargo install smg
    ```

=== "Docker"

    **SMG only** (gateway/router, no inference engine):

    Multi-architecture images are available for x86_64 and ARM64.

    ```bash
    docker pull lightseekorg/smg:latest
    ```

    Available tags: `latest` (stable), `v1.4.x` (specific version), `nightly` (development, from `ghcr.io/lightseekorg/smg:nightly`).

    **SMG + Engine** (all-in-one, ready to serve models):

    Engine images bundle SMG with a specific inference engine (x86_64/CUDA only). Use these when you want a single container that can both route and serve.

    ```bash
    # SGLang
    docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10

    # vLLM
    docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

    # TensorRT-LLM
    docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10
    ```

    Tag format: `{smg_version}-{engine}-{engine_version}`. Browse all tags at [ghcr.io/lightseekorg/smg](https://github.com/lightseekorg/smg/pkgs/container/smg).

=== "From Source"

    ```bash
    # Install Rust
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    source "$HOME/.cargo/env"

    # Clone and build
    git clone https://github.com/lightseekorg/smg.git
    cd smg
    cargo build --release
    ```

    The binary is available at `./target/release/smg`.

## Step 1: Start SMG

Choose one of these startup paths.

### Option A: All-in-one with `smg serve`

`smg serve` launches backend worker process(es) and then starts SMG with generated worker URLs.

=== "SGLang"

    ```bash
    smg serve \
      --backend sglang \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --data-parallel-size 2 \
      --connection-mode grpc \
      --host 0.0.0.0 \
      --port 30000
    ```

=== "vLLM"

    ```bash
    smg serve \
      --backend vllm \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --data-parallel-size 2 \
      --host 0.0.0.0 \
      --port 30000
    ```

=== "TensorRT-LLM (gRPC)"

    ```bash
    smg serve \
      --backend trtllm \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --data-parallel-size 2 \
      --host 0.0.0.0 \
      --port 30000
    ```

This starts `--data-parallel-size` worker replicas, waits for readiness, then starts the gateway.

| Option | Default | Description |
|--------|---------|-------------|
| `--backend` | `sglang` | Inference backend: `sglang`, `vllm`, or `trtllm` |
| `--connection-mode` | `grpc` | Worker connection mode: `grpc` or `http` (TensorRT-LLM only supports gRPC) |
| `--data-parallel-size` | `1` | Number of worker replicas (one per GPU) |
| `--worker-base-port` | `31000` | Base port for worker processes |
| `--host` | `127.0.0.1` | Router host |
| `--port` | `8080` | Router port |

### Option B: Launch gateway only with `smg launch`

Use this when workers are already running or managed by another platform.

For gRPC workers:

```bash
smg launch \
  --worker-urls grpc://localhost:50051 \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --policy round_robin \
  --host 0.0.0.0 \
  --port 30000
```

For HTTP workers:

```bash
smg launch \
  --worker-urls http://localhost:8000 \
  --policy round_robin \
  --host 0.0.0.0 \
  --port 30000
```

## Step 2: Verify Core Endpoints

Health:

```bash
curl http://localhost:30000/health
curl http://localhost:30000/readiness
```

OpenAI-compatible chat completions:

```bash
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
  }'
```

Responses API:

```bash
curl http://localhost:30000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "input": "Say hello in one sentence."
  }'
```

## Step 3: Choose Your Setup Track

### Core Deployment

- [Multiple Workers](multiple-workers.md)
- [gRPC Workers](grpc-workers.md)
- [PD Disaggregation](pd-disaggregation.md)
- [Service Discovery](service-discovery.md)

### Operations and Security

- [Monitoring](monitoring.md)
- [Logging](logging.md)
- [TLS](tls.md)
- [Control Plane Auth](control-plane-auth.md)
- [Control Plane Operations](control-plane-operations.md)

### Reliability and Data

- [Reliability Controls](reliability-controls.md)
- [Data Connections](data-connections.md)
- [Tokenization and Parsing APIs](tokenization-and-parsing.md)

### Advanced Features

- [Load Balancing](load-balancing.md)
- [Tokenizer Caching](tokenizer-caching.md)
- [MCP in Responses API](mcp.md)

---

## Worker Startup Recipes (Standalone)

Use these when workers are not started via `smg serve`.

=== "SGLang (gRPC)"

    ```bash
    python -m sglang.launch_server \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 50051 \
      --grpc-mode
    ```

=== "SGLang (HTTP)"

    ```bash
    python -m sglang.launch_server \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 8000
    ```

=== "vLLM (gRPC)"

    ```bash
    python -m vllm.entrypoints.grpc_server \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 50051 \
      --tensor-parallel-size 1
    ```

=== "TensorRT-LLM (gRPC)"

    ```bash
    python -m tensorrt_llm.commands.serve \
      meta-llama/Llama-3.1-8B-Instruct \
      --grpc \
      --host 0.0.0.0 \
      --port 50051 \
      --backend pytorch \
      --tp_size 1
    ```

### PD Disaggregation Workers

For prefill-decode disaggregation, start separate prefill and decode workers:

=== "SGLang PD (gRPC)"

    ```bash
    # Prefill worker
    python -m sglang.launch_server \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 50051 \
      --grpc-mode \
      --disaggregation-mode prefill \
      --disaggregation-bootstrap-port 8998

    # Decode worker
    python -m sglang.launch_server \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 50052 \
      --grpc-mode \
      --disaggregation-mode decode \
      --disaggregation-bootstrap-port 8999
    ```

    Start SMG with bootstrap ports for SGLang coordination:

    ```bash
    smg launch \
      --pd-disaggregation \
      --prefill grpc://localhost:50051 8998 \
      --decode grpc://localhost:50052 \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 30000
    ```

=== "SGLang PD (HTTP)"

    ```bash
    # Prefill worker
    python -m sglang.launch_server \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 8000 \
      --disaggregation-mode prefill \
      --disaggregation-bootstrap-port 8998

    # Decode worker
    python -m sglang.launch_server \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 8001 \
      --disaggregation-mode decode \
      --disaggregation-bootstrap-port 8999
    ```

    Start SMG with bootstrap ports for SGLang coordination:

    ```bash
    smg launch \
      --pd-disaggregation \
      --prefill http://localhost:8000 8998 \
      --decode http://localhost:8001 \
      --host 0.0.0.0 \
      --port 30000
    ```

=== "vLLM PD (gRPC + NIXL)"

    vLLM uses NIXL for KV cache transfer between prefill and decode workers:

    ```bash
    # Prefill worker
    VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
    python -m vllm.entrypoints.grpc_server \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 50051 \
      --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

    # Decode worker
    VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
    python -m vllm.entrypoints.grpc_server \
      --model meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 50052 \
      --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
    ```

    Start SMG (no bootstrap ports needed — NIXL handles KV transfer):

    ```bash
    smg \
      --pd-disaggregation \
      --prefill grpc://localhost:50051 \
      --decode grpc://localhost:50052 \
      --model-path meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 30000
    ```

See [PD Disaggregation](pd-disaggregation.md) for full details including Mooncake backend and scaling.

## Send a Request

```bash
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 50
  }'
```

Expected response:

```json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 8,
    "total_tokens": 22
  }
}
```

## Verify Health

```bash
# Gateway health
curl http://localhost:30000/health

# Worker status
curl http://localhost:30000/workers
```

## Deploy with Docker

For local deployment, run SMG in a container and point it at your worker:

```bash
docker pull lightseekorg/smg:latest

docker run -d \
  --name smg \
  -p 30000:30000 \
  -p 29000:29000 \
  lightseekorg/smg:latest \
  --worker-urls http://host.docker.internal:8000 \
  --policy cache_aware \
  --prometheus-port 29000
```

Verify:

```bash
docker ps | grep smg
curl http://localhost:30000/health
```

### All-in-one with engine images

Engine images include both SMG and an inference engine. Use `serve` to launch workers and the gateway together:

```bash
docker run -d --gpus all \
  --name smg \
  -p 30000:30000 \
  -v /path/to/models:/models \
  ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10 \
  serve \
  --backend sglang \
  --model-path /models/meta-llama/Llama-3.1-8B-Instruct \
  --port 30000
```

Verify:

```bash
curl http://localhost:30000/health
curl http://localhost:30000/v1/models
```

## Deploy to Kubernetes (Quick Start)

Run SMG in-cluster and use service discovery to pick up worker pods automatically.

Start SMG with service discovery:

```bash
smg \
  --service-discovery \
  --selector app=sglang-worker \
  --service-discovery-namespace inference \
  --service-discovery-port 8000 \
  --policy cache_aware
```

Required RBAC permissions:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: smg-discovery
  namespace: inference
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
```

Verify:

```bash
kubectl get pods -n inference -l app=sglang-worker
curl http://localhost:30000/workers
```

## Navigate by Category

### Core Setup

- [Multiple Workers](multiple-workers.md) — connect local or external worker endpoints
- [gRPC Workers](grpc-workers.md) — gateway-side tokenization, parsing, and tool handling
- [PD Disaggregation](pd-disaggregation.md) — split prefill and decode paths
- [Service Discovery](service-discovery.md) — Kubernetes pod-based worker registration

### Operations

- [Monitoring](monitoring.md) — Prometheus metrics, tracing, and alerts
- [Logging](logging.md) — structured logs and aggregation patterns
- [TLS](tls.md) — HTTPS gateway configuration
- [Control Plane Auth](control-plane-auth.md) — secure worker/tokenizer/WASM management endpoints

### Reliability and Data

- [Reliability Controls](reliability-controls.md) — concurrency limits, retries, and circuit breakers
- [Data Connections](data-connections.md) — history backend setup for Postgres, Redis, and Oracle
- [Tokenization and Parsing APIs](tokenization-and-parsing.md) — tokenize, detokenize, and parser endpoints

### Advanced Features

- [Load Balancing](load-balancing.md) — policy selection and tuning
- [Tokenizer Caching](tokenizer-caching.md) — L0/L1 cache setup for gRPC mode
- [MCP in Responses API](mcp.md) — configure and execute MCP tools through `/v1/responses`

## Troubleshooting

??? question "Gateway starts but can't connect to worker"

    **Symptoms:** Gateway logs show connection errors.

    **Solutions:**

    1. Verify the worker is running: `curl http://localhost:8000/health`
    2. Check network connectivity between gateway and worker
    3. If using Docker, ensure proper network configuration (`--network host` or Docker network)

??? question "Request times out"

    **Symptoms:** Requests hang or return 504 errors.

    **Solutions:**

    1. Check worker health: `curl http://localhost:30000/workers`
    2. Increase timeout: `--request-timeout-secs 120`
    3. Check worker logs for errors

??? question "Model not found error"

    **Symptoms:** `model not found` in response.

    **Solutions:**

    1. The `model` field in requests should match the model loaded on the worker
    2. Check available models: `curl http://localhost:30000/v1/models`