# Admin API Reference SMG provides administrative endpoints for managing tokenizers, workers, cache, and cluster operations. !!! tip "Related Documentation" For health checks, worker status, and monitoring endpoints, see [Gateway Extensions](extensions.md). --- ## Tokenizer Management Manage tokenizers for text processing and tokenization. !!! note "Authentication Required" These endpoints require admin authentication via API key or control plane credentials. ### Add Tokenizer ``` POST /v1/tokenizers ``` Adds a new tokenizer from a local path or HuggingFace model ID. **Request Body:** ```json { "name": "llama3-tokenizer", "source": "meta-llama/Meta-Llama-3-8B", "chat_template_path": "/path/to/template.jinja" } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `name` | string | Yes | Unique tokenizer identifier | | `source` | string | Yes | HuggingFace model ID or local path | | `chat_template_path` | string | No | Path to custom Jinja2 chat template | **Response:** `202 Accepted` ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "pending", "message": "Tokenizer 'llama3-tokenizer' registration job submitted. Loading from: meta-llama/Meta-Llama-3-8B" } ``` --- ### List Tokenizers ``` GET /v1/tokenizers ``` Returns all registered tokenizers. **Response:** `200 OK` ```json { "tokenizers": [ { "id": "550e8400-e29b-41d4-a716-446655440000", "name": "llama3-tokenizer", "source": "meta-llama/Meta-Llama-3-8B", "vocab_size": 128256 } ] } ``` --- ### Get Tokenizer ``` GET /v1/tokenizers/{tokenizer_id} ``` Returns details for a specific tokenizer. **Response:** `200 OK` ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "name": "llama3-tokenizer", "source": "meta-llama/Meta-Llama-3-8B", "vocab_size": 128256 } ``` **Response:** `404 Not Found` ```json { "error": { "message": "Tokenizer 'llama3-tokenizer' not found", "type": "tokenizer_not_found" } } ``` --- ### Get Tokenizer Status ``` GET /v1/tokenizers/{tokenizer_id}/status ``` Returns the loading status of a tokenizer. **Response:** `200 OK` ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "completed", "message": "Tokenizer 'llama3-tokenizer' is loaded and ready", "vocab_size": 128256 } ``` | Status | Description | |--------|-------------| | `pending` | Tokenizer loading queued | | `processing` | Tokenizer currently loading | | `completed` | Tokenizer ready for use | | `failed` | Loading failed (see message) | --- ### Remove Tokenizer ``` DELETE /v1/tokenizers/{tokenizer_id} ``` Removes a tokenizer. **Response:** `200 OK` ```json { "success": true, "message": "Tokenizer 'llama3-tokenizer' removed successfully" } ``` --- ## Worker Management Manage backend inference workers. !!! tip For listing workers and viewing metrics, see [Gateway Extensions](extensions.md#worker-management). ### Create Worker ``` POST /workers ``` Registers a new backend worker. **Request Body:** ```json { "url": "http://gpu1:8000", "models": [{ "id": "llama3-70b" }], "api_key": "worker-secret-key" } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | Worker base URL | | `worker_type` | string | No | `regular`, `prefill`, or `decode` (default: `regular`) | | `connection_mode` | string | No | `http` or `grpc` (default: `http`) | | `runtime_type` | string | No | `sglang`, `vllm`, `trtllm`, `mlx`, `external`, or `unspecified` (default: `unspecified`, which triggers auto-detection) | | `models` | array | No | Model cards served by this worker (empty = wildcard) | | `api_key` | string | No | API key for worker authentication | | `priority` | integer | No | Routing priority (higher = preferred, default: 50) | **Response:** `202 Accepted` ```json { "status": "accepted", "worker_id": "worker-abc123", "url": "http://gpu1:8000", "location": "/workers/worker-abc123", "message": "Worker addition queued for background processing" } ``` --- ### Update Worker (partial) ``` PATCH /workers/{worker_id} ``` Partially updates worker configuration. Only the fields you include are changed. **Request Body:** ```json { "priority": 75, "api_key": "new-api-key" } ``` | Field | Type | Description | |-------|------|-------------| | `priority` | integer | New routing priority | | `cost` | number | New cost factor | | `labels` | object | Updated labels | | `api_key` | string | New API key (for key rotation) | | `health` | object | Partial health-check overrides (`timeout_secs`, `check_interval_secs`, `success_threshold`, `failure_threshold`, `disable_health_check`) | **Response:** `202 Accepted` ```json { "status": "accepted", "worker_id": "worker-abc123", "message": "Worker update queued for background processing" } ``` --- ### Replace Worker (full) ``` PUT /workers/{worker_id} ``` Re-runs the full worker registration workflow (model discovery and all). The request body must be a complete `WorkerSpec` whose `url` matches the existing worker's URL — URL changes are not supported via `PUT`; use `DELETE` + `POST` instead. **Response:** `202 Accepted` with the same shape as `PATCH`. --- ### Delete Worker ``` DELETE /workers/{worker_id} ``` Removes a worker from the pool. **Response:** `202 Accepted` ```json { "status": "accepted", "worker_id": "worker-abc123", "message": "Worker removal queued for background processing" } ``` --- ## Cache Management Manage the routing cache and load information. ### Flush Cache ``` POST /flush_cache ``` Flushes the KV cache on all HTTP workers. gRPC workers are skipped. The response status is `200 OK` on full success and `206 Partial Content` when some workers fail. **Response:** `200 OK` ```json { "status": "success", "message": "Successfully flushed cache on all 3 HTTP workers", "workers_flushed": 3, "total_http_workers": 3, "total_workers": 3 } ``` On partial failure, the response additionally includes `successful` (list of worker URLs) and `failed` (list of `{worker, error}` entries), and `status` becomes `"partial_success"`. --- ### Get Loads ``` GET /get_loads ``` Returns the current load distribution across workers. The gateway fans out to every registered worker (HTTP and gRPC) and returns whatever each backend reports. The `load` field is the total number of KV-cache tokens in use across all data-parallel ranks for that worker; `-1` indicates the worker failed to respond. **Response:** `200 OK` ```json { "workers": [ { "worker": "http://gpu1:8000", "load": 1234, "details": { "timestamp": "2024-01-15T12:00:00Z", "dp_rank_count": 1, "loads": [ { "dp_rank": 0, "num_running_reqs": 5, "num_waiting_reqs": 2, "num_total_reqs": 7, "num_used_tokens": 1234, "max_total_num_tokens": 16384, "token_usage": 0.075, "gen_throughput": 45.2, "cache_hit_rate": 0.82, "utilization": 0.31, "max_running_requests": 256 } ] } } ] } ``` --- ## Model Information Query model and server information. ### List Models ``` GET /v1/models ``` Returns available models (proxied to workers). **Response:** `200 OK` ```json { "object": "list", "data": [ { "id": "llama3-70b", "object": "model", "created": 1700000000, "owned_by": "meta" } ] } ``` --- ### Get Model Info ``` GET /get_model_info ``` Returns detailed model information (proxied to workers). **Response:** `200 OK` ```json { "model_name": "llama3-70b", "max_tokens": 8192, "vocab_size": 128256 } ``` --- ### Get Server Info ``` GET /get_server_info ``` Returns server information (proxied to workers). **Response:** `200 OK` ```json { "version": "0.1.0", "backend": "vllm", "gpu_count": 8 } ``` --- ## WASM Module Management Manage WebAssembly plugins. Modules are registered from files accessible to the gateway process; the request body contains descriptors with paths, not binary payloads. ### Add WASM Module ``` POST /wasm ``` Registers one or more WASM modules. **Request Body:** JSON `WasmModuleAddRequest` ```json { "modules": [ { "name": "custom-middleware", "file_path": "/etc/smg/wasm/custom-middleware.wasm", "module_type": "Middleware", "attach_points": [ {"Middleware": "OnRequest"}, {"Middleware": "OnResponse"} ] } ] } ``` The only supported `module_type` today is `Middleware`. Valid `Middleware` attach points are `OnRequest`, `OnResponse`, and `OnError`. **Response:** `200 OK` on full success, `400 Bad Request` if any module failed to register. The response body echoes every requested module with an `add_result` field indicating success (carrying the assigned UUID) or failure (carrying the error message). ```json { "modules": [ { "name": "custom-middleware", "file_path": "/etc/smg/wasm/custom-middleware.wasm", "module_type": "Middleware", "attach_points": [ {"Middleware": "OnRequest"}, {"Middleware": "OnResponse"} ], "add_result": { "Success": "550e8400-e29b-41d4-a716-446655440000" } } ] } ``` --- ### List WASM Modules ``` GET /wasm ``` Returns all registered WASM modules together with aggregate execution metrics. **Response:** `200 OK` ```json { "modules": [ { "module_uuid": "550e8400-e29b-41d4-a716-446655440000", "module_meta": { "name": "custom-middleware", "file_path": "/etc/smg/wasm/custom-middleware.wasm", "sha256_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", "size_bytes": 65536, "created_at": "2024-01-15T12:00:00.000000000Z", "last_accessed_at": "2024-01-15T12:05:00.000000000Z", "access_count": 42, "attach_points": [ {"Middleware": "OnRequest"} ] } } ], "metrics": { "total_executions": 42, "successful_executions": 42, "failed_executions": 0, "total_execution_time_ms": 125, "max_execution_time_ms": 8, "average_execution_time_ms": 2.97 } } ``` --- ### Remove WASM Module ``` DELETE /wasm/{module_uuid} ``` Removes a WASM module. The body is a plain text status message, not JSON. **Response:** `200 OK` ``` Module removed successfully ``` On failure returns `400 Bad Request` with the error text as the body. --- ## Error Responses All endpoints return errors in a consistent format: ```json { "error": { "message": "Detailed error description", "type": "error_type" } } ``` | HTTP Status | Error Type | Description | |-------------|------------|-------------| | `400` | `bad_request` | Invalid request format or parameters | | `401` | `unauthorized` | Missing or invalid authentication | | `403` | `forbidden` | Insufficient permissions | | `404` | `not_found` | Resource not found | | `409` | `conflict` | Resource already exists | | `503` | `service_unavailable` | No healthy workers available | --- ## Authentication Admin endpoints require authentication via one of: 1. **API Key**: Pass via `Authorization: Bearer ` header 2. **Control Plane Key**: For cluster management operations Public endpoints (health checks, model info) do not require authentication.