# API Reference OlliteRT exposes an OpenAI-compatible HTTP API on your local network. Default port is `8000` (configurable in Settings). ## Table of Contents - [Endpoints](#endpoints) - [Authentication](#authentication) - [Chat Completions](#chat-completions--post-v1chatcompletions) - [Text Completions](#text-completions--post-v1completions) - [Responses API](#responses-api--post-v1responses) - [Anthropic Messages](#anthropic-messages--post-v1messages) - [Anthropic Token Counter](#anthropic-token-counter--post-v1messagescount_tokens) - [Audio Transcriptions](#audio-transcriptions--post-v1audiotranscriptions) - [Models](#models--get-v1models) - [Model Detail](#model-detail--get-v1modelsid) - [Health](#health--get-health) - [Error Responses](#error-responses) - [Server Info](#server-info--get--or-get-v1) - [Prometheus Metrics](#prometheus-metrics--get-metrics) --- ## Endpoints | Method | Endpoint | Description | |:-------|:---------|:------------| | `POST` | `/v1/chat/completions` | OpenAI Chat Completions API (streaming + non-streaming) | | `POST` | `/v1/completions` | OpenAI Text Completions API | | `POST` | `/v1/responses` | OpenAI Responses API | | `POST` | `/v1/messages` | Anthropic Messages API (streaming + non-streaming) | | `POST` | `/v1/messages/count_tokens` | Anthropic input-token estimator | | `POST` | `/v1/audio/transcriptions` | Audio transcription | | `GET` | `/v1/models` | List available models | | `GET` | `/v1/models/{id}` | Get detail for a specific model | | `GET` | `/` or `/v1` | Server info (version, status, endpoints) | | `GET` | `/health` | Health check (add `?metrics=true` for detailed JSON stats) | | `GET` | `/metrics` | Prometheus metrics (exposition format) | | `GET` | `/ping` | Simple liveness check — returns `{"status":"ok"}` | ## Authentication Bearer token authentication is **optional** and disabled by default. When disabled, all endpoints are open — no API key or header is needed. To enable authentication, go to Settings → Server Configuration and toggle **Require Bearer Token**. When enabled, include the token in the `Authorization` header: ``` Authorization: Bearer your-token ``` Anthropic SDK clients (Claude Code, the official Python/TypeScript SDKs) send credentials in `x-api-key` instead. OlliteRT accepts either header — `x-api-key` carries the raw token with no `Bearer` prefix: ``` x-api-key: your-token ``` In every example below the literal string `your-token` is purely a placeholder — when auth is disabled (the default) OlliteRT ignores the header value entirely, so any non-empty string works. When auth is enabled, the value must match the token configured in **Settings → Server Configuration**. The phone never relays credentials to the real OpenAI or Anthropic APIs. See the [Security Guide](../SECURITY.md) for details on network exposure and credential storage. > [!TIP] > All inference endpoints accept the same core parameters (`temperature`, `top_p`, `top_k`, `max_tokens`, `stream`). The parameter tables below document each endpoint's full set. ## Chat Completions — `POST /v1/chat/completions` ### Request Body | Parameter | Type | Required | Description | |:----------|:-----|:--------:|:------------| | `model` | string | Yes | Model name (e.g. `Gemma-4-E2B-it`) | | `messages` | array | Yes | Array of message objects (`role` + `content`) | | `stream` | boolean | No | Enable SSE streaming (default: `false`) | | `stream_options` | object | No | Streaming options. Set `{"include_usage": true}` to receive a usage chunk before `[DONE]` | | `temperature` | number | No | Sampling temperature (0.0 - 2.0) | | `top_p` | number | No | Nucleus sampling threshold | | `top_k` | integer | No | Top-k sampling | | `max_tokens` | integer | No | Maximum tokens to generate | | `max_completion_tokens` | integer | No | Alias for `max_tokens` | | `stop` | string or array | No | Stop sequence(s) | | `tools` | array | No | Tool/function definitions for [tool calling](../TROUBLESHOOTING.md#tool-calling-experimental) | | `tool_choice` | string or object | No | Tool selection strategy (`auto`, `none`, or specific tool) | | `response_format` | object | No | Response format (`{"type": "json_object"}` for JSON mode) | ### Message Object | Field | Type | Description | |:------|:-----|:------------| | `role` | string | `system`, `user`, `assistant`, or `tool` | | `content` | string or array | Text content, or array of content parts for multimodal | | `tool_call_id` | string | Required for `role: "tool"` — references the tool call being responded to | | `name` | string | Function name (for tool messages) | ### Multimodal Content For vision and audio input, use content parts: **Image:** ```json { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}} ] } ``` **Audio:** ```json { "role": "user", "content": [ {"type": "input_audio", "input_audio": {"data": "", "format": "wav"}} ] } ``` Supported audio formats: `wav`, `mp3`, `ogg`, `flac`. Audio must be mono — stereo is automatically downmixed. > [!TIP] > For dedicated audio transcription, use the [`/v1/audio/transcriptions`](#audio-transcriptions--post-v1audiotranscriptions) endpoint instead. ### Response ```json { "id": "chatcmpl-...", "object": "chat.completion", "created": 1234567890, "model": "Gemma-4-E2B-it", "system_fingerprint": null, "choices": [{ "index": 0, "message": { "role": "assistant", "content": "Hello! How can I help you?" }, "finish_reason": "stop" }], "usage": { "prompt_tokens": 10, "completion_tokens": 8, "total_tokens": 18 } } ``` `finish_reason` values: `"stop"` (natural end or stop sequence), `"length"` (output truncated by `max_tokens`), `"tool_calls"` (model invoked a tool). > **Note:** The `system_fingerprint` field is always `null`. The LiteRT runtime does not expose a tokenizer or model configuration hash, so there is no meaningful fingerprint to generate. Clients that check this field should treat `null` as "unknown configuration." ### Streaming Response When `stream: true`, the response is sent as Server-Sent Events: ``` data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]} data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]} data: {"id":"chatcmpl-...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE] ``` When `stream_options: {"include_usage": true}` is set, a usage chunk is emitted before `[DONE]`: ``` data: {"id":"chatcmpl-...","choices":[],"usage":{"prompt_tokens":10,"completion_tokens":8,"total_tokens":18}} data: [DONE] ``` Without `stream_options` (the default), no usage chunk is emitted — the stream ends with the `finish_reason` chunk followed by `[DONE]`. ## Text Completions — `POST /v1/completions` | Parameter | Type | Required | Description | |:----------|:-----|:--------:|:------------| | `model` | string | Yes | Model name | | `prompt` | string | Yes | Text prompt | | `stream` | boolean | No | Enable SSE streaming | | `temperature` | number | No | Sampling temperature | | `max_tokens` | integer | No | Maximum tokens to generate | ## Responses API — `POST /v1/responses` Alternative API format. Accepts either `messages` (array) or `input` (string) field. | Parameter | Type | Required | Description | |:----------|:-----|:--------:|:------------| | `model` | string | Yes | Model name | | `input` | string or array | Yes | Input text or messages array | | `stream` | boolean | No | Enable SSE streaming | | `tools` | array | No | Tool definitions | | `tool_choice` | string or object | No | Tool selection strategy (`auto`, `none`, or specific tool) | | `temperature` | number | No | Sampling temperature | | `top_p` | number | No | Nucleus sampling threshold | | `top_k` | integer | No | Top-k sampling | | `max_output_tokens` | integer | No | Maximum tokens to generate | ### Streaming Response When `stream: true`, the Responses API uses typed Server-Sent Events with an `event:` prefix (unlike Chat Completions which uses `data:`-only lines). Each SSE frame has the format: ``` event: data: ``` The full event sequence for a text response: ``` event: response.created data: {"type":"response.created","response":{"id":"resp-...","object":"response","created_at":1234567890,"status":"in_progress","model":"Gemma-4-E2B-it","output":[]}} event: response.in_progress data: {"type":"response.in_progress","response":{"id":"resp-...","object":"response","created_at":1234567890,"status":"in_progress","model":"Gemma-4-E2B-it","output":[]}} event: response.output_item.added data: {"type":"response.output_item.added","item":{"id":"msg-...","type":"message","status":"in_progress","content":[],"role":"assistant"},"output_index":0,"sequence_number":0} event: response.content_part.added data: {"type":"response.content_part.added","content_index":0,"item_id":"msg-...","output_index":0,"part":{"type":"output_text","annotations":[],"logprobs":[],"text":""}} event: response.output_text.delta data: {"type":"response.output_text.delta","content_index":0,"delta":"Hello","item_id":"msg-...","output_index":0} event: response.output_text.delta data: {"type":"response.output_text.delta","content_index":0,"delta":"!","item_id":"msg-...","output_index":0} event: response.output_text.done data: {"type":"response.output_text.done","content_index":0,"item_id":"msg-...","output_index":0,"text":"Hello!"} event: response.content_part.done data: {"type":"response.content_part.done","content_index":0,"item_id":"msg-...","output_index":0,"part":{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello!"}} event: response.output_item.done data: {"type":"response.output_item.done","item":{"id":"msg-...","type":"message","status":"completed","content":[{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello!"}],"role":"assistant"},"output_index":0} event: response.completed data: {"type":"response.completed","response":{"id":"resp-...","object":"response","created_at":1234567890,"status":"completed","model":"Gemma-4-E2B-it","output":[...],"usage":{"input_tokens":10,"output_tokens":2,"total_tokens":12}}} data: [DONE] ``` The final `data: [DONE]` line has no `event:` prefix — it signals the end of the stream (same as Chat Completions). ## Anthropic Messages — `POST /v1/messages` Anthropic-compatible Messages API. Lets Claude Code and the official Anthropic SDKs (Python, TypeScript) target the phone directly with no proxy. The handler translates the Anthropic request into the internal chat-completion pipeline and re-shapes the response into Anthropic's `content`-block format. > [!WARNING] > **Experimental.** Wire-level support for the Messages API is implemented and stable, but on-device models in the Gemma-4-E2B / 3n class do not have the context budget or instruction-following headroom to drive Claude Code (large system prompt, dense tool surface) reliably. Expect long prefill, frequent tool-call mistakes, and the LiteRT-LM #2418 parse failures noted below. Use the OpenAI-compatible endpoints for production workflows; treat this surface as a smoke test for the Anthropic API. ### Request Body | Parameter | Type | Required | Description | |:----------|:-----|:--------:|:------------| | `model` | string | Yes | Model name (e.g. `Gemma-4-E2B-it`) | | `messages` | array | Yes | Array of message objects (`role` + `content`) | | `max_tokens` | integer | Yes | Maximum tokens to generate | | `system` | string or array | No | System prompt — string for the simple form, or an array of `{type:"text", text:"..."}` blocks | | `stream` | boolean | No | Enable SSE streaming (default: `false`) | | `temperature` | number | No | Sampling temperature | | `top_p` | number | No | Nucleus sampling threshold | | `top_k` | integer | No | Top-k sampling | | `stop_sequences` | array | No | Stop strings | | `tools` | array | No | Tool definitions in Anthropic shape (`{name, description, input_schema}`) | | `tool_choice` | object | No | `{type:"auto"}`, `{type:"any"}`, `{type:"none"}`, or `{type:"tool", name:"..."}` | | `thinking` | object | No | `{type:"enabled"}` / `{type:"disabled"}` — per-request override of the model's persisted thinking setting (only applied when the model supports thinking) | The following Anthropic features are accepted on the wire but silently dropped because LiteRT-LM has no equivalent: `metadata`, `service_tier`, `cache_control`, `parallel_tool_calls`, echoed `thinking` blocks. URL-sourced images, document blocks, and `computer_*` / `text_editor_*` / `bash_*` tool types return HTTP 400. ### Response (non-streaming) ```json { "id": "msg_...", "type": "message", "role": "assistant", "model": "Gemma-4-E2B-it", "content": [ {"type": "text", "text": "Hello!"} ], "stop_reason": "end_turn", "stop_sequence": null, "usage": {"input_tokens": 12, "output_tokens": 4} } ``` `stop_reason` is one of `end_turn`, `max_tokens`, `stop_sequence`, or `tool_use`. When `stop_sequence` fires, `stop_sequence` echoes the matched string. Tool calls produce `{type:"tool_use", id, name, input}` content blocks. ### Streaming When `stream: true`, the response is a Server-Sent Events stream that follows Anthropic's documented event sequence: ``` event: message_start data: {"type":"message_start","message":{...}} event: content_block_start data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}} event: content_block_delta data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}} event: content_block_stop data: {"type":"content_block_stop","index":0} event: message_delta data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{...}} event: message_stop data: {"type":"message_stop"} ``` OlliteRT also emits `event: ping` events every 10 s while the model is still in prefill so SDK clients don't time out on long on-device prefill (Gemma-4-E2B routinely takes 30–60 s to first token). Errors mid-stream surface as `event: error` with `{"type":"error","error":{"type","message"}}`. ### Known Issues > [!WARNING] > **Gemma 4 native tool calling is unreliable.** When a tool argument is a string containing quoted content (Bash command, Edit `old_string`, WebFetch URL, JSON-in-a-string), Gemma-4 emits its trained `<|"|>` quote delimiter for the inner quotes. LiteRT-LM 0.11.0 / 0.12.0's ANTLR function-call parser does not understand this token and raises `INVALID_ARGUMENT`, which surfaces as a 500 to the client. Affects every Anthropic tool-using client (notably Claude Code, which always sends Bash / Edit / Read tool definitions). Tracking upstream: . Workaround: turn off **Settings → Schema Injection** so tool calls go through the text-mode parser instead. ### Example (curl, non-streaming) ```bash curl http://PHONE_IP:8000/v1/messages \ -H "x-api-key: your-token" \ -H "anthropic-version: 2023-06-01" \ -H "Content-Type: application/json" \ -d '{ "model": "Gemma-4-E2B-it", "max_tokens": 256, "messages": [{"role": "user", "content": "Say hello"}] }' ``` ### Example (Claude Code) ```bash ANTHROPIC_BASE_URL=http://PHONE_IP:8000 \ ANTHROPIC_AUTH_TOKEN=your-token \ claude ``` Claude Code maps `ANTHROPIC_AUTH_TOKEN` to the `x-api-key` header. The `/v1` segment is appended automatically. ## Anthropic Token Counter — `POST /v1/messages/count_tokens` Estimates the input-token count for a Messages-shaped request without running inference. Works even when no model is loaded. The body accepts the same fields as `/v1/messages`; `max_tokens` is optional here. The response is: ```json {"input_tokens": 1042} ``` Counts are estimated as `chars / 4` (the same heuristic OlliteRT uses across the request log). This is not a tokenizer-exact count — there is no public LiteRT tokenizer API — but it tracks within ±20% of the runtime count for English chat traffic. ## Audio Transcriptions — `POST /v1/audio/transcriptions` Accepts an audio file via multipart/form-data and returns a text transcription. Requires a model with audio capability (e.g. Gemma 4, Gemma 3n). ### Request Body (multipart/form-data) | Field | Type | Required | Description | |:------|:-----|:--------:|:------------| | `file` | file | Yes | Audio file to transcribe (max 25 MB) | | `model` | string | No | Model name (ignored — uses the currently loaded model) | | `language` | string | No | Language hint (e.g. `en`, `de`, `ja`) | | `prompt` | string | No | Context hint to guide transcription | | `temperature` | number | No | Sampling temperature override | | `response_format` | string | No | `json` (default), `text`, or `verbose_json` | Supported audio formats: **WAV**, **MP3**, **OGG** (Vorbis), **FLAC**. Stereo WAV (16-bit PCM) is automatically downmixed to mono; other formats should be mono before sending. ### Response Formats **`json`** (default) — `Content-Type: application/json` ```json {"text": "The transcribed text from the audio file."} ``` **`text`** — `Content-Type: text/plain` ``` The transcribed text from the audio file. ``` **`verbose_json`** — `Content-Type: application/json` ```json { "task": "transcribe", "language": "en", "duration": 3.456, "text": "The transcribed text from the audio file.", "segments": [{ "id": 0, "seek": 0, "start": 0.0, "end": 3.456, "text": "The transcribed text from the audio file." }] } ``` > `duration` reflects LLM inference time, not audio length. The model returns raw text without word-level timing, so the output contains a single segment spanning the full duration. > **Note:** `srt` and `vtt` formats are not supported — the LiteRT runtime does not provide word-level timing data required for subtitle generation. Requesting these formats returns HTTP 400. ### Example (curl) ```bash curl http://PHONE_IP:8000/v1/audio/transcriptions \ -H "Authorization: Bearer your-token" \ -F file=@recording.wav \ -F response_format=json ``` ## Models — `GET /v1/models` Returns a list of available models with their capabilities and update status. ```json { "object": "list", "data": [{ "id": "Gemma-4-E2B-it", "object": "model", "created": 1234567890, "owned_by": "ollitert", "capabilities": { "image": true, "audio": true, "thinking": true, "speculative_decoding": true }, "update_available": false }] } ``` | Field | Type | Description | |:------|:-----|:------------| | `id` | string | Model name | | `object` | string | Always `"model"` | | `created` | integer | Unix timestamp | | `owned_by` | string | Always `"ollitert"` | | `capabilities` | object | `image`, `audio`, `thinking`, `speculative_decoding` booleans. `thinking` indicates the model supports chain-of-thought AND it is currently enabled in settings (not just model capability). `speculative_decoding` indicates MTP is supported AND enabled. | | `update_available` | boolean | `true` if a newer version of this model is available in the allowlist | ## Model Detail — `GET /v1/models/{id}` Returns detail for a specific model by name. The model ID is case-insensitive. Returns `404` if the model is not loaded (or not idle-unloaded by keep-alive). The response has the same shape as a single entry from the `/v1/models` list. ## Health — `GET /health` Returns server health status. Also available at `/v1/health`. ### Base Response ```json { "status": "ok", "model": "Gemma-4-E2B-it", "uptime_seconds": 3600, "update_available": false } ``` | Field | Type | Description | |:------|:-----|:------------| | `status` | string | `ok`, `idle` (keep-alive unloaded), `loading`, `stopped`, `error` | | `model` | string | Currently loaded (or idle-unloaded) model name. Omitted if no model. | | `uptime_seconds` | integer | Seconds since server entered RUNNING state. Omitted if not running. | | `update_available` | boolean | `true` if a newer OlliteRT version exists | ### Extended Response — `GET /health?metrics=true` Appends server info and a `metrics` object to the base response: | Field | Type | Description | |:------|:-----|:------------| | `version` | string | OlliteRT version string | | `thinking_enabled` | boolean | Whether chain-of-thought mode is active | | `speculative_decoding_enabled` | boolean | Whether speculative decoding (MTP) is active | | `accelerator` | string | `gpu`, `cpu`, or `gpu,cpu` | | `is_idle_unloaded` | boolean | `true` if model was unloaded by keep-alive timeout | | `metrics.requests_total` | integer | Total requests processed | | `metrics.errors_total` | integer | Total request errors | | `metrics.prompt_tokens_total` | integer | Total prompt tokens (estimated) | | `metrics.generation_tokens_total` | integer | Total generated tokens (estimated) | | `metrics.requests_text` | integer | Total text-only requests | | `metrics.requests_image` | integer | Total image multimodal requests | | `metrics.requests_audio` | integer | Total audio multimodal requests | | `metrics.ttfb_last_ms` | number | Last request time to first token (ms) | | `metrics.ttfb_avg_ms` | number | Average time to first token (ms) | | `metrics.decode_tokens_per_second` | number | Last request decode throughput (tokens/s) | | `metrics.decode_tokens_per_second_peak` | number | Peak decode throughput since start | | `metrics.prefill_tokens_per_second` | number | Last request prefill throughput (tokens/s) | | `metrics.inter_token_latency_ms` | number | Last inter-token latency (ms) | | `metrics.request_latency_last_ms` | number | Last request total latency (ms) | | `metrics.request_latency_avg_ms` | number | Average request latency (ms) | | `metrics.request_latency_peak_ms` | number | Peak request latency (ms) | | `metrics.context_utilization_percent` | number | Last request context window usage (%) | | `metrics.model_load_time_seconds` | number | Model load/warmup time (seconds) | | `metrics.is_inferring` | boolean | `true` if a request is currently being processed | ## Server Info — `GET /` or `GET /v1` Returns server identity, version, status, update availability, and the full list of supported endpoints. Does not require authentication. ```json { "name": "OlliteRT", "version": "1.2.0", "build": 42, "git_hash": "abc1234", "status": "running", "model": "Gemma-4-E2B-it", "uptime_seconds": 3600, "update_available": false, "allowlist_content_version": 3, "allowlist_source": "asset", "model_update_available": false, "compatibility": "openai", "endpoints": ["/v1/models", "/v1/completions", "/v1/chat/completions", "..."] } ``` | Field | Type | Description | |:------|:-----|:------------| | `name` | string | Always `"OlliteRT"` | | `version` | string | App version (e.g. `"1.2.0"`) | | `build` | integer | Version code | | `git_hash` | string | Build git commit hash | | `status` | string | `running`, `idle` (keep-alive unloaded), `loading`, `stopped`, `error` | | `model` | string | Currently loaded model name (omitted if none) | | `uptime_seconds` | integer | Seconds since RUNNING state (omitted if not running) | | `update_available` | boolean | `true` if a newer OlliteRT version exists | | `latest_version` | string | Newest available version (only present when `update_available` is `true`) | | `release_url` | string | GitHub release URL (only present when `update_available` is `true`) | | `allowlist_content_version` | integer | Version number of the model allowlist currently cached | | `allowlist_source` | string | Source of the active allowlist: `"asset"`, `"external:"`, `"empty"`, or `"error"` | | `model_update_available` | boolean | `true` if the currently loaded model has a newer version in the allowlist | | `compatibility` | string | Always `"openai"` | | `endpoints` | array | List of supported endpoint paths | ## Error Responses > [!NOTE] > All errors follow the standard OpenAI error format, so existing client libraries handle them correctly. ```json { "error": { "message": "Model is not loaded", "type": "server_error", "param": null, "code": null } } ``` | Status | When | |:-------|:-----| | `400` | Malformed request, missing required fields | | `401` | Missing or invalid bearer token | | `404` | Not Found — model or endpoint doesn't exist | | `405` | Method Not Allowed — wrong HTTP method for endpoint | | `413` | Payload Too Large — request body exceeds size limit | | `500` | Internal server error | | `503` | Model not loaded or server not ready | See [Troubleshooting → Connection Issues](../TROUBLESHOOTING.md#connection-issues) for detailed explanations of each error code. ## Prometheus Metrics — `GET /metrics` Returns server metrics in [Prometheus exposition format](https://prometheus.io/docs/instrumenting/exposition_formats/) (`text/plain; version=0.0.4`). Includes 10 counters and 19 gauges covering throughput, latency, token counts, memory, and more. For the full list of metrics and Grafana setup, see the [Prometheus Integration Guide](../integrations/PROMETHEUS.md).