# Home Assistant Integration OlliteRT provides a built-in REST API for monitoring and controlling the server directly from Home Assistant — no HACS, no custom components, just `configuration.yaml`. ## Table of Contents - [Overview](#overview) - [REST Sensors](#rest-sensors) - [REST Commands (Server Control)](#rest-commands-server-control) - [Automation Examples](#automation-examples) - [Auto-Generated Config](#auto-generated-config) --- ## Overview Two ways to use OlliteRT with Home Assistant: 1. **As an LLM brain** — Use OlliteRT as a conversation agent for voice assistants and automations via [Local OpenAI LLM](https://github.com/skye-harris/hass_local_openai_llm) (recommended), [Home LLM](https://github.com/acon96/home-llm/), or [Extended OpenAI Conversation](https://github.com/jekalmin/extended_openai_conversation). See the [Client Setup Guide](../CLIENT_SETUP.md#home-assistant) for setup. 2. **As a monitored server** — Poll OlliteRT's health endpoint for status, metrics, and control via REST sensors and commands (documented below). ## REST Sensors OlliteRT's `GET /health?metrics=true` endpoint returns a JSON payload with server status and performance metrics — perfect for HA's REST sensor platform. ### Available Sensor Data | Field | Path | Description | |:------|:-----|:------------| | Status | `status` | `ok`, `idle` (keep-alive unloaded), `loading`, `stopped` | | Model | `model` | Currently loaded (or idle-unloaded) model name | | Uptime | `uptime_seconds` | Seconds since server entered RUNNING state | | Update Available | `update_available` | `true` if a newer OlliteRT version exists (app update, not model update) | | Version | `version` | OlliteRT version string (e.g. `"1.2.0"`) | | Thinking | `thinking_enabled` | Whether chain-of-thought mode is active | | Speculative Decoding | `speculative_decoding_enabled` | Whether MTP is active | | Accelerator | `accelerator` | `gpu`, `cpu`, or `gpu,cpu` | | Idle Unloaded | `is_idle_unloaded` | `true` if model was unloaded by keep-alive | | Requests | `metrics.requests_total` | Total requests processed | | Errors | `metrics.errors_total` | Total request errors | | TTFB (last) | `metrics.ttfb_last_ms` | Last request time to first token (ms) | | TTFB (avg) | `metrics.ttfb_avg_ms` | Average time to first token (ms) | | Decode Speed | `metrics.decode_tokens_per_second` | Last request decode throughput (tokens/s) | | Peak Decode Speed | `metrics.decode_tokens_per_second_peak` | Peak decode throughput since start | | Prefill Speed | `metrics.prefill_tokens_per_second` | Last request prefill throughput (tokens/s) | | Inter-Token Latency | `metrics.inter_token_latency_ms` | Last inter-token latency (ms) | | Request Latency | `metrics.request_latency_last_ms` | Last request total latency (ms) | | Avg Latency | `metrics.request_latency_avg_ms` | Average request latency (ms) | | Peak Latency | `metrics.request_latency_peak_ms` | Peak request latency (ms) | | Context Usage | `metrics.context_utilization_percent` | Last request context window usage (%) | | Model Load Time | `metrics.model_load_time_seconds` | Model load/warmup time (seconds) | | Is Inferring | `metrics.is_inferring` | `true` if a request is currently being processed | | Text Requests | `metrics.requests_text` | Total text-only requests | | Image Requests | `metrics.requests_image` | Total image multimodal requests | | Audio Requests | `metrics.requests_audio` | Total audio multimodal requests | | Prompt Tokens | `metrics.prompt_tokens_total` | Total prompt tokens (estimated) | | Generation Tokens | `metrics.generation_tokens_total` | Total generated tokens (estimated) | ### Example `configuration.yaml` ```yaml rest: - resource: "http://PHONE_IP:8000/health?metrics=true" scan_interval: 30 # Uncomment if bearer auth is enabled: # headers: # Authorization: "Bearer your-token" sensor: - name: "OlliteRT Status" value_template: "{{ value_json.status }}" - name: "OlliteRT Model" value_template: "{{ value_json.model | default('none') }}" - name: "OlliteRT Uptime" value_template: "{{ value_json.uptime_seconds | default(0) }}" unit_of_measurement: "s" - name: "OlliteRT Thinking" value_template: "{{ value_json.thinking_enabled | default(false) }}" - name: "OlliteRT Speculative Decoding" value_template: "{{ value_json.speculative_decoding_enabled | default(false) }}" - name: "OlliteRT Accelerator" value_template: "{{ value_json.accelerator | default('unknown') }}" - name: "OlliteRT Idle" value_template: "{{ value_json.is_idle_unloaded | default(false) }}" - name: "OlliteRT Requests" value_template: "{{ value_json.metrics.requests_total | default(0) }}" - name: "OlliteRT Errors" value_template: "{{ value_json.metrics.errors_total | default(0) }}" - name: "OlliteRT TTFB" value_template: "{{ value_json.metrics.ttfb_avg_ms | default(0) }}" unit_of_measurement: "ms" - name: "OlliteRT Decode Speed" value_template: "{{ value_json.metrics.decode_tokens_per_second | default(0) | round(1) }}" unit_of_measurement: "t/s" - name: "OlliteRT Context Usage" value_template: "{{ value_json.metrics.context_utilization_percent | default(0) | round(1) }}" unit_of_measurement: "%" ``` ## REST Commands (Server Control) Control OlliteRT remotely from HA automations and scripts. ### Available Commands | Endpoint | Method | Description | Payload | |:---------|:-------|:------------|:--------| | `/v1/server/stop` | `POST` | Stop the server | None | | `/v1/server/reload` | `POST` | Reload the current model | None | | `/v1/server/thinking` | `POST` | Toggle thinking mode | `{"enabled": true}` or `{"enabled": false}` | | `/v1/server/config` | `POST` | Update inference and behavior settings | Any subset of fields (see below) | #### `/v1/server/config` Fields Send an empty body to read current config, or any subset of fields to update: | Field | Type | Description | |:------|:-----|:------------| | `temperature` | number | Sampling temperature (0.0 - 2.0) | | `max_tokens` | integer | Maximum tokens to generate | | `top_k` | integer | Top-k sampling | | `top_p` | number | Nucleus sampling threshold | | `thinking_enabled` | boolean | Enable chain-of-thought mode | | `auto_truncate_history` | boolean | Auto-drop older messages when context is full | | `auto_trim_prompts` | boolean | Hard-cut prompts as last resort when context overflows | | `warmup_enabled` | boolean | Run warmup inference on model load | | `keep_alive_enabled` | boolean | Enable idle timeout auto-unload | | `keep_alive_minutes` | integer | Idle timeout duration (1 - 7200) | | `custom_prompts_enabled` | boolean | Enable custom system prompts | | `system_prompt` | string | Per-model system instruction text | > [!IMPORTANT] > If enabled - All server control endpoints require bearer token authentication. Without auth, anyone on your network can control the server. See the [Security Guide](../SECURITY.md) for details. ### Example `configuration.yaml` ```yaml rest_command: ollitert_stop: url: "http://PHONE_IP:8000/v1/server/stop" method: POST # headers: # Authorization: "Bearer your-token" content_type: "application/json" ollitert_reload: url: "http://PHONE_IP:8000/v1/server/reload" method: POST # headers: # Authorization: "Bearer your-token" content_type: "application/json" ollitert_thinking: url: "http://PHONE_IP:8000/v1/server/thinking" method: POST # headers: # Authorization: "Bearer your-token" content_type: "application/json" payload: '{"enabled": {{ enabled }}}' ollitert_config: url: "http://PHONE_IP:8000/v1/server/config" method: POST # headers: # Authorization: "Bearer your-token" content_type: "application/json" payload: '{{ payload }}' ``` ### Automation Examples > [!TIP] > Instead of automating temperature changes, you can enable **Ignore Client Sampler Parameters** in OlliteRT Settings → Model Behaviour to always use your own per-model inference settings regardless of what HA sends. **Lower temperature when switching to a tool-calling automation** (see [Troubleshooting → Tool Calling](../TROUBLESHOOTING.md#tool-calling-experimental) for more tips): ```yaml automation: - alias: "OlliteRT: Set low temperature for HA tools" trigger: - platform: state entity_id: input_boolean.ha_tool_mode to: "on" action: - service: rest_command.ollitert_config data: payload: '{"temperature": 0.78}' ``` **Extend keep-alive timeout when expecting frequent queries:** ```yaml automation: - alias: "OlliteRT: Extend keep-alive during active hours" trigger: - platform: time at: "08:00:00" action: - service: rest_command.ollitert_config data: payload: '{"keep_alive_enabled": true, "keep_alive_minutes": 120}' ``` **Set a system prompt for a specific automation context:** ```yaml automation: - alias: "OlliteRT: Configure for home control" trigger: - platform: state entity_id: input_boolean.home_assistant_mode to: "on" action: - service: rest_command.ollitert_config data: payload: '{"custom_prompts_enabled": true, "system_prompt": "You are a smart home assistant. Answer concisely."}' ``` ## Auto-Generated Config > [!TIP] > OlliteRT's Settings screen includes a **"Copy Configuration"** button that generates a complete `configuration.yaml` snippet pre-filled with your device's current IP address, port, and bearer token. No manual editing needed — just paste into your `configuration.yaml`. ## Anthropic Messages API OlliteRT speaks the Anthropic Messages API on `/v1/messages` and `/v1/messages/count_tokens` in addition to the OpenAI-compatible endpoints. Anthropic SDKs and Claude Code can target the phone directly with no proxy. ### Authentication Both `Authorization: Bearer ` and `x-api-key: ` are accepted. The `x-api-key` header carries the raw token with no `Bearer` prefix — this is the form Claude Code and the official Anthropic SDKs use. ### Curl example — non-streaming ```bash curl -s http://:8000/v1/messages \ -H "x-api-key: " \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{ "model": "local", "max_tokens": 256, "messages": [{"role": "user", "content": "Say hi"}] }' ``` ### Curl example — streaming ```bash curl -N http://:8000/v1/messages \ -H "x-api-key: " \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{ "model": "local", "max_tokens": 256, "stream": true, "messages": [{"role": "user", "content": "Stream a haiku"}] }' ``` ### Curl example — count_tokens ```bash curl -s http://:8000/v1/messages/count_tokens \ -H "x-api-key: " \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{ "model": "local", "messages": [{"role": "user", "content": "How many tokens is this?"}] }' ``` ### Claude Code ```bash export ANTHROPIC_BASE_URL="http://:8000" export ANTHROPIC_AUTH_TOKEN="" claude ``` Claude Code maps `ANTHROPIC_AUTH_TOKEN` to the `x-api-key` header. The `/v1` segment is appended automatically. ### Limitations - Prompt caching (`cache_control`) is accepted but discarded — there is no KV reuse beyond what LiteRT does internally. - URL image sources, `document` blocks, and computer-use tool types are rejected with a 400. - When tools are present and schema injection is OFF, output is buffered and arrives at end-of-stream (a constraint shared with `/v1/chat/completions`). Schema injection on → progressive streaming works. - The Batches API and Files API are not implemented.