--- name: parlor-on-device-ai description: On-device, real-time multimodal AI voice and vision assistant powered by Gemma 4 E2B and Kokoro TTS, running entirely locally via FastAPI WebSocket server. triggers: - "set up parlor on-device AI" - "run local voice AI with camera" - "configure parlor multimodal assistant" - "use Gemma 4 with Kokoro TTS locally" - "build real-time voice assistant on device" - "parlor websocket voice vision server" - "on-device speech and vision AI" - "run parlor with Apple Silicon" --- # Parlor On-Device AI > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request. ## Architecture ``` Browser (mic + camera) │ │ WebSocket (audio PCM + JPEG frames) ▼ FastAPI server ├── Gemma 4 E2B via LiteRT-LM (GPU) → understands speech + vision └── Kokoro TTS (MLX on Mac, ONNX on Linux) → speaks back │ │ WebSocket (streamed audio chunks) ▼ Browser (playback + transcript) ``` Key features: - **Silero VAD** in browser — hands-free, no push-to-talk - **Barge-in** — interrupt AI mid-sentence by speaking - **Sentence-level TTS streaming** — audio starts before full response is ready - **Platform-aware TTS** — MLX backend on Apple Silicon, ONNX on Linux ## Requirements - Python 3.12+ - macOS with Apple Silicon **or** Linux with a supported GPU - ~3 GB free RAM - [`uv`](https://github.com/astral-sh/uv) package manager ## Installation ```bash git clone https://github.com/fikrikarim/parlor.git cd parlor # Install uv if needed curl -LsSf https://astral.sh/uv/install.sh | sh cd src uv sync uv run server.py ``` Open [http://localhost:8000](http://localhost:8000), grant camera and microphone permissions, and start talking. Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models). ## Configuration Set environment variables before running: ```bash # Use a pre-downloaded model instead of auto-downloading export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm # Change server port (default: 8000) export PORT=9000 uv run server.py ``` | Variable | Default | Description | |--------------|-------------------------------|------------------------------------------------| | `MODEL_PATH` | auto-download from HuggingFace | Path to local `.litertlm` model file | | `PORT` | `8000` | Server port | ## Project Structure ``` src/ ├── server.py # FastAPI WebSocket server + Gemma 4 inference ├── tts.py # Platform-aware TTS (MLX on Mac, ONNX on Linux) ├── index.html # Frontend UI (VAD, camera, audio playback) ├── pyproject.toml # Dependencies └── benchmarks/ ├── bench.py # End-to-end WebSocket benchmark └── benchmark_tts.py # TTS backend comparison ``` ## Key Components ### server.py — FastAPI WebSocket Server The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back. ```python # Simplified pattern from server.py from fastapi import FastAPI, WebSocket import asyncio app = FastAPI() @app.websocket("/ws") async def websocket_endpoint(websocket: WebSocket): await websocket.accept() async for data in websocket.iter_bytes(): # data contains PCM audio + optional JPEG frame response_text = await run_gemma_inference(data) audio_chunks = await run_tts(response_text) for chunk in audio_chunks: await websocket.send_bytes(chunk) ``` ### tts.py — Platform-Aware TTS Kokoro TTS selects backend based on platform: ```python # tts.py uses platform detection import platform def get_tts_backend(): if platform.system() == "Darwin": # Apple Silicon: use MLX backend for GPU acceleration from kokoro_mlx import KokoroMLX return KokoroMLX() else: # Linux: use ONNX backend from kokoro import KokoroPipeline return KokoroPipeline(lang_code='a') tts = get_tts_backend() # Sentence-level streaming — yields audio as each sentence is ready async def synthesize_streaming(text: str): for sentence in split_sentences(text): audio = tts.synthesize(sentence) yield audio ``` ### Gemma 4 E2B Inference via LiteRT-LM ```python # LiteRT-LM inference pattern from litert_lm import LiteRTLM import os model_path = os.environ.get("MODEL_PATH", None) # Auto-downloads if MODEL_PATH not set model = LiteRTLM.from_pretrained( "google/gemma-4-E2B-it", local_path=model_path ) async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None): inputs = {"audio": audio_pcm} if image_jpeg: inputs["image"] = image_jpeg response = "" async for token in model.generate_stream(**inputs): response += token return response ``` ## Running Benchmarks ```bash cd src # End-to-end WebSocket latency benchmark uv run benchmarks/bench.py # Compare TTS backends (MLX vs ONNX) uv run benchmarks/benchmark_tts.py ``` ## Performance Reference (Apple M3 Pro) | Stage | Time | |----------------------------------|---------------| | Speech + vision understanding | ~1.8–2.2s | | Response generation (~25 tokens) | ~0.3s | | Text-to-speech (1–3 sentences) | ~0.3–0.7s | | **Total end-to-end** | **~2.5–3.0s** | Decode speed: ~83 tokens/sec on GPU. ## Common Patterns ### Extending the System Prompt Modify the prompt in `server.py` to change the AI's persona or task: ```python SYSTEM_PROMPT = """You are a helpful language tutor. Respond conversationally in 1-3 sentences. If the user makes a grammar mistake, gently correct them. You can see through the user's camera and discuss what you observe.""" ``` ### Adding a New Language for TTS Kokoro supports multiple language codes. Set `lang_code` in `tts.py`: ```python # Language codes: 'a' = American English, 'b' = British English # 'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese pipeline = KokoroPipeline(lang_code='e') # Spanish ``` ### Customizing VAD Sensitivity (index.html) The Silero VAD threshold can be tuned in the frontend: ```javascript // In index.html — lower positiveSpeechThreshold = more sensitive const vad = await MicVAD.new({ positiveSpeechThreshold: 0.6, // default ~0.8, lower = triggers more easily negativeSpeechThreshold: 0.35, // how quickly it stops detecting speech minSpeechFrames: 3, onSpeechStart: () => { /* UI feedback */ }, onSpeechEnd: (audio) => sendAudioToServer(audio), }); ``` ### Sending Frames Programmatically (WebSocket Client Example) ```python import asyncio import websockets import json import base64 async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None): uri = "ws://localhost:8000/ws" async with websockets.connect(uri) as ws: payload = { "audio": base64.b64encode(audio_pcm_bytes).decode(), } if jpeg_bytes: payload["image"] = base64.b64encode(jpeg_bytes).decode() await ws.send(json.dumps(payload)) # Receive streamed audio response async for message in ws: audio_chunk = message # raw PCM bytes # play or save audio_chunk ``` ## Troubleshooting ### Model download fails ```bash # Pre-download manually via huggingface_hub uv run python -c " from huggingface_hub import hf_hub_download path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm') print(path) " export MODEL_PATH=/path/shown/above uv run server.py ``` ### Microphone/camera not working in browser - Must access via `http://localhost` (not IP address) — browsers block media APIs on non-localhost HTTP - Check browser permissions: address bar → lock icon → reset permissions ### TTS not loading on Linux ```bash # Ensure ONNX runtime is installed uv add onnxruntime # Or for GPU: uv add onnxruntime-gpu ``` ### High latency or slow inference - Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs - Close other GPU-heavy applications - On Linux, confirm CUDA drivers match installed `onnxruntime-gpu` version ### Port already in use ```bash export PORT=8080 uv run server.py # Or kill the existing process: lsof -ti:8000 | xargs kill ``` ### `uv sync` fails — Python version mismatch ```bash # Parlor requires Python 3.12+ python3 --version # Install 3.12 via pyenv or system package manager, then: uv python pin 3.12 uv sync ``` ## Dependencies (pyproject.toml) Key packages installed by `uv sync`: - `litert-lm` — Google AI Edge inference runtime for Gemma - `fastapi` + `uvicorn` — async web/WebSocket server - `kokoro` — Kokoro TTS ONNX backend - `kokoro-mlx` — Kokoro TTS MLX backend (Mac only) - `silero-vad` — voice activity detection (browser-side via CDN) - `huggingface-hub` — model auto-download