# Telemetry and diagnostics Oppi records diagnostics to answer one question: **does the app feel fast, reliable, and safe while someone is supervising a Pi session from Apple clients?** This page is the public source of truth for what telemetry exists, how it is gated, where it is stored, and which metrics matter for user experience. ## Scope This page covers: - Apple client diagnostics: MetricKit payloads, chat/runtime metrics, device resource samples, and redacted client logs. - Server diagnostics: operational metrics, resource samples, and review commands. - Privacy gates and storage paths. - The small set of experience metrics that belong on release and quality dashboards. This page does not cover: - Pi’s own anonymous install/update telemetry. See Pi documentation for that behavior. - Provider billing or model-side observability outside Oppi. - Raw session content review. Oppi can inspect Pi session files locally, but telemetry uploads must not contain prompt text, assistant output, tool arguments, or dictation transcript content. ## Quick review commands From `server/`: ```bash npm run telemetry:review -- --days 1 --wide npm run telemetry:client-logs -- --days 1 --limit 30 npm run telemetry:client-logs -- --hours 3 --limit 30 npm run telemetry:metrickit -- --days 14 --limit 50 npm run telemetry:server-log -- --days 1 --limit 30 npm run diagnostics:review -- --days 1 ``` For a release gate: ```bash npm run telemetry:review:gate ``` For local Grafana: ```bash npm run telemetry:grafana:up # Open http://localhost:13001, default login admin/admin ``` ## Privacy model Oppi diagnostics are designed for self-hosted debugging. - Public iOS release builds set `OPPI_TELEMETRY_MODE=public`. - Public iOS release builds do not upload diagnostics unless the user enables **Settings → Diagnostics → Send Diagnostics to Server**. - iOS builds do not link an external crash-reporting SDK; diagnostics upload only to the configured Oppi server. - Internal/debug builds set `OPPI_TELEMETRY_MODE=internal` and upload diagnostics automatically to the configured Oppi server. - The server also enforces `OPPI_TELEMETRY_MODE`: `public`, `release`, `prod`, `off`, and equivalent values reject telemetry uploads with HTTP 403. Telemetry must not include: - prompt text - assistant output - tool arguments - command output content - dictation transcript text - raw URLs, LAN IPs, local file paths, tokens, secrets, or credentials Allowed diagnostic data: - low-cardinality IDs such as session ID, workspace ID, app instance ID, and boot ID - low-cardinality tags such as `transport=lan`, `status=ok`, `error_kind=network`, or `tool=bash` - numeric timings, counts, ratios, byte counts, and resource samples - sanitized MetricKit summaries and crash diagnostics - redacted client logs ## Upload channels | Channel | Endpoint | Stored at | Purpose | |---|---|---|---| | Chat metrics | `POST /telemetry/chat-metrics` | `/diagnostics/telemetry/chat-metrics-YYYY-MM-DD.jsonl` | Client UX, rendering, queueing, dictation, and device metrics. | | MetricKit | `POST /telemetry/metrickit` | `/diagnostics/telemetry/metrickit-YYYY-MM-DD.jsonl` | Apple crash, hang, CPU, disk, and battery diagnostics. | | Client logs | `POST /telemetry/client-logs` | `/diagnostics/telemetry/client-logs-YYYY-MM-DD.jsonl` | Redacted warning/error events and selected high-value info logs. | | Server resource metrics | local JSONL writer | `/diagnostics/telemetry/server-metrics-YYYY-MM-DD.jsonl` | Server CPU, memory, event loop, sessions, and WebSocket counts. | | Server ops metrics | local JSONL writer | `/diagnostics/telemetry/server-ops-metrics-YYYY-MM-DD.jsonl` | Server WebSocket, session, turn, extension UI, dictation, retry, and compaction metrics. | | Server log | local JSONL/text log | `/server.log` | Structured server events and warnings. | Retention defaults: | Data | Default retention | Environment override | |---|---:|---| | MetricKit | 14 days | `OPPI_METRICKIT_RETENTION_DAYS` | | Chat metrics | 14 days | `OPPI_CHAT_METRICS_RETENTION_DAYS` | | Client logs | 14 days | `OPPI_CHAT_METRICS_RETENTION_DAYS` | | Server resource metrics | 30 days | `OPPI_SERVER_METRICS_RETENTION_DAYS` | | Server ops metrics | 30 days | `OPPI_SERVER_OPS_METRICS_RETENTION_DAYS` | ## Metric taxonomy Use this split when reading dashboards or telemetry reviews: | Category | Meaning | Examples | How to read it | |---|---|---|---| | UX responsiveness | The user is waiting for the app, stream, or media to become usable. | `chat.ttft_ms`, `chat.session_load_ms`, `chat.ws_wait_for_connected_ms`, `chat.media_playback_start_ms` | High values are user-visible latency. These belong on the front page and can have SLOs. | | Reliability counters | A user action, stream, or render path failed or recovered. | `chat.message_queue_stale_drop`, `chat.app_event_stream_reconnect`, `chat.media_playback_error`, client logs | Trend toward zero; drill into logs and tags. | | Agent workload and progress | The agent is actively doing work. | `server.turn_duration_ms`, `server.turn_tool_calls`, `server.turn_input_tokens`, `chat.session_files_changed` | Long values are not automatically bad. Correlate with progress, tokens, tools, file changes, errors, and TTFT before calling it a stall. | | Resource health | Local client/server pressure that can make UX worse. | `device.memory_mb`, `server.heap_mb`, `server.event_loop_lag_ms` | Diagnose capacity or leaks; do not confuse with agent productivity. | | Drill-down internals | Mechanical sub-steps used to explain a front-page metric. | `chat.queue_sync_ms`, `server.session_subscribe_ms`, `chat.render_strategy_ms` | Keep available, but do not let them define product health by themselves. | `server.turn_duration_ms` is workload telemetry. It measures the full wall-clock duration of an agent turn. A long turn can mean the agent is handling a large task, running tools, editing files, waiting on tests, or processing a large context. Treat it as a problem only when it combines with missing progress signals, high first-token latency, stuck tool calls, errors, blocked asks, or disconnected clients. ## Experience metrics that belong on the front page The front page should focus on metrics that map directly to user experience. Low-level counters stay available for drill-down, but they should not define the product health story. ### App and session responsiveness | Metric | Why it matters | |---|---| | `chat.app_launch_ms` | Time until the app presents useful content. | | `chat.workspace_load_ms` | Time until the workspace screen is usable. | | `chat.session_load_ms` | Time from selecting a session to chat content visible. | | `chat.session_switch_ms` | Session row tap-to-content latency. | | `chat.ttft_ms` | User-perceived time to first assistant response token. | | `chat.fresh_content_lag_ms` | Delay between new stream content and visible timeline freshness. | ### Connection and queue reliability | Metric or source | Why it matters | |---|---| | `chat.ws_wait_for_connected_ms` | Client wait before commands can use the focused session stream. | | `server.ws_handshake_ms` | Server-side WebSocket upgrade latency. | | `server.session_subscribe_ms` | Server session subscribe/catch-up path latency. | | `chat.queue_sync_ms` | Time to refresh queued steer/follow-up state. | | `chat.message_queue_ack_ms` | Time from queue command send to server acknowledgement. | | `chat.message_queue_stale_drop` | User input was dropped because the client had stale session state. | | `chat.app_event_stream_connect_ms` | Time until global app updates, attention cards, and session list events are live. | | `chat.app_event_stream_reconnect` | Global app-event stream reconnect attempts and exhaustion. | | `chat.app_event_stream_decode_error` | Global app-event stream payload/schema failures. | | `server.ws_ping_timeout` | Dead connection detection. | | Client logs: `WebSocket`, `AppEventStream`, `Network` | Reconnect storms, HTTP 1011s, POSIX disconnects, endpoint changes, and app-event stream failures. | ### Attention and media interactions | Metric | Why it matters | |---|---| | `chat.ask_response_ms` | How long the visible ask card blocked the agent before answer or ignore. | | `chat.media_playback_start_ms` | Time from media preview/player setup to playable video or audio. | | `chat.media_playback_error` | User-visible media preview/player failures by media kind and phase. | | Client logs: `MediaPlayback` | Playback/source failures with privacy-safe kind, source, phase, mode, and error class. | ### Timeline rendering and scrolling | Metric | Why it matters | |---|---| | `chat.timeline_apply_ms` | Snapshot/reducer apply time during streaming. | | `chat.timeline_layout_ms` | UIKit layout cost during streaming. | | `chat.cell_configure_ms` | Row rendering cost, especially large tool output rows. | | `chat.markdown_streaming_ms` | Markdown streaming parse/build/apply cost. | | `chat.jank_pct` | Percentage of render cycles over frame budget. | | `chat.timeline_hitch` | Count of detected frame-budget hitches. | ### Dictation and voice input | Metric | Why it matters | |---|---| | `chat.dictation_setup_ms` | Time from starting dictation to ready state. | | `chat.dictation_first_result_ms` | Time until the user sees first transcript feedback. | | `chat.dictation_finalize_ms` | Stop-to-final-result latency. | | `chat.dictation_preview_final_delta` | How much final text changed from the preview. | | `server.dictation_stt_ms` | Backend STT latency. | | `server.dictation_stt_audio_ratio` | STT real-time factor. | | `chat.dictation_error` and `server.dictation_error` | User-visible dictation failures. | ### Resource health and crash diagnostics | Metric or source | Why it matters | |---|---| | `device.cpu_pct` | Client CPU usage during real interaction. | | `device.memory_mb` | Client memory footprint. | | `device.memory_available_mb` | Headroom before jetsam; low is bad. | | `device.thermal_state` | Thermal pressure that can make the app feel slow. | | `server.cpu_total` | Server CPU saturation. | | `server.event_loop_lag_ms` | Server event-loop delay during sampler intervals. | | `server.rss_mb` and `server.heap_mb` | Server memory pressure. | | `server.sessions_total` and `server.ws_connections` | Local server concurrency pressure. | | MetricKit diagnostics | Crashes, hangs, CPU exceptions, and disk-write exceptions. | | Client logs | Redacted diagnostic context for what happened before failure. | ## Informational metrics policy Informational metrics are useful for debugging, but they must not drown out the experience story. A metric belongs on the front page only when it has all of these: 1. A user-visible question it answers. 2. A unit and owner. 3. A bounded tag set. 4. A release threshold, trend, or explicit investigation use. 5. Enough context to debug, usually `sessionId`, `workspaceId`, and low-cardinality tags. Keep informational metrics when they are low-volume or needed for drill-down. Aggregate, sample, demote, or remove them when they are high-volume and do not affect a release decision. Server operational metrics can be aggregated before storage. For sum-aggregated counters, `value` is the event sum for the roughly one-minute flush bucket; dashboards should use `SUM(value)`, not row count, when showing throughput. For max-aggregated gauges such as fanout and ring utilization, `value` is the peak seen in the flush bucket. Current handling guidance: | Pattern | Action | |---|---| | Per-message counters such as `server.ws_message_sent` | Keep for drill-down, but sum-aggregate by path/type/status instead of showing raw sample volume. | | Ephemeral UI counters such as `chat.tool_update_count` | Keep only if they explain a UX metric; otherwise sample or summarize by session. | | Server HTTP timings such as `server.http_request_ms` | Promote only route groups that affect visible UX; fast successful health/stats/capability/navigation/telemetry-upload routes are threshold-gated. | | Server drill-down gauges such as `server.broadcast_fanout` and `server.event_ring_utilization` | Max-aggregate by bounded tags; use peaks for diagnostics instead of raw sample counts. | | Server resource snapshots such as active-session peak | Prefer the structured server resource sample over duplicating the same value into server-ops metrics. | | Paired metrics such as coalescer events/bytes | Aggregate over a larger window and drop tiny partial drain windows; keep them as drill-down, not front-page rows. | | Render drill-down metrics such as `chat.render_strategy_ms` | Keep signposts for every operation, but upload only non-trivial samples. | | Routine command metrics such as successful `get_queue` command send/resolve/roundtrip | Keep errors and slow samples; rely on queue UX metrics for the normal success path. | | Server token and cost counters such as `server.turn_input_tokens` and `server.turn_cost` | Sum-aggregate before storage; dashboards should use `SUM(value)`. | | Client session-size snapshots such as `chat.session_input_tokens` | Emit only stable non-empty snapshots; dashboards should treat them as latest/max capacity diagnostics, not per-update events. | | Rare error counters | Prefer a single error metric with a `reason` tag over many near-zero standalone metrics. | | Stale or redundant metrics | Stop emitting them and keep import compatibility only for archived dashboards. | ## How Oppi uses Pi observability Pi persists sessions as JSONL and emits structured `AgentSessionEvent` values for lifecycle, streaming, tool execution, retry, compaction, and queue state. Oppi uses those primitives as the raw truth, then derives user-facing diagnostics: - server turn duration and server-side time to first token - token and cost snapshots - tool-call counts and mutating-file stats - retries, compactions, and compaction duration - ask and extension UI round-trip timing - session summaries and catch-up events for clients Use Pi session files for forensic replay. Use Oppi metrics to answer whether the app felt good and where the interaction became slow, unreliable, or unsafe. ## Local dashboards and importer The optional Grafana stack imports JSONL files into SQLite and serves prebuilt dashboards. ```bash cd server npm run telemetry:grafana:up ``` Importer notes: - reads JSONL from `${OPPI_DATA_DIR:-~/.config/oppi}/diagnostics/telemetry/*.jsonl` - writes SQLite into a Docker-managed volume for Grafana; Grafana opens it read-write so SQLite WAL-mode read queries can create sidecar shared-memory files - can also run manually with `npm run telemetry:import` - normalizes append-only daily JSONL files incrementally - flattens common server-op tags for split-stream panels See `server/README.md` for the full dashboard runbook. ## Tag vocabulary Use the same tag names across clients, server metrics, logs, and dashboards when possible. | Tag | Use | |---|---| | `status` | Mechanical request outcome: `ok`, `error`, `cancelled`, or `timeout`. | | `result` | Domain result, such as catch-up result: `applied`, `no_gap`, `ring_miss`, `fetch_failed`. | | `reason` | Why an event happened, such as `capabilityRefreshFailed` or `idle_timeout`. | | `transport` | User-selected or active transport path for metric samples: `lan`, `paired`, or `unknown`. | | `streamRole` | WebSocket role in client logs, such as `focused_session` or another low-cardinality stream name. | | `error_kind` | Coarse error class for metrics: `network`, `timeout`, `decode`, `cancelled`, `not_connected`, or `other`. | Prefer logs over metrics for raw platform error details such as `NSURLErrorDomain`, HTTP status, or WebSocket close code. Use metrics for bounded counts, durations, and ratios. ## Adding or changing a metric Before adding a metric, answer these questions in the code review: 1. What user experience question does this metric answer? 2. Is this a front-page metric, drill-down metric, or temporary investigation metric? 3. What is the unit? 4. What tags are allowed, and are they bounded? 5. Does the sample include session/workspace context when that is useful? 6. What is the removal or promotion trigger if this is temporary? Implementation rules: - Add chat/client metrics to `server/src/types/telemetry.ts` first. - Mirror chat/client metric names in `clients/apple/Oppi/Core/Services/MetricKitModels.swift`. - Add server operational metrics to `server/src/server-metric-registry.ts` first. - Update tests when a metric contract changes. - Prefer fewer, better-shaped metrics over broad logging-by-metric.