# Observability Ryco has one server-side observability model: - pretty logs go to stdout for humans - completed spans go to a local NDJSON trace file - traces and metrics can also be exported over OTLP to a real backend like Grafana LGTM The local trace file is the persisted source of truth. There is no separate persisted server log file anymore. ## Where To Find Things ### Logs Logs are human-facing only: - destination: stdout - format: `Logger.consolePretty()` - persistence: none If you want a log message to show up in the trace file, emit it inside an active span with `Effect.log...`. `Logger.tracerLogger` will attach it as a span event. ### Traces Completed spans are written as NDJSON records to `serverTracePath` (by default, `~/.ryco/userdata/logs/server.trace.ndjson`; dev mode uses `$RYCO_HOME/dev/logs/server.trace.ndjson`). Important fields in each record: - `name`: span name - `traceId`, `spanId`, `parentSpanId`: correlation - `durationMs`: elapsed time - `attributes`: structured context - `events`: embedded logs and custom events - `exit`: `Success`, `Failure`, or `Interrupted` The schema lives in `apps/server/src/observability/TraceRecord.ts`. ### Metrics Metrics are not written to a local file. - local persistence: none - remote export: OTLP only, when configured - current definitions: `apps/server/src/observability/Metrics.ts` - in-app snapshot: `server.getDiagnosticsMetrics` RPC exposes rolling-window stats for the Diagnostics panel (turn quiescence average, checkpoint duration p95, websocket reconnect count). These counters live in process memory only and reset whenever the server restarts. If OTLP is not configured, metrics still exist in-process, but you will not have a local artifact to inspect outside the Diagnostics panel. ### Related Artifacts Provider event NDJSON files still exist for provider runtime streams. Those are separate from the main server trace file. ## Run The Server In Instrumented Mode There are two useful modes: - local-only: stdout + local `server.trace.ndjson` - full local observability: stdout + local trace file + OTLP export to Grafana/Tempo/Prometheus The local trace file is always on. OTLP export is opt-in. ### Option 1: Local Traces Only You do not need any extra env vars. Just run the app normally and inspect `server.trace.ndjson`. Examples: ```bash npx ryco-cli ``` ```bash bun run dev ``` ```bash bun run dev:desktop ``` ### Option 2: Run With A Local LGTM Stack #### 1. Start Grafana LGTM ```bash docker run --name lgtm \ -p 3000:3000 \ -p 4317:4317 \ -p 4318:4318 \ --rm -ti \ grafana/otel-lgtm ``` Then open `http://localhost:3000`. Default Grafana login: - username: `admin` - password: `admin` #### 2. Export OTLP env vars ```bash export RYCO_OTLP_TRACES_URL=http://localhost:4318/v1/traces export RYCO_OTLP_METRICS_URL=http://localhost:4318/v1/metrics export RYCO_OTLP_SERVICE_NAME=ryco-local ``` Optional: ```bash export RYCO_TRACE_MIN_LEVEL=Info export RYCO_TRACE_TIMING_ENABLED=true ``` #### 3. Launch the app from that same shell CLI: ```bash npx ryco-cli ``` Monorepo web/server dev: ```bash bun run dev ``` Monorepo desktop dev: ```bash bun run dev:desktop ``` Packaged desktop app: Launch the actual app executable from the same shell so the desktop app and embedded backend inherit `RYCO_OTLP_*`. macOS app bundle example: ```bash RYCO_OTLP_TRACES_URL=http://localhost:4318/v1/traces \ RYCO_OTLP_METRICS_URL=http://localhost:4318/v1/metrics \ RYCO_OTLP_SERVICE_NAME=ryco-desktop \ "/Applications/Ryco.app/Contents/MacOS/Ryco" ``` Direct binary example: ```bash RYCO_OTLP_TRACES_URL=http://localhost:4318/v1/traces \ RYCO_OTLP_METRICS_URL=http://localhost:4318/v1/metrics \ RYCO_OTLP_SERVICE_NAME=ryco-desktop \ ./path/to/your/desktop-app-binary ``` Do not rely on launching from Finder, Spotlight, the dock, or the Start menu after setting shell env vars. Those launches usually will not pick them up. #### 4. Fully restart after changing env The backend reads observability config at process start. If you change OTLP env vars, stop the app completely and start it again. ## How To Use Traces And Metrics To Debug The Server ### Start With The Local Trace File The trace file is the fastest way to inspect raw span data. Tail it: ```bash tail -f "$RYCO_HOME/userdata/logs/server.trace.ndjson" ``` In monorepo dev, use: ```bash tail -f ./dev/logs/server.trace.ndjson ``` Show failed spans: ```bash jq -c 'select(.exit._tag != "Success") | { name, durationMs, exit, attributes }' "$RYCO_HOME/userdata/logs/server.trace.ndjson" ``` Show slow spans: ```bash jq -c 'select(.durationMs > 1000) | { name, durationMs, traceId, spanId }' "$RYCO_HOME/userdata/logs/server.trace.ndjson" ``` Inspect embedded log events: ```bash jq -c 'select(any(.events[]?; .attributes["effect.logLevel"] != null)) | { name, durationMs, events: [ .events[] | select(.attributes["effect.logLevel"] != null) | { message: .name, level: .attributes["effect.logLevel"] } ] }' "$RYCO_HOME/userdata/logs/server.trace.ndjson" ``` Follow one trace: ```bash jq -r 'select(.traceId == "TRACE_ID_HERE") | [ .name, .spanId, (.parentSpanId // "-"), .durationMs ] | @tsv' "$RYCO_HOME/userdata/logs/server.trace.ndjson" ``` Filter orchestration commands: ```bash jq -c 'select(.attributes["orchestration.command_type"] != null) | { name, durationMs, commandType: .attributes["orchestration.command_type"], aggregateKind: .attributes["orchestration.aggregate_kind"] }' "$RYCO_HOME/userdata/logs/server.trace.ndjson" ``` Filter git activity: ```bash jq -c 'select(.attributes["git.operation"] != null) | { name, durationMs, operation: .attributes["git.operation"], cwd: .attributes["git.cwd"], hookEvents: [ .events[] | select(.name == "git.hook.started" or .name == "git.hook.finished") ] }' "$RYCO_HOME/userdata/logs/server.trace.ndjson" ``` ### Use Tempo When You Need A Real Trace Viewer Tempo is better than raw NDJSON when you want to: - search across many traces - inspect parent/child relationships visually - compare many slow traces - drill into one failing request without hand-joining by `traceId` Recommended flow in Grafana: 1. Open `Explore`. 2. Pick the `Tempo` data source. 3. Set the time range to something recent like `Last 15 minutes`. 4. Start broad. Do not begin with a very narrow query. 5. Look for spans from your configured service name, then narrow by span name or attributes. Good first searches: - service name such as `ryco-local`, `ryco-dev`, or `ryco-desktop` - span names like `sql.execute`, `git.runCommand`, `provider.sendTurn` - orchestration spans with attributes like `orchestration.command_type` Once you know traces are arriving, narrower TraceQL queries like `name = "sql.execute"` become useful. ### Use Metrics To See Systemic Problems Traces are best for one request. Metrics are best for trends. Good metric families to watch: - `ryco_rpc_request_duration` - `ryco_orchestration_command_duration` - `ryco_orchestration_command_ack_duration` - `ryco_provider_turn_duration` - `ryco_git_command_duration` - `ryco_db_query_duration` Counters tell you volume and failure rate: - `ryco_rpc_requests_total` - `ryco_orchestration_commands_total` - `ryco_provider_turns_total` - `ryco_git_commands_total` - `ryco_db_queries_total` Use metrics when the question is: - "is this always slow?" - "did this get worse after a change?" - "which command type is failing most often?" Use traces when the question is: - "what happened in this specific request?" - "which child span caused this one slow interaction?" - "what logs were emitted inside the failing flow?" ### What The New Ack Metric Means `ryco_orchestration_command_ack_duration` measures: - start: command dispatch enters the orchestration engine - end: the first committed domain event for that command is published by the server That is a server-side acknowledgment metric. It does not measure: - websocket transit to the browser - client receipt - React render time If you need those later, add client-side instrumentation or a dedicated server fanout metric. ## Common Workflows ### "Why did this request fail?" 1. Start with the local NDJSON file. 2. Find spans where `exit._tag != "Success"`. 3. Group by `traceId`. 4. Inspect sibling spans and span events. 5. If needed, move to Tempo for the full trace tree. ### "Why is the UI feeling slow?" 1. Search for slow top-level spans in the trace file or Tempo. 2. Check child spans for sqlite, git, provider, or terminal work. 3. Look at the matching duration metrics to see whether the slowness is systemic. ### "Did this command take too long to acknowledge?" 1. Check `ryco_orchestration_command_ack_duration` by `commandType`. 2. If it is high, inspect the corresponding orchestration trace. 3. Look at child spans for projection, sqlite, provider, or git work. ### "Are git hooks causing latency?" 1. Filter `git.operation` spans. 2. Inspect `git.hook.started` and `git.hook.finished` events. 3. Compare hook timing to the enclosing git span duration. ### "Why do I have spans locally but nothing in Grafana?" Usually one of these is true: - `RYCO_OTLP_TRACES_URL` was not set - the app was launched from a different environment than the one where you exported the vars - the app was not fully restarted after changing env - Grafana is looking at the wrong time range or service name If the local NDJSON file is updating, local tracing is working. The problem is almost always OTLP export configuration or process startup. ## How To Think About Adding Tracing To Future Code ### Prefer Boundaries Over Tiny Helpers Good span boundaries: - RPC methods - orchestration command handling - provider adapter calls - external process calls - persistence writes - queue handoffs Avoid tracing every tiny helper. Most helpers should inherit the active span rather than create a new one. ### Reuse `Effect.fn(...)` Where It Already Exists The codebase already uses `Effect.fn("name")` heavily. That should usually be your first tracing boundary. For ad hoc work: ```ts import { Effect } from "effect"; const runThing = Effect.gen(function* () { yield* Effect.annotateCurrentSpan({ "thing.id": "abc123", "thing.kind": "example", }); yield* Effect.logInfo("starting thing"); return yield* doWork(); }).pipe(Effect.withSpan("thing.run")); ``` ### Put High-Cardinality Detail On Spans Use span annotations for IDs, paths, and other detailed context: ```ts yield * Effect.annotateCurrentSpan({ "provider.thread_id": input.threadId, "provider.request_id": input.requestId, "git.cwd": input.cwd, }); ``` ### Keep Metric Labels Low Cardinality Good metric labels: - operation kind - method name - provider kind - aggregate kind - outcome Bad metric labels: - raw thread IDs - command IDs - file paths - cwd - full prompts - full model strings when a normalized family label would do Detailed context belongs on spans, not metrics. ### Use Logs As Span Events Logs inside a span become part of the trace story: ```ts yield * Effect.logInfo("starting provider turn"); yield * Effect.logDebug("waiting for approval response"); ``` Those messages show up as span events because `Logger.tracerLogger` is installed. ### Use The Pipeable Metrics API `withMetrics(...)` is the default way to attach a counter and timer to an effect: ```ts import { someCounter, someDuration, withMetrics } from "../observability/Metrics.ts"; const program = doWork().pipe( withMetrics({ counter: someCounter, timer: someDuration, attributes: { operation: "work", }, }), ); ``` ## Detailed API Reference ### Runtime Wiring The server observability layer is assembled in `apps/server/src/observability/Layers/Observability.ts`. It provides: - pretty stdout logger - `Logger.tracerLogger` - local NDJSON tracer - optional OTLP trace exporter - optional OTLP metrics exporter - Effect trace-level and timing refs ### Env Vars Local trace file: - `RYCO_TRACE_FILE`: override trace file path - `RYCO_TRACE_MAX_BYTES`: per-file rotation size, default `10485760` - `RYCO_TRACE_MAX_FILES`: rotated file count, default `10` - `RYCO_TRACE_BATCH_WINDOW_MS`: flush window, default `200` - `RYCO_TRACE_MIN_LEVEL`: minimum trace level, default `Info` - `RYCO_TRACE_TIMING_ENABLED`: enable timing metadata, default `true` OTLP export: - `RYCO_OTLP_TRACES_URL`: OTLP trace endpoint - `RYCO_OTLP_METRICS_URL`: OTLP metric endpoint - `RYCO_OTLP_EXPORT_INTERVAL_MS`: export interval, default `10000` - `RYCO_OTLP_SERVICE_NAME`: service name, default `ryco-server` If the OTLP URLs are unset, local tracing still works and metrics stay in-process only. ### What Is Instrumented Today Current high-value span and metric boundaries include: - Effect RPC websocket request spans from `effect/rpc` - RPC request metrics in `apps/server/src/observability/RpcInstrumentation.ts` - startup phases - orchestration command processing - orchestration command acknowledgment latency - provider session and turn operations - git command execution and git hook events - terminal session lifecycle - sqlite query execution ### Client perf profiling The web app supports opt-in interaction profiling via `VITE_RYCO_PERF_PROFILE=1`. Instrumented interactions: - thread tab switches (`ryco:tab-switch:*` measures) - sidebar project expand (`ryco:sidebar-expand:*` measures) - component render durations (`ryco:render:*` measures) Soft budgets live in `apps/web/src/perf/budgets.ts`. See `apps/web/src/perf/README.md` for usage and inspection commands. ### Current Constraints - logs outside spans are not persisted - metrics are not snapshotted locally to disk; use the Diagnostics panel or OTLP export for inspection - the old `serverLogPath` still exists in config for compatibility, but the trace file is the persisted artifact that matters