# Dashboards

`mcp-ts-core-dashboard.json` (this directory) is an example Grafana dashboard for the framework's OTel metrics. It targets a Prometheus-compatible backend and uses labels the framework emits.

For the underlying metric catalog, span list, and env config, see [`observability.md`](./observability.md).

---

## Import quickstart (Grafana)

1. Wire OTel export to a Prometheus-compatible store (Prometheus, VictoriaMetrics, Mimir, Thanos, Cortex, …). On the MCP server, set `OTEL_ENABLED=true` and the OTLP endpoints — see [`observability.md`](./observability.md) for the env vars.
2. In Grafana → **Dashboards** → **New** → **Import** → paste the JSON or upload the file.
3. Pick the Prometheus data source on import. Save.
4. Open the dashboard. Adjust the `Service` template variable (default `.+`) if you want to scope to one server. The regex automatically tolerates an `@scope/` prefix — entering `mcp-ts-core` matches both `mcp-ts-core` and `@cyanheads/mcp-ts-core`.

Panels populate within ~30s of metrics flowing (the framework pushes via `PeriodicExportingMetricReader` every 15s).

---

## Naming convention assumed

The dashboard targets the **OTel Collector → Prometheus exporter** translation rules with `add_metric_suffixes: true` (default).

| OTel metric | Prometheus name |
|:------------|:----------------|
| Counter `mcp.tool.calls` | `mcp_tool_calls_total` |
| Histogram `mcp.tool.duration` (`ms`) | `mcp_tool_duration_milliseconds_{bucket,sum,count}` |
| Histogram `mcp.session.duration` (`s`) | `mcp_session_duration_seconds_{bucket,sum,count}` |
| Histogram `mcp.tool.input_bytes` | `mcp_tool_input_bytes_{bucket,sum,count}` |
| UpDownCounter `mcp.requests.active` | `mcp_requests_active` |
| Observable gauge `mcp.sessions.active` | `mcp_sessions_active` |
| Observable gauge `process.event_loop.utilization` (unit `1`) | `process_event_loop_utilization_ratio` |
| Resource attribute `service.name` | label `service_name` |
| Attribute key `mcp.tool.name` | label `mcp_tool_name` |

Dots become underscores everywhere. Counters get `_total`. Histograms with physical units (`s`, `ms`, `bytes`) get the unit appended; non-physical units (`{calls}`, `{requests}`) get nothing. Dimensionless ratio (unit `1`) becomes `_ratio`.

---

## Row guide

| Row | What it tells you |
|:----|:------------------|
| **Live activity** | At-a-glance health: tool calls/min, in-flight requests, active sessions, tool error %. Plus tool call rate and tool errors-by-category over time. The error % stat is thresholded green/yellow/red at 0/1/5 %. |
| **Tools** | Per-tool call rate, p95 duration, and p95 input/output payload size. Spot the tools that dominate traffic, the slow ones, and the chatty ones. |
| **Resources** | Per-resource read rate, p95 duration, error rate, and p95 output bytes. Same shape as Tools but keyed on `mcp_resource_name`. |
| **Prompts** | Generation rate, p95 duration, error rate by category, and p95 message count. Useful for prompts that fan out — high message count is often the explanation for slow responses. |
| **Storage / LLM / Speech / Graph** | Op rate and p95 latency for each service. LLM also shows tokens/sec by type (`input`/`output`) — multiply by your provider's $/Mtoken to see live cost. |
| **HTTP server & client** | Server: req rate by method, p50/p95/p99 latency, status code distribution. Client: p95 duration by upstream `server_address`. |
| **Sessions / Auth / Tasks** | Session events (`created`/`terminated`/`rejected`/`stale_cleanup`), heartbeat failures by transport, auth attempts and p95 by outcome (`success`/`failure`/`missing`), task transitions, active tasks (in-memory store only). |
| **Errors & rate limits** | Classified errors keyed on JSON-RPC code (`-32603`, `-32001`, …), and rate-limit rejections by key. |
| **Per-request leak detection** | Per-request `McpServer` and transport instances are created on every HTTP request and reclaimed by GC. The dashboard plots **created vs finalized** rate (should match under steady-state) and the **cumulative gap** (should be flat — a growing line is a leak). The `HTTP close failures` panel surfaces `surface`/`trigger` combinations where the per-request close threw. |
| **Process health** | RSS / heap used / heap total, event loop p99 delay (thresholded 50/200 ms), event loop utilization (thresholded 0.7/0.9), and process uptime. |

---

## Vendor-agnostic query recipes

The dashboard is Prometheus-flavored, but the underlying OTel metrics work in any backend. The translation pattern is the same: pick the equivalent rate/histogram primitive, use the same attribute keys (with each vendor's quirks for label naming).

### Datadog

Datadog ingests OTel metrics natively or via the OTel Collector's `datadog` exporter. Attribute keys land as Datadog tags.

| Prometheus query | Datadog equivalent |
|:-----------------|:-------------------|
| `sum by (mcp_tool_name) (rate(mcp_tool_calls_total[5m]))` | `sum:mcp.tool.calls{*} by {mcp_tool_name}.as_rate()` |
| `histogram_quantile(0.95, sum by (mcp_tool_name, le) (rate(mcp_tool_duration_milliseconds_bucket[5m])))` | `p95:mcp.tool.duration{*} by {mcp_tool_name}` (Datadog auto-aggregates histograms when ingested as distribution metrics) |
| `sum(mcp_requests_active)` | `sum:mcp.requests.active{*}` |
| `sum(rate(mcp_tool_errors_total[5m])) / sum(rate(mcp_tool_calls_total[5m]))` | `sum:mcp.tool.errors{*}.as_rate() / sum:mcp.tool.calls{*}.as_rate()` |

Service filter: `service:mcp-ts-core` (Datadog populates `service` from `service.name`).

### New Relic (NRQL)

OTel metrics flow through New Relic's OTLP endpoint. They land in the `Metric` event with original names preserved (no Prometheus suffixing).

| Prometheus query | NRQL equivalent |
|:-----------------|:----------------|
| `sum by (mcp_tool_name) (rate(mcp_tool_calls_total[5m]))` | `SELECT rate(sum(mcp.tool.calls), 1 second) FROM Metric FACET mcp.tool.name SINCE 1 hour ago TIMESERIES` |
| `histogram_quantile(0.95, sum by (le) (rate(mcp_tool_duration_milliseconds_bucket[5m])))` | `SELECT percentile(mcp.tool.duration, 95) FROM Metric FACET mcp.tool.name SINCE 1 hour ago TIMESERIES` |
| `sum(mcp_requests_active)` | `SELECT latest(mcp.requests.active) FROM Metric SINCE 5 minutes ago` |
| Cumulative leak gap | `SELECT latest(mcp.http.per_request.created) - latest(mcp.http.per_request.finalized) FROM Metric FACET kind TIMESERIES` |

Service filter: `WHERE service.name = 'mcp-ts-core'`.

### Honeycomb

Honeycomb is span-first. Metric values land as fields on the synthetic span produced by the OTel Collector's `honeycomb` exporter — but the more productive path is to query the **trace data** directly, since the framework already emits per-call spans.

| Goal | Honeycomb query |
|:-----|:----------------|
| Tool call rate by name | `VISUALIZE: COUNT, GROUP BY: name, FILTER: name starts-with "tool_execution:"` |
| Tool p95 duration by name | `VISUALIZE: P95(duration_ms), GROUP BY: name, FILTER: name starts-with "tool_execution:"` |
| Tool error rate | `VISUALIZE: COUNT, GROUP BY: mcp.tool.error_code, FILTER: name starts-with "tool_execution:" AND mcp.tool.success = false` |
| Resource throughput | `VISUALIZE: COUNT, GROUP BY: mcp.resource.name, FILTER: name starts-with "resource_read:"` |
| LLM token cost | `VISUALIZE: SUM(gen_ai.usage.total_tokens), GROUP BY: gen_ai.request.model, FILTER: name = "gen_ai.chat_completion"` |

Service filter: `FILTER: service.name = "mcp-ts-core"`. Honeycomb's `bubbleup` feature makes outlier triage on these spans particularly effective — point it at the slowest 1 % of tool calls and let it surface the differentiating attributes.

---

## Adapting

| Symptom | Fix |
|:--------|:----|
| HTTP server panels empty | Older `@opentelemetry/instrumentation-http` emits `http.server.duration` (ms) instead of `http.server.request.duration` (s). Replace `http_server_request_duration_seconds` with `http_server_duration_milliseconds` in the JSON. |
| Dashboard empty on Cloudflare Workers | `NodeSDK` doesn't run in V8 isolates and the framework no-ops metric emission unless a Worker-compatible exporter is wired. See [`observability.md`](./observability.md) for runtime caveats. |

For other mismatches (different service label, `add_metric_suffixes: false`, custom attribute filters), inspect one query against your actual metric names and find/replace from there.