--- name: llm-obs-experiment-analyzer description: Analyze LLM experiment results. Handles single or comparative experiments, exploratory or Q&A modes. Use when user says "analyze experiment", "compare experiments", "analyze against baseline", or provides one or two experiment IDs for analysis. --- ## Backend **Detection** — At the start of every invocation, before taking any action, determine which backend to use: 1. If the user passed `--backend pup` anywhere in their invocation → use **pup mode** immediately, regardless of whether MCP tools are present. Skip steps 2–4. 2. Check whether MCP tools are present in your active tool list. The canonical signal is whether `mcp__datadog-llmo-mcp__get_llmobs_experiment_summary` appears in your available tools. 3. If MCP tools are present → use **MCP mode** throughout. Call MCP tools exactly as named in this skill's workflow sections. 4. If MCP tools are absent → check whether `pup` is executable: run `pup --version` via Bash. A JSON response containing `"version"` confirms pup is available. 5. If pup responds → use **pup mode** throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file. 6. If neither is available → stop and tell the user: > "Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (`claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'`) or install pup." `--backend pup` is accepted anywhere in the invocation arguments and is stripped before passing remaining args to the skill logic. **pup invocation rules:** - Invoke via Bash: `pup llm-obs [flags]` - pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in `[{"type": "text", "text": ""}]`). - If pup returns an auth error, tell the user to run `pup auth login` and stop. - Parallelization: issue multiple Bash tool calls in a single message (one pup command per call). **Invocation ID:** At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g., `3a9f1c2b`). Keep it constant for the entire invocation. **Intent tagging:** On every MCP tool call, prefix `telemetry.intent` with `skill:llm-obs-experiment-analyzer[] — ` followed by a description of why the tool is being called. On the **first MCP tool call only**, use `skill:llm-obs-experiment-analyzer:start[] — ` instead (note the `:start` suffix). Example first call: `skill:llm-obs-experiment-analyzer:start[3a9f1c2b] — Phase 1: get experiment summary to orient analysis` # Unified Experiment Analyzer Analyzes one or two LLM experiments. Supports four modes based on inputs: | Inputs | Mode | |--------|------| | 2 IDs, no question | Comparative Exploratory | | 2 IDs + question | Comparative Q&A | | 1 ID, no question | Single Exploratory | | 1 ID + question | Single Q&A | ## Usage ``` /llm-obs-experiment-analyzer [experiment_id_2] [question text] [--output agent|file|notebook] ``` Arguments: $ARGUMENTS ## Available Tools | Tool | Purpose | |------|---------| | `mcp__datadog-llmo-mcp__get_llmobs_experiment_summary` | Get total events, error count, metrics stats, available dimensions | | `mcp__datadog-llmo-mcp__list_llmobs_experiment_events` | Query events with filters, sorting, pagination | | `mcp__datadog-llmo-mcp__get_llmobs_experiment_event` | Get full event details (input, output, expected_output, metrics) | | `mcp__datadog-llmo-mcp__get_llmobs_experiment_metric_values` | Get metric stats overall and segmented by dimension. Use `segment_by_dimension` (not `segment_dimension`) to segment; optionally `segment_dimension_value` to filter to a specific value. | | `mcp__datadog-llmo-mcp__get_llmobs_experiment_dimension_values` | List unique values for a dimension with counts | | `mcp__datadog-mcp-core__create_datadog_notebook` | Export report as a Datadog notebook | --- ## Phase 0 — Mode & Output Resolution Parse $ARGUMENTS: 1. Extract one or two UUID-format strings as experiment IDs (first = baseline/primary, second = candidate). 2. Extract `--output agent|file|notebook` flag if present. 3. The remaining text (after IDs and flags) is the question, if any. **Mode determination:** - 2 IDs + question → Comparative Q&A - 2 IDs, no question → Comparative Exploratory - 1 ID + question → Single Q&A - 1 ID, no question → Single Exploratory **Output mode determination:** If `--output` was provided in arguments, use that mode and skip asking. Otherwise, ask two **separate sequential** `AskUserQuestion` calls before proceeding — never combined into a single call: 1. **Analysis type**: If no question text was provided in the arguments, ask whether the user wants exploratory analysis or has a specific question. Skip this call only if the user's intent is already clear from context (e.g. they typed a question alongside the IDs). 2. **Output destination**: If `--output` was not specified, ask where to deliver the report (chat, file, or Datadog notebook). Always ask this as its own standalone call. **Output modes:** 1. **Agent (default):** Display the full report in the conversation. 2. **File:** Before starting, propose a path: `evals/reports/YYYY-MM-DD--analysis.md` Present it to the user and let them confirm or adjust. Then proceed. 3. **Notebook:** Use `mcp__datadog-mcp-core__create_datadog_notebook` at the end. In pup mode, use `pup notebooks create --title "TITLE" --file /tmp/nb_cells.json` instead (see Tool Reference). If neither MCP nor pup is available, output these setup instructions instead of failing: ``` To enable Datadog notebook export, add the MCP server: claude mcp add --transport http datadog-mcp https://mcp.datadoghq.com/api/unstable/mcp-server See: https://docs.datadoghq.com/bits_ai/mcp_server/setup/ ``` Then ask: "Would you like to fall back to file or agent output instead?" See Phase 5 for full notebook call details. After resolving mode and output, proceed to Phase 1. There will be one additional `AskUserQuestion` interaction at Phase 1.5 before the deep analysis begins. --- ## Phase 1 — Orient **Comparative:** Call `get_llmobs_experiment_summary` for both experiments. Produce a side-by-side comparison: - Scale: total samples and error count for each - Metrics: which metrics exist in each; which are shared - Dimensions: which dimensions exist in each; which are shared - Immediate red flags (errors present, missing metrics, sparse data) - Obvious improvements or regressions visible at the summary level When `error_count > 0`, call `get_llmobs_experiment_dimension_values` for `error_type` and report the breakdown by exception class (e.g. "2 errors: `asyncio.exceptions.cancellederror`"). Errors mean the executor threw an unhandled exception — no eval scores were produced for those samples. Do not report a percentage; report the count and type(s). **Single:** Call `get_llmobs_experiment_summary` for the experiment. Determine: - Total samples, and error count (with `error_type` breakdown if non-zero) - Available metrics grouped by `metric_type` as returned by the summary (`score`, `boolean`, `categorical`). Do not infer semantic groupings or categories from label name patterns or prefixes — the label string is not a reliable signal for what a metric measures. - Classify each metric using the statistics already returned by the summary (mean, min, max). Do not infer metric meaning from label names or prefixes. Use the classifications defined in Phase 1.5 when referencing metrics throughout the report. - Available dimensions for segmentation - Any immediate red flags --- ## Phase 1.5 — Metrics Selection After completing Phase 1, run the following three steps before any `AskUserQuestion`. **Step 1 — Classify every metric** using summary statistics only (no additional tool calls): | Class | Condition | Meaning | |---|---|---| | `always_zero` | `max == 0` | Feature disabled or not implemented — no signal | | `perfect` | `min == 1` | Always passes — no diagnostic signal | | `saturated` | `mean ≥ 0.99` and `min < 1` | Rarely fails — low diagnostic value | | `struggling` | `mean < 0.70` | Meaningful failure rate — highest diagnostic value | | `interesting` | `0.70 ≤ mean < 0.99` and `min < max` | Partial failures — moderate diagnostic value | **Step 2 — Print the full metric table to chat** before asking any question. This gives the user complete visibility — never truncated by option limits. Format: ``` Found N metrics. Full breakdown: | Metric | Mean | Class | |--------|------|-------| |