--- name: diagnosing-experiment-results description: "Diagnoses bias, anomalies, and strange-looking results on a specific PostHog experiment. Covers empty / 0-exposure experiments, sample ratio mismatch, identity fragmentation, multi-variant exposure, uneven-split exclusion bias, significance traps (peeking, A/A, Bayesian vs Frequentist), PostHog-vs-SQL discrepancies, and surprises after mid-run edits. Symptom-driven dispatch to the right diagnostic.\nTRIGGER when: user asks 'is my experiment biased?' or 'why 0 exposures?', references the bias banner, says a variant looks strange / wrong / off, sees significance flipping, notices PostHog numbers disagreeing with their SQL, sees an A/A test showing significance, or reports surprises after mid-run edits.\nDO NOT TRIGGER when: creating a new experiment (use creating-experiments), only configuring rollout (use configuring-experiment-rollout) or metrics (use configuring-experiment-analytics), or only asking lifecycle questions (use managing-experiment-lifecycle)." --- # Diagnosing experiment results This skill answers: **My PostHog experiment results look wrong, biased, or empty — what's going on?** Match the user's complaint in the dispatch table, then read the matching reference file for the diagnostic. Each diagnostic in the reference files is tagged `[HIGH]`, `[MEDIUM]`, or `[LOW]` based on how strongly it's verified — `[HIGH]` is verified directly in PostHog code, `[MEDIUM]` is partially or team-source verified, `[LOW]` describes SDK/external behavior that wasn't verified here. Treat `[LOW]` items as hypotheses to test, not facts to assert. ## Step 1 — Resolve the experiment If the user refers to an experiment by name or description, load the `finding-experiments` skill first to resolve it to a concrete ID. Call `experiment-get` and pull these fields. They are inputs for almost every diagnostic: - `parameters.feature_flag_variants[].rollout_percentage` — the variant split - `parameters.rollout_percentage` — the overall rollout (% of users entering the experiment) - `exposure_criteria.multiple_variant_handling` — defaults to `"exclude"` if absent - `exposure_criteria.exposure_event` — `null` means default `$feature_flag_called` - `exposure_criteria.filterTestAccounts` — defaults to `true` - `feature_flag.active`, status (`draft` / `running` / `paused` / `stopped`), `start_date`, `end_date` - `feature_flag.filters.groups[].variant` — any non-null value is a forced-variant override on the matched cohort (release-condition assignment, not randomized). Surfaces A7 by default. - `stats_config` — Bayesian (default) or Frequentist ## Step 1.5 — Pull a diagnostic snapshot (verify before asking) Before asking the user clarifying questions, pull the diagnostic snapshot in [references/diagnostic-snapshot.md](references/diagnostic-snapshot.md). Most diagnostics in this skill can be confirmed or ruled out from that data without an interview. ## Step 2 — Match symptom to diagnostic | User says... | Diagnostic group | | ------------------------------------------------------------------------------------------ | -------------------------------------------- | | "Smaller variant looks biased" / banner says bias | A — bias & skew | | "Variant ratio doesn't match my split" / SRM warning | A — bias & skew | | "Why isn't it 50/50?" / "users in both groups" | A — bias & skew | | "Users in both control and test" / high `$multiple` % | A — bias & skew | | Multi-variant exposure on a server-rendered app | A — bias & skew | | Banner about feature-flag/experiment state mismatch | A — bias & skew | | "Migrating distinct_id" / "switching from anonymous to user_id" mid-run | A — bias & skew | | Metric count is much smaller than exposures (e.g. 10× or 100× gap) | A — bias & skew (route here before D) | | "Experiment shows 0 / not enough data" / empty | B — empty experiment | | "Variant always undefined / false" | B — empty experiment | | "$feature_flag_called fires but no exposures show up" | B — empty experiment | | "Experiment says running but exposures haven't moved in weeks/months" | B — empty experiment | | "Significance keeps flipping as we run longer" | C — interpretation traps | | "Significance was declared, then it wasn't significant anymore" | C — interpretation traps | | "30/16 split at 46 exposures, is this broken?" | C — interpretation traps | | "A/A test is showing significant results" | C — interpretation traps | | "Many metrics — some significant, some not" | C — interpretation traps | | "Bayesian says 96% chance to win — should we ship?" | C — interpretation traps | | "Confidence intervals overlap — does that mean not significant?" | C — interpretation traps | | "An external tool (significance calculator or AI agent) disagrees with PostHog" | C — interpretation traps | | "Should I ship? Primary is up but a secondary is down" | C — interpretation traps | | "PostHog numbers ≠ my SQL count" | D — numbers vs SQL | | "Funnel says X% but my raw event count says Y" | D — numbers vs SQL | | "Sum of revenue looks wrong" / "breakdown shows 'none'" | D — numbers vs SQL | | "Recordings panel doesn't match the stats" | D — numbers vs SQL | | "I applied a filter but the user count didn't change" | D — numbers vs SQL | | "I want to slice results by current person properties (as of now, not as of exposure)" | D — numbers vs SQL | | "Changed split / rollout / metric / criteria mid-run, now odd" | E — mid-run changes | | "Ended/shipped — flag now flipped to 0/100 unexpectedly" | E — mid-run changes | | "Long-term metric moves opposite from primary" | E — mid-run changes | | "Retention metric counts users I didn't expect" | E — mid-run changes | | "Can't convert the feature flag back to a simple (boolean) flag after the experiment ends" | E — mid-run changes | | "How do I restart an experiment with new variants?" | E — mid-run changes | | Metric line is rendered but the result block is empty / no chance-to-win or significance | E — mid-run changes (E13 legacy methodology) | If the symptom is unclear, ask one clarifying question before picking. Most diagnostics have different fixes — do not guess. ## Step 3 — Surface every diagnostic the evidence supports After matching the symptom in Step 2 and reading the relevant reference file(s), list each diagnostic that applies before recommending an action. Surface co-occurring mechanisms independently — even when one is more salient, don't collapse them into a single "wait" or "fix" recommendation. Different mechanisms have different fixes: a _systematic_ bias (e.g. uneven-split + Exclude) doesn't resolve by waiting; a _statistical_ pattern (e.g. small-sample variance) does. Bundling them leaves the bias in place after the user follows the bundled advice. Only list mechanisms that have a path to verification in the project state — config (from `experiment-get`), snapshot data, activity log, or repo source. Config-derived mechanisms count: an 80/20 split with default `multiple_variant_handling="exclude"` is visible in `experiment-get` and is therefore enumerable. Naming a mechanism with no source (e.g. SRM when the snapshot shows a clean variant ratio) is not. ## Diagnostic groups ### A — Bias & skew Variants don't look balanced, one variant looks biased, the in-app warning banner appeared, or users are showing up under multiple variants. Covers the uneven-split + Exclude interaction, SRM, identity fragmentation, bootstrap × `/decide` mismatch, and flag/experiment state inconsistency. → See [references/bias-and-skew.md](references/bias-and-skew.md) ### B — Empty experiment / 0 exposures / "not enough data" A frequent pain point. Covers SDK call (wrong evaluation method, `identify()` timing, dedup), exposure capture (custom event missing variant property, required properties, ad-blockers), and exposure-criteria match (test-account filter, eligibility ordering, events firing before exposure). → See [references/empty-experiment.md](references/empty-experiment.md) ### C — Significance / interpretation traps Significance flipping, A/A test showing significance, Bayesian vs Frequentist confusion, multiple comparisons, low-volume variance, peeking / early stopping. Includes the legacy stats issue (A/A tests historically over-fired before the new Bayesian module) and how the win-probability methodology changed in Jan 2025 (single test vs control, not control vs all variants). → See [references/interpretation.md](references/interpretation.md) ### D — Numbers don't match (PostHog vs the user's SQL / raw count) The experiment page applies an exposure scope, `$multiple` exclusion, test-account filter, and date range that ad-hoc SQL almost never replicates. Covers funnel attribution (only first→last step counts for stats), breakdowns (read from the exposure event, not the metric event), the "sum of revenue" mean-of-per-user confusion, and the recordings-panel-vs-stats divergence. → See [references/numbers-vs-sql.md](references/numbers-vs-sql.md) ### E — Surprises after mid-run changes (incl. lifecycle and retention quirks) Increasing rollout is safe; decreasing is caution; changing the variant split is an anti-pattern; adding metrics mid-run is p-hacking; ship-variant can rewrite the flag in surprising ways; reset clears results not the flag. Also covers retention-metric quirks (first-event-must-be-after-exposure design), "matured users" filtering, and long-term vs short-term metric divergence. → See [references/mid-run-changes.md](references/mid-run-changes.md) ## Step 4 — Calibrate recommendations to experiment state Surface diagnostics first (Step 3). Then recommend — but scope what you recommend to what the experiment's current state permits. - **Draft** — config changes are free; recommend and apply. - **Running** — every change has a tradeoff. Explain the mid-run impact (anti-pattern? safe? user-visible?) before recommending. See `configuring-experiment-rollout` and its reference file `references/changing-distribution-after-launch.md` for the mid-run rules. - **Stopped / archived** — the experiment AND its feature flag represent the documented outcome of the run. Recommendations are scoped to (a) interpretation of the existing data, (b) what to do for the _next_ experiment, or (c) explaining what happened. On a stopped or archived experiment, don't preemptively offer reversal of a state mutation (ship-variant flag rewrite, manual flag edit, reset, archive). If the user asks "why did X happen?", explain X — don't append a "here's how to undo it" coda. That pattern assumes intent the user didn't signal. Conditional offers like _"if this wasn't intended, you could…"_ or _"want me to revert it?"_ count as preemptive too — only the user explicitly naming the reversal action ("how do I undo this?", "can I roll back ship-variant?", "how do I get the 50/50 split back?") is a request to surface reversal mechanics. Use consistent terminology: variant _split_ (between variants) is distinct from _rollout_ (overall % entering); the `$feature_flag_called` exposure event is distinct from a _custom exposure event_; the _Exclude_ / _First seen_ options control multivariate handling, not exposure.