Getting started
litmus runs your system prompt against the model you actually ship on, scores every output with an LLM judge, and keeps each attempt as a version you can compare. Everything runs locally in your browser with your own API keys.
The progress rail at the top of the panel tracks where you are
Paste the system prompt you want to test, or click ⤓ Grab from this tab to pull one from the page. Pick the target model — the model you actually ship on.
litmus scores your prompt on language, intent, format, and tone for the chosen model, and lists concrete rewrite suggestions. This is read-only feedback — nothing is changed yet.
litmus finds the quality dimensions your prompt's output should be judged on and generates a rigorous LLM-as-judge rubric for each. You can edit a rubric, regenerate one, add your own dimension, or check coverage.
litmus generates a set of test cases — typical, edge, and adversarial inputs that match your prompt's real input contract. Remove any you don't want, regenerate the set, or add more (see tips below). The estimated cost updates live.
Each case is sent to your target model, then the judge scores the output against the rubric. Speed (time-to-first-byte, tokens/sec) is measured as it runs.
An overall score, pass/fail counts, a speed strip, and a per-case table. Failing cases sort to the top. Hover any case to read the judge's full rationale.
Results — failing cases first, hover a row for the full rationale
From the failing cases, litmus proposes ranked, concrete edits to your prompt. Click Apply fixes & re-run → to auto-apply every suggestion to your system prompt — you land back on Capture with the revised prompt, ready to review and run again.
Every run is saved as a version. Compare the baseline against the latest by dimension, and export the whole history as Markdown or JSON.
If your model calls tools/functions, you don't need the rubric flow. On the Capture step, under "What are you testing?", choose Tool & agent instead of Output quality. The button changes to Set up tool & agent tests → and takes you straight to the Cases step — skipping Analyze and Build eval prompt, since tool and agent tests are scored deterministically, with no LLM judge.
Checks that, for a given message, the model calls the right tool with valid arguments and avoids the ones it shouldn't.
name, optional description, and a JSON-Schema parameters object. litmus validates it live.✦ Generate tool tests to have litmus propose cases from your catalog, or add one manually: a user message, the expected tool, any forbidden tools, and optional required argument values.Tests a model that uses tools across several turns to finish a task. In the 🤖 Agent scenarios panel, paste a scenario as JSON: a goal, the tools it may call (each with scripted results — you can inject a failure to test recovery), a maxSteps cap, and optional successContains keywords the final answer must include. litmus runs the model in a loop — tool call → mocked result → continue — and scores whether it reached the goal using only the tools you defined.
Tools are mocked. litmus never executes a real tool; it returns the responses you scripted, so runs are deterministic and side-effect-free. Both tool tests and agent scenarios work with OpenAI, Anthropic, and Google targets (litmus normalizes each provider's tool-call format). Richer per-dimension trajectory scoring (efficiency, recovery) is planned; today's agent verdict is goal-reached plus correct tool selection.
+10 cases to append ten fresh cases without losing the ones you have. litmus is told to avoid duplicating existing cases.Back button, so you can revisit earlier steps without losing your place.The first generation produces 12 cases, spread across typical, edge, and adversarial inputs. Use +10 cases to add more, or ↻ Regenerate to replace the set.
Keys are stored only in your browser's local extension storage. Your prompt and cases are sent only to the model provider you choose (OpenAI, Anthropic, or Google) to run the test. litmus has no backend and collects no analytics. See the Privacy Policy.
A model judging its own output tends to be lenient with itself. Picking a different judge model reduces this self-preference bias and gives more trustworthy scores.
It's an estimate of the total API spend for a run, based on case count and average token sizes. If it exceeds your spend cap (set in Settings), the Run button is disabled until you raise the cap or trim cases.
Deterministically, not by the LLM judge. A tool test passes only if the model called the expected tool, the arguments parsed as JSON and matched the tool's schema (required fields, types), any required values matched, and no forbidden tool was called. Because it's deterministic, a tool test's pass/fail doesn't drift run-to-run the way judged scores can.