litmus · Help

Getting started

Test a system prompt in 8 steps

litmus runs your system prompt against the model you actually ship on, scores every output with an LLM judge, and keeps each attempt as a version you can compare. Everything runs locally in your browser with your own API keys.

How it works at a glance

The progress rail at the top of the panel tracks where you are

Before you start

Open the ⚙ Settings panel (top-right gear).
Add at least one API key — for OpenAI, Anthropic, or Google. This is required: nothing in litmus works until at least one key is set. Keys are stored only in this browser and never leave it except to call that provider.
Optionally pick a judge model different from your target to reduce self-preference bias.

The eight steps

Capture

Paste the system prompt you want to test, or click ⤓ Grab from this tab to pull one from the page. Pick the target model — the model you actually ship on.

Analyze

litmus scores your prompt on language, intent, format, and tone for the chosen model, and lists concrete rewrite suggestions. This is read-only feedback — nothing is changed yet.

Build eval prompt

litmus finds the quality dimensions your prompt's output should be judged on and generates a rigorous LLM-as-judge rubric for each. You can edit a rubric, regenerate one, add your own dimension, or check coverage.

Cases

litmus generates a set of test cases — typical, edge, and adversarial inputs that match your prompt's real input contract. Remove any you don't want, regenerate the set, or add more (see tips below). The estimated cost updates live.

Run

Each case is sent to your target model, then the judge scores the output against the rubric. Speed (time-to-first-byte, tokens/sec) is measured as it runs.

Results

An overall score, pass/fail counts, a speed strip, and a per-case table. Failing cases sort to the top. Click a case row to expand its full rationale from the judge.

Results — failing cases first, click a row to expand its full rationale

Fixes

From the failing cases, litmus proposes ranked, concrete edits to your prompt. Click Apply fixes & review → to apply every suggestion to your system prompt via an LLM rewrite — you land back on Capture with the revised prompt, where you review it and re-run when ready. It does not re-run automatically.

Versions

Every run is saved as a version. Compare the baseline against the latest by dimension, and export the whole history as Markdown or JSON.

Testing tool & agent behavior Beta

If your model calls tools/functions, you don't need the rubric flow. On the Capture step, under "What are you testing?", choose Tool & agent instead of Output quality. The button changes to Set up tool & agent tests → and takes you straight to the Cases step — skipping Analyze and Build eval prompt, since tool and agent tests are scored deterministically, with no LLM judge.

🔧 Tool tests — single-step tool calls

Checks that, for a given message, the model calls the right tool with valid arguments and avoids the ones it shouldn't.

Define your tools. In the 🔧 Tool tests panel, paste a JSON array of tool definitions — each with a name, optional description, and a JSON-Schema parameters object. litmus validates it live.
Generate tests, or add your own. Click ✦ Generate tool tests to have litmus propose cases from your catalog, or add one manually: a user message, the expected tool, any forbidden tools, and optional required argument values.
Run. Each test is scored pass/fail by a deterministic checker — right tool called, arguments parsed and matched the schema, no forbidden tool used — with the reason shown in the results.

🤖 Agent scenarios — multi-step tasks

Tests a model that uses tools across several turns to finish a task. In the 🤖 Agent scenarios panel, paste a scenario as JSON: a goal, the tools it may call (each with scripted results — you can inject a failure to test recovery), a maxSteps cap, and optional successContains keywords the final answer must include. litmus runs the model in a loop — tool call → mocked result → continue — and scores whether it reached the goal using only the tools you defined.

Tools are mocked. litmus never executes a real tool; it returns the responses you scripted, so runs are deterministic and side-effect-free. Both tool tests and agent scenarios work with OpenAI, Anthropic, and Google targets (litmus normalizes each provider's tool-call format). Richer per-dimension trajectory scoring (efficiency, recovery) is planned; today's agent verdict is goal-reached plus correct tool selection.

Testing a live MCP server v1.2

If you're building an MCP (Model Context Protocol) server, litmus can connect to it directly and exercise it end to end — no system prompt or rubric needed. On the Capture step, under "What are you testing?", pick the MCP server chip. You'll enter the server's URL, connect, and then inspect and test what it exposes.

🔌 Connect

Enter the server's endpoint and connect. litmus first asks your browser for host permission for that server's origin — this is granted per origin and stored locally, so you approve each new server once. An auth header is optional: add one (for example a bearer token) if your server requires it, or leave it blank for open servers.

🧭 Inspect

Once connected, litmus lists the server's tools, resources, and prompts so you can see exactly what it advertises, with each tool's input schema.

✓ Conformance check

Runs a structured pass over the server — checking that it advertises and responds correctly over the protocol, that tool schemas are well-formed, and that listed capabilities behave as declared.

🛡 Security scan

An adversarial pass that sends crafted, hostile payloads to the live server — probing for things like prompt-injection openings, argument-injection, and missing input validation — and reports findings by severity.

⚠

These are real calls with real side effects. Unlike Agent scenarios (where tools are mocked), MCP mode talks to your live server: every tool call actually runs and can read, write, or delete real data. The security scan deliberately sends adversarial payloads to that live server. Only point litmus at MCP servers you own or control, and prefer a test/staging instance over production.

Tips & shortcuts

＋

Need more coverage? On the Cases step, click +10 cases to append ten fresh cases without losing the ones you have. litmus is told to avoid duplicating existing cases.

⚡

No wasted calls. The eval prompt and generated cases are cached for your session, so stepping back and forth never re-runs the model. Editing your prompt (including applying fixes) automatically refreshes them.

←

Go back anytime. Every step has a Back button, so you can revisit earlier steps without losing your place.

FAQ

How many test cases are generated?

The first generation produces 12 cases, spread across typical, edge, and adversarial inputs. Use +10 cases to add more, or ↻ Regenerate to replace the set.

Where do my API keys and data go?

Keys are stored only in your browser's local extension storage. Your prompt and cases are sent only to the model provider you choose (OpenAI, Anthropic, or Google) to run the test. litmus has no backend and collects no analytics. See the Privacy Policy.

Why use a separate judge model?

A model judging its own output tends to be lenient with itself. Picking a different judge model reduces this self-preference bias and gives more trustworthy scores.

What does the estimated cost mean?

It's an estimate of the total API spend for a run, based on case count and average token sizes. If it exceeds your spend cap (set in Settings), the Run button is disabled until you raise the cap or trim cases.

How are tool tests scored?

Deterministically, not by the LLM judge. A tool test passes only if the model called the expected tool, the arguments parsed as JSON and matched the tool's schema (required fields, types), any required values matched, and no forbidden tool was called. Because it's deterministic, a tool test's pass/fail doesn't drift run-to-run the way judged scores can.

Does testing an MCP server make real calls?

Yes. Unlike Agent scenarios, where tools are mocked, MCP mode connects to your live server and every tool call runs for real, with real side effects. The security scan goes further and sends adversarial payloads to that live server. Only test servers you own or control, and prefer a staging instance. Connecting requires host permission for the server's origin (granted per origin, stored in this browser); an auth header is optional.