The judge scores each output. Default "same as target" always works; pick a distinct model to reduce self-preference bias.
Custom model ID (advanced)
Use this for a model not in the list, or one your key has special access to.
Pass threshold (0–10)
Spend cap (USD)
Samples per case (variance)
Run each case N times to surface run-to-run variance. 1 is fastest; 3–5 reveals how noisy a score is (costs N× more).
What are we testing?
Grab your system prompt
Grab the system prompt from whatever console you're on, or paste it.
Load a version from this session
⎘
What are you testing?
Target model
AI
Output type
Your API key (stored only in this browser)
Prompt analysis
Prompt analysis
The judges that score outputs
Evaluation prompts
litmus found these dimensions and wrote a rubric for each. Edit any, or add your own.
Dimensions
Rubric
Eval cases · remove any you don't want
What we'll test against
🔧 Tool tests Beta
Test whether the model calls the right tool with valid arguments. Define the tools available to it, then add a case that asserts the expected call.
Tool definitions · JSON array
Or add one manually
Expected tool
🔧
Forbidden · comma-sep
Required args · JSON (optional)
🤖 Agent scenarios Beta
Test a multi-step agent: give it a goal and mock tools (with scripted results — you can inject a failure to test recovery). litmus runs the model in a loop, feeding tool results back, and scores whether it reaches the goal. Tools are mocked — nothing real is executed.
Scenario · JSON
Estimated cost
Running · on your key
Generating & judging
Results
How this version scored
◷ Scores vary run-to-run — the target model is non-deterministic. A different score doesn't always mean a different prompt; re-run a version to see its spread.