litmus

What are we testing?

Grab your system prompt

Grab the system prompt from whatever console you're on, or paste it.

What are you testing?
Target model
AI
Output type

Prompt analysis

Prompt analysis

The judges that score outputs

Evaluation prompts

litmus found these dimensions and wrote a rubric for each. Edit any, or add your own.

Dimensions
Rubric

Eval cases · remove any you don't want

What we'll test against

🔧 Tool tests Beta

Test whether the model calls the right tool with valid arguments. Define the tools available to it, then add a case that asserts the expected call.

Tool definitions · JSON array
Or add one manually
Expected tool
🔧
Forbidden · comma-sep
Required args · JSON (optional)
🤖 Agent scenarios Beta

Test a multi-step agent: give it a goal and mock tools (with scripted results — you can inject a failure to test recovery). litmus runs the model in a loop, feeding tool results back, and scores whether it reaches the goal. Tools are mocked — nothing real is executed.

Scenario · JSON
Estimated cost

Running · on your key

Generating & judging

Results

How this version scored

◷ Scores vary run-to-run — the target model is non-deterministic. A different score doesn't always mean a different prompt; re-run a version to see its spread.

Speed · measured this run
Cases

From the failing cases

Fix these, in order

Every pass is kept

Versions

litmus 0.1.0 · local-first · BYOK Help · Privacy