--- name: metrillm description: Benchmark local LLM models — measures performance (tok/s, TTFT, memory), quality (reasoning, math, coding, instruction following, structured output, multilingual), and computes a hardware fitness verdict. Works with Ollama. argument-hint: "[model-name]" author: MetriLLM source: https://github.com/MetriLLM/metrillm license: MIT allowed-tools: Bash, Read --- # MetriLLM — Benchmark Local LLM Models Benchmark any local LLM model directly from your AI coding assistant. Get a clear verdict on whether a model fits your hardware. ## Setup 1. Install and start [Ollama](https://ollama.com) 2. Pull a model: `ollama pull llama3.2:3b` ## Usage ### List available models ```bash ollama list ``` ### Run a full benchmark ```bash npx metrillm bench --model $ARGUMENTS --json ``` This measures: - **Performance**: tokens/second, time to first token, memory usage - **Quality**: reasoning, math, coding, instruction following, structured output, multilingual - **Fitness verdict**: EXCELLENT / GOOD / MARGINAL / NOT RECOMMENDED A full benchmark takes 1-5 minutes depending on model size. ### Performance-only benchmark (faster) ```bash npx metrillm bench --model $ARGUMENTS --perf-only --json ``` Takes about 30 seconds. Skips quality evaluation. ### View previous results ```bash ls ~/.metrillm/results/ ``` Read any JSON file to see full benchmark details. ### Share to public leaderboard ```bash npx metrillm bench --model $ARGUMENTS --share ``` ## Interpreting Results | Verdict | Score | Meaning | |---|---|---| | EXCELLENT | >= 80 | Fast and accurate — great fit | | GOOD | >= 60 | Solid — suitable for most tasks | | MARGINAL | >= 40 | Usable but with tradeoffs | | NOT RECOMMENDED | < 40 | Too slow or inaccurate | Key metrics to highlight: - `tokensPerSecond` > 30 = good for interactive use - `ttft` < 500ms = responsive - `memoryUsedGB` vs available RAM = will it fit? ## Tips - Use `--perf-only` for quick tests - Smaller models (1-3B) benchmark in ~30s, larger (7B+) in 2-5 min - Close GPU-intensive apps before benchmarking - Thinking models (Qwen3, etc.) generate many tokens and take longer