# UPskill
Generate and evaluate agent skills based on traces with agents. Create skills with teacher models (expensive/slow) that student models (cheap/fast) can use to perform harder tasks reliably.
> [!TIP]
>
> UPskill v2 - recommended default config file now runs evaluations on Hugging Face Jobs. Make sure
> to set your `HF_TOKEN` and use `--artifact-repo ` for job creation and result capture
## Quick Start
Install upskill:
```bash
uv pip install upskill
# or just use uv
uvx upskill
```
Create a new skill
```bash
upskill generate "write good git commit messages"
# or based on previous agent traces
upskill generate "document the pattern" --from ./trace.md
# Skills are saved to ./skills/{skill-name}/ by default
```
Generate a skill with a teaching model and evaluate it on a student model.
```bash
upskill generate "write good git commit messages" --model sonnet --eval-model haiku
```
Benchmark a set of models against a skill.
```bash
upskill eval ./skills/git-commit-messages/ -m haiku -m sonnet
# logs pretty printed to the terminal
```
View the results later.
```bash
upskill runs --skill git-commit-messages
```
## Development checks
This repo uses a CI flow inspired by `fast-agent` with separate format, lint, typecheck, and test
stages.
Install dev dependencies:
```bash
uv sync --extra dev
```
Run the quality gates locally:
```bash
uv run scripts/format.py
uv run scripts/lint.py
uv run scripts/typecheck.py
uv run scripts/cpd.py --check
uv run --extra dev pytest -v
```
Or use the helper script to run the whole sequence:
```bash
uv run scripts/check.py
```
Add `--sync` to include `uv sync --extra dev`, or `--skip-tests` for a faster static-only pass.
To auto-format before re-running checks:
```bash
uv run --extra dev scripts/format.py --write
```
Current enforced standards:
- `ruff format --check` for formatting
- `ruff check` for style, imports, modernization, bugbear, simplify, and import-hygiene rules
- cyclomatic complexity via Ruff `C90` with `max-complexity = 15`
- `ty check` across `src`, `tests`, and `scripts`
- `pmd cpd` via `scripts/cpd.py --check` to flag duplicated code in `src/`
- `pytest` for the test suite
CI enforcement lives in `.github/workflows/ci.yml` and runs on pushes and pull requests targeting
`main`.
## Model Handling Overview
upskill uses distinct phases with explicit model roles:
- **Skill generation**: create/refine `SKILL.md`
- **Test generation**: create synthetic evaluation cases
- **Evaluation**: run tests against evaluator model(s)
- **Benchmark**: repeated evaluation across multiple runs/models
Model flags by command:
| Command | Flag | Meaning |
|---|---|---|
| `generate` | `--model` | Skill generation/refinement model |
| `generate` | `--test-gen-model` | Test generation model override |
| `generate` | `--eval-model` | Optional extra cross-model eval pass |
| `eval` | `-m/--model` | Evaluation model(s) (repeatable) |
| `eval` | `--test-gen-model` | Test generation model override (when tests are generated) |
| `benchmark` | `-m/--model` | Evaluation model(s) to benchmark |
| `benchmark` | `--test-gen-model` | Test generation model override (when tests are generated) |
| `runs` / `plot` | `-m/--model` | Historical results filter only |
`upskill eval` enters **benchmark mode** whenever you pass multiple `-m` values or `--runs > 1`.
In benchmark mode, baseline comparison is always off; `--no-baseline` is redundant.
## Commands
### `upskill generate`
Generate a skill from a task description with automatic evaluation and refinement.
```bash
upskill generate TASK [OPTIONS]
```
**Arguments:**
- `TASK` - Description of what the skill should teach
**Options:**
- `-e, --example` - Input -> output example (can be repeated)
- `-f, --from PATH` - Improve from existing skill dir or agent trace file (auto-detected)
- `-m, --model MODEL` - Skill generation model (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514')
- `--test-gen-model MODEL` - Override test generation model for this run
- `-o, --output PATH` - Output directory for skill
- `--no-eval` - Skip evaluation and refinement
- `--eval-model MODEL` - Different model to evaluate skill on
- `--executor [local|jobs]` - Execution backend for evaluation/refinement; overrides config
- `--artifact-repo TEXT` - Dataset repo for remote fast-agent job artifacts (required with `--executor jobs`)
- `--max-parallel N` - Max concurrent evaluation executions; overrides config
- `--runs-dir PATH` - Directory for run logs (default: ./runs)
- `--log-runs / --no-log-runs` - Log run data (default: enabled)
**Examples:**
```bash
# Basic usage
upskill generate "parse JSON Schema files"
# Make and evaluate skills for less powerful models
upskill generate "write git commits" --model sonnet --eval-model haiku
# Remote execution on Hugging Face Jobs
upskill generate "parse invoices" --executor jobs --artifact-repo /upskill-tests
# Improve an existing skill (auto-detected as directory)
upskill generate "add more error handling examples" --from ./skills/api-errors/
# Generate from an agent trace file (auto-detected as file)
upskill generate "document the pattern" --from ./trace.json
# Skip evaluation during generation (evaluate separately with upskill eval)
upskill generate "parse YAML" --no-eval
```
**Output:**
```
Generating skill with sonnet...
Generating test cases...
Evaluating on sonnet... (attempt 1)
60% -> 100% (+40%) OK
git-commit-messages
Write clear, conventional commit messages that follow best practices.
SKILL.md ~450 tokens
baseline ████████████░░░░░░░░ 60%
with skill ████████████████████ 100% (+40%)
tokens: 1200 → 800 (-33%)
Saved to ./skills/git-commit-messages
```
### `upskill eval`
Evaluate an existing skill against test cases. Supports single-model evaluation with baseline comparison, or multi-model benchmarking.
```bash
upskill eval SKILL_PATH [OPTIONS]
```
**Arguments:**
- `SKILL_PATH` - Path to skill directory containing SKILL.md
**Options:**
- `-t, --tests PATH` - Test cases JSON file
- `-m, --model MODEL` - Model(s) to evaluate against (repeatable for multi-model benchmarking)
- `--test-gen-model MODEL` - Override test generation model when tests must be generated
- `--runs N` - Number of runs per model; overrides config
- `--no-baseline` - Skip baseline comparison (simple eval mode only; ignored in benchmark mode)
- `-v, --verbose` - Show per-test results
- `--executor [local|jobs]` - Execution backend for evaluation; overrides config
- `--max-parallel N` - Max concurrent evaluation executions; overrides config
- `--log-runs / --no-log-runs` - Log run data (default: enabled)
- `--runs-dir PATH` - Directory for run logs
**Examples:**
```bash
# Basic evaluation with baseline comparison
upskill eval ./skills/my-skill/
# With verbose output
upskill eval ./skills/my-skill/ -v
# Custom test cases
upskill eval ./skills/my-skill/ --tests ./tests.json
# Evaluate on specific model
upskill eval ./skills/my-skill/ -m haiku
# Multi-model benchmarking (compare models)
upskill eval ./skills/my-skill/ -m haiku -m sonnet
# Multiple runs per model for statistical significance
upskill eval ./skills/my-skill/ -m haiku -m sonnet --runs 5
# Evaluate a local model configured in fast-agent
upskill eval ./skills/my-skill/ -m generic.my-model
# Skip baseline (just test with skill)
upskill eval ./skills/my-skill/ --no-baseline
# Benchmark mode is triggered by multiple models OR --runs > 1
upskill eval ./skills/my-skill/ -m haiku --runs 5
# Disable run logging
upskill eval ./skills/my-skill/ --no-log-runs
```
**Benchmark output:**
```
Evaluating my-skill across 2 model(s)
3 test case(s), 5 run(s) per model
haiku
Pass rate: 4/5 (80%) Avg assertions: 2.8/3
sonnet
Pass rate: 5/5 (100%) Avg assertions: 3.0/3
┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Model ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃
┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ haiku │ 4/5 │ 2.8/3 │ 1250 │
│ sonnet │ 5/5 │ 3.0/3 │ 1890 │
└────────┴───────────┴────────────────┴────────────┘
```
**Test cases JSON format:**
```json
[
{"input": "Write a commit for adding login", "expected": {"contains": ["feat", "login"]}},
{"input": "Fix the null pointer bug", "expected": {"contains": ["fix", "bug"]}}
]
```
### `upskill list`
List all generated skills in a tree view.
```bash
upskill list [OPTIONS]
```
**Options:**
- `-d, --dir PATH` - Skills directory to list
- `-v, --verbose` - Show skill contents preview
**Examples:**
```bash
# List skills in default directory
upskill list
# List from custom directory
upskill list -d ./my-skills/
# Show preview of skill contents
upskill list -v
```
**Output:**
```
./skills
├── git-commit-messages
│ ├── Write clear, conventional commit messages...
│ └── files
│ └── SKILL.md
├── api-error-handling
│ ├── Handle API errors gracefully with proper logging...
│ └── files
│ ├── SKILL.md
│ └── references/error-codes.md
└── yaml-parsing
├── Parse YAML files safely with schema validation...
└── files
├── SKILL.md
└── scripts/validate.py
```
### `upskill runs`
View run results as a plot, or export to CSV. By default, shows a visual comparison of baseline vs with-skill performance.
```bash
upskill runs [OPTIONS]
```
**Options:**
- `-d, --dir PATH` - Runs directory
- `-s, --skill TEXT` - Filter by skill name(s) (repeatable)
- `-m, --model TEXT` - Filter historical run data by model(s) (repeatable)
- `--metric [success|tokens]` - Metric to display (default: success)
- `--csv PATH` - Export to CSV instead of plot
**Examples:**
```bash
# View results plot (default)
upskill runs
# Filter by skill and models
upskill runs -s my-skill -m haiku -m sonnet
# Show token usage instead of success rate
upskill runs --metric tokens
# Export to CSV
upskill runs --csv ./results.csv
# Custom runs directory
upskill runs -d ./my-runs/
```
**Plot output:**
```
skill: git-commit-messages
haiku
baseline ████████████░░░░░░░░ 60%
with skill ████████████████░░░░ 80% (+20%)
sonnet
baseline ████████████░░░░░░░░ 60%
with skill ████████████████████ 100% (+40%)
```
**Matrix view (multiple skills and models):**
```
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ skill ┃ haiku ┃ sonnet ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ git-commit-messages │ 60%→80% │ 60%→100% │
│ api-error-handling │ 40%→70% │ 50%→90% │
│ yaml-parsing │ 70%→90% │ 80%→100% │
└─────────────────────┴──────────────┴──────────────┘
```
## Skill Output Format
Skills are saved in a standard directory format:
```
./skills/{skill-name}/
├── SKILL.md # Main skill instructions
├── references/ # Supporting documents (optional)
└── scripts/ # Executable scripts (optional)
```
**Example SKILL.md:**
```markdown
# git-commit-messages
Write clear, conventional commit messages that follow best practices.
## Instructions
This skill teaches how to write effective git commit messages
following the Conventional Commits specification.
## Format
Commit messages should follow this structure:
():