---
name: eval-recipes-runner
version: 1.0.0
description: |
  Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.
  Auto-activates when testing improvements, running evals, or benchmarking changes.
---

# eval-recipes Runner Skill

## Purpose

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

## When to Use

- User asks to "test with eval-recipes"
- User says "run the evals" or "benchmark this change"
- User wants to validate improvements against codex/claude_code
- Testing a PR branch to prove it improves scores

## Capabilities

I can run eval-recipes benchmarks to:

1. Test specific amplihack branches
2. Compare against baseline agents (codex, claude_code)
3. Run specific tasks (linkedin_drafting, email_drafting, etc.)
4. Compare before/after scores for PRs
5. Generate reports with score improvements

## How It Works

### Setup (One-Time)

```bash
# Clone eval-recipes from Microsoft
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes
cd ~/eval-recipes

# Copy our agent configs
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

# Install dependencies
uv sync
```

### Running Benchmarks

**Test a specific branch:**

```bash
# Update install.dockerfile to use specific branch
# Then run benchmark
cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
```

**Compare before/after:**

```bash
# Test baseline (main)
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

# Test PR branch (edit install.dockerfile to checkout PR branch)
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

# Compare scores
```

### Available Tasks

Common tasks from eval-recipes:

- `linkedin_drafting` - Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
- `email_drafting` - Create CLI tool for emails (scored 26/100 before)
- `arxiv_paper_summarizer` - Research tool
- `github_docs_extractor` - Documentation tool
- Many more in `~/eval-recipes/data/tasks/`

### Typical Workflow

When user says "test this change with eval-recipes":

1. **Identify the branch/PR** to test
2. **Update agent config** to use that branch:
   ```dockerfile
   # In .claude/agents/eval-recipes/amplihack/install.dockerfile
   RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \
       cd /tmp/amplihack && \
       git checkout BRANCH_NAME && \
       pip install -e .
   ```
3. **Copy to eval-recipes:**
   ```bash
   cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
   ```
4. **Run benchmark:**
   ```bash
   cd ~/eval-recipes
   uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3
   ```
5. **Report scores** and compare with baseline

### Expected Scores

**Baseline (main branch):**

- Overall: 40.6/100
- LinkedIn: 6.5/100
- Email: 26/100

**With PR #1443 (task classification):**

- Expected: 55-60/100 (+15-20 points)
- LinkedIn: 30-40/100 (creates actual tool)
- Email: 45/100 (consistent execution)

## Example Usage

**User says:** "Test PR #1443 with eval-recipes on the LinkedIn task"

**I do:**

1. Update install.dockerfile to checkout `feat/issue-1435-task-classification`
2. Copy to eval-recipes: `cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/`
3. Run: `cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3`
4. Report results: "Score: 35.2/100 (up from 6.5 baseline)"

## Prerequisites

- eval-recipes cloned to `~/eval-recipes`
- API key in environment: `export ANTHROPIC_API_KEY=sk-ant-...`
- Docker installed (for containerized runs)
- uv installed: `curl -LsSf https://astral.sh/uv/install.sh | sh`

## Notes

- Benchmarks take 2-15 minutes per task depending on complexity
- Multiple trials (3-5) give more reliable averages
- Docker builds can be cached for speed
- Results saved to `.benchmark_results/` in eval-recipes repo

## Automation

For fully autonomous testing:

```bash
# Test suite for a PR
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer"
for task in $tasks; do
  uv run eval_recipes/main.py --agent amplihack --task $task --trials 3
done

# Compare results
cat .benchmark_results/*/amplihack/*/score.txt
```