---
namespace: aiwg
name: eval-workflow
platforms: [all]
description: Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
---

# Workflow Evaluation

Run automated evaluation tests against a multi-agent workflow.

## Research Foundation

- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for real agentic task evaluation

## Usage

```bash
/eval-workflow flow-security-review-cycle
/eval-workflow flow-inception-to-elaboration --scenario distractor-test
/eval-workflow flow-deploy-to-production --verbose --strict
```

## Arguments

| Argument | Required | Description |
|----------|----------|-------------|
| workflow-name | Yes | Workflow (flow command) to evaluate |

## Options

| Option | Default | Description |
|--------|---------|-------------|
| --scenario | all | Specific scenario to run |
| --verbose | false | Show detailed test output |
| --output | stdout | Output file for results |
| --strict | false | Fail on any test failure |
| --timeout | 300 | Maximum seconds per scenario |

## What Gets Evaluated

### Orchestration Quality

- **Agent coordination**: Parallel agents launched correctly in single message
- **Handoff fidelity**: Artifacts pass correctly between phases
- **Gate enforcement**: Phase gates checked before transition

### Archetype Resistance

- `grounding-test` — Archetype 1: Premature action without reading state
- `distractor-test` — Archetype 3: Context pollution from irrelevant artifacts
- `recovery-test` — Archetype 4: Fragile execution when subagent fails

### Output Validation

- Required artifacts created in correct `.aiwg/` paths
- Document structure matches templates
- Traceability links intact

## Process

1. **Load Workflow**: Read flow command definition
2. **Select Scenarios**: Based on --scenario flag or all applicable
3. **Setup Workspace**: Create isolated `.aiwg/working/` test space
4. **Execute Flow**: Run workflow against each scenario
5. **Validate Outputs**: Check artifact presence, structure, and content
6. **Generate Report**: Output results with pass/fail per assertion
7. **Cleanup**: Remove test workspace

## Output Format

```json
{
  "workflow": "flow-security-review-cycle",
  "timestamp": "2026-04-01T10:30:00Z",
  "scenarios": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "assertions": [
        {"name": "threat-model-created", "passed": true},
        {"name": "security-gate-run", "passed": true}
      ],
      "duration_ms": 45000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.7,
      "assertions": [
        {"name": "correct-assets-only", "passed": false, "evidence": "Distractor file referenced in output"}
      ],
      "duration_ms": 38000
    }
  },
  "summary": {
    "passed": 4,
    "failed": 1,
    "total": 5,
    "score": 0.80
  }
}
```

## Examples

```bash
# Full evaluation of a workflow
/eval-workflow flow-security-review-cycle

# Single scenario with verbose output
/eval-workflow flow-inception-to-elaboration --scenario grounding-test --verbose

# Strict mode with output saved
/eval-workflow flow-deploy-to-production --strict --output .aiwg/reports/deploy-eval.json
```

## Success Criteria

| Metric | Target |
|--------|--------|
| Artifact creation | 100% |
| Grounding compliance | >90% |
| Distractor resistance | >80% |
| Recovery success | ≥80% |
| Overall | ≥85% |

## Related Commands

- `/eval-agent` - Test individual agents
- `/eval-report` - Generate aggregate quality report
- `aiwg lint agents` - Static validation

Evaluate workflow: $ARGUMENTS

## References

- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Parallel agent coordination rules evaluated in workflows
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success thresholds and test criteria
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC flow commands available for workflow evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint and eval commands