---
name: mcp-evaluation-skill
version: 1.0.0
category: mcp-development
description: Comprehensive evaluation creation for MCP servers - question generation, answer verification, and XML formatting for agent usability testing
triggers:
- "create evaluations"
- "evaluation questions"
- "test MCP server"
- "verify MCP tools"
- "agent usability"
dependencies:
- mcp-builder-skill
author: Engineering Standards Committee
last_updated: 2025-12-29
---
# MCP Evaluation Skill
## Description
This skill provides a systematic approach to creating comprehensive evaluation suites for MCP (Model Context Protocol) servers. Evaluations test whether AI agents can effectively use MCP tools to answer realistic, complex questions - the ultimate measure of MCP server quality.
**Core Capabilities:**
- Question generation methodology (simple → moderate → complex)
- Answer verification through manual solving
- XML format specification for evaluation frameworks
- Complexity distribution optimization (2-3-2 pattern)
- Independence and stability validation
- Real-world use case identification
---
## When to Use This Skill
**Use this skill when you need to:**
- Create evaluation suites for new MCP servers
- Validate MCP tool usability by AI agents
- Test complex multi-tool workflows
- Verify agent can discover and use tools correctly
- Generate realistic questions based on actual data
- Ensure stable, verifiable answers
**Trigger Phrases:**
- "Create 10 evaluation questions for this MCP server"
- "Generate evaluation suite"
- "Test if agents can use these tools"
- "Verify MCP server with evaluations"
- "Create XML evaluation file"
**Don't use this skill for:**
- Unit testing (use validator-role-skill instead)
- Integration testing (different testing methodology)
- Manual QA testing (evaluations are for automated agent testing)
- API documentation (use scribe-role-skill)
---
## Prerequisites
### Knowledge Requirements
1. **MCP Protocol Understanding**
- Tool, resource, and prompt concepts
- Input schemas (Pydantic/Zod)
- Response format best practices
- Agent-centric design principles
2. **Evaluation Theory**
- Independence (no question dependencies)
- Read-only operations (non-destructive)
- Verifiability (string comparison)
- Stability (answer doesn't change over time)
- Complexity levels (simple, moderate, complex)
3. **Domain Knowledge**
- Understanding of target API/service
- Realistic use cases humans care about
- Data relationships and patterns
- Edge cases worth testing
### Environment Setup
```bash
# Ensure MCP server is running
npm run build
node dist/index.js &
# Or use evaluation harness (recommended)
# Harness manages server lifecycle automatically
```
### Project Context
- **Phase 4 of MCP Development**: Evaluations come after implementation (Phases 1-3)
- **MCP Server Running**: Must have working MCP server to explore data
- **Tool Documentation**: Understand what each tool does
- **Read-Only Access**: Evaluation questions must not modify data
---
## Workflow
### Phase 1: Tool Inspection and Understanding
#### 1.1 List All Available Tools
**Objective**: Understand the complete capability surface of the MCP server
```bash
# If using MCP inspector
mcp-inspector --server ./dist/index.js tools list
# Manual inspection via code
grep -r "@tool" src/mcp/tools/
```
**Document Each Tool:**
| Tool Name | Purpose | Input Parameters | Output | Complexity |
|-----------|---------|------------------|--------|------------|
| `list_miners` | Get all registered miners | `{ limit?, offset? }` | `{ miners: [...] }` | Simple |
| `get_miner_status` | Get detailed miner status | `{ minerId }` | `{ status, hashrate, temp }` | Simple |
| `update_firmware` | Update miner firmware | `{ minerId, version }` | `{ jobId, status }` | Complex |
| `get_fleet_summary` | Aggregated fleet metrics | `{ tenantId? }` | `{ total, online, hashrate }` | Moderate |
**Key Insights to Capture:**
- Which tools return lists vs single items?
- Which tools require IDs from other tools? (workflow chaining)
- Which tools have optional parameters?
- Which tools enable complex multi-step questions?
#### 1.2 Understand Tool Relationships
**Pattern: Map Tool Dependencies**
```
list_miners → get_miner_status (requires minerId from list)
↓
update_firmware (requires minerId)
↓
check_job_status (requires jobId from update)
```
**Workflow Chains to Test:**
1. **Discovery → Detail**: list_miners → get_miner_status
2. **Discovery → Action**: list_miners → update_firmware → check_job_status
3. **Aggregation → Filter**: get_fleet_summary → list_miners (with filters)
4. **Multi-Resource**: get_miner_status + get_pool_config + get_firmware_version
---
### Phase 2: Content Exploration (Read-Only)
#### 2.1 Use READ-ONLY Tools to Explore Data
**Critical Rule**: Never use destructive operations during exploration
**Exploration Strategy:**
```typescript
// Example: Explore miner fleet
const miners = await mcpServer.callTool("list_miners", { limit: 100 });
// Identify interesting miners: highest hashrate, highest temp, offline, etc.
const detailedStatus = await mcpServer.callTool("get_miner_status", {
minerId: miners.miners[0].id
});
// Understand status structure: what fields exist? What values?
const fleetSummary = await mcpServer.callTool("get_fleet_summary", {});
// Understand aggregated metrics: total miners, online count, average hashrate
```
**Data Patterns to Identify:**
1. **Uniqueness**: Which fields uniquely identify entities?
- Example: `minerId`, `serialNumber`, `ipAddress`
2. **Relationships**: How do entities relate?
- Example: Miners → Pools, Miners → Firmware Versions
3. **Ranges**: What are typical value ranges?
- Example: Temperature (40-80°C), Hashrate (90-100 TH/s)
4. **Edge Cases**: Interesting outliers to test
- Example: Offline miners, miners with errors, miners updating firmware
5. **Aggregations**: What can be calculated?
- Example: Total hashrate, average temperature, count by status
#### 2.2 Document Data Characteristics
**Data Classification Matrix:**
| Data Type | Change Frequency | Uniqueness | Suitable for Evaluation? |
|-----------|------------------|------------|--------------------------|
| Miner ID | Never | Unique | ✅ Yes (stable reference) |
| Hashrate | Every 1-5s | Non-unique | ❌ No (too volatile) |
| Firmware version | Rarely | Non-unique | ✅ Yes (stable) |
| Temperature | Every 1-5s | Non-unique | ❌ No (too volatile) |
| Pool URL | Rarely | Non-unique | ✅ Yes (stable) |
| Error messages | Varies | Non-unique | ⚠️ Maybe (if persistent) |
**Stable vs Volatile Data:**
- **Stable**: Suitable for evaluation answers (firmware versions, pool URLs, miner counts)
- **Volatile**: Unsuitable (hashrate, temperature, current status)
---
### Phase 3: Question Generation
#### 3.1 Complexity Distribution (2-3-2 Pattern)
**Target Distribution for 10 Questions:**
- **2 Simple** (1-2 tool calls, straightforward lookup)
- **6 Moderate** (2-4 tool calls, some reasoning/filtering)
- **2 Complex** (4+ tool calls, deep exploration, multi-step workflows)
#### 3.2 Simple Questions (Single Tool or Straightforward Workflow)
**Characteristics:**
- 1-2 tool calls
- Obvious solution path
- Direct lookup or simple filter
- Answer is immediate from tool output
**Examples:**
1. **Simple Discovery**
```xml
How many miners are currently registered in the fleet?
127
```
2. **Simple Detail Lookup**
```xml
What firmware version is miner-abc-123 running?
2.5.1
```
#### 3.3 Moderate Questions (Multi-Tool, Filtering, Reasoning)
**Characteristics:**
- 2-4 tool calls
- Requires filtering or sorting
- Some logic to combine results
- May need to identify "best" or "worst"
**Examples:**
1. **Find by Characteristic**
```xml
Which miner in the fleet has the highest hashrate? What is its IP address?
192.168.1.157
```
2. **Aggregation with Filter**
```xml
How many miners are currently offline in tenant 'prod-west'?
3
```
3. **Cross-Resource Query**
```xml
Which pool URL is configured for the miner with serial number SN-7891? Include the pool priority.
stratum+tcp://pool.example.com:3333 (priority: 0)
```
#### 3.4 Complex Questions (Deep Exploration, Multi-Step)
**Characteristics:**
- 4+ tool calls
- Requires exploring multiple layers
- Chained dependencies (output of one tool feeds next)
- Combines data from multiple sources
- May require finding relationships or patterns
**Examples:**
1. **Deep Workflow Exploration**
```xml
Find the miner with the oldest firmware version in the fleet. What is its current hashrate in TH/s?
87.3
```
2. **Multi-Condition Search**
```xml
Among miners running firmware 2.5.x, which one has been online the longest? What is its uptime in hours?
1847
```
3. **Pattern Discovery**
```xml
Which firmware version is most commonly deployed across all miners in the 'prod' tenant? How many miners use it?
2.5.1 (94 miners)
```
#### 3.5 Question Quality Checklist
For each generated question, verify:
- [ ] **Independent**: Doesn't depend on answers from other questions
- [ ] **Read-Only**: Only uses non-destructive tools
- [ ] **Verifiable**: Has single, clear answer (string comparison)
- [ ] **Stable**: Answer won't change over time (no volatile data)
- [ ] **Realistic**: Based on actual use case humans care about
- [ ] **Answerable**: Agent can solve with available tools
- [ ] **Clear**: Unambiguous what's being asked
- [ ] **Complete**: Includes all context needed
**Red Flags (Avoid These):**
- ❌ "What is the current temperature of miner-123?" (too volatile)
- ❌ "Update firmware and tell me the result" (destructive)
- ❌ "Solve question 3 first, then answer this" (dependent)
- ❌ "Approximately how many miners..." (vague, not verifiable)
---
### Phase 4: Answer Verification
#### 4.1 Manually Solve Each Question
**Critical Rule**: You must solve every question yourself to verify the answer
**Verification Process:**
```typescript
// For each question, document solving process:
// Question: "How many miners are in tenant 'prod-west'?"
// Step 1: Call list_miners
const miners = await mcpServer.callTool("list_miners", {
tenantId: "prod-west"
});
// Result: { miners: [...], total: 47 }
// Step 2: Verify count
console.log(`Total miners: ${miners.total}`);
// Output: Total miners: 47
// Step 3: Document answer
// Answer: 47
// Step 4: Verify stability
// - Tenant membership rarely changes ✅
// - Answer won't be volatile ✅
// - Answer is deterministic ✅
```
#### 4.2 Answer Format Guidelines
**String Comparison Requirements:**
| Answer Type | Format | Example |
|-------------|--------|---------|
| Number | Plain number | `47` (not "47 miners") |
| String | Exact string | `prod-west` (not "Tenant: prod-west") |
| IP Address | Standard notation | `192.168.1.100` |
| URL | Full URL | `stratum+tcp://pool.example.com:3333` |
| Version | Semantic version | `2.5.1` (not "v2.5.1") |
| Boolean | `true` or `false` | `true` (lowercase) |
| List | Comma-separated | `miner-1,miner-2,miner-3` (no spaces) |
**Multiple-Part Answers:**
If question asks for multiple pieces of information, format as structured answer:
```xml
What is the IP address and pool URL for miner-abc-123?
192.168.1.100, stratum+tcp://pool.example.com:3333
```
#### 4.3 Stability Verification
**Check Answer Stability:**
1. **Re-run verification** after 1 hour - answer should be same
2. **Identify dependencies** - what would cause answer to change?
3. **Avoid time-sensitive data** - current status, real-time metrics
4. **Use historical or configuration data** - firmware versions, pool URLs, miner IDs
**Stable vs Unstable Examples:**
| Question | Stability | Reason |
|----------|-----------|--------|
| "How many miners are registered?" | ✅ Stable | Rarely changes |
| "What is miner-123's hashrate?" | ❌ Unstable | Changes every second |
| "Which firmware version is on miner-abc?" | ✅ Stable | Only changes on update |
| "How many miners are currently online?" | ❌ Unstable | Changes frequently |
| "What pool URL is miner-xyz using?" | ✅ Stable | Configuration data |
---
### Phase 5: XML Output Generation
#### 5.1 XML Format Specification
**Complete Evaluation File Structure:**
```xml
Braiins OS MCP Server Evaluation
1.0
2025-12-29
Engineering Team
Comprehensive evaluation suite testing agent usability of Braiins OS MCP server
eval-001
simple
How many miners are currently registered in the fleet?
127
list_miners
1
eval-002
simple
What firmware version is miner-abc-123 running?
2.5.1
get_miner_status
1
eval-003
moderate
Which miner in the fleet has the highest hashrate? What is its IP address?
192.168.1.157
list_miners, get_miner_status
3-5
eval-009
complex
Find the miner with the oldest firmware version in the fleet. What is its current hashrate in TH/s?
87.3
list_miners, get_miner_status
5+
eval-010
complex
Which firmware version is most commonly deployed across all miners in the 'prod' tenant? How many miners use it?
2.5.1 (94 miners)
list_miners, get_miner_status
5+
10
2
6
2
4
2.3
```
#### 5.2 Metadata Best Practices
- **Name**: Descriptive name of MCP server being evaluated
- **Version**: Evaluation suite version (bump when questions change)
- **Created**: ISO 8601 date (YYYY-MM-DD)
- **Author**: Team or individual who created evaluations
- **Description**: Brief explanation of what's being tested
#### 5.3 QA Pair Best Practices
**Required Fields:**
- ``: Unique identifier (eval-001, eval-002, ...)
- ``: simple | moderate | complex
- ``: Clear, unambiguous question text
- ``: Verified answer (string comparison format)
**Optional but Recommended Fields:**
- ``: Comma-separated tool names needed
- ``: How many tool calls expected (for performance testing)
- ``: Why this question is valuable (internal documentation)
---
## Examples
### Example 1: Complete Evaluation Creation Process
**Target**: Braiins OS MCP Server with 4 tools
**Step 1: Tool Inspection**
```typescript
// Available tools:
1. list_miners({ limit?, offset?, tenantId? })
2. get_miner_status({ minerId })
3. get_fleet_summary({ tenantId? })
4. get_pool_config({ minerId })
```
**Step 2: Data Exploration**
```typescript
// Discover data patterns
const miners = await callTool("list_miners", { limit: 100 });
// Found: 127 miners total, IDs like "miner-abc-123"
const status = await callTool("get_miner_status", {
minerId: miners.miners[0].id
});
// Found: firmware version (stable), hashrate (volatile), temperature (volatile)
const summary = await callTool("get_fleet_summary", {});
// Found: total count, online count, total hashrate
```
**Step 3: Generate 10 Questions**
```xml
How many miners are registered?
127
What is miner-abc-123's firmware version?
2.5.1
How many miners in tenant 'prod-west' are online?
44
Which miner has the oldest firmware? What is its pool URL?
stratum+tcp://old-pool.example.com:3333
```
**Step 4: Verify All Answers**
```typescript
// Manually solve each question and verify answer stability
// Document solving process for future reference
```
### Example 2: Question Evolution (Bad → Good)
**❌ Bad Question (Volatile Answer):**
```xml
What is the current hashrate of miner-abc-123?
95.7
```
**✅ Good Question (Stable Answer):**
```xml
What firmware version is miner-abc-123 running?
2.5.1
```
**❌ Bad Question (Dependent):**
```xml
Using the miner ID from question 3, what is its temperature?
```
**✅ Good Question (Independent):**
```xml
What is the pool URL for miner-abc-123?
stratum+tcp://pool.example.com:3333
```
---
## Quality Standards
### Evaluation Quality Checklist
- [ ] **Coverage**
- [ ] Tests all major tools at least once
- [ ] Tests common workflows (list → detail)
- [ ] Tests edge cases (empty results, errors)
- [ ] Tests aggregation and filtering
- [ ] **Complexity Distribution**
- [ ] 2 simple questions (20%)
- [ ] 6 moderate questions (60%)
- [ ] 2 complex questions (20%)
- [ ] Total: 10 questions
- [ ] **Question Quality**
- [ ] All questions are independent
- [ ] All questions use read-only tools
- [ ] All questions have verifiable answers
- [ ] All questions have stable answers
- [ ] All questions are realistic use cases
- [ ] **Answer Quality**
- [ ] All answers manually verified
- [ ] All answers use string comparison format
- [ ] All answers are stable (re-verified after 1 hour)
- [ ] All answers are unambiguous
- [ ] **XML Format**
- [ ] Valid XML structure
- [ ] Metadata complete
- [ ] Statistics calculated
- [ ] Consistent formatting
### Performance Targets
**Agent Success Rates:**
- **Simple questions**: 95%+ success rate
- **Moderate questions**: 80%+ success rate
- **Complex questions**: 60%+ success rate
- **Overall**: 75%+ success rate
**Tool Call Efficiency:**
- **Simple**: 1-2 tool calls on average
- **Moderate**: 3-4 tool calls on average
- **Complex**: 5-7 tool calls on average
---
## Common Pitfalls
### ❌ Pitfall 1: Volatile Data in Answers
**Problem**: Using real-time metrics that change constantly
```xml
What is miner-123's current temperature?
65°C
```
**Solution**: Use stable configuration or historical data
```xml
What firmware version is miner-123 running?
2.5.1
```
### ❌ Pitfall 2: Dependent Questions
**Problem**: Questions that rely on previous answers
```xml
What is the pool URL for the miner from question 5?
```
**Solution**: Make every question self-contained
```xml
What is the pool URL for miner-abc-123?
stratum+tcp://pool.example.com:3333
```
### ❌ Pitfall 3: Ambiguous Answers
**Problem**: Multiple valid interpretations
```xml
How many miners are offline?
3 miners are offline
```
**Solution**: Specify exact format in question or normalize answer
```xml
How many miners are offline?
3
```
---
## Integration with Evaluation Harness
### Running Evaluations
**Evaluation Harness Setup:**
```bash
# Create evaluation harness script
cat > run-evaluation.ts <<'EOF'
import { MCPClient } from '@modelcontextprotocol/client';
import { parseEvaluation } from './eval-parser';
async function runEvaluation(evalPath: string) {
const client = new MCPClient('./dist/index.js');
const evaluation = parseEvaluation(evalPath);
let passed = 0;
let failed = 0;
for (const qa of evaluation.questions) {
try {
const answer = await client.ask(qa.question);
if (answer === qa.answer) {
passed++;
console.log(`✅ ${qa.id}: PASS`);
} else {
failed++;
console.log(`❌ ${qa.id}: FAIL (expected: ${qa.answer}, got: ${answer})`);
}
} catch (error) {
failed++;
console.log(`❌ ${qa.id}: ERROR - ${error.message}`);
}
}
console.log(`\nResults: ${passed}/${passed + failed} passed (${(passed / (passed + failed) * 100).toFixed(1)}%)`);
}
runEvaluation('./evaluations/braiins-os.xml');
EOF
```
**Usage:**
```bash
npm run build
npm run evaluate
```
---
## References
- **MCP Evaluation Guide**: See mcp-builder-skill reference/evaluation.md
- **Question Generation Theory**: See mcp-builder-skill Phase 4
- **Agent-Centric Design**: MCP Best Practices (modelcontextprotocol.io)
- **Braiins OS API**: See braiins-os skill for domain knowledge
---
**Version History:**
- 1.0.0 (2025-12-29): Initial release - Question generation, answer verification, XML formatting