--- name: add-golden description: Curate and add documents to the golden dataset with multi-agent validation. Use when adding test data, creating golden datasets, saving examples. context: fork version: 2.0.0 author: OrchestKit tags: [curation, golden-dataset, evaluation, testing, quality-scoring, bias-detection] user-invocable: true allowedTools: [Read, Write, Edit, Grep, Glob, Task, TaskCreate, TaskUpdate, mcp__memory__search_nodes] skills: [golden-dataset-validation, llm-evaluation, test-data-management] --- # Add to Golden Dataset Multi-agent curation workflow with quality score explanations, bias detection, and version tracking. ## Quick Start ```bash /add-golden https://example.com/article /add-golden https://arxiv.org/abs/2312.xxxxx ``` --- ## Task Management (CC 2.1.16) ```python # Create main curation task TaskCreate( subject="Add to golden dataset: {url}", description="Multi-agent curation with quality explanation", activeForm="Curating document" ) # Create subtasks for 9-phase process phases = ["Fetch content", "Run quality analysis", "Explain scores", "Check bias", "Check diversity", "Validate", "Get approval", "Write to dataset", "Update version"] for phase in phases: TaskCreate(subject=phase, activeForm=f"{phase}ing") ``` --- ## Workflow Overview | Phase | Activities | Output | |-------|------------|--------| | **1. Input Collection** | Get URL, detect content type | Document metadata | | **2. Fetch and Extract** | Parse document structure | Structured content | | **3. Quality Analysis** | 4 parallel agents evaluate | Raw scores | | **4. Quality Explanation** | Explain WHY each score | Score rationale | | **5. Bias Detection** | Check for bias in content | Bias report | | **6. Diversity Check** | Assess dataset balance | Diversity metrics | | **7. Validation** | Schema, duplicates, gates | Validation status | | **8. Silver-to-Gold** | Promote or mark as silver | Classification | | **9. Version Tracking** | Track changes, rollback | Version entry | --- ## Phase 1-2: Input and Extraction Detect content type: article, tutorial, documentation, research_paper. Extract: title, sections, code blocks, key terms, metadata (author, date). --- ## Phase 3: Parallel Quality Analysis (4 Agents) Launch ALL agents in ONE message with `run_in_background=True`. | Agent | Focus | Output | |-------|-------|--------| | code-quality-reviewer | Accuracy, coherence, depth, relevance | Quality scores | | workflow-architect | Keyword directness, paraphrase, reasoning | Difficulty level | | data-pipeline-engineer | Primary/secondary domains, skill level | Tags | | test-generator | Direct, paraphrased, multi-hop queries | Test queries | See [Quality Scoring](references/quality-scoring.md) for detailed criteria. --- ## Phase 4: Quality Explanation Each dimension gets WHY explanation: ```markdown ### Accuracy: [N.NN]/1.0 **Why this score:** - [Specific reason with evidence] **What would improve it:** - [Specific improvement] ``` --- ## Phase 5: Bias Detection See [Bias Detection Guide](references/bias-detection-guide.md) for patterns. Check for: - Technology bias (favors specific tools) - Recency bias (ignores LTS versions) - Complexity bias (assumed knowledge) - Vendor bias (promotes products) - Geographic/cultural bias | Bias Score | Action | |------------|--------| | 0-2 | Proceed normally | | 3-5 | Add disclaimer | | 6-8 | Require user review | | 9-10 | Recommend against | --- ## Phase 6: Diversity Dashboard Track dataset balance across: - Domain distribution (AI/ML, Backend, Frontend, DevOps, Security) - Difficulty distribution (trivial, easy, medium, hard, adversarial) **Impact assessment:** Does new document improve or worsen diversity? --- ## Phase 7: Validation - URL validation (no placeholders) - Schema validation (required fields) - Duplicate check (>80% similarity) - Quality gates (min sections, content length) --- ## Phase 8: Silver-to-Gold Workflow See [Silver-Gold Promotion](references/silver-gold-promotion.md) for criteria. | Status | Criteria | Action | |--------|----------|--------| | **GOLD** | Score >= 0.75, no bias | Add to main dataset | | **SILVER** | Score 0.55-0.74 | Add to silver, track | | **REJECT** | Score < 0.55 | Do not add | **Promotion criteria:** 7+ days in silver, quality >= 0.75, no negative feedback. --- ## Phase 9: Version Tracking ```json { "version": "1.2.3", "change_type": "ADD|UPDATE|REMOVE|PROMOTE", "document_id": "doc-123", "quality_score": 0.82, "rollback_available": true } ``` | Update Type | Version Bump | |-------------|--------------| | Add/Update document | Patch (0.0.X) | | Remove document | Minor (0.X.0) | | Schema change | Major (X.0.0) | --- ## Quality Scoring | Dimension | Weight | |-----------|--------| | Accuracy | 0.25 | | Coherence | 0.20 | | Depth | 0.25 | | Relevance | 0.30 | **Formula:** `quality_score = accuracy*0.25 + coherence*0.20 + depth*0.25 + relevance*0.30` --- ## Key Decisions | Decision | Choice | Rationale | |----------|--------|-----------| | Score explanation | Required | Transparency, actionable feedback | | Bias detection | Dedicated agent | Prevent dataset contamination | | Two-tier system | Silver + Gold | Allow docs time to mature | | Version tracking | Semantic versioning | Clear history, safe rollbacks | --- ## Related Skills - `golden-dataset-validation` - Validate existing datasets - `llm-evaluation` - LLM output evaluation patterns - `test-data-management` - Test data strategies --- **Version:** 2.0.0 (January 2026)