--- name: golden-dataset-curation description: Use when creating or improving golden datasets for AI evaluation. Defines quality criteria, curation workflows, and multi-agent analysis patterns for test data. context: fork agent: data-pipeline-engineer version: 1.0.0 author: OrchestKit AI Agent Hub tags: [golden-dataset, curation, quality, multi-agent, langfuse, 2025] user-invocable: false --- # Golden Dataset Curation **Curate high-quality documents for the golden dataset with multi-agent validation** ## Overview This skill provides patterns and workflows for **adding new documents** to the golden dataset with thorough quality analysis. It complements `golden-dataset-management` which handles backup/restore. **When to use this skill:** - Adding new documents to the golden dataset - Classifying content types and difficulty levels - Generating test queries for new documents - Running multi-agent quality analysis --- ## Content Types | Type | Description | Quality Focus | |------|-------------|---------------| | `article` | Technical articles, blog posts | Depth, accuracy, actionability | | `tutorial` | Step-by-step guides | Completeness, clarity, code quality | | `research_paper` | Academic papers, whitepapers | Rigor, citations, methodology | | `documentation` | API docs, reference materials | Accuracy, completeness, examples | | `video_transcript` | Transcribed video content | Structure, coherence, key points | | `code_repository` | README, code analysis | Code quality, documentation | --- ## Difficulty Levels | Level | Semantic Complexity | Expected Score | Characteristics | |-------|---------------------|----------------|-----------------| | **trivial** | Direct keyword match | >0.85 | Technical terms, exact phrases | | **easy** | Common synonyms | >0.70 | Well-known concepts, slight variations | | **medium** | Paraphrased intent | >0.55 | Conceptual queries, multi-topic | | **hard** | Multi-hop reasoning | >0.40 | Cross-domain, comparative analysis | | **adversarial** | Edge cases | Graceful degradation | Robustness tests, off-domain | --- ## Quality Dimensions | Dimension | Weight | Perfect | Acceptable | Failing | |-----------|--------|---------|------------|---------| | **Accuracy** | 0.25 | 0.95-1.0 | 0.70-0.94 | <0.70 | | **Coherence** | 0.20 | 0.90-1.0 | 0.60-0.89 | <0.60 | | **Depth** | 0.25 | 0.90-1.0 | 0.55-0.89 | <0.55 | | **Relevance** | 0.30 | 0.95-1.0 | 0.70-0.94 | <0.70 | **Evaluation focuses:** - **Accuracy:** Technical correctness, code validity, up-to-date info - **Coherence:** Logical structure, clear flow, consistent terminology - **Depth:** Comprehensive coverage, edge cases, appropriate detail - **Relevance:** Alignment with AI/ML, backend, frontend, DevOps domains --- ## Multi-Agent Pipeline ``` INPUT: URL/Content | v +------------------+ | FETCH AGENT | Extract structure, detect type +--------+---------+ | v +-----------------------------------------------+ | PARALLEL ANALYSIS AGENTS | | Quality | Difficulty | Domain | Query Gen | +-----------------------------------------------+ | v +------------------+ | CONSENSUS | Weighted score + confidence | AGGREGATOR | -> include/review/exclude +--------+---------+ | v +------------------+ | USER APPROVAL | Show scores, confirm +--------+---------+ | v OUTPUT: Curated document entry ``` ### Decision Thresholds | Quality Score | Confidence | Decision | |---------------|------------|----------| | >= 0.75 | >= 0.70 | **include** | | >= 0.55 | any | **review** | | < 0.55 | any | **exclude** | --- ## Quality Thresholds ```yaml # Recommended thresholds for golden dataset inclusion minimum_quality_score: 0.70 minimum_confidence: 0.65 required_tags: 2 # At least 2 domain tags required_queries: 3 # At least 3 test queries ``` --- ## Coverage Balance Guidelines Maintain balanced coverage across: - **Content types:** Don't over-index on articles - **Difficulty levels:** Need trivial AND hard queries - **Domains:** Spread across AI/ML, backend, frontend, etc. ### Duplicate Prevention Checklist Before adding: 1. Check URL against existing `source_url_map.json` 2. Run semantic similarity against existing document embeddings 3. Warn if >80% similar to existing document ### Provenance Tracking Always record: - Source URL (canonical) - Curation date - Agent scores (for audit trail) - Langfuse trace ID --- ## Langfuse Integration ### Trace Structure ```python trace = langfuse.trace( name="golden-dataset-curation", metadata={"source_url": url, "document_id": doc_id} ) # Log individual dimension scores trace.score(name="accuracy", value=0.85) trace.score(name="coherence", value=0.90) trace.score(name="depth", value=0.78) trace.score(name="relevance", value=0.92) # Final aggregated score trace.score(name="quality_total", value=0.87) trace.event(name="curation_decision", metadata={"decision": "include"}) ``` ### Managed Prompts | Prompt Name | Purpose | |-------------|---------| | `golden-content-classifier` | Classify content_type | | `golden-difficulty-classifier` | Assign difficulty | | `golden-domain-tagger` | Extract tags | | `golden-query-generator` | Generate test queries | --- ## References For detailed implementation patterns, see: - `references/selection-criteria.md` - Content type classification, difficulty stratification, quality evaluation dimensions, and best practices - `references/annotation-patterns.md` - Multi-agent pipeline architecture, agent specifications, consensus aggregation logic, and Langfuse integration --- ## Related Skills - `golden-dataset-management` - Backup/restore operations - `golden-dataset-validation` - Validation rules and checks - `langfuse-observability` - Tracing patterns - `pgvector-search` - Duplicate detection --- **Version:** 1.0.0 (December 2025) **Issue:** #599 ## Capability Details ### content-classification **Keywords:** content type, classification, document type, golden dataset **Solves:** - Classify document content types for golden dataset - Categorize entries by domain and purpose - Identify content requiring special handling ### difficulty-stratification **Keywords:** difficulty, stratification, complexity level, challenge rating **Solves:** - Assign difficulty levels to golden dataset entries - Ensure balanced difficulty distribution - Identify edge cases and challenging examples ### quality-evaluation **Keywords:** quality, evaluation, quality dimensions, quality criteria **Solves:** - Evaluate entry quality against defined criteria - Score entries on multiple quality dimensions - Identify entries needing improvement ### multi-agent-analysis **Keywords:** multi-agent, parallel analysis, consensus, agent evaluation **Solves:** - Run parallel agent evaluations on entries - Aggregate consensus from multiple analysts - Resolve disagreements in classifications