--- name: Ground Truth Management description: Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning --- # Ground Truth Management ## What is Ground Truth? **Definition:** Correct answers for evaluation - human-verified data that serves as the gold standard for measuring AI performance. ### Example ``` Question: "What is the capital of France?" Ground Truth: "Paris" AI Answer: "Paris" → Correct ✓ AI Answer: "Lyon" → Incorrect ✗ ``` --- ## Why Ground Truth Matters ### Measure Accuracy Objectively ``` Without ground truth: "This answer seems good" (subjective) With ground truth: "Accuracy: 85%" (objective) ``` ### Train and Validate Models ``` Training: Learn from ground truth examples Validation: Measure performance on ground truth test set ``` ### Regression Testing ``` Before change: Accuracy 90% After change: Accuracy 85% → Regression detected! ``` ### Benchmarking ``` Model A: 90% accuracy on ground truth Model B: 85% accuracy on ground truth → Model A is better ``` --- ## Types of Ground Truth ### Exact Match: Single Correct Answer ```json { "question": "What is 2+2?", "answer": "4" } ``` ### Multiple Acceptable Answers ```json { "question": "What is the capital of France?", "acceptable_answers": ["Paris", "paris", "PARIS", "The capital is Paris"] } ``` ### Rubric-Based: Quality Scale ```json { "question": "Summarize this article", "rubric": { "1": "Poor summary, missing key points", "3": "Adequate summary, covers main points", "5": "Excellent summary, concise and comprehensive" } } ``` ### Human Preference: Comparison Rankings ```json { "question": "Which answer is better?", "answer_a": "Paris is the capital of France.", "answer_b": "The capital of France is Paris, a city of 2.1 million people.", "preference": "B", "reasoning": "More informative" } ``` --- ## Creating Ground Truth ### Manual Annotation (Humans Label) ``` Process: 1. Collect examples (questions, documents, images) 2. Human annotators label each 3. Quality control (review annotations) 4. Store in dataset ``` ### Expert Review (For Specialized Domains) ``` Medical: Doctors annotate Legal: Lawyers annotate Technical: Engineers annotate Higher quality but more expensive ``` ### Crowdsourcing (Amazon MTurk) ``` Pros: - Fast (many workers) - Cheap ($0.10-1.00 per annotation) Cons: - Variable quality - Need quality control ``` ### Synthetic Generation (For Some Tasks) ``` LLM-generated questions + answers Careful validation needed Good for scale, risky for quality Use for augmentation, not sole source ``` --- ## Ground Truth Dataset Structure ### Input (Question, Document, Image) ```json { "input": { "type": "question", "text": "What is the capital of France?" } } ``` ### Expected Output (Answer, Label, Summary) ```json { "expected_output": { "type": "answer", "text": "Paris", "acceptable_variants": ["paris", "PARIS"] } } ``` ### Metadata (Difficulty, Category, Source) ```json { "metadata": { "difficulty": "easy", "category": "geography", "source": "wikipedia", "language": "en" } } ``` ### Annotation Info (Who, When, Confidence) ```json { "annotation": { "annotator_id": "annotator_123", "timestamp": "2024-01-15T10:00:00Z", "confidence": 0.95, "time_spent_seconds": 30 } } ``` **Complete Example:** ```json { "id": "example_001", "input": { "type": "question", "text": "What is the capital of France?" }, "expected_output": { "type": "answer", "text": "Paris", "acceptable_variants": ["paris", "PARIS", "The capital is Paris"] }, "metadata": { "difficulty": "easy", "category": "geography", "source": "wikipedia", "language": "en" }, "annotation": { "annotator_id": "annotator_123", "timestamp": "2024-01-15T10:00:00Z", "confidence": 0.95 } } ``` --- ## Annotation Guidelines ### Clear Instructions ```markdown # Annotation Guidelines ## Task Label whether the answer is correct. ## Instructions 1. Read the question carefully 2. Read the answer 3. Determine if answer is factually correct 4. Mark as "Correct" or "Incorrect" ## Examples Question: "What is 2+2?" Answer: "4" Label: Correct Question: "What is 2+2?" Answer: "5" Label: Incorrect ``` ### Examples (Good and Bad) ```markdown ## Good Example Question: "What is the capital of France?" Answer: "Paris" Label: Correct Reasoning: Factually accurate and directly answers question ## Bad Example Question: "What is the capital of France?" Answer: "France is a country in Europe" Label: Incorrect Reasoning: Doesn't answer the question ``` ### Edge Case Handling ```markdown ## Edge Cases ### Partially Correct Question: "What are the capitals of France and Germany?" Answer: "Paris" Label: Partially Correct (missing Germany) ### Ambiguous Question Question: "What is the best programming language?" Label: N/A - Subjective question, no single correct answer ### No Answer in Context Question: "What is the population of Paris?" Context: "Paris is the capital of France." Label: "Cannot be determined from context" ``` ### Consistency Checks ```markdown ## Consistency Rules 1. Same question → Same answer 2. Synonyms are acceptable ("car" = "automobile") 3. Case-insensitive ("Paris" = "paris") 4. Extra details are OK ("Paris" vs "Paris, France") ``` --- ## Quality Control ### Multiple Annotators Per Example ``` Each example labeled by 3 annotators Majority vote determines final label Catches individual annotator errors ``` ### Inter-Annotator Agreement (IAA) ``` Measure: Do annotators agree? Metric: Cohen's Kappa (κ) Target: κ > 0.7 (good agreement) ``` ### Gold Standard Subset (Known Answers) ``` 10% of examples have known correct labels Mix into annotation tasks Measure annotator accuracy on gold standard Remove low-quality annotators ``` ### Spot Checks by Experts ``` Expert reviews 10% of annotations Validates quality Identifies systematic errors ``` --- ## Inter-Annotator Agreement ### Kappa Score (Cohen's κ) ```python from sklearn.metrics import cohen_kappa_score annotator1 = [1, 0, 1, 1, 0] # Labels from annotator 1 annotator2 = [1, 0, 1, 0, 0] # Labels from annotator 2 kappa = cohen_kappa_score(annotator1, annotator2) print(f"Kappa: {kappa:.2f}") # Interpretation: # κ < 0.4: Poor agreement # κ 0.4-0.6: Moderate agreement # κ 0.6-0.8: Good agreement # κ > 0.8: Excellent agreement ``` ### Fleiss' κ (Multiple Annotators) ```python from statsmodels.stats.inter_rater import fleiss_kappa # 3 annotators, 5 examples # Each row: [count_label_0, count_label_1] data = [ [0, 3], # Example 1: All 3 annotators chose label 1 [1, 2], # Example 2: 1 chose 0, 2 chose 1 [3, 0], # Example 3: All 3 chose label 0 [2, 1], # Example 4: 2 chose 0, 1 chose 1 [0, 3], # Example 5: All 3 chose label 1 ] kappa = fleiss_kappa(data) print(f"Fleiss' Kappa: {kappa:.2f}") ``` ### Percentage Agreement ```python def percentage_agreement(annotator1, annotator2): agreements = sum(a == b for a, b in zip(annotator1, annotator2)) total = len(annotator1) return agreements / total agreement = percentage_agreement(annotator1, annotator2) print(f"Agreement: {agreement:.1%}") ``` ### Target: >0.7 (Good Agreement) ``` If κ < 0.7: 1. Review annotation guidelines (unclear?) 2. Provide more examples 3. Train annotators 4. Simplify task ``` --- ## Resolving Disagreements ### Majority Vote ```python def majority_vote(labels): from collections import Counter counts = Counter(labels) majority_label = counts.most_common(1)[0][0] return majority_label # 3 annotators labels = [1, 1, 0] # Two say 1, one says 0 final_label = majority_vote(labels) # 1 ``` ### Expert Adjudication ``` If no majority (e.g., 1, 0, 2): → Expert reviews and decides ``` ### Discussion and Consensus ``` Annotators discuss disagreement Reach consensus Update guidelines if needed ``` ### Update Guidelines ``` If systematic disagreements: → Guidelines unclear → Update and re-annotate ``` --- ## Ground Truth for Different Tasks ### Classification: Category Labels ```json { "text": "This product is amazing!", "label": "positive" } ``` ### Q&A: Correct Answers + Acceptable Variants ```json { "question": "What is the capital of France?", "answer": "Paris", "acceptable_variants": ["paris", "PARIS", "The capital is Paris"] } ``` ### Summarization: Reference Summaries ```json { "document": "Long article text...", "reference_summary": "Concise summary of key points" } ``` ### RAG: Question + Context + Answer ```json { "question": "What is the capital of France?", "context": "Paris is the capital and largest city of France.", "answer": "Paris", "relevant_chunks": ["Paris is the capital and largest city of France."] } ``` ### Generation: Multiple Acceptable Outputs ```json { "prompt": "Write a haiku about spring", "acceptable_outputs": [ "Cherry blossoms bloom\nGentle breeze carries petals\nSpring has arrived now", "Flowers start to bloom\nBirds sing in the morning light\nSpring is here at last" ] } ``` --- ## Dataset Size ### Evaluation Set: 100-1000 Examples (Representative) ``` Purpose: Quick evaluation during development Size: 100-1000 examples Quality: High (manually curated) Coverage: Representative of production ``` ### Test Set: 500-5000 Examples (Comprehensive) ``` Purpose: Final evaluation before deployment Size: 500-5000 examples Quality: High (gold standard) Coverage: Comprehensive (all categories, edge cases) ``` ### Quality > Quantity ``` Better: 100 high-quality examples Worse: 1000 low-quality examples ``` ### Cover Edge Cases ``` Include: - Common cases (80%) - Edge cases (15%) - Adversarial cases (5%) ``` --- ## Dataset Maintenance ### Version Control (Like Code) ```bash # Git for dataset versioning git init git add dataset.jsonl git commit -m "Initial dataset v1.0" # Tag versions git tag v1.0 # Update dataset git add dataset.jsonl git commit -m "Added 100 new examples" git tag v1.1 ``` ### Regular Updates (New Examples) ``` Monthly: Add 50-100 new examples Quarterly: Major update (500+ examples) ``` ### Remove Outdated Examples ``` Examples that are: - No longer relevant - Incorrect (facts changed) - Duplicates ``` ### Track Changes (Changelog) ```markdown # Dataset Changelog ## v1.2 (2024-02-01) - Added 100 new examples (geography category) - Removed 20 outdated examples - Fixed 5 incorrect labels ## v1.1 (2024-01-01) - Added 50 new examples (science category) - Updated annotation guidelines ## v1.0 (2023-12-01) - Initial release (500 examples) ``` --- ## Stratified Sampling ### Balance by Difficulty ``` Easy: 40% Medium: 40% Hard: 20% ``` ### Balance by Category ``` Geography: 25% Science: 25% History: 25% Math: 25% ``` ### Include Edge Cases ``` Common cases: 80% Edge cases: 15% Adversarial: 5% ``` ### Representative of Production ``` Sample from actual production queries Ensures dataset matches real usage ``` --- ## Synthetic Ground Truth ### LLM-Generated Questions + Answers ```python def generate_synthetic_qa(document): prompt = f""" Document: {document} Generate 5 question-answer pairs based on this document. Format: Q1: [question] A1: [answer] ... """ response = llm.generate(prompt) qa_pairs = parse_qa_pairs(response) return qa_pairs ``` ### Careful Validation Needed ``` LLM-generated data can have: - Hallucinations - Incorrect facts - Biased questions → Always validate with humans ``` ### Good for Scale, Risky for Quality ``` Pros: Can generate 1000s quickly Cons: Quality varies, needs validation ``` ### Use for Augmentation, Not Sole Source ``` Strategy: - 80% human-annotated (high quality) - 20% synthetic (validated) ``` --- ## Domain-Specific Ground Truth ### Medical: Expert Annotations ``` Annotators: Licensed doctors Cost: $50-100 per hour Quality: Very high Use case: Medical diagnosis, treatment recommendations ``` ### Legal: Lawyer Review ``` Annotators: Licensed lawyers Cost: $100-300 per hour Quality: Very high Use case: Legal document analysis, case law ``` ### Technical: Engineer Verification ``` Annotators: Senior engineers Cost: $50-150 per hour Quality: High Use case: Code review, technical Q&A ``` --- ## Ground Truth Storage ### JSON/JSONL Files ```jsonl {"id": "1", "question": "What is 2+2?", "answer": "4"} {"id": "2", "question": "Capital of France?", "answer": "Paris"} ``` ### Database (PostgreSQL, MongoDB) ```sql CREATE TABLE ground_truth ( id UUID PRIMARY KEY, question TEXT NOT NULL, answer TEXT NOT NULL, category VARCHAR(50), difficulty VARCHAR(20), created_at TIMESTAMP DEFAULT NOW() ); ``` ### Version Control (Git) ```bash git add dataset/ git commit -m "Update ground truth dataset" git push ``` ### Cloud Storage (S3 + Versioning) ```bash # Upload to S3 with versioning aws s3 cp dataset.jsonl s3://my-bucket/ground-truth/v1.0/dataset.jsonl aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled ``` --- ## Ground Truth for RAG **Structure:** ```json { "question": "What is the capital of France?", "expected_answer": "Paris", "relevant_document_chunks": [ "Paris is the capital and largest city of France." ], "evaluation_criteria": { "faithfulness": "Answer must be grounded in context", "relevance": "Answer must directly address question", "completeness": "Answer should mention Paris" } } ``` --- ## Evaluation with Ground Truth ### Exact Match Accuracy ```python def exact_match(predicted, ground_truth): return predicted.strip().lower() == ground_truth.strip().lower() accuracy = sum(exact_match(p, gt) for p, gt in zip(predicted, ground_truth)) / len(predicted) ``` ### F1 Score (For Overlapping Spans) ```python def f1_score(predicted, ground_truth): pred_tokens = set(predicted.lower().split()) gt_tokens = set(ground_truth.lower().split()) common = pred_tokens & gt_tokens if len(pred_tokens) == 0 or len(gt_tokens) == 0: return 0 precision = len(common) / len(pred_tokens) recall = len(common) / len(gt_tokens) if precision + recall == 0: return 0 f1 = 2 * (precision * recall) / (precision + recall) return f1 ``` ### BLEU/ROUGE (For Generation) ```python from nltk.translate.bleu_score import sentence_bleu reference = [["Paris", "is", "the", "capital"]] candidate = ["Paris", "is", "the", "capital"] bleu = sentence_bleu(reference, candidate) ``` ### Semantic Similarity (Embedding Distance) ```python from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity model = SentenceTransformer('all-MiniLM-L6-v2') emb1 = model.encode("Paris is the capital of France") emb2 = model.encode("The capital of France is Paris") similarity = cosine_similarity([emb1], [emb2])[0][0] ``` --- ## Continuous Ground Truth ### Production Feedback (User Thumbs Up/Down) ```python # Log user feedback feedback = { "question": "What is the capital of France?", "answer": "Paris", "user_feedback": "thumbs_up", "timestamp": "2024-01-15T10:00:00Z" } # Add to ground truth if positive if feedback["user_feedback"] == "thumbs_up": add_to_ground_truth(feedback["question"], feedback["answer"]) ``` ### Human Review of Flagged Outputs ``` User flags answer as incorrect → Human reviews → If incorrect, add correct answer to ground truth → If correct, keep as is ``` ### Incrementally Add to Dataset ``` Monthly: Review 100 flagged examples Add 50 to ground truth Update dataset version ``` --- ## Tools ### Annotation: Label Studio, Prodigy, CVAT **Label Studio:** ```bash pip install label-studio label-studio start # Open http://localhost:8080 ``` **Prodigy:** ```bash pip install prodigy prodigy textcat.manual dataset_name source.jsonl --label positive,negative ``` ### Management: DVC (Data Version Control) ```bash pip install dvc dvc init dvc add dataset.jsonl git add dataset.jsonl.dvc .gitignore git commit -m "Add dataset" dvc push ``` ### Storage: S3, GCS, Local Files See "Ground Truth Storage" section --- ## Summary **Ground Truth:** Correct answers for evaluation **Why:** - Measure accuracy objectively - Train/validate models - Regression testing - Benchmarking **Types:** - Exact match - Multiple acceptable answers - Rubric-based - Human preference **Creating:** - Manual annotation - Expert review - Crowdsourcing - Synthetic (with validation) **Quality Control:** - Multiple annotators - Inter-annotator agreement (κ > 0.7) - Gold standard subset - Expert spot checks **Dataset Size:** - Eval: 100-1000 (representative) - Test: 500-5000 (comprehensive) - Quality > quantity **Maintenance:** - Version control (Git) - Regular updates - Remove outdated - Changelog **Tools:** - Annotation: Label Studio, Prodigy - Management: DVC - Storage: S3, GCS, Git