--- name: prediction-tracking description: Track and evaluate AI predictions over time to assess accuracy. Use when reviewing past predictions to determine if they came true, failed, or remain uncertain. --- # Prediction Tracking Skill Track predictions made by AI researchers and critics, evaluate their accuracy over time. ## Prediction Recording When recording a new prediction, capture: ### Required Fields - **text**: The prediction as stated - **author**: Who made it - **madeAt**: When it was made - **timeframe**: When they expect it to happen - **topic**: What area of AI - **confidence**: How confident they seemed ### Optional Fields - **sourceUrl**: Where the prediction was made - **targetDate**: Specific date if mentioned - **conditions**: Any caveats or conditions - **metrics**: How to measure success ## Evaluation Status When evaluating predictions, assign one of: ### `verified` Clearly came true as stated. - The predicted capability/event occurred - Within the stated timeframe - Substantially as described ### `falsified` Clearly did not come true. - Timeframe passed without occurrence - Contradictory evidence emerged - Author retracted or modified claim ### `partially-verified` Partially accurate. - Some aspects came true, others didn't - Capability exists but weaker than claimed - Timeframe was off but direction correct ### `too-early` Not enough time has passed. - Still within stated timeframe - No definitive evidence either way ### `unfalsifiable` Cannot be objectively assessed. - Too vague to measure - No clear success criteria - Moved goalposts ### `ambiguous` Prediction was too vague to evaluate. - Multiple interpretations possible - Success criteria unclear ## Evaluation Process For each prediction being evaluated: ### 1. Restate the prediction What exactly was claimed? ### 2. Identify timeframe Has enough time passed to evaluate? ### 3. Gather evidence What has happened since? - Relevant releases or announcements - Benchmark results - Real-world deployments - Counter-evidence ### 4. Assess status Which evaluation status applies? ### 5. Score accuracy If verifiable, rate 0.0-1.0: - 1.0: Exactly as predicted - 0.7-0.9: Substantially correct - 0.4-0.6: Partially correct - 0.1-0.3: Mostly wrong - 0.0: Completely wrong ### 6. Note lessons What does this tell us about: - The author's forecasting ability - The topic's predictability - Common prediction pitfalls ## Output Format For evaluation: ```json { "evaluations": [ { "predictionId": "id", "status": "verified", "accuracyScore": 0.85, "evidence": "Description of evidence", "notes": "Additional context", "evaluatedAt": "timestamp" } ] } ``` For accuracy statistics: ```json { "author": "Author name", "totalPredictions": 15, "verified": 5, "falsified": 3, "partiallyVerified": 2, "pending": 4, "unfalsifiable": 1, "averageAccuracy": 0.62, "topicBreakdown": { "reasoning": { "predictions": 5, "accuracy": 0.7 }, "agents": { "predictions": 3, "accuracy": 0.4 } }, "calibration": "Assessment of how well-calibrated they are" } ``` ## Calibration Assessment Evaluate whether predictors are well-calibrated: ### Well-Calibrated - High-confidence predictions usually come true - Low-confidence predictions have mixed results - Acknowledges uncertainty appropriately ### Overconfident - High-confidence predictions often fail - Rarely expresses uncertainty - Doesn't update on evidence ### Underconfident - Low-confidence predictions often come true - Hedges even on likely outcomes - Too conservative ### Inconsistent - Confidence doesn't correlate with accuracy - Random relationship between stated and actual accuracy ## Tracking Notable Predictors Keep running assessments of key voices: | Predictor | Total | Accuracy | Calibration | Notes | |-----------|-------|----------|-------------|-------| | Sam Altman | 20 | 55% | Overconfident | Timeline optimism | | Gary Marcus | 15 | 70% | Well-calibrated | Conservative | | Dario Amodei | 12 | 65% | Slightly over | Safety-focused | ## Red Flags Watch for prediction patterns that suggest bias: - Always bullish regardless of topic - Never acknowledges failed predictions - Moves goalposts when wrong - Predictions align suspiciously with financial interests - Vague enough to claim credit for anything