--- name: Eval Gap Finder description: Find AILANG vs Python eval gaps and improve prompts/language. Use when user says 'find eval gaps', 'analyze benchmark failures', 'close Python-AILANG gap', or after running evals. --- # Eval Gap Finder Automates the process of finding and closing the gap between Python and AILANG benchmark success rates. Identifies language limitations, prompt gaps, and missing stdlib functions. ## Quick Start **Most common usage:** ```bash # User says: "Find eval gaps" or "Analyze benchmark failures" # This skill will: # 1. Run evals with dev models (gemini-3-flash, claude-haiku-4-5) # 2. Compare Python vs AILANG success rates # 3. Identify benchmarks where Python passes but AILANG fails # 4. Analyze error patterns and categorize them # 5. Check if gaps are documented in prompt # 6. Test proposed examples and add to prompt # 7. Create design docs for language limitations ``` ## When to Use This Skill Invoke this skill when: - User asks to "find eval gaps" or "close the Python-AILANG gap" - User wants to analyze benchmark failures - After running evals and seeing lower AILANG success - User says "why is AILANG failing?" or "improve AILANG benchmarks" - User wants to identify language limitations ## Available Scripts ### `scripts/run_gap_analysis.sh [eval_dir]` Run full gap analysis on eval results. ```bash .claude/skills/eval-gap-finder/scripts/run_gap_analysis.sh eval_results/v0.6.5 ``` ### `scripts/identify_python_only.sh ` List benchmarks where Python passes but AILANG fails. ```bash .claude/skills/eval-gap-finder/scripts/identify_python_only.sh eval_results/v0.6.5 ``` ### `scripts/categorize_errors.sh ` Categorize AILANG failures by error type. ```bash .claude/skills/eval-gap-finder/scripts/categorize_errors.sh eval_results/v0.6.5 ``` ### `scripts/test_example.sh ` Test if an AILANG code example compiles and runs correctly. ```bash .claude/skills/eval-gap-finder/scripts/test_example.sh /tmp/test.ail ``` ## Workflow ### 1. Run Evals with Dev Models ```bash ailang eval-suite --models gemini-3-flash,claude-haiku-4-5 --output eval_results/gap-analysis ``` Target: Run with cheap/fast models first. If they succeed, larger models should too. ### 2. Generate Summary and Identify Gaps ```bash ailang eval-summary eval_results/gap-analysis .claude/skills/eval-gap-finder/scripts/identify_python_only.sh eval_results/gap-analysis ``` Key metrics: - AILANG success rate (target: >70%) - Python success rate (baseline) - Gap (python_only benchmarks) ### 3. Analyze Error Patterns For each Python-only pass, categorize the error: | Category | Pattern | Fix Approach | |----------|---------|--------------| | WRONG_LANG | Model wrote Python syntax | Stronger "NOT Python" in prompt | | PAR_001 | Parse errors (syntax) | Add more examples to prompt | | Type errors | Type unification failures | May be language limitation | | Logic errors | Compiles but wrong output | Better examples or algorithm | | EOF errors | Incomplete code generation | Model limitation, not prompt | ### 4. Check Prompt Coverage For each gap, check if the pattern is documented: ```bash grep -n "pattern" prompts/v0.6.5.md ``` If not documented, add: - Working example to Quick Reference section - Entry in "What AILANG Does NOT Have" table (if limitation) - New section if pattern is complex ### 5. Test Examples Before Adding **CRITICAL**: Always test examples before adding to prompt! ```bash cat > /tmp/test.ail << 'EOF' module benchmark/solution -- Your example code here EOF ailang run --caps IO --entry main /tmp/test.ail ``` If example fails, it reveals a language gap - create a design doc instead. ### 6. Create Design Docs for Language Gaps If testing reveals a language limitation: 1. Create design doc: `design_docs/planned/vX_Y_Z/m-.md` 2. Document: - Minimal reproduction - Error message - Workaround (for prompt) - Proposed fix 3. Add workaround to prompt with note ### 7. Track Improvement After updates, re-run evals: ```bash ailang eval-suite --models gemini-3-flash,claude-haiku-4-5 --output eval_results/gap-analysis-v2 ``` Compare: - Success rate improvement - Which benchmarks fixed - Any regressions ## Error Categories Reference | Error | Meaning | Fix | |-------|---------|-----| | WRONG_LANG | Wrote Python instead | Prompt emphasis | | PAR_001 | Parser error | Syntax examples | | PAR_UNEXPECTED_TOKEN | Wrong token | Syntax examples | | TC_* | Type check error | Type examples or design doc | | "undefined variable" | Missing import/letrec | Document pattern | | EOF errors | Incomplete code | Model limitation | | logic_error | Wrong output | Algorithm examples | ## Resources ### Gap Analysis Template See [`resources/gap_analysis_template.md`](resources/gap_analysis_template.md) for structured analysis format. ### Common Patterns See [`resources/common_patterns.md`](resources/common_patterns.md) for frequently encountered gaps. ## Progressive Disclosure This skill loads information progressively: 1. **Always loaded**: This SKILL.md file (workflow overview) 2. **Execute as needed**: Scripts in `scripts/` directory 3. **Load on demand**: Resources for templates and patterns ## Notes - Always test examples before adding to prompt - Prefer fixing language over prompt workarounds - Track improvements with before/after eval runs - Create design docs for language limitations - Update prompt hash in versions.json after changes