--- name: la-bench-workflow description: Orchestrate the complete LA-Bench experimental procedure generation workflow from JSONL input to validated output. This skill should be used when processing LA-Bench format experimental data to generate and validate detailed experimental procedures. It coordinates parsing, reference fetching, procedure generation, and quality validation with 10-point scoring. --- # LA-Bench Workflow ## Overview Orchestrate the complete LA-Bench experimental procedure generation and validation pipeline. This skill coordinates four specialized components: 1. **la-bench-parser**: Extract experimental data from JSONL 2. **web-reference-fetcher**: Fetch reference materials 3. **procedure-generator**: Generate detailed experimental procedures 4. **procedure-checker**: Validate and score generated procedures (10-point scale) All intermediate and final outputs are saved to standardized directory structure: `workdir/{filename}_{entry_id}/` ## When to Use This Skill Use this skill when: - Processing LA-Bench JSONL files (public_test.jsonl, private_test_input.jsonl, etc.) - Generating experimental procedures from LA-Bench format data - Validating generated procedures against LA-Bench quality standards - Batch processing multiple LA-Bench entries - Testing or evaluating procedure generation capabilities ## Workflow ### Step 1: Parse LA-Bench Entry Extract experimental data from JSONL file using la-bench-parser skill. **Input:** - JSONL file path (e.g., `data/public_test.jsonl`) - Entry ID (e.g., `public_test_1`) **Execution:** ```bash # Extract filename without extension filename=$(basename "data/public_test.jsonl" .jsonl) # "public_test" # Create working directory mkdir -p workdir/${filename}_${entry_id} # Parse entry python .claude/skills/la-bench-parser/scripts/parse_labench.py \ data/public_test.jsonl ${entry_id} \ > workdir/${filename}_${entry_id}/input.json ``` **Output:** - `workdir/{filename}_{entry_id}/input.json` **Content:** - instruction - mandatory_objects - source_protocol_steps - expected_final_states - references - measurement.specific_criteria (if available) ### Step 2: Fetch Reference Materials (Optional) Fetch reference URLs using web-reference-fetcher skill if references exist in input.json. **Input:** - Reference URLs from input.json **Execution:** ```bash mkdir -p workdir/${filename}_${entry_id}/references # For each reference URL (ref_1, ref_2, ...) python3 .claude/skills/web-reference-fetcher/scripts/fetch_url.py \ "${reference_url}" \ --output workdir/${filename}_${entry_id}/references/ref_${i}.md ``` **Output:** - `workdir/{filename}_{entry_id}/references/ref_1.md` - `workdir/{filename}_{entry_id}/references/ref_2.md` - ... (one file per reference) **Note:** If references cannot be fetched or URLs are inaccessible, proceed to Step 3 without reference materials. ### Step 3: Generate Experimental Procedure Generate detailed procedure using procedure-generator skill. **Input:** - `workdir/{filename}_{entry_id}/input.json` - `workdir/{filename}_{entry_id}/references/ref_*.md` (if available) **Process:** 1. Read input.json 2. Read all reference files (if exist) 3. Invoke experiment-procedure-generator subagent with detailed prompt 4. Save generated procedure 5. Verify output (≤50 steps, ≤10 sentences/step) **Output:** - `workdir/{filename}_{entry_id}/procedure.json` **Format:** ```json [ {"id": 1, "text": "First step with quantitative details..."}, {"id": 2, "text": "Second step with quantitative details..."}, ... ] ``` ### Step 4: Validate, Score, and Iteratively Improve Procedure Validate procedure using procedure-checker skill with 10-point scoring system. **Automatically improve procedure until score ≥ 8/10 or maximum iterations reached.** **Target Quality**: Score ≥ 8/10 points (Good or Excellent) **Input:** - `workdir/{filename}_{entry_id}/input.json` - `workdir/{filename}_{entry_id}/procedure.json` **Iterative Improvement Process:** #### Iteration Loop (Max 3 attempts) For each iteration (attempt 1, 2, 3): 1. **Formal Validation (Script-based)** ```bash # Create wrapped version python3 -c " import json with open('workdir/${filename}_${entry_id}/procedure.json', 'r') as f: steps = json.load(f) wrapped = {'procedure_steps': steps} with open('workdir/${filename}_${entry_id}/procedure_wrapped.json', 'w') as f: json.dump(wrapped, f, ensure_ascii=False, indent=2) " # Run validation python .claude/skills/procedure-checker/scripts/validate_procedure.py \ workdir/${filename}_${entry_id}/procedure_wrapped.json ``` 2. **Semantic Evaluation (Subagent-based)** - Invoke procedure-semantic-checker subagent - Evaluate with 10-point scoring system: - Common Criteria (5 points) - Individual Criteria (5 points) - Generate comprehensive review 3. **Combine Results** - Merge formal validation + semantic evaluation - Save review to `workdir/{filename}_{entry_id}/review_v{iteration}.md` 4. **Check Score** - Extract total score from review - If **score ≥ 8**: ✅ **Success** → Proceed to Step 5 - If **score < 8** AND **iteration < 3**: 🔄 **Improve** → Continue to sub-step 5 - If **score < 8** AND **iteration = 3**: ⚠️ **Max iterations reached** → Proceed to Step 5 with warning 5. **Generate Improvement Instructions (if score < 8)** Extract specific issues from review and create targeted improvement prompt: **Analysis:** - Read `review_v{iteration}.md` - Extract "Critical Issues" section - Extract "Must-Fix Recommendations" - Identify specific scoring deficiencies: - Which common criteria failed? (0-5 points) - Which individual criteria failed? (0-5 points) - What deductions were applied? - Was excessive safety penalty triggered? **Improvement Prompt Template:** ``` The previous procedure (v{iteration}) scored {score}/10 points. Regenerate the procedure addressing these specific issues: ## Failed Common Criteria: {list of failed criteria with specific examples from review} ## Failed Individual Criteria: {list of failed criteria with specific examples} ## Critical Issues to Fix: {numbered list from review} ## Must-Fix Recommendations: {numbered list from review} ## Specific Corrections Required: {detailed corrections extracted from review} Maintain all strengths from the previous version while addressing these issues. ``` 6. **Regenerate Procedure** - Invoke procedure-generator skill again - Pass improvement instructions to experiment-procedure-generator subagent - Save to `workdir/{filename}_{entry_id}/procedure.json` (overwrite) - Archive previous version to `workdir/{filename}_{entry_id}/procedure_v{iteration}.json` 7. **Loop Back to Step 4.1** (next iteration) #### Outputs **Per Iteration:** - `workdir/{filename}_{entry_id}/procedure_v{N}.json` (archived versions) - `workdir/{filename}_{entry_id}/review_v{N}.md` (archived reviews) **Final:** - `workdir/{filename}_{entry_id}/procedure.json` (best version) - `workdir/{filename}_{entry_id}/review.md` (final review, symlink or copy of best) - `workdir/{filename}_{entry_id}/procedure_wrapped.json` (temporary) **Review Format:** - Scoring Results (10-point breakdown) - Formal Validation Results - Semantic Quality Assessment - Identified Strengths - Issues and Recommendations - Overall Assessment - Conclusion - Iteration History (if multiple attempts) ### Step 5: Report Results Report completion with comprehensive summary including iteration history: ``` Successfully processed: {entry_id} Directory: workdir/{filename}_{entry_id}/ ├── input.json ✅ ├── references/ │ ├── ref_1.md ✅ (if applicable) │ └── ref_2.md ✅ (if applicable) ├── procedure.json ✅ (final version) ├── procedure_v1.json ✅ (if improved) ├── procedure_v2.json ✅ (if improved) ├── review.md ✅ (final review) ├── review_v1.md ✅ (if improved) └── review_v2.md ✅ (if improved) Iteration History: ┌─────────────┬────────┬──────────┬──────────────┬────────────┐ │ Iteration │ Score │ Common │ Individual │ Status │ ├─────────────┼────────┼──────────┼──────────────┼────────────┤ │ v1 (Initial)│ 6/10 │ 4/5 │ 2/5 │ 🔄 Improve │ │ v2 │ 7/10 │ 5/5 │ 2/5 │ 🔄 Improve │ │ v3 (Final) │ 8/10 │ 5/5 │ 3/5 │ ✅ Accept │ └─────────────┴────────┴──────────┴──────────────┴────────────┘ Final Procedure Quality: - Total Score: 8/10 points (⬆️ +2 from initial) - Common Criteria: 5/5 points (✅ Perfect) - Individual Criteria: 3/5 points (Improved from 2/5) - Rating: ⭐⭐⭐⭐ (Good) - Recommendation: ✅ ACCEPT - Iterations Required: 3 attempts Improvements Made: 1. v1→v2: Fixed calculation errors, improved parameter reflection (+1pt) 2. v2→v3: Enhanced individual criteria compliance (+1pt) Next Steps: Procedure is ready for use. Review review.md for detailed feedback. ``` **Success Scenarios:** 1. **Immediate Success (Score ≥ 8 on first attempt)** ``` Iteration History: ✅ Single attempt (v1: 9/10) Status: Excellent quality achieved immediately ``` 2. **Improvement Success (Score ≥ 8 after iterations)** ``` Iteration History: 🔄 3 attempts (v1: 6→v2: 7→v3: 8) Status: Target quality achieved through iterative improvement ``` 3. **Partial Success (Score < 8 after max iterations)** ``` Iteration History: ⚠️ 3 attempts (v1: 5→v2: 6→v3: 7) Status: Max iterations reached. Best score: 7/10 (Needs manual review) Warning: Target quality (8/10) not achieved. Manual refinement recommended. ``` ## Batch Processing To process multiple entries from the same JSONL file: **Example:** ```bash # Process public_test_1, public_test_2, public_test_3 for entry_id in public_test_1 public_test_2 public_test_3; do echo "Processing ${entry_id}..." # Run la-bench-workflow skill for each entry done ``` **Directory Structure:** ``` workdir/ ├── public_test_public_test_1/ │ ├── input.json │ ├── references/ │ ├── procedure.json │ └── review.md ├── public_test_public_test_2/ │ └── ... └── public_test_public_test_3/ └── ... ``` ## Error Handling ### Common Issues **Issue**: JSONL file not found - **Fix**: Verify file path and ensure it exists **Issue**: Entry ID not found in JSONL - **Fix**: Check entry ID spelling and verify it exists in the file **Issue**: Reference fetch fails - **Fix**: Proceed without references or use cached references if available **Issue**: Procedure generation fails formal validation - **Fix**: Review errors and regenerate with adjusted parameters **Issue**: Low quality score (<5/10) - **Fix**: Review semantic evaluation feedback and regenerate with improvements ## Quality Standards ### LA-Bench 10-Point Scoring System **Common Criteria (5 points):** 1. Parameter Reflection (+1pt): All instruction parameters correctly reflected 2. Object Completeness (+1pt): All mandatory objects used correctly 3. Logical Structure Reflection (+1pt): Source protocol structure maintained 4. Expected Outcome Achievement (+1pt): Final states achievable 5. Appropriate Supplementation (+1pt): Missing details appropriately filled **Deductions:** - Unnatural Japanese/Hallucination (-1pt) - Calculation Errors (-1pt) - Procedural Contradictions (-1pt) **Special Restriction:** - Excessive safety → Common criteria capped at 2pts **Individual Criteria (5 points):** - Based on measurement.specific_criteria (if provided) - General quality assessment (if not provided) **Quality Thresholds:** - 9-10 pts: Excellent (⭐⭐⭐⭐⭐) - Ready for use - 7-8 pts: Good (⭐⭐⭐⭐) - Minor improvements recommended - 5-6 pts: Needs Improvement (⭐⭐⭐) - Significant revisions needed - 3-4 pts: Insufficient (⭐⭐) - Major issues - 0-2 pts: Unacceptable (⭐) - Complete regeneration required ## Best Practices ### General Workflow 1. **Always verify input.json**: Ensure all required fields exist before proceeding 2. **Check reference availability**: Some entries may not have fetchable references 3. **Review formal validation first**: Fix constraint violations before semantic evaluation 4. **Preserve intermediate files**: Keep all outputs for debugging and analysis ### Iterative Improvement 5. **Trust the iteration process**: The workflow automatically iterates up to 3 times to achieve target quality (≥8/10) 6. **Analyze iteration history**: Compare review_v1.md, review_v2.md, review.md to understand improvement trajectory 7. **Learn from failures**: If max iterations reached without target, review all versions to identify persistent issues 8. **Use version control**: Archived versions (procedure_v1.json, procedure_v2.json) allow rollback if needed ### Scoring Strategy 9. **Prioritize common criteria**: These 5 points are achievable through careful attention to input data 10. **Address deductions early**: Fix calculation errors, hallucinations, and contradictions in early iterations 11. **Understand individual criteria**: Read measurement.specific_criteria carefully if provided 12. **Avoid excessive safety**: Don't add unnecessary steps or extreme safety margins (triggers 2pt cap) ### Debugging Low Scores **If stuck at low scores (<6/10) after iterations:** - Check if instruction parameters are fully reflected - Verify all mandatory_objects are used - Confirm source_protocol_steps logic is preserved - Validate all calculations manually - Review for hallucinations or invented information **If stuck at medium scores (6-7/10):** - Focus on individual criteria improvements - Enhance cost efficiency (reduce reagent waste) - Improve work efficiency (optimize step order) - Add precision measures (quality controls) ### Batch Processing Best Practices 13. **Process in parallel if possible**: Independent entries can be processed simultaneously 14. **Monitor iteration counts**: High iteration rates may indicate dataset complexity 15. **Compare across entries**: Identify common failure patterns for systematic improvements ## Integration with Other Skills This skill orchestrates: - **la-bench-parser** (Step 1): JSONL parsing - **web-reference-fetcher** (Step 2): Reference retrieval - **procedure-generator** (Step 3): Procedure generation via experiment-procedure-generator subagent - **procedure-checker** (Step 4): Validation via validate_procedure.py + procedure-semantic-checker subagent See individual skill documentation for detailed specifications. ## Directory Naming Convention All outputs use the standardized pattern: `workdir/{filename}_{entry_id}/` **Examples:** | JSONL File | Entry ID | Directory | |------------|----------|-----------| | data/public_test.jsonl | public_test_1 | workdir/public_test_public_test_1/ | | data/private_test_input.jsonl | private_test_5 | workdir/private_test_input_private_test_5/ | | data/validation.jsonl | exp_042 | workdir/validation_exp_042/ | **Filename extraction:** ```python from pathlib import Path filename = Path("data/public_test.jsonl").stem # "public_test" entry_id = "public_test_1" dir_name = f"{filename}_{entry_id}" # "public_test_public_test_1" ``` ## Example Usage ### Example 1: Immediate Success (Score ≥ 8 on first attempt) ``` User: "Process public_test_1 from data/public_test.jsonl" Assistant workflow: 1. Parse: Extract entry to workdir/public_test_public_test_1/input.json ✅ 2. Fetch: Download 2 references to workdir/public_test_public_test_1/references/ ✅ 3. Generate: Create procedure.json (40 steps) ✅ 4. Validate (Iteration 1): - Formal: ✅ PASSED (40/50 steps, all ≤10 sentences) - Semantic: 9/10 points (Common: 5/5, Individual: 4/5) - Status: ✅ TARGET ACHIEVED (≥8/10) 5. Report: Procedure rated "Excellent" (⭐⭐⭐⭐⭐) Result: Single iteration achieved excellent quality Next: Review workdir/public_test_public_test_1/review.md for detailed feedback ``` ### Example 2: Iterative Improvement (Score improves to ≥ 8) ``` User: "Process public_test_5 from data/public_test.jsonl" Assistant workflow: 1. Parse: Extract entry ✅ 2. Fetch: Download references ✅ 3. Generate: Create initial procedure ✅ 4. Validate and Improve: Iteration 1: - Formal: ✅ PASSED - Semantic: 6/10 points (Common: 4/5, Individual: 2/5) - Issues: Missing parameter specifications, incomplete object usage - Status: 🔄 IMPROVE (score < 8) Iteration 2: - Regenerate with fixes for missing parameters - Formal: ✅ PASSED - Semantic: 7/10 points (Common: 5/5, Individual: 2/5) - Issues: Individual criteria still weak (cost efficiency) - Status: 🔄 IMPROVE (score < 8) Iteration 3: - Regenerate with cost efficiency improvements - Formal: ✅ PASSED - Semantic: 8/10 points (Common: 5/5, Individual: 3/5) - Status: ✅ TARGET ACHIEVED 5. Report: Final score 8/10 after 3 iterations (⬆️ +2 from initial) Result: Iterative improvement achieved target quality Files: procedure_v1.json, procedure_v2.json, procedure.json (final) Reviews: review_v1.md, review_v2.md, review.md (final) ``` ### Example 3: Partial Success (Max iterations without reaching target) ``` User: "Process complex_experiment_1 from data/validation.jsonl" Assistant workflow: 1-3. [Parse, Fetch, Generate] ✅ 4. Validate and Improve: Iteration 1: 5/10 (Critical issues with quantitative accuracy) Iteration 2: 6/10 (Improved accuracy, but logical structure issues) Iteration 3: 7/10 (Further improvements, but individual criteria still weak) Status: ⚠️ MAX ITERATIONS REACHED 5. Report: - Best Score: 7/10 (Good, but below target) - Recommendation: ⚠️ REVISE with manual review - Warning: Automatic improvement could not achieve 8/10 - Action: Manual refinement recommended Result: Best effort achieved 7/10 Recommendation: Review all iteration histories to identify persistent issues ``` ## Important Context This workflow is conducted for **academic research purposes** to evaluate and improve experimental planning quality and safety. All generated procedures should prioritize: - Scientific accuracy - Reproducibility - Safety considerations - Cost and time efficiency - Adherence to research ethics guidelines