--- name: html-structure-validate description: Validate HTML5 structure and basic syntax. BLOCKING quality gate - stops pipeline if validation fails. Ensures deterministic output quality. --- # HTML Structure Validate Skill ## Purpose This skill is a **BLOCKING quality gate** that ensures generated HTML meets minimum structural requirements. It is the **first deterministic validation** of probabilistic AI-generated output. The skill checks: - **HTML5 compliance** - Proper DOCTYPE, tags - **Tag closure** - All tags properly closed - **Required elements** - Meta tags, stylesheet links - **Well-formedness** - Valid structure If validation fails, the pipeline **STOPS** and triggers a hook to notify the user. This enforces the principle: **Python validates, ensuring deterministic quality**. ## What to Do 1. **Load HTML file to validate** - Read `04_page_XX.html` generated by AI skill - Verify file exists and is readable - Confirm file is text (not binary) 2. **Run validation checks** - Check HTML5 structure compliance - Verify tag closure - Validate head section - Check required CSS link - Validate page container structure 3. **Generate validation report** - Document all checks performed - List any errors found - Note warnings (non-blocking) - Record informational findings 4. **Save validation report** as JSON - Save to: `output/chapter_XX/page_artifacts/page_YY/06_validation_structure.json` - Include timestamp - Include all check results 5. **Exit with appropriate code** - Return 0 if VALID (continue pipeline) - Return 1 if INVALID (STOP pipeline, trigger hook) ## Input Parameters ``` html_file: - Path to 04_page_XX.html output_dir: - Directory for validation report strict_mode: - If true, warnings also fail (default: false) page_number: - Page number (for reporting) chapter: - Chapter number (for reporting) ``` ## Validation Checks ### Check 1: DOCTYPE Declaration **Requirement**: File must start with proper DOCTYPE ```html ``` **Check**: - [ ] File contains `` (case-insensitive) - [ ] DOCTYPE appears before any tags - [ ] DOCTYPE is on first line or near beginning **Error if**: Missing or incorrect DOCTYPE ### Check 2: HTML Tags **Requirement**: Proper `` opening and closing tags ```html ... ``` **Checks**: - [ ] `` tag present - [ ] `` closing tag present - [ ] Tags are properly paired - [ ] No unclosed `` tags **Error if**: Missing either tag or improperly paired ### Check 3: Head Section **Requirement**: Complete `` section with metadata ```html ... ``` **Checks**: - [ ] `` and `` tags present - [ ] `` present - [ ] `` present (warning if missing) - [ ] `` tag with content present - [ ] CSS `<link>` tag present with href attribute **Error if**: Missing charset, title, or CSS link **Warning if**: Missing viewport meta tag ### Check 4: Body Section **Requirement**: Proper `<body>` tags with content ```html <body> <div class="page-container"> <main class="page-content"> ... </main> </div> </body> ``` **Checks**: - [ ] `<body>` and `</body>` tags present - [ ] `<div class="page-container">` present - [ ] `<main class="page-content">` present inside container - [ ] Body contains substantial content (> 100 bytes) **Error if**: Missing tags or required container divs ### Check 5: Tag Closure Validation **Requirement**: All tags must be properly closed **Checks for**: - Unmatched opening tags (e.g., `<p>` without `</p>`) - Improper nesting (e.g., `<p><h2>text</h2></p>`) - Self-closing tags used correctly (e.g., `<br/>`, `<img/>`) - Comment blocks properly formatted (`<!-- -->`) **Validation method**: - Parse HTML into tree structure - Verify all nodes properly matched - Check nesting doesn't violate HTML5 rules **Error if**: Any unmatched or improperly nested tags ### Check 6: Heading Tags (h1-h6) **Requirement**: Valid heading hierarchy ```html <h1>Chapter Title</h1> <h2>Section Heading</h2> <h3>Subsection</h3> ``` **Checks**: - [ ] All heading tags properly closed - [ ] First heading should be h1 (warning if not) - [ ] Heading levels don't skip dramatically (h1 → h4 is suspicious) - [ ] All headings have text content (not empty) **Error if**: Heading tags improperly closed **Warning if**: Suspicious hierarchy ### Check 7: Content Structure **Requirement**: Meaningful content in page container **Checks**: - [ ] `<main class="page-content">` contains elements - [ ] Content includes headings or paragraphs - [ ] No completely empty content area - [ ] Text nodes or elements present (> 100 words total) **Error if**: No content or empty structure ### Check 8: List Integrity **Requirement**: All lists properly structured **Checks** for each `<ul>` or `<ol>`: - [ ] List opening and closing tags matched - [ ] List contains `<li>` elements - [ ] All `<li>` tags properly closed - [ ] `<li>` count matches opening/closing pairs - [ ] No nested `<ul>` or `<ol>` improperly closed **Error if**: Empty lists or unmatched `<li>` tags ### Check 9: Image and Link Tags **Requirement**: Self-closing tags properly formatted **Checks**: - [ ] All `<img>` tags have `src` and `alt` attributes - [ ] All `<a>` tags have valid `href` attributes - [ ] Image paths don't have obvious errors (no broken syntax) - [ ] Self-closing tags use proper syntax **Warning if**: Images missing alt text or links missing href ### Check 10: Table Tags (if present) **Requirement**: Proper table structure **Checks**: - [ ] `<table>`, `<tr>`, `<td>`, `<th>` tags properly nested - [ ] All rows have consistent column counts - [ ] Table headers and body properly structured **Error if**: Malformed table structure ## Validation Report Format ### Output: `06_validation_structure.json` ```json { "page": 16, "book_page": 17, "chapter": 2, "validation_type": "structure", "validation_timestamp": "2025-11-08T14:34:00Z", "overall_status": "PASS", "error_count": 0, "warning_count": 1, "checks_performed": [ { "check_name": "DOCTYPE Declaration", "status": "PASS", "details": "Valid HTML5 DOCTYPE found" }, { "check_name": "HTML Tags", "status": "PASS", "details": "Proper <html> opening and closing tags" }, { "check_name": "Head Section", "status": "PASS", "details": "All required meta tags and title present" }, { "check_name": "Body Section", "status": "PASS", "details": "Body and content structure valid" }, { "check_name": "Tag Closure", "status": "PASS", "details": "All tags properly matched and closed" }, { "check_name": "Heading Hierarchy", "status": "PASS", "details": "4 headings found, proper h1-h4 hierarchy" }, { "check_name": "Content Structure", "status": "PASS", "details": "Main content area contains 245 words across 3 paragraphs" }, { "check_name": "List Integrity", "status": "PASS", "details": "1 list with 3 items, all properly formed" }, { "check_name": "Image Tags", "status": "PASS", "details": "No images on this page" }, { "check_name": "Table Tags", "status": "PASS", "details": "No tables on this page" } ], "errors": [], "warnings": [ { "check": "Heading Hierarchy", "message": "First heading is h2, typically should be h1 for page opening", "severity": "LOW" } ], "summary": { "total_checks": 10, "passed": 9, "failed": 0, "warnings": 1, "html_valid": true, "tags_matched": true, "content_substantial": true } } ``` ## Validation Rules ### PASS Criteria - DOCTYPE present and valid - All required tags (`html`, `head`, `body`, `main`, `div.page-container`) present - All tags properly closed and matched - Title tag with content - CSS stylesheet link present - Content structure valid - No structural errors ### FAIL Criteria (BLOCKS PIPELINE) - Missing DOCTYPE - Missing required tags - Unmatched or improperly nested tags - Missing title or CSS link - Empty content - Malformed lists or tables ### WARNING (Logged but doesn't block) - Missing viewport meta tag - First heading is not h1 - Large heading jumps (h1 → h4) - Missing alt text on images - Missing href on links ## Implementation: Using Python Script This validation is performed by existing `validate_html.py` tool, run in **structure validation mode**: ```bash cd Calypso/tools # Validate single page HTML python3 validate_html.py \ ../output/chapter_02/page_artifacts/page_16/04_page_16.html \ --output-json ../output/chapter_02/page_artifacts/page_16/06_validation_structure.json \ --strict-structure # Exit code: # 0 = VALID (continue to next skill) # 1 = INVALID (STOP pipeline) ``` ## Hook Integration When validation **FAILS**: ```bash # Trigger hook: .claude/hooks/validate-structure.sh # Receives: # - Page number # - HTML file path # - Validation report path # - Error details # Hook behavior: # - Log failure with details # - Save error report # - Notify user # - STOP pipeline (no further processing) ``` ## Error Recovery **If validation fails**: 1. User reviews validation report 2. User identifies issue in AI-generated HTML 3. Options: - Fix HTML manually and re-validate - Re-run AI generation with improved prompt - Review source extraction data for errors - Proceed with caution (expert override) ## Quality Metrics Validation provides metrics: - Percentage of checks passing - Error severity levels - Content size (word count, element count) - Structure complexity These metrics feed into final quality reports. ## Success Criteria ✓ Validation completes successfully ✓ All structural checks pass (0 errors) ✓ Validation report saved in JSON format ✓ Exit code 0 returned (or 1 if invalid) ✓ Clear error messages if validation fails ## Next Steps After PASS If validation passes: 1. All pages of chapter processed through this gate 2. **Skill 4** (consolidate pages) merges individual page HTMLs 3. **Quality Gate 2** (semantic validate) checks semantic structure 4. Continue through validation pipeline ## Next Steps After FAIL If validation fails: 1. **PIPELINE STOPS** 2. Hook `validate-structure.sh` triggered 3. User receives error report with details 4. User must fix issues and retry ## Design Notes - This is the **first deterministic quality gate** - Uses proven `validate_html.py` tool - Catches structural issues before semantic analysis - Provides clear, actionable error messages - Essential for ensuring pipeline reliability ## Testing To test structure validation: ```bash # Test with known-good HTML python3 validate_html.py ../output/chapter_01/chapter_01.html # Should show: ✓ VALID # Test with invalid HTML (if needed) python3 validate_html.py broken_html.html # Should show: ✗ INVALID with specific errors ```