--- name: ai-chapter-consolidate description: Use AI to merge individual page HTML files into a unified chapter document. Creates continuous document format for improved reading experience and semantic consistency. --- # AI Chapter Consolidate Skill ## Purpose This skill uses AI to **intelligently merge individual page HTML files** into a single, continuous chapter document. Rather than simple concatenation, the AI: - Removes duplicate headers/footers from continuation pages - Ensures consistent heading hierarchy across pages - Maintains semantic structure throughout - Preserves all content without loss or repetition - Creates smooth content flow (no page breaks) The result is a **unified chapter document** in the continuous format (single `page-container`, single `page-content`). ## What to Do 1. **Collect all page HTML files for chapter** - Gather `04_page_XX.html` files for all pages in chapter - Verify all files exist and are valid - Sort by page number (ascending) 2. **Extract content from each page** - Load each HTML file - Extract main content from `
` - Preserve semantic classes and structure 3. **Prepare consolidation inputs for AI** - Page 1: Full content including chapter header - Pages 2+: Extract content sections, remove chapter header/nav - Preserve all text and structure - Note any special sections (exhibits, tables, etc.) 4. **Invoke AI consolidation** - Send all page contents to Claude - Request merging into single continuous document - Specify structural requirements - Request heading hierarchy normalization 5. **Process AI output** - Extract consolidated HTML from response - Verify structure integrity - Ensure all pages represented - Check heading hierarchy 6. **Save consolidated document** - Save to: `output/chapter_XX/chapter_artifacts/chapter_XX.html` - Create metadata/log file - Calculate statistics ## Input Files **Per-page HTML files** (validated by previous gate): - `output/chapter_XX/page_artifacts/page_16/04_page_16.html` (Chapter opening) - `output/chapter_XX/page_artifacts/page_17/04_page_17.html` (Continuation) - `output/chapter_XX/page_artifacts/page_18/04_page_18.html` (Continuation) - ... (all pages in chapter) **Chapter metadata** (from analysis): - Page range (first and last page of chapter) - Chapter number - Chapter title - Expected page count ## AI Consolidation Prompt The prompt sent to Claude: ``` You are merging individual page HTML documents into a single, continuous chapter. INPUT PAGES: Page 1 (Opening - include chapter header): [HTML content from page 1] Page 2 (Continuation): [HTML content from page 2] Page 3 (Continuation): [HTML content from page 3] ... (all pages) TASK: Merge these pages into a single HTML document that reads as one continuous chapter. REQUIREMENTS: 1. Structure: - Create single
wrapping everything - Create single
for all content - Remove page-break indicators or comments - Create truly continuous document (no paginated elements) 2. Chapter Header: - Keep chapter header from Page 1 (chapter number, title) - Remove chapter headers/titles from continuation pages - Keep section navigation if present on Page 1 - Remove duplicate navigation from other pages 3. Content Preservation: - Include ALL text content from all pages - Preserve exact wording (no paraphrasing) - Maintain all lists, paragraphs, tables - Include all semantic classes - Keep all HTML structure 4. Heading Hierarchy: - Normalize heading levels across merged pages - Page 1 h1 = Chapter title (stays as h1) - First section in each page = h2 (main sections) - Sub-sections = h3 or h4 as needed - Ensure no hierarchy jumps (h1 → h3 without h2) - Number consecutive headings logically 5. Content Flow: - Remove page-specific headers/footers - Merge seamlessly so content flows naturally - No artificial breaks or transitions - Paragraphs continue logically - Lists maintain coherence 6. Exhibits and Images: - Preserve all tables and figures - Keep exhibit titles and captions - Include all images with proper paths - Maintain table of contents if present 7. CSS Classes: - Preserve all semantic classes (section-heading, paragraph, etc.) - Keep consistent class usage throughout - Ensure classes match chapter opening page style - Do not add or remove classes 8. Metadata: - Include title tag: "Chapter N: Title - Pages X-Y" - Keep meta charset and viewport - Link stylesheet: OUTPUT: Return ONLY a single, valid HTML5 document: ```html Chapter [N]: [Title] - Pages [X-Y]
``` VALIDATION: - Single HTML5 document - All pages represented - No page breaks or transitions - Proper heading hierarchy - All text preserved ``` ## Page Content Extraction Logic Before sending to AI, extract content strategically: ### Page 1 (Opening): - **Include**: Entire page HTML content - **Reason**: Contains chapter header, navigation, first section - **Preserve**: All elements (header, nav, dividers, content) ### Pages 2-N (Continuation): - **Extract**: Only content after chapter header - **Skip**: Chapter number, chapter title, section navigation - **Preserve**: Section headings, paragraphs, lists, exhibits - **Include**: All semantic content sections ### Example extraction: ```html
2

Rights in Real Estate

REAL PROPERTY RIGHTS

...

Physical characteristics.

...

    ...

Interdependence.

...

``` ## Output File ### Consolidated Chapter HTML **Path**: `output/chapter_XX/chapter_artifacts/chapter_XX.html` **Structure**: ``` Chapter 2: Rights in Real Estate - Pages 16-29
...

REAL PROPERTY RIGHTS

...

Physical characteristics.

...

    ...

Interdependence.

...

REGULATIONS AND LICENSING

...

``` ### Consolidation Log **Path**: `output/chapter_XX/chapter_artifacts/consolidation_log.json` ```json { "chapter": 2, "title": "Rights in Real Estate", "book_pages": "16-29", "pdf_indices": "15-28", "consolidated_at": "2025-11-08T14:35:00Z", "pages_merged": 14, "pages_included": [ { "page": 16, "book_page": 17, "status": "opening_chapter", "content_type": "header_navigation_content" }, { "page": 17, "book_page": 18, "status": "continuation", "content_type": "subsections_paragraphs" }, { "page": 18, "book_page": 19, "status": "continuation", "content_type": "subsections_paragraphs_list" } // ... all pages ], "content_statistics": { "total_headings": { "h1": 1, "h2": 4, "h3": 0, "h4": 12 }, "total_paragraphs": 156, "total_lists": 12, "total_list_items": 42, "total_tables": 3, "total_images": 5, "total_words": 12547 }, "ai_model": "claude-3-5-sonnet-20241022", "consolidation_notes": "Successfully merged 14 pages into continuous format" } ``` ## Implementation Execute consolidation via Python wrapper: ```bash cd Calypso/tools # Run consolidation python3 consolidate_chapter.py \ --chapter 2 \ --pages 15-28 \ --output "../output" \ --mapping "../analysis/page_mapping.json" # Or invoke directly via Claude API: # The orchestrator sends the AI prompt with all page contents ``` ## Quality Checks Before passing to next gate: 1. **File created** - [ ] `chapter_XX.html` exists - [ ] File is valid HTML (parseable) - [ ] File size reasonable (> 50KB typical) 2. **Structure validated** - [ ] Single `
` - [ ] Single `
` - [ ] All tags properly closed - [ ] No duplicate content 3. **Content completeness** - [ ] All pages represented - [ ] No missing sections - [ ] Paragraph/heading counts reasonable - [ ] All text content present 4. **Heading hierarchy** - [ ] Starts with h1 (chapter title) - [ ] h1 count = 1 - [ ] h2 = major sections - [ ] h3/h4 = subsections - [ ] No hierarchy jumps 5. **Metadata logged** - [ ] Consolidation timestamp recorded - [ ] Pages merged count documented - [ ] Content statistics calculated - [ ] Log file saved ## Success Criteria ✓ All pages merged into single document ✓ Chapter header preserved from page 1 ✓ Duplicate headers removed from continuation pages ✓ Content flows naturally (continuous format) ✓ Heading hierarchy is correct ✓ All text content preserved ✓ Semantic classes maintained ✓ Ready for semantic validation ## Error Handling **If page HTML is incomplete**: - Note in consolidation log - Include whatever content is available - Proceed to validation (validation will catch issues) **If heading hierarchy is ambiguous**: - AI makes best judgment - Semantic validation gate will refine if needed - Document decision in log **If content appears duplicated**: - AI deduplicates automatically - Verify word count is reasonable - Log any unusual content patterns ## Next Steps Once consolidation completes: 1. **Quality Gate 2** (semantic-validate) checks semantic structure 2. **Skill 5** (quality-report-generate) generates final report 3. **Quality Gate 3** (visual-accuracy-check) validates appearance ## Design Notes - This skill is **AI-powered** (uses probabilistic consolidation) - Relies on AI's understanding of document structure - Produces continuous format (no page breaks) - Merges intelligently (not just concatenation) - Output will be refined by validation gates ## Testing To test consolidation on Chapter 2: ```bash # Input: 14 individual page HTML files (pages 16-29) # Process: AI merges into single continuous chapter # Output: chapter_02.html (single, unified document) # Verify: # - File size is sum of all pages # - Content flows logically # - Heading hierarchy makes sense # - No duplicate sections ```