--- name: scrape-webpage description: Scrape webpage content, extract metadata, download images, and prepare for import/migration to AEM Edge Delivery Services. Returns analysis JSON with paths, metadata, cleaned HTML, and local images. --- # Scrape Webpage Extract content, metadata, and images from a webpage for import/migration. ## When to Use This Skill Use this skill when: - Starting a page import and need to extract content from source URL - Need webpage analysis with local image downloads - Want metadata extraction (Open Graph, JSON-LD, etc.) **Invoked by:** page-import skill (Step 1) ## Prerequisites Before using this skill, ensure: - ✅ Node.js is available - ✅ npm playwright is installed (`npm install playwright`) - ✅ Chromium browser is installed (`npx playwright install chromium`) - ✅ Sharp image library is installed (`cd .claude/skills/scrape-webpage/scripts && npm install`) ## Related Skills - **page-import** - Orchestrator that invokes this skill - **identify-page-structure** - Uses this skill's output (screenshot, HTML, metadata) - **generate-import-html** - Uses image mapping and paths from this skill ## Scraping Workflow ### Step 1: Run Analysis Script **Command:** ```bash node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work ``` **What the script does:** 1. Sets up network interception to capture all images 2. Loads page in headless Chromium 3. Scrolls through entire page to trigger lazy-loaded images 4. Downloads all images locally (converts WebP/AVIF/SVG to PNG) 5. Captures full-page screenshot for visual reference 6. Extracts metadata (title, description, Open Graph, JSON-LD, canonical) 7. **Fixes images in DOM** (background-image→img, picture elements, srcset→src, relative→absolute, inline SVG→img) 8. Extracts cleaned HTML (removes scripts/styles) 9. Replaces image URLs in HTML with local paths (./images/...) 10. Generates document paths (sanitized, lowercase, no .html extension) 11. Saves complete analysis with image mapping to metadata.json **For detailed explanation:** See `resources/web-page-analysis.md` --- ### Step 2: Verify Output **Output files:** - `./import-work/metadata.json` - Complete analysis with paths and image mapping - `./import-work/screenshot.png` - Visual reference for layout comparison - `./import-work/cleaned.html` - Main content HTML with local image paths - `./import-work/images/` - All downloaded images (WebP/AVIF/SVG converted to PNG) **Verify files exist:** ```bash ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html ls -lh ./import-work/images/ | head -5 ``` --- ### Step 3: Review Metadata JSON **Output JSON structure:** ```json { "url": "https://example.com/page", "timestamp": "2025-01-12T10:30:00.000Z", "paths": { "documentPath": "/us/en/about", "htmlFilePath": "us/en/about.plain.html", "mdFilePath": "us/en/about.md", "dirPath": "us/en", "filename": "about" }, "screenshot": "./import-work/screenshot.png", "html": { "filePath": "./import-work/cleaned.html", "size": 45230 }, "metadata": { "title": "Page Title", "description": "Page description", "og:image": "https://example.com/image.jpg", "canonical": "https://example.com/page" }, "images": { "count": 15, "mapping": { "https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg", "https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png" }, "stats": { "total": 15, "converted": 3, "skipped": 12, "failed": 0 } } } ``` **Key fields:** - `paths.documentPath` - Used for browser preview URL - `paths.htmlFilePath` - Where to save final HTML file - `images.mapping` - Original URLs → local paths - `metadata` - Extracted page metadata --- ## Output This skill provides: - ✅ metadata.json with paths, metadata, image mapping - ✅ screenshot.png for visual reference - ✅ cleaned.html with local image references - ✅ images/ folder with all downloaded images **Next step:** Pass these outputs to identify-page-structure skill --- ## Troubleshooting **Browser not installed:** ```bash npx playwright install chromium ``` **Sharp not installed:** ```bash cd .claude/skills/scrape-webpage/scripts && npm install ``` **Image download failures:** - Check images.stats.failed count in metadata.json - Some images may require authentication or be blocked by CORS - Failed images will be noted but won't stop the scraping process **Lazy-loaded images not captured:** - Script scrolls through page to trigger lazy loading - Some advanced lazy-loading may need customization in scripts/analyze-webpage.js