--- name: web-content-fetcher description: Expert guidance for fetching and parsing web content from URLs, handling various content sizes, JavaScript-rendered sites, and working around tool limitations. Use when fetching articles, documentation, social media posts, or any web content for translation, analysis, or archiving. allowed-tools: Bash(curl *), Bash(mkdir *), Bash(node *), Read, Grep, Task --- # Web Content Fetcher Expert knowledge for fetching and parsing web content, handling size limitations, JavaScript-rendered sites, and extracting clean article content from HTML. ## Overview This skill provides battle-tested strategies for fetching web content in Claude Code, addressing critical challenges: 1. **WebFetch tool limitation**: ~50KB max content size 2. **Read tool limitation**: 256KB max file size per call 3. **JavaScript-rendered content**: Twitter/X, React SPAs require special handling **Four-tier approach**: - **Tier 1**: WebFetch for small content (< 50KB) - **Tier 2**: curl + Task agent for any size static content ⭐ **DEFAULT** - **Tier 3**: Scripts for edge cases (EUC-JP encoding, complex HTML) - **Tier 4**: Playwright for JavaScript-rendered sites (Twitter/X, React SPAs) ## Quick Tool Limitations | Tool | Max Size | Best For | Limitation | | -------------- | --------- | ------------------------------- | -------------------------------- | | **WebFetch** | ~50KB | Small articles, first attempt | Prompt length constraints | | **Read** | 256KB | Reading fetched files | Large files need pagination | | **curl** | Unlimited | Any size content, raw downloads | No parsing | | **Task agent** | Unlimited | Extraction from any size HTML | Handles pagination automatically | ## Tiered Fetching Strategy ### Tier 1: Small Content - WebFetch Direct **Use when**: Simple articles, API responses, first attempt on unknown content **Quick example**: ``` WebFetch tool with url and extraction prompt ``` **If fails**: "Prompt too long" error → Switch to Tier 2 ### Tier 2: Any Size Content - curl + Task Agent ⭐ RECOMMENDED **Use when**: News articles, blog posts, WebFetch fails, any content of any size **Quick workflow**: ```bash # 1. Create directory mkdir -p output/tasks/YYYYMMDD_taskname/original # 2. Fetch HTML curl -s "URL" > output/tasks/YYYYMMDD_taskname/original/raw_html.html # 3. Extract with Task agent # Prompt: "Read HTML at [path], extract article content, save to fetched_content.md" ``` **Why default**: Handles any size, AI-powered extraction, no script setup needed. **For detailed workflow**: See [references/workflows.md](references/workflows.md) ### Tier 3: Script-Based Extraction - Edge Cases Only **Use when**: Task agent struggles with complex HTML or special encoding requirements (EUC-JP) **Available scripts**: - `scripts/extract_article.js` - Standard HTML with Mozilla Readability - `scripts/extract_eucjp.js` - Japanese EUC-JP encoded sites (4gamer.net) **Quick example**: ```bash node .claude/skills/web-content-fetcher/scripts/extract_article.js raw_html.html > fetched_content.md ``` **Note**: Try Task agent (Tier 2) first. Scripts only when Task agent explicitly fails. **For script details**: See [references/advanced-fetching.md](references/advanced-fetching.md) ### Tier 4: JavaScript-Rendered Content - Playwright **Use when**: Twitter/X, React SPAs, dynamic sites, static fetch returns empty/error **Quick example**: ```bash node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \ "https://x.com/user/status/123456789" \ --output fetched_content.md ``` **Common sites**: Twitter/X, Threads, Instagram, React/Vue/Angular SPAs, AJAX-loaded content **Image Extraction**: - **Automatically extracts images if exist** (Twitter/X, Threads, blogs, etc.) - Captures image URLs with alt text from main content area - Filters out small images (icons/logos) automatically - Output includes structured Media section with URLs for downloading **Setup required** (one-time): ```bash cd .claude/skills/web-content-fetcher pnpm add -D playwright --save-catalog-name=dev pnpm exec playwright install chromium ``` **For Playwright details**: See [references/advanced-fetching.md](references/advanced-fetching.md) ## Decision Tree Use this flowchart to choose the right approach: ``` Need to fetch web content? │ ├─ JavaScript-rendered site (Twitter/X, React SPA, dynamic content)? │ └─ Use Playwright script (Tier 4) │ └─ See: references/advanced-fetching.md │ ├─ Size unknown or small expected content? │ └─ Try WebFetch (Tier 1) │ ├─ Success? → Done ✓ │ └─ "Prompt too long" error? → Use Tier 2 │ ├─ Any other case (medium/large content, WebFetch failed)? │ └─ Use curl + Task agent (Tier 2) ⭐ DEFAULT │ ├─ Success? → Done ✓ │ └─ Task agent struggles? → Use Tier 3 │ └─ See: references/advanced-fetching.md │ └─ Edge cases (Task agent fails, special encoding)? └─ Use curl + script (Tier 3) └─ See: references/advanced-fetching.md ``` ## Standard Workflow Quick Reference **Most common pattern** (works for 90% of cases): ```bash # 1. Setup mkdir -p output/tasks/20260111_taskname/original # 2. Fetch curl -s "URL" > output/tasks/20260111_taskname/original/raw_html.html # 3. Extract (Task agent prompt) "Read HTML file at [path], extract main article content, remove navigation/ads/sidebars/comments/footer, save clean markdown to: [path]/fetched_content.md" # 4. Use Read: output/tasks/20260111_taskname/original/fetched_content.md ``` **For detailed workflows and patterns**: See [references/workflows.md](references/workflows.md) ## Common Use Cases **Translation with images**: 1. Fetch content (Tier 2 or Tier 4 for JS sites) 2. Extract to `original/fetched_content.md` with readable format (includes Media section with image URLs) 3. Download images using URLs from Media section 4. Pass to `/translate` command **Analysis**: 1. Fetch content (Tier 2 or Tier 4) 2. Extract to `original/fetched_content.md` with readable format 3. Read and analyze **Social media posts** (Twitter/X, Threads, Instagram): 1. Use Playwright (Tier 4) directly 2. Automatically extracts text, images, and metadata 3. Outputs to `fetched_content.md` with Media section in a readable format 4. Download images from extracted URLs 5. Use for translation/analysis **For all patterns**: See [references/workflows.md](references/workflows.md) ## When Things Go Wrong **Common issues and quick fixes**: | Issue | Quick Fix | Details | | -------------------------- | ----------------------------- | --------------------------------------------------- | | WebFetch "Prompt too long" | Switch to Tier 2 | [troubleshooting.md](references/troubleshooting.md) | | Read tool file too large | Use Task agent | [troubleshooting.md](references/troubleshooting.md) | | Garbled Japanese text | EUC-JP encoding issue | [troubleshooting.md](references/troubleshooting.md) | | JavaScript required | Use Playwright (Tier 4) | [troubleshooting.md](references/troubleshooting.md) | | Anti-bot protection | Add user agent | [troubleshooting.md](references/troubleshooting.md) | | Authentication required | Use curl with headers/cookies | [troubleshooting.md](references/troubleshooting.md) | **For complete troubleshooting guide**: See [references/troubleshooting.md](references/troubleshooting.md) ## Best Practices 1. **Always use task directories**: `output/tasks/YYYYMMDD_taskname/` 2. **Default to Tier 2**: curl + Task agent works for nearly all cases 3. **Keep raw HTML**: Save to `original/raw_html.html` for reference (except Tier 4) 4. **Use Playwright for JS sites**: Twitter/X, React SPAs, dynamic content 5. **Clean content format**: Always save extracted content as markdown 6. **Descriptive naming**: Use date prefix (`YYYYMMDD_`) and descriptive task names 7. **Try simple first**: WebFetch → curl + Task agent → Scripts (only if needed) 8. **Task agent for extraction**: Let AI handle pagination and parsing complexity ## Reference Files This skill uses progressive disclosure for efficiency. Core information is in this file. Detailed guides are in reference files: - **[references/workflows.md](references/workflows.md)** - Standard workflows, common patterns, directory structure, content extraction priorities - **[references/advanced-fetching.md](references/advanced-fetching.md)** - Tier 3 script-based extraction, Tier 4 Playwright details, setup instructions - **[references/troubleshooting.md](references/troubleshooting.md)** - Common issues, solutions, quick reference commands Read reference files when you need detailed guidance for specific scenarios. ## Quick Start Examples **Simple article**: ```bash mkdir -p output/tasks/20260111_article/original curl -s "https://example.com/article" > output/tasks/20260111_article/original/raw_html.html # Task agent: Extract to fetched_content.md ``` **Twitter/X post**: ```bash mkdir -p output/tasks/20260111_twitter/original node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \ "https://x.com/user/status/123" \ --output output/tasks/20260111_twitter/original/fetched_content.md ``` **Threads post**: ```bash mkdir -p output/tasks/20260111_threads/original node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \ "https://www.threads.com/@user/post/ABC123" \ --output output/tasks/20260111_threads/original/fetched_content.md ``` **Output format with images** (works for all sites): ```markdown # Title Main content text here... ### Media 1. Image description or alt text - URL: https://example.com/image1.jpg 2. Another image - URL: https://example.com/image2.jpg ``` **Note**: - Images are automatically extracted from the main content area - Small images (< 100x100px) like icons/logos are filtered out - Image formats: `.jpg`, `.png`, `.webp`, `.gif`, `.svg` - all preserved in URLs - When downloading, respect the original file extension **Japanese site (potential encoding issue)**: ```bash mkdir -p output/tasks/20260111_japanese/original curl -s "https://4gamer.net/..." > output/tasks/20260111_japanese/original/raw_html.html # Task agent with encoding detection, or use extract_eucjp.js if needed ```