--- name: crawl-wechat description: > Crawl and extract WeChat public account (微信公众号) articles into structured data and clean markdown. Use this skill whenever the user wants to scrape, crawl, read, extract, or fetch content from a WeChat article URL (mp.weixin.qq.com). Also trigger when the user mentions 微信公众号文章、抓取微信文章、 爬取公众号、or provides a WeChat article link and wants its content extracted. This skill handles the tricky parts: spoofing the WeChat in-app browser user-agent, waiting for dynamic content to load, fixing lazy-loaded images, extracting structured metadata (title, author, publish time), generating clean markdown with inline images, and downloading images locally to bypass hotlink protection. --- # Crawl WeChat Articles This skill extracts content from WeChat public account articles using the `crawl4ai` library. WeChat articles require special handling because they check the User-Agent header, render content dynamically, and use lazy-loading for images. ## When to use - User provides a `mp.weixin.qq.com/s/...` URL and wants its content - User asks to scrape/crawl/extract/read a WeChat (微信) article - User wants to batch-process multiple WeChat article links - User needs the article in markdown or structured format ## Setup (run once before first use) Before running the script, ensure dependencies are installed: ```bash pip install crawl4ai aiohttp && crawl4ai-setup ``` If `crawl4ai` is already importable and the browser backend is ready, skip this step. When the script fails with `ModuleNotFoundError` or browser-related errors, run the commands above to fix it. ## How it works Run the bundled script to crawl a WeChat article: ```bash python /scripts/crawl_wechat.py [--download-images] [--save-html] [--save-markdown] [--output-dir DIR] ``` The script outputs a JSON summary to stdout and optionally saves the full HTML and/or markdown to files. ### Key technical details 1. **User-Agent spoofing**: The script sets `MicroMessenger/8.0.43` in the UA string so WeChat serves the full article instead of a "please open in WeChat" block. 2. **Dynamic wait**: Uses `wait_for="css:#js_content"` to ensure the article body has fully rendered before scraping. 3. **Lazy-image fix**: WeChat uses `data-src` for lazy-loaded images. The script injects JS to copy `data-src` → `src` so the markdown generator can pick up real image URLs. 4. **Structured extraction**: Uses `JsonCssExtractionStrategy` with a schema targeting WeChat's DOM structure (`#activity-name` for title, `#js_name` for author, `#publish_time` for date, `#js_content` for body). 5. **Clean markdown with images**: Uses `DefaultMarkdownGenerator` to produce readable markdown. SVG placeholder images and data-URI artifacts are cleaned out, preserving only real article images inline with the text. 6. **Image hotlink protection**: WeChat images on `mmbiz.qpic.cn` block requests with non-QQ referrers. Use `--download-images` to download all images locally with the correct Referer header, automatically replacing remote URLs with local paths in both HTML and markdown output. ## Extracted fields | Field | Description | |----------------|------------------------------------| | `title` | Article title | | `author` | Public account name | | `publish_time` | Publication timestamp | | `account_desc` | Account description/bio | | `markdown` | Clean markdown of article body | | `html` | Raw HTML of article body | | `url` | Final URL after any redirects | ## Example usage Single article with images downloaded locally: ```bash python /scripts/crawl_wechat.py "https://mp.weixin.qq.com/s/xxx" --download-images --save-markdown --output-dir ./output ``` For programmatic use in Python: ```python from crawl_wechat import crawl_wechat_article import asyncio article = asyncio.run(crawl_wechat_article( "https://mp.weixin.qq.com/s/...", images_dir="./output/images", # download images locally )) print(article["title"]) print(article["markdown"]) # images reference local paths ``` ## Limitations - Requires a valid, non-expired WeChat article URL — cannot search or list articles from an account - High-frequency crawling may trigger WeChat's anti-bot measures (CAPTCHAs, IP blocks) - Some temporary share links expire after a period