--- name: baoyu-url-to-markdown description: Fetch any URL and convert to markdown using Chrome CDP. Saves the rendered HTML snapshot alongside the markdown, uses an upgraded Defuddle pipeline with better web-component handling and YouTube transcript extraction, and automatically falls back to the pre-Defuddle HTML-to-Markdown pipeline when needed. If local browser capture fails entirely, it can fall back to the hosted defuddle.md API. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown. version: 1.59.0 metadata: openclaw: homepage: https://github.com/JimLiu/baoyu-skills#baoyu-url-to-markdown requires: anyBins: - bun - npx --- # URL to Markdown Fetches any URL via Chrome CDP, saves the rendered HTML snapshot, and converts it to clean markdown. ## Script Directory **Important**: All scripts are located in the `scripts/` subdirectory of this skill. **Agent Execution Instructions**: 1. Determine this SKILL.md file's directory path as `{baseDir}` 2. Script path = `{baseDir}/scripts/.ts` 3. Resolve `${BUN_X}` runtime: if `bun` installed → `bun`; if `npx` available → `npx -y bun`; else suggest installing bun 4. Replace all `{baseDir}` and `${BUN_X}` in this document with actual values **Script Reference**: | Script | Purpose | |--------|---------| | `scripts/main.ts` | CLI entry point for URL fetching | | `scripts/html-to-markdown.ts` | Markdown conversion entry point and converter selection | | `scripts/parsers/index.ts` | Unified parser entry: dispatches URL-specific rules before generic converters | | `scripts/parsers/types.ts` | Unified parser interface shared by all rule files | | `scripts/parsers/rules/*.ts` | One file per URL rule, for example X status and X article | | `scripts/defuddle-converter.ts` | Defuddle-based conversion | | `scripts/legacy-converter.ts` | Pre-Defuddle legacy extraction and markdown conversion | | `scripts/markdown-conversion-shared.ts` | Shared metadata parsing and markdown document helpers | ## Preferences (EXTEND.md) Check EXTEND.md existence (priority order): ```bash # macOS, Linux, WSL, Git Bash test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project" test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg" test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user" ``` ```powershell # PowerShell (Windows) if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" } $xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" } if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" } if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" } ``` ┌────────────────────────────────────────────────────────┬───────────────────┐ │ Path │ Location │ ├────────────────────────────────────────────────────────┼───────────────────┤ │ .baoyu-skills/baoyu-url-to-markdown/EXTEND.md │ Project directory │ ├────────────────────────────────────────────────────────┼───────────────────┤ │ $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md │ User home │ └────────────────────────────────────────────────────────┴───────────────────┘ ┌───────────┬───────────────────────────────────────────────────────────────────────────┐ │ Result │ Action │ ├───────────┼───────────────────────────────────────────────────────────────────────────┤ │ Found │ Read, parse, apply settings │ ├───────────┼───────────────────────────────────────────────────────────────────────────┤ │ Not found │ **MUST** run first-time setup (see below) — do NOT silently create defaults │ └───────────┴───────────────────────────────────────────────────────────────────────────┘ **EXTEND.md Supports**: Download media by default | Default output directory | Default capture mode | Timeout settings ### First-Time Setup (BLOCKING) **CRITICAL**: When EXTEND.md is not found, you **MUST use `AskUserQuestion`** to ask the user for their preferences before creating EXTEND.md. **NEVER** create EXTEND.md with defaults without asking. This is a **BLOCKING** operation — do NOT proceed with any conversion until setup is complete. Use `AskUserQuestion` with ALL questions in ONE call: **Question 1** — header: "Media", question: "How to handle images and videos in pages?" - "Ask each time (Recommended)" — After saving markdown, ask whether to download media - "Always download" — Always download media to local imgs/ and videos/ directories - "Never download" — Keep original remote URLs in markdown **Question 2** — header: "Output", question: "Default output directory?" - "url-to-markdown (Recommended)" — Save to ./url-to-markdown/{domain}/{slug}.md - (User may choose "Other" to type a custom path) **Question 3** — header: "Save", question: "Where to save preferences?" - "User (Recommended)" — ~/.baoyu-skills/ (all projects) - "Project" — .baoyu-skills/ (this project only) After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue. Full reference: [references/config/first-time-setup.md](references/config/first-time-setup.md) ### Supported Keys | Key | Default | Values | Description | |-----|---------|--------|-------------| | `download_media` | `ask` | `ask` / `1` / `0` | `ask` = prompt each time, `1` = always download, `0` = never | | `default_output_dir` | empty | path or empty | Default output directory (empty = `./url-to-markdown/`) | **EXTEND.md → CLI mapping**: | EXTEND.md key | CLI argument | Notes | |---------------|-------------|-------| | `download_media: 1` | `--download-media` | | | `default_output_dir: ./posts/` | `--output-dir ./posts/` | Directory path. Do NOT pass to `-o` (which expects a file path) | **Value priority**: 1. CLI arguments (`--download-media`, `-o`, `--output-dir`) 2. EXTEND.md 3. Skill defaults ## Features - Chrome CDP for full JavaScript rendering - Browser strategy fallback: default headless first, then visible Chrome on technical failure - URL-specific parser layer for sites that need custom HTML rules before generic extraction - Two capture modes: auto or wait-for-user - Save rendered HTML as a sibling `-captured.html` file - Clean markdown output with metadata - Upgraded Defuddle-first markdown conversion with automatic fallback to the pre-Defuddle extractor from git history - X/Twitter pages can use HTML-specific parsing for Tweets and Articles, which improves title/body/media extraction on `x.com` / `twitter.com` - `archive.ph` / related archive mirrors can restore the original URL from `input[name=q]` and prefer `#CONTENT` before falling back to the page body - Materializes shadow DOM content before conversion so web-component pages survive serialization better - YouTube pages can include transcript/caption text in the markdown when YouTube exposes a caption track - If local browser capture fails completely, can fall back to `defuddle.md/` and still save markdown - Handles login-required pages via wait mode - Download images and videos to local directories ## Usage ```bash # Auto mode (default) - capture when page loads ${BUN_X} {baseDir}/scripts/main.ts # Force headless only ${BUN_X} {baseDir}/scripts/main.ts --browser headless # Force visible browser ${BUN_X} {baseDir}/scripts/main.ts --browser headed # Wait mode - wait for user signal before capture ${BUN_X} {baseDir}/scripts/main.ts --wait # Save to specific file ${BUN_X} {baseDir}/scripts/main.ts -o output.md # Save to a custom output directory (auto-generates filename) ${BUN_X} {baseDir}/scripts/main.ts --output-dir ./posts/ # Download images and videos to local directories ${BUN_X} {baseDir}/scripts/main.ts --download-media ``` ## Options | Option | Description | |--------|-------------| | `` | URL to fetch | | `-o ` | Output file path — must be a **file** path, not directory (default: auto-generated) | | `--output-dir ` | Base output directory — auto-generates `{dir}/{domain}/{slug}.md` (default: `./url-to-markdown/`) | | `--wait` | Wait for user signal before capturing | | `--browser ` | Browser strategy: `auto` (default), `headless`, or `headed` | | `--headless` | Shortcut for `--browser headless` | | `--headed` | Shortcut for `--browser headed` | | `--timeout ` | Page load timeout (default: 30000) | | `--download-media` | Download image/video assets to local `imgs/` and `videos/`, and rewrite markdown links to local relative paths | ## Capture Modes | Mode | Behavior | Use When | |------|----------|----------| | Auto (default) | Try headless first, then retry in visible Chrome if needed | Public pages, static content, unknown pages | | Wait (`--wait`) | User signals when ready | Login-required, lazy loading, paywalls | **Wait mode workflow**: 1. Run with `--wait` → script outputs "Press Enter when ready" 2. Ask user to confirm page is ready 3. Send newline to stdin to trigger capture **Default browser fallback**: 1. Auto mode starts with headless Chrome and captures on network idle 2. If headless capture fails technically, retry with visible Chrome 3. If a shared Chrome session for this profile already exists, reuse it instead of launching a new browser 4. The script does not hard-code login or paywall detection; the agent must inspect the captured markdown or HTML and decide whether to rerun with `--browser headed --wait` ## Agent Quality Gate **CRITICAL**: The agent must treat headless capture as provisional. Some sites render differently in headless mode and can silently return an error shell, partially hydrated page, or low-quality extraction **without** causing the CLI to fail. After every run that used `--browser auto` or `--browser headless`, the agent **MUST** inspect the saved markdown first, and inspect the saved `-captured.html` when the markdown looks suspicious. ### Quality checks the agent must perform 1. Confirm the markdown title matches the target page, not a generic site shell 2. Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error 3. Watch for obvious failure signs such as: - `Application error` - `This page could not be found` - login, signup, subscribe, or verification shells - extremely short markdown for a page that should be long-form - raw framework payloads or mostly boilerplate content 4. If the result is low quality, incomplete, or clearly wrong, do **not** accept the run as successful just because the CLI exited with code 0 ### Recovery workflow the agent must follow 1. First run with default `auto` unless there is already a clear reason to use wait mode 2. Review markdown quality immediately after the run 3. If the content is low quality, rerun locally with visible Chrome: - `--browser headed` for ordinary rendering issues - `--browser headed --wait` when the page may need login, anti-bot interaction, cookie acceptance, or extra hydration time 4. If `--wait` is used, tell the user exactly what to do: - if login is required, ask them to sign in - if the page needs time to hydrate, ask them to wait until the full content is visible - once ready, ask them to press Enter so capture can continue 5. Only fall back to hosted `defuddle.md` after the local browser strategies have failed or are clearly lower fidelity ## Output Format Each run saves two files side by side: - Markdown: YAML front matter with `url`, `title`, `description`, `author`, `published`, optional `coverImage`, and `captured_at`, followed by converted markdown content - HTML snapshot: `*-captured.html`, containing the rendered page HTML captured from Chrome When Defuddle or page metadata provides a language hint, the markdown front matter also includes `language`. The HTML snapshot is saved before any markdown media localization, so it stays a faithful capture of the page DOM used for conversion. If the hosted `defuddle.md` API fallback is used, markdown is still saved, but there is no local `-captured.html` snapshot for that run. ## Output Directory Default: `url-to-markdown//.md` With `--output-dir ./posts/`: `./posts//.md` HTML snapshot path uses the same basename: - `url-to-markdown//-captured.html` - `./posts//-captured.html` - ``: From page title or URL path (kebab-case, 2-6 words) - Conflict resolution: Append timestamp `-YYYYMMDD-HHMMSS.md` When `--download-media` is enabled: - Images are saved to `imgs/` next to the markdown file - Videos are saved to `videos/` next to the markdown file - Markdown media links are rewritten to local relative paths ## Conversion Fallback Conversion order: 1. Try the URL-specific parser layer first when a site rule matches 2. If no specialized parser matches, try Defuddle 3. For rich pages such as YouTube, prefer Defuddle's extractor-specific output (including transcripts when available) instead of replacing it with the legacy pipeline 4. If Defuddle throws, cannot load, returns obviously incomplete markdown, or captures lower-quality content than the legacy pipeline, automatically fall back to the pre-Defuddle extractor 5. If the agent determines the captured result is a login screen, verification screen, or paywall shell, rerun locally with `--browser headed --wait` and ask the user to complete access before capture 6. If the entire local browser capture flow still fails before markdown can be produced, try the hosted `https://defuddle.md/` API and save its markdown output directly 7. The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history CLI output will show: - `Converter: parser:...` when a URL-specific parser succeeded - `Converter: defuddle` when Defuddle succeeds - `Converter: legacy:...` plus `Fallback used: ...` when fallback was needed - `Converter: defuddle-api` when local browser capture failed and the hosted API was used instead ## Media Download Workflow Based on `download_media` setting in EXTEND.md: | Setting | Behavior | |---------|----------| | `1` (always) | Run script with `--download-media` flag | | `0` (never) | Run script without `--download-media` flag | | `ask` (default) | Follow the ask-each-time flow below | ### Ask-Each-Time Flow 1. Run script **without** `--download-media` → markdown saved 2. Check saved markdown for remote media URLs (`https://` in image/video links) 3. **If no remote media found** → done, no prompt needed 4. **If remote media found** → use `AskUserQuestion`: - header: "Media", question: "Download N images/videos to local files?" - "Yes" — Download to local directories - "No" — Keep remote URLs 5. If user confirms → run script **again** with `--download-media` (overwrites markdown with localized links) ## Environment Variables | Variable | Description | |----------|-------------| | `URL_CHROME_PATH` | Custom Chrome executable path | | `URL_DATA_DIR` | Custom data directory | | `URL_CHROME_PROFILE_DIR` | Custom Chrome profile directory | **Troubleshooting**: Chrome not found → set `URL_CHROME_PATH`. Timeout → increase `--timeout`. Complex pages → try `--wait` mode. If markdown quality is poor, inspect the saved `-captured.html` and check whether the run logged a legacy fallback. ### YouTube Notes - The upgraded Defuddle path uses async extractors, so YouTube pages can include transcript text directly in the markdown body. - Transcript availability depends on YouTube exposing a caption track. Videos with captions disabled, restricted playback, or blocked regional access may still produce description-only output. - If the page needs time to finish loading descriptions, chapters, or player metadata, prefer `--wait` and capture after the watch page is fully hydrated. ### Hosted API Fallback - The hosted fallback endpoint is `https://defuddle.md/`. In shell form: `curl https://defuddle.md/stephango.com` - Use it only when the local Chrome/CDP capture path fails outright. The local path still has higher fidelity because it can save the captured HTML and handle authenticated pages. - The hosted API already returns Markdown with YAML frontmatter, so save that response as-is and then apply the normal media-localization step if requested. ## Extension Support Custom configurations via EXTEND.md. See **Preferences** section for paths and supported options.