--- name: get-web-page-raw description: Capture a web page as raw material (extracted text/Markdown) with metadata, storing it in the raw/ directory for later distillation. --- # Get Web Page (Raw) ## When to use Use this skill when: - The user provides a URL to capture as source material - You need to store a web page for later distillation - You're building up raw materials for the distillation pipeline **Keywords:** capture, web page, URL, fetch, raw ## Inputs Required: - `url` (string): The URL of the web page to capture Optional: - `title_hint` (string): A human-provided hint for the title if extraction fails ## Outputs This skill produces: 1. A new file in `raw/` with the naming pattern: `YYYYMMDD-HHMMSSZ--.md` 2. Metadata returned to the agent: - `raw_path`: full path to the created file - `title`: extracted or inferred title - `canonical_url`: canonical URL if found, otherwise the input URL - `captured_at`: ISO timestamp (UTC) ## Procedure ### 1. Generate timestamp and slug - Get current UTC timestamp: `date -u +"%Y%m%d-%H%M%SZ"` - Create a slug from the page title or URL path: - Use the main topic/title in lowercase - Replace spaces and special chars with hyphens - Keep it concise (typically 3-8 words) - Example: `patterns-for-ai-assisted-software-development` ### 2. Fetch the web page content **Primary method: fetch_webpage tool** - Call `fetch_webpage` with the URL and a descriptive query about the content - This extracts main content, removes navigation/ads, and converts to clean Markdown - Captures metadata like title and author when available **Alternative methods (when needed):** - **Playwright MCP:** For complex interactive pages, JavaScript-heavy sites, or when you need precise browser control - **curl + HTML parsing:** For simple pages when fetch_webpage isn't available, though requires manual Markdown conversion - **Browser tools:** Built-in browser navigation and snapshot capabilities **Note:** Most public articles and blog posts work well with fetch_webpage ### 3. Build front matter Create YAML front matter with required and available fields: ```yaml --- title: "" source_url: "" captured_at: "" capture_type: web_page capture_tool: fetch_webpage # or browser if used raw_format: markdown status: captured author: "" # optional published_at: "" # optional tags: ["keyword1", "keyword2"] # optional, 3-5 relevant tags --- ``` **Tips:** - Use the page's actual title, not a description - Convert published_at to simple date format (YYYY-MM-DD) - Tags should be broad topics, not exhaustive keywords ### 4. Write the file - Filename: `--.md` (e.g., `20260102-095107Z--patterns-for-ai-assisted-software-development.md`) - Full path: `raw/` - Content: front matter + blank line + extracted content - Use `create_file` tool with the full content ### 5. Confirm to user Provide a brief confirmation: - Link to the created file using relative path - One-sentence summary of what the article covers (2-3 key topics) ## Examples ### Example 1: Dev.to article capture **Input:** ``` url: "https://dev.to/javatarz/patterns-for-ai-assisted-software-development-4ga2" ``` **Actions:** 1. Generate timestamp: `20260102-095107Z` 2. Fetch with `fetch_webpage` tool 3. Extract title: "Patterns for AI assisted software development" 4. Create slug: `patterns-for-ai-assisted-software-development` 5. Write to: `raw/20260102-095107Z--patterns-for-ai-assisted-software-development.md` **Output:** ```yaml --- title: "Patterns for AI assisted software development" source_url: "https://dev.to/javatarz/patterns-for-ai-assisted-software-development-4ga2" captured_at: "2026-01-02T09:51:07Z" capture_type: web_page capture_tool: fetch_webpage raw_format: markdown status: captured author: "Karun Japhet" published_at: "2025-12-27" tags: ["ai", "programming", "productivity", "patterns"] --- ``` ## Edge cases **fetch_webpage fails:** - Check if authentication is required - Try browser-based extraction if needed - Inform user if the page cannot be accessed **Missing metadata:** - If no title is extracted, use the URL path or ask user for a title hint - If no author/date visible, omit those fields - Minimal front matter is acceptable **Paywalled or restricted content:** - Inform user about the restriction - Capture what's visible if useful (preview text) - Consider asking if they have alternative access ## References - Parent workflow: `docs/distillation/distillation-pipeline.md` - Front matter schema: see "Raw file front matter" in distillation-pipeline.md - Next step after capture: use `make-distilled` skill to process this raw file