--- name: web-research description: "Use when extracting article text from news URLs or downloading video or audio with metadata" allowed-tools: [Bash(python3*), Bash(yt-dlp*), Read] version: 1.0.0 author: ykotik license: MIT --- # Web Research ## When to Use - Extracting clean article text and metadata from a news/blog URL - Downloading video or audio from supported sites with full metadata - Getting video/audio metadata without downloading the media ## Tools | Tool | Purpose | Structured output | |------|---------|-------------------| | **newspaper4k** | Extract article text, title, authors, date from URLs | Python object (access `.title`, `.text`, `.authors`) | | **yt-dlp** | Download video/audio from 1000+ sites | `--dump-json` for metadata JSON | ## Patterns ### Extract article text from a URL ```bash python3 -c " from newspaper import Article a = Article('https://example.com/news/article') a.download() a.parse() import json print(json.dumps({'title': a.title, 'authors': a.authors, 'date': str(a.publish_date), 'text': a.text[:2000]}, indent=2)) " ``` ### Extract multiple articles ```bash for url in "https://example.com/article1" "https://example.com/article2"; do python3 -c " from newspaper import Article import json a = Article('$url') a.download() a.parse() print(json.dumps({'url': '$url', 'title': a.title, 'text': a.text[:1000]})) " done ``` ### Get video metadata without downloading ```bash yt-dlp --dump-json "https://www.youtube.com/watch?v=VIDEO_ID" | jq '{title, duration, view_count, upload_date}' ``` ### Download audio only (best quality) ```bash yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=VIDEO_ID" ``` ### Download video (best quality, specific format) ```bash yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]" "https://www.youtube.com/watch?v=VIDEO_ID" ``` ### Download with metadata and subtitles ```bash yt-dlp --write-info-json --write-subs --sub-langs en "https://www.youtube.com/watch?v=VIDEO_ID" ``` ### List available formats for a video ```bash yt-dlp -F "https://www.youtube.com/watch?v=VIDEO_ID" ``` ## Pipelines ### Get video metadata → query with DuckDB ```bash yt-dlp --dump-json "https://www.youtube.com/@channel/videos" --flat-playlist | head -20 > videos.jsonl duckdb -c "SELECT title, view_count, duration FROM read_json_auto('videos.jsonl') ORDER BY view_count DESC LIMIT 10" ``` Each stage: yt-dlp dumps playlist metadata as JSONL, DuckDB queries for top videos by views. ## Prefer Over - Prefer **newspaper4k** over curl + HTML parsing for article extraction — handles boilerplate removal, metadata extraction automatically - Prefer **yt-dlp** over browser downloads — supports 1000+ sites, can extract metadata without downloading ## Do NOT Use When - Need to interact with a web page (click, scroll, fill forms) — use web-crawling skill (Playwright) - Real-time search with the WebSearch tool available — prefer the built-in tool for conversational search - Downloading copyrighted content without authorization - Target site requires authentication — use web-crawling skill with login automation