--- name: article-extractor description: Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter. allowed-tools: - Bash - Write --- # Article Extractor This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text. ## When to Use This Skill Activate when the user: - Provides an article/blog URL and wants the text content - Asks to "download this article" - Wants to "extract the content from [URL]" - Asks to "save this blog post as text" - Needs clean article text without distractions ## How It Works ### Priority Order: 1. **Check if tools are installed** (reader or trafilatura) 2. **Download and extract article** using best available tool 3. **Clean up the content** (remove extra whitespace, format properly) 4. **Save to file** with article title as filename 5. **Confirm location** and show preview ## Installation Check Check for article extraction tools in this order: ### Option 1: reader (Recommended - Mozilla's Readability) ```bash command -v reader ``` If not installed: ```bash npm install -g @mozilla/readability-cli # or npm install -g reader-cli ``` ### Option 2: trafilatura (Python-based, very good) ```bash command -v trafilatura ``` If not installed: ```bash pip3 install trafilatura ``` ### Option 3: Fallback (curl + simple parsing) If no tools available, use basic curl + text extraction (less reliable but works) ## Extraction Methods ### Method 1: Using reader (Best for most articles) ```bash # Extract article reader "URL" > article.txt ``` **Pros:** - Based on Mozilla's Readability algorithm - Excellent at removing clutter - Preserves article structure ### Method 2: Using trafilatura (Best for blogs/news) ```bash # Extract article trafilatura --URL "URL" --output-format txt > article.txt # Or with more options trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt ``` **Pros:** - Very accurate extraction - Good with various site structures - Handles multiple languages **Options:** - `--no-comments`: Skip comment sections - `--no-tables`: Skip data tables - `--precision`: Favor precision over recall - `--recall`: Extract more content (may include some noise) ### Method 3: Fallback (curl + basic parsing) ```bash # Download and extract basic content curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys class ArticleExtractor(HTMLParser): def __init__(self): super().__init__() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None def handle_starttag(self, tag, attrs): if tag not in self.skip_tags: if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}: self.in_content = True self.current_tag = tag def handle_data(self, data): if self.in_content and data.strip(): self.content.append(data.strip()) def get_content(self): return '\n\n'.join(self.content) parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt ``` **Note:** This is less reliable but works without dependencies. ## Getting Article Title Extract title for filename: ### Using reader: ```bash # reader outputs markdown with title at top TITLE=$(reader "URL" | head -n 1 | sed 's/^# //') ``` ### Using trafilatura: ```bash # Get metadata including title TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])") ``` ### Using curl (fallback): ```bash TITLE=$(curl -s "URL" | grep -oP '\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//') ``` ## Filename Creation Clean title for filesystem: ```bash # Get title TITLE="Article Title from Website" # Clean for filesystem (remove special chars, limit length) FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//') # Add extension FILENAME="${FILENAME}.txt" ``` ## Complete Workflow ```bash ARTICLE_URL="https://example.com/article" # Check for tools if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi # Extract article case $TOOL in reader) # Get content reader "$ARTICLE_URL" > temp_article.txt # Get title (first line after # in markdown) TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # Get title from metadata METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # Get clean content trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # Get title TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # Remove site name TITLE=${TITLE%% | *} # Remove site name (alternate) # Get content (basic extraction) curl -s "$ARTICLE_URL" | python3 -c " from html.parser import HTMLParser import sys class ArticleExtractor(HTMLParser): def __init__(self): super().__init__() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'} def handle_starttag(self, tag, attrs): if tag not in self.skip_tags: if tag in {'p', 'article', 'main'}: self.in_content = True if tag in {'h1', 'h2', 'h3'}: self.content.append('\n') def handle_data(self, data): if self.in_content and data.strip(): self.content.append(data.strip()) def get_content(self): return '\n\n'.join(self.content) parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac # Clean filename FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//') FILENAME="${FILENAME}.txt" # Move to final filename mv temp_article.txt "$FILENAME" # Show result echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME" ``` ## Error Handling ### Common Issues **1. Tool not installed** - Try alternate tool (reader → trafilatura → fallback) - Offer to install: "Install reader with: npm install -g reader-cli" **2. Paywall or login required** - Extraction tools may fail - Inform user: "This article requires authentication. Cannot extract." **3. Invalid URL** - Check URL format - Try with and without redirects **4. No content extracted** - Site may use heavy JavaScript - Try fallback method - Inform user if extraction fails **5. Special characters in title** - Clean title for filesystem - Remove: `/`, `:`, `?`, `"`, `<`, `>`, `|` - Replace with `-` or remove ## Output Format ### Saved File Contains: - Article title (if available) - Author (if available from tool) - Main article text - Section headings - No navigation, ads, or clutter ### What Gets Removed: - Navigation menus - Ads and promotional content - Newsletter signup forms - Related articles sidebars - Comment sections (optional) - Social media buttons - Cookie notices ## Tips for Best Results **1. Use reader for most articles** - Best all-around tool - Based on Firefox Reader View - Works on most news sites and blogs **2. Use trafilatura for:** - Academic articles - News sites - Blogs with complex layouts - Non-English content **3. Fallback method limitations:** - May include some noise - Less accurate paragraph detection - Better than nothing for simple sites **4. Check extraction quality:** - Always show preview to user - Ask if it looks correct - Offer to try different tool if needed ## Example Usage **Simple extraction:** ```bash # User: "Extract https://example.com/article" reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME" ``` **With error handling:** ```bash if ! reader "$URL" > temp.txt 2>/dev/null; then if command -v trafilatura &> /dev/null; then trafilatura --URL "$URL" --output-format txt > temp.txt else echo "Error: Could not extract article. Install reader or trafilatura." exit 1 fi fi ``` ## Best Practices - ✅ Always show preview after extraction (first 10 lines) - ✅ Verify extraction succeeded before saving - ✅ Clean filename for filesystem compatibility - ✅ Try fallback method if primary fails - ✅ Inform user which tool was used - ✅ Keep filename length reasonable (< 100 chars) ## After Extraction Display to user: 1. "✓ Extracted: [Article Title]" 2. "✓ Saved to: [filename]" 3. Show preview (first 10-15 lines) 4. File size and location Ask if needed: - "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill) - "Should I extract another article?"