---
name: article-extractor
description: Extract clean article content from URLs, removing ads, navigation, and clutter. Save as readable text files for research, archiving, or offline reading.
original_source: https://github.com/michalparkola/tapestry-skills-for-claude-code/tree/main/article-extractor
author: michalparkola
license: MIT
---

# Article Extractor Skill

This skill extracts clean article content from web URLs, removing ads, navigation, sidebars, and other clutter to save readable text files.

## When to Use This Skill

- Downloading article text from a URL
- Saving blog posts as clean text
- Removing distractions from web articles
- Archiving content for offline reading
- Extracting content for research
- Creating a local reading library

## How to Use

### Basic Extraction
```
Extract the article from https://example.com/article
```

### Save to Specific Location
```
Extract this article and save to ~/reading/
https://example.com/interesting-post
```

### Multiple Articles
```
Extract these articles:
- https://example.com/post-1
- https://example.com/post-2
- https://example.com/post-3
```

## Extraction Methods

The skill uses multiple tools in priority order:

### 1. Reader (Mozilla Readability)
- Uses Firefox Reader View algorithm
- Excellent at removing clutter
- Preserves article structure

### 2. Trafilatura (Python)
- Very accurate extraction
- Works great for blogs and news
- Options: `--no-comments`, `--precision`

### 3. Fallback (curl + parsing)
- No dependencies required
- Basic HTML parsing
- Less reliable but always works

## What Gets Preserved

- Article text and paragraphs
- Section headings
- Author information
- Publication date
- Article structure

## What Gets Removed

- Navigation bars
- Advertisements
- Newsletter signup forms
- Sidebars
- Comments sections
- Social sharing buttons
- Cookie notices
- Related article widgets

## Filename Generation

Files are named based on:
1. Article title (cleaned)
2. Special characters removed (/, :, ?, ", <, >, |)
3. Length limited to 80-100 characters
4. Extension: `.txt`

**Example:**
```
"How to Build a Great Product: A Guide"
  → "How to Build a Great Product - A Guide.txt"
```

## Output Format

After extraction:
```
Title: [Article Title]
Author: [Author Name]
Date: [Publication Date]
Source: [Original URL]

---

[Clean article content...]
```

## Error Handling

The skill handles:
- **Paywalled content**: Extracts available preview
- **Missing tools**: Falls back to alternatives
- **Invalid URLs**: Provides clear error message
- **Failed extraction**: Suggests manual copy
- **Filename issues**: Auto-sanitizes problematic characters

## Advanced Options

### With Metadata Only
```
Extract just the title and author from this URL
```

### Specific Format
```
Extract this article as markdown
```

### Research Mode
```
Extract and summarize the key points from this article
```

## Best Practices

1. **Check Output**: Always verify extraction quality
2. **Save Originals**: Keep the source URL for reference
3. **Organize Files**: Use meaningful folder structures
4. **Batch Processing**: Extract multiple related articles together
5. **Respect Copyright**: Use for personal research only

## Dependencies

For best results, install:
```bash
# Mozilla Readability
npm install -g @nicolo-ribaudo/readability-cli

# Or Trafilatura (Python)
pip install trafilatura
```

Without dependencies, the skill uses fallback methods.