---
name: scrape-posts
description: Scrape new articles from Milan Jovanovic's blog (November 2025+). Optimized - pre-filters from listing page, only scrapes new articles.
argument-hint: "[--force] [--since YYYY-MM-DD] [--limit N] [--dry-run]"
allowed-tools: Read, Bash, Skill, mcp__firecrawl__firecrawl_scrape, mcp__firecrawl__firecrawl_map, mcp__firecrawl__firecrawl_search
---

# Scrape Milan Jovanovic Blog Posts

Scrape new articles from Milan Jovanovic's .NET blog with **optimized pre-filtering**. Parses dates from listing page to avoid unnecessary per-article scraping.

## Arguments

- `--force`: Re-scrape all articles (compare content hash to skip unchanged)
- `--since YYYY-MM-DD`: Custom date filter (default: 2025-11-01)
- `--limit N`: Limit number of articles (for testing)
- `--dry-run`: Preview what would be scraped without saving

## Optimized Workflow

### Step 1: Invoke Skill

Invoke the `milan-jovanovic:milan-jovanovic-blog` skill to load context and access scripts.

### Step 2: Pre-Filter from Listing Page (OPTIMIZATION)

**Key efficiency optimization:** Parse dates from listing page BEFORE scraping individual articles.

1. Scrape the blog listing page using `firecrawl_scrape`:

   ```text
   URL: https://www.milanjovanovic.tech/blog
   Format: markdown
   ```

2. Save listing content to temp file (e.g., `.claude/temp/milan-listing.md`)

3. Run pre-filter script to identify articles needing scraping:

   ```bash
   # Normal mode - only new articles
   python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --since 2025-11-01

   # Force mode - include existing for re-check
   python scripts/core/check_new_articles.py .claude/temp/milan-listing.md --json --force --since 2025-11-01
   ```

4. Parse JSON output to get `to_scrape` list. If empty, skip to Step 5 (no scraping needed).

### Step 3: Scrape Only Needed Articles

For each article in `to_scrape`:

1. **For articles with `in_index: false`** (new):
   - Scrape full article with `firecrawl_scrape`
   - Extract publication date from metadata
   - Clean promotional content
   - Save to `canonical/milanjovanovic-tech/blog/{slug}.md`

2. **For articles with `in_index: true`** (force mode re-check):
   - Scrape full article with `firecrawl_scrape`
   - Clean promotional content
   - Generate content hash
   - Compare to `content_hash` from pre-filter output
   - If unchanged, skip writing (log as "skipped - unchanged")
   - If changed, save updated content

### Step 4: Update Index

After scraping completes:

```bash
python scripts/management/refresh_index.py
```

### Step 5: Report Statistics

Report:

- Articles found on listing page
- Articles needing scraping (new + force re-check)
- Articles skipped (already indexed, not in force mode)
- Articles skipped (unchanged content hash, force mode)
- Articles filtered (before cutoff date)
- Any errors

## Content Cleanup Patterns

The scraper removes these promotional patterns:

**Footer patterns (stop processing):**

- "Whenever you're ready, there are X ways I can help you"
- "Become a Better .NET Software Engineer"
- "Hi, I'm Milan"

**Sponsor patterns (remove section):**

- AuthKit/WorkOS mentions
- "Sponsor this newsletter" links
- Incident response sponsor content

**Inline patterns (remove):**

- Reading time ("5 min read")
- "Manage read history" links
- Empty image placeholders

## Efficiency Gains

| Scenario | Without Optimization | With Optimization |
|----------|----------------------|-------------------|
| No new articles | 10+ firecrawl requests | 1-2 requests |
| 1 new article | 10+ firecrawl requests | 2-3 requests |
| Force (unchanged) | 10+ requests | 10+ requests but skips writes |

**Why this matters:** Firecrawl has API costs and rate limits. Pre-filtering saves 80-90% of requests when articles haven't changed.

## Example Usage

```text
/milan-jovanovic:scrape-posts
/milan-jovanovic:scrape-posts --limit 3 --dry-run
/milan-jovanovic:scrape-posts --force
/milan-jovanovic:scrape-posts --since 2025-12-01
```

## Troubleshooting

### Firecrawl Not Available

If firecrawl MCP is not connected, the command will fail. Ensure the firecrawl MCP server is configured and running.

### Date Parsing Issues

If listing page dates can't be parsed, the script logs them in `no_date` category. These articles are skipped unless you provide a specific URL.

### Pre-Filter Shows 0 Articles

If `check_new_articles.py` shows 0 articles to scrape:

- All articles are already indexed (use `--force` to re-check)
- All articles are before the cutoff date (adjust `--since`)
- Listing page format changed (check regex patterns in script)