---
name: blog-scraper
description: Fetch and compress blog articles from tech-lab.sios.jp into the doc/ directory with token usage statistics and OGP metadata
---

# Blog Scraper Skill

## Overview

This skill fetches blog articles from `tech-lab.sios.jp/archives/*`, compresses the HTML content by removing unnecessary attributes and whitespace, and saves the result to the `doc/` directory with metadata.

## When to Use

- User requests to fetch a specific blog article
- User wants to update existing cached articles
- User needs to scrape multiple articles for analysis or documentation

## Usage

### Single Article

```bash
URL=https://tech-lab.sios.jp/archives/[article-id] npm run scraper
```

Example:
```bash
URL=https://tech-lab.sios.jp/archives/48397 npm run scraper
```

### Multiple Articles

For multiple articles, run the command sequentially for each URL.

## Output

The scraper will:

1. **Fetch and parse** the HTML from the specified URL
2. **Extract content** using the CSS selector `section.entry-content`
3. **Compress** by removing:
   - Scripts, styles, and noscript tags
   - Class, ID, and style attributes
   - Whitespace between tags
4. **Preserve**:
   - Image alt text as `[画像: alt]`
   - Image src URLs
   - Link href attributes
5. **Add metadata** as HTML comment:
   - OGP title
   - Source URL
   - OGP image URL
   - Extraction timestamp
6. **Save** to `docs/data/tech-lab-sios-jp-archives-[id].html`
7. **Report** compression statistics:
   - Token count reduction (estimated for Claude)
   - Compression ratio percentages
   - File size

## Cache Behavior

- If the target HTML file already exists in `docs/data/`, the scraper **skips fetching** and reports the existing file size
- To re-fetch, delete the existing HTML file first

## Token Estimation

The scraper estimates Claude token usage for Japanese content:
- Hiragana/Katakana: ~1.5 chars/token
- Kanji: ~1 char/token
- ASCII: ~4 chars/token
- Other: ~2 chars/token

Typical compression achieves 60-85% token reduction.

## Implementation Details

See `application/tools/scraper.ts` for the TypeScript implementation using:
- `node-fetch` for HTTP requests
- `cheerio` for HTML parsing
- OGP metadata extraction
- Custom token estimation for Japanese text

## Permissions Required

This skill requires the following permissions in `.claude/settings.local.json`:

```json
{
  "permissions": {
    "allow": [
      "Bash(npm run scraper:*)",
      "Bash(URL=:*)"
    ]
  }
}
```

**Note:** The `Bash(URL=:*)` permission uses prefix matching to allow any URL environment variable pattern. This is a broad permission - consider restricting to specific domains if needed for security.