---
name: "scrapling"
description: "CLI-first web scraping & content extraction with optional MCP server. Use when you have target URLs and need clean, selector-based outputs (html/md/txt)."
---

# Scrapling Skill (VCO)

Scrapling is a Python-based web scraping / extraction toolkit that exposes:
- a **CLI** (`scrapling ...`) for fetching + extracting content into files
- an **optional MCP server** (`scrapling mcp`) so an agent can call structured scraping tools

This skill is **CLI-first**. Prefer it when you already have URLs and need reliable, repeatable extraction (CSS selector → file).

## When to use

Use `scrapling` when you need:

- Extract **specific parts** of a web page (CSS selector / XPath) into `.txt` / `.md` / `.html`
- Run **repeatable scraping jobs** (batch URLs with a small wrapper script)
- Reduce token usage by extracting only the relevant DOM region before passing to the LLM
- Provide a local MCP endpoint for scraping tools (agent → MCP → scrapling)

## Boundaries (vs Playwright / Search)

### vs `playwright`
- `scrapling`: best for “get URL → extract selector → write file” workflows; simpler, faster iteration
- `playwright`: best for interactive UI flows (login, multi-step navigation, downloads, complex JS actions, stateful sessions)

If you must *navigate* or *click through* a UI, use `playwright`.
If you can directly fetch the target page and just need extraction, use `scrapling`.

### vs search tools
- Search tools are for discovering sources/URLs (query → result list → choose URLs).
- `scrapling` is for acquisition + extraction once you already know the URL(s).

A common pipeline:
1) Search → find candidate URLs
2) Scrapling → extract focused content from chosen URLs
3) LLM → summarize / transform / analyze extracted outputs

## Prerequisite check (required)

1) Python version (Scrapling requires Python >= 3.10):
```powershell
python --version
```

2) Scrapling CLI availability:
```powershell
scrapling --help
```

## Installation (recommended)

Scrapling’s CLI and MCP features are enabled via extras.

Recommended (CLI + MCP + fetchers):
```powershell
python -m pip install "scrapling[ai]"
```

If you only want CLI fetch/extract without MCP:
```powershell
python -m pip install "scrapling[fetchers]"
```

If you use browser-based fetchers, you may need browser binaries:
```powershell
# Option A: via Scrapling helper (after install)
scrapling install

# Option B: directly via Playwright
python -m playwright install
```

## Wrapper script (Windows convenience)

This skill ships a thin PowerShell wrapper:
- `C:/Users/羽裳/.codex/skills/scrapling/scripts/scrapling.ps1`

It checks whether `scrapling` exists and prints install hints if missing.

## Common CLI patterns

### 1) Extract full page body (to Markdown)
```powershell
scrapling extract get "https://example.com" out.md
```

### 2) Extract a specific element (CSS selector) to text
```powershell
scrapling extract get "https://example.com" out.txt --css-selector "main article"
```

### 3) Extract HTML for downstream parsing
```powershell
scrapling extract get "https://example.com" out.html --css-selector "#content"
```

### 4) Use browser-backed fetcher mode (when simple GET is blocked / dynamic)
```powershell
scrapling extract fetch "https://example.com" out.md --css-selector "main"
```

Tip: keep outputs in files and only feed the smallest relevant snippet to the LLM.

## MCP server relationship (optional)

Scrapling can run as an MCP server. This is useful when:
- the agent needs tool-style scraping calls
- you want scraping results to be structured and deterministic

Start MCP server (stdio transport by default):
```powershell
scrapling mcp
```

Optional: run MCP server with HTTP transport:
```powershell
scrapling mcp --http --host 127.0.0.1 --port 8765
```

### Example MCP server config snippet

```json
{
  "servers": {
    "scrapling": {
      "mode": "stdio",
      "command": "scrapling",
      "args": ["mcp"],
      "required": false,
      "note": "Requires: python -m pip install \"scrapling[ai]\""
    }
  }
}
```

## Safety & ops notes

- Prefer selector-based extraction to minimize data volume.
- Treat scraping as an external dependency: handle timeouts, retries, and failures explicitly.
- For aggressive bot protection, consider switching fetchers or using `playwright`.