# Scrapling MCP Server - Project Documentation

## Table of Contents

1. [Project Overview](#1-project-overview)
2. [Project Scope](#2-project-scope)
3. [Architecture & Structure](#3-architecture--structure)
4. [Key Components](#4-key-components)
5. [MCP Tools Reference](#5-mcp-tools-reference)
6. [Stealth Levels](#6-stealth-levels)
7. [Configuration](#7-configuration)
8. [Security Considerations](#8-security-considerations)
9. [Development Guidelines](#9-development-guidelines)
10. [Usage Examples](#10-usage-examples)

---

## 1. Project Overview

### What is the Scrapling MCP Server?

The Scrapling MCP Server is a Model Context Protocol (MCP) server that provides web scraping capabilities through an integrated stealth-aware scraping engine. Built on top of the [FastMCP](https://github.com/jlowin/fastmcp) framework and leveraging the [scrapling](https://github.com/D4Vinci/Scrapling) library, this server exposes powerful web scraping tools that AI agents and applications can invoke through a standardized MCP interface.

The project bridges the gap between AI agents that need to fetch web content and the complex reality of modern web scraping—including anti-bot protections, JavaScript rendering requirements, Cloudflare challenges, and the need for stealthy request patterns.

### Scrapling Library

The server leverages [Scrapling](https://github.com/D4Vinci/Scrapling), an adaptive web scraping framework with **9.1k GitHub stars** that provides multiple fetcher types:

| Fetcher | Use Case |
|---------|----------|
| `Fetcher` | Fast HTTP requests with TLS fingerprinting and HTTP/3 support |
| `DynamicFetcher` | Full browser automation using Playwright |
| `StealthyFetcher` | Advanced anti-bot bypass using Camoufox (modified Firefox) |
| `AsyncStealthySession` | Concurrent stealth browsing with tab pooling |

### Purpose and Goals

The primary purpose of this MCP server is to enable AI agents to:

- **Fetch web content reliably** from websites with varying levels of anti-bot protection
- **Render JavaScript** when necessary to access dynamically loaded content
- **Bypass common anti-bot measures** through configurable stealth settings
- **Handle session-based scraping** for websites requiring authentication or stateful interactions
- **Extract structured data** using CSS selectors from scraped pages

The project aims to provide a balance between:
- **Ease of use** - Simple API for common scraping tasks
- **Flexibility** - Extensive configuration options for advanced use cases
- **Reliability** - Built-in retry logic and error handling
- **Security** - URL validation and safe defaults

### Key Features and Capabilities

| Feature | Description |
|---------|-------------|
| **JavaScript Rendering** | Full browser-based rendering for dynamic content |
| **Stealth Modes** | Multiple pre-configured stealth levels (Minimal, Standard, Maximum) |
| **Cloudflare Support** | Automatic Cloudflare challenge detection and solving |
| **Session Management** | Persistent sessions for stateful scraping |
| **Proxy Rotation** | Support for proxy lists with automatic rotation |
| **Retry Logic** | Exponential backoff with configurable retry attempts |
| **CSS Extraction** | Structured data extraction using CSS selectors |
| **URL Validation** | Built-in SSRF protection and security checks |
| **MCP Integration** | Native MCP protocol support for AI agent integration |
| **Spider Framework** | Scrapy-like API with async callbacks, concurrent crawling, and pause/resume support |
| **Adaptive Parsing** | Smart element tracking that survives website design changes |
| **Camoufox Integration** | Modified Firefox browser with stealth patches for maximum anti-detection |

---

## 2. Project Scope

### What the Project Does

The Scrapling MCP Server provides a collection of MCP tools that allow AI agents to:

1. **Simple Scraping** - Fetch a URL and retrieve HTML content
2. **Stealth Scraping** - Fetch URLs with configurable anti-detection measures
3. **Session-Based Scraping** - Maintain cookies and state across multiple requests
4. **Structured Extraction** - Extract specific data using CSS selectors
5. **Batch Scraping** - Process multiple URLs in sequence

### Target Use Cases

The server is designed for the following use cases:

- **AI Agent Web Research** - Enabling AI agents to gather information from the web
- **Data Collection** - Automated gathering of publicly available web data
- **Content Aggregation** - Building datasets from multiple web sources
- **Monitoring & Alerting** - Watching web pages for changes
- **API Alternative** - Accessing websites that lack public APIs

### Supported Scraping Modes

#### Simple Mode
- Basic HTTP requests without stealth features
- Fastest performance
- Suitable for well-behaved websites without anti-bot protection
- No JavaScript rendering

#### Stealth Mode
- Configurable anti-detection features
- User-Agent randomization
- Human-like behavior simulation
- Browser automation with headless Chrome

#### Session-Based Mode
- Persistent cookie storage
- State maintenance across requests
- Authentication handling
- Ideal for authenticated scraping

### What's In Scope

- **HTTP/HTTPS scraping** with JavaScript rendering support
- **Stealth configuration** with multiple preset levels
- **Session management** for stateful interactions
- **Error handling** with automatic retry logic
- **URL validation** for security
- **MCP protocol** integration

### What's Out of Scope

- **Authentication handling** - While sessions are supported, credential management is outside scope
- **CAPTCHA solving** - No built-in CAPTCHA solving capabilities (Cloudflare challenges only)
- **Distributed scraping** - Single-instance operation
- **Data storage** - The server fetches and returns data but doesn't persist it
- **Legal compliance** - Users are responsible for ensuring their scraping activities are legal

---

## 3. Architecture & Structure

### High-Level Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         AI Agent / Client                              │
│                  (Claude, GPT, or other MCP clients)                   │
└─────────────────────────────────┬───────────────────────────────────────┘
                                  │ MCP Protocol (JSON-RPC)
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        MCP Server (FastMCP)                            │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      MCP Tools Layer                            │   │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────────┐  │   │
│  │  │  scrape   │ │  stealth  │ │  session  │ │  extract      │  │   │
│  │  │  simple   │ │  scrape   │ │  scrape   │ │  structured   │  │   │
│  │  └───────────┘ └───────────┘ └───────────┘ └───────────────┘  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
│                                  ▼                                      │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    Core Logic Layer                             │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐  │   │
│  │  │   Stealth   │ │   Session   │ │     Retry & Error       │  │   │
│  │  │  Config     │ │  Management │ │     Handling            │  │   │
│  │  └─────────────┘ └─────────────┘ └─────────────────────────┘  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    Scrapling Integration Layer                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │              AsyncStealthySession (scrapling library)          │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐   │   │
│  │  │   Browser   │ │   HTTP       │ │   Anti-Detection       │   │   │
│  │  │   Pool      │ │   Client     │ │   │   │   Features             │
│  │  └──────────────┘ └──────────────┘ └────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                  │                                      │
└──────────────────────────────────┼──────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           Target Website                                │
│                  (Any HTTP/HTTPS accessible URL)                        │
└─────────────────────────────────────────────────────────────────────────┘
```

### Directory Structure

```
mcp-scraper/
├── .env.example           # Example environment configuration
├── .gitignore             # Git ignore patterns
├── pyproject.toml         # Python project configuration
├── README.md              # Project README
├── src/
│   └── mcp_scraper/
│       ├── __init__.py    # Package initialization
│       ├── config.py      # Configuration classes and settings
│       └── stealth.py     # Stealth utilities and scraping logic
└── tests/                 # Test directory (to be added)
```

### Key Components and Their Responsibilities

#### MCP Server (`__init__.py`)
- **Responsibility**: Package initialization and exports
- **Exports**: Settings, StealthConfig, version information

#### Configuration Module (`config.py`)
- **Responsibility**: Define settings and configuration classes
- **Components**:
  - `StealthConfig` dataclass - Detailed stealth configuration options
  - `Settings` class - Environment-based settings using Pydantic
  - `StealthProfiles` class - Pre-configured stealth profiles

#### Stealth Module (`stealth.py`)
- **Responsibility**: Core scraping logic, session management, and utilities
- **Components**:
  - `StealthConfig` class - Stealth configuration with all options
  - `StealthLevel` enum - Preset stealth levels (MINIMAL, STANDARD, MAXIMUM)
  - `scrape_with_retry()` - Main scraping function with retry logic
  - `get_session()` - Session management
  - `validate_url()` - URL security validation
  - `format_response()` - Response formatting utility

### Data Flow

1. **Request Receipt**: Client sends MCP request with URL and optional parameters
2. **URL Validation**: System validates URL for security (SSRF protection)
3. **Configuration**: Stealth settings are applied based on parameters
4. **Session Management**: Get or create stealth session
5. **Scraping**: Execute HTTP request through scrapling engine
6. **Response Processing**: Format response with requested data
7. **Error Handling**: Apply retry logic if needed
8. **Return**: Send formatted response back to client

---

## 4. Key Components

### MCP Server (FastMCP)

The MCP server is built using FastMCP, a modern framework for creating MCP servers in Python. FastMCP provides:

- **Simple tool definition** using decorators
- **Automatic type conversion** between Python and JSON
- **Built-in error handling** for tool execution
- **Async support** for concurrent operations

The server exposes scraping functionality as MCP tools that clients can invoke.

### Scrapling Integration

The server integrates with the [scrapling](https://github.com/D4Vinci/Scrapling) library, which provides:

- **AsyncStealthySession**: An async session with built-in anti-detection features and tab pooling
- **StealthyFetcher**: Advanced anti-bot bypass using Camoufox (modified Firefox)
- **Page object**: Unified interface for accessing page content
- **Browser automation**: Headless browser with stealth features
- **JavaScript rendering**: Full DOM rendering for dynamic content
- **Spider Framework**: Scrapy-like API with concurrent crawling, pause/resume, and streaming mode
- **Adaptive Parsing**: Smart element tracking that survives website design changes

### Stealth Configuration

The stealth system provides multiple configuration options:

| Option | Description | Default |
|--------|-------------|---------|
| `headless` | Run browser in headless mode | `True` |
| `solve_cloudflare` | Attempt Cloudflare challenges | `False` |
| `humanize` | Human-like behavior simulation | `True` |
| `humanize_duration` | Maximum cursor movement duration in seconds | `1.5` |
| `geoip` | GeoIP-based routing | `False` |
| `os_randomize` | Randomize OS fingerprint | `True` |
| `block_webrtc` | Block WebRTC to prevent IP leaks | `True` |
| `allow_webgl` | Allow WebGL fingerprinting | `True` |
| `google_search` | Simulate Chrome browser | `True` |
| `block_images` | Block image loading | `False` |
| `block_ads` | Block advertisements | `True` |
| `disable_resources` | Disable CSS/JS resources | `False` |
| `network_idle` | Wait for network inactivity before returning | `False` |
| `load_dom` | Wait for DOMContentLoaded event | `False` |
| `wait_selector` | Wait for specific element to appear | `None` |
| `wait_selector_state` | Element state to wait for (visible/hidden/attached) | `None` |
| `timeout` | Request timeout in milliseconds | `30000` |
| `proxy` | Proxy URL for requests | `None` |

### Session Management

The session management system:

- **Global session cache**: Maintains a single session instance
- **Config-aware**: Recreates session when configuration changes
- **Proper cleanup**: Ensures resources are released on close
- **Cookie persistence**: Maintains cookies across requests

### Error Handling & Retry Logic

The retry system implements:

1. **Exponential backoff**: Delay increases exponentially between retries
2. **Proxy rotation**: Automatic proxy switching on block detection
3. **Cloudflare handling**: Detection and optional solving of challenges
4. **Block detection**: Identifies when requests are blocked
5. **Custom exceptions**: Specific error types for different failure modes

**Exception Hierarchy**:
```
ScrapeError (base)
├── CloudflareError - Cloudflare protection detected
├── BlockedError - Request blocked by anti-bot
└── TimeoutError - Request timed out
```

---

## 5. MCP Tools Reference

### Available Tools

The following MCP tools are available. Note: The exact tool names and parameters depend on the implementation. Below are the conceptual tools provided by the server.

#### Tool: `scrape_simple`

Simple web scraping without stealth features. Uses the `Fetcher` class for fast HTTP requests with TLS fingerprinting.

**When to use**:
- Fast scraping of static content
- Well-behaved websites without protection
- Initial testing and development

**Parameters**:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | URL to scrape |
| `selector` | string | No | CSS selector for targeted extraction |
| `extract` | string | No | What to extract: "text", "html", or "both" (default: "text") |
| `timeout` | integer | No | Request timeout in milliseconds (default: 30000) |

**Example**:
```json
{
  "url": "https://example.com",
  "timeout": 30000
}
```

#### Tool: `scrape_stealth`

Stealth web scraping with configurable anti-detection. Uses the `StealthyFetcher` class with Camoufox (modified Firefox) for maximum stealth.

**When to use**:
- Websites with basic anti-bot measures
- When avoiding detection is important
- Rate-limited endpoints
- Cloudflare-protected sites

**Parameters**:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | URL to scrape |
| `stealth_level` | string | No | "minimal", "standard", or "maximum" (default: "standard") |
| `solve_cloudflare` | boolean | No | Attempt Cloudflare challenges (default: false) |
| `network_idle` | boolean | No | Wait for network inactivity (default: true) |
| `load_dom` | boolean | No | Wait for DOMContentLoaded (default: true) |
| `timeout` | integer | No | Request timeout in milliseconds (default: 30000) |
| `proxy` | string | No | Proxy URL for requests |

**Example**:
```json
{
  "url": "https://example.com",
  "stealth_level": "maximum",
  "solve_cloudflare": true,
  "network_idle": true,
  "timeout": 60000
}
```

#### Tool: `scrape_session`

Session-based scraping with persistent state.

**When to use**:
- Websites requiring authentication
- Multi-step interactions
- Maintaining login state

**Parameters**:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | URL to scrape |
| `session_id` | string | No | Session identifier for persistence |
| `cookies` | object | No | Initial cookies to set |
| `stealth_level` | string | No | Stealth level (default: "standard") |

**Example**:
```json
{
  "url": "https://example.com/dashboard",
  "session_id": "user-session-123",
  "cookies": {"auth": "token-value"}
}
```

#### Tool: `extract_structured`

Extract structured data using CSS selectors.

**When to use**:
- Extracting specific data from pages
- Building datasets from web content
- Structured data acquisition

**Parameters**:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | URL to scrape |
| `selectors` | object \| string | Yes | Map of name → CSS selector. Can be either a JSON object or a JSON string representation. |
| `stealth_level` | string | No | Stealth level (default: "standard") |

**Selector Syntax**:

The `selectors` parameter supports CSS selector syntax with the following extensions:

| Syntax | Description | Example |
|--------|-------------|---------|
| `selector` | Extract text content | `"title": "h1"` |
| `selector::html` | Extract HTML content | `"content": "div::html"` |
| `selector::text` | Extract text using ::text pseudo-element | `"text": "p::text"` |
| `selector::attr(name)` | Extract attribute value | `"link": "a::attr(href)"` |
| `selector@attr` | Extract attribute (alternative syntax) | `"image": "img@src"` |
| `selector@attr1@attr2` | Extract multiple attributes | `"data": "img@src@alt"` |

**Example with dict input**:
```json
{
  "url": "https://example.com/blog",
  "selectors": {
    "title": "h1.article-title",
    "content": "div.article-content",
    "author": "span.author-name",
    "date": "time.publish-date",
    "link": "a.read-more::attr(href)",
    "image": "img.featured@src@alt"
  }
}
```

**Example with JSON string input**:
```json
{
  "url": "https://example.com/blog",
  "selectors": "{\"title\": \"h1.article-title\", \"content\": \"div.article-content\", \"link\": \"a.read-more::attr(href)\"}"
}
```

#### Tool: `scrape_batch`

Scrape multiple URLs in sequence.

**When to use**:
- Processing multiple pages
- Building site-wide datasets
- Bulk data collection

**Parameters**:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `urls` | array | Yes | List of URLs to scrape |
| `stealth_level` | string | No | Stealth level (default: "standard") |
| `delay` | float | No | Delay between requests in seconds (default: 1.0) |

**Example**:
```json
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "stealth_level": "minimal",
  "delay": 2.0
}
```

---

## 6. Stealth Levels

### Overview

The server provides three pre-configured stealth levels, each balancing speed, anonymity, and success rate differently.

### Minimal Stealth

**Profile**: `StealthLevel.MINIMAL` or `get_minimal_stealth()`

| Setting | Value |
|---------|-------|
| Headless | Yes |
| Humanize | No |
| Humanize Duration | N/A |
| Cloudflare solving | No |
| OS randomization | No |
| WebRTC blocking | No |
| Chrome simulation | No |
| Image blocking | Yes |
| Resource disabling | Yes |
| Ad blocking | Yes |
| Network Idle | No |
| Load DOM | No |
| Timeout | 15s |

**When to use**:
- Simple websites without anti-bot protection
- High-speed scraping where stealth is not critical
- Testing and development
- Static content and APIs

**Performance**: Fastest - suitable for high-volume scraping of cooperative sites

### Standard Stealth

**Profile**: `StealthLevel.STANDARD` or `get_standard_stealth()`

| Setting | Value |
|---------|-------|
| Headless | Yes |
| Humanize | Yes |
| Humanize Duration | 1.5s |
| Cloudflare solving | No |
| OS randomization | Yes |
| WebRTC blocking | Yes |
| Chrome simulation | Yes |
| Image blocking | No |
| Resource disabling | No |
| Ad blocking | Yes |
| Network Idle | Yes |
| Load DOM | Yes |
| Timeout | 30s |

**When to use**:
- Most web scraping tasks
- Sites with basic anti-bot protection
- General-purpose scraping
- Balance of speed and anonymity required

**Performance**: Moderate - suitable for most common scraping scenarios

### Maximum Stealth

**Profile**: `StealthLevel.MAXIMUM` or `get_maximum_stealth()`

| Setting | Value |
|---------|-------|
| Headless | Yes |
| Humanize | Yes |
| Humanize Duration | 1.5s |
| Cloudflare solving | Yes |
| OS randomization | Yes |
| WebRTC blocking | Yes |
| Chrome simulation | Yes |
| Image blocking | No |
| Resource disabling | No |
| Ad blocking | Yes |
| GeoIP routing | Yes |
| Network Idle | Yes |
| Load DOM | Yes |
| Wait Selector | body |
| Wait Selector State | visible |
| Timeout | 60s |

**When to use**:
- Heavily protected websites
- Cloudflare-protected sites
- Rate-limited endpoints
- Maximum anonymity required
- Challenging anti-bot systems

**Performance**: Slowest - but highest success rate on protected sites

### Configuration Options

You can also create custom stealth configurations:

```python
from mcp_scraper.stealth import StealthConfig

custom_config = StealthConfig(
    headless=True,
    solve_cloudflare=True,
    humanize=True,
    humanize_duration=1.5,
    geoip=False,
    os_randomize=True,
    block_webrtc=True,
    allow_webgl=True,
    google_search=True,
    block_images=False,
    block_ads=True,
    disable_resources=False,
    network_idle=True,
    load_dom=True,
    wait_selector="body",
    wait_selector_state="visible",
    timeout=45000,
    proxy="http://proxy:8080"
)
```

---

## 7. Configuration

### Environment Variables

Create a `.env` file based on `.env.example`:

```bash
# Proxy URL for requests (optional)
# Format: http://user:pass@host:port or socks5://host:port
PROXY_URL=

# Default timeout for requests in seconds (1-300)
DEFAULT_TIMEOUT=30

# Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_LEVEL=INFO

# Maximum number of retries for failed requests (0-10)
MAX_RETRIES=3
```

### Configuration Files

#### pyproject.toml

The project uses `pyproject.toml` for Python package configuration:

```toml
[project]
name = "mcp-scraper"
version = "0.1.0"
requires-python = ">=3.10"

dependencies = [
    "scrapling[all]",
    "fastmcp>=2.0",
    "httpx>=0.25",
    "pydantic>=2.0",
    "pydantic-settings>=2.0",
    "python-dotenv>=1.0",
    "loguru>=0.7",
]
```

### Proxy Setup

#### Single Proxy

Set via environment variable:
```bash
PROXY_URL=http://proxy.example.com:8080
```

Or programmatically:
```python
config = StealthConfig(proxy="http://proxy.example.com:8080")
```

#### Proxy Rotation

For proxy rotation, pass a list of proxies to the scraping function:

```python
proxy_list = [
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://proxy3:8080",
]

page = await scrape_with_retry(
    url="https://example.com",
    proxy_list=proxy_list,
    max_retries=3
)
```

#### Supported Proxy Formats

| Protocol | Format |
|----------|--------|
| HTTP | `http://host:port` |
| HTTPS | `https://host:port` |
| SOCKS5 | `socks5://host:port` |
| With auth | `http://user:pass@host:port` |

### Timeout Settings

#### Request Timeout

Set per-request or globally:

```python
# Per-request (in milliseconds)
page = await session.fetch(url, timeout=60000)

# Global (via Settings)
# DEFAULT_TIMEOUT=60 in .env (in seconds)
```

**Recommended values**:
| Scenario | Timeout |
|----------|---------|
| Simple static pages | 15-30s (15000-30000ms) |
| Standard scraping | 30-45s (30000-45000ms) |
| Complex JavaScript | 45-60s (45000-60000ms) |
| Slow/blocked sites | 60-120s (60000-120000ms) |

---

## 8. Security Considerations

### URL Validation

The server implements robust URL validation to prevent Server-Side Request Forgery (SSRF) attacks:

**Allowed**:
- `http://` and `https://` protocols only
- Public IP addresses
- Public domain names

**Blocked**:
- `file://`, `ftp://`, and other protocols
- Private IP addresses (10.x.x.x, 172.16-31.x.x, 192.168.x.x)
- Localhost variants (localhost, 127.0.0.1, ::1)
- Internal hostnames (*.local, *.internal, *.corp)
- Link-local addresses (169.254.x.x)

The validation function `validate_url()` is called automatically before any scraping operation.

### Proxy Security

When using proxies:

1. **Use trusted proxies** - Avoid free/public proxy lists
2. **Encrypt credentials** - Don't hardcode proxy credentials
3. **Validate proxy URLs** - Ensure proxy URLs are valid format
4. **Rotate responsibly** - Don't abuse proxy rotation

### Rate Limiting

To avoid overwhelming target sites:

1. **Use appropriate delays** - Set random_delay between requests
2. **Implement backoff** - Use exponential backoff on failures
3. **Respect robots.txt** - Check and follow site policies
4. **Monitor responses** - Watch for rate limit indicators

### Legal Compliance

**Users are responsible for**:

- Ensuring their scraping activities are legal in their jurisdiction
- Respecting website Terms of Service
- Complying with robots.txt directives
- Not bypassing authentication mechanisms they don't have access to
- Handling personal data appropriately

**Best practices**:
- Only scrape publicly available data
- Identify your scraper in User-Agent when appropriate
- Cache responses to minimize repeated requests
- Consider using official APIs when available

---

## 9. Development Guidelines

### Best Practices for Undetectable Scraping

Follow these best practices to maximize scraping success while minimizing detection:

1. **Always use sessions**: Reuse browser instances to maintain consistent fingerprints
2. **Enable geoip with proxies**: Match browser locale to proxy location for better anonymity
3. **Use solve_cloudflare sparingly**: Only when needed - it increases detection surface and slows down requests
4. **Implement exponential backoff**: Start slow, increase speed gradually on successful requests
5. **Rotate user agents**: Even with Camoufox, periodic rotation helps avoid pattern detection
6. **Monitor for blocks**: Track 403/429 responses and adjust strategy accordingly
7. **Enable network_idle and load_dom**: Wait for page to fully load before extracting data
8. **Use wait_selector for dynamic content**: Wait for specific elements to appear before extraction

#### Recommended Configuration Patterns

```python
# Pattern 1: Simple static pages
simple_config = StealthConfig(
    headless=True,
    disable_resources=True,
    timeout=10000
)

# Pattern 2: Protected sites (Cloudflare)
protected_config = StealthConfig(
    headless=True,
    solve_cloudflare=True,
    humanize=True,
    geoip=True,
    os_randomize=True,
    timeout=60000,
    google_search=True,
    network_idle=True,
    load_dom=True
)

# Pattern 3: High-anonymity scraping
anonymous_config = StealthConfig(
    headless=True,
    block_webrtc=True,
    block_images=True,
    disable_resources=True,
    os_randomize=True,
    geoip=True,
    solve_cloudflare=True,
    humanize=True,
    humanize_duration=1.5,
    proxy=rotation.next()  # Use rotating proxies
)

# Pattern 4: Debugging (visible browser)
debug_config = StealthConfig(
    headless=False,  # Visible browser
    timeout=120000   # Long timeout for manual intervention
)
```

### How to Extend the Server

#### Adding New Tools

To add a new MCP tool, follow this pattern:

```python
from fastmcp import FastMCP
from mcp_scraper.stealth import scrape_with_retry, get_stealth_config

mcp = FastMCP("My Scraper")

@mcp.tool()
async def scrape_with_custom_option(
    url: str,
    custom_option: bool = False
) -> dict:
    """Description of what this tool does.
    
    Args:
        url: The URL to scrape
        custom_option: Description of custom option
        
    Returns:
        Dictionary with scraping results
    """
    # Validate URL
    if not validate_url(url):
        raise ValueError(f"Invalid URL: {url}")
    
    # Get stealth config
    config = get_standard_stealth()
    
    # Apply custom options
    if custom_option:
        # Custom logic
        
    # Scrape
    page = await scrape_with_retry(url, config)
    
    # Format and return
    return format_response(page, url)
```

#### Adding New Stealth Profiles

Add new preset configurations in `config.py`:

```python
@staticmethod
def custom_profile() -> StealthConfig:
    """Custom profile description.
    
    Suitable for: Your specific use case
    """
    return StealthConfig(
        # Custom settings
        enable_js=True,
        # ... other options
    )
```

#### Adding Error Types

Extend the exception hierarchy in `stealth.py`:

```python
class RateLimitError(ScrapeError):
    """Exception raised when rate limited."""
    pass
```

### Testing Approach

**Unit Tests**:
- Test URL validation
- Test configuration classes
- Test response formatting

**Integration Tests**:
- Test scraping with mock servers
- Test retry logic
- Test error handling

**Example test structure**:
```
tests/
├── __init__.py
├── test_config.py
├── test_stealth.py
├── test_validation.py
└── test_integration.py
```

### Code Style

The project follows:

- **Black** for code formatting (100 character line length)
- **Ruff** for linting
- **MyPy** for type checking
- **PEP 8** naming conventions

**Key style rules**:
- Use type hints for all function parameters and return values
- Use dataclasses for configuration objects
- Use async/await for I/O operations
- Use Loguru for logging
- Document all public functions with docstrings

**Pre-commit hooks**:
```bash
pip install pre-commit
pre-commit install
```

---

## 10. Usage Examples

### Basic Scraping Example

Simple scraping of a static webpage:

```python
from mcp_scraper.stealth import scrape_with_retry, format_response

async def basic_example():
    url = "https://example.com"
    
    # Simple scrape
    page = await scrape_with_retry(url)
    
    # Format response
    result = format_response(page, url)
    
    print(f"Title: {result.get('title')}")
    print(f"Text content: {result.get('text')[:500]}")
    print(f"Status: {result.get('status')}")
```

### Stealth Scraping Example

Scraping a protected website with maximum stealth:

```python
from mcp_scraper.stealth import (
    scrape_with_retry,
    get_maximum_stealth,
    format_response
)

async def stealth_example():
    url = "https://protected-site.com/data"
    
    # Use maximum stealth
    config = get_maximum_stealth()
    
    try:
        page = await scrape_with_retry(
            url,
            config=config,
            max_retries=3
        )
        
        result = format_response(page, url)
        print(f"Success! Content length: {len(result.get('html', ''))}")
        
    except Exception as e:
        print(f"Scraping failed: {e}")
```

### Batch Scraping Example

Processing multiple URLs:

```python
from mcp_scraper.stealth import scrape_with_retry, format_response, validate_url
import asyncio

async def batch_example():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ]
    
    results = []
    delay = 1.0  # Delay between requests
    
    for url in urls:
        # Validate first
        if not validate_url(url):
            print(f"Skipping invalid URL: {url}")
            continue
            
        try:
            page = await scrape_with_retry(url)
            result = format_response(page, url)
            results.append(result)
            print(f"Scraped: {url}")
            
        except Exception as e:
            print(f"Failed to scrape {url}: {e}")
            
        # Delay between requests
        await asyncio.sleep(delay)
    
    print(f"Successfully scraped {len(results)}/{len(urls)} URLs")
    return results
```

### Structured Extraction Example

Extracting specific data using CSS selectors:

```python
from mcp_scraper.stealth import scrape_with_retry, format_response

async def structured_example():
    url = "https://example.com/blogposts"
    
    # Define selectors for data extraction
    selectors = {
        "titles": "h2.post-title",
        "authors": "span.author",
        "dates": "time.published",
        "summaries": "p.summary",
        "links": "a.read-more@href"
    }
    
    # Scrape with selectors
    page = await scrape_with_retry(url, selectors=selectors)
    
    # Get formatted response with extracted data
    result = format_response(page, url, selectors=selectors)
    
    # Access extracted data
    extracted = result.get("selectors", {})
    for i, title in enumerate(extracted.get("titles", [])):
        print(f"Post {i+1}: {title}")
        print(f"  Author: {extracted.get('authors', [None])[i]}")
        print(f"  Date: {extracted.get('dates', [None])[i]}")
```

### Custom Configuration Example

Using custom stealth settings:

```python
from mcp_scraper.stealth import StealthConfig, scrape_with_retry

async def custom_config_example():
    # Create custom stealth configuration
    config = StealthConfig(
        headless=True,
        solve_cloudflare=True,  # Attempt Cloudflare challenges
        humanize=True,
        humanize_duration=1.5,
        geoip=False,
        os_randomize=True,
        block_webrtc=True,
        allow_webgl=True,
        google_search=True,
        block_images=True,  # Reduce bandwidth
        block_ads=True,
        disable_resources=False,
        network_idle=True,
        load_dom=True,
        timeout=45000,
        proxy="http://my-proxy:8080"  # Use specific proxy
    )
    
    url = "https://cloudflare-protected-site.com"
    page = await scrape_with_retry(url, config=config, max_retries=5)
    print(f"Success! Content: {page.text[:200]}")
```

### Proxy Rotation Example

Using multiple proxies with automatic rotation:

```python
from mcp_scraper.stealth import scrape_with_retry, get_standard_stealth

async def proxy_rotation_example():
    # List of proxy servers
    proxy_list = [
        "http://proxy1:8080",
        "http://proxy2:8080", 
        "http://proxy3:8080",
    ]
    
    config = get_standard_stealth()
    
    try:
        page = await scrape_with_retry(
            url="https://example.com",
            config=config,
            max_retries=3,
            proxy_list=proxy_list
        )
        print(f"Success with proxy rotation!")
        
    except Exception as e:
        print(f"All proxies failed: {e}")
```

---

## Appendix: API Reference

### Core Classes

#### `StealthConfig`

Configuration class for stealth web scraping.

**Attributes**:
- `headless` (bool): Run browser in headless mode
- `solve_cloudflare` (bool): Attempt Cloudflare challenges
- `humanize` (bool): Add human-like behavior
- `humanize_duration` (float): Maximum cursor movement duration
- `geoip` (bool): Use geoIP-based routing
- `os_randomize` (bool): Randomize OS fingerprint
- `block_webrtc` (bool): Block WebRTC
- `allow_webgl` (bool): Allow WebGL
- `google_search` (bool): Simulate Chrome
- `block_images` (bool): Block images
- `block_ads` (bool): Block advertisements
- `disable_resources` (bool): Disable CSS/JS
- `network_idle` (bool): Wait for network inactivity
- `load_dom` (bool): Wait for DOMContentLoaded
- `wait_selector` (str): Wait for specific element
- `wait_selector_state` (str): Element state to wait for
- `timeout` (int): Request timeout in milliseconds
- `proxy` (str): Proxy URL

#### `StealthLevel`

Enum for preset stealth levels.

**Values**:
- `MINIMAL`: Fast, minimal protection
- `STANDARD`: Balanced protection
- `MAXIMUM`: Highest protection

### Core Functions

#### `scrape_with_retry(url, config, max_retries, backoff_factor, proxy_list, selectors)`

Scrape a URL with retry logic.

**Parameters**:
- `url` (str): URL to scrape
- `config` (StealthConfig): Stealth configuration
- `max_retries` (int): Maximum retry attempts
- `backoff_factor` (float): Exponential backoff multiplier
- `proxy_list` (list): List of proxy URLs
- `selectors` (dict): CSS selectors for extraction

**Returns**: Page object

**Raises**: `ScrapeError`, `CloudflareError`, `BlockedError`, `TimeoutError`

#### `validate_url(url)`

Validate URL for security.

**Parameters**:
- `url` (str): URL to validate

**Returns**: bool - True if URL is safe

#### `format_response(page, url, selectors)`

Format scraping response.

**Parameters**:
- `page` (Page): Scraped page object
- `url` (str): Original URL
- `selectors` (dict): Optional CSS selectors

**Returns**: dict with response data

#### `get_element_text(element)`

Extract text content from a scraping element with fallbacks.

**Parameters**:
- `element` (Any): A page element object from scrapling

**Returns**: str - The text content of the element

**Description**: Checks for `.text` property first, then `.inner_text`, and falls back to `str()`.

#### `get_element_html(element)`

Extract HTML content from a scraping element with fallbacks.

**Parameters**:
- `element` (Any): A page element object from scrapling

**Returns**: str - The HTML content of the element

**Description**: Checks for `.html` property first, then `.innerHTML`.

#### `get_element_attribute(element, attribute)`

Extract an attribute value from a scraping element with fallbacks.

**Parameters**:
- `element` (Any): A page element object from scrapling
- `attribute` (str): The name of the attribute to retrieve

**Returns**: str | None - The attribute value, or None if not found

**Description**: Checks for `.get_attribute()` method first, then direct property access.

---

## Additional Resources

- [FastMCP Documentation](https://github.com/jlowin/fastmcp)
- [Scrapling Library](https://github.com/D4Vinci/Scrapling)
- [MCP Specification](https://spec.modelcontextprotocol.io/)

---

*This documentation was generated for the Scrapling MCP Server project.*

## Active Technologies
- Python 3.10+ (already configured in pyproject.toml) + FastMCP (MCP framework), Scrapling (scraping engine), Pydantic (config), Loguru (logging) (001-mcp-server-implementation)
- N/A - stateless server with optional session caching in memory (001-mcp-server-implementation)

## Recent Changes
- 001-mcp-server-implementation: Added Python 3.10+ (already configured in pyproject.toml) + FastMCP (MCP framework), Scrapling (scraping engine), Pydantic (config), Loguru (logging)