---
name: website-to-vite-scraper
description: Multi-provider website scraper that converts any website (including CSR/SPA) to deployable static sites. Uses Playwright, Apify RAG Browser, Crawl4AI, and Firecrawl for comprehensive scraping. Triggers on requests to clone, reverse-engineer, or convert websites.
version: "2.0"
---

# Website-to-Vite Scraper V2

Multi-provider website scraper with AI-powered extraction for any website type.

## Scraping Methods

| Method | Best For | Anti-Bot | JS Rendering | Cost |
|--------|----------|----------|--------------|------|
| **Playwright** | General sites, Next.js/React apps | ❌ | ✅ Full | FREE |
| **Apify RAG Browser** | LLM/RAG-optimized content | ✅ | ✅ Adaptive | Credits |
| **Crawl4AI** | AI training data, clean extraction | ✅ | ✅ | Credits |
| **Firecrawl** | Protected sites, anti-bot bypass | ✅✅ | ✅ | $16/mo |

## Quick Start

### GitHub Actions (Recommended)

```bash
# Go to: Actions → Website Scraper V2 → Run workflow
# Options:
#   - URL: https://www.reventure.app/
#   - Project name: reventure-clone
#   - Method: all (tries all providers)
#   - Deploy: true
```

### API MEGA LIBRARY Integration

The following APIs from our library enhance this scraper:

| API | Purpose | Status |
|-----|---------|--------|
| `APIFY_API_TOKEN` | RAG Browser, Crawl4AI, Web Scraper | ✅ Configured |
| `FIRECRAWL_API_KEY` | Anti-bot bypass, stealth mode | ✅ Configured |
| `BROWSERLESS_API_KEY` | Alternative headless browser | 🔄 Available |

### MCP Server Integration

Connect Claude Desktop/Cursor to Apify MCP for AI-powered scraping:

```json
{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["@apify/actors-mcp-server"],
      "env": {
        "APIFY_TOKEN": "your-apify-api-token"
      }
    }
  }
}
```

Or use hosted: `https://mcp.apify.com?token=YOUR_TOKEN`

## Apify Actors Used

### apify/rag-web-browser
- **Purpose:** LLM-optimized web content extraction
- **Output:** Markdown, HTML, text
- **Features:** 
  - Playwright adaptive (handles JS)
  - Clean content extraction
  - Link following
  - Metadata extraction

### raizen/ai-web-scraper (Crawl4AI)
- **Purpose:** AI training data collection
- **Output:** Cleaned markdown, structured links
- **Features:**
  - Excludes boilerplate (headers, footers, nav)
  - Word count thresholding
  - External link filtering

### Firecrawl
- **Purpose:** Anti-bot protected sites
- **Output:** Markdown, HTML, screenshots
- **Features:**
  - Anti-detection technology
  - JavaScript rendering
  - Main content extraction
  - 5-second wait for dynamic content

## Output Structure

```
project-name/
├── dist/
│   ├── index.html      # Best merged HTML
│   ├── screenshot.png  # Full page capture
│   ├── meta.json       # Scrape metadata
│   └── assets/
│       ├── images/     # Downloaded images
│       ├── css/        # Stylesheets
│       └── js/         # Scripts
└── results/
    ├── playwright/     # Raw Playwright output
    ├── apify-rag/      # RAG Browser output
    ├── crawl4ai/       # Crawl4AI output
    └── firecrawl/      # Firecrawl output
```

## Handling CSR/SPA Sites

Sites like Next.js, React, Vue that render client-side require JavaScript execution:

1. **Playwright** waits for `networkidle` + 5 seconds
2. **Apify RAG** uses adaptive crawler (Playwright when needed)
3. **Firecrawl** has built-in JS rendering

For `__NEXT_DATA__` extraction (Next.js sites):
- Playwright automatically extracts and saves to `next_data.json`
- Can be parsed to reconstruct static pages

## Workflow Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | required | Website URL to scrape |
| `project_name` | string | required | Output folder/Cloudflare project name |
| `scrape_method` | choice | playwright | Method to use |
| `extract_assets` | boolean | true | Download images/CSS/JS |
| `deploy_cloudflare` | boolean | true | Deploy to Cloudflare Pages |

## Cost Optimization

| Scenario | Recommended Method |
|----------|-------------------|
| Simple static site | Playwright (FREE) |
| JS-heavy SPA | Playwright → Apify RAG fallback |
| Protected site (Cloudflare) | Firecrawl |
| AI/RAG pipeline | Apify RAG or Crawl4AI |
| Maximum coverage | `all` method |

## Security Assessment

Per API_MEGA_LIBRARY guidelines:

| API | Security Score | Recommendation |
|-----|----------------|----------------|
| Apify | 85/100 | ✅ ADOPT |
| Firecrawl | 82/100 | ✅ ADOPT |
| Playwright | 90/100 | ✅ ADOPT (local) |

## Troubleshooting

### Site returns blank page
1. Try `scrape_method: all` to use multiple providers
2. Increase wait time in Playwright
3. Check if site blocks datacenter IPs → use Firecrawl

### Assets not downloading
1. Some sites block direct asset requests
2. Use relative paths from original HTML
3. Check for CORS restrictions

### Cloudflare protection detected
1. Use Firecrawl (has anti-bot bypass)
2. Or use Apify with residential proxies

## Related Skills

- `auction-results` - Uses similar scraping for auction data
- `bcpao-scraper` - BCPAO property data extraction
- `youtube-transcript` - Video content extraction

## Changelog

### V2.0 (Dec 2025)
- Added multi-provider support (Playwright, Apify, Firecrawl)
- MCP server integration
- Automatic provider fallback
- Asset downloading
- Cloudflare Pages deployment

### V1.0 (Dec 2025)
- Initial Playwright-only scraper
- Basic HTML/CSS/JS extraction