---
name: firecrawl-scraping
description: Web page and website scraping with Firecrawl API. Use this skill when scraping web articles, blog posts, documentation pages, paywalled content, or JavaScript-heavy sites. Triggers on requests to scrape websites, extract article content, convert pages to markdown, or handle anti-bot protection.
---

# Firecrawl Scraping

## Overview

Scrape individual web pages and convert them to clean, LLM-ready markdown. Handles JavaScript rendering, anti-bot protection, and dynamic content.

## Quick Decision Tree

```
What are you scraping?
│
├── Single page (article, blog, docs)
│   └── references/single-page.md
│   └── Script: scripts/firecrawl_scrape.py
│
└── Entire website (multiple pages, crawling)
    └── references/website-crawler.md
    └── (Use Apify Website Content Crawler for multi-page)
```

## Environment Setup

```bash
# Required in .env
FIRECRAWL_API_KEY=fc-your-api-key-here
```

Get your API key: https://firecrawl.dev/app/api-keys

## Common Usage

### Simple Scrape
```bash
python scripts/firecrawl_scrape.py "https://example.com/article"
```

### With Options
```bash
python scripts/firecrawl_scrape.py "https://wsj.com/article" \
  --proxy stealth \
  --format markdown summary \
  --timeout 60000
```

## Proxy Modes

| Mode | Use Case |
|------|----------|
| `basic` | Standard sites, fastest |
| `stealth` | Anti-bot protection, premium content (WSJ, NYT) |
| `auto` | Let Firecrawl decide (recommended) |

## Output Formats

- `markdown` - Clean markdown content (default)
- `html` - Raw HTML
- `summary` - AI-generated summary
- `screenshot` - Page screenshot
- `links` - All links on page

## Cost

~1 credit per page. Stealth proxy may use additional credits.

## Security Notes

### Credential Handling
- Store `FIRECRAWL_API_KEY` in `.env` file (never commit to git)
- API keys can be regenerated at https://firecrawl.dev/app/api-keys
- Never log or print API keys in script output
- Use environment variables, not hardcoded values

### Data Privacy
- Only scrapes publicly accessible web pages
- Scraped content is processed by Firecrawl servers temporarily
- Markdown output stored locally in `.tmp/` directory
- Screenshots (if requested) are stored locally
- No persistent data retention by Firecrawl after request

### Access Scopes
- API key provides full access to scraping features
- No granular permission scopes available
- Monitor usage via Firecrawl dashboard

### Compliance Considerations
- **Robots.txt**: Firecrawl respects robots.txt by default
- **Public Content Only**: Only scrape publicly accessible pages
- **Terms of Service**: Respect target site ToS
- **Rate Limiting**: Built-in rate limiting prevents abuse
- **Stealth Proxy**: Use stealth mode only when necessary (paywalled news, not auth bypass)
- **GDPR**: Scraped content may contain PII - handle accordingly
- **Copyright**: Respect intellectual property rights of scraped content

## Troubleshooting

### Common Issues

#### Issue: Credits exhausted
**Symptoms:** API returns "insufficient credits" or quota exceeded error
**Cause:** Account credits depleted
**Solution:**
- Check credit balance at https://firecrawl.dev/app
- Upgrade plan or purchase additional credits
- Reduce scraping frequency
- Use `basic` proxy mode to conserve credits

#### Issue: Page not rendering correctly
**Symptoms:** Empty content or partial HTML returned
**Cause:** JavaScript-heavy page not fully loading
**Solution:**
- Enable JavaScript rendering with `--js-render` flag
- Increase timeout with `--timeout 60000` (60 seconds)
- Try `stealth` proxy mode for protected sites
- Wait for specific elements with `--wait-for` selector

#### Issue: 403 Forbidden error
**Symptoms:** Script returns 403 status code
**Cause:** Site blocking automated access
**Solution:**
- Enable `stealth` proxy mode
- Add delay between requests
- Try at different times (some sites rate limit by time)
- Check if site requires login (not supported)

#### Issue: Empty markdown output
**Symptoms:** Scrape succeeds but markdown is empty or malformed
**Cause:** Dynamic content loaded after page load, or unusual page structure
**Solution:**
- Increase wait time for JavaScript to execute
- Use `--wait-for` to wait for specific content
- Try `html` format to see raw content
- Check if content is in an iframe (not always supported)

#### Issue: Timeout errors
**Symptoms:** Request times out before completion
**Cause:** Slow page load or large page content
**Solution:**
- Increase timeout value (up to 120000ms)
- Use `basic` proxy for faster response
- Target specific page sections if possible
- Check if site is experiencing issues

## Resources

- **references/single-page.md** - Single page scraping details
- **references/website-crawler.md** - Multi-page website crawling

## Integration Patterns

### Scrape and Analyze
**Skills:** firecrawl-scraping → parallel-research
**Use case:** Scrape competitor pages, then analyze content strategy
**Flow:**
1. Scrape competitor website pages with Firecrawl
2. Convert to clean markdown
3. Use parallel-research to analyze positioning, messaging, features

### Scrape and Document
**Skills:** firecrawl-scraping → content-generation
**Use case:** Create summary documents from web research
**Flow:**
1. Scrape multiple article pages on a topic
2. Combine markdown content
3. Generate summary document via content-generation

### Scrape and Enrich CRM
**Skills:** firecrawl-scraping → attio-crm
**Use case:** Enrich company records with website data
**Flow:**
1. Scrape company website (about page, team page, product pages)
2. Extract key information (funding, team size, products)
3. Update company record in Attio CRM with enriched data