# Web Scraping Scrape websites and index content into the knowledge base. ## Quick Start 1. **Dashboard → Scrape tab → Enter URL** 2. Configure scope and depth 3. Click "Start Scrape" 4. Content auto-indexes when complete ## Configuration | Setting | Values | Default | Description | |---------|--------|---------|-------------| | URL | string | - | Starting URL to scrape | | Max Pages | 1-1000 | 100 | Limit total pages | | Max Depth | 0-10 | 3 | Link depth from start URL (0 = start page only) | | Scope | SUBPAGES, HOSTNAME, DOMAIN | HOSTNAME | How far to crawl | | Include Patterns | glob patterns | - | Only scrape matching URLs | | Exclude Patterns | glob patterns | - | Skip matching URLs | | Scrape Mode | AUTO, FAST, FULL | AUTO | How to fetch pages | | Cookies | string | - | For authenticated sites | | Force Rescrape | boolean | false | Re-scrape even if unchanged | **Scope values:** - `SUBPAGES` - Only pages under the starting path - `HOSTNAME` - All pages on same hostname - `DOMAIN` - Include subdomains **Scrape Mode values:** - `AUTO` - Try fast mode, fall back to full for SPAs - `FAST` - HTTP only, faster but may miss JavaScript content - `FULL` - Uses headless browser, handles all JavaScript ## GraphQL API Start a scrape job programmatically: ```graphql mutation StartScrape($input: StartScrapeInput!) { startScrape(input: $input) { jobId baseUrl status } } ``` Variables: ```json { "input": { "url": "https://docs.example.com", "maxPages": 100, "maxDepth": 3, "scope": "HOSTNAME", "includePatterns": ["/docs/*", "/api/*"], "excludePatterns": ["/blog/*", "/changelog/*"], "scrapeMode": "AUTO", "cookies": "session=abc123; auth=xyz789", "forceRescrape": false } } ``` Check job status: ```graphql query GetScrapeJob($jobId: ID!) { getScrapeJob(jobId: $jobId) { job { jobId status totalUrls processedCount failedCount } } } ``` List jobs: ```graphql query ListScrapeJobs($limit: Int) { listScrapeJobs(limit: $limit) { items { jobId baseUrl status processedCount totalUrls } } } ``` Cancel a job: ```graphql mutation CancelScrape($jobId: ID!) { cancelScrape(jobId: $jobId) { jobId status } } ``` ### Authentication Include your API key in the request headers: ``` x-api-key: da2-xxxxxxxxxxxx ``` Get your API key from **Dashboard → Settings → API Key**. ## How It Works ```text Start URL → Discovery Queue → Process Queue → S3 → Knowledge Base ``` 1. **ScrapeStart** - Creates job, queues initial URL 2. **ScrapeDiscover** - Finds links, respects scope/depth, queues new URLs 3. **ScrapeProcess** - Fetches content, converts to markdown, saves to S3 4. **ProcessDocument** - Standard pipeline indexes the markdown ## Deduplication Content is hashed using SHA-256. Re-scraping skips unchanged pages (hash match) unless "Force Rescrape" is enabled. ## Real-time Updates Progress publishes via GraphQL subscriptions. The UI updates automatically as pages process. ## Troubleshooting ### Scrape stuck at 0% - Check ScrapeDiscover Lambda logs - Verify URL is accessible ### Pages missing - Check scope setting (subpages is restrictive) - Increase max depth - Some SPAs need "full" mode ### Content garbled - Try "full" mode for JavaScript-heavy sites