# Getting Started with CrawlX This guide will help you get up and running with CrawlX quickly. ## Installation ```bash npm install crawlx ``` ## Quick Start ### Basic Crawling The simplest way to use CrawlX is with the `quickCrawl` function: ```typescript import { quickCrawl } from 'crawlx'; const result = await quickCrawl('https://example.com', { title: 'title', description: 'meta[name="description"]@content' }); console.log(result.parsed); // Output: { title: "Example Domain", description: "..." } ``` ### Creating a Crawler Instance For more control, create a crawler instance: ```typescript import { CrawlX } from 'crawlx'; const crawler = new CrawlX({ concurrency: 5, timeout: 10000, userAgent: 'MyBot/1.0' }); const result = await crawler.crawl('https://example.com'); console.log(result.response.statusCode); // 200 await crawler.destroy(); // Clean up resources ``` ## Data Extraction CrawlX uses CSS selectors for data extraction: ### Simple Selectors ```typescript const parseRule = { title: 'title', // Text content links: '[a@href]', // Attribute values images: ['img@src'], // Arrays price: '.price | trim | number' // With filters }; const result = await crawler.crawl('https://shop.example.com', { parse: parseRule }); ``` ### Nested Data Structures ```typescript const parseRule = { products: { _scope: '.product', // Scope to product elements name: '.name', price: '.price | trim | number', image: 'img@src', details: { _scope: '.details', description: '.desc', specs: ['.spec'] } } }; ``` ### Custom Functions ```typescript const parseRule = { title: 'title', url: () => window.location.href, timestamp: () => new Date().toISOString(), productCount: ($) => $('.product').length }; ``` ## Factory Functions CrawlX provides factory functions for common use cases: ### Lightweight Crawler ```typescript import { createLightweightCrawler } from 'crawlx'; const crawler = createLightweightCrawler({ concurrency: 2, timeout: 5000 }); ``` ### Scraper ```typescript import { createScraper } from 'crawlx'; const scraper = createScraper(); const result = await scraper.crawl('https://example.com', { parse: { title: 'title', content: '.content' } }); ``` ### Spider (Link Following) ```typescript import { createSpider } from 'crawlx'; const spider = createSpider({ plugins: { follow: { maxDepth: 3, sameDomainOnly: true } } }); const results = await spider.crawlMany(['https://example.com'], { parse: { title: 'title' }, follow: '[a@href]' // Follow all links }); ``` ## Configuration ### Basic Configuration ```typescript const crawler = new CrawlX({ mode: 'high-performance', concurrency: 10, timeout: 30000, maxRetries: 3, userAgent: 'MyBot/1.0', headers: { 'Accept': 'text/html,application/xhtml+xml' } }); ``` ### Plugin Configuration ```typescript const crawler = new CrawlX({ plugins: { delay: { enabled: true, defaultDelay: 1000, randomDelay: true }, rateLimit: { enabled: true, globalLimit: { requests: 100, window: 60000 } }, retry: { enabled: true, maxRetries: 3, exponentialBackoff: true } } }); ``` ### Environment Variables ```bash CRAWLX_MODE=high-performance CRAWLX_CONCURRENCY=10 CRAWLX_TIMEOUT=30000 CRAWLX_PLUGINS_DELAY_ENABLED=true CRAWLX_PLUGINS_DELAY_DEFAULT_DELAY=1000 ``` ### Configuration Presets ```typescript import { ConfigPresets } from 'crawlx'; // Development preset const devCrawler = ConfigPresets.development(); // Production preset const prodCrawler = ConfigPresets.production(); // Testing preset const testCrawler = ConfigPresets.testing(); ``` ## Event Handling ```typescript const crawler = new CrawlX(); crawler.on('task-start', (task) => { console.log(`Starting: ${task.url}`); }); crawler.on('task-complete', (result) => { console.log(`Completed: ${result.response.url}`); }); crawler.on('data-extracted', (data, url) => { console.log(`Data from ${url}:`, data); }); crawler.on('task-error', (error, task) => { console.error(`Failed ${task.url}:`, error.message); }); ``` ## Error Handling ```typescript import { CrawlXError, NetworkError, TimeoutError } from 'crawlx'; try { const result = await crawler.crawl('https://example.com'); } catch (error) { if (error instanceof NetworkError) { console.log('Network error:', error.statusCode); } else if (error instanceof TimeoutError) { console.log('Timeout after:', error.timeout); } else if (error instanceof CrawlXError) { console.log('CrawlX error:', error.code, error.context); } } ``` ## Custom Plugins ```typescript class CustomPlugin { name = 'custom'; version = '1.0.0'; priority = 100; async onTaskComplete(result) { // Add custom processing result.customData = { processedAt: new Date().toISOString(), urlLength: result.response.url.length }; return result; } } const crawler = new CrawlX(); crawler.addPlugin(new CustomPlugin()); ``` ## Best Practices ### 1. Resource Management Always clean up resources: ```typescript const crawler = new CrawlX(); try { const result = await crawler.crawl('https://example.com'); // Process result } finally { await crawler.destroy(); } ``` ### 2. Error Handling Handle errors gracefully: ```typescript const crawler = new CrawlX({ maxRetries: 3, timeout: 10000 }); crawler.on('task-error', (error, task) => { console.error(`Failed to crawl ${task.url}:`, error.message); }); ``` ### 3. Rate Limiting Be respectful to target websites: ```typescript const crawler = new CrawlX({ plugins: { delay: { enabled: true, defaultDelay: 1000 // 1 second between requests }, rateLimit: { enabled: true, perDomainLimit: { requests: 10, window: 60000 } } } }); ``` ### 4. Memory Management For large-scale crawling: ```typescript const crawler = new CrawlX({ mode: 'high-performance', concurrency: 20, scheduler: { maxQueueSize: 1000, resourceLimits: { maxMemoryUsage: 1073741824 // 1GB } } }); ``` ## Next Steps - Read the [API Documentation](../api/README.md) - Explore [Advanced Examples](./examples.md) - Learn about [Plugin Development](./plugins.md) - Check out [Performance Tuning](./performance.md)