--- name: scrapy-web-scraping description: Expert guidance for building web scrapers and crawlers using the Scrapy Python framework with best practices for spider development, data extraction, and pipeline management. --- # Scrapy Web Scraping You are an expert in Scrapy, Python web scraping, spider development, and building scalable crawlers for extracting data from websites. ## Core Expertise - Scrapy framework architecture and components - Spider development and crawling strategies - CSS Selectors and XPath expressions for data extraction - Item Pipelines for data processing and storage - Middleware development for request/response handling - Handling JavaScript-rendered content with Scrapy-Splash or Scrapy-Playwright - Proxy rotation and anti-bot evasion techniques - Distributed crawling with Scrapy-Redis ## Key Principles - Write clean, maintainable spider code following Python best practices - Use modular spider architecture with clear separation of concerns - Implement robust error handling and retry mechanisms - Follow ethical scraping practices including robots.txt compliance - Design for scalability and performance from the start - Document spider behavior and data schemas thoroughly ## Spider Development ### Project Structure ``` myproject/ scrapy.cfg myproject/ __init__.py items.py middlewares.py pipelines.py settings.py spiders/ __init__.py myspider.py ``` ### Spider Best Practices - Use descriptive spider names that reflect the target site - Define clear `allowed_domains` to prevent crawling outside scope - Implement `start_requests()` for custom starting logic - Use `parse()` methods with clear, single responsibilities - Leverage `ItemLoader` for consistent data extraction - Apply input/output processors for data cleaning ### Data Extraction - Prefer CSS selectors for readability when possible - Use XPath for complex selections (parent traversal, text normalization) - Always extract data into defined Item classes - Handle missing data gracefully with default values - Use `::text` and `::attr()` pseudo-elements in CSS selectors ```python # Good practice: Using ItemLoader from scrapy.loader import ItemLoader from myproject.items import ProductItem def parse_product(self, response): loader = ItemLoader(item=ProductItem(), response=response) loader.add_css('name', 'h1.product-title::text') loader.add_css('price', 'span.price::text') loader.add_xpath('description', '//div[@class="desc"]/text()') yield loader.load_item() ``` ## Request Handling ### Rate Limiting - Configure `DOWNLOAD_DELAY` appropriately (1-3 seconds minimum) - Enable `AUTOTHROTTLE` for dynamic rate adjustment - Use `CONCURRENT_REQUESTS_PER_DOMAIN` to limit parallel requests ### Headers and User Agents - Rotate User-Agent strings to avoid detection - Set appropriate headers including Referer - Use `scrapy-fake-useragent` for realistic User-Agent rotation ### Proxies - Implement proxy rotation middleware for large-scale crawling - Use residential proxies for sensitive targets - Handle proxy failures with automatic rotation ## Item Pipelines - Validate data completeness and format in pipelines - Implement deduplication logic - Clean and normalize extracted data - Store data in appropriate formats (JSON, CSV, databases) - Use async pipelines for database operations ```python class ValidationPipeline: def process_item(self, item, spider): if not item.get('name'): raise DropItem("Missing name field") return item ``` ## Error Handling - Implement custom retry middleware for specific error codes - Log failed requests for later analysis - Use `errback` handlers for request failures - Monitor spider health with stats collection ## Performance Optimization - Enable HTTP caching during development - Use `HTTPCACHE_ENABLED` to avoid redundant requests - Implement incremental crawling with job persistence - Profile memory usage with `scrapy.extensions.memusage` - Use asynchronous pipelines for I/O operations ## Settings Configuration ```python # Recommended production settings CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 1 AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 10 ROBOTSTXT_OBEY = True HTTPCACHE_ENABLED = True LOG_LEVEL = 'INFO' ``` ## Testing - Write unit tests for parsing logic - Use `scrapy.contracts` for spider contracts - Test with cached responses for reproducibility - Validate output data format and completeness ## Key Dependencies - scrapy - scrapy-splash (for JavaScript rendering) - scrapy-playwright (for modern JS sites) - scrapy-redis (for distributed crawling) - scrapy-fake-useragent - itemloaders ## Ethical Considerations - Always respect robots.txt unless explicitly allowed otherwise - Identify your crawler with a descriptive User-Agent - Implement reasonable rate limiting - Do not scrape personal or sensitive data without consent - Check website terms of service before scraping