--- name: spider description: Web crawling and scraping with analysis. Use for crawling websites, security scanning, and extracting information from web pages. --- # Web Spider Crawl and analyze websites. ## Prerequisites ```bash # curl for fetching curl --version # Gemini for analysis pip install google-generativeai export GEMINI_API_KEY=your_api_key # Optional: better HTML parsing npm install -g puppeteer pip install beautifulsoup4 ``` ## Basic Crawling ### Fetch Page ```bash # Simple fetch curl -s "https://example.com" # With headers curl -s -H "User-Agent: Mozilla/5.0" "https://example.com" # Follow redirects curl -sL "https://example.com" # Save to file curl -s "https://example.com" -o page.html # Get headers only curl -sI "https://example.com" # Get headers and body curl -sD - "https://example.com" ``` ### Extract Links ```bash # Extract all links from a page curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sed 's/href="//;s/"$//' # Filter to specific domain curl -s "https://example.com" | grep -oE 'href="https?://example\.com[^"]*"' # Unique links curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u ``` ### Quick Site Scan ```bash #!/bin/bash URL=$1 echo "=== Scanning: $URL ===" echo "" echo "### Response Info ###" curl -sI "$URL" | head -20 echo "" echo "### Links Found ###" curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20 echo "" echo "### Scripts ###" curl -s "$URL" | grep -oE 'src="[^"]*\.js[^"]*"' | sort -u echo "" echo "### Meta Tags ###" curl -s "$URL" | grep -oE ']*>' | head -10 ``` ## AI-Powered Analysis ### Page Analysis ```bash CONTENT=$(curl -s "https://example.com") gemini -m pro -o text -e "" "Analyze this webpage: $CONTENT Provide: 1. What is this page about? 2. Key information/content 3. Navigation structure 4. Technical observations (framework, libraries) 5. SEO observations" ``` ### Security Scan ```bash URL="https://example.com" # Gather information HEADERS=$(curl -sI "$URL") CONTENT=$(curl -s "$URL" | head -1000) gemini -m pro -o text -e "" "Security scan this website: URL: $URL HEADERS: $HEADERS CONTENT SAMPLE: $CONTENT Check for: 1. Missing security headers (CSP, HSTS, X-Frame-Options) 2. Exposed sensitive information 3. Potential vulnerabilities in scripts/forms 4. Cookie security settings 5. HTTPS configuration" ``` ### Extract Structured Data ```bash CONTENT=$(curl -s "https://example.com/products") gemini -m pro -o text -e "" "Extract structured data from this page: $CONTENT Extract into JSON format: - Product names - Prices - Descriptions - Any available metadata" ``` ## Multi-Page Crawling ### Crawl and Analyze ```bash #!/bin/bash BASE_URL=$1 MAX_PAGES=${2:-10} # Get initial links LINKS=$(curl -s "$BASE_URL" | grep -oE "href=\"$BASE_URL[^\"]*\"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES) echo "Found $(echo "$LINKS" | wc -l) pages" for link in $LINKS; do echo "" echo "=== $link ===" TITLE=$(curl -s "$link" | grep -oE '