--- name: apify-scrapers description: Social media and web scraping using Apify actors. Use this skill when scraping Twitter/X tweets, Reddit posts, LinkedIn posts, Instagram profiles/posts/reels, Facebook pages/posts/groups, TikTok videos, YouTube content, Google Maps businesses/reviews, contact enrichment (emails/phones from websites), or when auto-detecting URL type to scrape. Triggers on requests to scrape social media, get trending posts, extract business info, find contact details, or extract content from social URLs. --- # Apify Scrapers ## Overview Scrape content from major social platforms using Apify actors. Each platform has optimized settings for cost and quality. ## Quick Decision Tree ``` What do you want to scrape? │ ├── Social Media Posts │ ├── Twitter/X → references/twitter.md │ │ └── Script: scripts/scrape_twitter_ai_trends.py │ │ │ ├── Reddit → references/reddit.md │ │ └── Script: scripts/scrape_reddit_ai_tech.py │ │ │ ├── LinkedIn → references/linkedin.md │ │ └── Script: scripts/scrape_linkedin_posts.py │ │ │ ├── Instagram → references/instagram.md │ │ └── Script: scripts/scrape_instagram.py │ │ └── Modes: profile, posts, hashtag, reels, comments │ │ │ ├── Facebook → references/facebook.md │ │ └── Script: scripts/scrape_facebook.py │ │ └── Modes: page, posts, reviews, groups, marketplace │ │ │ ├── TikTok → references/multi-platform.md │ │ └── Script: scripts/scrape_multi_platform.py │ │ │ └── YouTube → references/multi-platform.md │ └── Script: scripts/scrape_multi_platform.py │ ├── Business/Places │ ├── Google Maps businesses → references/google-maps.md │ │ └── Script: scripts/scrape_google_maps.py │ │ └── Modes: search, place, reviews │ │ │ └── Contact info from websites → references/contact-enrichment.md │ └── Script: scripts/scrape_contact_info.py │ └── Extract: emails, phone numbers, social profiles │ ├── Auto-detect URL type → references/url-detect.md │ └── Script: scripts/scrape_content_by_url.py │ ├── Trend Analysis (NEW) │ └── Enriched trend analysis → workflows/trend-analysis.md │ └── Script: scripts/analyze_trends.py │ └── Features: velocity scoring, lifecycle staging, opportunity scoring │ └── Workflows (multi-step) ├── Lead generation → workflows/lead-generation.md ├── Influencer discovery → workflows/influencer-discovery.md ├── Competitor analysis → workflows/competitor-intel.md ├── Trend analysis → workflows/trend-analysis.md └── Competitor Ads Intelligence (NEW) → workflows/competitor-ads.md └── Script: scripts/scrape_competitor_ads.py └── Platforms: Facebook Ads Library, Google Ads Transparency └── Features: Spend estimates, creative analysis, benchmarking ``` ## Environment Setup ```bash # Required in .env APIFY_TOKEN=apify_api_xxxxx ``` Get your API key: https://console.apify.com/account/integrations ## Common Usage Patterns ### Scrape Twitter Trends ```bash python scripts/scrape_twitter_ai_trends.py --query "AI agents" --max-tweets 50 ``` ### Scrape Reddit Discussions ```bash python scripts/scrape_reddit_ai_tech.py --subreddits "MachineLearning,LocalLLaMA" --max-posts 100 ``` ### Scrape LinkedIn Author ```bash python scripts/scrape_linkedin_posts.py author "https://linkedin.com/in/username" --max-posts 30 ``` ### Auto-detect and Scrape URL ```bash python scripts/scrape_content_by_url.py "https://x.com/user/status/123456" ``` ### Scrape Instagram Profile ```bash python scripts/scrape_instagram.py profile "https://instagram.com/username" --max-posts 20 ``` ### Scrape Instagram Hashtag ```bash python scripts/scrape_instagram.py hashtag "#artificialintelligence" --max-posts 50 ``` ### Scrape Instagram Reels ```bash python scripts/scrape_instagram.py reels "https://instagram.com/username" --max-reels 30 ``` ### Scrape Facebook Page ```bash python scripts/scrape_facebook.py page "https://facebook.com/pagename" --max-posts 50 ``` ### Scrape Facebook Reviews ```bash python scripts/scrape_facebook.py reviews "https://facebook.com/pagename" --max-reviews 100 ``` ### Scrape Facebook Marketplace ```bash python scripts/scrape_facebook.py marketplace "laptops in san francisco" --max-items 30 ``` ### Scrape Google Maps Businesses ```bash python scripts/scrape_google_maps.py search "AI consulting firms in New York" --max-results 50 ``` ### Scrape Google Maps Reviews ```bash python scripts/scrape_google_maps.py reviews "ChIJN1t_tDeuEmsRUsoyG83frY4" --max-reviews 100 ``` ### Extract Contact Info from Websites ```bash python scripts/scrape_contact_info.py "https://example.com" --depth 2 ``` ### Bulk Contact Enrichment ```bash python scripts/scrape_contact_info.py --urls-file companies.txt --output contacts.json ``` ### Scrape Competitor Ads (Single Competitor) ```bash python scripts/scrape_competitor_ads.py "Nike" --platforms facebook google --country US --days 30 ``` ### Compare Multiple Competitors' Ads ```bash python scripts/scrape_competitor_ads.py "Nike" "Adidas" "Puma" --compare --output comparison.json ``` ### Discover Advertisers by Keyword ```bash python scripts/scrape_competitor_ads.py --search "running shoes" --country US --max-ads 200 ``` ### Filter Competitor Ads by Media Type ```bash python scripts/scrape_competitor_ads.py "Netflix" "Disney+" --platforms facebook --media-types video --days 7 ``` ### Analyze Trends (NEW) ```bash # Analyze specific topic with enrichments python scripts/analyze_trends.py "artificial intelligence" --sources google instagram tiktok --days 90 # Discover trending topics in category python scripts/analyze_trends.py --category technology --discover --top 50 # Compare multiple trends python scripts/analyze_trends.py "AI" "blockchain" "metaverse" --compare # Export HTML trend report python scripts/analyze_trends.py "sustainable fashion" --format html --output trend_report.html ``` ## Cost Estimates | Platform | Actor | Cost per Item | |----------|-------|---------------| | Twitter | kaitoeasyapi/twitter-x-data-tweet-scraper | ~$0.00025 | | Reddit | trudax/reddit-scraper | ~$0.001-0.005 | | LinkedIn | harvestapi/linkedin-post-search | ~$0.01-0.05 | | YouTube | streamers/youtube-scraper | ~$0.01-0.05 | | TikTok | clockworks/tiktok-scraper | ~$0.005 | | Instagram (profile) | apify/instagram-profile-scraper | ~$0.005 | | Instagram (posts) | apify/instagram-post-scraper | ~$0.002-0.005 | | Instagram (hashtag) | apify/instagram-hashtag-scraper | ~$0.002-0.005 | | Instagram (reels) | apify/instagram-reel-scraper | ~$0.005-0.01 | | Instagram (comments) | apify/instagram-comment-scraper | ~$0.001-0.003 | | Facebook (page) | apify/facebook-pages-scraper | ~$0.005-0.01 | | Facebook (posts) | apify/facebook-posts-scraper | ~$0.003-0.005 | | Facebook (reviews) | apify/facebook-reviews-scraper | ~$0.002-0.005 | | Facebook (groups) | apify/facebook-groups-scraper | ~$0.005-0.01 | | Facebook (marketplace) | apify/facebook-marketplace-scraper | ~$0.005-0.01 | | Google Maps (search) | compass/crawler-google-places | ~$0.01-0.02 | | Google Maps (place) | compass/google-maps-business-scraper | ~$0.01 | | Google Maps (reviews) | compass/google-maps-reviews-scraper | ~$0.003-0.005 | | Contact Enrichment | lukaskrivka/contact-info-scraper | ~$0.01-0.03 | | Google Trends | apify/google-trends-scraper | ~$0.01 | | Trend Analysis (multi) | Multiple actors | ~$0.50-1.50/run | | Facebook Ads Library | apify/facebook-ads-scraper | ~$0.75/1K ads | | Facebook Ads (alt) | curious_coder/facebook-ads-library-scraper | ~$0.50/1K ads | | Google Ads Transparency | lexis-solutions/google-ads-scraper | ~$1.00/1K ads | | Google Ads (alt) | xtech/google-ad-transparency-scraper | ~$0.80/1K ads | ## Output Location All scraped data saves to `.tmp/` with timestamped filenames: - `.tmp/twitter_ai_trends_YYYYMMDD.json` - `.tmp/reddit_ai_tech_YYYYMMDD.json` - `.tmp/linkedin_posts_YYYYMMDD_HHMMSS.json` ## Security Notes ### Credential Handling - Store `APIFY_TOKEN` in `.env` file (never commit to git) - Rotate API tokens periodically via Apify Console - Never log or print API tokens in script output - Use environment variables, not hardcoded values ### Data Privacy - Scraped data contains only publicly available content - Social media posts may include PII (names, handles, profile info) - Data is stored locally in `.tmp/` directory - No data is retained by Apify after actor run completes - Consider data minimization - only scrape what you need ### Access Scopes - Apify tokens have full account access (no granular scopes) - Use separate Apify accounts for different projects if needed - Monitor usage via Apify Console dashboard ### Compliance Considerations - **Terms of Service**: Respect each platform's ToS (Twitter, Reddit, LinkedIn) - **Rate Limiting**: Actors have built-in rate limiting to avoid bans - **Robots.txt**: Some actors may bypass robots.txt - use responsibly - **GDPR**: Scraped PII may be subject to GDPR if EU residents - **Ethical Use**: Only scrape public data; never bypass authentication - **Proxy Ethics**: Residential proxies should be used ethically ## Troubleshooting ### Common Issues #### Issue: Actor run failed **Symptoms:** Script terminates with "Actor run failed" or timeout error **Cause:** Invalid actor ID, insufficient proxy credits, or actor configuration issue **Solution:** - Verify the actor ID is correct in the script - Check Apify Console for actor run logs - Ensure proxy settings match actor requirements - Try running with default proxy settings first #### Issue: Empty results returned **Symptoms:** Script completes but returns 0 items **Cause:** Content blocked by platform, invalid query, or proxy being detected **Solution:** - Try a different proxy type (residential vs datacenter) - Simplify the search query - Reduce the number of results requested - Check if the platform is blocking scraping attempts #### Issue: Rate limited by platform **Symptoms:** Script fails with 429 errors or "rate limited" messages **Cause:** Too many requests in a short time period **Solution:** - Add delays between requests (actor settings) - Reduce concurrent requests - Use proxy rotation - Wait and retry after a cooldown period #### Issue: Invalid API token **Symptoms:** Authentication error or "invalid token" message **Cause:** Token expired, revoked, or incorrectly set **Solution:** - Regenerate API token in Apify Console - Verify token is correctly set in `.env` file - Check for leading/trailing whitespace in token - Ensure `APIFY_TOKEN` environment variable is loaded #### Issue: Proxy connection errors **Symptoms:** Connection timeout or proxy errors **Cause:** Proxy pool exhausted or geo-restriction issues **Solution:** - Switch proxy type (basic, residential, or datacenter) - Verify proxy credit balance in Apify Console - Try a different proxy country/region - Disable proxy to test if that's the root cause ## Resources ### Platform References - **references/twitter.md** - Twitter/X scraping details - **references/reddit.md** - Reddit scraping with subreddit targeting - **references/linkedin.md** - LinkedIn post scraping (author or search mode) - **references/instagram.md** - Instagram profile, posts, hashtag, reels, and comments scraping - **references/facebook.md** - Facebook page, posts, reviews, groups, and marketplace scraping - **references/multi-platform.md** - TikTok and YouTube scraping - **references/url-detect.md** - Auto-detect URL type and scrape ### Business/Places References - **references/google-maps.md** - Google Maps business search, place details, and reviews - **references/contact-enrichment.md** - Extract emails, phone numbers, and social profiles from websites ### Workflow References - **workflows/lead-generation.md** - Multi-step lead generation workflow - **workflows/influencer-discovery.md** - Find and analyze influencers across platforms - **workflows/competitor-intel.md** - Competitive intelligence gathering workflow - **workflows/trend-analysis.md** - Enriched multi-platform trend analysis with scoring ## Integration Patterns ### Scrape and Enrich **Skills:** apify-scrapers → parallel-research **Use case:** Scrape social media posts, then enrich with deep research **Flow:** 1. Scrape Twitter/Reddit for mentions of a topic 2. Extract company names or URLs from posts 3. Use parallel-research to get detailed info on each company ### Scrape and Summarize **Skills:** apify-scrapers → content-generation **Use case:** Create newsletter content from social media trends **Flow:** 1. Scrape trending AI posts from Twitter 2. Pass scraped data to content-generation summarize 3. Generate a formatted newsletter section ### Scrape and Archive **Skills:** apify-scrapers → google-workspace **Use case:** Save scraped data to Google Drive for team access **Flow:** 1. Scrape LinkedIn posts from target accounts 2. Format data as CSV or JSON 3. Upload to Google Drive client folder via google-workspace ### Trend Analysis + Content Strategy **Skills:** apify-scrapers (trend-analysis) → content-generation **Use case:** Identify trending topics and create content strategy **Flow:** 1. Run trend analysis: `python scripts/analyze_trends.py "AI productivity" --sources all` 2. Review lifecycle stage and opportunity score 3. Use content-generation to create content for high-opportunity trends 4. Focus on emerging trends with high velocity scores ### Competitive Trend Monitoring **Skills:** apify-scrapers (trend-analysis) → parallel-research **Use case:** Monitor competitor visibility in trending topics **Flow:** 1. Analyze industry trends: `python scripts/analyze_trends.py --category "your-industry" --discover` 2. Compare your brand vs competitors in those trends 3. Use parallel-research for deep dive on gaps 4. Generate competitive intelligence report