--- name: gallery-scraper description: Bulk download images from login-protected gallery websites using browser automation. Use when asked to scrape, download, or save images from gallery pages that require authentication, extract full-size images from thumbnails, or batch download from multi-page galleries. --- # Gallery Scraper Bulk download images from authenticated gallery websites via browser relay. ## Prerequisites - User must have Chrome with OpenClaw Browser Relay extension - User must be logged into the target site - User must attach the browser tab (click relay toolbar button, badge ON) ## Workflow ### 1. Attach Browser Tab Ask user to: 1. Log into the gallery site in Chrome 2. Navigate to the target gallery/profile page 3. Click the OpenClaw Browser Relay toolbar button (badge shows ON) ### 2. Discover Image URL Pattern Most gallery sites store full-size URLs in data attributes. Common patterns: ```javascript // Extract via browser evaluate () => { // Try common patterns const patterns = [ 'img[data-max]', // data-max attribute 'img[data-src]', // lazy-load pattern 'img[data-full]', // full-size pattern 'a[data-lightbox] img', // lightbox galleries '.gallery-item img' // generic gallery ]; for (const sel of patterns) { const imgs = document.querySelectorAll(sel); if (imgs.length > 0) { return { selector: sel, count: imgs.length, sample: imgs[0].outerHTML.substring(0, 200) }; } } return null; } ``` ### 3. Extract Full-Size URLs Once pattern identified, extract all URLs: ```javascript // For data-max pattern (common) () => Array.from(document.querySelectorAll('img[data-max]')) .map(img => img.dataset.max) // For thumbnail→full conversion (replace path segment) () => Array.from(document.querySelectorAll('.gallery img')) .map(img => img.src.replace('/thumb/', '/full/')) ``` ### 4. Handle Pagination Check for multiple pages: ```javascript () => { const pagination = document.querySelectorAll('.pagination a, [class*="page"] a'); return Array.from(pagination).map(a => ({text: a.textContent, href: a.href})); } ``` Navigate to each page and collect URLs. ### 4b. Batch scrape multiple galleries (iframe trick) When you need multiple galleries quickly and can’t automate CDP, you can load each gallery in a hidden iframe and extract `data-max` URLs: ```javascript async () => { const urls = [ 'https://site.example/galleries/view/123', 'https://site.example/galleries/view/456' ]; const results = []; for (const url of urls) { const iframe = document.createElement('iframe'); iframe.style.position = 'fixed'; iframe.style.left = '-9999px'; iframe.style.width = '800px'; iframe.style.height = '600px'; iframe.src = url; document.body.appendChild(iframe); await new Promise((resolve, reject) => { const t = setTimeout(() => reject(new Error('timeout load')), 20000); iframe.onload = () => { clearTimeout(t); resolve(); }; }); const doc = iframe.contentDocument; const start = Date.now(); let imgs = []; while (Date.now() - start < 20000) { imgs = Array.from(doc.querySelectorAll('img[data-max]')).map(i => i.dataset.max); if (imgs.length) break; await new Promise(r => setTimeout(r, 500)); } results.push({ id: url.split('/').pop(), urls: imgs }); iframe.remove(); } return results; } ``` ### 5. Check CDN Access Test if CDN requires authentication or just Referer: ```bash # Test direct access curl -I "CDN_URL" 2>/dev/null | head -3 # Test with Referer curl -I -H "Referer: https://SITE_DOMAIN/" "CDN_URL" 2>/dev/null | head -3 ``` ### 6. Bulk Download Save URLs to file, then parallel download: ```bash # Create output directory mkdir -p ~/Downloads/gallery_name # Download with Referer header (parallel) cd ~/Downloads/gallery_name while IFS= read -r url; do filename=$(basename "$url") curl -s -H "Referer: https://SITE_DOMAIN/" -o "$filename" "$url" & [ $(jobs -r | wc -l) -ge 8 ] && wait -n done < urls.txt wait ``` **Python ThreadPool fallback (avoids shell quoting + wait -n issues):** ```python import os import requests from concurrent.futures import ThreadPoolExecutor outdir = os.path.expanduser('~/Downloads/gallery_name') os.makedirs(outdir, exist_ok=True) headers = {'Referer': 'https://SITE_DOMAIN/', 'User-Agent': 'Mozilla/5.0'} with open('urls.txt') as f: urls = [line.strip() for line in f if line.strip()] def download(url): filename = os.path.join(outdir, os.path.basename(url)) if os.path.exists(filename) and os.path.getsize(filename) > 0: return r = requests.get(url, headers=headers, timeout=60) r.raise_for_status() with open(filename, 'wb') as f: f.write(r.content) with ThreadPoolExecutor(max_workers=8) as ex: for url in urls: ex.submit(download, url) ``` ## Handling Lock Buttons Some galleries have "lock" buttons to reveal hidden content. Look for: ```javascript // Find lock/unlock buttons () => { const locks = document.querySelectorAll( '[class*="lock"], [class*="unlock"], ' + 'button[title*="lock"], .premium-unlock' ); return Array.from(locks).map(el => ({ tag: el.tagName, class: el.className, text: el.innerText?.substring(0, 30) })); } ``` Click each lock button before extracting URLs. ## Output Organization Optionally organize by gallery: ```bash # Extract gallery ID from URL gallery_id=$(echo "$url" | grep -oE 'gallery/[0-9]+' | cut -d/ -f2) mkdir -p "gallery_${gallery_id}" ``` ## Troubleshooting - **403 Forbidden**: Add Referer header or extract cookies from browser - **Rate limited**: Reduce parallel downloads, add delays - **Missing images**: Check for JavaScript-loaded content, may need scroll injection - **Login required for CDN**: Extract session cookies via `document.cookie`