--- name: page-monitoring description: Web page monitoring, change detection, and availability tracking. Use when tracking content changes, detecting when pages go down, monitoring for updates, preserving content before deletion, or generating feeds for pages without RSS. Covers Visualping, ChangeTower, Distill.io, and self-hosted monitoring solutions. --- # Page monitoring methodology Patterns for tracking web page changes, detecting content removal, and preserving important pages before they disappear. ## Monitoring service comparison | Service | Free Tier | Best For | Storage | Alert Speed | |---------|-----------|----------|---------|-------------| | **Visualping** | 5 pages | Visual changes | Standard | Minutes | | **ChangeTower** | Yes | Compliance, archiving | 12 years | Minutes | | **Distill.io** | 25 pages | Element-level tracking | 12 months | Seconds | | **Wachete** | Limited | Login-protected pages | 12 months | Minutes | | **UptimeRobot** | 50 monitors | Uptime only | 2 months | Minutes | ## Quick-start: Monitor a page ### Distill.io element monitoring ```javascript // Distill.io allows CSS/XPath selectors for precise monitoring // Example selectors for common use cases: // Monitor news article headlines const newsSelector = '.article-headline, h1.title, .story-title'; // Monitor price changes const priceSelector = '.price, .product-price, [data-price]'; // Monitor stock/availability const availabilitySelector = '.in-stock, .availability, .stock-status'; // Monitor specific paragraph or section const sectionSelector = '#main-content p:first-child'; // Monitor table data const tableSelector = 'table.data-table tbody tr'; ``` ### Python monitoring script ```python import requests import hashlib import json import smtplib from email.mime.text import MIMEText from datetime import datetime from pathlib import Path from typing import Optional from bs4 import BeautifulSoup class PageMonitor: """Simple page change monitor with local storage.""" def __init__(self, storage_dir: Path): self.storage_dir = storage_dir self.storage_dir.mkdir(parents=True, exist_ok=True) self.state_file = storage_dir / 'monitor_state.json' self.state = self._load_state() def _load_state(self) -> dict: if self.state_file.exists(): return json.loads(self.state_file.read_text()) return {'pages': {}} def _save_state(self): self.state_file.write_text(json.dumps(self.state, indent=2)) def _get_page_hash(self, url: str, selector: str = None) -> tuple[str, str]: """Get content hash and content for a page or element.""" response = requests.get(url, timeout=30, headers={ 'User-Agent': 'Mozilla/5.0 (PageMonitor/1.0)' }) response.raise_for_status() if selector: soup = BeautifulSoup(response.text, 'html.parser') element = soup.select_one(selector) content = element.get_text(strip=True) if element else '' else: content = response.text content_hash = hashlib.sha256(content.encode()).hexdigest() return content_hash, content def add_page(self, url: str, name: str, selector: str = None): """Add a page to monitor.""" content_hash, content = self._get_page_hash(url, selector) self.state['pages'][url] = { 'name': name, 'selector': selector, 'last_hash': content_hash, 'last_check': datetime.now().isoformat(), 'last_content': content[:1000], # Store preview 'change_count': 0 } self._save_state() print(f"Added: {name} ({url})") def check_page(self, url: str) -> Optional[dict]: """Check single page for changes.""" if url not in self.state['pages']: return None page = self.state['pages'][url] selector = page.get('selector') try: new_hash, new_content = self._get_page_hash(url, selector) except Exception as e: return { 'url': url, 'name': page['name'], 'status': 'error', 'error': str(e) } changed = new_hash != page['last_hash'] result = { 'url': url, 'name': page['name'], 'status': 'changed' if changed else 'unchanged', 'previous_content': page['last_content'], 'new_content': new_content[:1000] if changed else None } if changed: page['last_hash'] = new_hash page['last_content'] = new_content[:1000] page['change_count'] += 1 # Archive the change archive_file = self.storage_dir / f"{hashlib.md5(url.encode()).hexdigest()}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt" archive_file.write_text(new_content) page['last_check'] = datetime.now().isoformat() self._save_state() return result def check_all(self) -> list[dict]: """Check all monitored pages.""" results = [] for url in self.state['pages']: result = self.check_page(url) if result: results.append(result) return results # Usage monitor = PageMonitor(Path('./page_monitor_data')) # Add pages to monitor monitor.add_page( 'https://example.com/important-page', 'Important Page', selector='.main-content' # Optional: monitor specific element ) # Check for changes results = monitor.check_all() for result in results: if result['status'] == 'changed': print(f"CHANGED: {result['name']}") print(f" Previous: {result['previous_content'][:100]}...") print(f" New: {result['new_content'][:100]}...") ``` ## Uptime monitoring ### UptimeRobot API integration ```python import requests from typing import List, Optional class UptimeRobotClient: """UptimeRobot API client for monitoring page availability.""" def __init__(self, api_key: str): self.api_key = api_key self.base_url = "https://api.uptimerobot.com/v2" def _request(self, endpoint: str, params: dict = None) -> dict: data = {'api_key': self.api_key} if params: data.update(params) response = requests.post(f"{self.base_url}/{endpoint}", data=data) return response.json() def get_monitors(self) -> List[dict]: """Get all monitors.""" result = self._request('getMonitors') return result.get('monitors', []) def create_monitor(self, friendly_name: str, url: str, monitor_type: int = 1) -> dict: """Create a new monitor. Types: 1=HTTP(s), 2=Keyword, 3=Ping, 4=Port """ return self._request('newMonitor', { 'friendly_name': friendly_name, 'url': url, 'type': monitor_type }) def get_monitor_uptime(self, monitor_id: int, custom_uptime_ratios: str = "7-30-90") -> dict: """Get uptime statistics for a monitor.""" return self._request('getMonitors', { 'monitors': monitor_id, 'custom_uptime_ratios': custom_uptime_ratios }) def pause_monitor(self, monitor_id: int) -> dict: """Pause a monitor.""" return self._request('editMonitor', { 'id': monitor_id, 'status': 0 }) def resume_monitor(self, monitor_id: int) -> dict: """Resume a monitor.""" return self._request('editMonitor', { 'id': monitor_id, 'status': 1 }) # Usage client = UptimeRobotClient('your-api-key') # Create monitors for important pages client.create_monitor('News Homepage', 'https://example-news.com') client.create_monitor('API Status', 'https://api.example.com/health') # Check all monitors for monitor in client.get_monitors(): status = 'UP' if monitor['status'] == 2 else 'DOWN' print(f"{monitor['friendly_name']}: {status}") ``` ## RSS feed generation ### Generate RSS from pages without feeds ```python import requests from bs4 import BeautifulSoup from feedgen.feed import FeedGenerator from datetime import datetime import hashlib class RSSGenerator: """Generate RSS feeds from web pages.""" def __init__(self, feed_id: str, title: str, link: str): self.fg = FeedGenerator() self.fg.id(feed_id) self.fg.title(title) self.fg.link(href=link) self.fg.description(f'Auto-generated feed for {title}') def add_from_page(self, url: str, item_selector: str, title_selector: str, link_selector: str, description_selector: str = None): """Parse a page and add items to feed. Args: url: Page URL to parse item_selector: CSS selector for each item container title_selector: CSS selector for title (relative to item) link_selector: CSS selector for link (relative to item) description_selector: Optional CSS selector for description """ response = requests.get(url, timeout=30) soup = BeautifulSoup(response.text, 'html.parser') items = soup.select(item_selector) for item in items[:20]: # Limit to 20 items title_elem = item.select_one(title_selector) link_elem = item.select_one(link_selector) if not title_elem or not link_elem: continue title = title_elem.get_text(strip=True) link = link_elem.get('href', '') # Make absolute URL if relative if link.startswith('/'): from urllib.parse import urljoin link = urljoin(url, link) fe = self.fg.add_entry() fe.id(hashlib.md5(link.encode()).hexdigest()) fe.title(title) fe.link(href=link) if description_selector: desc_elem = item.select_one(description_selector) if desc_elem: fe.description(desc_elem.get_text(strip=True)) fe.published(datetime.now()) def generate_rss(self) -> str: """Generate RSS XML string.""" return self.fg.rss_str(pretty=True).decode() def save_rss(self, filepath: str): """Save RSS feed to file.""" self.fg.rss_file(filepath) # Example: Generate feed for a news site without RSS rss = RSSGenerator( 'https://example.com/news', 'Example News Feed', 'https://example.com/news' ) rss.add_from_page( 'https://example.com/news', item_selector='.news-item', title_selector='h2 a', link_selector='h2 a', description_selector='.summary' ) # Save the feed rss.save_rss('example_feed.xml') ``` ### Using RSS-Bridge (self-hosted) ```bash # RSS-Bridge generates feeds for sites without them # Supports Twitter, Instagram, YouTube, and many others # Docker installation docker pull rssbridge/rss-bridge docker run -d -p 3000:80 rssbridge/rss-bridge # Access at http://localhost:3000 # Select a bridge, enter parameters, get RSS feed URL ``` ## Social media monitoring ### Twitter/X archiving with Twarc ```python # Twarc requires Twitter API credentials # Installation # pip install twarc # Configure # twarc2 configure import subprocess import json from pathlib import Path class TwitterArchiver: """Archive Twitter searches and timelines.""" def __init__(self, output_dir: Path): self.output_dir = output_dir self.output_dir.mkdir(parents=True, exist_ok=True) def search(self, query: str, max_results: int = 100) -> Path: """Search tweets and save to file.""" output_file = self.output_dir / f"search_{query.replace(' ', '_')}.jsonl" subprocess.run([ 'twarc2', 'search', '--max-results', str(max_results), query, str(output_file) ], check=True) return output_file def get_timeline(self, username: str, max_results: int = 100) -> Path: """Get user timeline.""" output_file = self.output_dir / f"timeline_{username}.jsonl" subprocess.run([ 'twarc2', 'timeline', '--max-results', str(max_results), username, str(output_file) ], check=True) return output_file def parse_archive(self, filepath: Path) -> list[dict]: """Parse archived tweets.""" tweets = [] with open(filepath) as f: for line in f: data = json.loads(line) if 'data' in data: tweets.extend(data['data']) return tweets ``` ## Webhook notifications ### Send alerts on changes ```python import requests from typing import Optional class AlertManager: """Send alerts when monitored pages change.""" def __init__(self, slack_webhook: str = None, discord_webhook: str = None, email_config: dict = None): self.slack_webhook = slack_webhook self.discord_webhook = discord_webhook self.email_config = email_config def send_slack(self, message: str, channel: str = None): """Send Slack notification.""" if not self.slack_webhook: return payload = {'text': message} if channel: payload['channel'] = channel requests.post(self.slack_webhook, json=payload) def send_discord(self, message: str): """Send Discord notification.""" if not self.discord_webhook: return requests.post(self.discord_webhook, json={'content': message}) def send_email(self, subject: str, body: str, to: str): """Send email notification.""" if not self.email_config: return import smtplib from email.mime.text import MIMEText msg = MIMEText(body) msg['Subject'] = subject msg['From'] = self.email_config['from'] msg['To'] = to with smtplib.SMTP(self.email_config['smtp_host'], self.email_config['smtp_port']) as server: server.starttls() server.login(self.email_config['username'], self.email_config['password']) server.send_message(msg) def alert_change(self, page_name: str, url: str, old_content: str, new_content: str): """Send change alert to all configured channels.""" message = f""" Page Changed: {page_name} URL: {url} Time: {datetime.now().isoformat()} Previous content (preview): {old_content[:200]}... New content (preview): {new_content[:200]}... """ if self.slack_webhook: self.send_slack(message) if self.discord_webhook: self.send_discord(message) ``` ## Scheduled monitoring with cron ### Cron setup for continuous monitoring ```bash # Edit crontab crontab -e # Check pages every 15 minutes */15 * * * * /usr/bin/python3 /path/to/monitor_script.py >> /var/log/monitor.log 2>&1 # Check critical pages every 5 minutes */5 * * * * /usr/bin/python3 /path/to/critical_monitor.py >> /var/log/critical.log 2>&1 # Daily summary report at 8 AM 0 8 * * * /usr/bin/python3 /path/to/daily_report.py ``` ### Monitoring script template ```python #!/usr/bin/env python3 """Page monitoring script for cron execution.""" import sys from pathlib import Path from datetime import datetime # Add project to path sys.path.insert(0, str(Path(__file__).parent)) from monitor import PageMonitor from alerts import AlertManager def main(): # Initialize monitor = PageMonitor(Path('./data')) alerts = AlertManager( slack_webhook='https://hooks.slack.com/services/...', discord_webhook='https://discord.com/api/webhooks/...' ) # Check all pages results = monitor.check_all() # Process results changes = [r for r in results if r['status'] == 'changed'] errors = [r for r in results if r['status'] == 'error'] # Alert on changes for change in changes: alerts.alert_change( change['name'], change['url'], change['previous_content'], change['new_content'] ) print(f"[{datetime.now()}] CHANGE: {change['name']}") # Alert on errors for error in errors: alerts.send_slack(f"Monitor error for {error['name']}: {error['error']}") print(f"[{datetime.now()}] ERROR: {error['name']} - {error['error']}") # Summary print(f"[{datetime.now()}] Checked {len(results)} pages, " f"{len(changes)} changes, {len(errors)} errors") if __name__ == '__main__': main() ``` ## Archive on change ### Automatic archiving when changes detected ```python from multiarchiver import MultiArchiver class ArchivingMonitor(PageMonitor): """Page monitor that archives content when changes detected.""" def __init__(self, storage_dir: Path): super().__init__(storage_dir) self.archiver = MultiArchiver() def check_page(self, url: str) -> dict: """Check page and archive if changed.""" result = super().check_page(url) if result and result['status'] == 'changed': # Archive to multiple services archive_results = self.archiver.archive_url(url) successful_archives = [ r.archived_url for r in archive_results if r.success ] result['archives'] = successful_archives # Log archive URLs print(f"Archived {url} to:") for archive_url in successful_archives: print(f" - {archive_url}") return result ``` ## Monitoring strategy by use case ### News monitoring ```markdown ## News/Current Events Monitoring ### Pages to monitor: - Breaking news sections - Press release pages - Government announcement pages - Company newsrooms ### Monitoring frequency: - Breaking news: Every 5 minutes - Press releases: Every 15-30 minutes - General news: Every hour ### Archive strategy: - Archive immediately on detection - Use both Wayback Machine and Archive.today - Save local copy with timestamp ``` ### Research monitoring ```markdown ## Academic/Research Monitoring ### Pages to monitor: - Preprint servers (arXiv, SSRN) - Journal table of contents - Conference proceedings - Researcher profiles ### Monitoring frequency: - Daily for active topics - Weekly for general monitoring ### Tools recommended: - Google Scholar alerts (free, built-in) - Semantic Scholar alerts - RSS feeds where available - Custom monitors for specific pages ``` ### Competitive intelligence ```markdown ## Competitor Monitoring ### Pages to monitor: - Pricing pages - Product pages - Job postings - Press releases - Executive bios ### Monitoring frequency: - Pricing: Daily - Products: Daily - Jobs: Weekly - Press: Daily ### Legal considerations: - Don't violate terms of service - Don't circumvent access controls - Public pages only - Don't scrape at high frequency ``` ## Best practices ### Monitoring checklist ```markdown ## Before monitoring a page: - [ ] Is the page publicly accessible? - [ ] Are you respecting robots.txt? - [ ] Is monitoring frequency reasonable? - [ ] Do you have a legitimate purpose? - [ ] Are you storing data securely? - [ ] Do you have alerts configured? - [ ] Is archiving set up for important pages? ## Maintenance: - [ ] Review monitors monthly - [ ] Remove stale monitors - [ ] Update selectors if pages change - [ ] Check alert delivery - [ ] Verify archives are working ``` ### Rate limiting ```python import time from functools import wraps def rate_limit(min_interval: float = 1.0): """Decorator to rate limit function calls.""" last_call = [0.0] def decorator(func): @wraps(func) def wrapper(*args, **kwargs): elapsed = time.time() - last_call[0] if elapsed < min_interval: time.sleep(min_interval - elapsed) last_call[0] = time.time() return func(*args, **kwargs) return wrapper return decorator # Usage @rate_limit(min_interval=2.0) # Max once per 2 seconds def check_page(url: str): return requests.get(url) ```