--- name: self-learning-web-adapter description: Learn reusable per-site web scraping adapters and turn them into lightweight commands for repeated extraction. Use when Claude needs to scrape the same domain more than once, reduce token and HTML parsing cost, validate DOM drift, retrain site-specific rules, or export a learned website as a CLI-style tool instead of re-exploring raw pages every time. --- # Self Learning Web Adapter ## Overview Use the bundled CLI to learn a domain-specific adapter from a few sample URLs, run it on future pages, detect drift, retrain when extraction quality drops, and export a command wrapper for agent reuse. Prefer this skill over ad hoc scraping when the user is repeatedly reading the same site, wants a reusable extraction flow, or asks about token/cost savings from structured extraction. ## Workflow 1. Identify 3 or more representative URLs from the same host. 2. Learn an adapter with `scripts/web_adapter_cli.py learn`. 3. Validate on a different page with `run` or `check`. 4. Retrain with `retrain` if `needs_retrain` becomes `true`. 5. Export a reusable command with `export-command` if the site should behave like a tool. ## Core Commands - Learn a host-specific adapter: ```bash python3 scripts/web_adapter_cli.py learn ``` - Run the learned adapter on a new page: ```bash python3 scripts/web_adapter_cli.py run ``` - Check drift and retraining need: ```bash python3 scripts/web_adapter_cli.py check ``` - Retrain from stored or replacement samples: ```bash python3 scripts/web_adapter_cli.py retrain [url1 url2 url3] ``` - Export a web2cli-style command and inspect the registry: ```bash python3 scripts/web_adapter_cli.py export-command python3 scripts/web_adapter_cli.py commands ``` ## Operating Rules - Train only on URLs from the same host. - Use at least one holdout URL for validation. - Prefer structured outputs over summarizing raw HTML. - Treat `needs_retrain: true` as a signal to refresh the adapter before trusting extraction. - Export commands only after the adapter works on at least one unseen page. ## Outputs - Adapter JSON is stored in `adapter_registry/.json`. - Exported commands are stored in `web2cli_commands/`. - The command registry is stored in `web2cli_commands/index.json`. ## References - Read `references/adapter-format.md` when editing or inspecting the adapter schema. - Read `references/token-savings.md` when the user asks for token reduction, ROI, or cost justification. - Read `scripts/web_adapter_cli.py` only when the workflow itself needs changes.