--- name: ai-tech-fulltext-fetch description: Fetch and persist article full text for RSS entries already stored in SQLite by ai-tech-rss-fetch. Use when backfilling or incrementally syncing body text from entries.url or entries.canonical_url into a companion table for downstream indexing, retrieval, or summarization. --- # AI Tech Fulltext Fetch ## Core Goal - Reuse the same SQLite database populated by `ai-tech-rss-fetch`. - Fetch article body text from each RSS entry URL. - Persist extraction status and text in a companion table (`entry_content`). - Support incremental runs and safe retries without creating duplicate fulltext rows. ## Triggering Conditions - Receive a request to fetch article body/full text for entries already in `ai_rss.db`. - Receive a request to build a second-stage pipeline after RSS metadata sync. - Need a stable, resumable queue over existing `entries` rows. - Need URL-based fulltext persistence before chunking, indexing, or summarization. ## Workflow 1. Ensure metadata table exists first. - Run `ai-tech-rss-fetch` and populate `entries` in SQLite before using this skill. - This skill requires the `entries` table to exist. - In multi-agent runtimes, pin DB to the same absolute path used by `ai-tech-rss-fetch`: ```bash export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db" ``` 2. Initialize fulltext table. ```bash python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH" ``` 3. Run incremental fulltext sync. - Default behavior fetches rows that are missing full text or currently failed. ```bash python3 scripts/fulltext_fetch.py sync \ --db "$AI_RSS_DB_PATH" \ --limit 50 \ --timeout 20 \ --min-chars 300 ``` 4. Fetch one entry on demand. ```bash python3 scripts/fulltext_fetch.py fetch-entry \ --db "$AI_RSS_DB_PATH" \ --entry-id 1234 ``` 5. Inspect extracted content state. ```bash python3 scripts/fulltext_fetch.py list-content \ --db "$AI_RSS_DB_PATH" \ --status ready \ --limit 100 ``` ## Data Contract - Reads from existing `entries` table: - `id`, `canonical_url`, `url`, `title`. - Writes to `entry_content` table: - `entry_id` (unique, one row per entry) - `source_url`, `final_url`, `http_status` - `extractor` (`trafilatura`, `html-parser`, or `none`) - `content_text`, `content_hash`, `content_length` - `status` (`ready` or `failed`) - `retry_count`, `last_error`, timestamps. ## Extraction and Update Rules - URL source priority: `canonical_url` first, fallback to `url`. - Attempt `trafilatura` extraction when dependency is available, fallback to built-in HTML parser. - Upsert by `entry_id`: - Success: write/update full text and reset `retry_count` to `0`. - Failure with existing `ready` content: keep old text, keep status `ready`, record `last_error`. - Failure without ready content: status becomes `failed`, increment `retry_count`, set `next_retry_at`. - Failed retries are capped by `--max-retries` (default `3`) and paced by `--retry-backoff-minutes`. - `--force` allows refetching already `ready` rows. - `--refetch-days N` allows refreshing rows older than `N` days. ## Configurable Parameters - `--db` - `AI_RSS_DB_PATH` (recommended absolute path in multi-agent runtime) - `--limit` - `--force` - `--only-failed` - `--refetch-days` - `--oldest-first` - `--timeout` - `--max-bytes` - `--min-chars` - `--max-retries` - `--retry-backoff-minutes` - `--user-agent` - `--disable-trafilatura` - `--fail-on-errors` ## Error Handling - Missing `entries` table: return actionable error and stop. - Network/HTTP/parse errors: store failure state and continue processing other entries. - Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry. - Extraction too short (`--min-chars`): treat as failure to avoid low-quality body text. ## References - `references/schema.md` - `references/fetch-rules.md` ## Assets - `assets/config.example.json` ## Scripts - `scripts/fulltext_fetch.py`