---
name: sustainability-rss-fetch
description: Ingest all sustainability journal RSS entries into a dedicated RSS SQLite database first, keyed by DOI, then mark relevance and prune non-relevant rows to DOI-only. Use when building a DOI-first ingestion pipeline with mandatory full ingestion before topic filtering.
---

# Sustainability RSS Fetch

## Core Goal
- Ingest all RSS/Atom items into SQLite before topic filtering.
- Use `doi` as the primary key in `entries`.
- Keep RSS metadata isolated in its own DB file.
- After semantic screening, keep relevant rows and prune non-relevant rows to DOI-only.

## Triggering Conditions
- Receive a request to import sustainability feeds and persist all fetched records first.
- Receive a request to do prompt-based topic screening after DB ingestion.
- Receive a request to convert irrelevant rows into lightweight DOI-only records.
- Need stable DOI-keyed storage for downstream API/fulltext/summarization.

## Mandatory Workflow
1. Prepare runtime and RSS metadata DB path.

```bash
python3 -m pip install feedparser
export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
python3 scripts/rss_subscribe.py init-db --db "$SUSTAIN_RSS_DB_PATH"
```

2. Collect RSS window and ingest all fetched items first.

```bash
python3 scripts/rss_subscribe.py collect-window \
  --db "$SUSTAIN_RSS_DB_PATH" \
  --opml assets/journal.opml \
  --start 2026-02-01 \
  --end 2026-02-10 \
  --max-items-per-feed 150 \
  --topic-prompt "筛选与可持续主题相关的文章：生命周期评价、物质流分析、绿色供应链、绿电、绿色设计、减污降碳" \
  --output /tmp/sustainability-candidates.json \
  --pretty
```

3. Screen candidates in agent context (semantic, not regex-only).
- Use `topic_prompt` + user instructions.
- Produce selected `candidate_id` list.

4. Mark selected rows as relevant and prune unselected rows.

```bash
python3 scripts/rss_subscribe.py insert-selected \
  --db "$SUSTAIN_RSS_DB_PATH" \
  --candidates /tmp/sustainability-candidates.json \
  --selected-ids 3,7,12,21
```

Result:
- selected candidates: `is_relevant=1`, keep metadata.
- unselected candidates: clear metadata fields, keep DOI-only row (`is_relevant=0`).

## Optional Maintenance Sync
```bash
python3 scripts/rss_subscribe.py sync --db "$SUSTAIN_RSS_DB_PATH" --max-feeds 20 --max-items-per-feed 100
```

## Source Management
```bash
python3 scripts/rss_subscribe.py add-feed --db "$SUSTAIN_RSS_DB_PATH" --url "https://example.com/feed.xml"
python3 scripts/rss_subscribe.py import-opml --db "$SUSTAIN_RSS_DB_PATH" --opml assets/journal.opml
```

## Query Data
```bash
python3 scripts/rss_subscribe.py list-feeds --db "$SUSTAIN_RSS_DB_PATH" --limit 50
python3 scripts/rss_subscribe.py list-entries --db "$SUSTAIN_RSS_DB_PATH" --limit 100
```

## Data Contract
- `feeds` table: subscription and fetch state.
- `entries` table (`doi` PK):
  - metadata fields (`title/url/summary/categories/...`)
  - `doi_is_surrogate` (when no DOI is present in source)
  - `is_relevant` (`1` relevant, `0` pruned non-relevant, `NULL` not labeled yet)
- Non-relevant rows are pruned to DOI-only payload for storage efficiency.

## Configurable Parameters
- `--db`
- `SUSTAIN_RSS_DB_PATH`
- `--opml`
- `--feed-url`
- `--use-subscribed-feeds`
- `--topic-prompt`
- `--start/--end`
- `--max-feeds`
- `--max-items-per-feed`
- `--user-agent`
- `--cleanup-ttl-days`

## Error and Boundary Handling
- Feed/network failure: continue other feeds and keep errors in feed state.
- Missing `feedparser`: return install guidance.
- Missing DOI in RSS item: create deterministic surrogate DOI key to keep full-ingestion guarantee.
- Invalid selected IDs: fail fast before label/prune write.

## References
- `references/input-model.md`
- `references/output-rules.md`
- `references/time-range-rules.md`

## Assets
- `assets/journal.opml`
- `assets/config.example.json`

## Scripts
- `scripts/rss_subscribe.py`