---
name: sustainability-fulltext-fetch
description: Fetch and persist content for DOI-keyed sustainability RSS entries from a separate fulltext SQLite DB, using OpenAlex/Semantic Scholar API metadata first and webpage fulltext extraction as fallback. Use when building resilient DOI-first content enrichment after relevance labeling.
---

# Sustainability Fulltext Fetch

## Core Goal
- Read relevant DOI entries from RSS metadata DB.
- Write fetched content into a separate fulltext DB.
- Process only relevant entries (`is_relevant=1`).
- Prefer API metadata retrieval by DOI (OpenAlex first, Semantic Scholar fallback).
- Fallback to webpage fulltext extraction when API metadata is unavailable.
- Persist one content row per DOI in `entry_content`.

## Triggering Conditions
- Receive a request to enrich relevant DOI records with abstract/fulltext content.
- Receive a request to replace webpage-first crawling with API-first enrichment.
- Need retry-safe incremental updates without duplicate rows.

## Workflow
1. Ensure upstream DOI/relevance data exists.

```bash
export SUSTAIN_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_rss.db"
export SUSTAIN_FULLTEXT_DB_PATH="/absolute/path/to/workspace-rss-bot/sustainability_fulltext.db"
python3 scripts/fulltext_fetch.py init-db --content-db "$SUSTAIN_FULLTEXT_DB_PATH"
```

2. Run incremental sync (API first, webpage fallback).

```bash
python3 scripts/fulltext_fetch.py sync \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --limit 50 \
  --openalex-email "you@example.com" \
  --api-min-chars 80 \
  --min-chars 300
```

3. Fetch one DOI on demand.

```bash
python3 scripts/fulltext_fetch.py fetch-entry \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --doi "10.1038/nature12373"
```

4. Inspect stored content state.

```bash
python3 scripts/fulltext_fetch.py list-content \
  --rss-db "$SUSTAIN_RSS_DB_PATH" \
  --content-db "$SUSTAIN_FULLTEXT_DB_PATH" \
  --status ready \
  --limit 100
```

## Data Contract
- Reads from RSS DB `entries`:
  - `doi`, `doi_is_surrogate`, `is_relevant`, `canonical_url`, `url`, `title`.
- Writes to fulltext DB `entry_content` (primary key `doi`):
  - source URL/status/extractor
  - `content_kind` (`abstract` or `fulltext`)
  - `content_text`, `content_hash`, `content_length`
  - retry fields and timestamps.

## Extraction Priority
1. API metadata path:
- OpenAlex by DOI.
- Semantic Scholar fallback by DOI.
- If accepted (`--api-min-chars`), persist as `content_kind=abstract`.

2. Webpage fallback path:
- Use `canonical_url` then `url`.
- Extract with `trafilatura` when available, else built-in HTML parser.
- Persist as `content_kind=fulltext`.

## Update Semantics
- Upsert key: `doi`.
- Success: status `ready`, reset retry counters.
- Failure with existing ready row: keep old content, record latest error.
- Failure without ready row: set `status=failed`, increment retry state.

## Configurable Parameters
- `--rss-db`
- `--content-db`
- `SUSTAIN_RSS_DB_PATH`
- `SUSTAIN_FULLTEXT_DB_PATH`
- `--limit`
- `--force`
- `--only-failed`
- `--refetch-days`
- `--timeout`
- `--max-bytes`
- `--min-chars`
- `--openalex-email` / `OPENALEX_EMAIL`
- `--s2-api-key` / `S2_API_KEY`
- `--api-timeout`
- `--api-min-chars`
- `--disable-api-metadata`
- `--max-retries`
- `--retry-backoff-minutes`
- `--user-agent`
- `--disable-trafilatura`
- `--fail-on-errors`

## Error Handling
- Missing DOI-keyed `entries` table: stop with actionable message.
- RSS DB and fulltext DB path collision: fail fast and require separate files.
- API/network/HTTP failures: record failures and continue queue.
- Webpage non-text content: mark failed for that DOI.
- Short extraction: fail by threshold to avoid low-quality content.

## References
- `references/schema.md`
- `references/fetch-rules.md`

## Assets
- `assets/config.example.json`

## Scripts
- `scripts/fulltext_fetch.py`