# AI pipeline
Reva's intelligence pipeline is local-first and keyless by default. Native parsers, bundled OCR, deterministic extraction, learned schema mapping, reconciliation, review, and export all run inside the .NET host. Optional Docling and optional LLM-assisted extraction are additive paths, not the core.
## Pipeline overview
```mermaid
flowchart LR
Upload["Upload
file or .eml/.msg"] --> Store["Store + SHA-256"]
Store --> Route["ParserRouter
content-aware routing"]
Route --> Parse["Native parser or PaddleOCR"]
Parse --> Classify["Document classifier"]
Classify --> Extract["Reinsurance field extractor
computed confidence"]
Extract --> Map["Schema mapping
learned → alias → fuzzy → unmapped"]
Map --> Rec["Reconciliation
stated vs computed totals"]
Rec --> Review["Split-view review
source citations + exceptions"]
Review --> Export["CSV / Excel / JSON
templates + preview"]
Review --> Learn["Analyst mapping corrections"]
Learn --> Map
```
## Intake and parser routing
`DocumentWorkflow` stores the upload, hashes it, and passes the file through `ParserRouter`. Routing is based on sniffed content and parser capability, not extension alone.
| Input | Runtime parser |
|:---|:---|
| TXT / Markdown / CSV | Built-in parser with encoding detection. |
| DOCX / PPTX | `DocumentFormat.OpenXml`. |
| XLSX | `ClosedXML`. |
| EML | `MimeKit`, including body and recursive attachments. |
| MSG | `MSGReader`. |
| Digital PDF | `PdfPig`. |
| Images / scanned PDFs | `Sdcb.PaddleOCR` with bundled PP-OCR V5 models. |
| Unknown / binary | Best-effort visible-text fallback, low confidence, never a workflow error. |
| Optional richer path | Python + Docling only when explicitly installed and enabled. |
## OCR and geometry
The OCR path runs entirely on the machine:
- No Python, cloud account, or OCR API key is required.
- Scanned PDFs are rasterized page by page and fed to the same PaddleOCR engine used for images.
- The parser captures per-line text, confidence, normalized bounding boxes, and polygons.
- Review overlays use the normalized geometry to highlight source regions as the user hovers or focuses fields.
## Extraction and confidence
The classifier and field extractor target technical accounts, bordereaux, and statements of account. Confidence is computed from how the value was located and blended with a domain validation check. Reva does not assign fixed confidence constants to make fields look better.
When an analyst corrects a field, the field becomes **Reviewed**. That is separate from machine confidence and keeps audit semantics honest.
## Schema mapping
Schema mapping turns sender-specific headers into Reva's canonical reinsurance fields.
```mermaid
flowchart TB
Header["Source header + value"] --> Learned{"Learned sender/domain override?"}
Learned -->|yes| UseLearned["source = learned
high precedence"]
Learned -->|no| Alias{"Static reinsurance alias?"}
Alias -->|yes| UseAlias["source = alias"]
Alias -->|no| Fuzzy{"Bounded fuzzy match?"}
Fuzzy -->|yes| UseFuzzy["source = fuzzy
confidence bounded"]
Fuzzy -->|no| Unmapped["source = unmapped
low-confidence review row"]
UseLearned --> Record["Persist source header, canonical target, normalized value, confidence, source"]
UseAlias --> Record
UseFuzzy --> Record
Unmapped --> Record
Correction["Analyst correction"] --> Learn["Persist sender-specific EF rule"] --> Learned
```
Every mapping records source header, canonical target, normalized value, confidence, and source (`learned`, `alias`, `fuzzy`, or `unmapped`). Analyst corrections are persisted as sender-specific EF overrides and take precedence on the next document from that sender or email domain.
## Reconciliation
Reva reconciles documents that state headline figures and also carry line items. The engine compares detected stated values to computed expected values:
| Check | Detected | Expected |
|:---|:---|:---|
| Money fields | Stated premium, claims, commission, or balance. | Sum of the corresponding line-item column, with configurable tolerance. |
| Cession rate | Stated rate. | Line-item cession rate or computed share. |
| Line of business | Stated class/line. | Line-item class/line text, compared by token agreement. |
Each disagreement becomes a field-level exception with:
- field name;
- **Detected** value;
- **Expected** computed value;
- agreement score in `[0,1]`;
- reconciliation flag;
- tolerance used for money comparisons.
Nothing is hardcoded: figures and scores are derived from the document content.
## Native assistant chat
The assistant is native .NET and keeps the frontend transport unchanged:
- `POST /api/agent` streams AI-SDK UI-message-stream SSE.
- `@ai-sdk/react` `useChat` uses `DefaultChatTransport` against the same-origin endpoint.
- `Microsoft.Extensions.AI` and `Microsoft.Extensions.AI.OpenAI` talk to the local Ollama OpenAI-compatible endpoint at `http://localhost:11434/v1`.
- The default local model is `qwen3-vl:8b`.
- A dummy local key keeps the OpenAI-compatible client keyless.
- `GET /api/agent/status` reports whether Ollama is running and whether the model is present.
- The host best-effort auto-starts `ollama serve` when Ollama is installed.
A `FunctionInvokingChatClient` runs a bounded automatic tool loop over the real workflow:
| Tool | Purpose |
|:---|:---|
| `list_documents` | Summarize the current work queue. |
| `get_document` | Read a document's extracted fields and exceptions by id. |
| `reconcile` | Explain reconciliation checks and exceptions. |
| `explain_field` | Explain where a field value came from, including citations when available. |
If the local model is absent, chat degrades to a clear local-model-unavailable message. Ingestion, extraction, reconciliation, review, and export continue working.
## Export
Export supports CSV, Excel, and JSON. Templates have full CRUD, duplication, live preview, a default-template setting, and built-in shapes for canonical exports and Lloyd's CRS-oriented output.