# Architecture

## System Design

```
Upload → OCR → Bedrock KB (embeddings + indexing)
                            ↓
        UI/Chat ←→ Query Bedrock KB
```

**Principles:**
- Serverless (auto-scaling, no servers)
- Cost-optimized (S3 vectors ~$1/mo vs OpenSearch $50+/mo)
- Error handling (DLQ, 3x retry), CloudWatch metrics

## Components

| Component | Purpose |
|-----------|---------|
| DetectFileType Lambda | Detect file type, count pages, and route to appropriate processor |
| ProcessDocument Lambda | OCR extraction (Textract/Bedrock) for PDF/images |
| ProcessMedia Lambda | Video/audio transcription via AWS Transcribe, 30s segmentation |
| ProcessText Lambda | Text extraction for HTML, CSV, JSON, XML, EML, EPUB, DOCX, XLSX |
| EnqueueBatches Lambda | Queue batch jobs to SQS (internal) |
| BatchProcessor Lambda | Process 10-page batches (max 10 concurrent, internal) |
| CombinePages Lambda | Merge partial outputs into final document (internal) |
| ProcessZip Lambda | Handle ZIP batch uploads (internal) |
| IngestToKB Lambda | Trigger Bedrock KB ingestion (Nova Multimodal embeddings) |
| IngestMedia Lambda | Ingest transcribed media segments to KB |
| QueryKB Lambda | Query documents, chat with sources |
| SearchKB Lambda | Direct KB search (no chat context) |
| ProcessImage Lambda | Image ingestion with captions |
| Scrape Lambdas | Web scraping pipeline (start/discover/process/status) |
| ReindexKB Lambda | Orchestrate KB reindexing with new metadata settings |
| MetadataAnalyzer Lambda | Sample KB vectors and generate filter examples |
| SyncCoordinator Lambda | Coordinate KB sync operations (internal) |
| SyncStatusChecker Lambda | Check KB sync completion status (internal) |
| BudgetSync Lambda | Sync AWS Budget data (internal) |
| StartCodeBuild Lambda | Trigger web component builds (internal) |
| ConfigurationResolver Lambda | Resolve DynamoDB configuration |
| AppSyncResolvers Lambda | GraphQL resolver implementations |
| ApiKeyResolver Lambda | API key validation and management |
| AdminUserProvisioner Lambda | Idempotent Cognito admin user provisioning (CloudFormation custom resource) |
| InitialSync Lambda | Trigger initial KB ingestion on stack creation (CloudFormation custom resource) |
| DlqReplay Lambda | Move messages from DLQ back to source queue (manual trigger) |
| QueueProcessor Lambda | Process SQS queue messages (internal) |
| MoveVideo Lambda | Move video files between S3 locations (internal) |
| Step Functions | Orchestrate document/scrape/reindex workflows |
| Bedrock KB | Vector storage & retrieval (S3 backend) |
| S3 | File storage (input/, output/, images/) |
| DynamoDB | Document tracking, config, conversations, scrape jobs |
| AppSync | GraphQL API with subscriptions |
| React UI | Web dashboard (Cloudscape) |
| ragstack-chat | AI chat web component |

## Design Decisions

### S3 Vectors Cost-Performance Trade-Off

RAGStack uses S3 Vectors for ~90% cost savings over traditional vector databases ($46/month vs $660+ for billion vectors). This trade-off introduces:

**Quantization Impact:** 4-bit compression creates ~10% relevancy drop on filtered queries due to quantization noise amplification in smaller candidate pools.

**Solution:** An adaptive boost computes the exact multiplier needed from the score gap between filtered and unfiltered results, capped by `multislice_filtered_boost` (default 1.25). See [METADATA_FILTERING.md](./METADATA_FILTERING.md) for technical details.

## Data Flow

### Document Processing

Documents are automatically routed to the appropriate processor based on file type detection:

```
Upload → DetectFileType → Route by Type:
         │
         ├── Text files (HTML, TXT, CSV, JSON, XML, EML, EPUB, DOCX, XLSX)
         │   └── ProcessText → IngestToKB → Bedrock KB
         │
         ├── OCR files (PDF, images)
         │   └── ProcessDocument → IngestToKB → Bedrock KB
         │
         ├── Media files (MP4, WebM, MP3, WAV, M4A, OGG, FLAC)
         │   └── ProcessMedia → AWS Transcribe → 30s segments → IngestToKB → Bedrock KB
         │
         └── Passthrough (Markdown)
             └── ProcessDocument → IngestToKB → Bedrock KB
```

**Supported File Types:**

| Category | Types | Processing |
|----------|-------|------------|
| **Text** | HTML, TXT, CSV, JSON, XML, EML, EPUB, DOCX, XLSX | Direct text extraction with smart analysis |
| **OCR** | PDF, JPG, PNG, TIFF, GIF, BMP, WebP, AVIF | Textract or Bedrock vision OCR (WebP/AVIF require Bedrock) |
| **Media** | MP4, WebM, MP3, WAV, M4A, OGG, FLAC | AWS Transcribe speech-to-text, 30s segments with timestamps |
| **Passthrough** | Markdown (.md) | Copy directly to output |

**Text Processing:** Content sniffing detects actual file type regardless of extension. Structured formats (CSV, JSON, XML) get smart extraction with schema analysis.

**Large PDFs (>20 pages):**
1. **Upload:** User → S3 input/ → EventBridge → Step Functions
2. **Page Info:** DetectFileType counts pages, creates 10-page batches
3. **Queue:** EnqueueBatches → SQS batch queue
4. **Process:** BatchProcessor Lambda (max 10 concurrent) → partial files
5. **Combine:** Last batch triggers CombinePages → merged output
6. **Indexing:** IngestToKB → Bedrock KB

**95% threshold:** Ingestion proceeds if ≥95% of pages processed successfully. Failed batches retry 3x before DLQ.

### Web Scraping
1. **Start:** User → AppSync → ScrapeStart Lambda → SQS discovery queue
2. **Discover:** ScrapeDiscover finds links → SQS processing queue
3. **Process:** ScrapeProcess fetches content → S3 input/ (.scraped.md)
4. **Index:** Step Functions → ProcessDocument → IngestToKB

### Image Processing
1. **Upload:** User → S3 images/ → EventBridge → ProcessImage
2. **Indexing:** ProcessImage ingests image + caption to Bedrock KB
3. **Cross-modal:** Both visual and text vectors share image_id

### Media Processing (Video/Audio)
1. **Upload:** User → S3 input/ → EventBridge → DetectFileType
2. **Transcribe:** ProcessMedia → AWS Transcribe batch job → transcript with timestamps
3. **Segment:** Transcript split into 30-second chunks
4. **Metadata:** Each segment tagged with `timestamp_start`, `timestamp_end`, `speaker` (if diarization enabled)
5. **Indexing:** Segments ingested to Bedrock KB with timestamp metadata
6. **Query:** Sources include timestamp ranges, URLs with `#t=start,end` fragment for HTML5 playback

**Speaker diarization:** When enabled, Transcribe identifies up to 10 speakers. Each segment tracks the primary speaker for filtering.

**Source format:** Chat responses show timestamps like "1:30-2:00" with clickable links that open the media at that position.

### Knowledge Base Reindex
1. **Trigger:** User → AppSync → startReindex mutation → Step Functions
2. **Init:** Create new S3 Vectors bucket + Knowledge Base
3. **Process:** Map state iterates documents, regenerates metadata, ingests to new KB
4. **Finalize:** Update SSM parameter to new KB ID
5. **Cleanup:** Delete old KB and S3 Vectors bucket

**Note:** Reindex regenerates metadata only - does NOT re-run OCR/text extraction.

### Chat Query
1. **Query:** User → AppSync → QueryKB Lambda
2. **Quota Check:** Atomic DynamoDB transaction (global + per-user limits)
3. **History:** Load last 5 conversation turns for context
4. **Retrieve:** bedrock_agent.retrieve() → top 5 KB results
5. **Generate:** bedrock_runtime.converse() → answer with citations
6. **Sources:** KB URIs resolved to original files via tracking table
7. **Store:** Save turn to conversation history (14-day TTL)

**Media sources:** Results from video/audio include `timestampStart`, `timestampEnd` (seconds), `timestampDisplay` ("1:30-2:00"), and `segmentUrl` with `#t=start,end` fragment for direct playback positioning.

### Real-time Updates
All state changes publish via GraphQL subscriptions:
- `onDocumentUpdate` - Document processing progress
- `onImageUpdate` - Image processing progress
- `onScrapeUpdate` - Web scraping progress
- `onReindexUpdate` - Knowledge Base reindex progress

UI subscribes on load, updates automatically without polling.

## Architecture Decisions

**Why SAM?** Local testing, simpler Lambda packaging

**Why S3 vectors?** ~$1/month vs $50+/month for OpenSearch

**Why DynamoDB config?** Changes apply immediately, no redeployment

**Why shared library?** `lib/ragstack_common/` eliminates duplication

**Error handling:** Lambda retry → Bedrock retry → DLQ

## Security

- HTTPS/TLS everywhere
- S3 SSE, DynamoDB encryption
- Cognito auth + optional MFA
- API key for programmatic access (all operations)
- API key regeneration (manual, via Settings UI)
- Least-privilege IAM
- Public S3 blocked

## API Access

All operations support both API key and Cognito authentication:

| Operation | Endpoint | Auth |
|-----------|----------|------|
| Search KB | `searchKnowledgeBase` | API key / Cognito |
| Chat | `queryKnowledgeBase` | API key / Cognito |
| Upload docs | `createUploadUrl` | API key / Cognito |
| Upload images | `createImageUploadUrl`, `submitImage` | API key / Cognito |
| Scrape | `startScrape`, `getScrapeJob` | API key / Cognito |

**In-app documentation:** Each UI tab includes an expandable section with GraphQL queries and code examples.

## Performance

**10-page PDF:**
- Upload: <5 sec
- OCR: 2-15 min
- KB Sync: 1-10 min

**Optimization:**
- Text-native PDFs: 50% faster (skip OCR)
- Smaller docs: scales linearly

## Cost

1000 docs/month (5 pages each):
- Textract + Haiku: **$7-10/month**
- Bedrock OCR + Haiku: **$25-75/month**

See [Configuration](CONFIGURATION.md)

## Stack

- **Infrastructure:** SAM, Lambda, Step Functions
- **Storage:** S3, DynamoDB, Bedrock KB
- **APIs:** AppSync, Bedrock, Textract, Transcribe
- **Frontend:** React 19, Vite, Cloudscape
- **Chat:** ragstack-chat web component