# GBrain Infrastructure Layer

The shared foundation that all skills, recipes, and integrations build on.

## Data Pipeline

```
INPUT (markdown files, git repo)
  ↓
FILE RESOLUTION (local → .redirect → .supabase → error)
  ↓
MARKDOWN PARSER (gray-matter frontmatter + body)
  → compiled_truth + timeline separation
  ↓
CONTENT HASH (SHA-256 idempotency check — skip if unchanged)
  ↓
CHUNKING (3 strategies, configurable)
  ├── Recursive: 300-word chunks, 50-word overlap, 5-level delimiter hierarchy
  ├── Semantic: embed sentences, cosine similarity, Savitzky-Golay smoothing
  └── LLM-guided: Claude Haiku identifies topic shifts in 128-word candidates
  ↓
EMBEDDING (OpenAI text-embedding-3-large, 1536 dimensions)
  → batch 100, exponential backoff, non-fatal if fails
  ↓
DATABASE TRANSACTION (atomic: page + chunks + tags + version)
  ↓
SEARCH (hybrid, available immediately)
```

## Search Architecture

GBrain uses Reciprocal Rank Fusion (RRF) to merge vector and keyword search:

```
User Query
  ↓
EXPANSION (optional: Claude Haiku generates 2 alternative phrasings)
  ↓
  ├── VECTOR SEARCH (pgvector HNSW, cosine distance)
  │     → 2x limit results per query variant
  │
  └── KEYWORD SEARCH (PostgreSQL tsvector, ts_rank)
        → 2x limit results
  ↓
RRF MERGE (score = Σ(1/(60 + rank)), balances both fairly)
  ↓
4-LAYER DEDUP
  ├── Best 3 chunks per page (source dedup)
  ├── Jaccard similarity > 0.85 (text dedup)
  ├── No type exceeds 60% (diversity)
  └── Max 2 chunks per page (page cap)
  ↓
TOP N RESULTS (default 20)
```

## Key Components

| File | Purpose |
|------|---------|
| `src/core/engine.ts` | Pluggable engine interface (BrainEngine) |
| `src/core/postgres-engine.ts` | Postgres + pgvector implementation |
| `src/core/import-file.ts` | importFromFile + importFromContent pipeline |
| `src/core/sync.ts` | Git-based incremental change detection |
| `src/core/markdown.ts` | YAML frontmatter + compiled_truth/timeline parsing |
| `src/core/embedding.ts` | OpenAI embedding with batch, retry, backoff |
| `src/core/chunkers/recursive.ts` | Base chunker (300w, 5-level delimiters) |
| `src/core/chunkers/semantic.ts` | Embedding-based topic boundary detection |
| `src/core/chunkers/llm.ts` | Claude Haiku guided chunking |
| `src/core/search/hybrid.ts` | RRF merge of vector + keyword |
| `src/core/search/dedup.ts` | 4-layer result deduplication |
| `src/core/search/expansion.ts` | Multi-query expansion via Claude Haiku |
| `src/core/storage.ts` | Pluggable storage (S3, Supabase, local) |
| `src/core/operations.ts` | Contract-first operation definitions (31 ops) |
| `src/schema.sql` | Full DDL (10 tables, RLS, tsvector, HNSW) |

## Schema Overview

10 tables in Postgres:

- **pages** — slug (unique), type, title, compiled_truth, timeline, frontmatter (JSONB)
- **content_chunks** — pgvector 1536-dim embedding, chunk_source (compiled_truth|timeline)
- **links** — typed edges (knows, works_at, invested_in, founded, etc.)
- **tags** — many-to-many page tagging
- **timeline_entries** — structured events (date, source, summary, detail)
- **page_versions** — snapshot history for diff/revert
- **raw_data** — sidecar JSON from external APIs (preserves provenance)
- **files** — binary attachments in storage backend
- **ingest_log** — audit trail of import operations
- **config** — brain-level settings (version, embedding model, chunk strategy)

Full-text search uses weighted tsvector: title (A), compiled_truth (B), timeline (C).
Vector search uses HNSW index with cosine distance on content_chunks.embedding.

## The Thin Harness Principle

GBrain is the deterministic layer. Skills and recipes are the latent space layer.

See [Thin Harness, Fat Skills](../ethos/THIN_HARNESS_FAT_SKILLS.md) for the full
architecture philosophy.

- **GBrain CLI** = thin harness (same input → same output)
- **Skills** (ingest, query, maintain, enrich, briefing, migrate, setup) = fat skills
- **Recipes** (voice-to-brain, email-to-brain) = fat skills that install infrastructure

The agent reads the skill/recipe and uses GBrain's deterministic tools to do the work.