# Storage Tiering: db-tracked vs db-only directories

## Overview

GBrain supports storage tiering to separate version-controlled content from bulk machine-generated data. This prevents git repositories from becoming bloated with large amounts of automatically generated content while still preserving it in the database.

> Note on naming: prior to v0.22.11 the keys were `git_tracked` / `supabase_only`. The canonical names are now `db_tracked` / `db_only` (engine-agnostic — works on both PGLite and Postgres). The deprecated keys still load with a once-per-process warning. Run `gbrain doctor --fix` for an automated rename when that path lands.

## Configuration

Add a `storage` section to your `gbrain.yml` file in the brain repository root:

```yaml
storage:
  # Directories that are version-controlled (human-edited, committed to git).
  db_tracked:
    - people/
    - companies/
    - deals/
    - concepts/
    - yc/
    - ideas/
    - projects/

  # Directories persisted via the brain database only (bulk machine-generated
  # content). Written to disk as a local cache but not committed to git;
  # `gbrain sync` auto-manages .gitignore for these paths. `gbrain export
  # --restore-only` repopulates missing files from the database.
  db_only:
    - media/x/
    - media/articles/
    - meetings/transcripts/
```

Path requirements:

- Each directory must end with `/` for canonical form. The validator auto-normalizes missing trailing slashes (one-time info note shows what changed).
- A directory cannot appear in both tiers — that's a tier-overlap error and `loadStorageConfig` throws `StorageConfigError`. Edit `gbrain.yml` to remove the overlap and try again.

## Behavior Changes

### 1. `gbrain sync` — automatic .gitignore management

When storage configuration is present, `gbrain sync` automatically manages `.gitignore` entries on every successful sync:

- Adds missing `db_only` directory patterns to `.gitignore`.
- Idempotent — re-running adds no duplicate entries.
- Stable comment header so the managed block is grep-able.
- Skipped on `--dry-run` (don't mutate disk in preview mode).
- Skipped on `blocked_by_failures` status (sync state is inconsistent).
- Skipped when the repo is a git submodule (`.git` is a file, not a directory) — submodule .gitignore changes don't survive parent updates. A warning explains.
- Skipped entirely when `GBRAIN_NO_GITIGNORE=1` is set (escape hatch for shared-repo setups where a maintainer wants gbrain to leave .gitignore alone).
- Failures (write permission denied, etc.) are caught and logged, never crash sync.

Example `.gitignore` addition:

```gitignore
# Auto-managed by gbrain (db_only directories)
media/x/
media/articles/
meetings/transcripts/
```

### 2. `gbrain export --restore-only` — repopulate missing db_only files

```bash
# Restore only missing db_only files from the database.
gbrain export --restore-only --repo /path/to/brain

# Filter by page type.
gbrain export --restore-only --type media --repo /path/to/brain

# Filter by slug prefix.
gbrain export --restore-only --slug-prefix media/x/ --repo /path/to/brain

# Combine filters.
gbrain export --restore-only --type media --slug-prefix media/x/ --repo /path/to/brain
```

The `--restore-only` flag:

- Resolves repoPath via the chain `--repo` → typed `sources.getDefault()` → hard error.
  Never falls through to the current directory.
- Only exports pages that match `db_only` patterns AND are missing from disk.
- Ideal for container restart recovery and fresh clones.

### 3. `gbrain storage status` — storage-tier health dashboard

```bash
# Human-readable status.
gbrain storage status --repo /path/to/brain

# JSON output for scripts and orchestrators.
gbrain storage status --repo /path/to/brain --json
```

Output includes:

- Total page counts by storage tier.
- Disk usage breakdown by tier.
- Missing files that need restoration (top 10 shown; full list in `--json`).
- Configuration validation warnings.
- Current tier directory listing.

Example output:

```
Storage Status
==============

Repository: /data/brain
Total pages: 15,243

Storage Tiers:
-------------
DB tracked:     2,156 pages
DB only:        12,887 pages
Unspecified:    200 pages

Disk Usage:
-----------
DB tracked:     45.2 MB
DB only:        2.1 GB

Missing Files (need restore):
-----------------------------
  media/x/tweet-1234567890
  media/x/tweet-0987654321
  ... and 47 more

Use: gbrain export --restore-only --repo "/data/brain"

Configuration:
--------------
DB tracked directories:
  - people/
  - companies/
  - deals/

DB-only directories:
  - media/x/
  - media/articles/
  - meetings/transcripts/
```

## Validation

`loadStorageConfig` runs `normalizeAndValidateStorageConfig` after parsing:

- Auto-fixes (silent, with one-time info note showing what changed):
  - Missing trailing `/` is added: `'media/x'` → `'media/x/'`.
- Throws `StorageConfigError` (caller sees a clean exit-1 with actionable message):
  - Same directory in both `db_tracked` and `db_only` (ambiguous routing).

## Use cases

### Brain repository scaling

Perfect for brain repositories crossing 50K-200K+ files where:

- Core knowledge (people, companies, deals) remains git-tracked.
- Bulk data (tweets, articles, transcripts) moves to db_only.
- Development stays fast with smaller git repos.
- Full data remains available via the database.

### Container-based deployments

Essential for ephemeral container environments:

- Git repo contains only essential files.
- Container restarts don't lose db_only data.
- `gbrain export --restore-only` quickly restores bulk files when needed.
- Local disk acts as a cache layer.

### Multi-environment consistency

Enables consistent data access across environments:

- Development: small git clone, restore bulk data on demand.
- Production: full dataset via the database, selective local caching.
- CI/CD: fast tests with git-tracked data only.

## Migration strategy

1. **Assess current repository**: use `gbrain storage status` to understand current distribution.
2. **Plan directory structure**: identify which directories should be db_tracked vs db_only.
3. **Create `gbrain.yml`**: add storage configuration to the repository root.
4. **Test with dry-run**: `gbrain sync --dry-run` to verify behavior; `.gitignore` is NOT touched on dry-run.
5. **Run a real sync**: `gbrain sync` updates `.gitignore` automatically on success.
6. **Verify restore**: test `gbrain export --restore-only --repo .` against a small db_only directory.

## Best practices

- **Directory naming**: end storage paths with `/` (canonical form). The validator normalizes if you forget.
- **Start small**: begin with clearly machine-generated directories in `db_only`.
- **Address validation errors**: tier overlap is an error, not a warning. Fix it before sync.
- **Test restore**: regularly test `--restore-only` in staging environments.
- **Document decisions**: comment your `gbrain.yml` to explain tier choices.

## PGLite engine note

On the PGLite engine (gbrain's local-only embedded Postgres), the "DB" your db_only pages live in IS the local file gbrain uses for everything else. The `.gitignore` housekeeping still helps (keeps bulk content out of git history), but the offload-to-DB promise is technically vacuous. A once-per-process soft-warn explains when the engine is detected. To get full tiering, migrate to Postgres with `gbrain migrate --to supabase`.

## Compatibility

- **Backward compatible**: systems without `gbrain.yml` work unchanged.
- **Progressive enhancement**: add configuration when needed.
- **Database unchanged**: all data remains in Postgres regardless of tier.
- **Existing workflows**: all existing `sync` and `export` behavior preserved.
- **Deprecated keys**: `git_tracked` / `supabase_only` still load with a once-per-process warning.