# @memberjunction/db-auto-doc
AI-powered database documentation generator for SQL Server, MySQL, and PostgreSQL. Analyzes database structure using LLMs to generate intelligent descriptions, discovers missing relationships, generates reference SQL queries, and saves documentation as database metadata.
## Overview
```mermaid
graph TD
A["CLI / Programmatic API"] --> B["AnalysisOrchestrator"]
B --> C["Database Layer
(Schema Introspection)"]
B --> D["Analysis Engine
(LLM Processing)"]
B --> E["State Manager
(Progress Tracking)"]
B --> F["Guardrails Manager
(Resource Limits)"]
C --> G["SQL Server"]
C --> H["PostgreSQL"]
C --> I["MySQL"]
D --> J["Description Generator"]
D --> K["Relationship Discovery"]
D --> L["Sample Query Generator"]
J --> M["Output Generators"]
M --> N["SQL Extended Properties"]
M --> O["Markdown / HTML / CSV"]
M --> P["Mermaid ERD Diagrams"]
style A fill:#2d6a9f,stroke:#1a4971,color:#fff
style B fill:#7c5295,stroke:#563a6b,color:#fff
style C fill:#2d8659,stroke:#1a5c3a,color:#fff
style D fill:#7c5295,stroke:#563a6b,color:#fff
style E fill:#b8762f,stroke:#8a5722,color:#fff
style F fill:#b8762f,stroke:#8a5722,color:#fff
style M fill:#2d6a9f,stroke:#1a4971,color:#fff
```
## Features
### Core Capabilities
- **AI-Powered Analysis** -- Uses OpenAI, Anthropic, Google, Groq, and other LLM providers to generate intelligent descriptions
- **Iterative Refinement** -- Multi-pass analysis with backpropagation for accuracy
- **Topological Processing** -- Analyzes tables in dependency order for better context
- **Data-Driven** -- Leverages cardinality, statistics, and sample data for insights
- **Convergence Detection** -- Automatically knows when analysis is complete
- **State Tracking** -- Full audit trail of all iterations and reasoning
- **Standalone** -- Works with any database, no MemberJunction required
### Multi-Database Support
- **SQL Server** -- Full support with extended properties
- **PostgreSQL** -- Complete implementation with COMMENT syntax
- **MySQL** -- Full support with column/table comments
- **Unified Interface** -- Single configuration approach across all databases
### Advanced Features
- **Relationship Discovery** -- Automatically detect missing primary and foreign keys using statistical analysis and LLM validation
- **Organic Key Detection** -- Optional pass that finds cross-table *shared-identity* relationships (e.g. the same email or phone stored in different formats across systems) by clustering columns in business-meaning space, and emits MemberJunction PR #2193 organic-key metadata with per-column normalization
- **Sample Query Generation** -- Generate reference SQL queries for AI agents with alignment tracking
- **Ground Truth System** -- Provide authoritative descriptions for schemas, tables, and columns that AI analysis must respect
- **Incremental State Saves** -- State persisted to disk after every phase (introspection, sampling, discovery, per-table analysis)
- **User-Approved Protection** -- Mark tables/columns as approved to prevent AI from overwriting descriptions
- **Enhanced Resume** -- Resume with `--reanalyze-below-confidence` to target low-confidence tables while preserving ground truth
- **Existing DB Descriptions** -- Automatically loads MS_Description (SQL Server), COMMENT (PostgreSQL/MySQL) as initial descriptions
- **CodeGen Integration** -- Emit `additionalSchemaInfo.json` for MemberJunction CodeGen soft FK/PK support
- **Granular Guardrails** -- Multi-level resource controls (run, phase, iteration limits)
- **Resume Capability** -- Pause and resume analysis from checkpoint state files
- **Programmatic API** -- Use as a library in your own applications
- **Extensible** -- Custom database drivers and analysis plugins
### Output Formats
- **SQL Scripts** -- Database-specific metadata scripts (extended properties, comments)
- **Markdown Documentation** -- Human-readable docs with ERD diagrams
- **HTML Documentation** -- Interactive, searchable documentation with embedded CSS/JS
- **CSV Exports** -- Spreadsheet-ready table and column data
- **Mermaid Diagrams** -- Standalone ERD files (.mmd and .html)
- **Analysis Reports** -- Detailed metrics and quality assessments
- **Additional Schema Info** -- CodeGen-compatible JSON for soft FK/PK metadata
## Installation
### Global Installation (Recommended for DBAs)
```bash
npm install -g @memberjunction/db-auto-doc
```
### Within MemberJunction Project
```bash
npm install @memberjunction/db-auto-doc
```
### As a Library Dependency
```bash
npm install @memberjunction/db-auto-doc --save
```
## Benchmark Results
DBAutoDoc has been extensively benchmarked across multiple databases and LLM providers. Full details in the [research paper](research/v1/paper.md).
### AdventureWorks2022 (71 tables, all constraints stripped)
| Model Configuration | PK F1 | FK F1 | Overall |
|---------------------|-------|-------|---------|
| Gemini 3 Flash / 3.1 Pro | 95.0% | 94.2% | 96.1% (A+) |
| Claude Sonnet 4.6 / Opus 4.6 | 95.0% | 93.0% | 96.1% (A+) |
| GPT-5.4-mini / GPT-5.4 | 89.4% | 77.9% | 87.9% (B+) |
### Cross-Database Results (Gemini Flash)
| Database | Tables | PK F1 | FK F1 | Descriptions |
|----------|--------|-------|-------|-------------|
| AdventureWorks | 71 | 95.0% | 94.2% | 99% |
| Chinook | 11 | 95.2% | 95.2% | 100% |
| LousyDB (dark DB) | 20 | 97.6% | 77.2% | 100% |
| Northwind | 13 | 72.7% | 75.0% | 100% |
### Key Discovery Pipeline
DBAutoDoc uses a 4-phase pipeline for relationship discovery:
1. **Statistical Discovery** — PK/FK candidates via uniqueness analysis, value overlap, cardinality ratios. Six deterministic gates filter false positives with zero correct-key loss.
2. **LLM Iterative Analysis** — The LLM creates new FK/PK proposals based on semantic understanding (89% precision on LLM-created FKs). Cross-table statistics provided in prompt context.
3. **Ground Truth Locking** — High-confidence keys (≥90%) become immutable to protect correct discoveries.
4. **Two-Pass Pruning** — A stronger model evaluates remaining candidates per-table, then holistically reviews all proposals.
### Research Paper
See [research/v1/](research/v1/) for the full paper, benchmark results, and comparison scripts:
- **[Paper](research/v1/paper.md)** — "DBAutoDoc: Automated Discovery and Documentation of Undocumented Database Schemas via Statistical Analysis and Iterative LLM Refinement"
- **[Results](research/v1/results/)** — Full exports (HTML, Markdown, ERD, SQL, CSV) for all public benchmarks
## Quick Start
### 1. Initialize
```bash
db-auto-doc init
```
This interactive wizard will:
- Configure database connection
- Set up AI provider (Gemini default, plus OpenAI, Anthropic, Groq, and more)
- Configure guardrails and resource limits
- Optionally add seed context for better analysis
- Create `config.json`
### 2. Analyze
```bash
db-auto-doc analyze
```
This will:
- Introspect your database structure
- Analyze data (cardinality, statistics, patterns)
- Optionally discover missing primary and foreign keys
- Build dependency graph
- Run iterative AI analysis with backpropagation
- Perform sanity checks
- Save state to `db-doc-state.json`
### 3. Generate Sample Queries (Optional)
Generate reference SQL queries for AI agent training:
```bash
# During analysis (if enabled in config)
db-auto-doc analyze # Automatically generates queries
# Or generate separately from existing state
db-auto-doc generate-queries --from-state ./output/run-1/state.json
# With custom settings
db-auto-doc generate-queries --from-state ./output/run-1/state.json \
--queries-per-table 10 \
--max-execution-time 60000 \
--output-dir ./queries
```
This generates:
- **sample-queries.json**: Full query specifications with SQL, metadata, and alignment info
- **sample-queries-summary.json**: Execution statistics, token usage, and cost breakdown
**Configuration Options:**
```json
{
"analysis": {
"sampleQueryGeneration": {
"enabled": true, // Enable sample query generation
"queriesPerTable": 5, // Number of queries per table
"maxTables": 10, // Max tables to process (0 = all tables)
"tokenBudget": 100000, // Token limit (0 = unlimited)
"maxExecutionTime": 30000, // Query validation timeout (ms)
"includeMultiQueryPatterns": true, // Generate related query patterns
"validateAlignment": true, // Validate alignment between queries
"maxRowsInSample": 10, // Sample result rows to capture
"enableQueryFix": true, // Auto-fix failed queries (default: true)
"maxFixAttempts": 3, // Max fix attempts per query (default: 3)
"enableQueryRefinement": true, // LLM-based result analysis (default: false)
"maxRefinementAttempts": 1 // Max refinement iterations (default: 1)
}
}
}
```
**Query Fix & Refinement:**
DBAutoDoc includes two quality control mechanisms to ensure high-quality queries:
**1. Query Fix (Error Recovery)**
- **Purpose**: Automatically fix queries that fail validation (syntax errors, wrong columns, etc.)
- **When**: Runs immediately when a query fails to execute
- **How**: Passes SQL, error message, and schema context to LLM for correction
- **Settings**:
- `enableQueryFix: true` (default: true) - Enable automatic fixes
- `maxFixAttempts: 3` (default: 3) - Maximum retry attempts per query
- **Success Rate**: ~96% of queries validate successfully after fix attempts
**2. Query Refinement (Quality Improvement)**
- **Purpose**: Improve working queries by analyzing actual results
- **When**: Runs after a query successfully validates
- **How**: LLM reviews sample results and suggests improvements (filters, joins, aggregations)
- **Settings**:
- `enableQueryRefinement: false` (default: false) - Enable refinement analysis
- `maxRefinementAttempts: 1` (default: 1) - Maximum refinement iterations
- **Use Cases**: Adding appropriate filters, improving join logic, optimizing aggregations
- **Note**: Increases token usage and generation time but significantly improves query quality
**Processing Flow:**
```
Generate SQL
→ Validate (execute against DB)
→ If Failed: Fix (up to maxFixAttempts) → Re-validate
→ If Passed & Refinement Enabled: Refine → Re-validate → Repeat (up to maxRefinementAttempts)
→ Done
```
**Example Configuration:**
```json
{
"sampleQueryGeneration": {
"enableQueryFix": true, // Fix broken queries
"maxFixAttempts": 3, // Try up to 3 times
"enableQueryRefinement": true, // Improve working queries
"maxRefinementAttempts": 2 // Up to 2 refinement passes
}
}
```
**Key Configuration Settings:**
- **`maxTables`**: Controls table selection
- `10` (default) - Generate queries for top 10 most important tables
- `0` - Generate queries for **all tables** with data
- Custom value - Generate queries for top N tables
- **`tokenBudget`**: Controls LLM token usage and cost
- `100000` (default) - Limit to 100K tokens (~$0.50-1.00 with Gemini Flash)
- `0` - **Unlimited** token budget (useful with `maxTables: 0`)
- Custom value - Set specific token limit for cost control
**Example Configurations:**
*Cost-conscious (default):*
```json
{
"maxTables": 10,
"tokenBudget": 100000
}
```
*Medium coverage (~25 tables):*
```json
{
"maxTables": 25,
"tokenBudget": 500000
}
```
*Complete coverage (all tables):*
```json
{
"maxTables": 0,
"tokenBudget": 0
}
```
**Model Recommendations (based on benchmark results):**
- ✅ **Gemini 3 Flash** — Best overall: 96.1% (A+) on AdventureWorks, 1M context window, lowest cost (~$0.50/100 tables). Recommended default.
- ✅ **Claude Sonnet 4.6** — Highest token efficiency: 96.1% (A+) with only 471K tokens, 100% description coverage. Best quality-per-token.
- ⚠️ **GPT-5.4-mini** — Good but limited: 87.9% (B+), 272K context window causes FK misses on large tables.
- 💡 **For pruning**: Use a stronger model (Gemini 3.1 Pro, Claude Opus 4.6, or GPT-5.4) via `modelOverrides` config for the precision-critical pruning pass.
### 4. Export
```bash
db-auto-doc export --sql --markdown --html --csv --mermaid --schema-info
```
This generates:
- **SQL Script**: Database-specific metadata statements
- **Markdown Documentation**: Human-readable docs with ERD links
- **HTML Documentation**: Interactive searchable documentation
- **CSV Files**: tables.csv and columns.csv for spreadsheet analysis
- **Mermaid Diagrams**: erd.mmd and erd.html for visualization
- **Additional Schema Info**: `additionalSchemaInfo.json` for CodeGen soft FK/PK support
Optionally apply directly to database:
```bash
db-auto-doc export --sql --apply
```
Export only AI-discovered relationships (exclude hard DB constraints):
```bash
db-auto-doc export --schema-info --schema-info-discovered-only --confidence-threshold 70
```
### 5. Export Sample Queries to Metadata (Optional)
Transform generated sample queries into MemberJunction metadata format for syncing to the database:
```bash
# Basic export
db-auto-doc export-sample-queries \
--input ./output/sample-queries.json \
--output ./metadata/queries/.queries.json
# Export with separate SQL files (uses @file: references)
db-auto-doc export-sample-queries \
--input ./output/sample-queries.json \
--output ./metadata/queries/.queries.json \
--separate-sql-files
# Set category and filter by quality
db-auto-doc export-sample-queries \
--input ./output/sample-queries.json \
--output ./metadata/queries/.queries.json \
--category "Database Documentation" \
--status Approved \
--min-confidence 0.8 \
--validated-only
```
**Key Flags:**
- `--input, -i`: Path to sample-queries.json from generate-queries
- `--output, -o`: Output path for .queries.json metadata file
- `--separate-sql-files`: Write SQL to separate files with `@file:` references
- `--sql-dir`: Directory for SQL files (default: "SQL")
- `--category`: Query category for `@lookup:Query Categories.Name=...`
- `--status`: Status to assign (Approved/Pending/Rejected/Expired)
- `--min-confidence`: Minimum confidence threshold (0-1)
- `--validated-only`: Only export successfully validated queries
- `--append`: Append to existing metadata file
**After Export:**
1. Review the generated metadata file
2. Ensure the Query Category exists in the database
3. Run: `npx mj-sync push ./metadata/queries/`
This integrates DBAutoDoc-generated queries with MemberJunction's metadata system for use by AI agents like Skip.
### 6. Check Status
```bash
db-auto-doc status
```
Shows:
- Analysis progress and phase completion
- Convergence status
- Low-confidence tables and columns
- Token usage, cost, and duration
- Guardrail status and warnings
### 7. Resume Analysis
```bash
db-auto-doc analyze --resume ./db-doc-state.json
```
Resume a previous analysis from a checkpoint state file, useful for:
- Continuing after hitting guardrail limits
- Recovering from interruptions
- Incremental database updates
## How It Works
### Topological Analysis
DBAutoDoc processes tables in dependency order:
```
Level 0: Users, Products, Categories (no dependencies)
↓
Level 1: Orders (depends on Users), ProductCategories (Products + Categories)
↓
Level 2: OrderItems (depends on Orders + Products)
↓
Level 3: Shipments (depends on OrderItems)
```
Processing in this order allows child tables to benefit from parent table context.
### Relationship Discovery
For legacy databases missing primary/foreign key constraints, DBAutoDoc can:
- **Detect Primary Keys** using statistical analysis (uniqueness, nullability, cardinality)
- **Find Foreign Keys** using value overlap analysis and naming patterns
- **LLM Validation** to verify discovered relationships make business sense
- **Backpropagation** to refine parent table analysis based on child relationships
Triggered automatically when:
- Tables lack primary key constraints
- Insufficient foreign key relationships detected (below threshold)
### Organic Key Detection (Optional)
Foreign keys require **shared exact values**. Organic keys (MemberJunction [PR #2193](https://github.com/MemberJunction/MJ/pull/2193)) capture the weaker-but-broader condition of **shared identity**: two columns refer to the same real-world thing once each is canonicalized through its own transformation. That joins far more cross-system data than exact-match FKs -- e.g. the same phone number stored as `+1 (555) 123-4567`, `5551234567`, and `555.123.4567` across HubSpot, Zendesk, and QuickBooks.
This pass is **off by default** and runs after PK/FK detection. Enable it with `organicKeyDetection.enabled: true` (it reuses your existing `ai` provider for both the LLM and embeddings). The pipeline:
1. **Prefilter** -- drop columns that can't carry identity (booleans, enums, measures, audit columns, free text, non-value-matchable types).
2. **Normalize to business space** -- an LLM rewrites each column's description into a simplified, business-focused form and tags a *concept name* + a *normalization strategy*. Describing the same thing the same way pulls semantically-equal columns together in embedding space.
3. **Embed & cluster** -- normalized descriptions are embedded (via MemberJunction's `BaseEmbeddings`) and grouped with agglomerative clustering using an auto-calibrated distance threshold.
4. **Concept-name split** -- clusters that merged distinct concepts (e.g. `product_id` vs `product_category_id`) are split apart using the LLM concept tags.
5. **Per-column normalization** -- each emitted organic key carries *its own* normalization function for its column, so differently-formatted columns each canonicalize to a shared form. One function per organic key, not one per cluster.
6. **Transitive bridges** -- the FK graph is walked to find multi-hop paths between clustered tables; bridge-view SQL is generated and the transitive spoke fields are populated so PR #2193's runtime can surface multi-hop matches.
Results are written into `additionalSchemaInfo.json` (see [CodeGen Integration](#codegen-integration-additional-schema-info)) as `OrganicKeys` entries, which CodeGen upserts into `EntityOrganicKey` / `EntityOrganicKeyRelatedEntity` metadata. At runtime, `EntityInfo.BuildOrganicKeyViewParams` applies each side's own normalization expression when matching records.
> **Note:** Embeddings route through MemberJunction's `BaseEmbeddings` drivers (OpenAI, Mistral, Azure, Bedrock, Ollama, local), defaulting to `OpenAIEmbedding`. The detection pass therefore requires an embedding provider with a valid key configured.
### Sample Query Generation
DBAutoDoc can generate reference SQL queries for AI agents, solving the **query alignment problem** where multi-query patterns (summary + detail) have inconsistent filtering logic:
**The Problem:**
```sql
-- Summary query
SELECT COUNT(*) FROM Registrations -- All registrations
-- Detail query
SELECT * FROM Registrations WHERE Status='Attended' -- Only attended
-- Result: Numbers don't match! Bad UX.
```
**The Solution:**
DBAutoDoc generates "gold standard" reference queries with:
- **Explicit Filtering Rules** - Documents filter logic for consistency
- **Alignment Tracking** - Links related queries via `relatedQueryIds`
- **Query Patterns** - Summary+Detail, Multi-Entity Drilldown, Time Series, etc.
- **Validation** - Executes queries and validates results
- **Few-Shot Training** - Use as examples for AI agent prompting
**Two-Prompt Architecture:**
1. **Planning Phase** - AI designs what queries to create (lightweight, ~4K tokens)
2. **Generation Phase** - AI generates SQL for each query individually (~3K tokens each)
This approach prevents JSON truncation issues while maintaining alignment context between related queries.
**Use Cases:**
- Training AI agents like Skip to generate consistent multi-query patterns
- Creating reference examples for few-shot prompting
- Documenting common query patterns for your database
- Validating that related queries use consistent filtering logic
### Backpropagation
After analyzing child tables, DBAutoDoc can detect insights about parent tables and trigger re-analysis:
```
Level 0: "Persons" → Initially thinks: "General contact information"
↓
Level 1: "Students" table reveals Persons.Type has values: Student, Teacher, Staff
↓
BACKPROPAGATE to Level 0: "Persons" → Revise to: "School personnel with role-based typing"
```
### Convergence
Analysis stops when:
1. **No changes** in last N iterations (stability window)
2. **All tables** meet confidence threshold
3. **Max iterations** reached
4. **Guardrail limits** exceeded (tokens, cost, duration)
### Ground Truth System
Provide authoritative, user-supplied documentation that AI analysis must respect. Ground truth descriptions are injected into prompts as `AUTHORITATIVE` context and are never overwritten by AI analysis or backpropagation.
**Configuration:**
```json
{
"seedContext": {
"overallPurpose": "E-commerce platform for retail",
"businessDomains": ["Sales", "Inventory", "Users"],
"industryContext": "Retail"
},
"groundTruth": {
"databaseDescription": "Primary transactional database for online store",
"schemas": {
"dbo": { "description": "Core application schema", "businessDomain": "Core" },
"hr": { "description": "Human resources schema", "businessDomain": "HR" }
},
"tables": {
"dbo.Users": {
"description": "Registered user accounts with authentication data",
"notes": "Primary user table - synced from identity provider",
"businessDomain": "Identity",
"columns": {
"Email": { "description": "Primary email for authentication", "notes": "Must be unique" },
"Status": { "description": "Account status: Active, Suspended, Deleted" }
}
}
}
}
}
```
**How it works:**
- Tables/columns with ground truth are marked `userApproved: true` in state
- AI prompts receive ground truth as `AUTHORITATIVE` context to guide analysis
- Backpropagation skips user-approved tables/columns entirely
- Ground truth tables are never cleared by `--reanalyze-below-confidence`
- `businessDomain` falls back from table → schema if not specified at table level
### Incremental State Saves
State is persisted to disk at every phase boundary, not just at the end. If analysis is interrupted, you can resume from the last checkpoint:
- After database introspection (schema structure captured)
- After loading existing database descriptions
- After data sampling (statistics and cardinality)
- After each table's analysis iteration
- After sanity checks and backpropagation
### User-Approved Description Protection
Tables and columns marked `userApproved: true` (via ground truth or manual approval) are protected from modification:
- **Analysis Engine**: Skips user-approved tables during iterative analysis
- **Column Updates**: Skips user-approved columns when applying AI descriptions
- **Backpropagation**: Skips user-approved tables during re-analysis
- **Reanalysis**: Ground truth items are never cleared by `--reanalyze-below-confidence`
### Enhanced Resume
Resume analysis with fine-grained control:
```bash
# Resume from a previous run's state file
db-auto-doc analyze --resume ./output/run-1/state.json
# Resume and re-analyze tables with confidence below 70%
db-auto-doc analyze --resume ./state.json --reanalyze-below 0.7
# Resume with a different iteration limit
db-auto-doc analyze --resume ./state.json --max-iterations 20
# Resume with additional iterations (e.g., ran 5 originally, now want 3 more)
db-auto-doc analyze --resume ./state.json --max-iterations 3
```
The `--reanalyze-below` flag clears `userApproved` on non-ground-truth tables whose latest confidence is below the threshold, allowing them to be re-analyzed while protecting authoritative descriptions.
### Pruning-Only Mode
Run just the FK pruning pass on an existing state file, skipping discovery and analysis iterations. Useful when you want to apply a stronger model to clean up FK false positives without re-running the full analysis:
```bash
# Run only the FK pruning pass on an existing state
db-auto-doc analyze --resume ./output/run-1/state.json --pruning-only
# Combine with a config that specifies a stronger pruning model
db-auto-doc analyze --resume ./output/run-1/state.json --pruning-only --config ./config-with-pro-pruning.json
```
Requires `--resume` pointing to a state file that has already completed discovery and at least one analysis iteration. The pruning pass uses the `ai.modelOverrides.fkPruning` config to select a potentially stronger model (e.g., Gemini Pro, Claude Opus) for the precision-critical FK filtering.
### CLI Flags Reference
| Flag | Short | Description |
|------|-------|-------------|
| `--config` | `-c` | Path to config file (default: `./config.json`) |
| `--resume` | `-r` | Resume from an existing state file |
| `--max-iterations` | `-n` | Override max iterations from config |
| `--reanalyze-below` | | Re-analyze tables with confidence below threshold (0-1) |
| `--pruning-only` | | Skip discovery/iterations, run only PK/FK pruning (requires `--resume`) |
### Standalone Pruning Command
The `prune` command runs PK/FK pruning on an existing state file without re-running analysis. It provides interactive confirmation before applying changes.
```bash
# Interactive mode (shows proposals, asks for confirmation)
db-auto-doc prune --state ./output/run-1/state.json --config ./config.json
# Silent mode (applies all pruning automatically)
db-auto-doc prune --state ./output/run-1/state.json --config ./config.json --silent
# PK-only or FK-only pruning
db-auto-doc prune --state ./output/run-1/state.json --config ./config.json --pk-only
db-auto-doc prune --state ./output/run-1/state.json --config ./config.json --fk-only
```
| Flag | Description |
|------|-------------|
| `--state` | Path to existing state.json file (required) |
| `--config` | Path to config file with AI settings (required) |
| `--silent` | Skip interactive confirmation |
| `--pk-only` | Only prune primary keys |
| `--fk-only` | Only prune foreign keys |
### CodeGen Integration (Additional Schema Info)
DBAutoDoc automatically emits `additionalSchemaInfo.json` in every analysis run folder. This file is compatible with MemberJunction CodeGen's soft FK/PK system, allowing AI-discovered relationships to be fed back into the metadata layer.
**Output format** (matches CodeGen's `additionalSchemaInfo.json`):
```json
{
"Schemas": [
{
"name": "CRM",
"entityNamePrefix": "CRM: ",
"entityNameSuffix": "",
"description": "Customer relationship management tables"
},
{
"name": "ACCOUNTING",
"entityNamePrefix": "Accounting: ",
"entityNameSuffix": "",
"description": "Financial and accounting tables"
}
],
"CRM": [
{
"TableName": "CUSTOMER",
"PrimaryKey": [
{ "FieldName": "CUSTOMER_ID", "Description": "AI-discovered primary key (confidence: 95%)" }
]
},
{
"TableName": "ORDER",
"ForeignKeys": [
{
"FieldName": "CUSTOMER_ID",
"SchemaName": "CRM",
"RelatedTable": "CUSTOMER",
"RelatedField": "CUSTOMER_ID",
"Description": "AI-discovered relationship (confidence: 88%)"
}
]
}
]
}
```
#### Schemas Section (Entity Name Prefixes)
The top-level `Schemas` array provides entity naming recommendations for CodeGen. For multi-schema databases, each schema gets a prefix to prevent entity name collisions when CodeGen creates MemberJunction entities.
**Prefix generation rules:**
- **Single-schema databases**: No prefix (empty string) — collisions are impossible
- **Known acronyms** (CRM, AI, HR, ERP, ETL, API, etc.): Kept uppercase (e.g., `"CRM: "`)
- **Compound underscore names**: Normalized with title-case (e.g., `AI_COMMERCE_CONTEXT` → `"AI Commerce Context: "`)
- **Regular names**: Title-cased (e.g., `ACCOUNTING` → `"Accounting: "`)
CodeGen reads the `Schemas` array and applies prefixes to `SchemaInfo` records, which are then prepended to entity display names during entity creation. This ensures entities like `CRM.CUSTOMER` and `SALES.CUSTOMER` become `"CRM: Customer"` and `"Sales: Customer"` instead of colliding.
#### PK/FK Data
**Data sources:**
1. Hard FK/PK from database introspection (always included unless `--schema-info-discovered-only`)
2. AI-discovered PK/FK candidates from the relationship discovery phase
**Export options:**
- `--schema-info` -- Generate the file
- `--schema-info-discovered-only` -- Only include AI-discovered keys (exclude hard DB constraints)
- `--schema-info-confirmed-only` -- Only include confirmed candidates (exclude unvalidated)
- `--confidence-threshold N` -- Minimum confidence score for discovered keys
**Usage with CodeGen:**
1. Run DBAutoDoc analysis on a legacy database
2. Copy `additionalSchemaInfo.json` to your MJ project
3. Set `codeGen.additionalSchemaInfo` path in `mj.config.cjs`
4. Run CodeGen to apply soft FK/PK metadata and schema prefixes to entity records
### Granular Guardrails
Multi-level resource controls ensure analysis stays within bounds:
**Run-Level Limits**:
- `maxTokensPerRun`: Total token budget for entire analysis
- `maxDurationSeconds`: Maximum wall-clock time
- `maxCostDollars`: Maximum AI cost
**Phase-Level Limits**:
- `maxTokensPerPhase.discovery`: Budget for relationship discovery
- `maxTokensPerPhase.analysis`: Budget for description generation
- `maxTokensPerPhase.sanityChecks`: Budget for validation
**Iteration-Level Limits**:
- `maxTokensPerIteration`: Per-iteration token cap
- `maxIterationDurationSeconds`: Per-iteration time limit
**Warning Thresholds**:
- Configurable percentage-based warnings (default 80-85%)
- Early notification before hitting hard limits
### Data Analysis
For each column, DBAutoDoc collects:
- **Cardinality**: Distinct value counts
- **Statistics**: Min, max, average, standard deviation
- **Patterns**: Common prefixes, format detection
- **Value Distribution**: Actual enum values if low cardinality
- **Sample Data**: Stratified sampling across value ranges
This rich context enables AI to make accurate inferences.
## Configuration
### SQL Server Configuration
```json
{
"version": "1.0.0",
"database": {
"provider": "sqlserver",
"host": "localhost",
"database": "MyDatabase",
"user": "sa",
"password": "YourPassword",
"encrypt": true,
"trustServerCertificate": false
},
"ai": {
"provider": "openai",
"model": "gpt-5.4-mini",
"apiKey": "sk-...",
"temperature": 0.1,
"maxTokens": 8000,
"effortLevel": 50
},
"analysis": {
"cardinalityThreshold": 20,
"sampleSize": 10,
"includeStatistics": true,
"includePatternAnalysis": true,
"convergence": {
"maxIterations": 50,
"stabilityWindow": 2,
"confidenceThreshold": 0.85
},
"backpropagation": {
"enabled": true,
"maxDepth": 3
},
"sanityChecks": {
"dependencyLevel": true,
"schemaLevel": true,
"crossSchema": true
},
"sampleQueryGeneration": {
"enabled": true,
"queriesPerTable": 5,
"maxExecutionTime": 30000,
"includeMultiQueryPatterns": true,
"validateAlignment": true,
"tokenBudget": 100000,
"maxRowsInSample": 10,
"enableQueryFix": true,
"maxFixAttempts": 3,
"enableQueryRefinement": true,
"maxRefinementAttempts": 1
},
"guardrails": {
"enabled": true,
"stopOnExceeded": true,
"maxTokensPerRun": 250000,
"maxDurationSeconds": 3600,
"maxCostDollars": 50,
"maxTokensPerPhase": {
"discovery": 100000,
"analysis": 150000,
"sanityChecks": 50000
},
"maxTokensPerIteration": 50000,
"maxIterationDurationSeconds": 600,
"warnThresholds": {
"tokenPercentage": 80,
"durationPercentage": 80,
"costPercentage": 80,
"iterationTokenPercentage": 85,
"phaseTokenPercentage": 85
}
},
"relationshipDiscovery": {
"enabled": true,
"triggers": {
"runOnMissingPKs": true,
"runOnInsufficientFKs": true,
"fkDeficitThreshold": 0.4
},
"tokenBudget": {
"ratioOfTotal": 0.4
},
"confidence": {
"primaryKeyMinimum": 0.7,
"foreignKeyMinimum": 0.6,
"llmValidationThreshold": 0.8
},
"sampling": {
"maxRowsPerTable": 1000,
"valueOverlapSampleSize": 100,
"statisticalSignificance": 100,
"compositeKeyMaxColumns": 3
},
"patterns": {
"primaryKeyNames": ["^id$", ".*_id$", "^pk_.*", ".*_key$"],
"foreignKeyNames": [".*_id$", ".*_fk$", "^fk_.*"]
},
"llmValidation": {
"enabled": true,
"batchSize": 10
},
"backpropagation": {
"enabled": true,
"maxIterations": 5
}
},
"organicKeyDetection": {
"enabled": false,
"clusteringSensitivity": "balanced",
"minClusterSize": 2,
"minDistinctTables": 2,
"sampleValueCount": 5,
"refinementConcurrency": 4,
"maxRefinementRetries": 2,
"embedding": {
"provider": "openai",
"model": "text-embedding-3-small"
}
}
},
"output": {
"stateFile": "./db-doc-state.json",
"outputDir": "./output",
"sqlFile": "./output/add-descriptions.sql",
"markdownFile": "./output/database-documentation.md"
},
"schemas": {
"exclude": ["sys", "INFORMATION_SCHEMA"]
},
"tables": {
"exclude": ["sysdiagrams", "__MigrationHistory"]
}
}
```
### PostgreSQL Configuration
```json
{
"version": "1.0.0",
"database": {
"provider": "postgresql",
"host": "localhost",
"port": 5432,
"database": "mydatabase",
"user": "postgres",
"password": "YourPassword",
"ssl": false
},
"ai": {
"provider": "openai",
"model": "gpt-5.4-mini",
"apiKey": "sk-...",
"temperature": 0.1,
"maxTokens": 8000
},
"analysis": {
"cardinalityThreshold": 20,
"sampleSize": 10,
"includeStatistics": true,
"guardrails": {
"enabled": true,
"maxTokensPerRun": 250000
}
},
"output": {
"stateFile": "./db-doc-state.json",
"outputDir": "./output",
"sqlFile": "./output/add-descriptions.sql",
"markdownFile": "./output/database-documentation.md"
},
"schemas": {
"exclude": ["pg_catalog", "information_schema"]
}
}
```
### MySQL Configuration
```json
{
"version": "1.0.0",
"database": {
"provider": "mysql",
"host": "localhost",
"port": 3306,
"database": "mydatabase",
"user": "root",
"password": "YourPassword"
},
"ai": {
"provider": "openai",
"model": "gpt-5.4-mini",
"apiKey": "sk-...",
"temperature": 0.1,
"maxTokens": 8000
},
"analysis": {
"cardinalityThreshold": 20,
"sampleSize": 10,
"includeStatistics": true,
"guardrails": {
"enabled": true,
"maxTokensPerRun": 250000
}
},
"output": {
"stateFile": "./db-doc-state.json",
"outputDir": "./output",
"sqlFile": "./output/add-descriptions.sql",
"markdownFile": "./output/database-documentation.md"
},
"schemas": {
"exclude": ["mysql", "information_schema", "performance_schema", "sys"]
}
}
```
### Retry and Rate Limiting
```json
{
"ai": {
"retry": {
"maxRetries": 5,
"initialDelayMs": 30000,
"maxDelayMs": 480000,
"backoffMultiplier": 2
},
"rateLimits": {
"requestsPerMinute": 60,
"maxParallelRequests": 1
}
}
}
```
Handles 429 (rate limit) and transient network errors with exponential backoff. Configure based on your API provider's limits.
### Multi-Model Configuration
Use different models for different pipeline phases. A cheaper/faster model for bulk analysis, a stronger model for precision-critical pruning:
```json
{
"ai": {
"provider": "gemini",
"model": "gemini-3-flash-preview",
"modelOverrides": {
"fkPruning": {
"model": "gemini-3.1-pro-preview",
"temperature": 0.05,
"maxTokens": 16000
},
"pkPruning": {
"model": "gemini-3.1-pro-preview",
"temperature": 0.05
}
}
}
}
```
## Supported AI Providers
DBAutoDoc integrates with [MemberJunction's AI provider system](../../packages/AI/). Supported providers:
| Config Provider | Driver Class | Description |
|-----------------|--------------|-------------|
| `gemini` (default) | [GeminiLLM](../../packages/AI/Providers/Gemini/) | Google Gemini |
| `openai` | [OpenAILLM](../../packages/AI/Providers/OpenAI/) | OpenAI |
| `anthropic` | [AnthropicLLM](../../packages/AI/Providers/Anthropic/) | Anthropic Claude |
| `groq` | [GroqLLM](../../packages/AI/Providers/Groq/) | Groq |
| `mistral` | [MistralLLM](../../packages/AI/Providers/Mistral/) | Mistral AI |
| `vertex` | VertexLLM | Google Vertex AI |
| `azure` | AzureLLM | Azure OpenAI |
| `cerebras` | [CerebrasLLM](../../packages/AI/Providers/Cerebras/) | Cerebras |
| `openrouter` | [OpenRouterLLM](../../packages/AI/Providers/OpenRouter/) | OpenRouter (multi-model) |
| `xai` | [xAILLM](../../packages/AI/Providers/xAI/) | xAI (Grok) |
| `bedrock` | [BedrockLLM](../../packages/AI/Providers/Bedrock/) | AWS Bedrock |
### Gemini (Default)
```json
{
"provider": "gemini",
"model": "gemini-3-flash-preview",
"apiKey": "..."
}
```
### OpenAI
```json
{
"provider": "openai",
"model": "gpt-5.4-mini",
"apiKey": "sk-..."
}
```
### Anthropic
```json
{
"provider": "anthropic",
"model": "claude-sonnet-4-6",
"apiKey": "sk-ant-..."
}
```
### Groq
```json
{
"provider": "groq",
"model": "llama-4-scout-17b-16e-instruct",
"apiKey": "gsk_..."
}
```
## State File
The `db-doc-state.json` file tracks:
- All schemas, tables, and columns
- **Description iterations** with reasoning and confidence
- **Analysis runs** with metrics (tokens, cost, duration)
- **Processing logs** for debugging
- **Relationship discovery results** (primary keys, foreign keys)
- **Guardrail metrics** (phase and iteration budgets)
### Iteration Tracking
Each description has an iteration history:
```json
{
"descriptionIterations": [
{
"description": "Initial hypothesis...",
"reasoning": "Based on column names...",
"generatedAt": "2024-01-15T10:00:00Z",
"modelUsed": "gpt-4",
"confidence": 0.75,
"triggeredBy": "initial"
},
{
"description": "Revised hypothesis...",
"reasoning": "Child table analysis revealed...",
"generatedAt": "2024-01-15T10:05:00Z",
"modelUsed": "gpt-4",
"confidence": 0.92,
"triggeredBy": "backpropagation",
"changedFrom": "Initial hypothesis..."
}
]
}
```
## Programmatic Usage
DBAutoDoc can be used as a library with a comprehensive programmatic API:
### Simple API (Recommended)
```typescript
import { DBAutoDocAPI } from '@memberjunction/db-auto-doc';
const api = new DBAutoDocAPI();
// Analyze database
const result = await api.analyze({
database: {
provider: 'sqlserver',
host: 'localhost',
database: 'MyDB',
user: 'sa',
password: 'password'
},
ai: {
provider: 'openai',
model: 'gpt-4-turbo-preview',
apiKey: 'sk-...'
},
analysis: {
convergence: { maxIterations: 10 },
guardrails: { maxTokensPerRun: 100000 }
},
output: {
outputDir: './output'
},
onProgress: (message, data) => {
console.log(message, data);
}
});
// Resume from state file
const resumed = await api.resume('./db-doc-state.json', {
analysis: {
convergence: { maxIterations: 20 }
}
});
// Export documentation
const exported = await api.export('./db-doc-state.json', {
formats: ['sql', 'markdown', 'html', 'csv', 'mermaid'],
outputDir: './docs',
applyToDatabase: true
});
// Get analysis status
const status = await api.getStatus('./db-doc-state.json');
console.log('Progress:', status.progress);
console.log('Tokens used:', status.metrics.totalTokens);
console.log('Estimated cost:', status.metrics.estimatedCost);
```
### Advanced API (Full Control)
```typescript
import {
ConfigLoader,
DatabaseConnection,
Introspector,
TopologicalSorter,
StateManager,
PromptEngine,
AnalysisEngine,
GuardrailsManager,
SQLGenerator,
MarkdownGenerator,
HTMLGenerator,
CSVGenerator,
MermaidGenerator,
AdditionalSchemaInfoGenerator
} from '@memberjunction/db-auto-doc';
// Load config
const config = await ConfigLoader.load('./config.json');
// Connect to database
const db = new DatabaseConnection(config.database);
await db.connect();
// Introspect
const driver = db.getDriver();
const introspector = new Introspector(driver);
const schemas = await introspector.getSchemas(config.schemas, config.tables);
// Initialize analysis components
const promptEngine = new PromptEngine(config.ai, './prompts');
await promptEngine.initialize();
const stateManager = new StateManager(config.output.stateFile);
const state = stateManager.createInitialState(config.database.database, config.database.server);
state.schemas = schemas;
const guardrails = new GuardrailsManager(config.analysis.guardrails);
const iterationTracker = new IterationTracker();
// Run analysis
const analysisEngine = new AnalysisEngine(config, promptEngine, stateManager, iterationTracker);
// ... custom analysis workflow
// Generate outputs
const sqlGen = new SQLGenerator();
const sql = sqlGen.generate(state, { approvedOnly: false });
const mdGen = new MarkdownGenerator();
const markdown = mdGen.generate(state);
const htmlGen = new HTMLGenerator();
const html = htmlGen.generate(state, { confidenceThreshold: 0.7 });
const csvGen = new CSVGenerator();
const { tables, columns } = csvGen.generate(state);
const mermaidGen = new MermaidGenerator();
const erdDiagram = mermaidGen.generate(state);
const erdHtml = mermaidGen.generateHtml(state);
const schemaInfoGen = new AdditionalSchemaInfoGenerator();
const schemaInfo = schemaInfoGen.generate(state, {
discoveredOnly: true, // Only AI-discovered keys
confidenceThreshold: 70, // Minimum confidence
confirmedOnly: true // Only LLM-validated candidates
});
```
## Cost Estimation
Typical costs (will vary by database size and complexity):
| Database Size | Tables | Iterations | Tokens | Cost (Gemini Flash) | Cost (Sonnet 4.6) |
|---------------|--------|------------|--------|--------------|-------------|
| Small | 10-20 | 2-3 | ~50K | $0.50 | $0.02 |
| Medium | 50-100 | 3-5 | ~200K | $2.00 | $0.08 |
| Large | 200+ | 5-8 | ~500K | $5.00 | $0.20 |
| Enterprise | 500+ | 8-15 | ~1.5M | $15.00 | $0.60 |
**With Relationship Discovery**: Add 25-40% to token/cost estimates for databases with missing constraints.
**With Sample Query Generation** (5 queries/table, Gemini Flash):
| Database Size | Tables | Additional Tokens | Additional Cost |
|---------------|--------|-------------------|-----------------|
| Small | 10-20 | ~100K | $0.50-1.00 |
| Medium | 50-100 | ~500K | $2.50-5.00 |
| Large | 200+ | ~2M | $10-20 |
Note: Sample query generation uses ~6× more API calls than description generation (planning + individual SQL generation for each query), adding ~50% to total token usage.
**Guardrails** help control costs by setting hard limits on token usage and runtime.
## Best Practices
1. **Start with guardrails** - Set reasonable token/cost limits to avoid surprises
2. **Add seed context** - Helps AI understand database purpose and domain
3. **Review low-confidence items** - Focus manual effort where AI is uncertain
4. **Use backpropagation** - Improves accuracy significantly
5. **Enable relationship discovery** - For legacy databases missing constraints
6. **Filter exports** - Use `--confidence-threshold` to only apply high-confidence descriptions
7. **Iterate** - Run analysis multiple times if first pass isn't satisfactory
8. **Resume from checkpoints** - Save costs by continuing previous runs
9. **Use appropriate models** - Balance cost vs. quality (see benchmark results)
10. **Export multiple formats** - HTML for browsing, CSV for analysis, SQL for database
### Sample Query Generation Best Practices
**Configuration:**
1. **Use Gemini 3 Flash or Claude Sonnet 4.6** - Best balance of quality, speed, and cost (see benchmark results)
2. **Set token budget** - Prevents runaway costs (default: 100K tokens)
3. **Start with 5 queries/table** - Good balance of coverage and cost
4. **Enable query fix** (`enableQueryFix: true`, default) - Auto-fixes broken queries (up to 3 attempts)
5. **Enable query refinement** (`enableQueryRefinement: true`, optional) - LLM improves working queries
6. **Set max refinement attempts** (`maxRefinementAttempts: 2`) - More iterations = better quality but higher cost
**Quality Control:**
7. **Enable alignment validation** - Ensures related queries use consistent filtering logic
8. **Validate execution** - Set `maxExecutionTime` to test queries actually run (default: 30s)
9. **Review refinement results** - Check `wasRefined` flag and `refinementHistory` in output
10. **Compare fix vs refinement** - Fix errors are in `fixHistory`, improvements in `refinementHistory`
**Usage:**
11. **Generate separately** - Use `generate-queries` command on existing state to avoid re-running full analysis
12. **Export to metadata** - Use `export-sample-queries` to sync queries to MemberJunction
13. **Use for few-shot prompting** - Include in AI agent system prompts as examples
14. **Focus on complex tables** - Skip simple lookup tables to save costs
15. **Document patterns** - Use generated queries to document common query patterns for your domain
**Understanding Results:**
- `validated: true` = Query executes successfully
- `fixAttempts: 0` = Query worked on first try
- `fixAttempts: 2, validated: true` = Query fixed after 2 attempts
- `wasRefined: true` = Query was improved after initial success
- `refinementAttempts: 2` = Query went through 2 refinement passes
## Troubleshooting
### "Connection failed"
- Check server, database, user, password in config
- Verify database server is running and accessible
- Check firewall rules and network connectivity
- For PostgreSQL: verify SSL settings
- For MySQL: check port and authentication method
### "Analysis not converging"
- Increase `maxIterations` in config
- Lower `confidenceThreshold`
- Add more seed context
- Check warnings in state file for specific issues
- Review guardrail limits (may be hitting token budget)
### "High token usage"
- Enable guardrails with appropriate limits
- Reduce `maxTokens` per prompt
- Filter schemas/tables to focus on subset
- Use cheaper model (Gemini Flash is already very cost-effective)
- Disable relationship discovery if not needed
### "Guardrail limits exceeded"
- Review metrics in state file
- Adjust limits upward if budget allows
- Use `--resume` to continue from checkpoint
- Focus on specific schemas/tables
- Reduce iteration count
### "Relationship discovery not finding keys"
- Check confidence thresholds (may be too high)
- Review statistical significance settings
- Enable LLM validation for better accuracy
- Check naming patterns configuration
- Verify sample size is adequate
## Documentation
Comprehensive documentation is available in the `docs/` folder:
- **[USER_GUIDE.md](./docs/USER_GUIDE.md)** - Complete user documentation
- **[ARCHITECTURE.md](./docs/ARCHITECTURE.md)** - Technical architecture and design
- **[API_USAGE.md](./docs/API_USAGE.md)** - Programmatic API examples
- **[GUARDRAILS.md](./docs/GUARDRAILS.md)** - Guardrails system documentation
- **[CHANGES.md](./docs/CHANGES.md)** - Recent changes and enhancements
## Architecture
DBAutoDoc uses a sophisticated multi-phase architecture:
1. **Discovery Phase** - Introspection and optional relationship discovery
2. **Analysis Phase** - Iterative LLM-based description generation
3. **Sanity Check Phase** - Validation and quality assurance
4. **Export Phase** - Multi-format documentation generation
See [ARCHITECTURE.md](./docs/ARCHITECTURE.md) for comprehensive architecture documentation, including:
- Phase flow diagrams
- Extension points for customization
- Database driver development guide
- LLM intelligence strategy
## Contributing
DBAutoDoc is part of the MemberJunction project. Contributions welcome!
## License
MIT
## Demo Databases
### LousyDB - Legacy Database Demo
Located in `/Demos/LousyDB/`, this demo showcases **Relationship Discovery** capabilities on a realistic legacy database:
- ❌ **Zero metadata** - No PK or FK constraints defined
- 🔤 **Cryptic naming** - Short abbreviations (`cst`, `ord`, `pmt`)
- 🔡 **Single-char codes** - Undocumented status values (`A`, `T`, `P`)
- 💔 **Data quality issues** - Orphaned records, NULLs, duplicates
- 📊 **20 tables** across 2 schemas with 1000+ rows
Perfect for testing DBAutoDoc's ability to **reverse-engineer** poorly-documented databases.
See `/Demos/LousyDB/README.md` for details and testing instructions.
## Links
- **GitHub**: https://github.com/MemberJunction/MJ
- **Documentation**: https://docs.memberjunction.org
- **Support**: https://github.com/MemberJunction/MJ/issues