# GraphAnything

Turn anything into a navigable knowledge graph. Markdown vaults, OpenAPI
specs, contracts, meeting notes, chat transcripts — all converge on the
same `{nodes, edges}` schema with full provenance, versioning, federation,
and quality reporting.

LLM-driven extraction goes through GraphAnything's built-in
**OpenAI-compatible** client, so you can point it at any
`chat.completions`-shaped endpoint:

- Local **vLLM serve**, **llama.cpp**, **Ollama**, **LM Studio**
- **OpenAI** itself, or any commercial OpenAI-compatible host

[中文版 / Chinese → README.zh.md](README.zh.md)

## At a glance

| | |
|---|---|
| Schema presets | 10 (chat-log / codebase / contracts / db-schema / fstree / meeting / obsidian-vault / openapi / papers / pr-review) |
| Extractors | 8 (markdown / json-yaml / openapi / fstree / chatlog / llm-entity / vlm-stub / noop) |
| Render formats | 9 (mermaid / html / svg / cypher / graphml / ascii / json / canvas / timeline) |
| MCP tools | 17 |
| CLI sub-commands | 19 |
| LLM backend | any OpenAI-compatible endpoint (HTTP) |
| External API | **optional** — read-only / rule-based paths need no LLM |

## Three drivers, one core

```
              ┌──────────────────────────────────────────┐
              │   GraphAnything (core)                   │
              │   Session state machine + 8 extractors   │
              │   + 10 schema presets + 9 viz formats    │
              │   + temporal + federate + ask + quality. │
              └──────────────────────────────────────────┘
                  ▲              ▲              ▲
                  │              │              │
        ┌─────────┴────┐  ┌──────┴────┐  ┌──────┴──────────┐
        │  CLI / REPL  │  │  Skill    │  │  llm_client.py  │
        │ graphanything│  │ /graphany.│  │ (OpenAI-compat) │
        └──────────────┘  └───────────┘  └─────────────────┘
```

## Install

```bash
cd GraphAnything
pip install -e .
```

This registers the `graphanything` console-script. As a fallback,
`python -m GraphAnything.cli ...` always works.

Optional extras:

```bash
pip install -e ".[mcp]"     # MCP stdio server (Claude Code / Cursor / Gemini CLI)
pip install -e ".[neo4j]"   # direct push to a running Neo4j instance
pip install -e ".[svg]"     # SVG renderer (matplotlib)
pip install -e ".[repl]"    # nicer REPL (history + completion)
pip install -e ".[all]"     # everything above
```

For LLM-gated commands (`refine --llm`, `sample --extractor llm-entity`,
`ask --llm`, `eval --llm`), point the client at any OpenAI-compatible
chat-completions endpoint:

```bash
# vLLM serve / llama.cpp / Ollama / LM Studio / OpenAI / …
export GA_API_BASE=http://localhost:8000/v1     # default
export GA_MODEL=Qwen3-32B-Instruct              # required
export GA_API_KEY=local                         # optional; many local servers
                                                # accept any string

# Legacy upstream env vars are also honoured:
#   OPENAI_API_BASE / OPENAI_API_KEY / OPENAI_MODEL
#   API_BASE        / API_KEY        / SUMMARY_MODEL_NAME
```

Rule-based extractors (`markdown`, `json-yaml`, `openapi`, `fstree`,
`chatlog`) and **all read-only commands** (`render`, `explain`, `ask`
without `--llm`, `versions`, `diff`, `federate`, `eval` without `--llm`)
need **no LLM at all**.

## CLI quickstart

```bash
# One-shot end-to-end
graphanything new ./vault/ --preset obsidian-vault --auto

# Step-by-step (recommended for non-trivial corpora)
graphanything new ./contracts --preset contracts
graphanything sample --n 5
graphanything review --merge ABC_Corp,abc_corp,ABC公司
graphanything refine "add GoverningLaw entity"
graphanything run

# Query
graphanything ask "all clauses with amount > 100k"
graphanything explain ep_api_get_user

# Versioning (incremental re-extract on changed files only)
graphanything update                # re-hash all inputs; redo only changed
graphanything versions              # list snapshots written so far
graphanything diff 1 2              # what changed between v1 and v2

# Federation
graphanything federate g1.json g2.json --out universe.json --fuzzy

# Quality
graphanything eval --out-dir graphanything-out/quality --llm --judge-n 20

# Render (9 formats)
graphanything render --fmt mermaid                       # for Claude / chat
graphanything render --fmt cypher  --out g.cypher        # → Neo4j
graphanything render --fmt graphml --out g.graphml       # → Gephi
graphanything render --fmt html    --out g.html          # standalone, force-directed
graphanything render --fmt json    --out g.json          # NetworkX JSON
graphanything render --fmt timeline --out timeline.html  # X = year, Y = community
graphanything render --fmt canvas  --out g.canvas        # Obsidian Canvas
graphanything render --fmt ascii                         # piped into terminal
graphanything render --fmt svg     --out g.svg
graphanything render --fmt mermaid --budget-tokens 4000  # PageRank-prune to fit
```

## All 19 CLI sub-commands

| Sub-command | Purpose |
|---|---|
| `new <inputs>` | Open a session; `--preset NAME`, `--extractor NAME`, `--auto`, budget caps |
| `propose [--n N] [--llm]` | Suggest an initial schema (rule-derived if possible, else generic / LLM) |
| `refine "<instruction>" [--llm]` | Edit the schema (regex first, LLM fallback) |
| `sample [--n N] [--extractor NAME]` | Extract from N inputs into `pending`, propose merges |
| `review` | `--accept-all` / `--accept ID...` / `--reject ID... [--reason ...]` / `--merge a,b[,c]` |
| `run [--out DIR] [--extractor NAME]` | Lock schema → run all inputs → write `graph.json` + snapshot |
| `update [--out DIR] [--extractor NAME]` | Re-extract only inputs whose `source_hash` changed; new snapshot |
| `versions [--out-root R]` | List snapshots written by `run` / `update` |
| `diff <v_old> <v_new>` | Diff two snapshots (added / removed / modified nodes & edges) |
| `ask "<question>" [--llm]` | NL query → graph traversal (regex first, LLM fallback) |
| `explain <id\|label\|"src → rel → tgt">` | Provenance for a node or edge |
| `render --fmt FMT [--out PATH] [--graph G] [--budget-tokens N]` | 9 formats (see below) |
| `federate g1 g2... --out U [--fuzzy] [--fuzzy-threshold T] [--llm]` | Merge multiple graphs into one universe |
| `eval [--out-dir D] [--llm] [--judge-n N] [--graph G]` | Coverage / dedup / per-extractor / sampled LLM-judge |
| `presets` | List the 10 built-in schema presets |
| `extractors` | List the 8 registered extractors |
| `sessions` | List sessions in `graphanything-out/sessions/` |
| `use <session_id>` | Switch the active session pointer |
| `repl [<inputs>]` | Interactive shell (history + completion via `prompt_toolkit` if installed) |

Top-level flags (apply to every sub-command):

- `--sessions-dir PATH` — where session JSONs live (default `graphanything-out/sessions/`).
- `--session ID` — override the "current session" pointer for one call.

`new` and `repl` accept budget soft caps:

- `--max-tokens N` — total token ceiling
- `--max-dollars D` — total $ ceiling
- `--max-api-calls N` — total LLM-call ceiling

`run` / `sample` stop early when any cap is exceeded; remaining inputs are
listed in the result `notes`.

## Skill / MCP server (Claude Code, Cursor, Gemini, …)

The same Session core, accessed through **17 MCP tools**:

| Tool | Purpose |
|---|---|
| `graphanything_open_session` | Start a session over inputs |
| `graphanything_list_presets` | 10 built-in schema templates |
| `graphanything_list_extractors` | 8 extractors (rule + LLM + VLM stub) |
| `graphanything_propose_schema` | Suggest a starting schema |
| `graphanything_refine_schema` | Edit schema (regex + LLM fallback) |
| `graphanything_sample` | Extract from N inputs into `pending` |
| `graphanything_review` | Apply `accept_all` / `accept` / `reject` / `merge` / `rule` actions |
| `graphanything_run` | Full extraction → `graph.json` + snapshot |
| `graphanything_status` | Counts + cost + schema |
| `graphanything_ask` | Natural-language query |
| `graphanything_explain` | Provenance for one node / edge |
| `graphanything_update` | Incremental re-extract on changed files |
| `graphanything_versions` | List graph snapshots |
| `graphanything_diff` | Diff two snapshots |
| `graphanything_federate` | Combine multiple graphs |
| `graphanything_eval` | Coverage / dedup / quality report |
| `graphanything_render` | Mermaid / HTML / SVG / Cypher / GraphML / ASCII / JSON / Canvas / Timeline |

Start the server:

```bash
python -m GraphAnything.serve
```

Wire into Claude Code / Cursor / Gemini CLI by registering the same
process in your MCP config (`~/.claude.json` / `.mcp.json` / equivalent):

```json
{
  "mcpServers": {
    "graphanything": {
      "command": "python",
      "args": ["-m", "GraphAnything.serve"],
      "env": {
        "GA_API_BASE": "http://localhost:8000/v1",
        "GA_MODEL": "Qwen3-32B-Instruct",
        "GA_API_KEY": "local"
      }
    }
  }
}
```

## 10 schema presets

`graphanything presets` lists them; `graphanything new --preset NAME`
applies one. Drop your own YAML in `GraphAnything/schemas/<name>.yaml`
to register a new one.

| Preset | Domain |
|---|---|
| `chat-log` | Slack / Claude Code `.jsonl` / Discord → user / message / tool |
| `codebase` | Source repo → module / file / class / function / import / call |
| `contracts` | Legal contracts → party / clause / date / amount / governing law |
| `db-schema` | DDL / migrations / ORM → table / column / FK / index |
| `fstree` | Plain filesystem → directory / file / symlink |
| `meeting` | Meeting notes → person / topic / decision / action item |
| `obsidian-vault` | Obsidian / Notion vault → note / tag / wikilink / backlink |
| `openapi` | OpenAPI 2.x/3.x spec → endpoint / schema / ref / security |
| `papers` | Generic LLM-driven paper extraction |
| `pr-review` | GitHub PR trail → file / function / reviewer / concern |

## 8 built-in extractors

`graphanything extractors` lists them. Suffix-based dispatcher picks one
unless `--extractor NAME` overrides.

| Extractor | LLM? | Handles | Notes |
|---|---|---|---|
| `markdown` | ❌ | `.md`, `.markdown` | Note / Heading / Tag / WikiLink |
| `json-yaml` | ❌ | `.json`, `.ndjson`, `.yaml`, `.yml`, `.toml` | Generic config tree + `$ref` |
| `openapi` | ❌ | `.yaml`, `.yml`, `.json` | API / Endpoint / Schema / Parameter (force via `--extractor openapi`) |
| `fstree` | ❌ | directories | Directory / File / Symlink |
| `chatlog` | ❌ | `.jsonl`, `.txt`, `.log` | Channel / User / Message / Tool / ToolCall |
| `llm-entity` | ✅ | `*` (any text) | Generic entity / relation, with `evidence_span` + `rationale` |
| `vlm` | ✅ | `.pdf`, `.png`, `.jpg`, `.jpeg` | **Stub** — install a plugin to enable |
| `noop` | ❌ | `*` | Empty graph; for tests |

Adding a new extractor (Python plugin):

```python
from GraphAnything import register_extractor

def extract_my_format(path, **_):
    return {
        "nodes": [{"id": "x", "label": "X", "file_type": "document",
                   "source_file": str(path)}],
        "edges": [],
    }

register_extractor(
    "my-format", extract_my_format,
    version="0.1.0", handles=(".myext",),
    description="My custom format extractor",
)
```

`run_extractor()` automatically stamps provenance (`extractor_id`,
`extractor_version`, `extraction_time`, `source_hash`).

To replace the VLM stub with a real model:

```python
register_extractor(
    "vlm", my_real_impl, version="1.0.0",
    handles=(".pdf", ".png", ".jpg"),
    needs_llm=True, overwrite=True,
)
```

## Versioning, federation, quality

**Incremental updates.** `graphanything update` rehashes every input;
unchanged files keep their previous nodes / edges verbatim, changed
files are re-extracted, the result is normalised and snapshotted as the
next `versions/v<N>.json`. `diff` then works between any two versions.

**Federation.** `graphanything federate g1 g2 ... --out universe.json`
merges several graphs into one universe. Same-label entities of the
same type collapse exactly; with `--fuzzy --fuzzy-threshold 0.7` it also
proposes `same_as` edges via Jaccard token overlap, optionally with
`--llm` tie-breaking on borderline pairs.

**Quality eval.** `graphanything eval --out-dir graphanything-out/quality`
writes `QUALITY_REPORT.md`: coverage by node type, dedup density,
per-extractor stats, and (with `--llm --judge-n 20`) an LLM verdict on
20 sampled edges against their `evidence_span`.

## REPL mode

```bash
graphanything repl ./contracts --preset contracts
```

Inside, every CLI sub-command is also a REPL command (`schema`,
`propose [N]`, `refine "..."`, `sample [N]`, `review accept-all|accept
ID...|reject ID...|merge a,b[,c]`, `run [DIR]`, `render FMT [PATH]`,
`explain TARGET`, `status`, `cost`, `llm on|off`, `presets`,
`extractors`, `help`, `quit`).

If `prompt_toolkit` is installed (`pip install -e ".[repl]"`), you get
history + completion; otherwise the REPL falls back to bare `input()`.

## Session state on disk

```
graphanything-out/
├── sessions/
│   ├── .current               ← active session pointer
│   ├── sess_a1b2c3d4.json     ← one file per session
│   └── sess_...json
├── graph.json                 ← latest run/update output
├── versions/
│   ├── v1.json
│   ├── v2.json
│   └── manifest.json          ← schema_version + source_hashes per snapshot
└── quality/
    └── QUALITY_REPORT.md
```

The `Session` JSON is the source of truth: `accepted` / `pending` /
`rejected` graph fragments, schema (with version), feedback log,
running cost log, normalize rules, last source hashes for incremental
update. Delete the file → the session is gone; copy it elsewhere → it
relocates intact.

## Configuration reference (env vars)

GraphAnything reads env vars only on demand; everything has sensible
defaults. Names are listed in priority order — the first one that's set
wins.

| Setting | Env vars | Default |
|---|---|---|
| Chat-completions URL | `GA_API_BASE` / `OPENAI_API_BASE` / `OPENAI_BASE_URL` / `API_BASE` | `http://localhost:8000/v1` |
| Model name | `GA_MODEL` / `OPENAI_MODEL` / `SUMMARY_MODEL_NAME` | (required for LLM ops) |
| Bearer token | `GA_API_KEY` / `OPENAI_API_KEY` / `API_KEY` | empty (many local servers accept any string) |
| HTTP timeout (s) | `GA_HTTP_TIMEOUT` | `600` |

Each LLM call sends a standard `chat.completions` POST to
`{API_BASE}/chat/completions` with `model`, `messages`, `temperature`,
`max_tokens`, and (for `chat_json`) `response_format: {type: "json_object"}`.
If the server rejects `response_format`, the call retries without it and
GraphAnything regex-extracts JSON from the answer (so reasoning models
emitting `<think>...</think>` blocks also work).

## Programmatic use

```python
from GraphAnything import open_session
from GraphAnything.llm_client import make_client

llm = make_client()                          # reads env vars

sess = open_session(["./vault/"], preset="obsidian-vault")
sess.propose(auto_accept=True, llm=llm)
sess.run(llm=llm, out_dir="graphanything-out")

print(sess.accepted["nodes"][:3])
```

`make_client()` accepts overrides:

```python
llm = make_client(
    api_base="http://h-100:8000/v1",
    model="Qwen3-32B-Instruct",
    api_key="local",
    timeout=900,
)
```

## License

MIT.