# GraphAnything Turn anything into a navigable knowledge graph. Markdown vaults, OpenAPI specs, contracts, meeting notes, chat transcripts — all converge on the same `{nodes, edges}` schema with full provenance, versioning, federation, and quality reporting. LLM-driven extraction goes through GraphAnything's built-in **OpenAI-compatible** client, so you can point it at any `chat.completions`-shaped endpoint: - Local **vLLM serve**, **llama.cpp**, **Ollama**, **LM Studio** - **OpenAI** itself, or any commercial OpenAI-compatible host [中文版 / Chinese → README.zh.md](README.zh.md) ## At a glance | | | |---|---| | Schema presets | 10 (chat-log / codebase / contracts / db-schema / fstree / meeting / obsidian-vault / openapi / papers / pr-review) | | Extractors | 8 (markdown / json-yaml / openapi / fstree / chatlog / llm-entity / vlm-stub / noop) | | Render formats | 9 (mermaid / html / svg / cypher / graphml / ascii / json / canvas / timeline) | | MCP tools | 17 | | CLI sub-commands | 19 | | LLM backend | any OpenAI-compatible endpoint (HTTP) | | External API | **optional** — read-only / rule-based paths need no LLM | ## Three drivers, one core ``` ┌──────────────────────────────────────────┐ │ GraphAnything (core) │ │ Session state machine + 8 extractors │ │ + 10 schema presets + 9 viz formats │ │ + temporal + federate + ask + quality. │ └──────────────────────────────────────────┘ ▲ ▲ ▲ │ │ │ ┌─────────┴────┐ ┌──────┴────┐ ┌──────┴──────────┐ │ CLI / REPL │ │ Skill │ │ llm_client.py │ │ graphanything│ │ /graphany.│ │ (OpenAI-compat) │ └──────────────┘ └───────────┘ └─────────────────┘ ``` ## Install ```bash cd GraphAnything pip install -e . ``` This registers the `graphanything` console-script. As a fallback, `python -m GraphAnything.cli ...` always works. Optional extras: ```bash pip install -e ".[mcp]" # MCP stdio server (Claude Code / Cursor / Gemini CLI) pip install -e ".[neo4j]" # direct push to a running Neo4j instance pip install -e ".[svg]" # SVG renderer (matplotlib) pip install -e ".[repl]" # nicer REPL (history + completion) pip install -e ".[all]" # everything above ``` For LLM-gated commands (`refine --llm`, `sample --extractor llm-entity`, `ask --llm`, `eval --llm`), point the client at any OpenAI-compatible chat-completions endpoint: ```bash # vLLM serve / llama.cpp / Ollama / LM Studio / OpenAI / … export GA_API_BASE=http://localhost:8000/v1 # default export GA_MODEL=Qwen3-32B-Instruct # required export GA_API_KEY=local # optional; many local servers # accept any string # Legacy upstream env vars are also honoured: # OPENAI_API_BASE / OPENAI_API_KEY / OPENAI_MODEL # API_BASE / API_KEY / SUMMARY_MODEL_NAME ``` Rule-based extractors (`markdown`, `json-yaml`, `openapi`, `fstree`, `chatlog`) and **all read-only commands** (`render`, `explain`, `ask` without `--llm`, `versions`, `diff`, `federate`, `eval` without `--llm`) need **no LLM at all**. ## CLI quickstart ```bash # One-shot end-to-end graphanything new ./vault/ --preset obsidian-vault --auto # Step-by-step (recommended for non-trivial corpora) graphanything new ./contracts --preset contracts graphanything sample --n 5 graphanything review --merge ABC_Corp,abc_corp,ABC公司 graphanything refine "add GoverningLaw entity" graphanything run # Query graphanything ask "all clauses with amount > 100k" graphanything explain ep_api_get_user # Versioning (incremental re-extract on changed files only) graphanything update # re-hash all inputs; redo only changed graphanything versions # list snapshots written so far graphanything diff 1 2 # what changed between v1 and v2 # Federation graphanything federate g1.json g2.json --out universe.json --fuzzy # Quality graphanything eval --out-dir graphanything-out/quality --llm --judge-n 20 # Render (9 formats) graphanything render --fmt mermaid # for Claude / chat graphanything render --fmt cypher --out g.cypher # → Neo4j graphanything render --fmt graphml --out g.graphml # → Gephi graphanything render --fmt html --out g.html # standalone, force-directed graphanything render --fmt json --out g.json # NetworkX JSON graphanything render --fmt timeline --out timeline.html # X = year, Y = community graphanything render --fmt canvas --out g.canvas # Obsidian Canvas graphanything render --fmt ascii # piped into terminal graphanything render --fmt svg --out g.svg graphanything render --fmt mermaid --budget-tokens 4000 # PageRank-prune to fit ``` ## All 19 CLI sub-commands | Sub-command | Purpose | |---|---| | `new ` | Open a session; `--preset NAME`, `--extractor NAME`, `--auto`, budget caps | | `propose [--n N] [--llm]` | Suggest an initial schema (rule-derived if possible, else generic / LLM) | | `refine "" [--llm]` | Edit the schema (regex first, LLM fallback) | | `sample [--n N] [--extractor NAME]` | Extract from N inputs into `pending`, propose merges | | `review` | `--accept-all` / `--accept ID...` / `--reject ID... [--reason ...]` / `--merge a,b[,c]` | | `run [--out DIR] [--extractor NAME]` | Lock schema → run all inputs → write `graph.json` + snapshot | | `update [--out DIR] [--extractor NAME]` | Re-extract only inputs whose `source_hash` changed; new snapshot | | `versions [--out-root R]` | List snapshots written by `run` / `update` | | `diff ` | Diff two snapshots (added / removed / modified nodes & edges) | | `ask "" [--llm]` | NL query → graph traversal (regex first, LLM fallback) | | `explain ` | Provenance for a node or edge | | `render --fmt FMT [--out PATH] [--graph G] [--budget-tokens N]` | 9 formats (see below) | | `federate g1 g2... --out U [--fuzzy] [--fuzzy-threshold T] [--llm]` | Merge multiple graphs into one universe | | `eval [--out-dir D] [--llm] [--judge-n N] [--graph G]` | Coverage / dedup / per-extractor / sampled LLM-judge | | `presets` | List the 10 built-in schema presets | | `extractors` | List the 8 registered extractors | | `sessions` | List sessions in `graphanything-out/sessions/` | | `use ` | Switch the active session pointer | | `repl []` | Interactive shell (history + completion via `prompt_toolkit` if installed) | Top-level flags (apply to every sub-command): - `--sessions-dir PATH` — where session JSONs live (default `graphanything-out/sessions/`). - `--session ID` — override the "current session" pointer for one call. `new` and `repl` accept budget soft caps: - `--max-tokens N` — total token ceiling - `--max-dollars D` — total $ ceiling - `--max-api-calls N` — total LLM-call ceiling `run` / `sample` stop early when any cap is exceeded; remaining inputs are listed in the result `notes`. ## Skill / MCP server (Claude Code, Cursor, Gemini, …) The same Session core, accessed through **17 MCP tools**: | Tool | Purpose | |---|---| | `graphanything_open_session` | Start a session over inputs | | `graphanything_list_presets` | 10 built-in schema templates | | `graphanything_list_extractors` | 8 extractors (rule + LLM + VLM stub) | | `graphanything_propose_schema` | Suggest a starting schema | | `graphanything_refine_schema` | Edit schema (regex + LLM fallback) | | `graphanything_sample` | Extract from N inputs into `pending` | | `graphanything_review` | Apply `accept_all` / `accept` / `reject` / `merge` / `rule` actions | | `graphanything_run` | Full extraction → `graph.json` + snapshot | | `graphanything_status` | Counts + cost + schema | | `graphanything_ask` | Natural-language query | | `graphanything_explain` | Provenance for one node / edge | | `graphanything_update` | Incremental re-extract on changed files | | `graphanything_versions` | List graph snapshots | | `graphanything_diff` | Diff two snapshots | | `graphanything_federate` | Combine multiple graphs | | `graphanything_eval` | Coverage / dedup / quality report | | `graphanything_render` | Mermaid / HTML / SVG / Cypher / GraphML / ASCII / JSON / Canvas / Timeline | Start the server: ```bash python -m GraphAnything.serve ``` Wire into Claude Code / Cursor / Gemini CLI by registering the same process in your MCP config (`~/.claude.json` / `.mcp.json` / equivalent): ```json { "mcpServers": { "graphanything": { "command": "python", "args": ["-m", "GraphAnything.serve"], "env": { "GA_API_BASE": "http://localhost:8000/v1", "GA_MODEL": "Qwen3-32B-Instruct", "GA_API_KEY": "local" } } } } ``` ## 10 schema presets `graphanything presets` lists them; `graphanything new --preset NAME` applies one. Drop your own YAML in `GraphAnything/schemas/.yaml` to register a new one. | Preset | Domain | |---|---| | `chat-log` | Slack / Claude Code `.jsonl` / Discord → user / message / tool | | `codebase` | Source repo → module / file / class / function / import / call | | `contracts` | Legal contracts → party / clause / date / amount / governing law | | `db-schema` | DDL / migrations / ORM → table / column / FK / index | | `fstree` | Plain filesystem → directory / file / symlink | | `meeting` | Meeting notes → person / topic / decision / action item | | `obsidian-vault` | Obsidian / Notion vault → note / tag / wikilink / backlink | | `openapi` | OpenAPI 2.x/3.x spec → endpoint / schema / ref / security | | `papers` | Generic LLM-driven paper extraction | | `pr-review` | GitHub PR trail → file / function / reviewer / concern | ## 8 built-in extractors `graphanything extractors` lists them. Suffix-based dispatcher picks one unless `--extractor NAME` overrides. | Extractor | LLM? | Handles | Notes | |---|---|---|---| | `markdown` | ❌ | `.md`, `.markdown` | Note / Heading / Tag / WikiLink | | `json-yaml` | ❌ | `.json`, `.ndjson`, `.yaml`, `.yml`, `.toml` | Generic config tree + `$ref` | | `openapi` | ❌ | `.yaml`, `.yml`, `.json` | API / Endpoint / Schema / Parameter (force via `--extractor openapi`) | | `fstree` | ❌ | directories | Directory / File / Symlink | | `chatlog` | ❌ | `.jsonl`, `.txt`, `.log` | Channel / User / Message / Tool / ToolCall | | `llm-entity` | ✅ | `*` (any text) | Generic entity / relation, with `evidence_span` + `rationale` | | `vlm` | ✅ | `.pdf`, `.png`, `.jpg`, `.jpeg` | **Stub** — install a plugin to enable | | `noop` | ❌ | `*` | Empty graph; for tests | Adding a new extractor (Python plugin): ```python from GraphAnything import register_extractor def extract_my_format(path, **_): return { "nodes": [{"id": "x", "label": "X", "file_type": "document", "source_file": str(path)}], "edges": [], } register_extractor( "my-format", extract_my_format, version="0.1.0", handles=(".myext",), description="My custom format extractor", ) ``` `run_extractor()` automatically stamps provenance (`extractor_id`, `extractor_version`, `extraction_time`, `source_hash`). To replace the VLM stub with a real model: ```python register_extractor( "vlm", my_real_impl, version="1.0.0", handles=(".pdf", ".png", ".jpg"), needs_llm=True, overwrite=True, ) ``` ## Versioning, federation, quality **Incremental updates.** `graphanything update` rehashes every input; unchanged files keep their previous nodes / edges verbatim, changed files are re-extracted, the result is normalised and snapshotted as the next `versions/v.json`. `diff` then works between any two versions. **Federation.** `graphanything federate g1 g2 ... --out universe.json` merges several graphs into one universe. Same-label entities of the same type collapse exactly; with `--fuzzy --fuzzy-threshold 0.7` it also proposes `same_as` edges via Jaccard token overlap, optionally with `--llm` tie-breaking on borderline pairs. **Quality eval.** `graphanything eval --out-dir graphanything-out/quality` writes `QUALITY_REPORT.md`: coverage by node type, dedup density, per-extractor stats, and (with `--llm --judge-n 20`) an LLM verdict on 20 sampled edges against their `evidence_span`. ## REPL mode ```bash graphanything repl ./contracts --preset contracts ``` Inside, every CLI sub-command is also a REPL command (`schema`, `propose [N]`, `refine "..."`, `sample [N]`, `review accept-all|accept ID...|reject ID...|merge a,b[,c]`, `run [DIR]`, `render FMT [PATH]`, `explain TARGET`, `status`, `cost`, `llm on|off`, `presets`, `extractors`, `help`, `quit`). If `prompt_toolkit` is installed (`pip install -e ".[repl]"`), you get history + completion; otherwise the REPL falls back to bare `input()`. ## Session state on disk ``` graphanything-out/ ├── sessions/ │ ├── .current ← active session pointer │ ├── sess_a1b2c3d4.json ← one file per session │ └── sess_...json ├── graph.json ← latest run/update output ├── versions/ │ ├── v1.json │ ├── v2.json │ └── manifest.json ← schema_version + source_hashes per snapshot └── quality/ └── QUALITY_REPORT.md ``` The `Session` JSON is the source of truth: `accepted` / `pending` / `rejected` graph fragments, schema (with version), feedback log, running cost log, normalize rules, last source hashes for incremental update. Delete the file → the session is gone; copy it elsewhere → it relocates intact. ## Configuration reference (env vars) GraphAnything reads env vars only on demand; everything has sensible defaults. Names are listed in priority order — the first one that's set wins. | Setting | Env vars | Default | |---|---|---| | Chat-completions URL | `GA_API_BASE` / `OPENAI_API_BASE` / `OPENAI_BASE_URL` / `API_BASE` | `http://localhost:8000/v1` | | Model name | `GA_MODEL` / `OPENAI_MODEL` / `SUMMARY_MODEL_NAME` | (required for LLM ops) | | Bearer token | `GA_API_KEY` / `OPENAI_API_KEY` / `API_KEY` | empty (many local servers accept any string) | | HTTP timeout (s) | `GA_HTTP_TIMEOUT` | `600` | Each LLM call sends a standard `chat.completions` POST to `{API_BASE}/chat/completions` with `model`, `messages`, `temperature`, `max_tokens`, and (for `chat_json`) `response_format: {type: "json_object"}`. If the server rejects `response_format`, the call retries without it and GraphAnything regex-extracts JSON from the answer (so reasoning models emitting `...` blocks also work). ## Programmatic use ```python from GraphAnything import open_session from GraphAnything.llm_client import make_client llm = make_client() # reads env vars sess = open_session(["./vault/"], preset="obsidian-vault") sess.propose(auto_accept=True, llm=llm) sess.run(llm=llm, out_dir="graphanything-out") print(sess.accepted["nodes"][:3]) ``` `make_client()` accepts overrides: ```python llm = make_client( api_base="http://h-100:8000/v1", model="Qwen3-32B-Instruct", api_key="local", timeout=900, ) ``` ## License MIT.