MothRAG logo

# MothRag > **Deterministic, agentic-style multi-hop โ€” research-SOTA parity without the graph you'd rebuild every day.** On commodity LLM APIs alone. No GPU, no fine-tuning, a proof tree per answer. **Author:** Julian Geymonat ยท **License:** Apache 2.0 ยท **Status:** v0.5.0a alpha ๐Ÿ“„ Paper: *(arXiv link pending)* ยท ๐Ÿ”— Demo ยท ๐ŸŒ mothrag.com ## What it's for MothRag is for **answering questions over a corpus that changes** โ€” where the answer also needs **multi-hop reasoning** (it's spread across several documents and has to be chained). Think live stats and standings, prices and filings, support tickets, news, or internal docs that get updated every day. The systems that chase this kind of multi-hop accuracy lean on heavy indexing โ€” building a knowledge graph (GraphRAG, HippoRAG) or training a retriever. MothRag **outscores the graph-based ones on every benchmark** (HotpotQA, 2WikiMultiHop, MuSiQue) โ€” and without their cost: the moment your data changes, each of them has to push the new data **through a generative-LLM indexing pass** before it's queryable, and some recompute global structures on top. On data that moves daily, you pay that bill every time. MothRag does the reasoning as **query-time orchestration over a plain dense index**. An update is **embed + append** โ€” one embedding call, no LLM extraction, no graph rebuild, no retraining. So it stays current on data that moves under it. โ†’ See it: a runnable [World Cup freshness demo](https://huggingface.co/spaces/JUBOX99/mothrag-demo) โ€” the answer follows the standings as new results are appended, with no re-index. Sourced cost comparison vs GraphRAG / HippoRAG2 / RAPTOR / LightRAG in [`GRAPH_COMPARISON.md`](#). ## Three things you get at once 1. **Freshness โ€” no rebuild.** Updates are embedding-only (`ingest()` embeds the changed chunks and appends). No graph reconstruction, no retraining. The public API is graph-free. 2. **No GPU, commodity APIs, cheap.** Reader = Llama-3.3-70B over an API (Groq), embedder = Gemini. **~$0.032/query** measured (full config); **~$0.018/query** on the economy tier. 3. **Deterministic + auditable.** Same input โ†’ same output, **zero run-to-run variance**, and every answer ships an inspectable proof tree. --- Every component โ€” reader, embedder, retrieval judges โ€” sits behind a commodity pay-per-call API. No local GPU, no constrained decoding, no non-commercially-licensed model. Deployment is a package install plus API keys. ## You don't trade accuracy for any of it ## Results (paper, n=1000 per dataset, Llama-3.3-70B reader, single uniform configuration) ![Against the most popular RAG systems, MOTHRAG (commodity APIs, no GPU) outscores RAPTOR, GraphRAG and HippoRAG 2 on HotpotQA, 2WikiMultiHopQA and MuSiQue](assets/benchmark_popular.png) *Against the most popular RAG systems, MOTHRAG outscores every one on every benchmark, on commodity APIs alone.* ![F1 on HotpotQA / 2WikiMultiHopQA / MuSiQue โ€” MOTHRAG (commodity APIs, no GPU) vs. HippoRAG 2, CoRAG, and NeocorRAG](assets/benchmark_comparison.png) *And against the GPU-bound research frontier (HippoRAG 2, CoRAG, NeocorRAG), MOTHRAG reaches the same tier without the GPU or the non-commercial models.* | System | Deployment profile | HotpotQA | 2WikiMultiHopQA | MuSiQue | AVG | |---|---|---|---|---|---| | HippoRAG 2 (as published) | offline OpenIE graph + NV-Embed-v2 peak | 75.5 | 71.0 | 48.6 | 65.0 | | CoRAG (training-based, as reproduced) | trained chain-retrieval | 75.1 | 75.1 | 52.9 | 67.7 | | NeocorRAG (as published) | GPU-bound constrained decoding + NV-Embed-v2 | **78.3** | 76.1 | **52.6** | **69.0** | | **MOTHRAG (ours)** | **commodity APIs only** | 78.1 | **76.3** | 50.5 | 68.3 | F1, competitor numbers as published in the cited sources (same reader class). MOTHRAG attains the **highest average F1 among commercially-deployable frameworks** โ€” within 0.7 points of the GPU-bound research state of the art (parity on HotpotQA at โˆ’0.2, an edge on 2WikiMultiHopQA at +0.2, an honest gap on MuSiQue at โˆ’2.1). **Measured cost:** $0.032/query (reader + retrieval-judge, measured over 3,000 queries). A documented **economy tier** (one-flag retrieval-judge swap) runs at โ‰ˆ$0.018/query (โˆ’44%) at statistical parity on HotpotQA and 2WikiMultiHopQA, with a measured trade-off only on MuSiQue (โˆ’2.12). All SOTA-parity claims attach to the full configuration. Answers are **proof-tree-structured**: each output carries inspectable reasoning steps over the assembled evidence, with a ฮณ-cap fallback when the grounding check cannot be satisfied within budget. ## Why deterministic? Naive single-shot RAG is fading; the field is moving to *agentic* retrieval โ€” planning, multi-hop iteration, reflection. MOTHRAG takes those mechanisms โ€” query **decomposition**, **grounding-driven iteration**, **multi-hop** evidence chaining โ€” but runs them through **deterministic** orchestration: same inputs โ†’ same answer, with an inspectable audit trail, instead of a flaky free-form agent loop. We tested the alternative: letting an LLM route the pipeline. Every model we tried (Llama-3.3-70B, Claude Sonnet, Gemini Flash, Claude Haiku) did *worse* than the deterministic router. Determinism won on **accuracy and reproducibility** โ€” which is exactly what you want when you put multi-hop retrieval into production or evaluate it cleanly. ## Install ```bash pip install mothrag # Recommended baseline (Gemini embeddings + Groq Llama-3.3-70B reader): pip install 'mothrag[gemini,openai]' # Full production stack: pip install 'mothrag[prod]' ``` | Extra | Pulls | Used by | |---|---|---| | `gemini` | `google-genai` | `GeminiEmbedder`, `GeminiReader` | | `openai` | `openai` | `OpenAIReader`, `GroqReader` (Groq's OpenAI-compatible API) | | `sentence-transformers` | `sentence-transformers` | local embedding fallback | | `retrieval` | `scikit-learn`, `networkx`, `rank-bm25` | classic-RAG features | | `faiss` | `faiss-cpu` | vector store for 100kโ€“10M chunk corpora | | `prod` | bundles the above + loaders | full stack | ## Quickstart ```python from mothrag import MothRAG # Works out of the box (degrades gracefully without keys); # set GROQ_API_KEY + GEMINI_API_KEY for production quality. m = MothRAG.from_documents([ "Paris is the capital of France.", "The Eiffel Tower is in Paris.", ]) result = m.query("In which country is the Eiffel Tower?") print(result.answer) # the answer print(result.arm_used) # which reasoning arm won arbitration print(result.confidence) # arbitration confidence ``` API keys via environment (see `.env.example`): `GROQ_API_KEY` (reader), `GEMINI_API_KEY` (embedder + grounding judge), `ANTHROPIC_API_KEY` (premium retrieval judge; optional โ€” the economy tier uses Gemini). ### Keeping the index current When a fact changes, replace it in place; when it is retracted, drop it. Both are incremental: one embedding pass over the changed document, no graph rebuild and no retraining. ```python from mothrag import MothRAG, Document m = MothRAG() m.ingest([Document(text="The price is $10.", metadata={"source": "price"})]) m.update("price", "The price is $20.") # supersede in place m.delete("price") # retract entirely ``` `update`/`delete` key on a document's `source` id and require `retrieval="dense"` (the default); a clear error is raised on append-only vector stores. ## How it works Three separable stages; every model invocation is a commodity API call: 1. **Bridge retrieval substrate** โ€” multi-query ANN fusion re-ranked by a tripartite LLM judge conditioned on retrieved bridge evidence, reshaping every retrieval (primary, sub-question, iterative). A post-retrieval **ChainFilter** re-scores the ranking by chain density over OpenIE triples, gated by input features only. 2. **Four-arm ensemble pool** โ€” direct read, decomposition, iterative refinement (ฮณ-driven re-retrieval), and **Pool-Duplicate Dispatch (PDD)**: a deterministic copy of the iterative arm's candidate that double-weights the grounding-checked voice in arbitration at zero extra inference. Pool cardinality is fixed at N=4 (five-arm pools regressed consistently). 3. **Deterministic arbitration** โ€” fixed weights over grounding status (ฮณ, 1.0), cross-arm agreement (0.5), and faithfulness (0.3). No learned components anywhere; all gates condition on input features of the question, never on dataset identity. ## Reproducing the paper The paper numbers come from the evaluation configuration (`scripts/route_prospective.py` with the full flag set), not from the high-level quickstart API. See **[`paper/REPRODUCE.md`](paper/REPRODUCE.md)** for the verbatim CLI, required inputs, and expected outputs. The per-query outputs behind every table in the paper are released in **[`paper/results/`](paper/results/)** (six JSONs: three datasets ร— premium/economy tiers, n=1000 each). ## Citing See [`CITATION.cff`](CITATION.cff). Cite the Zenodo preprint: ```bibtex @misc{geymonat2026mothrag, title = {MOTHRAG: Training-Free Multi-Hop Question Answering at Research-SOTA Parity on Commodity LLM APIs}, author = {Geymonat, Julian}, year = {2026}, doi = {10.5281/zenodo.20668567}, url = {https://doi.org/10.5281/zenodo.20668567} } ``` ## License Apache-2.0. ยฉ 2026 Julian Geymonat. Research supported by ItalySoft srl.