## Godot Docs Search: Indexing and Querying This document explains how the Godot documentation corpus is indexed into the BigQuery vector store and how the editor queries it via the `search_across_godot_docs` tool. ### What gets indexed - Tutorials and guides from the official `godot-docs` repository (RST files) - Class reference from the Godot engine repository (XML files) Each document is parsed to plain text, chunked, embedded with OpenAI `text-embedding-3-small`, and upserted into the vector table compatible with the backend `CloudVectorManager`. ### Corpus identifiers - `user_id`: `public_docs` - `project_id`: `godot_docs_latest` These can be overridden by CLI flags. ### Requirements - Environment variables: - `GCP_PROJECT_ID`: Your Google Cloud project with BigQuery enabled - `OPENAI_API_KEY`: API key for embeddings - Optional environment overrides: - `EMBED_DATASET` (default `godot_embeddings`) - `EMBED_TABLE` (default `embeddings`) - `DOCS_BRANCH` (default `latest`, falls back to `master`, etc.) - `GODOT_ENGINE_BRANCH` (default `master`) ### One-liners (recommended) Run from anywhere; the script resolves paths itself. 1) Quick smoke test to JSONL (does not touch BigQuery): ```bash backend/scripts/index_godot_docs.sh \ --max-files-rst 25 --max-files-xml 10 --limit-chunks 100 \ --out-jsonl build_temp/godot_docs.jsonl ``` 2) Full run into BigQuery (creates dataset/table if missing): ```bash GCP_PROJECT_ID=your-gcp-project OPENAI_API_KEY=sk-... \ backend/scripts/index_godot_docs.sh --force ``` 3) Pin branches: ```bash backend/scripts/index_godot_docs.sh --branch 4.2 --engine-branch 4.2 ``` For a usage banner: ```bash backend/scripts/index_godot_docs.sh --usage ``` ### Under the hood - Script: `backend/scripts/index_godot_docs.py` - Downloads `godot-docs` (RST) and the engine repo (XML) with robust branch fallbacks - Converts RST to plain text, parses XML for classes, methods, signals, and properties - Chunks by characters with overlap (defaults tunable via env) - Embeds in batches with retry and backoff - Inserts to BigQuery in batches (or writes JSONL shards if `--out-jsonl` is used) - Shell wrapper: `backend/scripts/index_godot_docs.sh` - Reads `GCP_PROJECT_ID` and `OPENAI_API_KEY` from environment or `backend/.env` - Executes the Python module from repo root for stable imports - Forwards all CLI flags to the Python script ### Querying from the editor The editor tool `search_across_godot_docs` calls the backend endpoint: - Python handler: `backend/app.py` → `search_across_godot_docs_internal` - HTTP endpoint (optional direct access): `POST /search_docs` with JSON: ```json { "query": "How do I connect a signal between nodes in Godot 4?", "max_results": 5 } ``` Results are shown in the AI Chat dock as expandable cards with snippet and full content. The backend relies on the shared corpus identifiers (`public_docs/godot_docs_latest`) to fetch from BigQuery. ### Tuning performance - Embedding: - `EMBED_BATCH_SIZE` (default 128) - `EMBED_MAX_PARALLEL` (default 8) - `DOCS_CHUNK_MAX_CHARS` (default 2000) - `DOCS_CHUNK_OVERLAP` (default 200) - JSONL sharding: `DOCS_JSONL_SHARD_SIZE` (default 10000) ### Troubleshooting - 404 on docs download: script automatically falls back through branches (`master`, `stable`, `4.3`, `4.2`). - BigQuery permission errors: ensure your `gcloud auth application-default login` is configured or ADC is available to the environment running the script. - Empty snippets in results: the backend now prefers `content` and falls back to `content_preview` or nested `chunk.content`. ### Example cURL to test the backend endpoint directly ```bash curl -sS -X POST http://127.0.0.1:8001/search_docs \ -H "Content-Type: application/json" \ -H "X-Machine-ID: local-dev" \ --data-raw '{"query":"How do I connect a signal between nodes in Godot 4?","max_results":5}' | jq ``` This should return a JSON with `success: true` and a `results` array containing titles, snippets, full content, similarity, and file paths.