Kodit

A code and document intelligence server that indexes Git repositories and provides search through MCP and REST APIs.

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=for-the-badge)](./LICENSE) [![Discussions](https://img.shields.io/badge/Discussions-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/helixml/kodit/discussions)

AI coding assistants work better when they have access to real examples from your codebase. Kodit indexes your repositories, splits source files into searchable snippets, and serves them to any MCP-compatible assistant. When your assistant needs to write new code, it queries Kodit first and gets back relevant, up-to-date examples drawn from your own projects. Kodit also handles documents. PDFs, Word files, PowerPoint decks, and spreadsheets are rasterized and indexed so you can search across both code and documentation in one place. **What you get:** - **Multiple search strategies** including BM25 keyword search, semantic vector search, regex grep, and visual document search, each exposed as a separate MCP tool so your assistant picks the right approach for each query - **MCP server** that works with Claude Code, Cursor, Cline, Kilo Code, and any other MCP-compatible assistant - **REST API** for programmatic access to search, repositories, enrichments, and indexing status - **AI enrichments** (optional) including architecture docs, API docs, database schema detection, cookbook examples, and commit summaries, all generated by an LLM - **Document intelligence** with visual search across PDF pages, Office documents, and images using multimodal embeddings - **No external dependencies required** for basic operation, with a built-in embedding model and SQLite storage ## Quickstart ### Docker (recommended) ```sh docker run -p 8080:8080 registry.helix.ml/helix/kodit:latest ``` This starts Kodit with SQLite storage and a built-in embedding model. No API keys needed. ### Pre-built binaries Download a binary from the [releases page](https://github.com/helixml/kodit/releases), then: ```sh chmod +x kodit ./kodit serve ``` ### Verify it works Open the interactive API docs at [http://localhost:8080/docs](http://localhost:8080/docs). Or index a small repository and run a search: ```sh # Index a repository curl http://localhost:8080/api/v1/repositories \ -X POST -H "Content-Type: application/json" \ -d '{ "data": { "type": "repository", "attributes": { "remote_uri": "https://gist.github.com/philwinder/7aa38185e20433c04c533f2b28f4e217.git" } } }' # Check indexing progress curl http://localhost:8080/api/v1/repositories/1/status # Search (once indexing is complete) curl http://localhost:8080/api/v1/search \ -X POST -H "Content-Type: application/json" \ -d '{ "data": { "type": "search", "attributes": { "keywords": ["orders"], "text": "code to get all orders" } } }' ``` ## Connecting to AI Assistants Kodit exposes an MCP endpoint at `/mcp`. Connect your assistant to start using Kodit as a code search tool. ### Claude Code ```sh claude mcp add --transport http kodit http://localhost:8080/mcp ``` ### Cursor Add to `~/.cursor/mcp.json`: ```json { "mcpServers": { "kodit": { "url": "http://localhost:8080/mcp" } } } ``` ### Cline Add to the MCP Servers configuration (Remote Servers tab): ```json { "mcpServers": { "kodit": { "autoApprove": [], "disabled": false, "timeout": 60, "type": "streamableHttp", "url": "http://localhost:8080/mcp" } } } ``` ### Kilo Code Add to the MCP configuration (Edit Project/Global MCP): ```json { "mcpServers": { "kodit": { "type": "streamable-http", "url": "http://localhost:8080/mcp", "alwaysAllow": [], "disabled": false } } } ``` Replace `http://localhost:8080` with your server URL if running remotely. ### Encouraging assistants to use Kodit Some assistants may not call Kodit tools automatically. Add this to your project rules or system prompt to enforce usage: ``` For every request that involves writing or modifying code, the assistant's first action must be to call the kodit search MCP tools. Only produce or edit code after the tool call returns results. ``` In Cursor, save this as `.cursor/rules/kodit.mdc` with `alwaysApply: true` frontmatter. ## MCP Tools Kodit exposes these tools to connected AI assistants: | Tool | Description | |------|-------------| | `kodit_repositories` | List all indexed repositories | | `kodit_semantic_search` | Semantic similarity search across code | | `kodit_keyword_search` | BM25 keyword search | | `kodit_visual_search` | Search document page images | | `kodit_grep` | Regex pattern matching | | `kodit_ls` | List files by glob pattern | | `kodit_read_resource` | Read file content by URI | | `kodit_architecture_docs` | Architecture documentation for a repo | | `kodit_api_docs` | Public API documentation | | `kodit_database_schema` | Database schema documentation | | `kodit_cookbook` | Usage examples and patterns | | `kodit_commit_description` | Commit description | | `kodit_wiki` | Wiki table of contents | | `kodit_wiki_page` | Read a specific wiki page | | `kodit_version` | Server version | The enrichment tools (`architecture_docs`, `api_docs`, `database_schema`, `cookbook`, `wiki`, `commit_description`) require an LLM provider to be configured. See Enrichment Providers under Configuration Reference. ## Go Library Kodit can be embedded directly as a Go library. This is how [Helix](https://helix.ml) integrates Kodit into its platform. ```go import "github.com/helixml/kodit" client, err := kodit.New( kodit.WithSQLite(".kodit/data.db"), ) if err != nil { log.Fatal(err) } defer client.Close() // Index a repository _, _, err = client.Repositories.Add(ctx, &service.RepositoryAddParams{ URL: "https://github.com/kubernetes/kubernetes", }) // Search results, err := client.Search.Query(ctx, "create a deployment", service.WithLimit(10), ) for _, result := range results.Enrichments() { fmt.Println(result.Subtype(), result.Content()) } ``` ### Library options | Option | Description | |--------|-------------| | `WithSQLite(path)` | Use SQLite for storage | | `WithPostgresVectorchord(dsn)` | Use PostgreSQL with VectorChord | | `WithOpenAI(apiKey)` | OpenAI for embeddings and text | | `WithAnthropic(apiKey)` | Anthropic Claude for text (needs separate embedding provider) | | `WithTextProvider(p)` | Custom text generation provider | | `WithEmbeddingProvider(p)` | Custom embedding provider | | `WithRAGPipeline()` | Skip LLM enrichments, index and search only | | `WithFullPipeline()` | Require all enrichments (errors without a text provider) | | `WithDataDir(dir)` | Data directory (default: `~/.kodit`) | | `WithCloneDir(dir)` | Repository clone directory | | `WithAPIKeys(keys...)` | API keys for HTTP authentication | | `WithWorkerCount(n)` | Number of background workers (default: 1) | | `WithPeriodicSyncConfig(cfg)` | Automatic repository sync settings | ### Search options | Option | Description | |--------|-------------| | `WithSemanticWeight(w)` | Weight for semantic vs keyword search (0.0 to 1.0) | | `WithLimit(n)` | Maximum number of results | | `WithOffset(n)` | Offset for pagination | | `WithLanguages(langs...)` | Filter by programming languages | | `WithRepositories(ids...)` | Filter by repository IDs | | `WithMinScore(score)` | Minimum score threshold | | `WithEnrichmentTypes(types...)` | Filter results to specific enrichment types | | `WithSnippets(include)` | Include code snippets in results | | `WithDocuments(include)` | Include enrichment documents in results | ### Go HTTP client A generated HTTP client is available for calling a remote Kodit server from Go: ```sh go get github.com/helixml/kodit/clients/go ``` ```go import koditclient "github.com/helixml/kodit/clients/go" client, err := koditclient.NewClient("https://kodit.example.com") // List repositories resp, err := client.GetRepositories(ctx, nil) // Search text := "create a deployment" resp, err := client.PostSearch(ctx, koditclient.PostSearchJSONRequestBody{ Data: &koditclient.DtoSearchData{ Attributes: &koditclient.DtoSearchAttributes{ Text: &text, }, }, }) ``` Types are auto-generated from the OpenAPI spec. See the interactive API docs at `/docs` for the full endpoint list. ## Production Deployment For production use, deploy with PostgreSQL (VectorChord) for scalable vector search and a dedicated LLM provider for enrichments. ### Docker Compose Save this as `docker-compose.yaml`: ```yaml services: kodit: image: registry.helix.ml/helix/kodit:latest ports: - "8080:8080" command: ["serve"] restart: unless-stopped depends_on: - vectorchord environment: DATA_DIR: /data DB_URL: postgresql://postgres:mysecretpassword@vectorchord:5432/kodit # Enrichment LLM (optional, enables AI-generated docs) ENRICHMENT_ENDPOINT_BASE_URL: http://ollama:11434 ENRICHMENT_ENDPOINT_MODEL: ollama/qwen3:1.7b # External embedding provider (optional, replaces built-in model) # EMBEDDING_ENDPOINT_API_KEY: sk-proj-xxxx # EMBEDDING_ENDPOINT_MODEL: openai/text-embedding-3-small LOG_LEVEL: INFO API_KEYS: ${KODIT_API_KEYS:-} volumes: - kodit-data:/data vectorchord: image: tensorchord/vchord-suite:pg17-20250601 environment: POSTGRES_DB: kodit POSTGRES_PASSWORD: mysecretpassword volumes: - vectorchord-data:/var/lib/postgresql/data restart: unless-stopped volumes: kodit-data: vectorchord-data: ``` ### Kubernetes ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: vectorchord spec: replicas: 1 selector: matchLabels: app: vectorchord template: metadata: labels: app: vectorchord spec: containers: - name: vectorchord image: tensorchord/vchord-suite:pg17-20250601 env: - name: POSTGRES_DB value: kodit - name: POSTGRES_PASSWORD value: mysecretpassword ports: - containerPort: 5432 --- apiVersion: v1 kind: Service metadata: name: vectorchord spec: selector: app: vectorchord ports: - port: 5432 --- apiVersion: apps/v1 kind: Deployment metadata: name: kodit spec: replicas: 1 selector: matchLabels: app: kodit template: metadata: labels: app: kodit spec: containers: - name: kodit image: registry.helix.ml/helix/kodit:latest # pin to a specific version args: ["serve"] env: [] # see Configuration Reference for environment variables ports: - containerPort: 8080 readinessProbe: httpGet: path: / port: 8080 initialDelaySeconds: 10 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: kodit spec: type: LoadBalancer selector: app: kodit ports: - port: 8080 ``` ### Authentication Set the `API_KEYS` environment variable to a comma-separated list of keys. Write endpoints (creating repositories, triggering syncs) require a valid key in the `Authorization: Bearer ` header. Search endpoints are open by default. ## Configuration Reference Configuration is done through environment variables. You can also use a `.env` file: ```sh kodit serve --env-file .env ``` ### Server | Variable | Default | Description | |----------|---------|-------------| | `HOST` | `0.0.0.0` | Listen address | | `PORT` | `8080` | Listen port | | `DATA_DIR` | `~/.kodit` | Data directory for models, clones, and database | | `DB_URL` | (empty) | PostgreSQL connection string (uses SQLite if empty) | | `LOG_LEVEL` | `INFO` | Logging verbosity: `DEBUG`, `INFO`, `WARN`, `ERROR` | | `LOG_FORMAT` | `pretty` | Log format: `pretty` or `json` | | `API_KEYS` | (empty) | Comma-separated API keys for write endpoints | | `WORKER_COUNT` | `1` | Number of background workers | | `SEARCH_LIMIT` | `10` | Default search result limit | | `DISABLE_TELEMETRY` | `false` | Disable anonymous usage telemetry | | `HTTP_CACHE_DIR` | (empty) | Directory for caching HTTP POST responses to disk; avoids repeated API calls during development | | `REPORTING_LOG_TIME_INTERVAL` | `5` | Progress reporting interval in seconds | ### Embedding Provider These configure an external embedding model. If unset, Kodit uses its built-in model. | Variable | Default | Description | |----------|---------|-------------| | `EMBEDDING_ENDPOINT_BASE_URL` | (empty) | Base URL of embedding service | | `EMBEDDING_ENDPOINT_MODEL` | (empty) | Model identifier | | `EMBEDDING_ENDPOINT_API_KEY` | (empty) | API key | | `EMBEDDING_ENDPOINT_MAX_TOKENS` | `0` | Max tokens per request (0 = provider default) | | `EMBEDDING_ENDPOINT_MAX_BATCH_CHARS` | `16000` | Max total characters per embedding batch | | `EMBEDDING_ENDPOINT_MAX_BATCH_SIZE` | `1` | Max items per batch | | `EMBEDDING_ENDPOINT_TIMEOUT` | `60` | Request timeout in seconds | | `EMBEDDING_ENDPOINT_NUM_PARALLEL_TASKS` | `1` | Concurrent embedding requests | | `EMBEDDING_ENDPOINT_EXTRA_PARAMS` | (empty) | JSON-encoded extra parameters for the embedding provider | | `EMBEDDING_ENDPOINT_QUERY_INSTRUCTION` | (empty) | Instruction prepended to queries for asymmetric retrieval | | `EMBEDDING_ENDPOINT_DOCUMENT_INSTRUCTION` | (empty) | Instruction prepended to documents for asymmetric retrieval | | `EMBEDDING_ENDPOINT_SOCKET_PATH` | (empty) | Unix socket path for local provider (alternative to BASE_URL) | | `EMBEDDING_ENDPOINT_MAX_RETRIES` | `5` | Maximum retry attempts on request failure | | `EMBEDDING_ENDPOINT_INITIAL_DELAY` | `2.0` | Initial retry delay in seconds | | `EMBEDDING_ENDPOINT_BACKOFF_FACTOR` | `2.0` | Retry backoff multiplier | ### Vision Embedding Provider These configure a remote service for image and text vision embeddings. If unset, Kodit uses its built-in SigLIP2 model. | Variable | Default | Description | |----------|---------|-------------| | `VISION_EMBEDDING_ENDPOINT_BASE_URL` | (empty) | Base URL of vision embedding service | | `VISION_EMBEDDING_ENDPOINT_MODEL` | (empty) | Model identifier | | `VISION_EMBEDDING_ENDPOINT_API_KEY` | (empty) | API key | | `VISION_EMBEDDING_ENDPOINT_MAX_TOKENS` | `0` | Max tokens per request (0 = provider default) | | `VISION_EMBEDDING_ENDPOINT_MAX_BATCH_CHARS` | `16000` | Max total characters per embedding batch | | `VISION_EMBEDDING_ENDPOINT_MAX_BATCH_SIZE` | `1` | Max items per batch | | `VISION_EMBEDDING_ENDPOINT_TIMEOUT` | `60` | Request timeout in seconds | | `VISION_EMBEDDING_ENDPOINT_NUM_PARALLEL_TASKS` | `1` | Concurrent vision embedding requests | | `VISION_EMBEDDING_ENDPOINT_EXTRA_PARAMS` | (empty) | JSON-encoded extra parameters for the vision embedding provider | | `VISION_EMBEDDING_ENDPOINT_QUERY_INSTRUCTION` | (empty) | Instruction prepended to queries for asymmetric retrieval | | `VISION_EMBEDDING_ENDPOINT_DOCUMENT_INSTRUCTION` | (empty) | Instruction prepended to documents for asymmetric retrieval | | `VISION_EMBEDDING_ENDPOINT_SOCKET_PATH` | (empty) | Unix socket path for local provider (alternative to BASE_URL) | | `VISION_EMBEDDING_ENDPOINT_MAX_RETRIES` | `5` | Maximum retry attempts on request failure | | `VISION_EMBEDDING_ENDPOINT_INITIAL_DELAY` | `2.0` | Initial retry delay in seconds | | `VISION_EMBEDDING_ENDPOINT_BACKOFF_FACTOR` | `2.0` | Retry backoff multiplier | ### Enrichment Providers These configure an LLM for generating architecture docs, API docs, database schemas, cookbooks, commit summaries, and wiki pages. Without this, Kodit indexes and searches code but does not generate any AI documentation. | Variable | Default | Description | |----------|---------|-------------| | `ENRICHMENT_ENDPOINT_BASE_URL` | (empty) | Base URL of LLM service | | `ENRICHMENT_ENDPOINT_MODEL` | (empty) | Model identifier | | `ENRICHMENT_ENDPOINT_API_KEY` | (empty) | API key | | `ENRICHMENT_ENDPOINT_NUM_PARALLEL_TASKS` | `1` | Concurrent enrichment requests | | `ENRICHMENT_ENDPOINT_TIMEOUT` | `60` | Request timeout in seconds | | `ENRICHMENT_ENDPOINT_EXTRA_PARAMS` | (empty) | JSON-encoded extra parameters for the LLM | | `ENRICHMENT_ENDPOINT_MAX_TOKENS` | `0` | Max tokens per response (0 = provider default) | | `ENRICHMENT_ENDPOINT_SOCKET_PATH` | (empty) | Unix socket path for local provider (alternative to BASE_URL) | | `ENRICHMENT_ENDPOINT_MAX_RETRIES` | `5` | Maximum retry attempts on request failure | | `ENRICHMENT_ENDPOINT_INITIAL_DELAY` | `2.0` | Initial retry delay in seconds | | `ENRICHMENT_ENDPOINT_BACKOFF_FACTOR` | `2.0` | Retry backoff multiplier | | `ENRICHMENT_ENDPOINT_MAX_BATCH_CHARS` | `16000` | Max total characters per batch | | `ENRICHMENT_ENDPOINT_MAX_BATCH_SIZE` | `1` | Max items per batch | | `ENRICHMENT_ENDPOINT_QUERY_INSTRUCTION` | (empty) | Instruction prepended to queries for asymmetric retrieval | | `ENRICHMENT_ENDPOINT_DOCUMENT_INSTRUCTION` | (empty) | Instruction prepended to documents for asymmetric retrieval | Enrichment is typically the slowest part of indexing because each enrichment requires a round-trip to the LLM provider. Increase `NUM_PARALLEL_TASKS` to speed things up, but respect your provider's rate limits. Start low and increase over time. Provider examples: ```sh # OpenAI ENRICHMENT_ENDPOINT_BASE_URL=https://api.openai.com/v1 ENRICHMENT_ENDPOINT_MODEL=gpt-4o-mini ENRICHMENT_ENDPOINT_API_KEY=sk-proj-xxxx # Ollama (local) ENRICHMENT_ENDPOINT_BASE_URL=http://localhost:11434 ENRICHMENT_ENDPOINT_MODEL=ollama/qwen3:1.7b # Helix (private cloud) ENRICHMENT_ENDPOINT_BASE_URL=https://app.helix.ml/v1 ENRICHMENT_ENDPOINT_MODEL=Qwen/Qwen3-8B ENRICHMENT_ENDPOINT_API_KEY=your-helix-key ``` ### Periodic Sync | Variable | Default | Description | |----------|---------|-------------| | `PERIODIC_SYNC_ENABLED` | `true` | Auto-sync repositories on an interval | | `PERIODIC_SYNC_INTERVAL_SECONDS` | `1800` | Sync interval (default: 30 minutes) | | `PERIODIC_SYNC_RETRY_ATTEMPTS` | `3` | Retry count on sync failure | ### Chunking | Variable | Default | Description | |----------|---------|-------------| | `CHUNK_SIZE` | `1500` | Characters per chunk | | `CHUNK_OVERLAP` | `200` | Overlap between adjacent chunks | | `CHUNK_MIN_SIZE` | `50` | Minimum chunk size | ## REST API The full API is documented interactively at `/docs` on a running Kodit instance. The OpenAPI 3.0 specification is available at `/docs/openapi.json`. Key endpoints: | Method | Path | Description | |--------|------|-------------| | `POST` | `/api/v1/repositories` | Add a repository for indexing | | `GET` | `/api/v1/repositories` | List indexed repositories | | `GET` | `/api/v1/repositories/{id}/status` | Indexing progress | | `POST` | `/api/v1/repositories/{id}/sync` | Trigger a sync | | `DELETE` | `/api/v1/repositories/{id}` | Remove a repository | | `POST` | `/api/v1/search` | Combined search (keyword + semantic) | | `GET` | `/api/v1/search/semantic` | Semantic search only | | `GET` | `/api/v1/search/keyword` | Keyword search only | | `GET` | `/api/v1/search/visual` | Visual search on document pages | | `GET` | `/api/v1/search/grep` | Regex pattern search | | `GET` | `/api/v1/search/ls` | List files by glob | All write endpoints require an `Authorization: Bearer ` header when `API_KEYS` is set. ## How Indexing Works When you add a repository, Kodit runs a pipeline: 1. **Clone** the Git repository to local storage 2. **Scan** commits, branches, and tags to extract metadata 3. **Extract snippets** by splitting source files into overlapping text chunks 4. **Build search indexes** with BM25 (keyword) and vector embeddings (semantic) 5. **Generate enrichments** (if an LLM provider is configured): architecture docs, API docs, database schemas, cookbook examples, commit summaries, and wiki pages Kodit tracks which files have changed between syncs and only reprocesses modified content. Repositories sync automatically on a configurable interval (default: every 30 minutes). ### Supported sources Kodit indexes any Git repository accessible via HTTPS, SSH, or the Git protocol. This includes GitHub, GitLab, Bitbucket, Azure DevOps, and self-hosted servers. ### Private repositories Private repositories are supported through personal access tokens or SSH keys: ```sh # HTTPS with token https://username:token@github.com/username/repo.git # SSH (ensure your SSH key is configured) git@github.com:username/repo.git ``` ### Privacy Kodit respects `.gitignore` and `.noindex` files. Files matching these patterns are excluded from indexing. ## Storage Backends ### SQLite (default) No configuration needed. Kodit creates a SQLite database in the data directory with FTS5 for keyword search and in-process vector storage. Good for single-user and small-team deployments. ### PostgreSQL with VectorChord For larger deployments, use PostgreSQL with the [VectorChord](https://github.com/tensorchord/VectorChord) extension. This provides scalable vector search and concurrent access. Set the `DB_URL` environment variable to your connection string. The recommended Docker image is `tensorchord/vchord-suite:pg17-20250601`, which bundles PostgreSQL 17 with VectorChord, vchord_bm25, and pg_tokenizer. ## Building from Source ```sh git clone https://github.com/helixml/kodit.git cd kodit make tools # Install development tools make download-model # Download the built-in embedding model make build # Build the binary ./bin/kodit version ./bin/kodit serve ``` Run the tests: ```sh make test # All tests make test PKG=./internal/foo/... # Specific package make check # Format, vet, lint, and test ``` ## Troubleshooting **MCP connection error after restart:** If you see `No valid session ID provided` after restarting the Kodit server, reload the MCP client in your assistant. MCP sessions do not survive server restarts. **No search results:** Check that indexing has completed by calling `GET /api/v1/repositories/{id}/status`. If status shows errors, check the server logs with `LOG_LEVEL=DEBUG`. **Enrichments not generating:** Enrichments require an LLM provider. Check that `ENRICHMENT_ENDPOINT_BASE_URL` and `ENRICHMENT_ENDPOINT_MODEL` are set. Without these, Kodit indexes and searches code but does not generate AI documentation. ## Telemetry Kodit collects limited anonymous telemetry (usage metadata only, no user data) to guide development. Disable it with: ```sh DISABLE_TELEMETRY=true ``` ## Commercial Support [Helix](https://helix.ml) provides a managed platform built on Kodit with additional features including a management UI, repository browsing, team collaboration, and hosted infrastructure. For commercial support or enterprise integration, contact [founders@helix.ml](mailto:founders@helix.ml). ## Contributing See [CONTRIBUTING.md](.github/CONTRIBUTING.md) for guidelines. ## License [Apache 2.0](./LICENSE)