# Xberg
Extract text, metadata, transcripts, and code intelligence from 96 file formats and 306 programming languages at native speeds without needing a GPU.
## What and Why?
Xberg is a document-intelligence framework with a Rust core and native bindings for 16 languages. It turns documents, images, audio, and source code into clean, structured text — extracting tables, metadata, transcripts, and code intelligence from 96 file formats and 306 programming languages.
Modern AI and RAG pipelines need fast, reliable extraction without a GPU or a stack of heavyweight dependencies. Xberg delivers that from a single Rust core: SIMD-accelerated parsing, pure-Rust PDF, streaming for multi-GB files, and consistent output across every binding. Run it as a library, CLI, REST API, or MCP server.
OCR (Tesseract, PaddleOCR, EasyOCR, and VLM across 143 vision providers), Whisper audio/video transcription, chunking, language detection, embeddings, and structured LLM extraction are all built in.
### Features
| Feature | Description |
| ------- | ----------- |
| **96 file formats** | PDF, Office, images, HTML/XML, email, archives, and academic formats across 8 categories |
| **306 languages** | Code intelligence — functions, classes, imports, symbols, docstrings — via tree-sitter |
| **Polyglot** | Native bindings for Rust, Python, Node.js, WebAssembly, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, and C |
| **OCR** | Tesseract (incl. WASM), PaddleOCR, EasyOCR, and VLM OCR across 143 vision providers — extensible via plugins |
| **Transcription** | Whisper ONNX transcripts for MP3, M4A, WAV, WebM, and MP4 audio tracks |
| **LLM intelligence** | Structured JSON extraction, embeddings, and VLM OCR through [liter-llm](https://github.com/xberg-io/liter-llm), including local engines |
| **Deployment** | Use as a library, CLI tool, REST API server, or MCP server |
| **High performance** | Rust core with pure-Rust PDF, SIMD optimizations, full parallelism, and streaming for multi-GB files |
| **Token-efficient output** | TOON wire format uses ~30–50% fewer tokens than JSON for LLM/RAG pipelines |
| **Extensible** | Plugin system for custom OCR backends, validators, post-processors, extractors, and renderers |
### Supported Formats
96 file formats across 8 categories — Office documents, images (OCR-enabled), web and structured data, email, archives, academic, and audio/video — plus code intelligence for 306 programming languages. See the [format reference](https://docs.xberg.io/reference/formats/) for the complete list.
⭐ Star this repo to show your support — it helps others discover Xberg.
## Quick Start
### Language Packages
Python
```sh
pip install xberg
```
```sh
uv add xberg
```
See [Python README](https://github.com/xberg-io/xberg/tree/main/packages/python) for full documentation.
Node.js
```sh
npm install @xberg/node
```
See [Node.js README](https://github.com/xberg-io/xberg/tree/main/crates/xberg-node) for full documentation.
Rust
```sh
cargo add xberg
```
See [Rust README](https://github.com/xberg-io/xberg/tree/main/crates/xberg) for full documentation.
Go
```sh
go get github.com/xberg-io/xberg
```
See [Go README](https://github.com/xberg-io/xberg/tree/main/packages/go) for full documentation.
Java
Available on Maven Central as `io.xberg:xberg`. See [Java README](https://github.com/xberg-io/xberg/tree/main/packages/java) for the dependency snippet and current version.
C#
```sh
dotnet add package Xberg
```
See [C# README](https://github.com/xberg-io/xberg/tree/main/packages/csharp) for full documentation.
Ruby
```sh
gem install xberg
```
See [Ruby README](https://github.com/xberg-io/xberg/tree/main/packages/ruby) for full documentation.
PHP
```sh
composer require xberg-io/xberg
```
See [PHP README](https://github.com/xberg-io/xberg/tree/main/packages/php) for full documentation.
Elixir
Add `{:xberg, "~> 5.0"}` to your `mix.exs` dependencies. See [Elixir README](https://github.com/xberg-io/xberg/tree/main/packages/elixir) for full documentation.
WebAssembly
```sh
npm install @xberg/wasm
```
See [WebAssembly README](https://github.com/xberg-io/xberg/tree/main/crates/xberg-wasm) for full documentation.
R
Install from r-universe. See [R README](https://github.com/xberg-io/xberg/tree/main/packages/r) for full documentation.
Kotlin (Android)
Available on Maven Central as `io.xberg:xberg-android`. See [Kotlin README](https://github.com/xberg-io/xberg/tree/main/packages/kotlin-android) for the dependency snippet and current version.
Swift
Add via Swift Package Manager. See [Swift README](https://github.com/xberg-io/xberg/tree/main/packages/swift) for full documentation.
Dart / Flutter
```sh
dart pub add xberg
```
See [Dart README](https://github.com/xberg-io/xberg/tree/main/packages/dart) for full documentation.
Zig
Add via `zig fetch`. See [Zig README](https://github.com/xberg-io/xberg/tree/main/packages/zig) for full documentation.
C/C++ (FFI)
Build from source as part of this workspace. See [C (FFI) README](https://github.com/xberg-io/xberg/tree/main/crates/xberg-ffi) for full documentation.
CLI
```sh
brew install xberg-io/tap/xberg
```
See [CLI usage](https://docs.xberg.io/cli/usage/) for full documentation.
Docker
```sh
docker pull ghcr.io/xberg-io/xberg:latest
```
See [Docker guide](https://docs.xberg.io/guides/docker/) for API, CLI, and MCP server modes.
MCP Server
Run Xberg as a [Model Context Protocol](https://modelcontextprotocol.io/) server. The prebuilt
binaries (Homebrew, `install.sh`, Docker) include it; from source, enable the `mcp` feature.
```sh
# Prebuilt (Homebrew / install.sh / Docker) — MCP is included
brew install xberg-io/tap/xberg
xberg mcp # stdio (default)
# From source — enable the mcp feature
cargo install xberg-cli --features mcp
xberg mcp
# HTTP transport instead of stdio
xberg mcp --transport http --host 127.0.0.1 --port 8001
```
Add it to an MCP client (Claude Desktop `claude_desktop_config.json`, Cursor `.cursor/mcp.json`):
```json
{
"mcpServers": {
"xberg": { "command": "xberg", "args": ["mcp"] }
}
}
```
See the [MCP integration guide](https://docs.xberg.io/guides/mcp-integration/) for tools,
resources, prompts, HTTP transport, and configuration.
### AI Coding Assistants
Install the Xberg plugin from the [`xberg-io/plugins`](https://github.com/xberg-io/plugins) marketplace. It ships the Xberg agent skills (extraction APIs, OCR backends, configuration, language conventions) and works with every major coding agent — expand your harness below.
Claude Code
```text
/plugin marketplace add xberg-io/plugins
/plugin install xberg@xberg
```
Codex CLI
```text
/plugins add https://github.com/xberg-io/plugins
```
Then search for `xberg` and select **Install Plugin**.
Cursor
Settings → Plugins → Add from URL → `https://github.com/xberg-io/plugins`, then select **xberg**.
Gemini CLI
```text
gemini extensions install https://github.com/xberg-io/plugins
```
Factory Droid
```text
droid plugin marketplace add https://github.com/xberg-io/plugins
droid plugin install xberg@xberg
```
GitHub Copilot CLI
```text
copilot plugin marketplace add https://github.com/xberg-io/plugins
copilot plugin install xberg@xberg
```
opencode
Add the package to `opencode.json`:
```json
{
"$schema": "https://opencode.ai/config.json",
"plugin": ["@xberg/opencode-xberg"]
}
```
## Documentation
Full guides, API references for every binding, and the complete format and configuration reference live at **[xberg.io](https://xberg.io/)**. Try it in the browser with the [live demo](https://docs.xberg.io/demo.html).
## Contributing
Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
Join our [Discord community](https://discord.gg/xt9WY3GnKR) for questions and discussion.
## Part of Xberg.dev
- [Xberg Enterprise](https://github.com/xberg-io/xberg-enterprise) — managed extraction API with SDKs, dashboards, and observability.
- [crawlberg](https://github.com/xberg-io/crawlberg) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/xberg-io/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/xberg-io/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/xberg-io/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/xberg-io/alef) — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
## License
MIT License (MIT) — see [LICENSE](LICENSE) for details. See [the MIT License](https://www.opensource.org/licenses/MIT) for the full text.