Datus — Open-Source Data Engineering Agent

--- ## What is Datus? **Datus** is an open-source data engineering agent that builds **evolvable context** for your data system — turning natural language into accurate SQL through domain-aware reasoning, semantic search, and continuous learning. Data engineering is shifting from "building tables and pipelines" to "delivering scoped, domain-aware agents for analysts and business users." Datus makes that shift concrete. ![Datus Architecture](docs/assets/datus_architecture.svg) ## Key Features ### Build Evolvable Context, Not Static Pipelines Traditional data engineering ends at data delivery. Datus goes further — it builds a **living knowledge base** that captures schema metadata, reference SQL, semantic models, metrics, and domain knowledge into a unified context layer. This context is what makes LLM-generated SQL accurate and trustworthy, and it improves with every interaction through a continuous learning loop. → [Contextual Data Engineering](https://docs.datus.ai/getting_started/contextual_data_engineering/) ### From Exploration to Domain-Specific Agents Datus provides a complete journey for data engineers: start with a **Claude-Code-like CLI** to explore your data interactively, use [Plan Mode](https://docs.datus.ai/cli/plan_mode/) to review before executing, and build up context over time. When a domain matures, package it into a **Subagent** — a scoped chatbot with curated context, tools, and business rules — and deliver it to analysts via web, API, or MCP. → [Subagent docs](https://docs.datus.ai/subagent/introduction/) ### Metrics and Semantic Layer Go beyond raw SQL with pluggable **semantic adapters**. Define business metrics in YAML via [MetricFlow](https://docs.datus.ai/metricflow/introduction/) integration, and let Datus generate SQL from metric queries — bridging the gap between business language and database dialect. Use [Dashboard Copilot](https://docs.datus.ai/getting_started/dashboard_copilot/) to turn existing BI dashboards into conversational analytics. → [Semantic Adapters docs](https://docs.datus.ai/adapters/semantic_adapters/) ### Measure and Improve Built-in evaluation framework supporting **BIRD** and **Spider 2.0-Snow** datasets. Benchmark your agent's SQL accuracy, compare configurations, and track improvements as context evolves. → [Benchmark docs](https://docs.datus.ai/benchmark/benchmark_manual/) ### Open Platform - **10+ LLM providers** (OpenAI, Claude, Gemini, DeepSeek, Qwen, Kimi, OpenRouter, and more) with per-node model assignment — mix models within a single workflow - **11 databases** — Built-in SQLite & DuckDB, plus pluggable adapters for PostgreSQL, MySQL, Snowflake, StarRocks, ClickHouse, and more - **MCP Protocol** — Both an MCP server (exposing Datus tools to Claude Desktop, Cursor, etc.) and an MCP client (consuming external tools via `.mcp` in the CLI). → [MCP docs](https://docs.datus.ai/integration/mcp/) - **Skills** — Extend Datus with [agentskills.io](https://agentskills.io)-style packaged tools, configurable permissions, and marketplace support. → [Skills docs](https://docs.datus.ai/integration/skills/) ## Getting Started ### Install **Requirements:** Linux or macOS. Python 3.12 is installed automatically when you use the one-liner. #### One-liner (Linux / macOS) Stable install from PyPI: ```bash curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install.sh | sh ``` This creates a dedicated venv at `~/.datus/venv`, installs `datus-agent` from PyPI into it, and drops `datus`, `datus-cli`, `datus-api`, `datus-mcp`, `datus-agent`, `datus-gateway`, and `datus-pip` shims into `~/.local/bin`. Open a new shell (or `source ~/.zshrc`) to pick up PATH, then run `datus` to launch the REPL — use `/model` to configure an LLM, `/datasource` to add a datasource, and (optionally) `/init` to generate `AGENTS.md` for the current project. To install additional Python packages into the global venv later, use `datus-pip install ` (it is a shim for `~/.datus/venv/bin/pip`). Dev install from GitHub source (picks up unreleased changes): ```bash curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install-dev.sh | sh # or pin to a branch / tag / commit curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install-dev.sh | DATUS_REF=feature/foo sh ``` Pin a PyPI version (stable installer only): ```bash curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install.sh | DATUS_VERSION=0.2.6 sh ``` Other variables supported by both installers: `DATUS_HOME` (default `~/.datus`), `DATUS_BIN_DIR` (default `~/.local/bin`), `DATUS_FORCE=1` to recreate the venv, `DATUS_NO_MODIFY_PATH=1` to skip shell rc edits. #### Manual install ```bash pip install datus-agent datus ``` After the REPL starts, run `/model` to configure an LLM, `/datasource` to add a datasource, and (optionally) `/init` to generate `AGENTS.md` for the current project. For detailed guidance, see the [Quickstart Guide](https://docs.datus.ai/getting_started/Quickstart/). ### Four Ways to Use Datus | Interface | Command | Use Case | |-----------|---------|----------| | **CLI** (Interactive REPL) | `datus-cli --datasource demo` | Data engineers exploring data, building context, creating subagents | | **Web Chatbot** (Streamlit) | `datus-cli --web --datasource demo` | Analysts chatting with subagents via browser (`http://localhost:8501`) | | **API Server** (FastAPI) | `datus-api --datasource demo` | Applications consuming data services via REST (`http://localhost:8000`) | | **MCP Server** | `datus-mcp --datasource demo` | MCP-compatible clients (Claude Desktop, Cursor, etc.) | > **Tip:** Use `datus-cli --print --datasource demo` for JSON streaming to stdout — useful for piping into other tools. ## Architecture ### Workflow Engine Datus uses a configurable **node-based workflow engine**. Each workflow is a plan of nodes executed in sequence, parallel, or as sub-workflows: ```yaml workflow: plan: planA planA: - schema_linking # Find relevant tables - parallel: # Run in parallel - generate_sql # SQL generation - reasoning # Chain-of-thought reasoning - selection # Pick the best result - execute_sql # Run the query - output # Format and return ``` ### Node Types | Category | Nodes | |----------|-------| | **Core** | `schema_linking`, `generate_sql`, `execute_sql`, `reasoning`, `reflect`, `output` | | **Agentic** | `chat`, `explore`, `gen_semantic_model`, `gen_metrics`, `gen_ext_knowledge`, `gen_sql_summary`, `gen_skill`, `gen_table`, `compare` | | **Control Flow** | `parallel`, `selection`, `subworkflow` | | **Utility** | `date_parser`, `doc_search`, `fix` | ### RAG Knowledge Base The knowledge base is powered by **LanceDB** and organizes context into multiple layers: - **Schema Metadata** — Table and column descriptions, relationships - **Reference SQL** — Curated query examples with summaries - **Reference Templates** — Parameterized Jinja2 SQL templates for stable, reusable queries - **Semantic Models** — Business logic and metric definitions - **Metrics** — Executable business metrics via semantic layer integration - **External Knowledge** — Domain rules and concepts beyond raw schema - **Platform Docs** — Ingested from GitHub repos, websites, or local files Build the knowledge base with: ```bash datus-agent bootstrap-kb --datasource demo --components metadata,reference_sql,ext_knowledge ``` ## Configuration Datus is configured via `agent.yml`. Launch `datus` and use `/model` plus `/datasource` to populate it interactively, or copy [`conf/agent.yml.example`](conf/agent.yml.example) and edit it by hand. | Section | Purpose | |---------|---------| | `agent.models` | LLM provider definitions (API keys, model IDs, base URLs) | | `agent.nodes` | Per-node model assignment and tuning parameters | | `agent.services.datasources` | Database connections (SQLite, DuckDB, Snowflake, etc.) | | `agent.storage` | Embedding models, vector DB, and RAG configuration | | `agent.workflow` | Execution plans with sequential, parallel, and sub-workflow steps | | `agent.agentic_nodes` | Configuration for agentic nodes (semantic model gen, metrics gen) | | `agent.document` | Platform documentation sources (GitHub repos, websites, local files) | API keys are injected via environment variables using `${ENV_VAR}` syntax. ## Supported LLM Providers | Provider | Type | Notes | |----------|------|-------| | OpenAI | `openai` | GPT-4o, GPT-4, etc. | | Anthropic Claude | `claude` | Direct API | | Google Gemini | `gemini` | Gemini 2.0+ | | DeepSeek | `deepseek` | DeepSeek-Chat, DeepSeek-Coder | | Alibaba Qwen | `qwen` | Qwen series | | Moonshot Kimi | `kimi` | Kimi models | | MiniMax | `minimax` | MiniMax models | | GLM (Zhipu) | `glm` | GLM-4 series | | OpenAI Codex | `codex` | OAuth-based Codex models (gpt-5.3-codex, o3-codex) | | OpenRouter | `openrouter` | 300+ models via a single API key | **Embedding models:** OpenAI, Sentence-Transformers, FastEmbed, Hugging Face. Per-node model assignment lets you use different providers for different workflow steps (e.g., a cheaper model for schema linking, a stronger model for SQL generation). ## Supported Databases | Database | Type | Package | |----------|------|---------| | SQLite | `sqlite` | Built-in | | DuckDB | `duckdb` | Built-in | | PostgreSQL | `postgresql` | [`datus-postgresql`](https://github.com/Datus-ai/Datus-adapters) | | MySQL | `mysql` | [`datus-mysql`](https://github.com/Datus-ai/Datus-adapters) | | Snowflake | `snowflake` | [`datus-snowflake`](https://github.com/Datus-ai/Datus-adapters) | | StarRocks | `starrocks` | [`datus-starrocks`](https://github.com/Datus-ai/Datus-adapters) | | ClickHouse | `clickhouse` | [`datus-clickhouse`](https://github.com/Datus-ai/Datus-adapters) | | ClickZetta | `clickzetta` | [`datus-clickzetta`](https://github.com/Datus-ai/Datus-adapters) | | Hive | `hive` | [`datus-hive`](https://github.com/Datus-ai/Datus-adapters) | | Spark | `spark` | [`datus-spark`](https://github.com/Datus-ai/Datus-adapters) | | Trino | `trino` | [`datus-trino`](https://github.com/Datus-ai/Datus-adapters) | See [Database Adapters documentation](https://docs.datus.ai/adapters/db_adapters/) for details. ## How It Works ![How It Works](docs/assets/how_it_works.svg) **Explore** — Chat with your database, test queries, and ground prompts with `@table` or `@file` references. ```bash datus-cli --datasource demo /Check the top 10 banks by assets lost @table duckdb-demo.main.bank_failures ``` **Build Context** — Generate semantic models, import SQL history, define metrics. Each piece becomes reusable context for future queries. ```bash /gen_semantic_model xxx # Generate semantic model from tables /gen_sql_summary # Index SQL history for retrieval ``` **Create a Subagent** — Package mature context into a scoped, domain-aware chatbot with curated tools and business rules. ```bash .subagent add mychatbot # Create a new subagent ``` **Deliver** — Serve the subagent to analysts via web (`localhost:8501/?subagent=mychatbot`), REST API, or MCP — with feedback collection (upvotes, issue reports) built in. **Measure** — Run benchmarks against BIRD or Spider 2.0-Snow to track SQL accuracy as context evolves. **Iterate** — Analyst feedback loops back: engineers fix SQL, add rules, refine semantic models, and extend with Skills or MCP tools. The agent gets more accurate over time. → [End-to-end tutorial](https://docs.datus.ai/getting_started/contextual_data_engineering/#part-2--hands-on-tutorial-california-schools) · [CLI docs](https://docs.datus.ai/cli/introduction/) · [Knowledge Base docs](https://docs.datus.ai/knowledge_base/introduction/) · [Subagent docs](https://docs.datus.ai/subagent/introduction/) ## Development ```bash uv sync # Install dependencies uv run pytest tests/unit_tests/ -q # Run CI tests (no external deps) uv run ruff format . && uv run ruff check --fix . # Lint & format ``` Enable `--save_llm_trace` on CLI commands or set `save_llm_trace: true` per model in `agent.yml` to persist LLM inputs/outputs for debugging. → [LLM Trace docs](https://docs.datus.ai/training/llm_trace_usage/) See [CLAUDE.md](CLAUDE.md) for full development conventions, architecture patterns, and testing rules. ## License [Apache 2.0](LICENSE)