# SkillOpt: Executive Strategy for Self-Evolving Agent Skills *Train agent skills like you train neural networks — with epochs, (mini-)batchsize, learning rates, and validation gates — but without touching model weights.* [](https://microsoft.github.io/SkillOpt/) [](https://arxiv.org/abs/2605.23904) [](https://youtu.be/JUBMDTCiM0M) [](https://pypi.org/project/skillopt/) [](https://www.python.org/) [](LICENSE) --- ## News 🔥🔥🔥 - **[2026-06-03]** 🎉 **[gbrain](https://github.com/garrytan/gbrain), [gbrain-evals](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-06-03-skillopt.md), and [darwin-skill](https://github.com/alchaincyf/darwin-skill) have all integrated SkillOpt.** - **[2026-06-02]** 🎉 **SkillOpt [v0.1.0](https://github.com/microsoft/SkillOpt/releases/tag/v0.1.0) is now available on [PyPI](https://pypi.org/project/skillopt/)!** Install with `pip install skillopt`. This initial release includes the full training loop (rollout → reflect → aggregate → select → update → evaluate), multi-backend support (OpenAI / Azure / Claude / Qwen / MiniMax), six built-in benchmarks, and WebUI dashboard. --- ## Overview Modern agent skills are usually hand-crafted, generated one-shot by a strong LLM, or evolved through loosely controlled self-revision — none of which behaves like a deep-learning optimizer for the skill itself, and none of which reliably improves over its starting point under feedback. **SkillOpt treats the skill document as the trainable state of a frozen agent**, and trains it with the discipline that makes weight-space optimization reproducible. A separate optimizer model turns scored rollouts into bounded add / delete / replace edits on a single skill document; a candidate edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, a rejected-edit buffer, and an epoch-wise slow / meta update make skill training stable while adding **zero inference-time model calls** at deployment. The deployed artifact is a compact `best_skill.md` (typically 300–2,000 tokens) that runs against the unchanged target model. Across **six benchmarks, seven target models, and three execution harnesses** (direct chat, Codex CLI, Claude Code CLI), SkillOpt is best or tied-best on **all 52 evaluated (model, benchmark, harness) cells** and on GPT-5.5 lifts the average no-skill accuracy by **+23.5 points in direct chat, +24.8 inside the Codex agentic loop, and +19.1 inside Claude Code**. Optimized skill artifacts transfer across model scales, between Codex and Claude Code harnesses, and to nearby benchmarks without further optimization. For the full method, ablations, and per-cell results see the [paper](https://arxiv.org/abs/2605.23904); for a visual walkthrough of the loop see the [project page](https://microsoft.github.io/SkillOpt/); for deeper API / backend / benchmark docs see [`docs/`](docs/). ## 🎬 Demo Video https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7
▶ Watch the full demo on YouTube
--- ## Install ### Requirements - Python 3.10+ ### Option A: Install from PyPI ```bash pip install skillopt # With optional extras: pip install skillopt[alfworld] # ALFWorld benchmark pip install skillopt[webui] # Gradio monitoring dashboard pip install skillopt[claude] # Claude model backend ``` ### Option B: Install from source (for development) ```bash git clone https://github.com/microsoft/SkillOpt.git cd SkillOpt pip install -e . # For the ALFWorld benchmark (optional): pip install -e ".[alfworld]" alfworld-download ``` ### Configure API Credentials ```bash cp .env.example .env # Edit .env with your API credentials, then: source .env ``` #### Azure OpenAI *(recommended)* ```bash export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" # Option 1: API key auth export AZURE_OPENAI_API_KEY="your-key" # Option 2: Azure CLI auth (no API key needed) export AZURE_OPENAI_AUTH_MODE="azure_cli" ``` > **Note:** `AZURE_OPENAI_ENDPOINT` is required for all three modes (`api_key`, `azure_cli`, `openai_compatible`). Without it, all LLM calls will fail. #### OpenAI-compatible endpoints ```bash export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1" export AZURE_OPENAI_API_KEY="sk-..." export AZURE_OPENAI_AUTH_MODE="openai_compatible" ``` This routes all calls through the plain OpenAI Python client (no Azure auth, no `api-version` header). > **Note:** SkillOpt reuses the `AZURE_OPENAI_*` env var names even in this mode — there is no separate `OPENAI_API_KEY` knob. #### Anthropic Claude ```bash export ANTHROPIC_API_KEY="sk-ant-..." ``` #### Qwen *(local vLLM)* ```bash export QWEN_CHAT_BASE_URL="http://localhost:8000/v1" export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B" ``` `qwen_chat` can also be used as the optimizer backend. When optimizer and target should point to different local vLLM services, use the role-specific settings: ```bash python scripts/train.py \ --config configs/searchqa/default.yaml \ --optimizer_backend qwen_chat \ --target_backend qwen_chat \ --optimizer_model Qwen/Qwen3.5-4B \ --target_model Qwen/Qwen3.5-4B \ --optimizer_qwen_chat_base_url http://localhost:8001/v1 \ --target_qwen_chat_base_url http://localhost:8000/v1 ``` #### MiniMax ```bash export MINIMAX_BASE_URL="https://api.minimax.io/v1" export MINIMAX_API_KEY="..." export MINIMAX_MODEL="MiniMax-M2.7" ``` --- ## Quick Start ### Training ```bash # Minimal example — train on SearchQA: python scripts/train.py \ --config configs/searchqa/default.yaml \ --split_dir /path/to/your/searchqa_split \ --azure_openai_endpoint https://your-resource.openai.azure.com/ \ --optimizer_model gpt-5.5 \ --target_model gpt-5.5 # Train on LiveMathematicianBench: python scripts/train.py \ --config configs/livemathematicianbench/default.yaml \ --split_dir /path/to/your/livemath_split \ --azure_openai_endpoint https://your-resource.openai.azure.com/ \ --optimizer_model gpt-5.5 \ --target_model gpt-5.5 # Train on ALFWorld: python scripts/train.py \ --config configs/alfworld/default.yaml \ --split_dir data/alfworld_path_split \ --azure_openai_endpoint https://your-resource.openai.azure.com/ \ --optimizer_model gpt-5.5 \ --target_model gpt-5.5 ``` Key CLI arguments: | Argument | Description | Example | |---|---|---| | `--config` | Benchmark config YAML | `configs/searchqa/default.yaml` | | `--split_dir` | Path to data split directory | `/path/to/split` | | `--azure_openai_endpoint` | Azure OpenAI endpoint URL | `https://your-resource.openai.azure.com/` | | `--optimizer_model` | Optimizer model deployment name | `gpt-5.5` | | `--target_model` | Target model deployment name | `gpt-5.5` | | `--num_epochs` | Number of training epochs | `4` | | `--batch_size` | Batch size per step | `40` | | `--workers` | Parallel rollout workers | `8` | | `--out_root` | Output directory | `outputs/my_run` | ### Eval Only Evaluate a trained skill on specific data splits without training: ```bash # Evaluate the packaged GPT-5.5 SearchQA skill on the test split: python scripts/eval_only.py \ --config configs/searchqa/default.yaml \ --skill ckpt/searchqa/gpt5.5_skill.md \ --split valid_unseen \ --split_dir /path/to/searchqa_split \ --azure_openai_endpoint https://your-resource.openai.azure.com/ # Evaluate on all splits (train + val + test): python scripts/eval_only.py \ --config configs/searchqa/default.yaml \ --skill ckpt/searchqa/gpt5.5_skill.md \ --split all \ --split_dir /path/to/searchqa_split \ --azure_openai_endpoint https://your-resource.openai.azure.com/ ``` To evaluate a skill produced by your own training run, replace `--skill` with that run's best-skill path, for example `outputs/my_run/best_skill.md`. | Split | Description | |---|---| | `valid_unseen` | Test set | | `valid_seen` | Validation set | | `train` | Training set | | `all` | All splits combined (default) | ### Output Structure Each training run writes to a structured output directory: ``` outputs/