# Walkthrough: building a custom task A detailed, step-by-step guide for preparing your own dataset and running SIA against it. For the short version, see the [Bring your own task](../README.md#3-bring-your-own-task) section of the README. ## Step 1: Set up the task directory Create the layout SIA expects: ```bash mkdir -p my-tasks/gpqa/{data/public,data/private,reference} ``` ### Add your dataset and task description Place dataset files in the appropriate folders: ```bash # Public inputs — the agent is allowed to see these cp questions.json my-tasks/gpqa/data/public/ # Private answers / ground truths — held out from the agent cp answers.json my-tasks/gpqa/data/private/ ``` > **Note:** The LLM is **not** told about `data/private/` during evaluation. This prevents the agent from cheating and ensures fair scoring. Write the task description in `my-tasks/gpqa/data/public/task.md`. SIA's meta-agent reads this file to understand what to build. ### Copy the reference agent template From a clone of this repo: ```bash cp sia/tasks/_shared/reference_target_agent.py my-tasks/gpqa/reference/ ``` ### (Optional) Add sample task descriptions Create `my-tasks/gpqa/reference/SAMPLE_TASK_DESCRIPTIONS.md` with examples of similar tasks. This helps the meta-agent generalize and reduces overfitting to the exact phrasing of `task.md`. ## Step 2: Run the orchestrator External custom task: ```bash sia run --task_dir ./my-tasks/gpqa --max_gen 5 --run_id 1 ``` Bundled task (for comparison): ```bash sia run --task gpqa --max_gen 5 --run_id 1 ``` With a meta agent on OpenHands + Gemini (author `./profiles/gemini-meta.json` with `"agent_impl": "openhands"`, `"model": "gemini/gemini-3.1-pro-preview"`, `"provider_id": "gemini"`): ```bash sia run \ --task_dir ./my-tasks/gpqa \ --max_gen 5 \ --run_id 1 \ --meta-agent-profile gemini-meta ``` See [configuration.md](configuration.md) for the full profile/provider schema and more examples. ## Step 3: Analyze results ```bash # View execution logs for a generation cat runs/run_1/gen_1/agent_execution.json # View improvements the feedback agent proposed cat runs/run_1/gen_2/improvement.md # Diff successive agent versions diff runs/run_1/gen_1/target_agent.py runs/run_1/gen_2/target_agent.py ``` Or browse it all in the web dashboard: ```bash sia web # → http://127.0.0.1:8000 ``` The dashboard also auto-starts during `sia run`, so you can watch generations land live (disable with `--no-web`). ## Task directory requirements Every task directory — bundled or custom — must look like this: ``` {task-id}/ ├── data/ │ ├── public/ │ │ ├── task.md # Task description (orchestrator reads this) │ │ ├── train.csv │ │ ├── test.csv │ │ └── sample_submission.csv │ └── private/ │ └── ... # Held-out evaluation data └── reference/ ├── SAMPLE_TASK_DESCRIPTIONS.md # Similar tasks (for meta-agent context) └── reference_target_agent.py # Template agent structure ``` ## Preparing an MLE-Bench task The `prepare_mlebench_dataset.py` script automates the steps above for any MLE-Bench competition. First install the extras (mle-bench is not on PyPI): ```bash pip install 'sia-agent[mlebench]' pip install git+https://github.com/openai/mle-bench export KAGGLE_USERNAME="..." KAGGLE_KEY="..." # mle-bench downloads via the Kaggle API export GEMINI_API_KEY="..." # optional; required only without --skip-gemini ``` Kaggle credentials come from your account's API token (Kaggle → Account → Create New Token); the downloaded `kaggle.json` can also live at `~/.kaggle/kaggle.json` instead of env vars. Accept the competition's rules on Kaggle first or `mlebench prepare` will fail to download it. Then run: ```bash python -m sia.prepare_mlebench_dataset -c "spaceship-titanic" ``` This will: 1. Run `mlebench prepare -c "spaceship-titanic"` 2. Copy public and private datasets from `~/.cache/mle-bench/data/prepared/` 3. Rename `description.md` → `task.md` in `data/public/` 4. Use Gemini to generate similar task descriptions (optional) 5. Create `SAMPLE_TASK_DESCRIPTIONS.md` in `reference/` 6. Copy `reference_target_agent.py` from `_shared/` into `reference/` **Options:** - `--skip-gemini` — Skip the Gemini API call for similar tasks - `--tasks-dir PATH` — Custom tasks directory (default: `./tasks`)