--- name: data-science-agent-loop description: >- Scaffolds short-lived R or Python scripts that call Ollama Cloud /api/chat with tool definitions, batch rows or records per request, optionally parallelize chunk calls (e.g. furrr), and apply tool results on the main process with JSONL audits—using a folder-local .env. Use when the user wants an agent loop for data cleaning, enrichment, geospatial routing, dataset compilation, batched tool calls, parallel Ollama chunks, or teaching patterns aligned with 10_data_management/fixer (not full Plumber/FastAPI agents unless requested). --- # Data science agent loop (short-lived scripts) Teaches a **repeatable pattern** for **single-folder**, **disposable** data-science loops: one main script (optional `functions.R` or `helpers.py`), **Ollama Cloud** tool calling, **batching**, optional **parallel chunk HTTP**, and **`.env` next to the script**. ## When to use this skill - **Use:** One runnable script (plus optional shared helpers) that loads data, batches prompts, gets **`tool_calls`**, dispatches tools, writes **CSV/GeoJSON + JSONL audit**. - **Do not default to:** Multi-turn “assistant products” ([`10_data_management/agentr/`](../../../10_data_management/agentr), [`10_data_management/agentpy/`](../../../10_data_management/agentpy)) unless the user wants that scope (Plumber, guardrails, many tools, long loops). ## Mandatory guided workflow (run before writing code) Walk the user through these **five choices**; do not skip silently. | Step | What to clarify | |------|-----------------| | **(1) Data / task** | Input paths, output paths, **stable row/entity keys**, immutable raw vs **working copy**, and **audit** strategy (append **JSONL** per tool application). | | **(2) Draft prompts** | Review **system + user** text: column semantics, **anti-hallucination** rules, **missing-value** conventions (e.g. empty string vs literal `NA` text), and whether the model should reply **tool-first / minimal prose**. Tighten vague instructions. | | **(3) Ollama Cloud model** | Offer **clear options** (see below); record choice in **`.env`** as `OLLAMA_MODEL`. If **`tool_calls` are empty or wrong**, try a **different tool-capable** model or smaller batches ([fixer README](../../../10_data_management/fixer/README.md)). | | **(4) Batch size** | Rows (or records) **per** `/api/chat`. Larger batches → fewer requests but **more context** and risk of **dropped / malformed** tool calls. Start conservative; tune via env (e.g. `ROWS_PER_BATCH`). | | **(5) Tools** | **Minimal** tool set; strict JSON schemas; decide what the **model** chooses vs what **code** computes (e.g. **sf** distances). Use **optimistic concurrency** where useful (`expected_old_value` pattern in [`fixer_csv.R`](../../../10_data_management/fixer/fixer_csv.R)). | ### Ollama Cloud model options (typical) Present **tradeoffs**, then let the user pick (store in `.env`). 1. **`nemotron-3-nano:30b-cloud`** — Default in [`fixer/.env.example`](../../../10_data_management/fixer/.env.example); **faster / cheaper**; verify tool quality on the task. 2. **`gpt-oss:120b`** — Stronger reasoning; used in fixer smoke path ([`testme.R`](../../../10_data_management/fixer/testme.R)); **slower / heavier**. 3. **Another Cloud tag** — Any model the user’s account exposes that supports **tools**; confirm on [Ollama](https://ollama.com) if unsure. 4. **Fallback strategy** — If tools never fire: **smaller batch**, **stronger model**, or **simpler tool schemas** (see troubleshooting in [fixer README](../../../10_data_management/fixer/README.md)). ## Course repo examples (canonical) **Shared helpers:** [`10_data_management/fixer/functions.R`](../../../10_data_management/fixer/functions.R) — **`ollama_chat_once`**, **`parse_function_arguments`**, **`split_df_into_row_chunks`**, **`truncate_tool_output`**. All four fixer drivers **`source()`** this file. **Run order and artifacts:** See [`fixer/README.md`](../../../10_data_management/fixer/README.md) (steps 2–5: CSV → parcels → POIs → spatial context). | Script | What it does | Primary tools (examples) | Main inputs → outputs | |--------|----------------|---------------------------|------------------------| | [`fixer_csv.R`](../../../10_data_management/fixer/fixer_csv.R) | Batched tabular repair; **`DATA_QUALITY_BLURB`** in-script (no per-row notes column); **`readr::format_csv`** chunks | **`set_cell`**, **`write_checkpoint`** | `data/messy_inventory_raw.csv` → `output/messy_inventory_working.csv`, `output/fix_audit.jsonl` | | [`fixer_parcels.R`](../../../10_data_management/fixer/fixer_parcels.R) | Non-overlapping **polygon** parcels (**`wkt`**, WGS84); maps via **sf** + **ggplot2** | **`record_parcel_zoning`** | `data/parcels_zoning_raw.csv` → `output/parcels_enriched.csv`, `output/parcels_enrich_audit.jsonl`, `map_parcels_*.png` | | [`fixer_pois.R`](../../../10_data_management/fixer/fixer_pois.R) | **Point** POIs (**`x`**, **`y`**, WGS84); maps | **`record_poi_category`** | `data/pois_messy_raw.csv` → `output/pois_enriched.csv`, `output/pois_enrich_audit.jsonl`, `map_pois_*.png` | | [`fixer_spatial_context.R`](../../../10_data_management/fixer/fixer_spatial_context.R) | **After** parcels + POIs: LLM **routes** which spatial tools to call; **no** geometry in the model — **sf** computes distances/counts (metric CRS **EPSG:32617** in-repo) | **`nearest_poi`**, **`count_pois_within`**, **`record_context_note`** | Default `output/parcels_enriched.csv` + `output/pois_enriched.csv` → `output/parcels_context_enriched.csv`, `output/context_routing_audit.jsonl`, `output/map_parcels_context_transport.png`. Override inputs with **`FIXER_CONTEXT_PARCELS`**, **`FIXER_CONTEXT_POIS`**. Demo grid: **24** parcels / **24** POIs → **3** chunk requests when **`ROWS_PER_BATCH=10`**. | **Extended** multi-turn agents (heavier): [`agentr/R/loop.R`](../../../10_data_management/agentr/R/loop.R), [`agentpy/app/loop.py`](../../../10_data_management/agentpy/app/loop.py). **Env template:** [`fixer/.env.example`](../../../10_data_management/fixer/.env.example) — **`OLLAMA_API_KEY`**, **`OLLAMA_HOST`**, **`OLLAMA_MODEL`**; optional **`ROWS_PER_BATCH`**, **`FIXER_CHUNK_WORKERS`**, **`FIXER_MAX_OUTPUT_TOKENS`** (digits only; omit if Cloud returns HTTP 400 on `num_predict`). ## Related project skills (Cursor) When students or the agent **author or extend** fixer-style R in this repo: - **[tidyverse-elegant-r](../tidyverse-elegant-r/SKILL.md)** + **[tidyverse_elegant.mdc](../../rules/tidyverse_elegant.mdc)** — `=`, native `|>`, explicit `library()`, **httr2**, dplyr 1.1+ (**`join_by`**, **`.by`**), vectorized table logic. - **[console-message](../console-message/SKILL.md)** — progress UX for scripts: section dividers, paths, row counts, previews (aligns with the **`cat()`**-style sections in the fixer drivers). ## Architecture (fixer-style) ```mermaid flowchart LR subgraph config [Folder config] env[.env OLLAMA_* BATCH_*] end subgraph prep [Prepare] load[Load table or paths] chunk[Split into batches] end subgraph parallel [Parallel optional] map[Map chunks to API] end subgraph api [Ollama Cloud] chat["/api/chat tools"] end subgraph main [Main process] apply[Dispatch tools mutate state] audit[Append JSONL audit] write[Write outputs] end env --> load load --> chunk chunk --> map map --> chat chat --> apply apply --> audit apply --> write ``` **Rule:** Workers return **parsed `tool_calls` (or errors)**; **mutate shared tables / files on the main process** so state stays coherent (see `call_chunk_ollama` + sequential dispatch in [`fixer_csv.R`](../../../10_data_management/fixer/fixer_csv.R)). ## Standard project layout - `my_task.R` or `run_loop.py` — entrypoint. - Optional `functions.R` / `helpers.py` — HTTP, parsing, chunking (no secrets in code). - `data/` — immutable inputs; `output/` — working CSV + audits. - **`.env` in the same folder** as the script(s) the student runs; commit **`.env.example`** only (no keys). ## Implementation patterns ### R (this repo) - Load env: `if (file.exists(".env")) readRenviron(".env")` (path relative to the task folder). - HTTP: **`httr2`** — `request()` → `req_body_json()` → `req_perform()`; `resp_body_json(..., simplifyVector = FALSE)` so **`message$tool_calls`** stays structured. - Optional parallelism: **`future`** + **`furrr::future_map`** over chunk indices; **`plan(multisession, workers = …)`**; cap workers if Cloud returns **429/500** (sequential fallback like `FIXER_CHUNK_WORKERS=1` in [fixer README](../../../10_data_management/fixer/README.md)). - **R style:** Follow **[tidyverse-elegant-r](../tidyverse-elegant-r/SKILL.md)** and **[tidyverse_elegant.mdc](../../rules/tidyverse_elegant.mdc)** (`=`, native `|>`, explicit `library()`, vectorized table logic, `join_by` / `.by` where appropriate). For polished **`cat()`** progress (dividers, paths, `nrow`), see **[console-message](../console-message/SKILL.md)**. ### Python - Mirror with **`httpx`** (see [`agentpy/app/`](../../../10_data_management/agentpy/app/)): POST JSON to `{OLLAMA_HOST}/api/chat`, `stream: false`, `tools: [...]`. - Batch with **`concurrent.futures.ThreadPoolExecutor`** or async patterns; keep **one writer** for DataFrame / files. - Prefer **pandas** or **Polars** chains for table logic; avoid row-at-a-time Python loops for expressible column work. ### Tool schemas (portable gotchas) - Ollama expects OpenAI-style **`tools`** with `type: "function"` and **`function.parameters`** as a JSON Schema object. - In R, **empty** `properties` must serialize as **`{}`**, not `[]` — use `properties = structure(list(), names = character(0))` (see [fixer README](../../../10_data_management/fixer/README.md) troubleshooting). ## Safety and operations - **Never log** raw `OLLAMA_API_KEY`; mask in console (e.g. first/last chars only), as in fixer scripts. - Redact **Bearer** tokens and obvious secret patterns in any file logs the student adds. - **HTTP 400** with `num_predict` / options: omit `options.num_predict` unless needed (fixer uses optional `FIXER_MAX_OUTPUT_TOKENS`). - **Rate limits:** reduce parallel workers or batch size. ## More use cases (beyond fixer) - **Survey / form coding** — Open text → **closed codes** via tools that only allow a controlled set. - **Entity resolution** — Model proposes **`merge_group_id`** / **`canonical_name`** against stable keys; code merges. - **Schema mapping** — Messy headers → target field names + types; JSONL audit per mapping decision. - **Unit / currency normalization** — Mixed strings → canonical units in structured columns. - **Time range extraction** — Free text → ISO intervals or explicit “unknown”. - **Multi-table stitching** — Model suggests join keys; code performs joins and checks row counts. - **Literature / metadata tagging** — Title/abstract batches → controlled vocabulary via tools. - **Quality gates** — Model **flags** bad rows; second pass or export **quarantine** CSV. - **Raster / vector pairing** — Model **selects** which zonal stat / buffer tool to call; **`terra`** / **`sf`** compute values. ## Additional resources - **[reference.md](reference.md)** — `.env` snippet and minimal **R / Python** tool-definition skeletons. - **Hands-on:** [`ACTIVITY_fixer_csv.md`](../../../10_data_management/ACTIVITY_fixer_csv.md), [`ACTIVITY_fixer_spatial.md`](../../../10_data_management/ACTIVITY_fixer_spatial.md) (both link back to these skills). - **R style:** [tidyverse-elegant-r](../tidyverse-elegant-r/SKILL.md), [tidyverse_elegant.mdc](../../rules/tidyverse_elegant.mdc). **Console UX:** [console-message](../console-message/SKILL.md).