---
name: bulk-inference
description: "Runs bulk VLM inference via vLLM, OpenAI, or Gemini. Async parallel with resume and JSONL append. Use for 'run inference', 'bulk inference', '추론 실행'."
model: sonnet
---

# Bulk Inference

## Purpose

Execute bulk VLM inference across multiple providers (vLLM local, OpenAI, Gemini) using [scripts/inference_runner.py](scripts/inference_runner.py). Handles JSONL input/output, resume from interruption, and concurrent async requests.

## Prerequisites

- Input JSONL file with at minimum: an image path field, a question/prompt field, and one or more ID fields.
- For `vllm_local`: running vLLM server(s) — use `/vllm-serve` first.
- For `openai`: `OPENAI_API_KEY` env var set.
- For `gemini`: `GOOGLE_API_KEY` env var set.

## Process

1. **Gather parameters from user**:
   - `--provider`: `vllm_local`, `openai`, or `gemini`
   - `--endpoints`: server URLs (vllm_local) or API base URL
   - `--model-id`: HF model name or API model ID
   - `--input`: path to input JSONL
   - `--output`: path for output JSONL
   - `--n-concurrent`: requests per endpoint (vllm) or total (API), default 6
   - `--max-tokens`: default 100
   - `--temperature`: default 0.0
   - Optional: `--api-key-env`, `--reasoning-effort`, `--thinking-budget`, `--rate-limit-delay`
   - Optional: `--image-field`, `--question-field`, `--id-fields`, `--prompt-template`

2. **Validate inputs** — Confirm input JSONL exists and is readable. Check provider-specific requirements (API keys, server health).

3. **Run inference**:
   ```bash
   python scripts/inference_runner.py \
     --provider {provider} \
     --endpoints {urls} \
     --model-id {model_id} \
     --input {input_jsonl} \
     --output {output_jsonl} \
     --n-concurrent {n} \
     --max-tokens {max_tokens} \
     --temperature {temp} \
     [--api-key-env {env_var}] \
     [--reasoning-effort {effort}] \
     [--thinking-budget {budget}] \
     [--rate-limit-delay {delay}] \
     [--no-resume] \
     [--image-field {field}] \
     [--question-field {field}] \
     [--id-fields {f1},{f2}] \
     [--prompt-template "Answer the question..."]
   ```

4. **Monitor output** — The script prints a tqdm progress bar and final summary with total, success, errors, and throughput.

5. **Report results** — After completion, report: output file path, total processed, success rate, error count.

## Input JSONL Format

Each line is a JSON object. Required fields are configurable via `--image-field`, `--question-field`, `--id-fields`. Defaults:
- `image_path` — path to image file
- `question_string` — prompt/question text
- `triplet_id`, `condition` — composite ID for resume

## Output JSONL Format

Each output line preserves ALL original input fields plus:
```json
{"...original fields...", "model": "...", "raw_response": "...", "parsed_answer": "...", "error": null}
```

## Rules

- Resume is ON by default — interrupted runs continue from where they stopped.
- Never modify the input JSONL file.
- Append mode: output JSONL is opened in append mode, one line per completed item.
- All errors are captured per-item; the runner does not abort on individual failures.