# Reproducing the Training Run This document is a **complete, self-contained recipe** for reproducing the embedding fine-tuning training run from scratch on a fresh machine: environment, dataset downloads, model serving, the exact training command, and what to expect. This is a LoRA fine-tune of `Qwen/Qwen3-VL-Embedding-2B` for visual document retrieval, with **ViT LoRA + text warmup + hard negatives**. On the `miniv8` test set (400 SimpleQA questions, 7426 candidate tiles) it reaches a peak **QA score ≈ 0.785** (vs. ~0.715–0.730 for the untrained base model). Original W&B run: --- ## Released model (best checkpoint) The trained LoRA adapters are published at [`Chrisyichuan/wiki-screenshot-embedding-lora`](https://huggingface.co/Chrisyichuan/wiki-screenshot-embedding-lora). You don't need to retrain to use the model — load the adapter on top of the base embedding model. - **Base model:** `Qwen/Qwen3-VL-Embedding-2B` - **Best checkpoint:** [`lora_vit/ckpt200`](https://huggingface.co/Chrisyichuan/wiki-screenshot-embedding-lora/tree/main/lora_vit/ckpt200) — the ViT-LoRA run (`--lora-vit`) at step 200, our best overall checkpoint. Each checkpoint folder is a standard PEFT adapter (`adapter_config.json` + `adapter_model.safetensors`, ~102 MB). Load it with PEFT (see the model card's [Usage](https://huggingface.co/Chrisyichuan/wiki-screenshot-embedding-lora#usage) section): ```python from peft import PeftModel from transformers import AutoModel base = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B") # Best checkpoint: ViT-LoRA, step 200 model = PeftModel.from_pretrained( base, "Chrisyichuan/wiki-screenshot-embedding-lora", subfolder="lora_vit/ckpt200", ) ``` The repo also ships other checkpoints (`lora_vit/ckpt{100,150,200,250,300}`, plus alternative `dora_ls005/*` and `hyper3/*` configs) — point `subfolder` at any of them. --- ## ⚠️ Before you start: two API keys You need **two API keys** to fully reproduce this run. Get them ready first: 1. **OpenAI API key** (`OPENAI_API_KEY`) — **required for evaluation.** During each eval step, the model retrieves images and a vLLM "reader" answers each test question; those answers are then **graded against the test set's gold answers by an OpenAI model** (`gpt-4.1-2025-04-14`). This grade *is* the headline **QA score**. Without a working key the QA score is silently **0** (the grader swallows errors), so the run looks broken even though training is fine. > Some keys are region-locked — if you get a 401 saying "make your request to > us.api.openai.com", set `OPENAI_BASE_URL=https://us.api.openai.com/v1`. 2. **W&B API key** (`WANDB_API_KEY`) — **required to log the curve online** and reproduce the dashboard above. Get it at . If you don't care about online logging, run with `WANDB_MODE=offline` instead (metrics still land in local `eval_step*.jsonl`). Both are only consumed once training reaches the first eval step (and W&B at launch), but set them **before** you start so a multi-hour run isn't wasted. --- ## 0. What you need (and when) | Resource | Needed for | When | |---|---|---| | 1× GPU (≥40 GB, e.g. H100/A100) for **training** | the fine-tune | whole run | | 1× GPU for **vLLM** (the QA "reader", `Qwen3-VL-4B-Instruct`) | QA eval at each `--test-eval-steps` | from first eval step | | **OpenAI API key** (`gpt-4.1` grader) | grading reader answers in QA eval | from first eval step | | **W&B API key** *(optional)* | online loss/metric curves | start of run (else use offline) | | ~**95 GB free disk** for images, ideally **fast/local** storage | dataset images | whole run | | ~200 GB scratch during download+extract | tar shards + extracted images | setup only | > ⚠️ **The OpenAI key and the vLLM endpoint are only used during *evaluation*.** > If neither is available, training still runs — but the QA score will be 0/blank. > The grader silently returns 0 on any error (including a bad key / wrong base URL), > so **verify the key works before launching** (see §6). > ⚠️ **HF token (optional but recommended):** unauthenticated HF downloads are > rate-limited and slow. `export HF_TOKEN=hf_...` before downloading the ~93 GB > image dataset for higher throughput. --- ## 1. Environment Pinned versions are **mandatory** — mismatches cause silent numerical drift: | Package | Version | |---|---| | PyTorch | 2.9.1+cu129 | | cuDNN | 9.20.0.48 | | transformers | 4.57.1 | These are locked in `pyproject.toml` + `uv.lock`. Install with [uv](https://docs.astral.sh/uv/): ```bash cd train uv sync # creates .venv with the exact locked versions ``` Always run training/eval via `uv run` so the locked env is used. --- ## 2. Download the datasets Three datasets are required. Pick a data root on a large disk: ```bash export DATA_ROOT=/big/disk/visrag/data mkdir -p "$DATA_ROOT" ``` ### 2a. Training data — `screenshot-training-natural-filtered-v2` (~93.5 GB) 104K train / 5.8K eval / 5.8K test query–image pairs with 2 hard negatives each, plus 1000 tar-sharded image archives. ```bash hf download Chrisyichuan/screenshot-training-natural-filtered-v2 \ --repo-type dataset \ --local-dir "$DATA_ROOT/screenshot-training-natural-filtered-v2" ``` This gives `train_hn.jsonl`, `eval_hn.jsonl`, `test_hn.jsonl` at the root and `image_shards/shard_000.tar … shard_999.tar`. > 💡 **Cleaner alternative:** > [`Chrisyichuan/screenshot-training-natural-filtered-4o-40k`](https://huggingface.co/datasets/Chrisyichuan/screenshot-training-natural-filtered-4o-40k) > is a smaller (~40K) variant whose hard negatives were filtered with a stronger > model, giving a cleaner hard-negative signal. Feel free to try it in place of > `screenshot-training-natural-filtered-v2`. ### 2b. Test set — `test_miniv8` (~2 GB, lives in the `screenshot-training` repo) 400 SimpleQA questions + 7426 candidate tiles, used for retrieval (R@1/R@3) and QA-score eval. ```bash hf download Chrisyichuan/screenshot-training \ --repo-type dataset --include "test_miniv8/*" \ --local-dir "$DATA_ROOT/screenshot-training" ``` ### 2c. Text-warmup data — `text-qa-pair` (~1.8 GB, text only) Text query→passage pairs with hard negatives, used for the 50-step text warmup. Already in the `chunk_*/filtered_hn.jsonl` layout the trainer expects. ```bash hf download Chrisyichuan/text-qa-pair \ --repo-type dataset \ --local-dir "$DATA_ROOT/text-qa-pair" ``` --- ## 3. Extract images The JSONL rows reference images by relative path `images/shard_XXX/...`, resolved **relative to the JSONL file's directory**. So images must end up at `/images/`. ```bash # Training images (1000 shards → images/). SLOW on networked filesystems — # extract to fast/local storage. ~200K small PNGs. python "$DATA_ROOT/screenshot-training-natural-filtered-v2/extract_hf_image_shards.py" \ --dataset-dir "$DATA_ROOT/screenshot-training-natural-filtered-v2" # Test tiles cd "$DATA_ROOT/screenshot-training/test_miniv8" mkdir -p tiles && tar xf tiles.tar -C tiles ``` > **Performance note:** extracting/reading hundreds of thousands of tiny PNGs over > NFS is extremely slow. Extract `images/` onto local SSD or RAM-disk (`/dev/shm`) > if available, and `ln -s` it back into the dataset dir so the relative paths > resolve. --- ## 4. Serve the vLLM reader The QA eval retrieves images, then asks `Qwen3-VL-4B-Instruct` to answer each question from the retrieved image (the "reader"). Serve it on a **separate GPU** from training. Use the pinned serving env in `../serving/vllm/`: ```bash cd ../serving/vllm uv sync CUDA_VISIBLE_DEVICES= uv run vllm serve Qwen/Qwen3-VL-4B-Instruct \ --dtype auto --port 8200 --max-model-len 65536 \ --gpu-memory-utilization 0.8 --api-key dummy # verify: curl -s http://localhost:8200/v1/models ``` --- ## 5. API keys ```bash # Grader (QA scoring). The grader uses gpt-4.1-2025-04-14. export OPENAI_API_KEY=sk-... # Use the host your key requires. Some keys are region-locked and 401 on the # default host with "make your request to us.api.openai.com" — then use: export OPENAI_BASE_URL=https://us.api.openai.com/v1 # or https://api.openai.com/v1, or your gateway # Optional: online W&B curves matching the original run export WANDB_API_KEY=... # else: export WANDB_MODE=offline ``` Sanity-check the grader before a long run: ```bash uv run python - <<'PY' import os, openai c = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=os.environ.get("OPENAI_BASE_URL")) print(c.chat.completions.create(model="gpt-4.1-2025-04-14", messages=[{"role":"user","content":"reply CORRECT"}]).choices[0].message.content) PY ``` --- ## 6. Run training The exact training command (adjust the paths to your `$DATA_ROOT`): ```bash cd train CUDA_VISIBLE_DEVICES= uv run python train_contrastors.py \ --data-split-dir "$DATA_ROOT/screenshot-training-natural-filtered-v2" \ --text-warmup-steps 50 \ --text-data-dir "$DATA_ROOT/text-qa-pair" \ --test-data "$DATA_ROOT/screenshot-training/test_miniv8/test_miniv8.json" \ --max-steps 350 \ --batch-size 64 \ --grad-cache-chunk 4 \ --num-hard-negatives 2 \ --lr 7e-6 \ --warmup-steps 20 \ --scheduler cosine \ --test-batch-size 16 \ --eval-steps 25 \ --test-eval-steps 50 \ --save-steps 50 \ --max-num-visual-tokens 4096 \ --lora-vit \ --simpleqa-max-examples 1000 \ --vllm-url http://localhost:8200/v1 \ --vllm-model Qwen/Qwen3-VL-4B-Instruct \ --wandb-run-name v8r \ --output-dir "$OUTPUT_DIR/v8_r_warmup50_lr7e6_lora_vit_350" ``` What the flags mean (key ones): - `--lora-vit` — apply LoRA to the ViT vision encoder too (the single biggest win). - `--text-warmup-steps 50` + `--text-data-dir` — 50 steps of text-only contrastive warmup before image training (hard switch). - `--num-hard-negatives 2` — the dataset has exactly 2 mined hard negatives per row. - `--batch-size 64 --grad-cache-chunk 4` — GradCache keeps memory ∝ chunk, not batch. - `--test-eval-steps 50` — full retrieval + QA eval every 50 steps (needs vLLM + grader). **Sanity checks in the startup logs** (confirm your setup is correct before waiting hours): - `trainable params: 25,427,968 || all params: 2,152,960,000 || trainable%: 1.1811` — this exact count means `--lora-vit` is applied (LLM + ViT + merger). Without `--lora-vit` it's ~12.8M. - `Loaded 104033 valid pairs … train_hn.jsonl` / `Loaded 5779 … eval_hn.jsonl` (test split = 5781) — confirms the training data resolved. - `Loaded 14952 text pairs …` — confirms the text-warmup data resolved. - `Loaded test 'miniv8': 400 questions, 7426 tiles` — confirms the test set + tiles resolved. > **`tiles_dir` gotcha:** the trainer reads `test_miniv8.json`'s `tiles_dir` field > **as-is, relative to the current working directory** (not relative to the JSON > file). The shipped value is `"test_miniv8/tiles"`. Either run training from the > directory that contains `test_miniv8/tiles`, or edit the JSON to an **absolute** > tiles path. A wrong `tiles_dir` yields `0 tiles` and a meaningless eval.
Eval cache and timing details **Step-0 eval is slow, then partially cached:** the first eval embeds all 7426 doc tiles. The dominant cost is **CPU-side preprocessing** — PIL `Image.open` + the Qwen3VL processor's resize / normalize / tokenize — which is single-threaded per batch and starves the GPU (you'll see GPU util mostly 0% with brief spikes). Cold cost: ~10–15 min on a dedicated GPU; longer on a shared one. What's actually cached: the **preprocessed batch tensors** (`pixel_values`, `image_grid_thw`, `input_ids`, `attention_mask`), saved to `.tile_cache_n{N}_px{max_pixels}_bs{batch_size}.pt` next to the tiles. **This file is huge** — `pixel_values` are the dominant payload. At `max-num-visual-tokens=1024` (max_pixels ≈ 1 MB) the miniv8 cache is **~157 GB**; at `4096` visual tokens (max_pixels ≈ 4 MB) it scales roughly linearly to **~600 GB**. The `torch.save` itself takes ~15–20 min at ~150 MB/s sustained write. Make sure the tiles directory lives on a volume with several hundred GB free, not on a small `$HOME` partition. Embeddings are **not** cached — the LoRA weights change each eval, so every eval still does a fresh GPU forward over all 7426 tiles. Cache key includes `batch_size` and `max_pixels`, so changing either invalidates it. **Measured eval breakdown on a dedicated H100, `max-num-visual-tokens=1024`, bs=16** (so cache is "only" 157 GB; 4096 visual tokens scales ~4× across the board): | Phase | Cold (step 0) | Warm (cache hit) | |---|---|---| | query embed (400) | 22 s | 1 s | | doc embed (7426 tiles) | **46 min** (preprocess + fwd + `torch.save` 157 GB) | **27 min** (`torch.load` 157 GB ≈ 18 min + GPU fwd ≈ 9 min) | | grader (400 SimpleQA) | 2 min | 2 min | | **total** | **49 min** | **29 min** | Big takeaway: even with the cache hit, each eval is ~half an hour because `torch.load`ing a 157 GB pickle is itself ~18 minutes (NVMe-bound, ~145 MB/s sustained — much slower than raw NVMe because of pickle deserialization). At `4096` visual tokens, expect roughly 4× — `torch.load` alone takes ~70 min per eval. Budget accordingly when picking `--test-eval-steps`.
--- ## 7. What to expect - ~350 steps, single GPU, ≈ a few seconds/step plus eval overhead. - QA score (primary metric) climbs in a staircase and **peaks around step 150–250 at ≈ 0.785**, then may decay slightly (overfitting) — checkpoint at the peak. - Per-eval results are written to `eval_step.jsonl` in the output dir; QA score = fraction of rows with `correct: true`. Quick peak extraction: ```python import json, glob peak = 0 for f in sorted(glob.glob("OUTPUT_DIR/eval_step*.jsonl")): rows = [json.loads(l) for l in open(f)] qa = sum(r.get("correct", False) for r in rows) / len(rows) step = int(f.split("eval_step")[1].split(".")[0]) peak = max(peak, qa); print(step, round(qa, 4)) print("peak", round(peak, 4)) ``` - Retrieval R@1/R@3 are logged too; note R@1 is **not** monotone with QA — query embeddings can get more useful for QA even as exact-tile match rate dips. --- ## 8. Results for reference The screenshots below are the **ideal loss / metric curves** from the run used while writing the paper. Use these as the visual reference for a healthy run: train loss should trend downward, eval loss should steadily improve, and `test/qa_score`, `test/recall@1`, and `test/recall@3` should climb in the same stair-step pattern. ![paper reference test metrics](./docs/img/v8r_reference_paper_metrics.png) ![paper reference training curves](./docs/img/v8r_reference_paper_train_curves.png) ![paper reference eval curves](./docs/img/v8r_reference_paper_eval_curves.png) For the open-source release run, the 2× H100 loss curve is available in W&B: . If you cannot access the run, email `yichuan_wang@berkeley.edu`. --- ## 9. Troubleshooting - **QA score is 0 / blank** → grader not reachable. Check `OPENAI_API_KEY`, `OPENAI_BASE_URL`, and that vLLM answers `curl .../v1/models`. The grader swallows errors and returns 0, so a silent 0 almost always means a key/endpoint problem. - **`Image.open` errors / missing files** → images not fully extracted, or `images/` is not next to the JSONL. Verify a path: `ls "$DATA_ROOT/.../images/shard_812/..."`. - **Slow startup / step-0 eval hangs** → CPU-bound tile preprocessing on first eval; with many parallel runs it can thrash. Run one at a time, or warm the tile cache first. - **vLLM eval queue stalls** → one vLLM instance shared across many runs bottlenecks evals. Use a dedicated instance per run or stagger eval schedules. ## Reproducing the ablations To reproduce the stairstep ablation (base → in-batch → hard negatives → text warmup → LoRA-ViT), see [`recipes/v8s_ablation.sh`](./recipes/v8s_ablation.sh) — one launch command per run, each adding a single knob. Results are summarized in [`docs/v8_ablation_results.md`](./docs/v8_ablation_results.md). For maintainer notes on training internals, hard-negative filtering, dataset packaging, and tests, see [`docs/training_dev_notes.md`](./docs/training_dev_notes.md).