--- layout: '@/layouts/Doc.astro' title: "🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation" date: 2025-11-12 description: 'Best practices for cooling down model checkpoints before evaluation to improve validation loss comparisons.' --- Sam Foreman 2025-11-12 - [📉 Simple Experiment to Compare Validation Loss](#chart_with_downwards_trend-simple-experiment-to-compare-validation-loss) - [☃️ Cooling Down](#snowman_with_snow-cooling-down) - [♻️ Convert to Universal (Optional)](#recycle-convert-to-universal-optional) - [📄 W&B Report](#page_facing_up-wb-report) ## 📉 Simple Experiment to Compare Validation Loss
Cool Down Comparison
Cool Down Comparison
## ☃️ Cooling Down - 256 Nodes of Aurora: - Cooled down over last 10%: - W&B Run: [volcanic-blaze-4312](https://wandb.ai/aurora_gpt/AuroraGPT/runs/7bjj8vgu/overview?nw=nwuserforemans) - Explicit command: ``` bash ROPE_THETA=50000 \ GRAD_ACC_STEPS=2 \ MICRO_BATCH=1 \ USE_ACTIVATION_CHECKPOINTING=0 \ ZERO_STAGE=0 \ TRAIN_TOKENS=4673780159710 \ OPT=sophiag \ DATA_FILE_LIST=ALCF/data-lists/aurora/olmo-mix-1124.txt \ LR_DECAY_STYLE=constant \ LOAD=cooldown-checkpoints/sophiag-global-step-73500/global_step73500 \ bash train_alcf.sh \ --no-load-lr-state \ --lr_constant_plus_cooldown \ --lr_constant_plus_cooldown_frac 0.10 ``` ## ♻️ Convert to Universal (Optional) ```bash TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1 python3 ALCF/ds_to_universal.py \ --input_folder test_rollback/global_step136000 \ --output_folder test_rollback/global_step136000_universal ``` ## 📄 W&B Report
<iframe loading="lazy" src="https://api.wandb.ai/links/aurora_gpt/dek99dmd" align="center" frameborder="0" webkitallowfullscreen allowfullscreen style="border:none;height:1024px;width:100%"> </iframe>