--- layout: '@/layouts/Doc.astro' title: Training Foundation Models on Supercomputers date: '2025-10-24' location: 'UIUC' --- Sam Foreman 2025-10-24 - [🧑🏻‍💻 About Me](#adultcomputer-about-me) - [Argonne Leadership Computing Facility (ALCF)](#argonne-leadership-computing-facility-alcf) - [🏗️ Aurora](#building_construction-aurora) - [🤖 ALCF AI Testbed](#robot-alcf-ai-testbed) - [🌌 AuroraGPT (2024–)](#milky_way-auroragpt-2024) - [🧪 AuroraGPT: Open Science Foundation Model](#test_tube-auroragpt-open-science-foundation-model) - [🧰 AuroraGPT: Toolbox](#toolbox-auroragpt-toolbox) - [👥 Team Leads](#busts_in_silhouette-team-leads) - [🤝 Teams](#handshake-teams) - [🏋️ Challenges: In Practice](#weight_lifting-challenges-in-practice) - [💾 AuroraGPT: Training](#floppy_disk-auroragpt-training) - [🍹 AuroraGPT: Blending Data, Efficiently](#tropical_drink-auroragpt-blending-data-efficiently) - [📉 Training AuroraGPT-7B on 2T Tokens](#chart_with_downwards_trend-training-auroragpt-7b-on-2t-tokens) - [📉 Training AuroraGPT-2B on 7T Tokens](#chart_with_downwards_trend-training-auroragpt-2b-on-7t-tokens) - [✨ Features](#sparkles-features) - [✨ Features (even more!)](#sparkles-features-even-more) - [🧬 MProt-DPO](#dna-mprot-dpo) - [🧬 Scaling Results (2024)](#dna-scaling-results-2024) - [🧬 MProt-DPO: Scaling Results](#dna-mprot-dpo-scaling-results) - [🚂 Loooooooooong Sequence Lengths](#steam_locomotive-loooooooooong-sequence-lengths) - [🌎 AERIS (2025)](#earth_americas-aeris-2025) - [👀 High-Level Overview of AERIS](#eyes-high-level-overview-of-aeris) - [➕ Contributions](#heavy_plus_sign-contributions) - [⚠️ Issues with the Deterministic Approach](#warning-issues-with-the-deterministic-approach) - [🎲 Transitioning to a Probabilistic Model](#game_die-transitioning-to-a-probabilistic-model) - [🌀 Sequence-Window-Pipeline Parallelism `SWiPe`](#cyclone-sequence-window-pipeline-parallelism-swipe) - [🚀 AERIS: Scaling Results](#rocket-aeris-scaling-results) - [🌪️ Hurricane Laura](#tornado-hurricane-laura) - [📓 References](#notebook-references) - [❤️ Acknowledgements](#heart-acknowledgements) - [Extras](#extras) ## 🧑🏻‍💻 About Me
- 🏡 [samforeman.me](https://samforeman.me) - UIUC (2015): - Engineering Physics + Applied Mathematics - University of Iowa (2015–2019): - PhD. Physics[^1] - ANL (2019–2022): Postdoctoral Researcher - ANL (2022–Present): Assistant Computational Scientist - Member of the [AI/ML Group](https://www.alcf.anl.gov/about/people/group/506) at ALCF Current Research: - [AuroraGPT](https://auroragpt.anl.gov): Foundation Models for Science - [AERIS](https://arxiv.org/abs/2509.13523): Argonne’s Earth System Model - Finalist for the [2025 ACM Gordon Bell Prize in Climate Modeling](https://awards.acm.org/bell-climate) - [MProt-DPO](https://www.researchgate.net/publication/387390653_MProt-DPO_Breaking_the_ExaFLOPS_Barrier_for_Multimodal_Protein_Design_Workflows_with_Direct_Preference_Optimization): Multimodal Protein Design - Finalist for the [ACM Gordon Bell Prize 2024](https://sc24.supercomputing.org/2024/10/presenting-the-finalists-for-the-2024-gordon-bell-prize/) - [GenSLMs](https://www.biorxiv.org/content/10.1101/2022.10.10.511571v2): Genome Scale Language Models. - Winner of the [ACM Gordon Bell Special Prize for HPC-Based COVID-19 Research](https://www.acm.org/media-center/2022/november/gordon-bell-special-prize-covid-research-2022)
## Argonne Leadership Computing Facility (ALCF)
> The ALCF enables breakthroughs in science and engineering by providing > supercomputing resources and expertise to the research community. > –[_alcf.anl.gov_](https://alcf.anl.gov) Reverse Diffusion Process Forward Diffusion Process (\pi\rightarrow \mathcal{N})
### 🌀 Sequence-Window-Pipeline Parallelism `SWiPe`
- `SWiPe` is a **novel parallelism strategy** for Swin-based Transformers - Hybrid 3D Parallelism strategy, combining: - Sequence parallelism (`SP`) - Window parallelism (`WP`) - Pipeline parallelism (`PP`)
Figure 17
Figure 18: `SWiPe` Communication Patterns
### 🚀 AERIS: Scaling Results
Figure 19: AERIS: Scaling Results
- **10 EFLOPs** (sustained) @ **120,960 GPUs** - See (Hatanpää et al. (2025)) for additional details - [arXiv:2509.13523](https://arxiv.org/abs/2509.13523)
### 🌪️ Hurricane Laura
Figure 20: Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.
## 📓 References
Dharuman, Gautham, Kyle Hippe, Alexander Brace, et al. 2024. “MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization.” _Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis_ (Atlanta, GA, USA), SC ’24. [https://doi.org/10.1109/SC41406.2024.00013](https://doi.org/10.1109/SC41406.2024.00013).
Hatanpää, Väinö, Eugene Ku, Jason Stock, et al. 2025. _AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions_. [https://arxiv.org/abs/2509.13523](https://arxiv.org/abs/2509.13523).
Price, Ilan, Alvaro Sanchez-Gonzalez, Ferran Alet, et al. 2024. _GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather_. [https://arxiv.org/abs/2312.15796](https://arxiv.org/abs/2312.15796).
Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, et al. 2023. _DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies_. [https://arxiv.org/abs/2310.04610](https://arxiv.org/abs/2310.04610).
## ❤️ Acknowledgements > This research used resources of the Argonne Leadership Computing > Facility, which is a DOE Office of Science User Facility supported > under Contract DE-AC02-06CH11357. ## Extras [^1]: [A Machine Learning Approach to Lattice Gauge Theory](https://www.researchgate.net/publication/337499051_Learning_better_physics_a_machine_learning_approach_to_lattice_gauge_theory) [^2]: 🏆 [Aurora Supercomputer Ranks Fastest for AI](https://www.intel.com/content/www/us/en/newsroom/news/intel-powered-aurora-supercomputer-breaks-exascale-barrier.html) [^3]: Each node has 6 Intel Data Center GPU Max 1550 (code-named “Ponte Vecchio”) tiles, with 2 XPUs per tile. [^4]: Co-led by: Venkat Vishwanath, Sam Foreman [^5]: Implemented by Marieme Ngom [^6]: Relative to PDE-based models, e.g.: [GFS](https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs) [^7]: [GenCast: A Generative Model for Medium-Range Global Weather Forecasting](https://arxiv.org/html/2312.15796v1) (Price et al. (2024)) [^8]: Demonstrated on up to 120,960 GPUs on Aurora and 8,064 GPUs on LUMI.