--- layout: '@/layouts/Doc.astro' title: 'AuroraGPT: Training Foundation Models on Supercomputers' date: '2025-12-16' location: 'ANL' --- Sam Foreman 2025-12-16 - [🧰 AuroraGPT: Toolbox](#toolbox-auroragpt-toolbox) - [👥 Team Leads](#busts_in_silhouette-team-leads) - [🤝 Teams](#handshake-teams) - [🏋️ Challenges](#weight_lifting-challenges) - [💾 AuroraGPT: Training](#floppy_disk-auroragpt-training) - [🍹 AuroraGPT: Blending Data, Efficiently](#tropical_drink-auroragpt-blending-data-efficiently) - [📉 Training AuroraGPT-7B on 2T Tokens](#chart_with_downwards_trend-training-auroragpt-7b-on-2t-tokens) - [📉 Training AuroraGPT-2B on 7T Tokens](#chart_with_downwards_trend-training-auroragpt-2b-on-7t-tokens) - [✨ Features](#sparkles-features) - [✨ Features (even more!)](#sparkles-features-even-more) - [🧬 MProt-DPO](#dna-mprot-dpo) - [🧬 Scaling Results (2024)](#dna-scaling-results-2024) - [🧬 MProt-DPO: Scaling Results](#dna-mprot-dpo-scaling-results) - [🚂 Loooooooooong Sequence Lengths](#steam_locomotive-loooooooooong-sequence-lengths) - [🌎 AERIS (2025)](#earth_americas-aeris-2025) - [👀 High-Level Overview of AERIS](#eyes-high-level-overview-of-aeris) - [➕ Contributions](#heavy_plus_sign-contributions) - [⚠️ Issues with the Deterministic Approach](#warning-issues-with-the-deterministic-approach) - [🎲 Transitioning to a Probabilistic Model](#game_die-transitioning-to-a-probabilistic-model) - [🌀 Sequence-Window-Pipeline Parallelism `SWiPe`](#cyclone-sequence-window-pipeline-parallelism-swipe) - [🚀 AERIS: Scaling Results](#rocket-aeris-scaling-results) - [🌪️ Hurricane Laura](#tornado-hurricane-laura) - [📓 References](#notebook-references) - [❤️ Acknowledgements](#heart-acknowledgements) - [Extras](#extras) ## 🧰 AuroraGPT: Toolbox - **Datasets and data pipelines** (how do we deal with scientific data?) - **Software infrastructure and workflows** (scalable, robust, extensible) - **Evaluation of state-of-the-art LLM Models** (how do they perform on scientific tasks?)
> [!TIP] 🍋 ezpz > > [saforem2/ezpz](https://github.com/saforem2/ezpz) > > Write once, run anywhere > [!NOTE] 🚂 Training > > [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed) > > For the largest of large language models > [!IMPORTANT] 🏃‍♂️ Running > > [argonne-lcf/inference-endpoints](https://github.com/argonne-lcf/inference-endpoints) > > Inference endpoints for LLMs, hosted @ ALCF
## 👥 Team Leads
**Planning** Rick Stevens Ian Foster Rinku Gupta Mike Papka Arvind Ramanathan Fangfang Xia
**Data** Ian Foster Robert Underwood
**Training** Venkat Vishwanath Sam Foreman
**Evaluation** Franck Cappello Sandeep Madireddy Bo Li
**Post** Eliu Huerta Azton Wells
**Inference** Rajeev Thakur
**Comms** Charlie Catlett David Martin
**Distribution** Brad Ullrich
## 🤝 Teams
- **Planning** - **Data Prep** - Accumulate 20+ T tokens of high-quality scientific text and structured data - **Models / Training** [^1] - Train (entirely from scratch) a series of models on publicly available data - **Evaluation** - Skills, trustworthiness, safety, robustness, privacy, machine ethics - **Post-Training** - Fine-tuning, alignment - **Inference** - Model serving, API development / public-facing web services - **Distribution** - Licensing, generating and distributing artifacts for public consumption - **Communication**
## 🏋️ Challenges This is _incredibly_ difficult in practice, due in part to: - Brand new hardware, architecture, software - Lack of native support in existing frameworks (though getting better!) - General system stability +10k Nodes $\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow$ +**100k** XPUs - network performance - file system stability (impacted by _other users_ !) - _many_ unexpected difficulties occur at increasingly large scales - Combinatorial explosion of possible configurations and experiments - \{hyperparameters, architectures, tokenizers, learning rates, …\} ## 💾 AuroraGPT: Training - To train a fixed model on trillions of tokens requires: 1. **Aggregating** data from multiple different _corpora_ (e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.) 2. **Sampling** _each training batch_ according to a fixed distribution across corpora 3. **Building** indices that map batches of tokens into these files (indexing)
The original implementation was _slow_: - Designed to run _serially_ on a **single device** - **Major bottleneck** when debugging data pipeline at scale
## 🍹 AuroraGPT: Blending Data, Efficiently
- 🐢 Original implementation: - **Slow** (serial, single device) - ~ 1 hr/2T tokens - 🐇 New implementation: - **Fast!** (distributed, asynchronous) - ~ **2 min**/2T tokens (**30x** faster !!)
Figure 1: Time spent preparing 2T tokens
## 📉 Training AuroraGPT-7B on 2T Tokens ## 📉 Training AuroraGPT-2B on 7T Tokens
Reverse Diffusion Process Forward Diffusion Process (\pi\rightarrow \mathcal{N})
## 🌀 Sequence-Window-Pipeline Parallelism `SWiPe`
- `SWiPe` is a **novel parallelism strategy** for Swin-based Transformers - Hybrid 3D Parallelism strategy, combining: - Sequence parallelism (`SP`) - Window parallelism (`WP`) - Pipeline parallelism (`PP`)
Figure 10
Figure 11: `SWiPe` Communication Patterns
## 🚀 AERIS: Scaling Results
Figure 12: AERIS: Scaling Results
- **10 EFLOPs** (sustained) @ **120,960 GPUs** - See (Hatanpää et al. (2025)) for additional details - [arXiv:2509.13523](https://arxiv.org/abs/2509.13523)
## 🌪️ Hurricane Laura
Figure 13: Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.
## 📓 References
Dharuman, Gautham, Kyle Hippe, Alexander Brace, et al. 2024. “MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization.” _Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis_ (Atlanta, GA, USA), SC ’24. [https://doi.org/10.1109/SC41406.2024.00013](https://doi.org/10.1109/SC41406.2024.00013).
Hatanpää, Väinö, Eugene Ku, Jason Stock, et al. 2025. _AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions_. [https://arxiv.org/abs/2509.13523](https://arxiv.org/abs/2509.13523).
Price, Ilan, Alvaro Sanchez-Gonzalez, Ferran Alet, et al. 2024. _GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather_. [https://arxiv.org/abs/2312.15796](https://arxiv.org/abs/2312.15796).
Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, et al. 2023. _DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies_. [https://arxiv.org/abs/2310.04610](https://arxiv.org/abs/2310.04610).
## ❤️ Acknowledgements > This research used resources of the Argonne Leadership Computing > Facility, which is a DOE Office of Science User Facility supported > under Contract DE-AC02-06CH11357. ## Extras [^1]: Co-led by: Venkat Vishwanath, Sam Foreman [^2]: Implemented by Marieme Ngom [^3]: Relative to PDE-based models, e.g.: [GFS](https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs) [^4]: [GenCast: A Generative Model for Medium-Range Global Weather Forecasting](https://arxiv.org/html/2312.15796v1) (Price et al. (2024)) [^5]: Demonstrated on up to 120,960 GPUs on Aurora and 8,064 GPUs on LUMI.