--- title: "Scientific Data Science" center: true theme: black transition: slide margin: 0.04 revealOptions: transition: 'slide' css: - css/custom.css --- # Scientific Data Science #### An Emerging Symbiosis Sam Foreman May, 2022 ::: block [](https://github.com/saforem2) [](https://samforeman.me) [](https://twitter.com/saforem2) :::

--- # Outline 1. [Issues with current scientific workflows](#Science%20Workflows) - [How ML can help](#Possibilities) 2. [Current Workflows](#Current%20Workflows) 3. [Advanced Workflows](#Advanced%20Workflows) - [MLOps](#MLOps), [Distributed Training](#Distributed%20Training), etc. 4. [Illustrative Example](#Identifying%20Phase%20Transitions%201%20https%20doi%20org%2010%201038%20nphys4035) - [Current Research / Future Plans](#Anomalous%20Magnetic%20Moment%20of%20the%20Muon) --- # Science Workflows ::: block - Increasingly dependent on computing and data analysis / statistics - **Issues**: - Science research is manually guided - This is slow! - Depends on good ideas and large effort - Publications are static - Not all research code is public - Reproducibility? - Not always clear how to proceed - Error prone and labor-intensive ::: ![](assets/science-workflow.svg) --- # 🤔 Current Issues - Research publications are static - Mistakes are inevitable - Reproducibility issues are common - Versioning - Which code version made which reports? - Jupyter Notebooks - Better - Still must be manually ran, cell by cell - Useful as a "playground" for: - testing ideas, - debugging code - making / fine-tuning plots - Can we do better? note: - FAIR - v1, v2.4, ..., v3-final-1, ... - Prone to human error - Dependencies evolve and become incompatible - Depends on users making - May not reflect most up-to-date changes --- # Possibilities - Better data $\longrightarrow$ better science - Dynamic resource management - Specialized hardware - AI Accelerators - **Use ML to guide experiments** - Real-time data analysis - Predictive modeling - Advanced simulations - Heterogeneous architectures - Specialized hardware - AI Testbeds note: - Good idea - Seems promising - ML Still difficult - Barrier for entry - Difficulty getting started - Hard to ensure results - No working theory - Error prone - Unpredictable --- # ML Workflow ![](assets/frontend-workflow.svg) ::: block **Idea**: Automate this as much as possible and remove aspects prone to human error ::: note: --- # Current Workflows - Try new architecture, play with hyperparameters, repeat - Largely still manually guided - ML Engineering - Large "research institutes" - DeepMind, FAIR / Meta, Nvidia, etc - Hard to compete - Access to large scale systems - Iterative refinement of huge models - Operate in parallel over grad students / postdocs ![](assets/deepmind-workflow.svg) --- # Advanced Workflows - Real-time data analysis - Used to guide / plan future experiments - Dynamic allocation of resources - Improved efficiency, less time spent idle - **More science per watts** [`[1]`](https://publications.anl.gov/anlpubs/2009/12/65724.pdf) - Possible solutions? - Scaling up - New parallelism techniques? - Hyperparameter optimization, [DeepHyper](https://deephyper.org) - ML Ops tools / libraries - 🤗 [huggingface](https://huggingface.co) - [Weights & Biases]((https://wandb.ai) - [openai](https://openai.com) 1. [Argonne Leadership Computing Facility • 2008 annual report](https://publications.anl.gov/anlpubs/2009/12/65724.pdf) note: APS Upgrade Physics + ML - Geometric Deep Learning - Heavily influenced by symmetry considerations - should be amenable to goemetric analysis - Relatively easy to prototype / try new models - esp on toy problems - Typically the refinement stage is most involved --- ::: block > Polaris is well equipped to help move the ALCF into the exascale era of computational science by accelerating the application of AI capabilities to the growing data and simulation demands of our users. **Polaris will also provide a broader opportunity to help prototype and test the integration of HPC with _real-time experiments and sensor networks_.** > > — Michael E. Papka, ALCF director [`[1]`](https://www.alcf.anl.gov/news/argonne-national-laboratory-and-hewlett-packard-enterprise-prepare-exascale-era-new-testbed) ::: --- # Alternatives? - Dynamic reports: - Grow and change over time (**alive**) - Present results **with context** seamlessly - Central hub for team - Arbitrarily customizable - Live-updating - Having dynamic "reports" that can grow and change - Experiment tracking - [TensorBoard](https://tensorboard.org) - [Weights and Biases](https://wandb.ai) - [Neptune](https://neptune.ai/product) - [Comet](https://www.comet.ml/site/data-scientists/) note: - Empirical DL research + Scientific Research could be more effective with reports that are alive & with researchers adding to it - Make sense of the countless recipes researchers have and move towards systematizing these ideas - For DataScience team @ ALCF: - Centralized hub for: - Storing and versioning models from multiple projects - displaying real-time machine performance - Share ideas / models with team - Borrowing / fine tuning --- # 📊 ML Ops **Goal**: Allow researchers to focus on their science / model development without all the boilerplate. --- ### MLOps --- --- ### Distributed Training ![](assets/avgGrads.svg) --- # Distributed Training ![](assets/data-parallel.svg) ![](assets/model-parallel.svg) --- # Ongoing work --- # Standard Model - Electricity & Magnetism, Strong and Weak Interactions, elementary particles - (Lattice) QCD: - Theory of the **strong** interactions between quarks and gluons - ❌ Analytically intractable - ✅ Discretize space-time onto lattice ![](assets/nucleus.svg) ![](assets/feynman.svg) note: 1. Background / Interesting Work - Why its interesting? - What I've learned - What aspects are relevant outside of domain expertise 2. Ongoing Work - Lattice QCD - Common history with HPC - Mutually beneficial 3. Plans for future research ---

::: block #### Anomalous Magnetic Moment of the Muon[^quanta] ::: $$a_{\mu} = \frac{(g_{\mu} - 2)}{2}$$ #### New physics? ::: block [‘Last Hope’ Experiment Finds Evidence for Unknown Particles Quanta Magazine](https://www.quantamagazine.org/last-hope-experiment-finds-evidence-for-unknown-particles-20210407/) :::

note: muon g-2 --- ### Muon $g-2$ from Lattice QCD - From BNL [`[1]`](https://arxiv.org/abs/2002.12347): ::: block $$a_{\mu}^{\mathrm{exp}} = 11659209.1(5.4)(3.3)\times 10^{-10}$$ ::: - Hunt for Beyond Standard Model (BSM) physics - Upcoming experiments at Fermilab and JPARC aim to reduce uncertainty by factor of four note: - Lattice QCD historically has been in lock-step with developments in HPC --- # Contributions - Calculate using first principles from Lattice QCD (LQCD) - LQCD _may_ be able to resolve the current tension between the standard models' experiments and predictions ![](./assets/feynman.svg) ::: block **One Photon Correction** $\mu$ emits and reabsorbs a virtual photon (largest effect) ::: ::: block **Hadronic Vacuum Polarization** Virtual photon splits into (anti)-hadron pair (quarks, hard to calculate) ::: note: - Hadronic Vacuum Polarization (HVP) Contribution - The blobs (quark loops) represent all possible intermediate hadronic states ($\rho$, $\pi\pi$, $\ldots$) - Not calculable in perturbation theory - Can be calculated from: - First principles using **lattice QCD** $$a_{\mu}(\mathrm{HVP}) = \left(\frac{\alpha}{\pi}\right)^{2}\int_{0}^{\infty} dq^{2}\, f(q^{2}) \hat{\Pi}(q^{2})$$ --- ## Lattice QCD ::: block 1. **Gauge Field Generation**: Use Markov Chain Monte Carlo (MCMC) methods for sampling _independent_ gauge field (gluon) configurations. ::: 2. **Propagator calculations**: Compute how quarks propagate in these fields (_quark propagators_) 3. **Contractions**: Method for combining quark propagators into correlation functions and observables. note: - Calculations in LatticeQCD proceed in 3 steps: - Non-perturbative approach to solving the QCD theory of the strong interaction between quarks and gluons. --- # More statistics! - Lattice QCD _may_ help resolve this tension - Currently limited by computing power - New algorithms + ML seem promising... - See [`[1]`](https://arxiv.org/abs/2202.05838) for a broad overview of prospects note: - LQCD may help resolve the existing tension between the standard model predictions and experiments - For many key applications the necessary LQCD calculations are limited by available computing resources --- # BMW Collaboration ![](assets/bmw_supercomputer.jpg) ::: block 2. The JUWELS supercomputer at the Jülich Research Center in Germany, used to calculate the anomalous magnetic moment of the muon ::: --- ## LQCD @ ALCF (2008) > The **Blue Gene/P** at the ALCF has tremendously accelerated the generation of the gauge configurations—in many cases, by a factor of 5 to 10 over what has been possible with other machines. > Significant progress has been made in simulations with two different implementations of the quarks—domain wall and staggered. [`[1]`](https://publications.anl.gov/anlpubs/2009/12/65724.pdf) 1. [Argonne Leadership Computing Facility • 2008 annual report](https://publications.anl.gov/anlpubs/2009/12/65724.pdf) --- #

Topological Charge