{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# *Data Analysis and Machine Learning Applications for Physicists*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "*Material for a* [*University of Illinois*](http://illinois.edu) *course offered by the* [*Physics Department*](https://physics.illinois.edu). *This content is maintained on* [*GitHub*](https://github.com/illinois-mla) *and is distributed under a* [*BSD3 license*](https://opensource.org/licenses/BSD-3-Clause).\n", "\n", "[Table of contents](Contents.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Welcome!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![Einstein-as-a-data-scientist](./img/Intro/pynstein.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**ACTIVITY:** Discuss these questions:\n", "1. What is *Data Science*? *Machine Learning / Artifical Intelligence*? *Statistics / Data Analysis*? How are these related?\n", "2. What distinguishes *Data* from *Models* from *Parameters* and their estimation?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![Data-models-statistics triangle](./img/Intro/MLS-triangle.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Further reading:\n", "- [Data mining and statistics: what's the connection?](statweb.stanford.edu/~jhf/ftp/dm-stat.pdf)\n", "- [The rise of the \"data engineer\"](https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-91be18f1e603)\n", "- [Humorous contrasts between ML and Stats](http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf)\n", " - python$\\leftrightarrow$ R\n", " - conference talk$\\leftrightarrow$ journal article" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![AI-Circle](./img/Intro/AI-Circle.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### How will this course be different from a CS class?\n", "\n", "Physics and astronomy students have different preparation:\n", "- Strong background and experience with mathematical tools (linear algebra, multivariate calculus) needed for rigorous discussion of statistics.\n", "- Weak / varied background in traditional CS core topics of fundamental algorithms, databases, etc\n", "\n", "Physics and astronomy research also has different needs:\n", "- Our data and models are often fundamentally different from those in typical CS contexts.\n", "- We ask different types of questions about our data, sometimes requiring new methods.\n", "- We have different priorities for judging a \"good\" method: interpretability, error estimates, etc." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What is Data?\n", "\n", "Data are a finite set of measurements:\n", "- Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [ROOT tree](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...\n", "- **colums = features**: numeric / categorical?\n", "- **rows = samples**: ordered? independent?\n", "- measurement errors?\n", "- binned / un-binned?\n", "- similarity measure on samples?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**ACTIVITY:** Pick one of these ML problems and describe the rows (samples) and columns (features) of the data you might use to solve the problem.\n", "1. Learn a fast approximation to a slow exact calculation.\n", "2. Learn to identify Higgs particle decays from LHC event data.\n", "3. Learn to estimate the distance to a quasar using optical images." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What is a Model?\n", "\n", "Models specify the probabilities of possible measurements:\n", "- Explicit: probability density function.\n", "- Implicit: algorithm to generate random outcomes (forward / generative model).\n", "- Usually wrong (except \"Toy MC\")\n", "- Observables (latent variables):\n", " - integrability: required to calculate normalized probabilities.\n", "- Parameters (and hyper-parameters):\n", " - differentiability: required to find most probable (uphill) direction.\n", "- Variance - bias tradeoffs (regularization)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What is special about ML in physics and astronomy?\n", "\n", "- We are data **producers**, not (only) data consumers:\n", " - Experiment / survey design.\n", " - Optimization of statistical errors.\n", " - Control of systematic errors.\n", "- Our data represent measurements of physical processes:\n", " - Measurements often reduce to counting photons, etc, with known a-priori random errors.\n", " - Dimensions and units are important.\n", "- Our models are usually traceable to an underlying physical theory:\n", " - Models constrained by theory and previous observations.\n", " - Parameter values often intrinsically interesting.\n", "- A parameter error estimate is just as important as its value:\n", " - Prefer methods that handle input data errors (weights) and provide output parameter error estimates.\n", "- In some experiments and scientific domains, the data sets are *huge* --> \"Big Data\"\n", " - See one of my [recent talks](https://absuploads.aps.org/presentation.cfm?pid=14316)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "**Postprocessing for html export of notebook**" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Python postamble (do not edit):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install jupyter_contrib_nbextensions >/dev/null\n", "!jupyter nbconvert *.ipynb --to html_embed" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Please see the full instructions at\n", "\n", "https://illinois-mla.github.io/syllabus/assets/html-embed-export-tutorial.pdf\n", "\n", "for what to do after this cell executes to obtain a pdf of your notebook." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }