{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Machine Learning and Statistics for Physicists" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Material for a [UC Irvine](https://uci.edu/) course offered by the [Department of Physics and Astronomy](https://www.physics.uci.edu/).\n", "\n", "Content is maintained on [github](github.com/dkirkby/MachineLearningStatistics) and distributed under a [BSD3 license](https://opensource.org/licenses/BSD-3-Clause).\n", "\n", "##### ► [View table of contents](Contents.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "This notebook can optionally be viewed as a [slide presentation](https://medium.com/learning-machine-learning/present-your-data-science-projects-with-jupyter-slides-75f20735eb0f). Click [here](https://nbviewer.jupyter.org/format/slides/github/dkirkby/MachineLearningStatistics/blob/master/notebooks/Intro.ipynb#/) to view the slides online or, to present the slides locally, use:\n", "```\n", "jupyter nbconvert Intro.ipynb --to slides --post serve\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**ACTIVITY:** Discuss these questions:\n", "1. What is the relationship between *machine learning* and *statistics*?\n", "2. Does your research focus more on *data* or *models*?\n", "3. What is a *data scientist*?\n", "4. What is \"deep\" about *deep learning*?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What is \"Machine Learning\"?\n", "\n", "Using **machines** to **learn** how to explain data with models." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### What is \"Machine Learning\"?\n", "\n", "Using **machines** to **learn** how to explain data with models.\n", "\n", "The \"machines\" responsible for most of the progress in ML are:\n", " - software algorithms\n", " - hardware architectures\n", " - human ingenuity\n", " \n", "The \"learning\" consists of passively identifying statistical correlations, which is very different from how we learn with active experimentation and identifying causal relationships." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What is \"Machine Learning?\"\n", "\n", "Using machines to learn how to explain **data** with **models**.\n", "\n", "![MLS-triangle1](img/Intro/MLS-triangle1.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What is \"Machine Learning?\"\n", "\n", "Machine learning uses models to learn from data.\n", "\n", "![MLS-triangle2](img/Intro/MLS-triangle2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Further reading:\n", "- [Data mining and statistics: what's the connection?](http://statweb.stanford.edu/~jhf/ftp/dm-stat.pdf)\n", "- [The rise of the \"data engineer\"](https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-91be18f1e603)\n", "- [Humorous contrasts between ML and Stats](http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf)\n", " - python$\\leftrightarrow$ R\n", " - conference talk$\\leftrightarrow$ journal article" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What is Data?\n", "\n", "Data is (are?) a finite set of measurements:\n", "- Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...\n", "- **colums = features**\n", "- **rows = samples** (observations)\n", "- richer data structures (images, [ROOT trees](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), etc) must be flattened.\n", "\n", "![data-table](img/Intro/data-table.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What is Data?\n", "\n", "Data is (are?) a finite set of measurements:\n", "- Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...\n", "- **colums = features**\n", "- **rows = samples** (observations)\n", "- richer data structures (images, [ROOT trees](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), etc) must be flattened.\n", "\n", "Questions to ask about your data:\n", "- Are my features categorical / discrete / continuous?\n", "- Is the ordering of my samples significant?\n", "- Are my samples statistically independent? drawn from the same distribution?\n", "- What are my measurement uncertainties?\n", "- Is my data binned / un-binned?\n", "- Is there a natural similarity / distance measure on my samples (rows)?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**ACTIVITY:** Pick one of these ML problems and describe the rows (samples) and columns (features) of the data you might use to solve the problem.\n", "1. Learn a fast approximation to a slow exact calculation.\n", "2. Learn to identify Higgs particle decays from LHC event data.\n", "3. Learn to estimate the distance to a quasar using optical images." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What is a Model?\n", "\n", "Two important types of models: generative, probabilistic.\n", "\n", "All ML algorithms use a model to explain your data.\n", "\n", "Models have parameters.\n", "\n", "![models1](img/Intro/models1.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What is a Model?\n", "\n", "Two important types of models: generative, probabilistic.\n", "\n", "Models can explain data **and parameters**.\n", "\n", "Models have parameters **and hyper-parameters.**\n", "\n", "![models2](img/Intro/models2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What is Learning?\n", "\n", "Three broad types of learning:\n", " - **Unsupervised: learn to predict new data.**\n", " - Given data: what patterns are present? (learn a model).\n", " - Given data and model: how likely is new data to be from same model? (generate new data).\n", " - **Supervised: Learn to predict specific features of new data.**\n", " - Classification: predict discrete features (learn a conditional model).\n", " - Regression: predict continuous features (learn a conditional model).\n", " - **Inference: explain observed data.**\n", " - Assuming a model: what parameters (with what uncertainties) best describe my data? (learn a model).\n", " - Given competing models: which best describes my data? (model selection).\n", " \n", "(Also: reinforcement learning.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What is special about ML in Physics and Astronomy?\n", "\n", "Scientific applications of ML benefit a lot from advances in industry but we work in a different context:\n", "- **We are data producers, not data consumers:**\n", " - Experiment / survey design.\n", " - Optimization of statistical errors.\n", " - Control of systematic errors.\n", "- **Our data measures physical processes:**\n", " - Measurements often reduce to counting photons, etc, with known a-priori random errors.\n", " - Dimensions and units are important.\n", "- **Our models are usually traceable to an underlying physical theory:**\n", " - Models constrained by theory and previous observations.\n", " - Parameter values often intrinsically interesting.\n", "- **A parameter uncertainty estimate is just as important as its value:**\n", " - Prefer methods that handle input data uncertainties (weights) and provide output parameter uncertainty estimates." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## How will this course be different from a CS class?\n", "\n", "Physics and astronomy students have different preparation:\n", "- Strong background and experience with mathematical tools (linear algebra, multivariate calculus) needed for rigorous discussion of statistics.\n", "- Weak / varied background in traditional CS core topics of fundamental algorithms, databases, etc\n", "\n", "Physics and astronomy research also has different needs:\n", "- Our data and models are often fundamentally different from those in typical CS contexts.\n", "- We ask different types of questions about our data, sometimes requiring new methods.\n", "- We have different priorities for judging a \"good\" method: interpretability, error estimates, etc." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Topics Overview\n", "\n", "![outline](img/Intro/outline.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exercise\n", "\n", "One of the first tasks when applying machine learning to a new problem is to establish some baselines for the expected performance:\n", " - How well does the simplest possible (non-ML) approach work?\n", " - What is the current \"state of the art\"?\n", " - If applicable: what is the \"human performance\" level?\n", " \n", "Let's get a \"human performance\" baseline for the following supervised-learning problem:\n", "**How many \"sources\" are present in an image?**\n", "\n", "You are now the machines:\n", " - I will show you 36 training images so you can **learn** how to perform this task.\n", " - Next, you must **classify** 12 test images. Enter your responses at https://goo.gl/qi2CEV." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 2 }