{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning and Statistics for Physicists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Material for a [UC Irvine](https://uci.edu/) course offered by the [Department of Physics and Astronomy](https://www.physics.uci.edu/).\n", "\n", "Content is maintained on [github](github.com/dkirkby/MachineLearningStatistics) and distributed under a [BSD3 license](https://opensource.org/licenses/BSD-3-Clause).\n", "\n", "[Table of contents](Contents.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns; sns.set()\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning in a Probabilistic Context" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Learning a Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have focused recently on two questions:\n", " - **Assuming a model, which parameters best explain my data?**\n", " - **Given competing models, what are the relative odds that they explain my data?**\n", "\n", "In the framework of Bayesian inference, we answer the first question by estimating the posterior\n", "$$\n", "P(\\Theta_M\\mid D, M) = \\frac{P(D\\mid \\Theta_M, M)\\, P(\\Theta_M, M)}{P(D\\mid M)}\n", "$$\n", "for the assumed model $M$ and its parameters $\\Theta_M$. We answer the second question by estimating the posterior ratio:\n", "$$\n", "\\frac{P(M_1\\mid D)}{P(M_2\\mid D)} = \\frac{P(D\\mid M_1)\\, P(M_1)}{P(D\\mid M_2)\\, P(M_2)} \\; .\n", "$$\n", "In either case, the fundamental object is the joint probability,\n", "$$\n", "P(D, \\Theta_M, M),\n", "$$\n", "or, with competing models, $M_1$ and $M_2$, the pair of joint probabilities,\n", "$$\n", "P(D, \\Theta_{M_1}, M_1) \\quad , \\quad\n", "P(D, \\Theta_{M_2}, M_2) \\; .\n", "$$\n", "These are the fundamental objects since any conditional or marginalized probability can be derived from them. Note that the observed random variables $D$ are given, but the unobserved (latent) random variables $\\Theta_M$ require that we make a choice of model(s)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unsupervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The questions above concern models and their parameters. However, we can also ask questions about \"new\" data, where \"new\" means either not yet observed or else already observed but omitted from our analysis. For example:\n", " - **Given observed data $D$, how likely is unobserved data $D'$?**\n", "\n", "This is the fundamental question of **unsupervised learning**, and can be framed in probabilistic language starting from the joint probability\n", "$$\n", "P(D, D', \\Theta_M, M)\n", "$$\n", "assuming the model $M$ with parameters $\\Theta_M$. To answer this question, we must estimate:\n", "$$\n", "\\begin{aligned}\n", "P(D'\\mid D, M) &= \\int d\\Theta_M\\, P(D',\\Theta_M\\mid D, M) \\\\\n", "&= \\int d\\Theta_M\\, P(D'\\mid D,\\Theta_M, M)\\,P(\\Theta_M\\mid D, M)\\\\\n", "&= \\int d\\Theta_M\\, \\frac{P(D',D\\mid \\Theta_M, M)}{P(D\\mid \\Theta_M, M)}\\, P(\\Theta_M\\mid D, M) \\;. \\\\\n", "\\end{aligned}\n", "$$\n", "If we assume that $D$ and $D'$ are statistically independent datasets, then\n", "$$\n", "P(D',D\\mid \\Theta_M, M) = P(D'\\mid\\Theta_M, M)\\, P(D\\mid \\Theta_M, M) \\; ,\n", "$$\n", "and we can simplify\n", "$$\n", "P(D'\\mid D, M) = \\int d\\Theta_M\\, P(D'\\mid\\Theta_M, M)\\, P(\\Theta_M\\mid D, M) \\; .\n", "$$\n", "Note that in order to evaluate the RHS, we must have already learned the model $M$ and determined the posterior $P(\\Theta_M\\mid D, M)$ of its parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we split the features of our data $D$ into two categories, $X$ and $Y$, we can ask a new question about unobserved data $D'$ that splits into $X'$ and $Y'$:\n", " - **Given observed data $(X,Y)$ from $D$ and $X'$ from $D'$, how likely is the remaining unobserved data $Y'$?**\n", " \n", "This is the fundamental question of **supervised learning**, and the relevant joint probability is now\n", "$$\n", "P(D, D', \\Theta_M, M) = P(X, Y, X', Y', \\Theta_M, M)\n", "$$\n", "for the assumed model $M$ with parameters $\\Theta_M$. To answer this question, we can estimate:\n", "$$\n", "\\begin{aligned}\n", "P(Y'\\mid X, Y, X', M) &= \\int d\\Theta_M\\, P(Y',\\Theta_M\\mid X, Y, X', M) \\\\\n", "&= \\int d\\Theta_M\\, P(Y'\\mid X, Y, X',\\Theta_M, M)\\,P(\\Theta_M\\mid X, Y, X', M) \\\\\n", "&= \\int d\\Theta_M\\, \\frac{P(X, Y, X', Y'\\mid \\Theta_M, M)}{P(X, Y, X'\\mid \\Theta_M, M)}\\,\n", "P(\\Theta_M\\mid X, Y, X', M) \\; .\n", "\\end{aligned}\n", "$$\n", "If we again assume that $D$ and $D'$ are statistically independent, then\n", "$$\n", "P(X,Y,X',Y'\\mid\\Theta_M,M) = P(X',Y'\\mid\\Theta_M,M)\\, P(X,Y\\mid\\Theta_M,M) \\; ,\n", "$$\n", "and (after integrating out $X'$),\n", "$$\n", "P(X,Y,Y'\\mid\\Theta_M,M) = P(Y'\\mid\\Theta_M,M)\\, P(X,Y\\mid\\Theta_M,M) \\; .\n", "$$\n", "We can then simplify:\n", "$$\n", "P(Y'\\mid X, Y, X', M) = \\int d\\Theta_M\\,\n", "\\frac{P(X', Y'\\mid \\Theta_M, M)}{P(X'\\mid \\Theta_M, M)}\\, P(\\Theta_M\\mid X, Y, X', M) \\; .\n", "$$\n", "Note that this formulation of the problem has $P(\\Theta_M\\mid X, Y, X', M)$ on the RHS, which indicates that we must re-learn the model $M$ each time we are given new data $X'$.\n", "\n", "An alternative formulation reveals that, while valid, this is not necessary: start from the unsupervised result above, with $D=(X,Y)$ and $D'=(X',Y')$,\n", "$$\n", "P(X',Y'\\mid X,Y, M) = \\int d\\Theta_M\\, P(X',Y'\\mid\\Theta_M, M)\\, P(\\Theta_M\\mid X,Y, M)\n", "$$\n", "then integrate out $X'$,\n", "$$\n", "\\begin{aligned}\n", "P(X'\\mid X,Y, M) &= \\int d\\Theta_M\\, \\left[ \\int dX'\\, P(X',Y'\\mid\\Theta_M, M)\\right] \n", "P(\\Theta_M\\mid X,Y, M) \\\\\n", "&= \\int d\\Theta_M\\, P(Y'\\mid\\Theta_M, M)\\,P(\\Theta_M\\mid X,Y, M) \\; .\n", "\\end{aligned}\n", "$$\n", "We can now answer our original question using\n", "$$\n", "\\begin{aligned}\n", "P(Y'\\mid X,Y,X',M) &= \\frac{P(P(X',Y'\\mid X,Y, M)}{P(X'\\mid X,Y, M)} \\\\\n", "&= \\frac{\\int d\\Theta_M\\, P(X',Y'\\mid\\Theta_M, M)\\, P(\\Theta_M\\mid X,Y, M)}\n", "{\\int d\\Theta_M\\, P(Y'\\mid\\Theta_M, M)\\,P(\\Theta_M\\mid X,Y, M)} \\; .\n", "\\end{aligned}\n", "$$\n", "Note how this formulation allows us to learn the model $M$ once with the original data $D=(X,Y)$ but requires two separate marginalizations (integrals) over the model parameters $\\Theta_M$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Terminology" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data $D$ used to learn the model used in unsupervised or supervised learning is referred as the **training data**. In supervised learning, the features appearing in $Y$ and $Y$ are referred to as the **target features**.\n", "\n", "We use different terminology (and approaches to modeling) for supervised learning depending on the type of target features we wish to learn:\n", " - **regression:** predict continuous-value target features.\n", " - **classification:** predict discrete-valued target features.\n", "\n", "Note that the target features might be a mix of continuous and discrete features, so this terminology is incomplete.\n", "\n", "Most of the high-profile machine learning applications from google, facebook, etc, involve classification rather than regression, so proportionally more effort has gone into developing and optimizing classification algorithms. However, most scientific applications are more naturally expressed as regression problems: this presents both a challenge and an opportunity to the scientific ML community! Also note that a regression problem can always be converted into a classification problem by binning the output (assuming you don't need infinite accuracy), which can be surprisingly effective.\n", "\n", "All unsupervised and supervised learning algorithms involve priors, $P(\\Theta_M\\mid M)$ but they are not always stated explicitly. Sometimes priors are expressed implicitly via **regularization conditions**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model Selection Revisited" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When learning a model $M$, our quantitative measure of how well it explains the data $D$ is the evidence $P(D\\mid M)$.\n", "\n", "Since unsupervised and supervised learning requires first learning the model, we could use the same measure, but a different measure is also useful: how well does $M$ explain unobserved data $D'$? In other words, how well does the learned model generalize and offer predictive power? In order to answer this question, in practice, you must hold back some of your observed data when learning the model and can then measure how well the model \"postdicts\" the held back data. The held back data is referred to as the **test sample** and this process is known as **cross validation**.\n", "\n", "In cases where the model and its parameters have some physical reality (for example, projectile motion modeled with Newtonian physics and parameterized by $g$, ...), these measures are essentially the same since the model is dictated by some independent reality and not chosen specifically to explain the observed data $D$.\n", "\n", "However, when predicting future data is the main goal and there is no first-principles model available, the model and its parameters are essentially unconstrained and these two measures can easily diverge. In particular, optimizing how well the model explains the observed data leads to **over-fitting** and poor ability to **generalize** to new data.\n", "\n", "For example, suppose the observed data $D$ consists of $N$ samples $x_i$ of a single feature $x$, then a model $M$ with the likelihood ($\\delta_D$ is the Dirac delta function)\n", "$$\n", "P(x\\mid \\Theta_M, M) = \\frac{1}{N}\\, \\sum_{i=1}^N \\delta_D(x - x_i)\n", "$$\n", "trivially explains the data perfectly with the parameters\n", "$$\n", "\\Theta_M = \\{ x_1, x_2, \\ldots, x_N \\} \\; .\n", "$$\n", "This purely empirical approach to model building is an extreme case of over-fitting and offers no generalization power. (Note that the likelihood above is a kernel density estimate with a Dirac delta function kernel)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }