{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning and Statistics for Physicists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A significant part of this material is taken from David Kirkby's material for a [UC Irvine](https://uci.edu/) course offered by the [Department of Physics and Astronomy](https://www.physics.uci.edu/).\n", "\n", "Content is maintained on [github](github.com/dkirkby/MachineLearningStatistics) and distributed under a [BSD3 license](https://opensource.org/licenses/BSD-3-Clause).\n", "\n", "[Table of contents](Contents.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Start from the OUTLINE OF THE ML PIPELINE" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "a_data = pd.read_hdf('data/cluster_a_data.hf5')\n", "b_data = pd.read_hdf('data/cluster_b_data.hf5')\n", "c_data = pd.read_hdf('data/cluster_c_data.hf5')\n", "d_data = pd.read_hdf('data/cluster_d_data.hf5')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "cluster_3d = pd.read_hdf('data/cluster_3d_data.hf5')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "cosmo_data = pd.read_hdf('data/cosmo_data.hf5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SciKit Learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will be our first time using the [SciKit Learn package](http://scikit-learn.org/stable/). We don't include it in our standard preamble since it contains many modules (sub-packages). Instead, we import each module as we need it. The ones we need now are:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing, cluster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find Structure in Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type of structure we can look for is \"clusters\" of \"nearby\" samples, but the definition of these terms requires some care.\n", "\n", "### Distance between samples\n", "\n", "In the simplest case, all features $x_{ij}$ have the same (possibly dimensionless) units, and the natural distance between samples (rows) $j$ and $k$ is:\n", "$$\n", "d(j, k) = \\sum_{\\text{features}\\,i} (x_{ji} - x_{ki})^2 \\; .\n", "$$\n", "However, what if some columns have different units? For example, what is the distance between:\n", "$$\n", "\\left( 1.2, 0.4~\\text{cm}, 5.2~\\text{kPa}\\right)\n", "$$\n", "and\n", "$$\n", "\\left( 0.7, 0.5~\\text{cm}, 4.9~\\text{kPa}\\right)\n", "$$\n", "ML algorithms are generally unit-agnostic, so will happily combine features with different units but that may not be what you really want.\n", "\n", "One reasonable solution is to normalize each feature with the [sphering transformation](https://en.wikipedia.org/wiki/Whitening_transformation):\n", "$$\n", "x \\rightarrow (x - \\mu) / \\sigma\n", "$$\n", "where $\\mu$, $\\sigma$ are the mean and standard deviation of the original feature values.\n", "\n", "The [sklearn.preprocessing module](http://scikit-learn.org/stable/modules/preprocessing.html) automates this process with:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | omega_b | \n", "omega_cdm | \n", "ln10^{10}A_s | \n", "H0 | \n", "
|---|---|---|---|---|
| 0 | \n", "0.021984 | \n", "0.126410 | \n", "2.754078 | \n", "64.660790 | \n", "
| 1 | \n", "0.021120 | \n", "0.107570 | \n", "2.880657 | \n", "65.231402 | \n", "
| 2 | \n", "0.021920 | \n", "0.120965 | \n", "3.039052 | \n", "69.714897 | \n", "
| 3 | \n", "0.021304 | \n", "0.131144 | \n", "2.772624 | \n", "69.520171 | \n", "
| 4 | \n", "0.021985 | \n", "0.121561 | \n", "2.849463 | \n", "63.284940 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 9995 | \n", "0.021406 | \n", "0.110055 | \n", "3.069914 | \n", "67.993275 | \n", "
| 9996 | \n", "0.022045 | \n", "0.107412 | \n", "3.013025 | \n", "63.083467 | \n", "
| 9997 | \n", "0.021588 | \n", "0.108479 | \n", "3.315846 | \n", "65.571640 | \n", "
| 9998 | \n", "0.022582 | \n", "0.124919 | \n", "2.999254 | \n", "65.667611 | \n", "
| 9999 | \n", "0.020763 | \n", "0.125550 | \n", "3.171801 | \n", "73.810086 | \n", "
50000 rows × 4 columns
\n", "| \n", " | omega_b | \n", "omega_cdm | \n", "ln10^{10}A_s | \n", "H0 | \n", "
|---|---|---|---|---|
| count | \n", "5.000000e+04 | \n", "5.000000e+04 | \n", "5.000000e+04 | \n", "5.000000e+04 | \n", "
| mean | \n", "-1.101341e-17 | \n", "4.192202e-17 | \n", "1.326299e-15 | \n", "1.858922e-15 | \n", "
| std | \n", "1.000010e+00 | \n", "1.000010e+00 | \n", "1.000010e+00 | \n", "1.000010e+00 | \n", "
| min | \n", "-1.735789e+00 | \n", "-1.726691e+00 | \n", "-1.734287e+00 | \n", "-1.725280e+00 | \n", "
| 25% | \n", "-8.584267e-01 | \n", "-8.697915e-01 | \n", "-8.654500e-01 | \n", "-8.737388e-01 | \n", "
| 50% | \n", "-1.958299e-03 | \n", "-2.131438e-03 | \n", "-3.995583e-03 | \n", "5.644049e-03 | \n", "
| 75% | \n", "8.617981e-01 | \n", "8.679544e-01 | \n", "8.704566e-01 | \n", "8.736686e-01 | \n", "
| max | \n", "1.731191e+00 | \n", "1.732135e+00 | \n", "1.726426e+00 | \n", "1.717778e+00 | \n", "