{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning and Statistics for Physicists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Material for a [UC Irvine](https://uci.edu/) course offered by the [Department of Physics and Astronomy](https://www.physics.uci.edu/).\n", "\n", "Content is maintained on [github](github.com/dkirkby/MachineLearningStatistics) and distributed under a [BSD3 license](https://opensource.org/licenses/BSD-3-Clause).\n", "\n", "[Table of contents](Contents.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns; sns.set()\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the sklearn [decomposition module](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) below:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn import decomposition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from mls import locate_data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "line_data = pd.read_csv(locate_data('line_data.csv'))\n", "pong_data = pd.read_hdf(locate_data('pong_data.hf5'))\n", "cluster_3d = pd.read_hdf(locate_data('cluster_3d_data.hf5'))\n", "cosmo_targets = pd.read_hdf(locate_data('cosmo_targets.hf5'))\n", "spectra_data = pd.read_hdf(locate_data('spectra_data.hf5'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Dimensionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We call the number of features (columns) in a dataset its \"dimensionality\". In order to learn how different features are related, we need enough samples to get a complete picture.\n", "\n", "For example, imagine splitting each feature at its median value then, at a minimum, we would like to have at least one sample in each of the resulting $2^D$ bins (D = dimensionality = # of features = # of columns). This is a very low bar and only requires 8 samples with D=3, but requires $2^{30} > 1$ billion samples with D=30.\n", "\n", "To get a feel for how well sampled your dataset is, estimate how many bins you could split each feature (axis) into and end up with 1 sample per bin (assuming that features are uncorrelated). A value < 2 fails our minimal test above and anything < 5 is a potential red flag." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Dataset | \n", "N | \n", "D | \n", "N**(1/D) | \n", "
---|---|---|---|---|
0 | \n", "line_data | \n", "2000 | \n", "3 | \n", "12.599 | \n", "
1 | \n", "cluster_3d | \n", "500 | \n", "3 | \n", "7.937 | \n", "
2 | \n", "cosmo_targets | \n", "50000 | \n", "6 | \n", "6.070 | \n", "
3 | \n", "pong_data | \n", "1000 | \n", "20 | \n", "1.413 | \n", "
4 | \n", "spectra_data | \n", "200 | \n", "500 | \n", "1.011 | \n", "