{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Analysis and Machine Learning Applications for Physicists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Material for a* [*University of Illinois*](http://illinois.edu) *course offered by the* [*Physics Department*](https://physics.illinois.edu). *This content is maintained on* [*GitHub*](https://github.com/illinois-mla) *and is distributed under a* [*BSD3 license*](https://opensource.org/licenses/BSD-3-Clause).\n", "\n", "[Table of contents](Contents.ipynb)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns; sns.set()\n", "import numpy as np\n", "import pandas as pd\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the sklearn [decomposition module](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) below:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "from sklearn import decomposition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a module that includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques.\n", "\n", "We will also use the [wpca package](https://github.com/jakevdp/wpca):" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "import wpca" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from mls import locate_data" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "line_data = pd.read_csv(locate_data('line_data.csv'))\n", "pong_data = pd.read_hdf(locate_data('pong_data.hf5'))\n", "cluster_3d = pd.read_hdf(locate_data('cluster_3d_data.hf5'))\n", "cosmo_targets = pd.read_hdf(locate_data('cosmo_targets.hf5'))\n", "spectra_data = pd.read_hdf(locate_data('spectra_data.hf5'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Dimensionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We call the number of features (columns) in a dataset its \"dimensionality\". In order to learn how different features are related, we need enough samples to get a complete picture.\n", "\n", "For example, imagine splitting each feature at its median value then, at a minimum, we would like to have at least one sample in each of the resulting $2^D$ bins (D = dimensionality = # of features = # of columns; $r^D$ is the volume of a D-dimensional hypercube with edge length r, with r=2 in our case). This is a very low bar and only requires 8 samples with D=3, but requires $2^{30} > 1$ billion samples with D=30.\n", "\n", "To get a feel for how well sampled your dataset is, estimate how many bins you could split each feature (axis) into and end up with 1 sample per bin (assuming that features are uncorrelated). A value < 2 fails our minimal test above and anything < 5 is a potential red flag." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Dataset | \n", "N | \n", "D | \n", "N**(1/D) | \n", "
---|---|---|---|---|
0 | \n", "line_data | \n", "2000 | \n", "3 | \n", "12.599 | \n", "
1 | \n", "cluster_3d | \n", "500 | \n", "3 | \n", "7.937 | \n", "
2 | \n", "cosmo_targets | \n", "50000 | \n", "6 | \n", "6.070 | \n", "
3 | \n", "pong_data | \n", "1000 | \n", "20 | \n", "1.413 | \n", "
4 | \n", "spectra_data | \n", "200 | \n", "500 | \n", "1.011 | \n", "