{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Analysis and Machine Learning Applications for Physicists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Material for a* [*University of Illinois*](http://illinois.edu) *course offered by the* [*Physics Department*](https://physics.illinois.edu). *This content is maintained on* [*GitHub*](https://github.com/illinois-mla) *and is distributed under a* [*BSD3 license*](https://opensource.org/licenses/BSD-3-Clause).\n", "\n", "[Table of contents](Contents.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns; sns.set()\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Locate Course Data Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During the [initial setup of your environment](Setup.ipynb), you installed the data for this course with the `mls` package, which also provides a function to locate it:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from mls import locate_data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/root/miniconda/envs/DAMLA/lib/python3.6/site-packages/mls/data/pong_data.hf5'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "locate_data('pong_data.hf5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data files are stored in the industry standard [binary HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) and [text CSV](https://en.wikipedia.org/wiki/Comma-separated_values) formats, with extensions `.hf5` and `.csv`, respectively. HDF5 is more efficient for larger files but requires specialized software to read. CSV files are just plain text:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x,y,dy\n", "0.3929383711957233,0.08540861657452603,0.3831920560881885\n", "-0.42772133009924107,-0.5198803411067978,0.38522044793317467\n", "-0.5462970928715938,-0.8124804852644906,\n", "0.10262953816578246,0.10527828529558633,0.38556680974439583\n" ] } ], "source": [ "with open(locate_data('line_data.csv')) as f:\n", " # Print the first 5 lines of the file.\n", " for lineno in range(5):\n", " print(f.readline(), end='')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first line specifies the names of each column (\"feature\") in the data file. Subsequent lines are the rows (\"samples\") of the data file, with values for each column separated by commas. Note that values might be missing (for example, at the end of the third row)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read Files with Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the [Pandas package](https://pandas.pydata.org/) to read data files into DataFrame objects in memory. This will only be a quick introduction. For a deeper dive, start with [Data Manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) in the [Phython Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "pong_data = pd.read_hdf(locate_data('pong_data.hf5'))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "line_data = pd.read_csv(locate_data('line_data.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can think of a DataFrame as an enhanced 2D numpy array, with most of the same capabilities:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2000, 3)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Individual columns also behave like enhanced 1D numpy arrays:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2000,)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data['y'].shape" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2000,)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data['x'].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a first look at some unknown data, start with some basic [summary statistics](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xydy
count2000.0000002000.0000001850.000000
mean-0.000509-0.0862330.479347
std0.5852810.7828780.228198
min-0.999836-2.3906460.151793
25%-0.513685-0.6480450.302540
50%-0.006021-0.0680520.431361
75%0.5014490.4737410.610809
max0.9992892.3657101.506188
\n", "
" ], "text/plain": [ " x y dy\n", "count 2000.000000 2000.000000 1850.000000\n", "mean -0.000509 -0.086233 0.479347\n", "std 0.585281 0.782878 0.228198\n", "min -0.999836 -2.390646 0.151793\n", "25% -0.513685 -0.648045 0.302540\n", "50% -0.006021 -0.068052 0.431361\n", "75% 0.501449 0.473741 0.610809\n", "max 0.999289 2.365710 1.506188" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data.describe()" ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "Jot down a few things you notice about this data from this summary." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden" }, "source": [ "- The values of x and y are symmetric about zero.\n", "- The values of x look uniformly distributed on \\[-1, +1], judging by the percentiles.\n", "- The value of dy is always > 0, as you might expect if it represents the \"error on y\".\n", "- The dy column is missing 150 entries." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "Summarize `pong_data` the same way. Does anything stick out?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true, "solution2": "hidden" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x0x1x2x3x4x5x6x7x8x9y0y1y2y3y4y5y6y7y8y9
count1000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.01000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.0000001000.000000
mean0.0490040.1320930.2129050.2915040.3679500.4423010.5146150.5849490.6533550.7198880.00.1252060.2171220.2766580.3047020.3021160.2697400.2083900.1188600.001921
std0.0629980.0673800.0758050.0868060.0992850.1125470.1261750.1399190.1536240.1671960.00.0108760.0214540.0317420.0417480.0514810.0609460.0701530.0791070.087815
min-0.161553-0.089041-0.0185160.0500770.1167900.1816770.2447850.3061650.3658630.4158500.00.0937220.1550160.1847690.1838460.1530880.0933100.005310-0.110141-0.252291
25%-0.0017550.0794350.1570230.2295170.2934690.3536040.4140680.4733380.5322800.5905830.00.1158160.1985970.2492500.2686540.2576650.2171160.1478170.050555-0.073903
50%0.0765340.1486750.2058460.2702140.3383800.4069220.4763220.5428470.6082490.6735890.00.1270980.2208520.2821770.3119610.3110680.2803380.2205890.1326160.017191
75%0.1001770.1878000.2864630.3831270.4757240.5652170.6513980.7344180.8163780.8966000.00.1328470.2321930.2989560.3340290.3382810.3125540.2576720.1744310.063610
max0.1511180.2610950.3703250.4765630.5798910.6843210.7871240.8871110.9843581.0789410.00.1447990.2557690.3338380.3799080.3948540.3795300.3347640.2613640.160113
\n", "
" ], "text/plain": [ " x0 x1 x2 x3 x4 \\\n", "count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 \n", "mean 0.049004 0.132093 0.212905 0.291504 0.367950 \n", "std 0.062998 0.067380 0.075805 0.086806 0.099285 \n", "min -0.161553 -0.089041 -0.018516 0.050077 0.116790 \n", "25% -0.001755 0.079435 0.157023 0.229517 0.293469 \n", "50% 0.076534 0.148675 0.205846 0.270214 0.338380 \n", "75% 0.100177 0.187800 0.286463 0.383127 0.475724 \n", "max 0.151118 0.261095 0.370325 0.476563 0.579891 \n", "\n", " x5 x6 x7 x8 x9 \\\n", "count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 \n", "mean 0.442301 0.514615 0.584949 0.653355 0.719888 \n", "std 0.112547 0.126175 0.139919 0.153624 0.167196 \n", "min 0.181677 0.244785 0.306165 0.365863 0.415850 \n", "25% 0.353604 0.414068 0.473338 0.532280 0.590583 \n", "50% 0.406922 0.476322 0.542847 0.608249 0.673589 \n", "75% 0.565217 0.651398 0.734418 0.816378 0.896600 \n", "max 0.684321 0.787124 0.887111 0.984358 1.078941 \n", "\n", " y0 y1 y2 y3 y4 \\\n", "count 1000.0 1000.000000 1000.000000 1000.000000 1000.000000 \n", "mean 0.0 0.125206 0.217122 0.276658 0.304702 \n", "std 0.0 0.010876 0.021454 0.031742 0.041748 \n", "min 0.0 0.093722 0.155016 0.184769 0.183846 \n", "25% 0.0 0.115816 0.198597 0.249250 0.268654 \n", "50% 0.0 0.127098 0.220852 0.282177 0.311961 \n", "75% 0.0 0.132847 0.232193 0.298956 0.334029 \n", "max 0.0 0.144799 0.255769 0.333838 0.379908 \n", "\n", " y5 y6 y7 y8 y9 \n", "count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 \n", "mean 0.302116 0.269740 0.208390 0.118860 0.001921 \n", "std 0.051481 0.060946 0.070153 0.079107 0.087815 \n", "min 0.153088 0.093310 0.005310 -0.110141 -0.252291 \n", "25% 0.257665 0.217116 0.147817 0.050555 -0.073903 \n", "50% 0.311068 0.280338 0.220589 0.132616 0.017191 \n", "75% 0.338281 0.312554 0.257672 0.174431 0.063610 \n", "max 0.394854 0.379530 0.334764 0.261364 0.160113 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pong_data.describe()" ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden" }, "source": [ "Some things that stick out from this summary are:\n", "- Mean, median values in the xn columns are increasing left to right.\n", "- Column y0 is always zero, so not very informative.\n", "- Mean, median values in the yn columns increase from y0 to y4 then decrease through y9." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Work with Subsets of Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A subset is specified by limiting the rows and/or columns. We have already seen how to pick out a single column, e.g. with `line_data['x']`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also pick out specific rows (for details on why we use `iloc` see [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html#Indexers:-loc,-iloc,-and-ix)):" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xydy
00.3929380.0854090.383192
1-0.427721-0.5198800.385220
2-0.546297-0.812480NaN
30.1026300.1052780.385567
\n", "
" ], "text/plain": [ " x y dy\n", "0 0.392938 0.085409 0.383192\n", "1 -0.427721 -0.519880 0.385220\n", "2 -0.546297 -0.812480 NaN\n", "3 0.102630 0.105278 0.385567" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data.iloc[:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the missing value in the CSV file is represented as NaN = \"not a number\". This is generally how Pandas handles any [data that is missing / invalid or otherwise not available (NA)](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may not want to use any rows with missing data. Select the subset of useful data with:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "line_data_valid = line_data.dropna()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xydy
00.3929380.0854090.383192
1-0.427721-0.5198800.385220
30.1026300.1052780.385567
40.4389380.5821370.509960
\n", "
" ], "text/plain": [ " x y dy\n", "0 0.392938 0.085409 0.383192\n", "1 -0.427721 -0.519880 0.385220\n", "3 0.102630 0.105278 0.385567\n", "4 0.438938 0.582137 0.509960" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data_valid[:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also select rows using any logical test on its column values. For example, to select all rows with dy > 0.5 and y < 0:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xydy
13-0.880644-1.4820740.698284
16-0.635017-1.1922320.619905
30-0.815790-0.1723240.643215
35-0.375478-1.3200130.574198
\n", "
" ], "text/plain": [ " x y dy\n", "13 -0.880644 -1.482074 0.698284\n", "16 -0.635017 -1.192232 0.619905\n", "30 -0.815790 -0.172324 0.643215\n", "35 -0.375478 -1.320013 0.574198" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xpos = line_data[(line_data['dy'] > 0.5) & (line_data['y'] < 0)]\n", "xpos[:4]" ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "Use `describe` to compare the summary statistics for rows with x < 0 and x >= 0. Do they make sense?" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xydy
count1006.0000001006.000000938.000000
mean-0.507065-0.6890120.472889
std0.2880740.4985810.227474
min-0.999836-2.3906460.159862
25%-0.758180-1.0053570.294420
50%-0.511167-0.6435120.419482
75%-0.264287-0.3384490.611192
max-0.0001280.7579031.506188
\n", "
" ], "text/plain": [ " x y dy\n", "count 1006.000000 1006.000000 938.000000\n", "mean -0.507065 -0.689012 0.472889\n", "std 0.288074 0.498581 0.227474\n", "min -0.999836 -2.390646 0.159862\n", "25% -0.758180 -1.005357 0.294420\n", "50% -0.511167 -0.643512 0.419482\n", "75% -0.264287 -0.338449 0.611192\n", "max -0.000128 0.757903 1.506188" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data[line_data['x'] < 0].describe()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xydy
count994.000000994.000000912.000000
mean0.5121620.5238220.485989
std0.2873120.4915200.228875
min0.001123-1.1545580.151793
25%0.2665870.1633630.312799
50%0.5027360.4714190.436676
75%0.7613460.8216260.607731
max0.9992892.3657101.378183
\n", "
" ], "text/plain": [ " x y dy\n", "count 994.000000 994.000000 912.000000\n", "mean 0.512162 0.523822 0.485989\n", "std 0.287312 0.491520 0.228875\n", "min 0.001123 -1.154558 0.151793\n", "25% 0.266587 0.163363 0.312799\n", "50% 0.502736 0.471419 0.436676\n", "75% 0.761346 0.821626 0.607731\n", "max 0.999289 2.365710 1.378183" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data[line_data['x'] >= 0].describe()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extend Data with New Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can easily add new columns derived from existing columns, for example:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "line_data['yprediction'] = 1.2 * line_data['x'] - 0.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new column is only in memory, and not automatically written back to the original file." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Add a new column for the \"pull\", defined as:\n", "$$\n", "y_{pull} \\equiv \\frac{y - y_{prediction}}{\\delta y} \\; .\n", "$$\n", "What would you expect the mean and standard deviation (std) of this new column to be if the prediction is accuracte? What do the actual mean, std values indicate?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "solution2": "hidden" }, "outputs": [], "source": [ "line_data['ypull'] = (line_data['y'] - line_data['yprediction']) / line_data['dy']" ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden" }, "source": [ "The mean should be close to zero if the prediction is unbiased. The RMS should be close to one if the prediction is unbiased and the errors are accurate. The actual values indicate that the prediction is unbiased, but the errors are overerestimated." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xydyypredictionypull
count2000.0000002000.0000001850.0000002000.0000001850.000000
mean-0.000509-0.0862330.479347-0.1006110.036367
std0.5852810.7828780.2281980.7023380.661659
min-0.999836-2.3906460.151793-1.299803-2.162585
25%-0.513685-0.6480450.302540-0.716422-0.429185
50%-0.006021-0.0680520.431361-0.1072250.033875
75%0.5014490.4737410.6108090.5017390.484257
max0.9992892.3657101.5061881.0991462.033837
\n", "
" ], "text/plain": [ " x y dy yprediction ypull\n", "count 2000.000000 2000.000000 1850.000000 2000.000000 1850.000000\n", "mean -0.000509 -0.086233 0.479347 -0.100611 0.036367\n", "std 0.585281 0.782878 0.228198 0.702338 0.661659\n", "min -0.999836 -2.390646 0.151793 -1.299803 -2.162585\n", "25% -0.513685 -0.648045 0.302540 -0.716422 -0.429185\n", "50% -0.006021 -0.068052 0.431361 -0.107225 0.033875\n", "75% 0.501449 0.473741 0.610809 0.501739 0.484257\n", "max 0.999289 2.365710 1.506188 1.099146 2.033837" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line_data.describe()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combine Data from Different Sources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the data files for this course are in data/targets pairs (for reasons that will be clear soon)." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "Verify that the files `pong_data.hf5` and `pong_targets.hf5` have the same number of rows but different column names." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "solution2": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#rows: 1000, 1000.\n", "data columns: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'y0' 'y1' 'y2' 'y3'\n", " 'y4' 'y5' 'y6' 'y7' 'y8' 'y9'].\n", "targets columns: ['th0' 'hit' 'grp'].\n" ] } ], "source": [ "pong_data = pd.read_hdf(locate_data('pong_data.hf5'))\n", "pong_targets = pd.read_hdf(locate_data('pong_targets.hf5'))\n", "\n", "print('#rows: {}, {}.'.format(len(pong_data), len(pong_targets)))\n", "assert len(pong_data) == len(pong_targets)\n", "\n", "print('data columns: {}.'.format(pong_data.columns.values))\n", "print('targets columns: {}.'.format(pong_targets.columns.values))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "Use `pd.concat` to combine the (different) columns, matching row by row. Verify that your combined data has the expected number of rows and column names." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "solution2": "hidden" }, "outputs": [], "source": [ "pong_both = pd.concat([pong_data, pong_targets], axis='columns')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "solution2": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#rows: 1000\n", "columns: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'y0' 'y1' 'y2' 'y3'\n", " 'y4' 'y5' 'y6' 'y7' 'y8' 'y9' 'th0' 'hit' 'grp'].\n" ] } ], "source": [ "print('#rows: {}'.format(len(pong_both)))\n", "print('columns: {}.'.format(pong_both.columns.values))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare Data from an External Source" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, here is an example of taking data from an external source and adapting it to the standard format we are using. The data is from the [2014 ATLAS Higgs Challenge](https://www.kaggle.com/c/higgs-boson) which is now documented and archived [here](http://opendata.cern.ch/record/328). More details about the challenge are in [this writeup](http://opendata.cern.ch/record/329/files/atlas-higgs-challenge-2014.pdf)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**EXERCISE:**\n", "\n", "1. Download the compressed CSV file (~62Mb) `atlas-higgs-challenge-2014-v2.csv.gz` using the link at the bottom of [this page](http://opendata.cern.ch/record/328).\n", "2. Move the file to directory containing this notebook. You do need to uncompress (gunzip) the file.\n", "3. Skim the description of the columns [here](http://opendata.cern.ch/record/328). The details are not important, but the main points are that:\n", " - There are two types of input \"features\": 17 primary + 13 derived.\n", " - The goal is to predict the \"Label\" from the input features.\n", "4. Examine the function defined below and determine what it does. Lookup the documentation of any functions you are unfamiliar with.\n", "5. Run the function below, which should create two new files in your coursse data directory:\n", " - `higgs_data.hf5`: Input data with 30 columns, ~100Mb size.\n", " - `higgs_targets.hf5`: Ouput targets with 1 column, ~8.8Mb size." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def prepare_higgs(filename='atlas-higgs-challenge-2014-v2.csv.gz'):\n", " # Read the input file, uncompressing on the fly.\n", " df = pd.read_csv(filename, index_col='EventId', na_values='-999.0')\n", " # Prepare and save the data output file.\n", " higgs_data = df.drop(columns=['Label', 'KaggleSet', 'KaggleWeight']).astype('float32')\n", " higgs_data.to_hdf(locate_data('higgs_data.hf5', check_exists=False), 'data', mode='w')\n", " # Prepare and save the targets output file.\n", " higgs_targets = df[['Label']]\n", " higgs_targets.to_hdf(locate_data('higgs_targets.hf5', check_exists=False), 'targets', mode='w')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "prepare_higgs()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that `locate_data` can find the new files:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/root/miniconda/envs/DAMLA/lib/python3.6/site-packages/mls/data/higgs_data.hf5'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "locate_data('higgs_data.hf5')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/root/miniconda/envs/DAMLA/lib/python3.6/site-packages/mls/data/higgs_targets.hf5'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "locate_data('higgs_targets.hf5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can new safely delete the downloaded CSV file. Uncomment and run the line below if you would like to do this directly from your notebook (this is an example of a [shell command](https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html))." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "#!rm atlas-higgs-challenge-2014-v2.csv.gz" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }