{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "Case Study: Air Quality Dataset

\n", "
This notebook uses a dataset from the UC Irvine website. We will describe the dataset further as we explore with it using *pandas*. \n", "\n", "## Download the Dataset\n", "\n", "Please note that **you will need to download the dataset** from the UC Irvine Machine Learning Dataset Repository. \n", "\n", "You can find the data at http://archive.ics.uci.edu/ml/machine-learning-databases/00360/. Please unzip the file at a filepath of your choice. \n", "\n", "Here are instructions on how to unzip a file in Windows: https://support.microsoft.com/en-us/help/14200/windows-compress-uncompress-zip-files. \n", "For Macs, simply double-click on the file. \n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "Use Pandas to Read the Dataset
\n", "

\n", "
\n", "In this notebook, we will be using a CSV file:\n", "* **AirQualityUCI.csv :** \n", "\n", "The dataset contains over 9000 measurements of air quality taken over a year. \n", "\n", "The following are the measures taken: \n", "Date\tTime\tCO(GT)\tPT08.S1(CO)\tNMHC(GT)\tC6H6(GT)\tPT08.S2(NMHC)\tNOx(GT)\tPT08.S3(NOx)\tNO2(GT)\tPT08.S4(NO2)\tPT08.S5(O3)\tT\tRH\tAH\n", "\n", "\n", "\n", "Using the *read_csv* function in pandas, we will transfer this information into our code. " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# import pandas and load data\n", "import pandas as pd\n", "\n", "filepath = './AirQualityUCI.csv'\n", "airquality = pd.read_csv(filepath,sep=',',delimiter=';')\n", "\n", "airquality = pd.read_csv('AirQualityUCI.csv',sep=',',delimiter=';')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeCO(GT)PT08.S1(CO)NMHC(GT)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHAHUnnamed: 15Unnamed: 16
010/03/200418.00.002,61360.0150.011,91046.0166.01056.0113.01692.01268.013,648,90,7578NaNNaN
110/03/200419.00.0021292.0112.09,4955.0103.01174.092.01559.0972.013,347,70,7255NaNNaN
210/03/200420.00.002,21402.088.09,0939.0131.01140.0114.01555.01074.011,954,00,7502NaNNaN
310/03/200421.00.002,21376.080.09,2948.0172.01092.0122.01584.01203.011,060,00,7867NaNNaN
410/03/200422.00.001,61272.051.06,5836.0131.01205.0116.01490.01110.011,259,60,7888NaNNaN
\n", "
" ], "text/plain": [ " Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) \\\n", "0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 \n", "1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 \n", "2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 \n", "3 10/03/2004 21.00.00 2,2 1376.0 80.0 9,2 948.0 \n", "4 10/03/2004 22.00.00 1,6 1272.0 51.0 6,5 836.0 \n", "\n", " NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH \\\n", "0 166.0 1056.0 113.0 1692.0 1268.0 13,6 48,9 \n", "1 103.0 1174.0 92.0 1559.0 972.0 13,3 47,7 \n", "2 131.0 1140.0 114.0 1555.0 1074.0 11,9 54,0 \n", "3 172.0 1092.0 122.0 1584.0 1203.0 11,0 60,0 \n", "4 131.0 1205.0 116.0 1490.0 1110.0 11,2 59,6 \n", "\n", " AH Unnamed: 15 Unnamed: 16 \n", "0 0,7578 NaN NaN \n", "1 0,7255 NaN NaN \n", "2 0,7502 NaN NaN \n", "3 0,7867 NaN NaN \n", "4 0,7888 NaN NaN " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now that we have the dataset we will start to get a feeling for its layout\n", "airquality.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our dataset is loaded and looks ok, but it looks like there's some cleaning that needs to be done. Notice how the last two columns are unnamed and how the C6H6(GT) and T columns have values with commas. In this case, the commas are not delimiters but are instead subsitutes for decimal points. However, the following code block shows why Python is really useful. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeCO(GT)PT08.S1(CO)NMHC(GT)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHAHUnnamed: 15Unnamed: 16
010/03/200418.00.002.06.01360.0150.011.09.01046.0166.01056.0113.01692.01268.013.06.048.0
110/03/200419.00.002.01292.0112.09.04.0955.0103.01174.092.01559.0972.013.03.047.07.0
210/03/200420.00.002.02.01402.088.09.00.0939.0131.01140.0114.01555.01074.011.09.054.0
310/03/200421.00.002.02.01376.080.09.02.0948.0172.01092.0122.01584.01203.011.00.060.0
410/03/200422.00.001.06.01272.051.06.05.0836.0131.01205.0116.01490.01110.011.02.059.0
\n", "
" ], "text/plain": [ " Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) \\\n", "0 10/03/2004 18.00.00 2.0 6.0 1360.0 150.0 \n", "1 10/03/2004 19.00.00 2.0 1292.0 112.0 9.0 \n", "2 10/03/2004 20.00.00 2.0 2.0 1402.0 88.0 \n", "3 10/03/2004 21.00.00 2.0 2.0 1376.0 80.0 \n", "4 10/03/2004 22.00.00 1.0 6.0 1272.0 51.0 \n", "\n", " PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) \\\n", "0 11.0 9.0 1046.0 166.0 1056.0 113.0 \n", "1 4.0 955.0 103.0 1174.0 92.0 1559.0 \n", "2 9.0 0.0 939.0 131.0 1140.0 114.0 \n", "3 9.0 2.0 948.0 172.0 1092.0 122.0 \n", "4 6.0 5.0 836.0 131.0 1205.0 116.0 \n", "\n", " T RH AH Unnamed: 15 Unnamed: 16 \n", "0 1692.0 1268.0 13.0 6.0 48.0 \n", "1 972.0 13.0 3.0 47.0 7.0 \n", "2 1555.0 1074.0 11.0 9.0 54.0 \n", "3 1584.0 1203.0 11.0 0.0 60.0 \n", "4 1490.0 1110.0 11.0 2.0 59.0 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality_comma = pd.read_csv(filepath, delimiter=';|,',engine='python',index_col=False)\n", "airquality_comma.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've made our data a lot more usable in just one line of code. Some pointers:\n", "- the delimiter parameter in read_csv is capable of taking a \"regex\", which allows you to, in this case, insert a boolean expression for which characters are delimiters (the semicolon OR the comma) \n", "- the 'python' engine parameter enables us to use a **regular expression** delimiter\n", "- index_col is set to False because our first line of column names does not terminate in a delimiter. \n", "\n", "We're not going to use airquality_comma because we know that the comma is not a delimiter, but we've figured out how to manipulate delimiters somewhat in the event that it was. " ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del (airquality_comma)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might still be wondering how to fix the issue with the commas for decimals. It turns out it's just one more parameter in read_csv. We'll also delete the last two columns pandas has interpreted, because they contain no information. " ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeCO(GT)PT08.S1(CO)NMHC(GT)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHAH
010/03/200418.00.002.61360.0150.011.91046.0166.01056.0113.01692.01268.013.648.90.7578
110/03/200419.00.002.01292.0112.09.4955.0103.01174.092.01559.0972.013.347.70.7255
210/03/200420.00.002.21402.088.09.0939.0131.01140.0114.01555.01074.011.954.00.7502
310/03/200421.00.002.21376.080.09.2948.0172.01092.0122.01584.01203.011.060.00.7867
410/03/200422.00.001.61272.051.06.5836.0131.01205.0116.01490.01110.011.259.60.7888
\n", "
" ], "text/plain": [ " Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) \\\n", "0 10/03/2004 18.00.00 2.6 1360.0 150.0 11.9 \n", "1 10/03/2004 19.00.00 2.0 1292.0 112.0 9.4 \n", "2 10/03/2004 20.00.00 2.2 1402.0 88.0 9.0 \n", "3 10/03/2004 21.00.00 2.2 1376.0 80.0 9.2 \n", "4 10/03/2004 22.00.00 1.6 1272.0 51.0 6.5 \n", "\n", " PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) \\\n", "0 1046.0 166.0 1056.0 113.0 1692.0 1268.0 \n", "1 955.0 103.0 1174.0 92.0 1559.0 972.0 \n", "2 939.0 131.0 1140.0 114.0 1555.0 1074.0 \n", "3 948.0 172.0 1092.0 122.0 1584.0 1203.0 \n", "4 836.0 131.0 1205.0 116.0 1490.0 1110.0 \n", "\n", " T RH AH \n", "0 13.6 48.9 0.7578 \n", "1 13.3 47.7 0.7255 \n", "2 11.9 54.0 0.7502 \n", "3 11.0 60.0 0.7867 \n", "4 11.2 59.6 0.7888 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality = pd.read_csv(filepath,sep=',',delimiter=';',decimal=',')\n", "del (airquality['Unnamed: 15'])\n", "airquality.pop('Unnamed: 16') #There are a couple ways to get rid of the columns\n", "airquality.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Data Structures

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Series

\n", "\n", "Next, let's observe how pandas interprets our data as Series and DataFrame objects, and how we can leverage that to get information about the data. " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Extract 0th row: notice that it is in fact a Series\n", "\n", "row_0 = airquality.iloc[0]\n", "type(row_0)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Date 10/03/2004\n", "Time 18.00.00\n", "CO(GT) 2.6\n", "PT08.S1(CO) 1360\n", "NMHC(GT) 150\n", "C6H6(GT) 11.9\n", "PT08.S2(NMHC) 1046\n", "NOx(GT) 166\n", "PT08.S3(NOx) 1056\n", "NO2(GT) 113\n", "PT08.S4(NO2) 1692\n", "PT08.S5(O3) 1268\n", "T 13.6\n", "RH 48.9\n", "AH 0.7578\n", "Name: 0, dtype: object\n" ] } ], "source": [ "print(row_0)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',\n", " 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',\n", " 'PT08.S5(O3)', 'T', 'RH', 'AH'],\n", " dtype='object')" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "row_0.index" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'18.00.00'" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# value of 'Time' in row_0\n", "row_0['Time']" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# is 'RH' in row_0?\n", "'RH' in row_0" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# how can we reference row_0?\n", "row_0.name" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'first_row'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's change the name of row_0\n", "row_0 = row_0.rename('first_row')\n", "row_0.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

DataFrames

" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeCO(GT)PT08.S1(CO)NMHC(GT)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHAH
010/03/200418.00.002.61360.0150.011.91046.0166.01056.0113.01692.01268.013.648.90.7578
110/03/200419.00.002.01292.0112.09.4955.0103.01174.092.01559.0972.013.347.70.7255
210/03/200420.00.002.21402.088.09.0939.0131.01140.0114.01555.01074.011.954.00.7502
310/03/200421.00.002.21376.080.09.2948.0172.01092.0122.01584.01203.011.060.00.7867
410/03/200422.00.001.61272.051.06.5836.0131.01205.0116.01490.01110.011.259.60.7888
\n", "
" ], "text/plain": [ " Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) \\\n", "0 10/03/2004 18.00.00 2.6 1360.0 150.0 11.9 \n", "1 10/03/2004 19.00.00 2.0 1292.0 112.0 9.4 \n", "2 10/03/2004 20.00.00 2.2 1402.0 88.0 9.0 \n", "3 10/03/2004 21.00.00 2.2 1376.0 80.0 9.2 \n", "4 10/03/2004 22.00.00 1.6 1272.0 51.0 6.5 \n", "\n", " PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) \\\n", "0 1046.0 166.0 1056.0 113.0 1692.0 1268.0 \n", "1 955.0 103.0 1174.0 92.0 1559.0 972.0 \n", "2 939.0 131.0 1140.0 114.0 1555.0 1074.0 \n", "3 948.0 172.0 1092.0 122.0 1584.0 1203.0 \n", "4 836.0 131.0 1205.0 116.0 1490.0 1110.0 \n", "\n", " T RH AH \n", "0 13.6 48.9 0.7578 \n", "1 13.3 47.7 0.7255 \n", "2 11.9 54.0 0.7502 \n", "3 11.0 60.0 0.7867 \n", "4 11.2 59.6 0.7888 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's take a look at our dataset again\n", "airquality.head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=9471, step=1)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality.index" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',\n", " 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',\n", " 'PT08.S5(O3)', 'T', 'RH', 'AH'],\n", " dtype='object')" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality.columns" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeCO(GT)PT08.S1(CO)NMHC(GT)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHAH
010/03/200418.00.002.61360.0150.011.91046.0166.01056.0113.01692.01268.013.648.90.7578
1111/03/200405.00.000.71066.08.01.1512.016.01918.028.01182.0422.011.056.20.7366
200002/06/200402.00.00-200.0887.0-200.04.9758.043.01094.038.01622.0804.019.169.01.5043
\n", "
" ], "text/plain": [ " Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) \\\n", "0 10/03/2004 18.00.00 2.6 1360.0 150.0 11.9 \n", "11 11/03/2004 05.00.00 0.7 1066.0 8.0 1.1 \n", "2000 02/06/2004 02.00.00 -200.0 887.0 -200.0 4.9 \n", "\n", " PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) \\\n", "0 1046.0 166.0 1056.0 113.0 1692.0 \n", "11 512.0 16.0 1918.0 28.0 1182.0 \n", "2000 758.0 43.0 1094.0 38.0 1622.0 \n", "\n", " PT08.S5(O3) T RH AH \n", "0 1268.0 13.6 48.9 0.7578 \n", "11 422.0 11.0 56.2 0.7366 \n", "2000 804.0 19.1 69.0 1.5043 " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract row 0, 11, 2000 from DataFrame\n", "\n", "airquality.iloc[ [0,11,2000] ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Descriptive Statistics

\n", "\n", "Pandas also provides some basic quantitative functions to understand our data. " ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 9357.000000\n", "mean 58.148873\n", "std 126.940455\n", "min -200.000000\n", "25% 53.000000\n", "50% 96.000000\n", "75% 133.000000\n", "max 340.000000\n", "Name: NO2(GT), dtype: float64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality['NO2(GT)'].describe()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CO(GT)PT08.S1(CO)NMHC(GT)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHAH
count9357.0000009357.0000009357.0000009357.0000009357.0000009357.0000009357.0000009357.0000009357.0000009357.0000009357.0000009357.0000009357.000000
mean-34.2075241048.990061-159.0900931.865683894.595276168.616971794.99016858.1488731391.479641975.0720329.77830539.485380-6.837604
std77.657170329.832710139.78909341.380206342.333252257.433866321.993552126.940455467.210125456.93818443.20362351.21614538.976670
min-200.000000-200.000000-200.000000-200.000000-200.000000-200.000000-200.000000-200.000000-200.000000-200.000000-200.000000-200.000000-200.000000
25%0.600000921.000000-200.0000004.000000711.00000050.000000637.00000053.0000001185.000000700.00000010.90000034.1000000.692300
50%1.5000001053.000000-200.0000007.900000895.000000141.000000794.00000096.0000001446.000000942.00000017.20000048.6000000.976800
75%2.6000001221.000000-200.00000013.6000001105.000000284.000000960.000000133.0000001662.0000001255.00000024.10000061.9000001.296200
max11.9000002040.0000001189.00000063.7000002214.0000001479.0000002683.000000340.0000002775.0000002523.00000044.60000088.7000002.231000
\n", "
" ], "text/plain": [ " CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) \\\n", "count 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 \n", "mean -34.207524 1048.990061 -159.090093 1.865683 894.595276 \n", "std 77.657170 329.832710 139.789093 41.380206 342.333252 \n", "min -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 \n", "25% 0.600000 921.000000 -200.000000 4.000000 711.000000 \n", "50% 1.500000 1053.000000 -200.000000 7.900000 895.000000 \n", "75% 2.600000 1221.000000 -200.000000 13.600000 1105.000000 \n", "max 11.900000 2040.000000 1189.000000 63.700000 2214.000000 \n", "\n", " NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) \\\n", "count 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 \n", "mean 168.616971 794.990168 58.148873 1391.479641 975.072032 \n", "std 257.433866 321.993552 126.940455 467.210125 456.938184 \n", "min -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 \n", "25% 50.000000 637.000000 53.000000 1185.000000 700.000000 \n", "50% 141.000000 794.000000 96.000000 1446.000000 942.000000 \n", "75% 284.000000 960.000000 133.000000 1662.000000 1255.000000 \n", "max 1479.000000 2683.000000 340.000000 2775.000000 2523.000000 \n", "\n", " T RH AH \n", "count 9357.000000 9357.000000 9357.000000 \n", "mean 9.778305 39.485380 -6.837604 \n", "std 43.203623 51.216145 38.976670 \n", "min -200.000000 -200.000000 -200.000000 \n", "25% 10.900000 34.100000 0.692300 \n", "50% 17.200000 48.600000 0.976800 \n", "75% 24.100000 61.900000 1.296200 \n", "max 44.600000 88.700000 2.231000 " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality.describe()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-159.09009297851875" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality['NMHC(GT)'].mean()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CO(GT) -34.207524\n", "PT08.S1(CO) 1048.990061\n", "NMHC(GT) -159.090093\n", "C6H6(GT) 1.865683\n", "PT08.S2(NMHC) 894.595276\n", "NOx(GT) 168.616971\n", "PT08.S3(NOx) 794.990168\n", "NO2(GT) 58.148873\n", "PT08.S4(NO2) 1391.479641\n", "PT08.S5(O3) 975.072032\n", "T 9.778305\n", "RH 39.485380\n", "AH -6.837604\n", "dtype: float64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality.mean()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-200.0" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality['C6H6(GT)'].min()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1479.0" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality['NOx(GT)'].max()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "257.43386629107437" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality['NOx(GT)'].std()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -200.0\n", "dtype: float64" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality['T'].mode()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CO(GT)PT08.S1(CO)NMHC(GT)C6H6(GT)PT08.S2(NMHC)NOx(GT)PT08.S3(NOx)NO2(GT)PT08.S4(NO2)PT08.S5(O3)TRHAH
CO(GT)1.0000000.0414110.128351-0.0313780.0299260.526451-0.0899810.671127-0.0737240.080310-0.068939-0.048227-0.045892
PT08.S1(CO)0.0414111.0000000.1700070.8526870.9331020.2779930.0870190.1540300.8451490.8924340.7548440.7453750.764903
NMHC(GT)0.1283510.1700071.0000000.0373230.110104-0.0044270.0488210.1033070.1626800.101185-0.0000090.0082840.012500
C6H6(GT)-0.0313780.8526870.0373231.0000000.767433-0.0011740.512193-0.0109920.7746730.6413340.9713750.9250620.984555
PT08.S2(NMHC)0.0299260.9331020.1101040.7674331.0000000.331272-0.0736670.1764880.8747820.9099050.6690250.5858030.646572
NOx(GT)0.5264510.277993-0.004427-0.0011740.3312721.000000-0.4360840.8171390.0355460.461889-0.138452-0.053009-0.095847
PT08.S3(NOx)-0.0899810.0870190.0488210.512193-0.073667-0.4360841.000000-0.2562320.122734-0.2088650.5881110.5735490.621618
NO2(GT)0.6711270.1540300.103307-0.0109920.1764880.817139-0.2562321.000000-0.0221740.253439-0.084104-0.081305-0.060440
PT08.S4(NO2)-0.0737240.8451490.1626800.7746730.8747820.0355460.122734-0.0221741.0000000.7236900.7550600.6407070.691913
PT08.S5(O3)0.0803100.8924340.1011850.6413340.9099050.461889-0.2088650.2534390.7236901.0000000.5037000.5249550.519467
T-0.0689390.754844-0.0000090.9713750.669025-0.1384520.588111-0.0841040.7550600.5037001.0000000.8859110.981001
RH-0.0482270.7453750.0082840.9250620.585803-0.0530090.573549-0.0813050.6407070.5249550.8859111.0000000.943995
AH-0.0458920.7649030.0125000.9845550.646572-0.0958470.621618-0.0604400.6919130.5194670.9810010.9439951.000000
\n", "
" ], "text/plain": [ " CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) \\\n", "CO(GT) 1.000000 0.041411 0.128351 -0.031378 0.029926 \n", "PT08.S1(CO) 0.041411 1.000000 0.170007 0.852687 0.933102 \n", "NMHC(GT) 0.128351 0.170007 1.000000 0.037323 0.110104 \n", "C6H6(GT) -0.031378 0.852687 0.037323 1.000000 0.767433 \n", "PT08.S2(NMHC) 0.029926 0.933102 0.110104 0.767433 1.000000 \n", "NOx(GT) 0.526451 0.277993 -0.004427 -0.001174 0.331272 \n", "PT08.S3(NOx) -0.089981 0.087019 0.048821 0.512193 -0.073667 \n", "NO2(GT) 0.671127 0.154030 0.103307 -0.010992 0.176488 \n", "PT08.S4(NO2) -0.073724 0.845149 0.162680 0.774673 0.874782 \n", "PT08.S5(O3) 0.080310 0.892434 0.101185 0.641334 0.909905 \n", "T -0.068939 0.754844 -0.000009 0.971375 0.669025 \n", "RH -0.048227 0.745375 0.008284 0.925062 0.585803 \n", "AH -0.045892 0.764903 0.012500 0.984555 0.646572 \n", "\n", " NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) \\\n", "CO(GT) 0.526451 -0.089981 0.671127 -0.073724 0.080310 \n", "PT08.S1(CO) 0.277993 0.087019 0.154030 0.845149 0.892434 \n", "NMHC(GT) -0.004427 0.048821 0.103307 0.162680 0.101185 \n", "C6H6(GT) -0.001174 0.512193 -0.010992 0.774673 0.641334 \n", "PT08.S2(NMHC) 0.331272 -0.073667 0.176488 0.874782 0.909905 \n", "NOx(GT) 1.000000 -0.436084 0.817139 0.035546 0.461889 \n", "PT08.S3(NOx) -0.436084 1.000000 -0.256232 0.122734 -0.208865 \n", "NO2(GT) 0.817139 -0.256232 1.000000 -0.022174 0.253439 \n", "PT08.S4(NO2) 0.035546 0.122734 -0.022174 1.000000 0.723690 \n", "PT08.S5(O3) 0.461889 -0.208865 0.253439 0.723690 1.000000 \n", "T -0.138452 0.588111 -0.084104 0.755060 0.503700 \n", "RH -0.053009 0.573549 -0.081305 0.640707 0.524955 \n", "AH -0.095847 0.621618 -0.060440 0.691913 0.519467 \n", "\n", " T RH AH \n", "CO(GT) -0.068939 -0.048227 -0.045892 \n", "PT08.S1(CO) 0.754844 0.745375 0.764903 \n", "NMHC(GT) -0.000009 0.008284 0.012500 \n", "C6H6(GT) 0.971375 0.925062 0.984555 \n", "PT08.S2(NMHC) 0.669025 0.585803 0.646572 \n", "NOx(GT) -0.138452 -0.053009 -0.095847 \n", "PT08.S3(NOx) 0.588111 0.573549 0.621618 \n", "NO2(GT) -0.084104 -0.081305 -0.060440 \n", "PT08.S4(NO2) 0.755060 0.640707 0.691913 \n", "PT08.S5(O3) 0.503700 0.524955 0.519467 \n", "T 1.000000 0.885911 0.981001 \n", "RH 0.885911 1.000000 0.943995 \n", "AH 0.981001 0.943995 1.000000 " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality.corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also filter information conditionally. " ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 False\n", "14 False\n", "15 False\n", "16 False\n", "17 False\n", "18 False\n", "19 False\n", "20 False\n", "21 False\n", "22 False\n", "23 False\n", "24 True\n", "25 True\n", "26 True\n", "27 True\n", "28 False\n", "29 False\n", " ... \n", "9441 False\n", "9442 False\n", "9443 False\n", "9444 False\n", "9445 False\n", "9446 False\n", "9447 False\n", "9448 False\n", "9449 False\n", "9450 False\n", "9451 False\n", "9452 False\n", "9453 False\n", "9454 False\n", "9455 False\n", "9456 False\n", "9457 False\n", "9458 False\n", "9459 False\n", "9460 False\n", "9461 False\n", "9462 False\n", "9463 False\n", "9464 False\n", "9465 False\n", "9466 False\n", "9467 False\n", "9468 False\n", "9469 False\n", "9470 False\n", "Name: CO(GT), Length: 9471, dtype: bool\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filter_1 = airquality['CO(GT)'] > 3.0\n", "print(filter_1)\n", "filter_1.any()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filter_2 = airquality['CO(GT)'] > 3.0\n", "filter_2.all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Handling Missing Data

\n", "\n", "If you looked at the numbers, you may have noticed that a lot of them were -200.0. That doesn't seem like a reasonable pollutant concentration, so perhaps the creators of the dataset intended that to be a NULL value. Assuming this is the case, let's fill all Null values to be -200.0. " ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9471, 15)" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality.shape" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date True\n", "Time True\n", "CO(GT) True\n", "PT08.S1(CO) True\n", "NMHC(GT) True\n", "C6H6(GT) True\n", "PT08.S2(NMHC) True\n", "NOx(GT) True\n", "PT08.S3(NOx) True\n", "NO2(GT) True\n", "PT08.S4(NO2) True\n", "PT08.S5(O3) True\n", "T True\n", "RH True\n", "AH True\n", "dtype: bool" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Check if there are Null values in each row\n", "airquality.isnull().any()" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "airquality_filled = airquality.fillna(value=-200.0)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date False\n", "Time False\n", "CO(GT) False\n", "PT08.S1(CO) False\n", "NMHC(GT) False\n", "C6H6(GT) False\n", "PT08.S2(NMHC) False\n", "NOx(GT) False\n", "PT08.S3(NOx) False\n", "NO2(GT) False\n", "PT08.S4(NO2) False\n", "PT08.S5(O3) False\n", "T False\n", "RH False\n", "AH False\n", "dtype: bool" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality_filled.isnull().any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NULL values are gone, and now our data is consistent. We can use rows which have some NULL columns if we want by using airquality_filled, where the NULL values are replaced with the value -200.0. However, we can get rid all samples with NULL columns in the original DataFrame if we want to use only complete observations. " ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "airquality = airquality.dropna()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date False\n", "Time False\n", "CO(GT) False\n", "PT08.S1(CO) False\n", "NMHC(GT) False\n", "C6H6(GT) False\n", "PT08.S2(NMHC) False\n", "NOx(GT) False\n", "PT08.S3(NOx) False\n", "NO2(GT) False\n", "PT08.S4(NO2) False\n", "PT08.S5(O3) False\n", "T False\n", "RH False\n", "AH False\n", "dtype: bool" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality.isnull().any()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9357, 15)" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "airquality.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thats nice! No NULL rows! Any observation with a NULL value has been removed. Because of this, the number of rows as been reduced." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }