{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#ffa500\">8 | WORKING WITH DATA</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dct=\"http://purl.org/dc/terms/\"><span property=\"dct:title\">This chapter of an Introduction to Health Data Science</span> by <span property=\"cc:attributionName\">Dr JH Klopper</span> is licensed under <a href=\"http://creativecommons.org/licenses/by-nc-nd/4.0/?ref=chooser-v1\" target=\"_blank\" rel=\"license noopener noreferrer\" style=\"display:inline-block;\">Attribution-NonCommercial-NoDerivatives 4.0 International<img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1\"><img style=\"height:22px!important;margin-left:3px;vertical-align:text-bottom;\" src=\"https://mirrors.creativecommons.org/presskit/icons/nd.svg?ref=chooser-v1\"></a></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Introduction</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data science, by its name and nature, requires access to data. Images, sound files, text, and much more, can be represented as data. This lecture explores the use of Python for tabular data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tabular data exists in rows and columns, either extracted from an image, a database, or similar structures, and represented as an array. An array is a set of values in rows and columns (and along even more dimensions as in the case of color images, where rows and column are stacked _on top of_ each other representing red, green, and blue channels). There is a fantastic Python package for importing such tabular data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The __pandas__ package has much to do with the success of Python as a programming language for statistical analysis and data science. It is an enormous package and is used to import data, to manipulate data, to do calculations with data, and even create graphs and plots using the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook provides a glimpse into the usefulness of the pandas package by importing data captured in a spreadsheet file and extracting information from it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Packages used in this notebook</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is useful to import all packages at the start of a notebook. This allows for keeping track of what extensions to the Python language are used in the notebook. Namespace abbreviations are used to make it easier to refer to the packages later in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd  # Package to work with data\n",
    "\n",
    "import numpy as np  # Numerical analysis package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## <span style=\"color:#0096FF\">Importing data</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data used in this notebook is a spreadsheet file in comma separated values (CSV) format called `data.csv`. The file is imported using the pandas `read_csv` function. Since it is not a Python function, the package in which the function is hosted, has to be referenced (depending on the way the package was imported). This is done by preceding the function with the pandas namespace abbreviation, `pd`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `data.csv` file is in the same folder (directory) as the notebook. If it was in a different folder, the path to the file would have to be specified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the data.csv file as assign it to the variable df\n",
    "df = pd.read_csv('data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file is also available on GitHub as `PUBH1142Chap0801.csv` and is imported below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the file https://raw.githubusercontent.com/juanklopper/TutorialData/main/PUBH1142Chap0801.csv from GitHub and assign it to the variable df\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/PUBH1142Chap0801.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `type` function used below, shows that the object assigned to the `df` computer variable is a dataframe object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Type of df\n",
    "type(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The methods and attributes of the `df` object can be printed using the `dir` function. The methods are the functions that can be applied to the object. The attributes are the properties of the object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Age',\n",
       " 'CholesterolAfter',\n",
       " 'CholesterolBefore',\n",
       " 'DOB',\n",
       " 'Delta',\n",
       " 'Group',\n",
       " 'HR',\n",
       " 'Name',\n",
       " 'Smoke',\n",
       " 'Survey',\n",
       " 'T',\n",
       " 'TAG',\n",
       " 'Vocation',\n",
       " '_AXIS_LEN',\n",
       " '_AXIS_ORDERS',\n",
       " '_AXIS_TO_AXIS_NUMBER',\n",
       " '_HANDLED_TYPES',\n",
       " '__abs__',\n",
       " '__add__',\n",
       " '__and__']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the first 20 methods and attributes of df suing the dir function\n",
    "dir(df)[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many methods and attributes. The first few listed are the column headers (the statistical variable names) in the spreadsheet file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Read about methods and attributes of a [pandas dataframe object](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One such method is the `head` method. By default it returns the first five rows of a dataframe object. An integer value can be passed as argument for a different number of rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Age</th>\n",
       "      <th>Vocation</th>\n",
       "      <th>Smoke</th>\n",
       "      <th>HR</th>\n",
       "      <th>sBP</th>\n",
       "      <th>CholesterolBefore</th>\n",
       "      <th>TAG</th>\n",
       "      <th>Survey</th>\n",
       "      <th>CholesterolAfter</th>\n",
       "      <th>Delta</th>\n",
       "      <th>Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan Patton</td>\n",
       "      <td>1981-10-07</td>\n",
       "      <td>43</td>\n",
       "      <td>Energy manager</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>145</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Sandra Howard</td>\n",
       "      <td>1993-01-27</td>\n",
       "      <td>53</td>\n",
       "      <td>Tax adviser</td>\n",
       "      <td>0</td>\n",
       "      <td>51</td>\n",
       "      <td>115</td>\n",
       "      <td>1.2</td>\n",
       "      <td>0.6</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Samantha Williams</td>\n",
       "      <td>1973-12-21</td>\n",
       "      <td>33</td>\n",
       "      <td>IT consultant</td>\n",
       "      <td>0</td>\n",
       "      <td>54</td>\n",
       "      <td>120</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.3</td>\n",
       "      <td>3</td>\n",
       "      <td>1.7</td>\n",
       "      <td>0.3</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ashley Hensley</td>\n",
       "      <td>1981-12-01</td>\n",
       "      <td>43</td>\n",
       "      <td>Nurse, children's</td>\n",
       "      <td>0</td>\n",
       "      <td>54</td>\n",
       "      <td>103</td>\n",
       "      <td>2.1</td>\n",
       "      <td>1.6</td>\n",
       "      <td>4</td>\n",
       "      <td>2.1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Robert Wilson</td>\n",
       "      <td>1964-06-23</td>\n",
       "      <td>46</td>\n",
       "      <td>Clinical embryologist</td>\n",
       "      <td>0</td>\n",
       "      <td>61</td>\n",
       "      <td>138</td>\n",
       "      <td>2.8</td>\n",
       "      <td>2.1</td>\n",
       "      <td>5</td>\n",
       "      <td>2.8</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Name         DOB  Age               Vocation  Smoke  HR  sBP  \\\n",
       "0       Dylan Patton  1981-10-07   43         Energy manager      0  47  145   \n",
       "1      Sandra Howard  1993-01-27   53            Tax adviser      0  51  115   \n",
       "2  Samantha Williams  1973-12-21   33          IT consultant      0  54  120   \n",
       "3     Ashley Hensley  1981-12-01   43      Nurse, children's      0  54  103   \n",
       "4      Robert Wilson  1964-06-23   46  Clinical embryologist      0  61  138   \n",
       "\n",
       "   CholesterolBefore  TAG  Survey  CholesterolAfter  Delta   Group  \n",
       "0                1.2  1.2       1               0.7    0.5  Active  \n",
       "1                1.2  0.6       3               1.0    0.2  Active  \n",
       "2                2.0  1.3       3               1.7    0.3  Active  \n",
       "3                2.1  1.6       4               2.1    0.0  Active  \n",
       "4                2.8  2.1       5               2.8    0.0  Active  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use the head method to print the first 5 rows of df\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `shape` attribute shows use the number of rows and columns, returned as a tuple. Note that unlike a method (which is like a Python function), an attribute has no parentheses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(200, 13)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the shape of df\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are $200$ observations (rows) and $13$ statistical variables (columns) in this tidy data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `columns` attribute lists all the column header names, called labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Name', 'DOB', 'Age', 'Vocation', 'Smoke', 'HR', 'sBP',\n",
       "       'CholesterolBefore', 'TAG', 'Survey', 'CholesterolAfter', 'Delta',\n",
       "       'Group'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the columns of df\n",
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The majority of dataframe objects will have two axes (rows and columns). We can verify this using the `ndim` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the dimensions of df\n",
    "df.ndim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `size` attribute gives us the total number of data point values (the product of the number of rows and columns)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2600"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the size of df\n",
    "df.size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last attribute discussed here is the `dtype` attribute. It returns the Python data type of the values in each of the columns. This is a very important step. Pandas does its best to interpret the data type. Depending on how the spreadsheet was created and how data was entered, it is not always possible to correctly interpret the type. In this case, the data type may have to be changed manually. Remember that the analysis of data is based on the data type of the variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Name                  object\n",
       "DOB                   object\n",
       "Age                    int64\n",
       "Vocation              object\n",
       "Smoke                  int64\n",
       "HR                     int64\n",
       "sBP                    int64\n",
       "CholesterolBefore    float64\n",
       "TAG                  float64\n",
       "Survey                 int64\n",
       "CholesterolAfter     float64\n",
       "Delta                float64\n",
       "Group                 object\n",
       "dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the type of df\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Categorical variables are denoted as an `object` type. Numerical variables can be either integer or floating point numbers (numbers with decimal places). These are `int64` and `float64` (denoting 64-bit precision) respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#0096FF\">Selecting rows and columns</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To analyse data, only certain values may need to be _extracted_. This is a very useful skill."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas refers to a single column in a dataframe object as a series object. The `Age` column is extracted below and saved as a series object, assigned to the variable `age_column`. The notation uses square brackets, with the column name represented as a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign the Age column in df to the variable age_column\n",
    "age_column = df['Age']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `age_column` variable now holds a pandas series object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the type of age_column\n",
    "type(age_column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are no illegal characters in the column name such as spaces, the dot notation can be used.The `age_column computer` variable is overwritten below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign the Age column in df to the variable age_column and use dot notation\n",
    "age_column = df.Age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first five rows of the series object is displayed below using the `head`` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    43\n",
       "1    53\n",
       "2    33\n",
       "3    43\n",
       "4    46\n",
       "Name: Age, dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the first 5 rows of age_column\n",
    "age_column.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The index column above indicates that the values are not a simple Python list object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Select the `Vocation` column and assign it to the variable `vocation`. Confirm that the object assigned to vocation is a pandas series object. The `tail` function (with no arguments), prints the last five observations to the screen. Use the `tail` function to print the last five observations in `vocation` to the screen."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocation = df.Vocation\n",
    "type(vocation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "195            Sales professional, IT\n",
       "196                         Paramedic\n",
       "197    Chartered certified accountant\n",
       "198                            Dancer\n",
       "199                Theme park manager\n",
       "Name: Vocation, dtype: object"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocation.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At times it may be more useful to work with a numpy array, rather than a pandas series object. The `to_numpy` method is used to extract the age values as a numpy array. It is used below and the numpy array is assigned to the variable `age`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the age_column to a numpy array and assign it to the variable age\n",
    "age = age_column.to_numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `type` function shows that `age` contains a numpy array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numpy.ndarray"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the type of age\n",
    "type(age)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas series objects and numpy arrays have many methods and attributes. The `mean`, `min`, and `max` methods are used below to calculate the avergage, minimum, and maximum age values respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "53.07"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the average age in age_column\n",
    "age_column.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "30"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the minimum age in age_column\n",
    "age_column.min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "75"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the maximum age in age_column\n",
    "age_column.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inidividual rows (subjects) can be specified by making use of the `iloc` attribute (or property, which is the term used by pandas) for a dataframe object. The `iloc` property stands for integer location. Therefor, integers are used to specify the row and column numbers. The first row of data is extracted below.\n",
    "0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Name                   Dylan Patton\n",
       "DOB                      1981-10-07\n",
       "Age                              43\n",
       "Vocation             Energy manager\n",
       "Smoke                             0\n",
       "HR                               47\n",
       "sBP                             145\n",
       "CholesterolBefore               1.2\n",
       "TAG                             1.2\n",
       "Survey                            1\n",
       "CholesterolAfter                0.7\n",
       "Delta                           0.5\n",
       "Group                        Active\n",
       "Name: 0, dtype: object"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the first row of data in df using the iloc method\n",
    "df.iloc[0] # Note the use of square brackets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A list of values can be passed to select multiple rows. Rows $2$, $5$, and $7$ are selected below. Note that these rows correspond to index values $1$, $4$, and $6$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Age</th>\n",
       "      <th>Vocation</th>\n",
       "      <th>Smoke</th>\n",
       "      <th>HR</th>\n",
       "      <th>sBP</th>\n",
       "      <th>CholesterolBefore</th>\n",
       "      <th>TAG</th>\n",
       "      <th>Survey</th>\n",
       "      <th>CholesterolAfter</th>\n",
       "      <th>Delta</th>\n",
       "      <th>Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Sandra Howard</td>\n",
       "      <td>1993-01-27</td>\n",
       "      <td>53</td>\n",
       "      <td>Tax adviser</td>\n",
       "      <td>0</td>\n",
       "      <td>51</td>\n",
       "      <td>115</td>\n",
       "      <td>1.2</td>\n",
       "      <td>0.6</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Robert Wilson</td>\n",
       "      <td>1964-06-23</td>\n",
       "      <td>46</td>\n",
       "      <td>Clinical embryologist</td>\n",
       "      <td>0</td>\n",
       "      <td>61</td>\n",
       "      <td>138</td>\n",
       "      <td>2.8</td>\n",
       "      <td>2.1</td>\n",
       "      <td>5</td>\n",
       "      <td>2.8</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Frank Zimmerman</td>\n",
       "      <td>1981-03-04</td>\n",
       "      <td>54</td>\n",
       "      <td>Police officer</td>\n",
       "      <td>0</td>\n",
       "      <td>60</td>\n",
       "      <td>129</td>\n",
       "      <td>2.9</td>\n",
       "      <td>2.4</td>\n",
       "      <td>1</td>\n",
       "      <td>2.6</td>\n",
       "      <td>0.3</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Name         DOB  Age               Vocation  Smoke  HR  sBP  \\\n",
       "1    Sandra Howard  1993-01-27   53            Tax adviser      0  51  115   \n",
       "4    Robert Wilson  1964-06-23   46  Clinical embryologist      0  61  138   \n",
       "6  Frank Zimmerman  1981-03-04   54         Police officer      0  60  129   \n",
       "\n",
       "   CholesterolBefore  TAG  Survey  CholesterolAfter  Delta   Group  \n",
       "1                1.2  0.6       3               1.0    0.2  Active  \n",
       "4                2.8  2.1       5               2.8    0.0  Active  \n",
       "6                2.9  2.4       1               2.6    0.3  Active  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select rows 2, 5, and 7 using the iloc method\n",
    "df.iloc[[1, 4, 6]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A range object can be used to select a contiguous number of rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Age</th>\n",
       "      <th>Vocation</th>\n",
       "      <th>Smoke</th>\n",
       "      <th>HR</th>\n",
       "      <th>sBP</th>\n",
       "      <th>CholesterolBefore</th>\n",
       "      <th>TAG</th>\n",
       "      <th>Survey</th>\n",
       "      <th>CholesterolAfter</th>\n",
       "      <th>Delta</th>\n",
       "      <th>Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan Patton</td>\n",
       "      <td>1981-10-07</td>\n",
       "      <td>43</td>\n",
       "      <td>Energy manager</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>145</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Sandra Howard</td>\n",
       "      <td>1993-01-27</td>\n",
       "      <td>53</td>\n",
       "      <td>Tax adviser</td>\n",
       "      <td>0</td>\n",
       "      <td>51</td>\n",
       "      <td>115</td>\n",
       "      <td>1.2</td>\n",
       "      <td>0.6</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Name         DOB  Age        Vocation  Smoke  HR  sBP  \\\n",
       "0   Dylan Patton  1981-10-07   43  Energy manager      0  47  145   \n",
       "1  Sandra Howard  1993-01-27   53     Tax adviser      0  51  115   \n",
       "\n",
       "   CholesterolBefore  TAG  Survey  CholesterolAfter  Delta   Group  \n",
       "0                1.2  1.2       1               0.7    0.5  Active  \n",
       "1                1.2  0.6       3               1.0    0.2  Active  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the first two rows using the iloc method\n",
    "df.iloc[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As with Python list objects, shorthand notation can be used fro contiguous rows. The first two rows are selected below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Age</th>\n",
       "      <th>Vocation</th>\n",
       "      <th>Smoke</th>\n",
       "      <th>HR</th>\n",
       "      <th>sBP</th>\n",
       "      <th>CholesterolBefore</th>\n",
       "      <th>TAG</th>\n",
       "      <th>Survey</th>\n",
       "      <th>CholesterolAfter</th>\n",
       "      <th>Delta</th>\n",
       "      <th>Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan Patton</td>\n",
       "      <td>1981-10-07</td>\n",
       "      <td>43</td>\n",
       "      <td>Energy manager</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>145</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Sandra Howard</td>\n",
       "      <td>1993-01-27</td>\n",
       "      <td>53</td>\n",
       "      <td>Tax adviser</td>\n",
       "      <td>0</td>\n",
       "      <td>51</td>\n",
       "      <td>115</td>\n",
       "      <td>1.2</td>\n",
       "      <td>0.6</td>\n",
       "      <td>3</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Name         DOB  Age        Vocation  Smoke  HR  sBP  \\\n",
       "0   Dylan Patton  1981-10-07   43  Energy manager      0  47  145   \n",
       "1  Sandra Howard  1993-01-27   53     Tax adviser      0  51  115   \n",
       "\n",
       "   CholesterolBefore  TAG  Survey  CholesterolAfter  Delta   Group  \n",
       "0                1.2  1.2       1               0.7    0.5  Active  \n",
       "1                1.2  0.6       3               1.0    0.2  Active  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the first two rows using the iloc method\n",
    "df.iloc[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Negative index notation similarly follows the Python list convention. The last three rows are selected below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Age</th>\n",
       "      <th>Vocation</th>\n",
       "      <th>Smoke</th>\n",
       "      <th>HR</th>\n",
       "      <th>sBP</th>\n",
       "      <th>CholesterolBefore</th>\n",
       "      <th>TAG</th>\n",
       "      <th>Survey</th>\n",
       "      <th>CholesterolAfter</th>\n",
       "      <th>Delta</th>\n",
       "      <th>Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>Charles Smith</td>\n",
       "      <td>1959-01-30</td>\n",
       "      <td>61</td>\n",
       "      <td>Chartered certified accountant</td>\n",
       "      <td>0</td>\n",
       "      <td>99</td>\n",
       "      <td>212</td>\n",
       "      <td>10.1</td>\n",
       "      <td>5.6</td>\n",
       "      <td>4</td>\n",
       "      <td>9.7</td>\n",
       "      <td>0.4</td>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>Barry Porter</td>\n",
       "      <td>1979-05-30</td>\n",
       "      <td>65</td>\n",
       "      <td>Dancer</td>\n",
       "      <td>1</td>\n",
       "      <td>98</td>\n",
       "      <td>200</td>\n",
       "      <td>10.1</td>\n",
       "      <td>5.3</td>\n",
       "      <td>3</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.1</td>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>Julie Barrett</td>\n",
       "      <td>1972-07-27</td>\n",
       "      <td>66</td>\n",
       "      <td>Theme park manager</td>\n",
       "      <td>1</td>\n",
       "      <td>102</td>\n",
       "      <td>208</td>\n",
       "      <td>11.1</td>\n",
       "      <td>5.7</td>\n",
       "      <td>2</td>\n",
       "      <td>10.7</td>\n",
       "      <td>0.4</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Name         DOB  Age                        Vocation  Smoke  \\\n",
       "197  Charles Smith  1959-01-30   61  Chartered certified accountant      0   \n",
       "198   Barry Porter  1979-05-30   65                          Dancer      1   \n",
       "199  Julie Barrett  1972-07-27   66              Theme park manager      1   \n",
       "\n",
       "      HR  sBP  CholesterolBefore  TAG  Survey  CholesterolAfter  Delta  \\\n",
       "197   99  212               10.1  5.6       4               9.7    0.4   \n",
       "198   98  200               10.1  5.3       3              10.0    0.1   \n",
       "199  102  208               11.1  5.7       2              10.7    0.4   \n",
       "\n",
       "       Group  \n",
       "197  Control  \n",
       "198  Control  \n",
       "199   Active  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the last three rows\n",
    "df.iloc[-3:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Columns can also be index. The `iloc` method can take two arguments in the form `row,column`. The code below extracts the first five rows and columns $1$ and $3$. Note that the index values have to be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan Patton</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Sandra Howard</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Samantha Williams</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ashley Hensley</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Robert Wilson</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Name  Age\n",
       "0       Dylan Patton   43\n",
       "1      Sandra Howard   53\n",
       "2  Samantha Williams   33\n",
       "3     Ashley Hensley   43\n",
       "4      Robert Wilson   46"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Extract the first 5 rows for the first and third columns using the iloc method\n",
    "df.iloc[:5, [0, 2]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `loc` property can be used in a similar fashion. The column names (as a list or a slice) are specified. If the index values were not integers, but strings, those names could also be used. Remember that the row and column names are referred to as labels. Below, the same labels are extracted as above. Note, though, that the row range is now set to `0:4`. When extracting rows and column, ALWAYS use the `row, column` notation. Since two columns are required, they are passed as a Python list object (in square brackets) after the comma. Each column name is passed as a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan Patton</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Sandra Howard</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Samantha Williams</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ashley Hensley</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Robert Wilson</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Name  Age\n",
       "0       Dylan Patton   43\n",
       "1      Sandra Howard   53\n",
       "2  Samantha Williams   33\n",
       "3     Ashley Hensley   43\n",
       "4      Robert Wilson   46"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Extract the first five rows for the Name and Age colunms using the loc method\n",
    "df.loc[:4, ['Name', 'Age']] # Note the difference in row index compared to the iloc method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Print the `TAG` and the `Smoke` variables to the screen, but only for observations (rows) $10$ through $15$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TAG</th>\n",
       "      <th>Smoke</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>1.7</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2.3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2.1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2.4</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>2.6</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    TAG  Smoke\n",
       "10  1.7      0\n",
       "11  2.0      0\n",
       "12  2.3      0\n",
       "13  2.1      0\n",
       "14  2.4      0\n",
       "15  2.6      2"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[10:15, ['TAG', 'Smoke']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `iat` indexing extracts a single cell by using its row and column index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Sandra Howard'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Extract the value in row 2, column 1 using the iat method\n",
    "df.iat[1, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style=\"color:#0096FF\">Filtering data</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filtering data is one of the most useful tasks in data science. This section explores the filtering data by extracting numpy array objects based on criteria (conditions). Conditional operators include `>`, `<`, `<=`, `>=`, `==`, and `!=`. These operators can be used to compare values in a numpy array to a single value or to another numpy array."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Finding unique values in a column</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that a sample space of a variable are all the possible values that a variable can take. This is particularly useful when looking at categorical variables. The `unique` method is used to find all the sample space elements in a column. Below, the unique values in the `Smoke` column are filtered."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 2, 1])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the unique values in the Smoke column\n",
    "df.Smoke.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three unique elements in the sample space of this column. This method is great for surprises that might be hidden in a dataframe such as one or more strings in a numerical data column. A common example would be the `Age` column that has one or two strings such as `thirty-two` in it, instead of the number $32$. Strings in a numerical data column will prevent calculations on that column and such errors in the data must be corrected."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Ages of all non-smokers</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Smoke` column contain information about the smoking habits of the respondents in the data set. The sample space contains three integers, $0$ for not smoking, $1$ for smoking, and $2$ for previous smoking."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating an array that contains the ages of only the patients who do not smoke in our dataframe is created below. To do this, indexing is used directly. A conditional is used to include only $0$ subjects (`df.Smoke == 0`). The column is then referenced, which is `Age`, followed by the `to_numpy` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([43, 53, 33, 43, 46, 48, 54, 58, 44, 31, 45, 35, 49, 56, 57, 35, 50,\n",
       "       49, 63, 45, 51, 40, 47, 41, 47, 38, 54, 30, 46, 64, 40, 45, 65, 55,\n",
       "       53, 54, 72, 32, 38, 59, 53, 42, 38, 51, 37, 36, 48, 49, 62, 39, 74,\n",
       "       42, 72, 61, 33, 30, 44, 71, 49, 75, 43, 55, 38, 36, 46, 60, 57, 69,\n",
       "       56, 66, 60, 42, 32, 31, 56, 35, 63, 54, 68, 72, 40, 54, 62, 74, 62,\n",
       "       41, 61, 61])"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the ages of those who do not smoke and assign it to the variable age_nonsmokers (as a numpy array)\n",
    "age_nonsmokers = df.loc[df.Smoke == 0]['Age'].to_numpy()\n",
    "age_nonsmokers # Print the numpy array to the screen"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code, above may seem a bit difficult at first. It does read rather like an English language sentence, though. Take the dataframe object. Filter the rows in column `Smoke` that are $0$. For all of these rows return the `Age` values as a numpy array."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As an alternative, the `loc` indexing can be used, passing a row and a column specification as arguments. The row interrogates the `Smoke` column and filters only those with a $0$ entry. The column is then specified to the the `Age` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([43, 53, 33, 43, 46, 48, 54, 58, 44, 31, 45, 35, 49, 56, 57, 35, 50,\n",
       "       49, 63, 45, 51, 40, 47, 41, 47, 38, 54, 30, 46, 64, 40, 45, 65, 55,\n",
       "       53, 54, 72, 32, 38, 59, 53, 42, 38, 51, 37, 36, 48, 49, 62, 39, 74,\n",
       "       42, 72, 61, 33, 30, 44, 71, 49, 75, 43, 55, 38, 36, 46, 60, 57, 69,\n",
       "       56, 66, 60, 42, 32, 31, 56, 35, 63, 54, 68, 72, 40, 54, 62, 74, 62,\n",
       "       41, 61, 61])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use the loc method to select the ages of those who do not smoke and assign it to the variable age_nonsmokers (as a numpy array)\n",
    "age_nonsmokers = df.loc[df.Smoke == 0, 'Age'].to_numpy()\n",
    "age_nonsmokers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The different ways to interact with pandas adds to its power. Without the `to_numpy` method, the code above returns a pandas series object. Every data scientist has their own preferences and certain tasks require specific approaches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      43\n",
       "1      53\n",
       "2      33\n",
       "3      43\n",
       "4      46\n",
       "       ..\n",
       "180    74\n",
       "183    62\n",
       "188    41\n",
       "193    61\n",
       "197    61\n",
       "Name: Age, Length: 88, dtype: int64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use the loc method to select the ages of those who do not smoke and assign it to the variable age_nonsmokers\n",
    "age_nonsmokers = df.loc[df.Smoke == 0, 'Age']\n",
    "age_nonsmokers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Statistical analysis can be done on either the series or numpy objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "50.09090909090909"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the average age of those who do not smoke\n",
    "age_nonsmokers.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "The `to_list` method creates a Python list object. Use the `to_list` method to create a Python list object of the `Age` values for those that have a value in the `HR` (heart rate) column that is more than $72$. Print only the first $10$ values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[57, 52, 49, 70, 45, 53, 50, 33, 30, 44]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[df.HR > 72, 'Age'].to_list()[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Non-smoker ages where survey question choice is 3</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filtering by two criteria (two columns), `Age` and `Survey` is shown below. The filtering can either refer to __and__ or __or__. In the first, all the criteria are required to be met and in the second, only one of the criteria need be met (return a `True` value)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The symbol for __and__ is `&` and for __or__ is `|`. Below, `&` is used since both criteria must be met. Each filter is created in a set of parentheses. The code uses the `row, column` notation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1      53\n",
       "2      33\n",
       "11     35\n",
       "14     57\n",
       "27     47\n",
       "32     54\n",
       "38     45\n",
       "45     54\n",
       "50     53\n",
       "112    36\n",
       "130    68\n",
       "183    62\n",
       "Name: Age, dtype: int64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the non-smokers who choice 3 in the Survey column and return their ages (assigned to the variable non_smokers_choice_3)\n",
    "non_smokers_choice_3 = df.loc[(df.Smoke == 0) & (df.Survey == 3), 'Age']\n",
    "non_smokers_choice_3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In English the code reads: _Take the `df` dataframe object and look down the rows of the `Smoke` and `Survey` columns. Return only the rows where `Smoke` is $0$ AND `Survey` is $3$. Then return the `Age` column values for all these rows fulfilling both criteria_."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Never smoked or satisfaction score is greater than 3</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now filtering by __or__ is shown. The symbol for __or__ is `|`. Below, `|` is used since only one of the criteria must be met. Each filter is created in a set of parentheses. The code uses the `row, column` notation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It may be useful to create variable that holds the criteria first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter by Smoke == 0 or Survey > 3 and assign the filter to the variable crit\n",
    "crit = (df.Smoke == 0) | (df.Survey > 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `crit` variable can now be used in the filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    43\n",
       "1    53\n",
       "2    33\n",
       "3    43\n",
       "4    46\n",
       "Name: Age, dtype: int64"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use the crit criterium and assign to the variable nonsmoker_or_survey_gt_3\n",
    "nonsmoker_or_survey_gt_3 = df.loc[crit, 'Age']\n",
    "nonsmoker_or_survey_gt_3[:5] # Print the first 5 rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Create a variable named selection that will select for all the observations who never smoked, as a heart rate, `HR`, less than $80$, and a systolic blood pressure, `sBP`, of less than $120$. Use this selection to create a numpy $n$-dimensional array of the `Age` and triacylglycerol, `TAG`, variable values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method DataFrame.to_numpy of     Age  TAG\n",
       "1    53  0.6\n",
       "3    43  1.6\n",
       "8    44  2.4\n",
       "11   35  2.0\n",
       "21   45  2.1\n",
       "41   55  3.1>"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selection = (df.Smoke == 0) & (df.HR < 80) & (df.sBP < 120)\n",
    "df.loc[selection, ['Age', 'TAG']].to_numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Create a new dataframe object that only contains participants younger than 50</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of just an array of values, a new dataframe object can also be created using filtering. (Because it is a part of an existing dataframe object, some Data Scientist refer to it as a sub-dataframe object.) It includes all the columns (variables), but only for patients up to and including $49$ years of age. This is very simple to achieve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Age</th>\n",
       "      <th>Vocation</th>\n",
       "      <th>Smoke</th>\n",
       "      <th>HR</th>\n",
       "      <th>sBP</th>\n",
       "      <th>CholesterolBefore</th>\n",
       "      <th>TAG</th>\n",
       "      <th>Survey</th>\n",
       "      <th>CholesterolAfter</th>\n",
       "      <th>Delta</th>\n",
       "      <th>Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan Patton</td>\n",
       "      <td>1981-10-07</td>\n",
       "      <td>43</td>\n",
       "      <td>Energy manager</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>145</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Samantha Williams</td>\n",
       "      <td>1973-12-21</td>\n",
       "      <td>33</td>\n",
       "      <td>IT consultant</td>\n",
       "      <td>0</td>\n",
       "      <td>54</td>\n",
       "      <td>120</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.3</td>\n",
       "      <td>3</td>\n",
       "      <td>1.7</td>\n",
       "      <td>0.3</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ashley Hensley</td>\n",
       "      <td>1981-12-01</td>\n",
       "      <td>43</td>\n",
       "      <td>Nurse, children's</td>\n",
       "      <td>0</td>\n",
       "      <td>54</td>\n",
       "      <td>103</td>\n",
       "      <td>2.1</td>\n",
       "      <td>1.6</td>\n",
       "      <td>4</td>\n",
       "      <td>2.1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Robert Wilson</td>\n",
       "      <td>1964-06-23</td>\n",
       "      <td>46</td>\n",
       "      <td>Clinical embryologist</td>\n",
       "      <td>0</td>\n",
       "      <td>61</td>\n",
       "      <td>138</td>\n",
       "      <td>2.8</td>\n",
       "      <td>2.1</td>\n",
       "      <td>5</td>\n",
       "      <td>2.8</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Leslie Diaz</td>\n",
       "      <td>1994-08-25</td>\n",
       "      <td>48</td>\n",
       "      <td>Politician's assistant</td>\n",
       "      <td>0</td>\n",
       "      <td>59</td>\n",
       "      <td>122</td>\n",
       "      <td>2.8</td>\n",
       "      <td>1.4</td>\n",
       "      <td>4</td>\n",
       "      <td>2.6</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Name         DOB  Age                Vocation  Smoke  HR  sBP  \\\n",
       "0       Dylan Patton  1981-10-07   43          Energy manager      0  47  145   \n",
       "2  Samantha Williams  1973-12-21   33           IT consultant      0  54  120   \n",
       "3     Ashley Hensley  1981-12-01   43       Nurse, children's      0  54  103   \n",
       "4      Robert Wilson  1964-06-23   46   Clinical embryologist      0  61  138   \n",
       "5        Leslie Diaz  1994-08-25   48  Politician's assistant      0  59  122   \n",
       "\n",
       "   CholesterolBefore  TAG  Survey  CholesterolAfter  Delta   Group  \n",
       "0                1.2  1.2       1               0.7    0.5  Active  \n",
       "2                2.0  1.3       3               1.7    0.3  Active  \n",
       "3                2.1  1.6       4               2.1    0.0  Active  \n",
       "4                2.8  2.1       5               2.8    0.0  Active  \n",
       "5                2.8  1.4       4               2.6    0.2  Active  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a new dataframe object that only contains participants younger than 50 and assign it to the variable df_younger_than_50\n",
    "df_younger_than_50 = df.loc[df.Age < 50]\n",
    "df_younger_than_50[:5] # Display the first five rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since this new dataframe object has only data filetered by a crtiterium, it is easier to calculate any required statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "69.45348837209302"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Mean HR\n",
    "df_younger_than_50.HR.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Create a new dataframe for participants with a restricted list of job titles</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By making use of the `isin` method, certain values in the sample space of a variable can be specified. The code below filters for `IT consultant`, `Energy manager`, and `Clinical embryologist` in the `Vocation` column. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To achive this, a Python list object with the sample space elements, is created first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a Python list object with the string element 'IT consultant', 'Energy manager', 'Clinical embryologist' and assign it to the variable jobs\n",
    "jobs = ['IT consultant', 'Energy manager', 'Clinical embryologist']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A criterium is created using the `jobs` list object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a criterium\n",
    "crit = df.Vocation.isin(jobs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The criterium is used to filter the dataframe object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Age</th>\n",
       "      <th>Vocation</th>\n",
       "      <th>Smoke</th>\n",
       "      <th>HR</th>\n",
       "      <th>sBP</th>\n",
       "      <th>CholesterolBefore</th>\n",
       "      <th>TAG</th>\n",
       "      <th>Survey</th>\n",
       "      <th>CholesterolAfter</th>\n",
       "      <th>Delta</th>\n",
       "      <th>Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan Patton</td>\n",
       "      <td>1981-10-07</td>\n",
       "      <td>43</td>\n",
       "      <td>Energy manager</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>145</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Samantha Williams</td>\n",
       "      <td>1973-12-21</td>\n",
       "      <td>33</td>\n",
       "      <td>IT consultant</td>\n",
       "      <td>0</td>\n",
       "      <td>54</td>\n",
       "      <td>120</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.3</td>\n",
       "      <td>3</td>\n",
       "      <td>1.7</td>\n",
       "      <td>0.3</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Robert Wilson</td>\n",
       "      <td>1964-06-23</td>\n",
       "      <td>46</td>\n",
       "      <td>Clinical embryologist</td>\n",
       "      <td>0</td>\n",
       "      <td>61</td>\n",
       "      <td>138</td>\n",
       "      <td>2.8</td>\n",
       "      <td>2.1</td>\n",
       "      <td>5</td>\n",
       "      <td>2.8</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>188</th>\n",
       "      <td>Joan Chavez</td>\n",
       "      <td>1999-10-07</td>\n",
       "      <td>41</td>\n",
       "      <td>Energy manager</td>\n",
       "      <td>0</td>\n",
       "      <td>93</td>\n",
       "      <td>182</td>\n",
       "      <td>9.1</td>\n",
       "      <td>5.0</td>\n",
       "      <td>2</td>\n",
       "      <td>8.8</td>\n",
       "      <td>0.3</td>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  Name         DOB  Age               Vocation  Smoke  HR  \\\n",
       "0         Dylan Patton  1981-10-07   43         Energy manager      0  47   \n",
       "2    Samantha Williams  1973-12-21   33          IT consultant      0  54   \n",
       "4        Robert Wilson  1964-06-23   46  Clinical embryologist      0  61   \n",
       "188        Joan Chavez  1999-10-07   41         Energy manager      0  93   \n",
       "\n",
       "     sBP  CholesterolBefore  TAG  Survey  CholesterolAfter  Delta    Group  \n",
       "0    145                1.2  1.2       1               0.7    0.5   Active  \n",
       "2    120                2.0  1.3       3               1.7    0.3   Active  \n",
       "4    138                2.8  2.1       5               2.8    0.0   Active  \n",
       "188  182                9.1  5.0       2               8.8    0.3  Control  "
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Filter for job in the Vocation column and assign it to the variable df_jobs\n",
    "df_jobs = df.loc[crit]\n",
    "df_jobs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `str.contains` method can be used to filter for a string in a column. The code below filters for all the participants who have `manager` in their job title."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build a criterium with the str.contains method for the word manager\n",
    "crit = df.Vocation.str.contains('manager', na=False) # Flag for missing values (see later)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>DOB</th>\n",
       "      <th>Age</th>\n",
       "      <th>Vocation</th>\n",
       "      <th>Smoke</th>\n",
       "      <th>HR</th>\n",
       "      <th>sBP</th>\n",
       "      <th>CholesterolBefore</th>\n",
       "      <th>TAG</th>\n",
       "      <th>Survey</th>\n",
       "      <th>CholesterolAfter</th>\n",
       "      <th>Delta</th>\n",
       "      <th>Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan Patton</td>\n",
       "      <td>1981-10-07</td>\n",
       "      <td>43</td>\n",
       "      <td>Energy manager</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>145</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1.2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Mr. Tyler Strickland DDS</td>\n",
       "      <td>1940-08-27</td>\n",
       "      <td>46</td>\n",
       "      <td>Tourist information centre manager</td>\n",
       "      <td>0</td>\n",
       "      <td>62</td>\n",
       "      <td>136</td>\n",
       "      <td>4.1</td>\n",
       "      <td>2.3</td>\n",
       "      <td>2</td>\n",
       "      <td>3.8</td>\n",
       "      <td>0.3</td>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>Stephanie Jacobs</td>\n",
       "      <td>1977-06-18</td>\n",
       "      <td>38</td>\n",
       "      <td>Estate manager/land agent</td>\n",
       "      <td>0</td>\n",
       "      <td>69</td>\n",
       "      <td>139</td>\n",
       "      <td>4.4</td>\n",
       "      <td>2.7</td>\n",
       "      <td>2</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.4</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>55</th>\n",
       "      <td>Juan Johnson</td>\n",
       "      <td>1956-12-09</td>\n",
       "      <td>51</td>\n",
       "      <td>Logistics and distribution manager</td>\n",
       "      <td>0</td>\n",
       "      <td>65</td>\n",
       "      <td>141</td>\n",
       "      <td>4.5</td>\n",
       "      <td>2.9</td>\n",
       "      <td>5</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>Control</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71</th>\n",
       "      <td>Tina Martinez</td>\n",
       "      <td>1941-05-31</td>\n",
       "      <td>74</td>\n",
       "      <td>Passenger transport manager</td>\n",
       "      <td>1</td>\n",
       "      <td>69</td>\n",
       "      <td>146</td>\n",
       "      <td>4.8</td>\n",
       "      <td>3.1</td>\n",
       "      <td>4</td>\n",
       "      <td>4.5</td>\n",
       "      <td>0.3</td>\n",
       "      <td>Active</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        Name         DOB  Age  \\\n",
       "0               Dylan Patton  1981-10-07   43   \n",
       "34  Mr. Tyler Strickland DDS  1940-08-27   46   \n",
       "54          Stephanie Jacobs  1977-06-18   38   \n",
       "55              Juan Johnson  1956-12-09   51   \n",
       "71             Tina Martinez  1941-05-31   74   \n",
       "\n",
       "                              Vocation  Smoke  HR  sBP  CholesterolBefore  \\\n",
       "0                       Energy manager      0  47  145                1.2   \n",
       "34  Tourist information centre manager      0  62  136                4.1   \n",
       "54           Estate manager/land agent      0  69  139                4.4   \n",
       "55  Logistics and distribution manager      0  65  141                4.5   \n",
       "71         Passenger transport manager      1  69  146                4.8   \n",
       "\n",
       "    TAG  Survey  CholesterolAfter  Delta    Group  \n",
       "0   1.2       1               0.7    0.5   Active  \n",
       "34  2.3       2               3.8    0.3  Control  \n",
       "54  2.7       2               4.0    0.4   Active  \n",
       "55  2.9       5               4.0    0.5  Control  \n",
       "71  3.1       4               4.5    0.3   Active  "
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a new dataframe object with the criterium and assign it to the variable df_manager\n",
    "df_manager = df.loc[crit]\n",
    "df_manager[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#0096FF\">Updating or changing the values in a dataframe</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another valueble skill is to be able to change actual data in a dataframe object. Fortunately, datadrame objects can be manipulated in many ways."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Renaming columns</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The names of individual columns can be replaced with the `rename` method using a dictionary. Below, the `Name` column is renamed to `Participant`. To make the changes permanent, the `inplace` argument value to `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Patient', 'DOB', 'Age', 'Vocation', 'Smoke', 'HR', 'sBP',\n",
       "       'CholesterolBefore', 'TAG', 'Survey', 'CholesterolAfter', 'Delta',\n",
       "       'Group'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Replace the Name column with the Patient\n",
    "df.rename(columns={'Name': 'Patient'}, inplace=True)\n",
    "df.columns # Call the column names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Deleting columns</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `drop` method is used to delete a column. The `axis` argument is set to $1$ to indicate that the column is to be deleted. The `inplace` argument value is set to `True` to make the changes permanent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Delta` column is deleted from the dataframe object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete the Delta column\n",
    "df.drop(columns='Delta', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Add a value of 2 to the age of each participant</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In specific types of research, personal data are obfuscated to protect the privacy of the people in the dataset. In a simple case, $2$ is added the age of every participant when capturing the data. In reality, they are all $2$ years younger."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    45\n",
       "1    55\n",
       "2    35\n",
       "3    45\n",
       "4    48\n",
       "Name: Age, dtype: int64"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Add 2 to each value in the Age column\n",
    "df.Age = df.Age + 2 # Note that the Age column is reassigned\n",
    "df.Age[:5] # Display the first five rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are more than one way to achieve a similar goal. One way is to create a function and then use the `apply` method to apply the function to the values in a column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "User-defined functions are created using the `def` and `return` keywords. The former tells Python to create a new function. After a space follows the name is given to the function. A set of parentheses follow that contains a placeholder for the argument (here called a parameter). In its simplest form, the `return` keyword follows. As the name indicates this section returns a value for the function. Below, it is clear from the code that $x$ is the parameter and will take an argument value. The function then adds $2$ to the argument value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a user-defined function named add2 that adds 2 to a value\n",
    "def add2(x):\n",
    "    return x + 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `apply` method is now used and the `add2` function is used. The value in each row is now increased by $2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    47\n",
       "1    57\n",
       "2    37\n",
       "3    47\n",
       "4    50\n",
       "Name: Age, dtype: int64"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Apply the add2 function to the Age column\n",
    "df.Age = df.Age.apply(add2)\n",
    "df.Age[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Create a new variable named `AgeHR` that is the product of the values in the variables `Age` and `HR`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['AgeHR'] = df.Age * df.HR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Changing the values in a column</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dictionaries can be used to change the values in a column. The keys are the original values and the values are the new values. The `replace` method is used to change the values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below returns the sample space (unique) elements in the `Group` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Active', 'Control'], dtype=object)"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Return the unique values in the Group column\n",
    "df.Group.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values are `Active` and `Control`. These are replaced with `1` and `0` respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the replace method to replace the values in the Group column with 1 for Active and 0 for Control\n",
    "df.Group.replace({'Active': 1, 'Control': 0}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first five observations in the `Group` column are printed to the screen to confirm the changes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    1\n",
       "2    1\n",
       "3    1\n",
       "4    1\n",
       "Name: Group, dtype: int64"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the first 5 rows of the Group column\n",
    "df.Group[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Splitting the values in a column into two columns</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adding columns is an often used technique . It is as simple as stating the new name in square brackets as a string and then adding a list of values. The list of values that are added must be the same length (number of rows) as the dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below creates two new columns, `FirstName` and `LastName`. The `split` method is used to split the values in the `Patient` column. The `expand` argument is set to `True` to create two new columns. The `str` method is used to indicate that the `split` method is to be applied to a string, in this case a space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split the values in the Patient column on the empty space between the first and last names and assign it to the variable split_participant\n",
    "split_participant = df.Patient.str.split(' ', expand=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indexing is used to extract the two lists of values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create new columns in df named FirstName and LastName\n",
    "df['FirstName'] = split_participant[0]\n",
    "df['LastName'] = split_participant[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first five rows of the `FirstName`, `LastName`, and the original `Patient` columns are printed to the screen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>FirstName</th>\n",
       "      <th>LastName</th>\n",
       "      <th>Patient</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dylan</td>\n",
       "      <td>Patton</td>\n",
       "      <td>Dylan Patton</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Sandra</td>\n",
       "      <td>Howard</td>\n",
       "      <td>Sandra Howard</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Samantha</td>\n",
       "      <td>Williams</td>\n",
       "      <td>Samantha Williams</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ashley</td>\n",
       "      <td>Hensley</td>\n",
       "      <td>Ashley Hensley</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Robert</td>\n",
       "      <td>Wilson</td>\n",
       "      <td>Robert Wilson</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  FirstName  LastName            Patient\n",
       "0     Dylan    Patton       Dylan Patton\n",
       "1    Sandra    Howard      Sandra Howard\n",
       "2  Samantha  Williams  Samantha Williams\n",
       "3    Ashley   Hensley     Ashley Hensley\n",
       "4    Robert    Wilson      Robert Wilson"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print first 5 rows of FirstName and LastName and Patient\n",
    "df[['FirstName', 'LastName', 'Patient']][:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Categorize a numerical variable</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Intervals, called bins, can be created to change a continuous numerical variable into a categorical variable. The pandas `cut` function is used to create the bins. The `bins` argument is used to specify the intervals. The `labels` argument is used to specify the labels for each bins. The `right` argument is used to specify whether the right or left side of the interval is closed. The `include_lowest` argument is used to specify whether the lowest value is included in the first bin."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below creates three bins for the `CholesterolBefore` column. The minimum and maximum values are calculated first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.2"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Minimum CholesterolBefore value\n",
    "df.CholesterolBefore.min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11.1"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Maximum CholesterolBefore value\n",
    "df.CholesterolBefore.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Three bins are created with four values used in the `cut` function to create the bins $\\text{low}=[0,2.5)$, $\\text{normal}=[2.5,5.0)$, and $\\text{high}=[5.0,20)$. The intervals safely contain the range of values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    low\n",
       "1    low\n",
       "2    low\n",
       "3    low\n",
       "4    low\n",
       "Name: CholesterolBeforeLevel, dtype: category\n",
       "Categories (3, object): ['low' < 'normal' < 'high']"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a new column called CholesterolBeforeLevel using the cut function with bins=[0,5,10,20], which is right False and has the labels ['low', 'normal', 'high']\n",
    "df['CholesterolBeforeLevel'] = pd.cut(df.CholesterolBefore, bins=[0, 5, 10, 20], right=False, labels=['low', 'normal', 'high'])\n",
    "df.CholesterolBeforeLevel[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Task</span>\n",
    "\n",
    "Create a new variable in the dataframe called `HRLevel` from the `HR` column with intervals $(0,65], (65,78], (78,1000)$ with labels `low`, `normal`, and `high` respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:#00FF00\">Solution</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    low\n",
       "1    low\n",
       "2    low\n",
       "3    low\n",
       "4    low\n",
       "Name: HRLevel, dtype: category\n",
       "Categories (3, object): ['low' < 'normal' < 'high']"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['HRLevel'] = pd.cut(df.HR, bins = [0, 65, 78, 1000], right=True, labels=['low', 'normal', 'high'])\n",
    "df.HRLevel[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#0096FF\">Missing data</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">The numpy nan value</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is very often that datasets contain missing data. The numpy library has a specific entity called a `nan` value. This stands for _not a number_. The `nan` value is used to indicate that a value is missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the numpy nan value\n",
    "np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A list object or an array can contain a `nan` value. Computation is not possible since the missing value is unknown."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 3, nan]"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a list with three integers and a nan value and assign it to the variable my_list\n",
    "my_list = [1, 2, 3, np.nan]\n",
    "my_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the mean of my_list\n",
    "np.mean(my_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The solution is `nan` since it is impossible to know what the mean of the four values are."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">A dataframe object with missing data</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `MissingData.csv` file contains missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the MissingData.csv file and assign it to the variable df_missing\n",
    "df_missing = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/MissingData.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Printing the first $10$ rows to the screen shows some of the missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>salary</th>\n",
       "      <th>previouscompany</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>57.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>56.0</td>\n",
       "      <td>50927.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>46.0</td>\n",
       "      <td>75500.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>84417.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60.0</td>\n",
       "      <td>63002.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>54.0</td>\n",
       "      <td>54652.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "      <td>65739.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>64.0</td>\n",
       "      <td>89397.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>60.0</td>\n",
       "      <td>77797.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>61.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    age   salary  previouscompany\n",
       "0  57.0      NaN              1.0\n",
       "1  56.0  50927.0              NaN\n",
       "2  46.0  75500.0              3.0\n",
       "3   NaN  84417.0              NaN\n",
       "4  60.0  63002.0              1.0\n",
       "5  54.0  54652.0              NaN\n",
       "6   NaN  65739.0              1.0\n",
       "7  64.0  89397.0              3.0\n",
       "8  60.0  77797.0              4.0\n",
       "9  61.0      NaN              1.0"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print the first 10 rows of df_missing\n",
    "df_missing[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Deleting missing data</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `dropna` method deletes all rows that contain missing data for any of the columns in that row. All the data for that row (observation) are lost. The removal is not permanent. Below, the dataframe with dropped rows is reassigned to a new variable, leaving the oriiginal intact."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete rows with missing values and assign to a new variable complete_data_df\n",
    "complete_data_df = df_missing.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `axis` argument is set to $0$ to indicate that rows are to be dropped. This is the default. If the value is set to $1$, then any column with a missing value is dropped."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `isna` method can be used to return `True` and `False` values along a column. This is done for the `age` column below. `True` indicates that a value is missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     False\n",
       "1     False\n",
       "2     False\n",
       "3      True\n",
       "4     False\n",
       "5     False\n",
       "6      True\n",
       "7     False\n",
       "8     False\n",
       "9     False\n",
       "10    False\n",
       "11     True\n",
       "12    False\n",
       "13    False\n",
       "14    False\n",
       "15     True\n",
       "16    False\n",
       "17    False\n",
       "18    False\n",
       "19    False\n",
       "20    False\n",
       "21    False\n",
       "22    False\n",
       "23    False\n",
       "24    False\n",
       "25    False\n",
       "26    False\n",
       "27    False\n",
       "28    False\n",
       "29    False\n",
       "30    False\n",
       "31    False\n",
       "Name: age, dtype: bool"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use .isna() for the age column\n",
    "df_missing.age.isna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since `True` is stored as $1$, the `sum` method can be used to count all the rows (observations) with missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Sum over all the missing data in the age column\n",
    "df_missing.age.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are $4$ missing values in the `age` column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Replacing missing values</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The process of creating values to fill in missing data is called __data imputation__ and is a seperate and complicated subject. The pandas library provides a `fillna` method for filling in the missing data with simple calculations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `fillna` method takes a `method` argument. This can be set to `ffil` and `bfill`. As the argument values names imply, the value in the previous or subsequent row will be used to _fill in_ the missing value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/f9/9znwn5gx653_4p99htssbq400000gn/T/ipykernel_1250/170363983.py:2: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.\n",
      "  df_missing.age.fillna(method='ffill')[:5]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0    57.0\n",
       "1    56.0\n",
       "2    46.0\n",
       "3    46.0\n",
       "4    60.0\n",
       "Name: age, dtype: float64"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use forward fill to impute the missing values in the age column\n",
    "df_missing.age.fillna(method='ffill')[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The missing value for index $3$ has been replaced by the $46.0$ value at index $2$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Values for missing data can also be used. The median of the `age` column is used to replace the missing values. Note that the `median` method used for a pandas series object ignores the missing data unlike a Python list object or a numpy array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    57.0\n",
       "1    56.0\n",
       "2    46.0\n",
       "3    57.0\n",
       "4    60.0\n",
       "Name: age, dtype: float64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fill the missing age values with the median age\n",
    "df_missing.age.fillna(df_missing.age.median())[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Default missing values</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is common to use default values when data is not available at the time of capture. The `DefaultMissingData.csv` file uses the values $999$, `Nil` and `Missing` to indicate missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>salary</th>\n",
       "      <th>previous_company</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>57</td>\n",
       "      <td>Nil</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>56</td>\n",
       "      <td>50927</td>\n",
       "      <td>Missing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>46</td>\n",
       "      <td>75500</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>999</td>\n",
       "      <td>84417</td>\n",
       "      <td>Missing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60</td>\n",
       "      <td>63002</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age salary previous_company\n",
       "0   57    Nil                1\n",
       "1   56  50927          Missing\n",
       "2   46  75500                3\n",
       "3  999  84417          Missing\n",
       "4   60  63002                1"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import the DefaultMissingData.scv file and assign ot to the variable default_missing_df\n",
    "default_missing_df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/DefaultMissingData.csv')\n",
    "default_missing_df[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `na_values` argument can take a Python list object that contains all the default values that indicate missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>salary</th>\n",
       "      <th>previous_company</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>57.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>56.0</td>\n",
       "      <td>50927.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>46.0</td>\n",
       "      <td>75500.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>84417.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60.0</td>\n",
       "      <td>63002.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    age   salary  previous_company\n",
       "0  57.0      NaN               1.0\n",
       "1  56.0  50927.0               NaN\n",
       "2  46.0  75500.0               3.0\n",
       "3   NaN  84417.0               NaN\n",
       "4  60.0  63002.0               1.0"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reimport the csv file and use the na_values argument to specify the missing values 999, Nil, and Missing\n",
    "default_missing_df = pd.read_csv('DefaultMissingData.csv', na_values=[999, 'Nil', 'Missing'])\n",
    "default_missing_df[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The missing values are now correctly indicates as `nan` values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <span style=\"color:#0096FF\">Quiz questions</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color:#FFD700\">Questions</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. How do you import the pandas library in Python?\n",
    "\n",
    "2. How do you read a CSV file named `data.csv` into a pandas DataFrame?\n",
    "\n",
    "3. How do you display the first $5$ rows of a DataFrame named `df`?\n",
    "\n",
    "4. How do you find the number of rows and columns in a DataFrame `df`?\n",
    "\n",
    "5. How do you print the names of all the columns in a DataFrame `df`?\n",
    "\n",
    "6. How do you get the summary information of the DataFrame `df` including data types and non-null values?\n",
    "\n",
    "7. How do you select only the rows in DataFrame `df` where the column `A` is greater than $5$?\n",
    "\n",
    "8. How do you select the columns `A` and `B` from the DataFrame `df`?\n",
    "\n",
    "9. How do you rename the column `A` to `Alpha` in DataFrame `df`?\n",
    "\n",
    "10. How do you replace all occurrences of the value $999$ in DataFrame `df` with `nan`?\n",
    "\n",
    "11. How do you drop all rows in DataFrame `df` that contain missing values?\n",
    "\n",
    "12. How do you fill all missing values in DataFrame `df` with $0$?\n",
    "\n",
    "13. How do you calculate the mean of column `A` in DataFrame `df`?\n",
    "\n",
    "14. How do you create a new column `C` in DataFrame `df` that is the sum of columns `A` and `B`?\n",
    "\n",
    "15. How do you save the DataFrame `df` to a CSV file named `output.csv`? (Find the solution by searching the pandas documentation.)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "datascience",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}