{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 8 | WORKING WITH DATA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

This chapter of an Introduction to Health Data Science by Dr JH Klopper is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data science, by its name and nature, requires access to data. Images, sound files, text, and much more, can be represented as data. This lecture explores the use of Python for tabular data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tabular data exists in rows and columns, either extracted from an image, a database, or similar structures, and represented as an array. An array is a set of values in rows and columns (and along even more dimensions as in the case of color images, where rows and column are stacked _on top of_ each other representing red, green, and blue channels). There is a fantastic Python package for importing such tabular data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The __pandas__ package has much to do with the success of Python as a programming language for statistical analysis and data science. It is an enormous package and is used to import data, to manipulate data, to do calculations with data, and even create graphs and plots using the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook provides a glimpse into the usefulness of the pandas package by importing data captured in a spreadsheet file and extracting information from it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Packages used in this notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is useful to import all packages at the start of a notebook. This allows for keeping track of what extensions to the Python language are used in the notebook. Namespace abbreviations are used to make it easier to refer to the packages later in the notebook." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # Package to work with data\n", "\n", "import numpy as np # Numerical analysis package" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Importing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data used in this notebook is a spreadsheet file in comma separated values (CSV) format called `data.csv`. The file is imported using the pandas `read_csv` function. Since it is not a Python function, the package in which the function is hosted, has to be referenced (depending on the way the package was imported). This is done by preceding the function with the pandas namespace abbreviation, `pd`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `data.csv` file is in the same folder (directory) as the notebook. If it was in a different folder, the path to the file would have to be specified." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import the data.csv file as assign it to the variable df\n", "df = pd.read_csv('data.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file is also available on GitHub as `PUBH1142Chap0801.csv` and is imported below." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Import the file https://raw.githubusercontent.com/juanklopper/TutorialData/main/PUBH1142Chap0801.csv from GitHub and assign it to the variable df\n", "df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/PUBH1142Chap0801.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `type` function used below, shows that the object assigned to the `df` computer variable is a dataframe object." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Type of df\n", "type(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The methods and attributes of the `df` object can be printed using the `dir` function. The methods are the functions that can be applied to the object. The attributes are the properties of the object." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Age',\n", " 'CholesterolAfter',\n", " 'CholesterolBefore',\n", " 'DOB',\n", " 'Delta',\n", " 'Group',\n", " 'HR',\n", " 'Name',\n", " 'Smoke',\n", " 'Survey',\n", " 'T',\n", " 'TAG',\n", " 'Vocation',\n", " '_AXIS_LEN',\n", " '_AXIS_ORDERS',\n", " '_AXIS_TO_AXIS_NUMBER',\n", " '_HANDLED_TYPES',\n", " '__abs__',\n", " '__add__',\n", " '__and__']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the first 20 methods and attributes of df suing the dir function\n", "dir(df)[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many methods and attributes. The first few listed are the column headers (the statistical variable names) in the spreadsheet file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Read about methods and attributes of a [pandas dataframe object](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One such method is the `head` method. By default it returns the first five rows of a dataframe object. An integer value can be passed as argument for a different number of rows." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBAgeVocationSmokeHRsBPCholesterolBeforeTAGSurveyCholesterolAfterDeltaGroup
0Dylan Patton1981-10-0743Energy manager0471451.21.210.70.5Active
1Sandra Howard1993-01-2753Tax adviser0511151.20.631.00.2Active
2Samantha Williams1973-12-2133IT consultant0541202.01.331.70.3Active
3Ashley Hensley1981-12-0143Nurse, children's0541032.11.642.10.0Active
4Robert Wilson1964-06-2346Clinical embryologist0611382.82.152.80.0Active
\n", "
" ], "text/plain": [ " Name DOB Age Vocation Smoke HR sBP \\\n", "0 Dylan Patton 1981-10-07 43 Energy manager 0 47 145 \n", "1 Sandra Howard 1993-01-27 53 Tax adviser 0 51 115 \n", "2 Samantha Williams 1973-12-21 33 IT consultant 0 54 120 \n", "3 Ashley Hensley 1981-12-01 43 Nurse, children's 0 54 103 \n", "4 Robert Wilson 1964-06-23 46 Clinical embryologist 0 61 138 \n", "\n", " CholesterolBefore TAG Survey CholesterolAfter Delta Group \n", "0 1.2 1.2 1 0.7 0.5 Active \n", "1 1.2 0.6 3 1.0 0.2 Active \n", "2 2.0 1.3 3 1.7 0.3 Active \n", "3 2.1 1.6 4 2.1 0.0 Active \n", "4 2.8 2.1 5 2.8 0.0 Active " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the head method to print the first 5 rows of df\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `shape` attribute shows use the number of rows and columns, returned as a tuple. Note that unlike a method (which is like a Python function), an attribute has no parentheses." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(200, 13)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the shape of df\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are $200$ observations (rows) and $13$ statistical variables (columns) in this tidy data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `columns` attribute lists all the column header names, called labels." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Name', 'DOB', 'Age', 'Vocation', 'Smoke', 'HR', 'sBP',\n", " 'CholesterolBefore', 'TAG', 'Survey', 'CholesterolAfter', 'Delta',\n", " 'Group'],\n", " dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the columns of df\n", "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The majority of dataframe objects will have two axes (rows and columns). We can verify this using the `ndim` attribute." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the dimensions of df\n", "df.ndim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `size` attribute gives us the total number of data point values (the product of the number of rows and columns)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2600" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the size of df\n", "df.size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last attribute discussed here is the `dtype` attribute. It returns the Python data type of the values in each of the columns. This is a very important step. Pandas does its best to interpret the data type. Depending on how the spreadsheet was created and how data was entered, it is not always possible to correctly interpret the type. In this case, the data type may have to be changed manually. Remember that the analysis of data is based on the data type of the variable." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name object\n", "DOB object\n", "Age int64\n", "Vocation object\n", "Smoke int64\n", "HR int64\n", "sBP int64\n", "CholesterolBefore float64\n", "TAG float64\n", "Survey int64\n", "CholesterolAfter float64\n", "Delta float64\n", "Group object\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the type of df\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categorical variables are denoted as an `object` type. Numerical variables can be either integer or floating point numbers (numbers with decimal places). These are `int64` and `float64` (denoting 64-bit precision) respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Selecting rows and columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To analyse data, only certain values may need to be _extracted_. This is a very useful skill." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas refers to a single column in a dataframe object as a series object. The `Age` column is extracted below and saved as a series object, assigned to the variable `age_column`. The notation uses square brackets, with the column name represented as a string." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Assign the Age column in df to the variable age_column\n", "age_column = df['Age']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `age_column` variable now holds a pandas series object." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the type of age_column\n", "type(age_column)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since there are no illegal characters in the column name such as spaces, the dot notation can be used.The `age_column computer` variable is overwritten below." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Assign the Age column in df to the variable age_column and use dot notation\n", "age_column = df.Age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first five rows of the series object is displayed below using the `head`` method." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 43\n", "1 53\n", "2 33\n", "3 43\n", "4 46\n", "Name: Age, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the first 5 rows of age_column\n", "age_column.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The index column above indicates that the values are not a simple Python list object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Select the `Vocation` column and assign it to the variable `vocation`. Confirm that the object assigned to vocation is a pandas series object. The `tail` function (with no arguments), prints the last five observations to the screen. Use the `tail` function to print the last five observations in `vocation` to the screen." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocation = df.Vocation\n", "type(vocation)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "195 Sales professional, IT\n", "196 Paramedic\n", "197 Chartered certified accountant\n", "198 Dancer\n", "199 Theme park manager\n", "Name: Vocation, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocation.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At times it may be more useful to work with a numpy array, rather than a pandas series object. The `to_numpy` method is used to extract the age values as a numpy array. It is used below and the numpy array is assigned to the variable `age`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Convert the age_column to a numpy array and assign it to the variable age\n", "age = age_column.to_numpy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `type` function shows that `age` contains a numpy array." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the type of age\n", "type(age)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas series objects and numpy arrays have many methods and attributes. The `mean`, `min`, and `max` methods are used below to calculate the avergage, minimum, and maximum age values respectively." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "53.07" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the average age in age_column\n", "age_column.mean()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the minimum age in age_column\n", "age_column.min()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "75" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the maximum age in age_column\n", "age_column.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inidividual rows (subjects) can be specified by making use of the `iloc` attribute (or property, which is the term used by pandas) for a dataframe object. The `iloc` property stands for integer location. Therefor, integers are used to specify the row and column numbers. The first row of data is extracted below.\n", "0." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name Dylan Patton\n", "DOB 1981-10-07\n", "Age 43\n", "Vocation Energy manager\n", "Smoke 0\n", "HR 47\n", "sBP 145\n", "CholesterolBefore 1.2\n", "TAG 1.2\n", "Survey 1\n", "CholesterolAfter 0.7\n", "Delta 0.5\n", "Group Active\n", "Name: 0, dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the first row of data in df using the iloc method\n", "df.iloc[0] # Note the use of square brackets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A list of values can be passed to select multiple rows. Rows $2$, $5$, and $7$ are selected below. Note that these rows correspond to index values $1$, $4$, and $6$." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBAgeVocationSmokeHRsBPCholesterolBeforeTAGSurveyCholesterolAfterDeltaGroup
1Sandra Howard1993-01-2753Tax adviser0511151.20.631.00.2Active
4Robert Wilson1964-06-2346Clinical embryologist0611382.82.152.80.0Active
6Frank Zimmerman1981-03-0454Police officer0601292.92.412.60.3Active
\n", "
" ], "text/plain": [ " Name DOB Age Vocation Smoke HR sBP \\\n", "1 Sandra Howard 1993-01-27 53 Tax adviser 0 51 115 \n", "4 Robert Wilson 1964-06-23 46 Clinical embryologist 0 61 138 \n", "6 Frank Zimmerman 1981-03-04 54 Police officer 0 60 129 \n", "\n", " CholesterolBefore TAG Survey CholesterolAfter Delta Group \n", "1 1.2 0.6 3 1.0 0.2 Active \n", "4 2.8 2.1 5 2.8 0.0 Active \n", "6 2.9 2.4 1 2.6 0.3 Active " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select rows 2, 5, and 7 using the iloc method\n", "df.iloc[[1, 4, 6]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A range object can be used to select a contiguous number of rows." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBAgeVocationSmokeHRsBPCholesterolBeforeTAGSurveyCholesterolAfterDeltaGroup
0Dylan Patton1981-10-0743Energy manager0471451.21.210.70.5Active
1Sandra Howard1993-01-2753Tax adviser0511151.20.631.00.2Active
\n", "
" ], "text/plain": [ " Name DOB Age Vocation Smoke HR sBP \\\n", "0 Dylan Patton 1981-10-07 43 Energy manager 0 47 145 \n", "1 Sandra Howard 1993-01-27 53 Tax adviser 0 51 115 \n", "\n", " CholesterolBefore TAG Survey CholesterolAfter Delta Group \n", "0 1.2 1.2 1 0.7 0.5 Active \n", "1 1.2 0.6 3 1.0 0.2 Active " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the first two rows using the iloc method\n", "df.iloc[0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with Python list objects, shorthand notation can be used fro contiguous rows. The first two rows are selected below." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBAgeVocationSmokeHRsBPCholesterolBeforeTAGSurveyCholesterolAfterDeltaGroup
0Dylan Patton1981-10-0743Energy manager0471451.21.210.70.5Active
1Sandra Howard1993-01-2753Tax adviser0511151.20.631.00.2Active
\n", "
" ], "text/plain": [ " Name DOB Age Vocation Smoke HR sBP \\\n", "0 Dylan Patton 1981-10-07 43 Energy manager 0 47 145 \n", "1 Sandra Howard 1993-01-27 53 Tax adviser 0 51 115 \n", "\n", " CholesterolBefore TAG Survey CholesterolAfter Delta Group \n", "0 1.2 1.2 1 0.7 0.5 Active \n", "1 1.2 0.6 3 1.0 0.2 Active " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the first two rows using the iloc method\n", "df.iloc[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Negative index notation similarly follows the Python list convention. The last three rows are selected below." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBAgeVocationSmokeHRsBPCholesterolBeforeTAGSurveyCholesterolAfterDeltaGroup
197Charles Smith1959-01-3061Chartered certified accountant09921210.15.649.70.4Control
198Barry Porter1979-05-3065Dancer19820010.15.3310.00.1Control
199Julie Barrett1972-07-2766Theme park manager110220811.15.7210.70.4Active
\n", "
" ], "text/plain": [ " Name DOB Age Vocation Smoke \\\n", "197 Charles Smith 1959-01-30 61 Chartered certified accountant 0 \n", "198 Barry Porter 1979-05-30 65 Dancer 1 \n", "199 Julie Barrett 1972-07-27 66 Theme park manager 1 \n", "\n", " HR sBP CholesterolBefore TAG Survey CholesterolAfter Delta \\\n", "197 99 212 10.1 5.6 4 9.7 0.4 \n", "198 98 200 10.1 5.3 3 10.0 0.1 \n", "199 102 208 11.1 5.7 2 10.7 0.4 \n", "\n", " Group \n", "197 Control \n", "198 Control \n", "199 Active " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the last three rows\n", "df.iloc[-3:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Columns can also be index. The `iloc` method can take two arguments in the form `row,column`. The code below extracts the first five rows and columns $1$ and $3$. Note that the index values have to be used." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAge
0Dylan Patton43
1Sandra Howard53
2Samantha Williams33
3Ashley Hensley43
4Robert Wilson46
\n", "
" ], "text/plain": [ " Name Age\n", "0 Dylan Patton 43\n", "1 Sandra Howard 53\n", "2 Samantha Williams 33\n", "3 Ashley Hensley 43\n", "4 Robert Wilson 46" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract the first 5 rows for the first and third columns using the iloc method\n", "df.iloc[:5, [0, 2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `loc` property can be used in a similar fashion. The column names (as a list or a slice) are specified. If the index values were not integers, but strings, those names could also be used. Remember that the row and column names are referred to as labels. Below, the same labels are extracted as above. Note, though, that the row range is now set to `0:4`. When extracting rows and column, ALWAYS use the `row, column` notation. Since two columns are required, they are passed as a Python list object (in square brackets) after the comma. Each column name is passed as a string." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAge
0Dylan Patton43
1Sandra Howard53
2Samantha Williams33
3Ashley Hensley43
4Robert Wilson46
\n", "
" ], "text/plain": [ " Name Age\n", "0 Dylan Patton 43\n", "1 Sandra Howard 53\n", "2 Samantha Williams 33\n", "3 Ashley Hensley 43\n", "4 Robert Wilson 46" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract the first five rows for the Name and Age colunms using the loc method\n", "df.loc[:4, ['Name', 'Age']] # Note the difference in row index compared to the iloc method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Print the `TAG` and the `Smoke` variables to the screen, but only for observations (rows) $10$ through $15$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TAGSmoke
101.70
112.00
122.30
132.10
142.40
152.62
\n", "
" ], "text/plain": [ " TAG Smoke\n", "10 1.7 0\n", "11 2.0 0\n", "12 2.3 0\n", "13 2.1 0\n", "14 2.4 0\n", "15 2.6 2" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[10:15, ['TAG', 'Smoke']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `iat` indexing extracts a single cell by using its row and column index." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Sandra Howard'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract the value in row 2, column 1 using the iat method\n", "df.iat[1, 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filtering data is one of the most useful tasks in data science. This section explores the filtering data by extracting numpy array objects based on criteria (conditions). Conditional operators include `>`, `<`, `<=`, `>=`, `==`, and `!=`. These operators can be used to compare values in a numpy array to a single value or to another numpy array." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding unique values in a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that a sample space of a variable are all the possible values that a variable can take. This is particularly useful when looking at categorical variables. The `unique` method is used to find all the sample space elements in a column. Below, the unique values in the `Smoke` column are filtered." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 2, 1])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the unique values in the Smoke column\n", "df.Smoke.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three unique elements in the sample space of this column. This method is great for surprises that might be hidden in a dataframe such as one or more strings in a numerical data column. A common example would be the `Age` column that has one or two strings such as `thirty-two` in it, instead of the number $32$. Strings in a numerical data column will prevent calculations on that column and such errors in the data must be corrected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ages of all non-smokers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Smoke` column contain information about the smoking habits of the respondents in the data set. The sample space contains three integers, $0$ for not smoking, $1$ for smoking, and $2$ for previous smoking." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating an array that contains the ages of only the patients who do not smoke in our dataframe is created below. To do this, indexing is used directly. A conditional is used to include only $0$ subjects (`df.Smoke == 0`). The column is then referenced, which is `Age`, followed by the `to_numpy` method." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([43, 53, 33, 43, 46, 48, 54, 58, 44, 31, 45, 35, 49, 56, 57, 35, 50,\n", " 49, 63, 45, 51, 40, 47, 41, 47, 38, 54, 30, 46, 64, 40, 45, 65, 55,\n", " 53, 54, 72, 32, 38, 59, 53, 42, 38, 51, 37, 36, 48, 49, 62, 39, 74,\n", " 42, 72, 61, 33, 30, 44, 71, 49, 75, 43, 55, 38, 36, 46, 60, 57, 69,\n", " 56, 66, 60, 42, 32, 31, 56, 35, 63, 54, 68, 72, 40, 54, 62, 74, 62,\n", " 41, 61, 61])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the ages of those who do not smoke and assign it to the variable age_nonsmokers (as a numpy array)\n", "age_nonsmokers = df.loc[df.Smoke == 0]['Age'].to_numpy()\n", "age_nonsmokers # Print the numpy array to the screen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code, above may seem a bit difficult at first. It does read rather like an English language sentence, though. Take the dataframe object. Filter the rows in column `Smoke` that are $0$. For all of these rows return the `Age` values as a numpy array." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an alternative, the `loc` indexing can be used, passing a row and a column specification as arguments. The row interrogates the `Smoke` column and filters only those with a $0$ entry. The column is then specified to the the `Age` column." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([43, 53, 33, 43, 46, 48, 54, 58, 44, 31, 45, 35, 49, 56, 57, 35, 50,\n", " 49, 63, 45, 51, 40, 47, 41, 47, 38, 54, 30, 46, 64, 40, 45, 65, 55,\n", " 53, 54, 72, 32, 38, 59, 53, 42, 38, 51, 37, 36, 48, 49, 62, 39, 74,\n", " 42, 72, 61, 33, 30, 44, 71, 49, 75, 43, 55, 38, 36, 46, 60, 57, 69,\n", " 56, 66, 60, 42, 32, 31, 56, 35, 63, 54, 68, 72, 40, 54, 62, 74, 62,\n", " 41, 61, 61])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the loc method to select the ages of those who do not smoke and assign it to the variable age_nonsmokers (as a numpy array)\n", "age_nonsmokers = df.loc[df.Smoke == 0, 'Age'].to_numpy()\n", "age_nonsmokers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The different ways to interact with pandas adds to its power. Without the `to_numpy` method, the code above returns a pandas series object. Every data scientist has their own preferences and certain tasks require specific approaches." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 43\n", "1 53\n", "2 33\n", "3 43\n", "4 46\n", " ..\n", "180 74\n", "183 62\n", "188 41\n", "193 61\n", "197 61\n", "Name: Age, Length: 88, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the loc method to select the ages of those who do not smoke and assign it to the variable age_nonsmokers\n", "age_nonsmokers = df.loc[df.Smoke == 0, 'Age']\n", "age_nonsmokers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Statistical analysis can be done on either the series or numpy objects." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "50.09090909090909" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the average age of those who do not smoke\n", "age_nonsmokers.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "The `to_list` method creates a Python list object. Use the `to_list` method to create a Python list object of the `Age` values for those that have a value in the `HR` (heart rate) column that is more than $72$. Print only the first $10$ values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[57, 52, 49, 70, 45, 53, 50, 33, 30, 44]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df.HR > 72, 'Age'].to_list()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-smoker ages where survey question choice is 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filtering by two criteria (two columns), `Age` and `Survey` is shown below. The filtering can either refer to __and__ or __or__. In the first, all the criteria are required to be met and in the second, only one of the criteria need be met (return a `True` value)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The symbol for __and__ is `&` and for __or__ is `|`. Below, `&` is used since both criteria must be met. Each filter is created in a set of parentheses. The code uses the `row, column` notation." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 53\n", "2 33\n", "11 35\n", "14 57\n", "27 47\n", "32 54\n", "38 45\n", "45 54\n", "50 53\n", "112 36\n", "130 68\n", "183 62\n", "Name: Age, dtype: int64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the non-smokers who choice 3 in the Survey column and return their ages (assigned to the variable non_smokers_choice_3)\n", "non_smokers_choice_3 = df.loc[(df.Smoke == 0) & (df.Survey == 3), 'Age']\n", "non_smokers_choice_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In English the code reads: _Take the `df` dataframe object and look down the rows of the `Smoke` and `Survey` columns. Return only the rows where `Smoke` is $0$ AND `Survey` is $3$. Then return the `Age` column values for all these rows fulfilling both criteria_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Never smoked or satisfaction score is greater than 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now filtering by __or__ is shown. The symbol for __or__ is `|`. Below, `|` is used since only one of the criteria must be met. Each filter is created in a set of parentheses. The code uses the `row, column` notation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It may be useful to create variable that holds the criteria first." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "# Filter by Smoke == 0 or Survey > 3 and assign the filter to the variable crit\n", "crit = (df.Smoke == 0) | (df.Survey > 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `crit` variable can now be used in the filtering." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 43\n", "1 53\n", "2 33\n", "3 43\n", "4 46\n", "Name: Age, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the crit criterium and assign to the variable nonsmoker_or_survey_gt_3\n", "nonsmoker_or_survey_gt_3 = df.loc[crit, 'Age']\n", "nonsmoker_or_survey_gt_3[:5] # Print the first 5 rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Create a variable named selection that will select for all the observations who never smoked, as a heart rate, `HR`, less than $80$, and a systolic blood pressure, `sBP`, of less than $120$. Use this selection to create a numpy $n$-dimensional array of the `Age` and triacylglycerol, `TAG`, variable values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selection = (df.Smoke == 0) & (df.HR < 80) & (df.sBP < 120)\n", "df.loc[selection, ['Age', 'TAG']].to_numpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a new dataframe object that only contains participants younger than 50" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of just an array of values, a new dataframe object can also be created using filtering. (Because it is a part of an existing dataframe object, some Data Scientist refer to it as a sub-dataframe object.) It includes all the columns (variables), but only for patients up to and including $49$ years of age. This is very simple to achieve." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBAgeVocationSmokeHRsBPCholesterolBeforeTAGSurveyCholesterolAfterDeltaGroup
0Dylan Patton1981-10-0743Energy manager0471451.21.210.70.5Active
2Samantha Williams1973-12-2133IT consultant0541202.01.331.70.3Active
3Ashley Hensley1981-12-0143Nurse, children's0541032.11.642.10.0Active
4Robert Wilson1964-06-2346Clinical embryologist0611382.82.152.80.0Active
5Leslie Diaz1994-08-2548Politician's assistant0591222.81.442.60.2Active
\n", "
" ], "text/plain": [ " Name DOB Age Vocation Smoke HR sBP \\\n", "0 Dylan Patton 1981-10-07 43 Energy manager 0 47 145 \n", "2 Samantha Williams 1973-12-21 33 IT consultant 0 54 120 \n", "3 Ashley Hensley 1981-12-01 43 Nurse, children's 0 54 103 \n", "4 Robert Wilson 1964-06-23 46 Clinical embryologist 0 61 138 \n", "5 Leslie Diaz 1994-08-25 48 Politician's assistant 0 59 122 \n", "\n", " CholesterolBefore TAG Survey CholesterolAfter Delta Group \n", "0 1.2 1.2 1 0.7 0.5 Active \n", "2 2.0 1.3 3 1.7 0.3 Active \n", "3 2.1 1.6 4 2.1 0.0 Active \n", "4 2.8 2.1 5 2.8 0.0 Active \n", "5 2.8 1.4 4 2.6 0.2 Active " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a new dataframe object that only contains participants younger than 50 and assign it to the variable df_younger_than_50\n", "df_younger_than_50 = df.loc[df.Age < 50]\n", "df_younger_than_50[:5] # Display the first five rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this new dataframe object has only data filetered by a crtiterium, it is easier to calculate any required statistics." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "69.45348837209302" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Mean HR\n", "df_younger_than_50.HR.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a new dataframe for participants with a restricted list of job titles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By making use of the `isin` method, certain values in the sample space of a variable can be specified. The code below filters for `IT consultant`, `Energy manager`, and `Clinical embryologist` in the `Vocation` column. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To achive this, a Python list object with the sample space elements, is created first." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# Create a Python list object with the string element 'IT consultant', 'Energy manager', 'Clinical embryologist' and assign it to the variable jobs\n", "jobs = ['IT consultant', 'Energy manager', 'Clinical embryologist']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A criterium is created using the `jobs` list object." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "# Create a criterium\n", "crit = df.Vocation.isin(jobs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The criterium is used to filter the dataframe object." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBAgeVocationSmokeHRsBPCholesterolBeforeTAGSurveyCholesterolAfterDeltaGroup
0Dylan Patton1981-10-0743Energy manager0471451.21.210.70.5Active
2Samantha Williams1973-12-2133IT consultant0541202.01.331.70.3Active
4Robert Wilson1964-06-2346Clinical embryologist0611382.82.152.80.0Active
188Joan Chavez1999-10-0741Energy manager0931829.15.028.80.3Control
\n", "
" ], "text/plain": [ " Name DOB Age Vocation Smoke HR \\\n", "0 Dylan Patton 1981-10-07 43 Energy manager 0 47 \n", "2 Samantha Williams 1973-12-21 33 IT consultant 0 54 \n", "4 Robert Wilson 1964-06-23 46 Clinical embryologist 0 61 \n", "188 Joan Chavez 1999-10-07 41 Energy manager 0 93 \n", "\n", " sBP CholesterolBefore TAG Survey CholesterolAfter Delta Group \n", "0 145 1.2 1.2 1 0.7 0.5 Active \n", "2 120 2.0 1.3 3 1.7 0.3 Active \n", "4 138 2.8 2.1 5 2.8 0.0 Active \n", "188 182 9.1 5.0 2 8.8 0.3 Control " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter for job in the Vocation column and assign it to the variable df_jobs\n", "df_jobs = df.loc[crit]\n", "df_jobs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `str.contains` method can be used to filter for a string in a column. The code below filters for all the participants who have `manager` in their job title." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "# Build a criterium with the str.contains method for the word manager\n", "crit = df.Vocation.str.contains('manager', na=False) # Flag for missing values (see later)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBAgeVocationSmokeHRsBPCholesterolBeforeTAGSurveyCholesterolAfterDeltaGroup
0Dylan Patton1981-10-0743Energy manager0471451.21.210.70.5Active
34Mr. Tyler Strickland DDS1940-08-2746Tourist information centre manager0621364.12.323.80.3Control
54Stephanie Jacobs1977-06-1838Estate manager/land agent0691394.42.724.00.4Active
55Juan Johnson1956-12-0951Logistics and distribution manager0651414.52.954.00.5Control
71Tina Martinez1941-05-3174Passenger transport manager1691464.83.144.50.3Active
\n", "
" ], "text/plain": [ " Name DOB Age \\\n", "0 Dylan Patton 1981-10-07 43 \n", "34 Mr. Tyler Strickland DDS 1940-08-27 46 \n", "54 Stephanie Jacobs 1977-06-18 38 \n", "55 Juan Johnson 1956-12-09 51 \n", "71 Tina Martinez 1941-05-31 74 \n", "\n", " Vocation Smoke HR sBP CholesterolBefore \\\n", "0 Energy manager 0 47 145 1.2 \n", "34 Tourist information centre manager 0 62 136 4.1 \n", "54 Estate manager/land agent 0 69 139 4.4 \n", "55 Logistics and distribution manager 0 65 141 4.5 \n", "71 Passenger transport manager 1 69 146 4.8 \n", "\n", " TAG Survey CholesterolAfter Delta Group \n", "0 1.2 1 0.7 0.5 Active \n", "34 2.3 2 3.8 0.3 Control \n", "54 2.7 2 4.0 0.4 Active \n", "55 2.9 5 4.0 0.5 Control \n", "71 3.1 4 4.5 0.3 Active " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a new dataframe object with the criterium and assign it to the variable df_manager\n", "df_manager = df.loc[crit]\n", "df_manager[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Updating or changing the values in a dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another valueble skill is to be able to change actual data in a dataframe object. Fortunately, datadrame objects can be manipulated in many ways." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The names of individual columns can be replaced with the `rename` method using a dictionary. Below, the `Name` column is renamed to `Participant`. To make the changes permanent, the `inplace` argument value to `True`." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Patient', 'DOB', 'Age', 'Vocation', 'Smoke', 'HR', 'sBP',\n", " 'CholesterolBefore', 'TAG', 'Survey', 'CholesterolAfter', 'Delta',\n", " 'Group'],\n", " dtype='object')" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Replace the Name column with the Patient\n", "df.rename(columns={'Name': 'Patient'}, inplace=True)\n", "df.columns # Call the column names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `drop` method is used to delete a column. The `axis` argument is set to $1$ to indicate that the column is to be deleted. The `inplace` argument value is set to `True` to make the changes permanent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Delta` column is deleted from the dataframe object." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "# Delete the Delta column\n", "df.drop(columns='Delta', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add a value of 2 to the age of each participant" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In specific types of research, personal data are obfuscated to protect the privacy of the people in the dataset. In a simple case, $2$ is added the age of every participant when capturing the data. In reality, they are all $2$ years younger." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 45\n", "1 55\n", "2 35\n", "3 45\n", "4 48\n", "Name: Age, dtype: int64" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add 2 to each value in the Age column\n", "df.Age = df.Age + 2 # Note that the Age column is reassigned\n", "df.Age[:5] # Display the first five rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are more than one way to achieve a similar goal. One way is to create a function and then use the `apply` method to apply the function to the values in a column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "User-defined functions are created using the `def` and `return` keywords. The former tells Python to create a new function. After a space follows the name is given to the function. A set of parentheses follow that contains a placeholder for the argument (here called a parameter). In its simplest form, the `return` keyword follows. As the name indicates this section returns a value for the function. Below, it is clear from the code that $x$ is the parameter and will take an argument value. The function then adds $2$ to the argument value." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "# Create a user-defined function named add2 that adds 2 to a value\n", "def add2(x):\n", " return x + 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `apply` method is now used and the `add2` function is used. The value in each row is now increased by $2$." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 47\n", "1 57\n", "2 37\n", "3 47\n", "4 50\n", "Name: Age, dtype: int64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Apply the add2 function to the Age column\n", "df.Age = df.Age.apply(add2)\n", "df.Age[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Create a new variable named `AgeHR` that is the product of the values in the variables `Age` and `HR`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "df['AgeHR'] = df.Age * df.HR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changing the values in a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dictionaries can be used to change the values in a column. The keys are the original values and the values are the new values. The `replace` method is used to change the values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below returns the sample space (unique) elements in the `Group` column." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Active', 'Control'], dtype=object)" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Return the unique values in the Group column\n", "df.Group.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values are `Active` and `Control`. These are replaced with `1` and `0` respectively." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "# Use the replace method to replace the values in the Group column with 1 for Active and 0 for Control\n", "df.Group.replace({'Active': 1, 'Control': 0}, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first five observations in the `Group` column are printed to the screen to confirm the changes." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", "Name: Group, dtype: int64" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the first 5 rows of the Group column\n", "df.Group[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Splitting the values in a column into two columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding columns is an often used technique . It is as simple as stating the new name in square brackets as a string and then adding a list of values. The list of values that are added must be the same length (number of rows) as the dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below creates two new columns, `FirstName` and `LastName`. The `split` method is used to split the values in the `Patient` column. The `expand` argument is set to `True` to create two new columns. The `str` method is used to indicate that the `split` method is to be applied to a string, in this case a space." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "# Split the values in the Patient column on the empty space between the first and last names and assign it to the variable split_participant\n", "split_participant = df.Patient.str.split(' ', expand=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indexing is used to extract the two lists of values." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "# Create new columns in df named FirstName and LastName\n", "df['FirstName'] = split_participant[0]\n", "df['LastName'] = split_participant[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first five rows of the `FirstName`, `LastName`, and the original `Patient` columns are printed to the screen." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FirstNameLastNamePatient
0DylanPattonDylan Patton
1SandraHowardSandra Howard
2SamanthaWilliamsSamantha Williams
3AshleyHensleyAshley Hensley
4RobertWilsonRobert Wilson
\n", "
" ], "text/plain": [ " FirstName LastName Patient\n", "0 Dylan Patton Dylan Patton\n", "1 Sandra Howard Sandra Howard\n", "2 Samantha Williams Samantha Williams\n", "3 Ashley Hensley Ashley Hensley\n", "4 Robert Wilson Robert Wilson" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print first 5 rows of FirstName and LastName and Patient\n", "df[['FirstName', 'LastName', 'Patient']][:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categorize a numerical variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Intervals, called bins, can be created to change a continuous numerical variable into a categorical variable. The pandas `cut` function is used to create the bins. The `bins` argument is used to specify the intervals. The `labels` argument is used to specify the labels for each bins. The `right` argument is used to specify whether the right or left side of the interval is closed. The `include_lowest` argument is used to specify whether the lowest value is included in the first bin." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below creates three bins for the `CholesterolBefore` column. The minimum and maximum values are calculated first." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.2" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Minimum CholesterolBefore value\n", "df.CholesterolBefore.min()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11.1" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Maximum CholesterolBefore value\n", "df.CholesterolBefore.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Three bins are created with four values used in the `cut` function to create the bins $\\text{low}=[0,2.5)$, $\\text{normal}=[2.5,5.0)$, and $\\text{high}=[5.0,20)$. The intervals safely contain the range of values." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 low\n", "1 low\n", "2 low\n", "3 low\n", "4 low\n", "Name: CholesterolBeforeLevel, dtype: category\n", "Categories (3, object): ['low' < 'normal' < 'high']" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a new column called CholesterolBeforeLevel using the cut function with bins=[0,5,10,20], which is right False and has the labels ['low', 'normal', 'high']\n", "df['CholesterolBeforeLevel'] = pd.cut(df.CholesterolBefore, bins=[0, 5, 10, 20], right=False, labels=['low', 'normal', 'high'])\n", "df.CholesterolBeforeLevel[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task\n", "\n", "Create a new variable in the dataframe called `HRLevel` from the `HR` column with intervals $(0,65], (65,78], (78,1000)$ with labels `low`, `normal`, and `high` respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 low\n", "1 low\n", "2 low\n", "3 low\n", "4 low\n", "Name: HRLevel, dtype: category\n", "Categories (3, object): ['low' < 'normal' < 'high']" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['HRLevel'] = pd.cut(df.HR, bins = [0, 65, 78, 1000], right=True, labels=['low', 'normal', 'high'])\n", "df.HRLevel[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The numpy nan value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is very often that datasets contain missing data. The numpy library has a specific entity called a `nan` value. This stands for _not a number_. The `nan` value is used to indicate that a value is missing." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the numpy nan value\n", "np.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A list object or an array can contain a `nan` value. Computation is not possible since the missing value is unknown." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, nan]" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a list with three integers and a nan value and assign it to the variable my_list\n", "my_list = [1, 2, 3, np.nan]\n", "my_list" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the mean of my_list\n", "np.mean(my_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The solution is `nan` since it is impossible to know what the mean of the four values are." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A dataframe object with missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `MissingData.csv` file contains missing data." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# Read the MissingData.csv file and assign it to the variable df_missing\n", "df_missing = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/MissingData.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Printing the first $10$ rows to the screen shows some of the missing values." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesalarypreviouscompany
057.0NaN1.0
156.050927.0NaN
246.075500.03.0
3NaN84417.0NaN
460.063002.01.0
554.054652.0NaN
6NaN65739.01.0
764.089397.03.0
860.077797.04.0
961.0NaN1.0
\n", "
" ], "text/plain": [ " age salary previouscompany\n", "0 57.0 NaN 1.0\n", "1 56.0 50927.0 NaN\n", "2 46.0 75500.0 3.0\n", "3 NaN 84417.0 NaN\n", "4 60.0 63002.0 1.0\n", "5 54.0 54652.0 NaN\n", "6 NaN 65739.0 1.0\n", "7 64.0 89397.0 3.0\n", "8 60.0 77797.0 4.0\n", "9 61.0 NaN 1.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the first 10 rows of df_missing\n", "df_missing[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `dropna` method deletes all rows that contain missing data for any of the columns in that row. All the data for that row (observation) are lost. The removal is not permanent. Below, the dataframe with dropped rows is reassigned to a new variable, leaving the oriiginal intact." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# Delete rows with missing values and assign to a new variable complete_data_df\n", "complete_data_df = df_missing.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `axis` argument is set to $0$ to indicate that rows are to be dropped. This is the default. If the value is set to $1$, then any column with a missing value is dropped." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `isna` method can be used to return `True` and `False` values along a column. This is done for the `age` column below. `True` indicates that a value is missing." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 True\n", "4 False\n", "5 False\n", "6 True\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 True\n", "12 False\n", "13 False\n", "14 False\n", "15 True\n", "16 False\n", "17 False\n", "18 False\n", "19 False\n", "20 False\n", "21 False\n", "22 False\n", "23 False\n", "24 False\n", "25 False\n", "26 False\n", "27 False\n", "28 False\n", "29 False\n", "30 False\n", "31 False\n", "Name: age, dtype: bool" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use .isna() for the age column\n", "df_missing.age.isna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `True` is stored as $1$, the `sum` method can be used to count all the rows (observations) with missing data." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sum over all the missing data in the age column\n", "df_missing.age.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are $4$ missing values in the `age` column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replacing missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The process of creating values to fill in missing data is called __data imputation__ and is a seperate and complicated subject. The pandas library provides a `fillna` method for filling in the missing data with simple calculations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `fillna` method takes a `method` argument. This can be set to `ffil` and `bfill`. As the argument values names imply, the value in the previous or subsequent row will be used to _fill in_ the missing value." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/f9/9znwn5gx653_4p99htssbq400000gn/T/ipykernel_1250/170363983.py:2: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.\n", " df_missing.age.fillna(method='ffill')[:5]\n" ] }, { "data": { "text/plain": [ "0 57.0\n", "1 56.0\n", "2 46.0\n", "3 46.0\n", "4 60.0\n", "Name: age, dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use forward fill to impute the missing values in the age column\n", "df_missing.age.fillna(method='ffill')[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The missing value for index $3$ has been replaced by the $46.0$ value at index $2$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Values for missing data can also be used. The median of the `age` column is used to replace the missing values. Note that the `median` method used for a pandas series object ignores the missing data unlike a Python list object or a numpy array." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 57.0\n", "1 56.0\n", "2 46.0\n", "3 57.0\n", "4 60.0\n", "Name: age, dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fill the missing age values with the median age\n", "df_missing.age.fillna(df_missing.age.median())[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Default missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is common to use default values when data is not available at the time of capture. The `DefaultMissingData.csv` file uses the values $999$, `Nil` and `Missing` to indicate missing data." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesalaryprevious_company
057Nil1
15650927Missing
246755003
399984417Missing
460630021
\n", "
" ], "text/plain": [ " age salary previous_company\n", "0 57 Nil 1\n", "1 56 50927 Missing\n", "2 46 75500 3\n", "3 999 84417 Missing\n", "4 60 63002 1" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the DefaultMissingData.scv file and assign ot to the variable default_missing_df\n", "default_missing_df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/DefaultMissingData.csv')\n", "default_missing_df[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `na_values` argument can take a Python list object that contains all the default values that indicate missing values." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesalaryprevious_company
057.0NaN1.0
156.050927.0NaN
246.075500.03.0
3NaN84417.0NaN
460.063002.01.0
\n", "
" ], "text/plain": [ " age salary previous_company\n", "0 57.0 NaN 1.0\n", "1 56.0 50927.0 NaN\n", "2 46.0 75500.0 3.0\n", "3 NaN 84417.0 NaN\n", "4 60.0 63002.0 1.0" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reimport the csv file and use the na_values argument to specify the missing values 999, Nil, and Missing\n", "default_missing_df = pd.read_csv('DefaultMissingData.csv', na_values=[999, 'Nil', 'Missing'])\n", "default_missing_df[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The missing values are now correctly indicates as `nan` values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Quiz questions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Questions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. How do you import the pandas library in Python?\n", "\n", "2. How do you read a CSV file named `data.csv` into a pandas DataFrame?\n", "\n", "3. How do you display the first $5$ rows of a DataFrame named `df`?\n", "\n", "4. How do you find the number of rows and columns in a DataFrame `df`?\n", "\n", "5. How do you print the names of all the columns in a DataFrame `df`?\n", "\n", "6. How do you get the summary information of the DataFrame `df` including data types and non-null values?\n", "\n", "7. How do you select only the rows in DataFrame `df` where the column `A` is greater than $5$?\n", "\n", "8. How do you select the columns `A` and `B` from the DataFrame `df`?\n", "\n", "9. How do you rename the column `A` to `Alpha` in DataFrame `df`?\n", "\n", "10. How do you replace all occurrences of the value $999$ in DataFrame `df` with `nan`?\n", "\n", "11. How do you drop all rows in DataFrame `df` that contain missing values?\n", "\n", "12. How do you fill all missing values in DataFrame `df` with $0$?\n", "\n", "13. How do you calculate the mean of column `A` in DataFrame `df`?\n", "\n", "14. How do you create a new column `C` in DataFrame `df` that is the sum of columns `A` and `B`?\n", "\n", "15. How do you save the DataFrame `df` to a CSV file named `output.csv`? (Find the solution by searching the pandas documentation.)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# " ] } ], "metadata": { "kernelspec": { "display_name": "datascience", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }