{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## AI for Medicine Course 3 Week 1 lecture notebook\n", "## Pandas for a Medical Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to this lecture notebook! In this week's graded assignment, you will be using pandas quite often to work with dataframes.\n", "- To get you ready for assignment, you'll familiarize yourself with some objects in the `pandas` library, along with their data types.\n", "- Then you'll see how you can leverage pandas to get information from a dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Library" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import libraries\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load your data and check its shape\n", "\n", "pandas.read_csv takes in a file name, assuming that the file is formatted as comma separated values (csv).\n", "- You can choose one of the columns to be the row 'index', which is an ID associated with each row." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data has 50 rows and 5 columns.\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexageobstructoutcomeTRTMT
105701True
216800False
307200True
406611True
516901False
\n", "
" ], "text/plain": [ " sex age obstruct outcome TRTMT\n", "1 0 57 0 1 True\n", "2 1 68 0 0 False\n", "3 0 72 0 0 True\n", "4 0 66 1 1 True\n", "5 1 69 0 1 False" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the csv data, setting the 0th column as the row index\n", "data = pd.read_csv(\"dummy_data.csv\", index_col=0)\n", "\n", "# Display the data's number of rows and columns\n", "print(f\"Data has {data.shape[0]} rows and {data.shape[1]} columns.\\n\")\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a description of all the fields:\n", "- `sex (binary): 1 if Male, 0 otherwise`\n", "- `age (int): age of patient at start of the study`\n", "- `obstruct (binary): obstruction of colon by tumor`\n", "- `outcome (binary): 1 if died within 5 years`\n", "- `TRTMT (binary): patient was treated`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introducing the DataFrame" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# show the data type of the dataframe\n", "print(type(data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that your data is of type `DataFrame`. A `DataFrame` is a two-dimensional, labeled data structure with columns that can be of different data types. Dataframes are a great way to organize your data, and are the most common object in `pandas`. If you are unfamiliar with them, check the official [documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you're only interested in a single column (or feature) of the data, access that single column by using the \".\" dot notation, in which you specify the dataframe followed by a dot and the name of the column you are interested in, like this:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 True\n", "2 False\n", "3 True\n", "4 True\n", "5 False\n", "Name: TRTMT, dtype: bool" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.TRTMT.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the `head()` method. This method prints only the first five rows, so the output of the cell can be quickly and easily read. Try removing it and see what happens." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introducing the Series" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(type(data.TRTMT))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each column of a DataFrame is of type `Series`, which are one-dimensional, labeled arrays that can contain any data type, plus its index. Series are similar to lists in Python, with one important difference: each Series can only contain one type of data. \n", "\n", "Many of the methods and operations supported by DataFrames are also supported by Series. When in doubt, always check the documentation!\n", "\n", "There are several ways of accessing a single column of a DataFrame. The methods you're about to see all do the same thing.\n", "\n", "- Dot notation is simple to type, but doesn't work when the column name has a space. See some [examples of where dot notation will not work](https://www.dataschool.io/pandas-dot-notation-vs-brackets/).\n", "- Bracket notation always works." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n" ] } ], "source": [ "# Use dot notation to access the TRTMT column\n", "data.TRTMT\n", "\n", "# Use .loc to get all rows using \":\", for column TRTMT\n", "data.loc[:,\"TRTMT\"]\n", "\n", "# Use bracket notation to get the TRTMT column\n", "data[\"TRTMT\"]\n", "\n", "print(data.TRTMT.equals(data.loc[:,\"TRTMT\"]))\n", "print(data.TRTMT.equals(data[\"TRTMT\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing the DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the time you'll want a subset (or a slice) of the DataFrame that meets some criteria. For example, if you wanted to analyze all of the features for patients who are 50 years or younger, you can slice the DataFrame like this: " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexageobstructoutcomeTRTMT
614301True
1514610False
1903411True
2405000True
3203310True
3304901False
3414700False
4203910False
4514000True
6714900True
7004000False
\n", "
" ], "text/plain": [ " sex age obstruct outcome TRTMT\n", "6 1 43 0 1 True\n", "15 1 46 1 0 False\n", "19 0 34 1 1 True\n", "24 0 50 0 0 True\n", "32 0 33 1 0 True\n", "33 0 49 0 1 False\n", "34 1 47 0 0 False\n", "42 0 39 1 0 False\n", "45 1 40 0 0 True\n", "67 1 49 0 0 True\n", "70 0 40 0 0 False" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data.age <= 50]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if you wanted to filter a DataFrame based on multiple conditions? \n", "- To do this, use the \"&\" as the 'and' operator. Don't use `and`.\n", "- You can use '|' as the 'or' operator. Don't use `or`.\n", "\n", "```CPP\n", "# Trying to combine two conditions using `and` won't work\n", "data[(data.age <= 50) and (data.TRTMT == True)]\n", "\n", "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().\n", "```\n", "\n", "- Don't forget the parentheses around each condition!\n", "- Without parentheses, this won't work.\n", "\n", "```CPP\n", "# Trying to combine two conditions without parentheses results in an error\n", "data[ data.age <= 50 & data.TRTMT == True]\n", "\n", "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexageobstructoutcomeTRTMT
614301True
1903411True
2405000True
3203310True
4514000True
6714900True
\n", "
" ], "text/plain": [ " sex age obstruct outcome TRTMT\n", "6 1 43 0 1 True\n", "19 0 34 1 1 True\n", "24 0 50 0 0 True\n", "32 0 33 1 0 True\n", "45 1 40 0 0 True\n", "67 1 49 0 0 True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get patients aged 50 or less who received treatment\n", "data[(data.age <= 50) & (data.TRTMT == True)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When slicing a DataFrame the resulting type will be a DataFrame as well:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(data[(data.age <= 50) & (data.TRTMT == True)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More Advanced Operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's dive into some useful properties of DataFrames and Series that allow for more advanced calculations." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len: 11\n", "shape (rows, cols) (11, 5)\n", "size: 55\n" ] } ], "source": [ "# Applying len() to the df yields the number of rows\n", "print(f\"len: {len(data[(data.age <= 50)])}\")\n", "\n", "# Accessing the 'shape' attribute of the df yields a tuple of the form (rows, cols)\n", "print(f\"shape (rows, cols) {data[(data.age <= 50)].shape}\")\n", "\n", "# Accessing the 'size' attribute of the df yields the number of elements in the df:\n", "print(f\"size: {data[(data.age <= 50)].size}\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "50\n", "(50,)\n", "50\n" ] } ], "source": [ "# Applying len() to the df yields the number of rows\n", "print(f\"{len(data.TRTMT)}\")\n", "\n", "# Accessing the 'shape' attribute of the df yields a tuple of the form (rows, cols)\n", "print(f\"{data.TRTMT.shape}\")\n", "\n", "# Accessing the 'size' attribute of the df yields the number of elements in the df:\n", "print(f\"{data.TRTMT.size}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using what you've seen so far, can you calculate the proportion of the patients who are male?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prop_male_patients = data[(data.sex == 1)].size / data.size\n", "print(f\"Your answer: {prop_male_patients:.2f}, Expected: {21/50}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### mean( ) Method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One handy hack you can use when dealing with binary data is to use the `mean()` method of a Series to calculate the proportion of occurrences that are equal to 1. \n", "\n", "Note this should also work with bool data since Python treats booleans as numbers when applying math operations. \n", "- `True` is treated as the number `1`\n", "- `False` is treated as the number `0`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate the proportion of the `sex` column that is `True` (1).\n", "data.sex.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Updating Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far you've only accessed values of a DataFrame or Series. Sometimes you may need to update these values. \n", "\n", "Let's look at the original DataFrame one more time:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# View dataframe\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say you detected an error in the data, where the second patient was actually treated. \n", "- To update the data, you can use .loc[row, col] and specify the row and column you want to update.\n", "- Notice that because the dataframe's index is defined, the first row is at index 1 and not 0.\n", "- If the index was not set, then indexing would start at 0.\n", "\n", "\n", "```CPP\n", "# Try to access patient 0, and note the error message\n", "data.loc[0,'TRTMT']\n", "\n", "KeyError: 0\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.loc[2,'TRTMT']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.loc[2, \"TRTMT\"] = True\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you've found out that there was another issue with the data that needs to be corrected. This study only includes females, so the `sex` column should be set to 0 for all patients.\n", "\n", "You can update the whole column (or Series) using .loc[row, col] once again, but this time using \":\" for rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.loc[:, \"sex\"] = 0\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access a range of rows by specifying the `start:end`, where the `end` index is included.\n", "- Note that the range is inclusive of the end (other functions in Python exclude the end of the range from the output)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Access patients at index 3 to 4, including 4.\n", "data.loc[3:4,:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Congratulations, you have completed this lecture notebook!\n", "\n", "Welcome to the wonderful world of Pandas! You will be using these pandas functions in this week's graded assignment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 4 }