{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Statistical Analysis with Python\n", "\n", "In the first notebook, we used `numpy`, `scipy`, and `matplotlib` to visualize the output of mathematical functions. For the next couple of weeks, we will explore the use of python tools to analyze experimental data. Specifically, this week we will focus on visualizing experimental data and performing some basic statistical analysis. Python's statistics capabilities are not quite as fully-featured as the `R` language (though if you need it, the [`rpy2` package](https://pypi.org/project/rpy2/) provides a python interface to `R`), but are extensive enough for most scientific applications.\n", "\n", "This notebook uses data from the 2017-2018 [National Health and Nutrition Examination Survey](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017), and assumes that the following data files are available in the current working directory (by default, the same directory as the notebook file).\n", "\n", "- Standard Biochemistry Profile [BIOPRO_J.XPT](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BIOPRO_J.XPT)\n", "- Demographic Variables and Sample Weights [DEMO_J.XPT](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT)\n", "- Body Measures [BMX_J.XPT](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BMX_J.XPT)\n", "- Dietary Interview, Total Nutrient Intakes, First Day [DR1TOT_J.XPT](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DR1TOT_J.XPT)\n", "- Blood Pressure [BPX_J](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BPX_J.XPT)\n", "\n", "## Introduction to `pandas`\n", "\n", "The [`pandas` package](https://pandas.pydata.org/), which is part of the SciPy ecosystem, is one of the most widely-used packages when working with most types of experimental data. The package provides two useful data structures:\n", "\n", "- [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) is essentially a 1D `numpy.ndarray` with extra features: it allows each entry to have an arbitrary label rather then just an integer index starting from 0. Like `numpy.ndarray` objects, a `pandas.Series` object has a fixed length, and all of the entries must be of the same type. These could be floats, ints, strings, or other types. In nearly all cases, a function that works on a `numpy.ndarray` will also work exactly the same on a `pandas.Series`.\n", "- [`pandas.Dataframe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) represents a 2D table of data, where each row and column may be labeled. Essentially, it is a table in which each column is a `pandas.Series` object. The columns do not all need to be the same length or of the same type, and the number of columns can be changed at any time. The data in the table are aligned by the labels for each `pandas.Series` object. This is a very powerful feature that is discussed in a bit more detail below.\n", "\n", "It is strongly recommended that you read over the `pandas` [Intro to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) article and [10 Minutes to `pandas`](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) to get a sense for what these structures are capable of. In this notebook we will only have the chance to highlight a few features. Before we dive into real data, let's take a look at a simple example to illustrate how these data structures work.\n", "\n", "Our example will be a simple gradebook for a class of 5 students with a few assignments. We'll first create a `pandas.Series` object containing the students' names, using their student ID numbers as the \"index.\"\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "students = pd.Series(['Alice', 'Bob', 'Charlie', 'Diane', 'Elaine'],index=[421,398,990,102,556],name=\"Student Name\")\n", "\n", "students" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data contained in the array are the students' names. We also told `pandas` to use student ID numbers as the index for each row: Alice has a student ID number of 421, and so on. Finally, we told `pandas` the name of the data array. When we inspect the `Series` object, we see that each student is listed with their corresponding student ID number. We can use that ID number as an index to access the student." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(students[421],students[556])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now initialize the gradebook, giving it the `students` object we created before. When we inspect the gradebook, `pandas` outputs a nicely formatted table with the column heading corresponding to the name we gave the `Series` object. We can use that column heading to access data contained in the column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook = pd.DataFrame(students)\n", "\n", "print(gradebook['Student Name'][990])\n", "\n", "gradebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, `pandas.Dataframe` provides another syntax for accessing elements: the [`at`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html), [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html), [`iat`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iat.html#pandas.DataFrame.iat), and [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook.iat[1,0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook.iloc[1:4]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook.at[102,'Student Name']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook.loc[:,'Student Name']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many ways to add new columns to a `Dataframe`, and we'll look at a couple different ones here. Let's assume the students turn in their first assignment, and you have graded it. One way is to create a `pandas.Series` object exactly as before and use the [`Dataframe.join`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hw1 = pd.Series([100,95,70,89,65],index=[421,398,990,102,556],name='HW1')\n", "gradebook.join(hw1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that by default, `pandas` functions **do not modify** the dataframe(s) they're called on. So even though we joined `hw1` to `gradebook`, `gradebook` itself remains unchanged:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to keep the merged version, simply assign the result to a new variable (or overwrite `gradebook` if desired as we will do below after a couple more examples). Here we'll show a different method for adding the column: using a python dictionary whose keys are the student ID numbers and whose values are the grades. Notice that the order of the grades in the dictionary doesn't matter; each grade is associated with the student ID number, not its position in the dictionary. The dictionary is passed to the `pandas.Series` constructor, and then we use [`Dataframe.assign`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html) to add the column and its column heading." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hw1_dict = {\n", " 421: 100,\n", " 102: 89,\n", " 398: 95,\n", " 990: 70,\n", " 556: 65\n", "}\n", "gradebook.assign(HW1=pd.Series(hw1_dict))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like the `join` function, this method does not modify the gradebook. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's say that we have the following grades for HW2 and HW3: only 3 students turned in HW2, and only 4 turned in HW 3. We can add all three dictionaries with a single `assign` call. Here we overwrite the original gradebook with the modified version created by the `assign` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook = gradebook.assign(HW1 = pd.Series(hw1_dict),\n", " HW2 = pd.Series( {421:100,102:88,556:95} ),\n", " HW3 = pd.Series( {421:99,990:45,102:76,556:90}))\n", "\n", "gradebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In `pandas`, missing data are treated as `NaN` (not a number) by default. Many `pandas` functions also allow you to specify a default value that replaces `NaN`, so in a gradebook you might consider setting the default grade as 0. We won't do that here, because the way `pandas` treats `NaN` values is convenient.\n", "\n", "We can get a summary of the data in the `gradebook` by using the [`Dataframe.describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function. By default, `pandas` functions **ignore** `NaN` values. You can see this in the `count` row." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's say we want to add columns for the total homework points and the homework percentage. We can do that directly from `Dataframe.assign` because we can reference existing columns. The operation uses broadcasting very much like a `numpy.ndarray` to perform a calculation on all rows of the column. Both `pandas.Series` and `pandas.Dataframe` also have methods like `add`, `sub`, `mul`, `div`, `pow`, `sum` and so on which are better to use than `+`, `-`, `*` etc., because they allow you to control how `NaNs` are treated explicitly if you wish. So to calculate the sum of the three homework assignments, we can use the `loc` command to select the three columns we want, and call the [`sum`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) function. The `axis=1` tells the sum operation to add across the columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook.assign(HWTotal = gradebook.loc[:,['HW1','HW2','HW3']].sum(axis=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following does not work: when using the `+` operator, any `NaN` values will cause the result to be `NaN`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook.assign(HWTotal = gradebook.HW1 + gradebook.HW2 + gradebook.HW3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add columns for HWTotal and HWPct." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hwtotal = gradebook.loc[:,['HW1','HW2','HW3']].sum(axis=1)\n", "hwpct = hwtotal/3.\n", "gradebook = gradebook.assign(HWTotal = hwtotal,HWPct = hwpct)\n", "gradebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If now Bob hands in HW3 late and gets 50 points, we can modify the gradebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gradebook.at[398,'HW3'] = 50\n", "gradebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This highlights one key difference between `Dataframes` and spreadsheets: the `HWTotal` and `HWPct` columns were not modified when we changed the data in the table. All of the columns are independent of one another. A `Dataframe` is designed to store values, not formulas. So if you need to update other columns, you must recalculate them. In pactice, `Dataframe` objects usually don't contain columns that depend on other columns like this. Normally only the raw data (homework scores, exam scores, etc) would be stored in a `Dataframe`, and then calculations would be performed as needed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating `Dataframe` from real data and data inspection\n", "\n", "The `pandas` package comes with a number of utilities for loading dataframes from common file formats, including csv, xlsx, json, hdf5, and others. The data from the NHANES servey are written in the SAS Transport File Format xpt, and `pandas` has a built-in routine called [`pandas.read_sas`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html) which can handle this format. Let's begin with [BIOPRO_J.XPT](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BIOPRO_J.XPT)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bio = pd.read_sas('BIOPRO_J.XPT',index='SEQN') # Use the 'SEQN' field as the index... more on this below\n", "#print([len(bio[col]) for col in bio.columns]) #print number of entries in each column\n", "bio.head(10) # print the first 10 rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The meanings of the data columns are given in the [BIOPRO_J documentation](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BIOPRO_J.htm), where we can see that `SEQN` is the identifier associatied with a person in the study. This is why the code above uses that field as the index for the rows. When we look at the other data sets, the same `SEQN` value corresponds to the same study subject, so we can track that individual's data across the different datasets. All of the other column headings and their meanings are explained at that link as well. We can of course use the `Dataframe.describe` function to get basic statistics. Since this dataset has so many columns, it can be convenient to use the [`Dataframe.iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) function to select a subset of the columns to display at once.\n", "\n", "Recall that `pandas` by default ignores `NaN` values when performing calculations. When the describe the table, we can see that although there are 6401 rows in each column, there are only ~5900 values in each that are not `NaN`, though it varies per column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bio.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bio.iloc[:,0:10].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bio.iloc[:,10:20].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When exploring a dataset like this, it is useful to visualize the data in different ways. First, let's work on visualizing the distribution of data in a particular column. First, we can use the standard `Axes.plot` function from `matplotlib.`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "\n", "fig, ax = plt.subplots()\n", "\n", "ax.plot(bio.LBXSATSI)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ths unfotunately is not a very useful way to visualize the data (though note that when we chose the column to plot, `matplotlib` automatically used the `SEQN` index on the x axis). When you want to see how the values of a dataset are distributed, it's best to use a histogram. Fortunately, histograms are very easy to make using [`Axes.hist`](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.hist.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "n,bins,patches = ax.hist(bio.LBXSATSI,bins=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows the number of data points that fall within a certain range of values. However, one downside to histograms is that the apearance can be rather sensitive to the chosen bin width. The plots below show what happens with 10 bins or 150 bins, each of which produces a different bin width. We're using the [`Series.dropna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dropna.html) function to exclude any `NaN` values from the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "n,bins,patches = ax.hist(bio.LBXSATSI.dropna(),bins=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "n,bins,patches = ax.hist(bio.LBXSATSI.dropna(),bins=150)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A related way to visualize the data distribution is through [kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation), an implementation of which using a Gaussian kernel is available as [`scipy.stats.gaussian_kde`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html). To use this, we feed the `gaussian_kde` function the data we want to model, then create a grid of points where we want to evaluate the density. Below, we create a grid of points that runs from the minimum data value to the maximum, and give it 500 points to evaluate. The result gives a plot that has a smooth shape that follows the contour of the histograms we observed above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import scipy.stats as spstats\n", "\n", "d = bio.LBXSATSI.dropna()\n", "kde = spstats.gaussian_kde(d)\n", "points = np.linspace(d.min(),d.max(),500,endpoint=True)\n", "\n", "fig,ax = plt.subplots()\n", "ax.plot(points,kde.evaluate(points))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The y values here are no longer the number of data points, but rather the [probability density](https://en.wikipedia.org/wiki/Probability_density_function) of finding a certain value in the dataset. The integral under the probability density curve is 1, so there is 100% probability of finding a value somewhere between the minimum value and the maximum value in the dataset, which makes sense.\n", "\n", "We can actually use both kernel density estimates and histograms at the same time. The `Axes.hist` function takes an optional `density` argument. If set to `True`, the heights of the bars in the histogram are rescaled so that the sum of the areas of all the bars adds up to 1. This means we can plot the histogram and the kernel density estimate together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "ax.plot(points,kde.evaluate(points))\n", "n,bins,edges = ax.hist(d,bins=50,density=True,alpha=0.35)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will be useful enough that we can write a function to generate a plot like this. Feel free to customize the appearance as you prefer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_hist_kde(ax : plt.Axes, d : pd.Series, c : str='#004400', left : bool = False):\n", " #customize ticks appearance\n", " ax.tick_params(axis='both',which='both',direction='in',bottom=True,top=True,left=True,right=True)\n", " ax.minorticks_on()\n", " \n", " #show grid\n", " ax.grid(True,which='minor',linestyle=':',alpha=0.6)\n", " ax.grid(True,which='major',alpha=0.6)\n", " \n", " #make kde\n", " kde = spstats.gaussian_kde(d)\n", " points = np.linspace(d.min(),d.max(),len(d)//150,endpoint=True)\n", " if left:\n", " ax.plot(kde.evaluate(points),points,color=c)\n", " ax.hist(d,bins=len(d)//150,density=True,alpha=0.15,color=c,ec=c+'22',orientation='horizontal')\n", " ax.set_ylabel(d.name)\n", " else:\n", " ax.plot(points,kde.evaluate(points),color=c)\n", " ax.hist(d,bins=len(d)//150,density=True,alpha=0.15,color=c,ec=c+'22')\n", " ax.set_xlabel(d.name)\n", " \n", "fig,ax = plt.subplots()\n", "make_hist_kde(ax,bio.LBXSATSI.dropna())\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(3,3,figsize=(8,8))\n", "datastart = 16\n", "for i,ax in enumerate(axes.flatten()):\n", " make_hist_kde(ax,bio.iloc[:,datastart+i].dropna())\n", " \n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Scatter plots](https://en.wikipedia.org/wiki/Scatter_plot) are also frequently used to look for correlations in data. The [`Axes.scatter`](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html) function in `matplotlib` generates this type of plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "\n", "ax.scatter(bio.LBXSIR,bio.LBXSLDSI)\n", "ax.set_xlabel(bio.LBXSIR.name)\n", "ax.set_ylabel(bio.LBXSLDSI.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often scatter plots are combined with histograms for each axis so that the distribution for each individual data set is easier to visualize. We can do this by making a 2x2 grid of plots, placing our histograms with kernel density estimates on the upper left `[0,0]` and lower right `[1,1]` plots, and the scatter plot on the lower left `[1,0]` plot. To adjust the relative plot sizes, we can use the `gridspec_kw` argument to [`pyplot.subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html), which allows us to control the [`Gridspec`](https://matplotlib.org/api/_as_gen/matplotlib.gridspec.GridSpec.html) of the figure layout. We also use [`Figure.subplots_adjust`](https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure.subplots_adjust) to control the width and height spacing between the subplots." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,axes = plt.subplots(2,2, figsize=(6,6),\n", " gridspec_kw={'width_ratios' : [4,1],\n", " 'height_ratios' : [1,4]})\n", "\n", "fig.subplots_adjust(hspace=0.05,wspace=0.05)\n", "\n", "axes[0,1].axis('off')\n", "\n", "c = '#004400'\n", "\n", "make_hist_kde(axes[0,0],bio.LBXSIR.dropna(),c=c)\n", "make_hist_kde(axes[1,1],bio.LBXSLDSI.dropna(),c=c,left=True)\n", "\n", "for ax in [axes[0,0],axes[1,1]]:\n", " ax.tick_params(axis='both',which='both',bottom=False,left=False,right=False,top=False)\n", " ax.set_xticks([])\n", " ax.set_yticks([])\n", " ax.set_xlabel('')\n", " ax.set_ylabel('')\n", " for dir in ['top','bottom','right','left']:\n", " ax.spines[dir].set_visible(False)\n", "\n", "\n", "axes[1,0].scatter(bio.LBXSIR,bio.LBXSLDSI,c=c,s=2)\n", "axes[1,0].set_xlabel(bio.LBXSIR.name)\n", "axes[1,0].set_ylabel(bio.LBXSLDSI.name)\n", "axes[1,0].spines['top'].set_visible(True)\n", "axes[1,0].spines['right'].set_visible(True)\n", "axes[1,0].tick_params(axis='both',which='both',direction='in',bottom=True,top=True,left=True,right=True)\n", "axes[1,0].minorticks_on()\n", "\n", "#show grid\n", "axes[1,0].grid(True,which='minor',linestyle=':',alpha=0.6)\n", "axes[1,0].grid(True,which='major',alpha=0.6)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Built-in `pandas` plots\n", "\n", "Alternatively, `pandas` has convenience functions for generating many of the `matplotlib` plot types. For the most part, these are just wrappers around standard `matplotlib` plots, so all arguments that you could use in `matplotlib` for that plot type are available still, and there are extra parameters that `pandas` provides for making plot generation easier. For instance, we could make histograms with the [`Dataframe.hist`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) function. An example is shown in the next cell. Notice that the `matplotlib.Axes` objects that are generated by the `pandas` plotting routines are returned, so you can grab them and use them to customize the plot appearance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bio.hist(['LBDSGLSI','LBDSIRSI'],bins=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also a built-in method for generating kernel density plots, and multiple types of graphs can be plotted at the same time. The plotting methods take an `ax` argument that specifies which `matplotlib.Axes` object the data should be plotted on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = bio['LBDSIRSI'].plot.kde(c='#004400')\n", "bio['LBDSIRSI'].plot.hist(ax=ax,bins=100,density=True,alpha=0.15,\n", " color='#004400',ec='#00440022')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merging dataframes and selecting subframes\n", "\n", "The NHANES data track individuals by `SEQN` across the various files. With `pandas` it is easy for us to load in these multiple files and make sure that all the data are associated with the correct individual. First, let's load in the datasets into their own dataframes, and take a look at them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bio = pd.read_sas('BIOPRO_J.XPT',index='SEQN')\n", "demo = pd.read_sas('DEMO_J.XPT',index='SEQN')\n", "bmx = pd.read_sas('BMX_J.XPT',index='SEQN')\n", "diet = pd.read_sas('DR1TOT_J.XPT',index='SEQN')\n", "bp = pd.read_sas('BPX_J.XPT',index='SEQN')\n", "\n", "print(bio.columns,demo.columns,bmx.columns,diet.columns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "demo.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bmx.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One thing we can notice right away is that the different files do not have the exact same individuals. The `demo` table has at least 9254 entries not counting any potential `NaN` values, while the `bmx` table only has 8704. This means that there are at least 550 entries in `demo` that do not appear in `bmx` (and it could also be the case that some entries in `bmx` are not in `demo`). Meanwhile, we already know that the `bio` table has only 6401 entries.\n", "\n", "When combining these dataframes together, it is important to make sure that all the data are associated with the correct `SEQN`, and that we handle cases when the columns have different numbers of rows. The [`Dataframe.merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) command makes this easy. If you have ever used an SQL-style database, this operation behaves like a database `JOIN`. Essentially the merge command, as we are using it, creates a single table that has all of the columns in the two tables that we merged.\n", "\n", "We will first merge the `bio` data with the `demo` data. When we call `bio.merge(demo)`, `bio` is considered the \"left\" table and demo is considered the \"right\" table. The `on` argument tells what column is used as a reference key for determining which data values go together; the column must appear in both dataframes. We choose `SEQN` because that is the identifier shared between the tables. The `how` argument determines what rows are kept in the merge. The options are:\n", "\n", "- 'left' keeps only the rows that appear in the left dataframe. Any rows in the right dataframe that don't match one of the entries in the `on` column are thrown out.\n", "- 'right' is the opposite of left: it keeps only rows that appear in the right dataframe\n", "- 'outer' keeps all rows, and fills in any missing column data with `NaN`. So if a row in the left table doesn't have a match in the right table, that row will be added to the right table and filled with `NaN`.\n", "- 'inner' keeps only those rows that appear in both tables, and any rows that appear in only one table are discarded.\n", "\n", "By default, the order of the keys in the left dataframe is preserved. Let's compare an inner join with an outer join. The inner join keeps only rows that appear in both dataframes. Since the `bio` frame only has 6401 entries, the maximum possible number of entries in the merged dataframe is 6401, if every `SEQN` value in `bio` has a match in `demo`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = bio.merge(demo,how='inner',on='SEQN')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The merged dataframe has 6401 rows, so this does mean that every row in `bio` had a match in `demo`. The inner merge is therefore equivalent to a left merge, as we can show." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = bio.merge(demo,how='left',on='SEQN')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If instead we do an outer merge, we get a table that has a row for every entry in either dataframe, in this case yielding a total of 9254 rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = bio.merge(demo,how='outer',on='SEQN')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that the entries at the bottom of the table contain `NaN` for the 'LBX...' values. Becasue the order of keys in the left table is preserved, all the rows that only appeared in the right table are included at the end of the merged table. Since these didn't have any corrsponding entries in the left table, the columns from the left table are filled with `NaN`. We can visualize the order of the data by just plotting the `SEQN` value in the merged table. the first line are the entries that were originally in the left table, then what follow are the entries that didn't have matches in the left table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = bio.merge(demo,how='outer',on='SEQN')\n", "fig,ax = plt.subplots()\n", "ax.plot(df.index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can merge together the other three dataframes to have one master frame with all of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.merge(bmx,how='outer',on='SEQN')\n", "df = df.merge(diet,how='outer',on='SEQN')\n", "df = df.merge(bp,how='outer',on='SEQN')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this master dataset, we can compare data across different categores. For instance, we can make a chart of height vs age. From consulting the NHANES documentation ([DEMO_J](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm), [BMX_J](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BMX_J.htm)), we can see that the key for age in years is `RIDAGEYR` and for height in cm `BMXHT`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.plot.scatter(x='RIDAGEYR',y='BMXHT',s=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often with a dataset like this you would like to select a subset to work with. For instance, let's say we want to look at the heights of adults between the ages of 18 and 65. Using [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#boolean-indexing) we can create subtables where one or more conditions is/are true. the resulting table can be used for further visualization or analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(65)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "make_hist_kde(ax, df[df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(65)].BMXHT.dropna())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "htage = df[df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(79)].loc[:,['BMXHT','RIDAGEYR']]\n", "htage.plot.scatter(x='RIDAGEYR',y='BMXHT',s=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "htage.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before moving on, it is important to understand how the boolean indexing above works so that you know how to make selections correctly. When setting a condition, `pandas` returns a boolean series containing `True` and `False` values to indicate the result of the test. Rows with `True` are retained when using the condition as an index, and rows with `False` are dropped.\n", "\n", "We can look at the result of the first condition:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df.RIDAGEYR.ge(18)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This series tells whether the subject's age is >= 18. The second condition is age <= 65. To show both tests at the same time, we can create a new dataframe that merges both of these boolean series together:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(df.RIDAGEYR.ge(18)).merge(df.RIDAGEYR.le(65),on='SEQN',how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal was to get only the rows where both tests are true. We did this with the bitwise \"and\" operator `&`. When applied to `numpy.ndarray` or `pandas.Series` objects, the operator does an element-by-element logical \"AND\" operation, which returns true only if both arguments are true. It's this result that we used to select the final rows that appeared in the final dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(65)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical data\n", "\n", "When data in a column can only be one of a limited number of values, it is often useful to represent them as categorical data rather than numeric or string data. In our current dataset, much of the demographic data, including gender and race, are coded as numbers that represent categories. For instance, [gender (`RIAGENDR`)](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm#RIAGENDR) is coded as 1 for male and 2 for female, and in the current dataset those are the only two categories. The column [`RIDRETH3`](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.htm#RIDRETH3) codes for race/ethnicity using values 1-7. Since the numbers themselves don't have any meaning, it is more useful to tell `pandas` that the numbers represent categories of data. This will allow for more meaningful grouping of data, and enables some plotting methods to automatically generate graphs or perform analysis broken down by category if desired. In `pandas`, this is done using the [`pandas.Categorical`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html) datatype. A more complete [overview](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) of categorical data is also available in the `pandas` User Guide.\n", "\n", "We'll set up categorical data for the `RIAGENDR` and `RIDRETH3` columns. First, we will tell `pandas` that we want to convert the RIAGENDR column to cateforical data using the [`Series.astype`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.RIAGENDR.astype('category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This returns a new version of the column, where it recognizes that there are 2 categories: 1.0 and 2.0. Note however that it does not modify the original dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.RIAGENDR.dtype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we want to tell `pandas` what the categories mean. We can use [`Series.cat.rename_categories`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.rename_categories.html) to accomplish this, just passing the new category names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gendercat = df.RIAGENDR.astype('category')\n", "gendercat = gendercat.cat.rename_categories(['Male','Female'])\n", "gendercat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've configured the categories we like, we can reassign the `RIAGENDR` column with the categorical series." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['RIAGENDR'] = gendercat\n", "df.RIAGENDR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do the same with the `RIDRETH3` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "eth = df.RIDRETH3.astype('category')\n", "eth = eth.cat.rename_categories(['Mexican American','Other Hispanic',\n", " 'Non-Hispanic White','Non-Hispanic Black',\n", " 'Non-Hispanic Asian','Other'])\n", "df['RIDRETH3'] = eth\n", "df.RIDRETH3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the category data in place, we can easily generate plots of height broken down by category. First up, some [box plots](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html), which show the median, 25% and 75%iles (the box), the data range (the whiskers by default cover 1.5*(Q3-Q1)) and some outlier data. The `by` argument breaks down the data by value, which is especially convenient for categorical columns. Here we can plot the heights of people ages 18-65 broken down by gender or race/ethnicity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(65)].boxplot(column='BMXHT',by='RIAGENDR')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(65)].boxplot(column='BMXHT',by='RIDRETH3',figsize=(15,6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `by` argument to a histogram plot by default generates a separate plot for each category, but using the [`DataFrame.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function we can make a view of the dataframe broken down by category. We then pull out the `BMXHT` column and create the histogram, which shows the two categories plotted on the same axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(65)].groupby('RIAGENDR').BMXHT.plot.hist(bins=50,alpha=0.5)\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(65)].groupby('RIDRETH3').BMXHT.plot.hist(bins=50,alpha=0.3,figsize=(8,8))\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `groupby` function can also take multiple columns, creating further breakdowns in the data. The plot below is hard to read, and would need further tweaking to be presentable, but hopefully this gives you an idea of the capabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.RIDAGEYR.ge(18) & df.RIDAGEYR.le(65)].groupby(['RIDRETH3','RIAGENDR']).BMXHT.plot.hist(bins=50,alpha=0.3,figsize=(12,8))\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading/writing DataFrames\n", "\n", "Since we have done the work to load in multiple data files and set up categories, we should save the dataframe we're working with to disk. In the next notebook, we'll pick up where we left off by reading in this file.\n", "\n", "There are many ways to export dataframes. If you want to open them in Excel later, exporting as csv with [`DataFrame.to_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) is one option, or you can write directly to excel using [`DataFrame.to_excel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html). Each file format has advantages and limitations; for instance, if we wtire to a csv file and read it back in, the columns we set as categorical will not be loaded in as categorical (the csv file doesn't have a way to record the type of the column, so when it is read, `pandas` just has to read the data in as strings or numbers). Perhaps the file format that provides the best mix of high efficiency, fast access, full-features storage, and ease of loading is HDF5, which can be written with [`DataFrame.to_hdf`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_hdf.html) and read with [`pandas.read_hdf`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_hdf.html).\n", "\n", "To store the `df` dataframe to a file, we call `to_hdf`, giving it a filename, a key, and telling it to use the \"table\" format. HDF5 files can store multiple objects, and the key is used to identify the object in the file. We could store several dataframes if we wished, we would just need to add `append=True` to the arguments for subsequent `to_hdf` calls and give each object its own key.\n", "\n", "When we later read the data, we just need to tell `pandas.read_hdf` the filename and the key we want to read, as shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.to_hdf('nhanes.hd5',key='df',format='table')\n", "df2 = pd.read_hdf('nhanes.hd5','df')\n", "df2.RIAGENDR" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }