{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# B13: Tidy data and split-apply-combine\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Colab setup ------------------\n", "import os, sys\n", "if \"google.colab\" in sys.modules:\n", " data_path = \"https://biocircuits.github.io/chapters/data/\"\n", "else:\n", " data_path = \"data/\"\n", "# ------------------------------\n", "\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "In the last lesson, we learned about Pandas and dipped our toe in to see its power. In this lesson, we will continue to harness the power of Pandas to pull out subsets of data we are interested in." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tidy data\n", "\n", "[Hadley Wickham](https://en.wikipedia.org/wiki/Hadley_Wickham) wrote a [great article](http://dx.doi.org/10.18637/jss.v059.i10) in favor of \"tidy data.\" Tidy data frames follow the rules:\n", "\n", "1. Each variable is a column.\n", "2. Each observation is a row.\n", "3. Each type of observation has its own separate data frame.\n", "\n", "This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always **much** easier to work with than non-tidy formats.\n", "\n", "You may raise some objections about tidy data. Here are a few, and my responses.\n", "\n", "**Objection:** Looking at a table of tidy data is ugly. It is not intuitively organized. I would almost never display a tidy data table in a publication.\n", "\n", "**Response:** Correct! Having tabular data in a format that is easy to read as a human studying a table is a very different thing than having it in a format that is easy to explore and work with using a computer. As [Daniel Chen](https://chendaniely.github.io) put it, \"There are data formats that are better for reporting and data formats that are better for analysis.\" We are using the tidy data frames for analysis, not reporting (though we will see in the coming lessons that having the data in a tidy format makes making plots much easier, and plots are a key medium for reporting.)\n", "\n", "
\n", "\n", "**Objection:** Isn't it better to sometimes have data arranged in other ways? Say in a matrix?\n", "\n", "**Response:** This is certainly true for things like images, or raster-style data in general. It makes more sense to organize an image in a 2D matrix than to have it organized as a data frame with three columns (row in image, column in image, intensity of pixel), where each row corresponds to a single pixel. For an image, indexing it by row and column is always unambiguous, `my_image[i, j]` means the pixel at row `i` and column `j`.\n", "\n", "For other data, though, the matrix layout suffers from the fact that there may be more than one way to construct a matrix. If you know a data frame is tidy, you already know its structure. You need only to ask what the columns are, and then you immediately know how to access data using Boolean indexing. In other formats, you might have to read and write extensive comments to understand the structure of the data. Of course, you *can* read and write comments, but it opens the door for the possibility of misinterpretation or mistakes.\n", "\n", "
\n", "\n", "**Objection:** But what about time series? Clearly, that can be in matrix format. One column is time, and then subsequent columns are observations made at that time.\n", "\n", "**Response:** Yes, that is true. But then the matrix-style described could be considered tidy, since each row is a single observation (time point) that has many facets.\n", "\n", "
\n", "\n", "**Objection:** Isn't this an inefficient use of memory? There tend to be lots of repeated entries in tidy data frames.\n", "\n", "**Response:** Yes, there are more efficient ways of storing and accessing data. But for data sets that are not \"big data,\" this is seldom a real issue. The extra expense in memory, as well as the extra expense in access, are small prices to pay for the simplicity and speed of the human user in accessing the data.\n", "\n", "
\n", "\n", "**Objection:** Once it's tidy, we pretty much have to use Boolean indexing to get what we want, and that can be slower than other methods of accessing data. What about performance?\n", "\n", "**Response:** See the previous response. Speed of access really only becomes a problem with big, high-throughput data sets. In those cases, there are often many things you need to be clever about beyond organization of your data.\n", "\n", "
\n", "\n", "**Conclusion:** I really think that tidying a data set allows for fluid exploration. We will focus on tidy data sets going forward. The techniques for bringing untidy data into tidy format use many of Pandas's functions, but are largely beyond the scope of this bootcamp. You will explore that a little bit in the exercises, but for most of the bootcamp, our data sets are already tidy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data set\n", "\n", "We will use the same data set as in the [previous section](b12_intro_to_pandas.ipynb)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
strainCFPYFP
0m2224381409
1m2223161391
2m2225211511
3m2226461460
4m2228301638
\n", "
" ], "text/plain": [ " strain CFP YFP\n", "0 m22 2438 1409\n", "1 m22 2316 1391\n", "2 m22 2521 1511\n", "3 m22 2646 1460\n", "4 m22 2830 1638" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(os.path.join(data_path, \"elowitz_et_al_2002_fig_3a.csv\"))\n", "\n", "# Take a look\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data set is in tidy format. Each row represents a measurement of a single cell. Each cell is from a specific strain and has an associated CFP and YFP fluorescent intensity. We already saw the power of having the data in this format when we did Boolean indexing in the last lesson. Now, we will see how this format allows use to easily do an operation we do again and again with data sets, **split-apply-combine**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split-apply-combine\n", "\n", "Let's say we want to compute the median CFP intensity separately for the m22 and d22 strains. Ignoring for the second the mechanics of how we would do this with Python, let's think about it in English. What do we need to do?\n", "\n", "1. **Split** the data set up according to the `'strain'` field, i.e., split it up so we have a separate data set for the two classes.\n", "2. **Apply** a median function to the activity in these split data sets.\n", "3. **Combine** the results of these averages on the split data set into a new, summary data set that contains the two classes and medians for each.\n", "\n", "We see that the strategy we want is a **split-apply-combine** strategy. This idea was put forward by Hadley Wickham in [this paper](http://dx.doi.org/10.18637/jss.v040.i01). It turns out that this is a strategy we want to use *very* often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.\n", "\n", "Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.\n", "\n", "Pandas's split-apply-combine operations are achieved using the `groupby()` method. You can think of `groupby()` as the splitting part. You can then apply functions to the resulting `DataFrameGroupBy` object. The [Pandas documentation on split-apply-combine](http://pandas.pydata.org/pandas-docs/stable/groupby.html) is excellent and worth reading through. It is extensive though, so don't let yourself get intimidated by it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregation: Median percent correct\n", "\n", "Let's go ahead and do our first split-apply-combine on this tidy data set. First, we will split the data set up by strain." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped = df.groupby(\"strain\")\n", "\n", "# Take a look\n", "grouped" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is not much to see in the `DataFrameGroupBy` object that resulted. But there is a lot we can do with this object. Typing `grouped.` and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The `grouped.median()` method is exactly what we want." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CFPYFP
strain
d222646.01869.5
m222597.51414.0
\n", "
" ], "text/plain": [ " CFP YFP\n", "strain \n", "d22 2646.0 1869.5\n", "m22 2597.5 1414.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_median = grouped.median()\n", "\n", "# Take a look\n", "df_median" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The outputted data frame has the medians of all quantities, including the CFP intensity that we wanted. Note that this data frame has `strain` as the name of the row index. If we want to instead keep `strain` (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the `reset_index()` method." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
strainCFPYFP
0d222646.01869.5
1m222597.51414.0
\n", "
" ], "text/plain": [ " strain CFP YFP\n", "0 d22 2646.0 1869.5\n", "1 m22 2597.5 1414.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_median.reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, though, that this was not done in-place. `df_median` still has an index labeled `'strain'`. If you want to update your data frame, you have to explicitly do so with an assignment operator." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df_median = df_median.reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This type of operation is called an **aggregation**. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transformation\n", "\n", "Instead of summarizing data in a group with single summary statistics by aggregation, we can also do a **transformation** in which each row gets a new entry within a given group. As a simple example, we could generate a column that gives the rank of each cell in terms of CFP intensity for each strain. That is, we perform a rank ordering within the m22 strain and within the d22 strain." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 53.5\n", "1 24.0\n", "2 85.0\n", "3 145.5\n", "4 219.0\n", " ... \n", "529 27.0\n", "530 25.0\n", "531 41.0\n", "532 183.0\n", "533 10.0\n", "Name: CFP, Length: 534, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped['CFP'].rank()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gave us a column of ranks with the indexing of the original data frame preserved. We can put this column into the data frame." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
strainCFPYFPrank grouped by strain
0m222438140953.0
1m222316139124.0
2m222521151185.0
3m2226461460145.0
4m2228301638219.0
\n", "
" ], "text/plain": [ " strain CFP YFP rank grouped by strain\n", "0 m22 2438 1409 53.0\n", "1 m22 2316 1391 24.0\n", "2 m22 2521 1511 85.0\n", "3 m22 2646 1460 145.0\n", "4 m22 2830 1638 219.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['rank grouped by strain'] = grouped['CFP'].rank(method='first')\n", "\n", "# Take a look\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To verify that this worked correctly, and also to show some nice sorting properties of data frames, we will sort the data frame by `strain` and then by `CFP` and make sure the ranks worked accordingly." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
strainCFPYFPrank grouped by strain
420d22186815641.0
290d22188218332.0
377d22188317073.0
291d22189014744.0
287d22190815065.0
...............
42m2229331479246.0
91m2229471486247.0
205m2229511514248.0
201m2229631509249.0
189m2230231568250.0
\n", "

534 rows × 4 columns

\n", "
" ], "text/plain": [ " strain CFP YFP rank grouped by strain\n", "420 d22 1868 1564 1.0\n", "290 d22 1882 1833 2.0\n", "377 d22 1883 1707 3.0\n", "291 d22 1890 1474 4.0\n", "287 d22 1908 1506 5.0\n", ".. ... ... ... ...\n", "42 m22 2933 1479 246.0\n", "91 m22 2947 1486 247.0\n", "205 m22 2951 1514 248.0\n", "201 m22 2963 1509 249.0\n", "189 m22 3023 1568 250.0\n", "\n", "[534 rows x 4 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sorted = df.sort_values(by=[\"strain\", \"CFP\"])\n", "\n", "# Take a look\n", "df_sorted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed it worked!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating and transforming with custom functions\n", "\n", "Let's say we want to compute the coefficient of variation (CoV, the standard deviation divided by the mean) of data in columns of groups in the data frame. There is no built-in function to do this. We have to write our own function to compute the CoV and then use it with the `agg()` method of a `DataFrameGroupBy` object. In the function below, the values of each column are denoted by `data`.\n", "\n", "To compute the coefficient of variation, we will use one more Numpy function beyond `np.mean()` that you have already seen, `np.std()`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def coeff_of_var(data):\n", " \"\"\"Compute coefficient of variation from an array of data.\"\"\"\n", " return np.std(data) / np.mean(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can apply it as an aggregating function, omitting the `'strain'` column because it is categorical and we cannot compute aggregate statistics." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CFPYFP
strain
d220.1161010.113380
m220.0737680.080851
\n", "
" ], "text/plain": [ " CFP YFP\n", "strain \n", "d22 0.116101 0.113380\n", "m22 0.073768 0.080851" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = df.columns[(df.columns != \"strain\") & (df.columns != \"rank grouped by strain\")]\n", "grouped[cols].agg(coeff_of_var)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looping over a GroupBy object\n", "\n", "While the `GroupBy` methods we have learned so far (like `transform()` and `agg()`) are useful and lead to concise code, we sometimes want to loop over the groups of a `GroupBy` object. This often comes up in plotting applications. As an example, I will compute the mean CFP fluorescent intensity for each strain." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "d22 : 2646.0\n", "m22 : 2597.5\n" ] } ], "source": [ "for strain, group in df.groupby(\"strain\"):\n", " print(strain, ': ', group[\"CFP\"].median())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the `GroupBy` object as an iterator, it yields the name of the group (which I assigned as `strain`) and the corresponding sub-data frame (which I assigned `group`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalizing the intensities\n", "\n", "With all of these skills in place, let us now do a useful calculation and normalize the fluorescent intensities for each fluorophore in each strain." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
strainCFPYFPrank grouped by strainnorm CFPnorm YFP
0m222438140953.00.9403761.000676
1m222316139124.00.8933190.987892
2m222521151185.00.9723901.073117
3m2226461460145.01.0206051.036896
4m2228301638219.01.0915771.163313
\n", "
" ], "text/plain": [ " strain CFP YFP rank grouped by strain norm CFP norm YFP\n", "0 m22 2438 1409 53.0 0.940376 1.000676\n", "1 m22 2316 1391 24.0 0.893319 0.987892\n", "2 m22 2521 1511 85.0 0.972390 1.073117\n", "3 m22 2646 1460 145.0 1.020605 1.036896\n", "4 m22 2830 1638 219.0 1.091577 1.163313" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# group_keys = False necessary keep index of original df\n", "df[[\"norm CFP\", \"norm YFP\"]] = df.groupby(\"strain\", group_keys=False)[\n", " [\"CFP\", \"YFP\"]\n", "].apply(lambda x: x / x.mean())\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.9.16\n", "IPython version : 8.10.0\n", "\n", "numpy : 1.23.5\n", "pandas : 1.5.3\n", "jupyterlab: 3.5.3\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p numpy,pandas,jupyterlab" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" } }, "nbformat": 4, "nbformat_minor": 4 }