{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data wrangling and visualization with pandas, seaborn, and matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lesson preamble\n", "\n", "### Learning Objectives\n", "\n", "- Understand the split-apply-combine concept for data analysis.\n", " - Use `groupby()`, `mean()`, `agg()` and `size()` to apply this technique.\n", "- Produce scatter plots, line plots, and histograms using `seaborn` and `matplotlib`.\n", "- Set universal plot settings.\n", "- Understand and apply grids for faceting in `seaborn`.\n", "\n", "### Lesson outline\n", "\n", "- Split-apply-combine techniques in `pandas`\n", " - Using `mean()` to summarize categorical data (20 min)\n", " - Using `size()` to summarize categorical data (10 min)\n", "- Data visualization with `matplotlib` and `seaborn` (10 min)\n", " - Visualizing one quantitative variable with multiple categorical variables (50 min)\n", " - Visualizing the relationship of two quantitative variable with multiple categorical variables (40min)\n", "- Split-apply-combine... plot! (20 min)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
record_idmonthdayyearplot_idspecies_idsexhindfoot_lengthweightgenusspeciestaxaplot_type
3478126966102519977PLM20.016.0PeromyscusleucopusRodentRodent Exclosure
3478227185112219977PLF21.022.0PeromyscusleucopusRodentRodent Exclosure
34783277925219987PLF20.08.0PeromyscusleucopusRodentRodent Exclosure
3478428806112119987PXNaNNaNNaNChaetodipussp.RodentRodent Exclosure
34785309867120007PXNaNNaNNaNChaetodipussp.RodentRodent Exclosure
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genussexweight
0AmmospermophilusMNaN
1BaiomysF9.161290
2BaiomysM7.357143
3ChaetodipusF23.763824
4ChaetodipusM24.712219
5DipodomysF55.244360
6DipodomysM56.243034
7NeotomaF154.282209
8NeotomaM165.652893
9OnychomysF26.780959
10OnychomysM26.246466
11PerognathusF8.574803
12PerognathusM8.204182
13PeromyscusF22.491649
14PeromyscusM20.644279
15ReithrodontomysF11.220080
16ReithrodontomysM10.159941
17SigmodonF71.696000
18SigmodonM61.336842
19SpermophilusF57.000000
20SpermophilusM130.000000
\n", "
" ], "text/plain": [ " genus sex weight\n", "0 Ammospermophilus M NaN\n", "1 Baiomys F 9.161290\n", "2 Baiomys M 7.357143\n", "3 Chaetodipus F 23.763824\n", "4 Chaetodipus M 24.712219\n", "5 Dipodomys F 55.244360\n", "6 Dipodomys M 56.243034\n", "7 Neotoma F 154.282209\n", "8 Neotoma M 165.652893\n", "9 Onychomys F 26.780959\n", "10 Onychomys M 26.246466\n", "11 Perognathus F 8.574803\n", "12 Perognathus M 8.204182\n", "13 Peromyscus F 22.491649\n", "14 Peromyscus M 20.644279\n", "15 Reithrodontomys F 11.220080\n", "16 Reithrodontomys M 10.159941\n", "17 Sigmodon F 71.696000\n", "18 Sigmodon M 61.336842\n", "19 Spermophilus F 57.000000\n", "20 Spermophilus M 130.000000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "grouped_surveys['weight'].agg(np.mean).reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This general approach is more flexible and powerful since multiple aggregation functions can be applied in the same line of code by passing them as a list to `agg()`. For instance, the standard deviation and mean could be computed in the same call." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanstd
genussex
AmmospermophilusMNaNNaN
BaiomysF9.1612902.237510
M7.3571430.841897
ChaetodipusF23.7638247.973696
M24.71221910.303329
DipodomysF55.24436029.657217
M56.24303429.008498
NeotomaF154.28220939.186546
M165.65289348.991563
OnychomysF26.7809596.269802
M26.2464666.360828
PerognathusF8.5748034.123303
M8.2041823.238490
PeromyscusF22.4916494.850259
M20.6442793.935623
ReithrodontomysF11.2200802.604365
M10.1599411.760459
SigmodonF71.69600028.241820
M61.33684220.418291
SpermophilusF57.000000NaN
M130.000000NaN
\n", "
" ], "text/plain": [ " mean std\n", "genus sex \n", "Ammospermophilus M NaN NaN\n", "Baiomys F 9.161290 2.237510\n", " M 7.357143 0.841897\n", "Chaetodipus F 23.763824 7.973696\n", " M 24.712219 10.303329\n", "Dipodomys F 55.244360 29.657217\n", " M 56.243034 29.008498\n", "Neotoma F 154.282209 39.186546\n", " M 165.652893 48.991563\n", "Onychomys F 26.780959 6.269802\n", " M 26.246466 6.360828\n", "Perognathus F 8.574803 4.123303\n", " M 8.204182 3.238490\n", "Peromyscus F 22.491649 4.850259\n", " M 20.644279 3.935623\n", "Reithrodontomys F 11.220080 2.604365\n", " M 10.159941 1.760459\n", "Sigmodon F 71.696000 28.241820\n", " M 61.336842 20.418291\n", "Spermophilus F 57.000000 NaN\n", " M 130.000000 NaN" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Multiple aggregation functions are passed as a list, hence the square brackets\n", "grouped_surveys['weight'].agg ([np.mean, np.std])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any function can be passed like this, including custom personal functions. For many common aggregation functions, `pandas` allows to pass a string with the function name as a convenience." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanmediancount
genussex
AmmospermophilusMNaNNaN0
BaiomysF9.1612909.031
M7.3571437.014
ChaetodipusF23.76382423.03201
M24.71221921.02627
DipodomysF55.24436045.06826
M56.24303447.08649
NeotomaF154.282209160.0652
M165.652893170.0484
OnychomysF26.78095926.01502
M26.24646625.01627
PerognathusF8.5748038.0762
M8.2041828.0813
PeromyscusF22.49164923.0958
M20.64427921.01206
ReithrodontomysF11.22008011.01245
M10.15994110.01363
SigmodonF71.69600070.0125
M61.33684258.095
SpermophilusF57.00000057.01
M130.000000130.01
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
speciestaxagenus
merriamiRodentDipodomys10596
penicillatusRodentChaetodipus3123
ordiiRodentDipodomys3027
baileyiRodentChaetodipus2891
megalotisRodentReithrodontomys2609
\n", "
" ], "text/plain": [ " 0\n", "species taxa genus \n", "merriami Rodent Dipodomys 10596\n", "penicillatus Rodent Chaetodipus 3123\n", "ordii Rodent Dipodomys 3027\n", "baileyi Rodent Chaetodipus 2891\n", "megalotis Rodent Reithrodontomys 2609" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(surveys\n", " .groupby(['species', 'taxa', 'genus'])\n", " .size()\n", " .nlargest()\n", " .to_frame()\n", ") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason that \"species\", \"taxa\", and \"genus\" are displayed in bold font is that `groupby()` makes these columns into the index (the row names) of the data frame. Indexes can be powerful when working with very large datasets (e.g. matching on indices is faster than matching on values in columns). However, when having multiple index levels like above, it can be less intuitive to work with than when working with columns, so it is often a good idea to reset the data frames index, unless there is a clear advantage of keeping it for downstream analyisis.\n", "\n", "To reset the index, the `reset_index()` method can be used instead of `to_frame()`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciestaxagenus0
0merriamiRodentDipodomys10596
1penicillatusRodentChaetodipus3123
2ordiiRodentDipodomys3027
3baileyiRodentChaetodipus2891
4megalotisRodentReithrodontomys2609
\n", "
" ], "text/plain": [ " species taxa genus 0\n", "0 merriami Rodent Dipodomys 10596\n", "1 penicillatus Rodent Chaetodipus 3123\n", "2 ordii Rodent Dipodomys 3027\n", "3 baileyi Rodent Chaetodipus 2891\n", "4 megalotis Rodent Reithrodontomys 2609" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(surveys\n", " .groupby(['species', 'taxa', 'genus'])\n", " .size()\n", " .nlargest()\n", " .reset_index()\n", ") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the series was changed into a data frame, the values were put into a column. Columns needs a name and by default this is just the lowest unique number among the column names, in this case `0`. The `rename()` can be used to change the name of the `0` column to something more meaningful." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciestaxagenussize
0merriamiRodentDipodomys10596
1penicillatusRodentChaetodipus3123
2ordiiRodentDipodomys3027
3baileyiRodentChaetodipus2891
4megalotisRodentReithrodontomys2609
\n", "
" ], "text/plain": [ " species taxa genus size\n", "0 merriami Rodent Dipodomys 10596\n", "1 penicillatus Rodent Chaetodipus 3123\n", "2 ordii Rodent Dipodomys 3027\n", "3 baileyi Rodent Chaetodipus 2891\n", "4 megalotis Rodent Reithrodontomys 2609" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(surveys\n", " .groupby(['species', 'taxa', 'genus'])\n", " .size()\n", " .nlargest()\n", " .reset_index()\n", " .rename(columns={0: 'size'})\n", ") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any column can be renamed this way" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesTaxaGenus0
0merriamiRodentDipodomys10596
1penicillatusRodentChaetodipus3123
2ordiiRodentDipodomys3027
3baileyiRodentChaetodipus2891
4megalotisRodentReithrodontomys2609
\n", "