{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is part of my [Python data science curriculum](http://www.terran.us/articles/python_curriculum.html). It demonstrates some Pandas functions which I thought were not adequately explained in the Jake VanderPlas book." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import scipy.stats as stats\n", "from plotnine.data import diamonds" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# This is a standard Python demo dataset. You can also load it from your Python packages\n", "# dir with pd.read_csv if you don't want to import seaborn.\n", "import seaborn as sns\n", "tips = sns.load_dataset('tips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For no apparent reason, the VanderPlas book doesn't document `pd.read_csv`! This is definitely functionality that you need. If you have the Wes McKinney book available, he has a description that you can read in Chapter 6; otherwise just read the online docs." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "pd.read_csv?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pd.read_excel?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For DB connections: \n", "http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries \n", "https://www.sqlalchemy.org/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary Tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be aware that describe() ignores all non-numeric columns by default, which might not be what you wanted:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
count244.000000244.000000244.000000
mean19.7859432.9982792.569672
std8.9024121.3836380.951100
min3.0700001.0000001.000000
25%13.3475002.0000002.000000
50%17.7950002.9000002.000000
75%24.1275003.5625003.000000
max50.81000010.0000006.000000
\n", "
" ], "text/plain": [ " total_bill tip size\n", "count 244.000000 244.000000 244.000000\n", "mean 19.785943 2.998279 2.569672\n", "std 8.902412 1.383638 0.951100\n", "min 3.070000 1.000000 1.000000\n", "25% 13.347500 2.000000 2.000000\n", "50% 17.795000 2.900000 2.000000\n", "75% 24.127500 3.562500 3.000000\n", "max 50.810000 10.000000 6.000000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can force it to include them with include='all', but they compute different statistics, so the result is ugly:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
count244.000000244.000000244244244244244.000000
uniqueNaNNaN2242NaN
topNaNNaNMaleNoSatDinnerNaN
freqNaNNaN15715187176NaN
mean19.7859432.998279NaNNaNNaNNaN2.569672
std8.9024121.383638NaNNaNNaNNaN0.951100
min3.0700001.000000NaNNaNNaNNaN1.000000
25%13.3475002.000000NaNNaNNaNNaN2.000000
50%17.7950002.900000NaNNaNNaNNaN2.000000
75%24.1275003.562500NaNNaNNaNNaN3.000000
max50.81000010.000000NaNNaNNaNNaN6.000000
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "count 244.000000 244.000000 244 244 244 244 244.000000\n", "unique NaN NaN 2 2 4 2 NaN\n", "top NaN NaN Male No Sat Dinner NaN\n", "freq NaN NaN 157 151 87 176 NaN\n", "mean 19.785943 2.998279 NaN NaN NaN NaN 2.569672\n", "std 8.902412 1.383638 NaN NaN NaN NaN 0.951100\n", "min 3.070000 1.000000 NaN NaN NaN NaN 1.000000\n", "25% 13.347500 2.000000 NaN NaN NaN NaN 2.000000\n", "50% 17.795000 2.900000 NaN NaN NaN NaN 2.000000\n", "75% 24.127500 3.562500 NaN NaN NaN NaN 3.000000\n", "max 50.810000 10.000000 NaN NaN NaN NaN 6.000000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You probably want to do the numeric and string types separately. Note that values which look like strings might be `np.object` or `pd.Categorical`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexsmokerdaytime
count244244244244
unique2242
topMaleNoSatDinner
freq15715187176
\n", "
" ], "text/plain": [ " sex smoker day time\n", "count 244 244 244 244\n", "unique 2 2 4 2\n", "top Male No Sat Dinner\n", "freq 157 151 87 176" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.describe(include=pd.Categorical)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.value_counts()` and `.unique()` work on any type of column, but only one column at a time, not an entire dataframe." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Male 157\n", "Female 87\n", "Name: sex, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips['sex'].value_counts()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Female, Male]\n", "Categories (2, object): [Female, Male]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips['sex'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These specialized functions are quite a bit faster than the general approach with `groupby`:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "905 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "1.4 ms ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "11.7 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit -n100 diamonds.color.unique()\n", "%timeit -n100 diamonds.color.value_counts()\n", "%timeit -n100 diamonds.groupby('color').count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Stacking, unstacking, melting, and pivoting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stacking and unstacking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To understand stacking and unstacking, let's start by creating a multindex on the rows." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
sexsmoker
MaleYes22.2845003.0511672.500000
No19.7912373.1134022.711340
FemaleYes17.9778792.9315152.242424
No18.1051852.7735192.592593
\n", "
" ], "text/plain": [ " total_bill tip size\n", "sex smoker \n", "Male Yes 22.284500 3.051167 2.500000\n", " No 19.791237 3.113402 2.711340\n", "Female Yes 17.977879 2.931515 2.242424\n", " No 18.105185 2.773519 2.592593" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss = tips.groupby(['sex','smoker']).aggregate('mean')\n", "tss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`unstack` will them move the INNER level of the index from the ROWS to the COLUMNS:\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
smokerYesNoYesNoYesNo
sex
Male22.28450019.7912373.0511673.1134022.5000002.711340
Female17.97787918.1051852.9315152.7735192.2424242.592593
\n", "
" ], "text/plain": [ " total_bill tip size \n", "smoker Yes No Yes No Yes No\n", "sex \n", "Male 22.284500 19.791237 3.051167 3.113402 2.500000 2.711340\n", "Female 17.977879 18.105185 2.931515 2.773519 2.242424 2.592593" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`stack` moves the innermost level from the columns to the rows:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
sexsmoker
MaleYes22.2845003.0511672.500000
No19.7912373.1134022.711340
FemaleYes17.9778792.9315152.242424
No18.1051852.7735192.592593
\n", "
" ], "text/plain": [ " total_bill tip size\n", "sex smoker \n", "Male Yes 22.284500 3.051167 2.500000\n", " No 19.791237 3.113402 2.711340\n", "Female Yes 17.977879 2.931515 2.242424\n", " No 18.105185 2.773519 2.592593" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().stack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you remove the _last_ (only remaining) level from either the rows or the columns, you then get a one-dimensional Series instead of a DataFrame:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " smoker sex \n", "total_bill Yes Male 22.284500\n", " Female 17.977879\n", " No Male 19.791237\n", " Female 18.105185\n", "tip Yes Male 3.051167\n", " Female 2.931515\n", " No Male 3.113402\n", " Female 2.773519\n", "size Yes Male 2.500000\n", " Female 2.242424\n", " No Male 2.711340\n", " Female 2.592593\n", "dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have a series, it can only be `unstack`ed, not `stack`ed. You can use `level=` to control which part of the index gets turned back into columns." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexMaleFemale
smoker
total_billYes22.28450017.977879
No19.79123718.105185
tipYes3.0511672.931515
No3.1134022.773519
sizeYes2.5000002.242424
No2.7113402.592593
\n", "
" ], "text/plain": [ "sex Male Female\n", " smoker \n", "total_bill Yes 22.284500 17.977879\n", " No 19.791237 18.105185\n", "tip Yes 3.051167 2.931515\n", " No 3.113402 2.773519\n", "size Yes 2.500000 2.242424\n", " No 2.711340 2.592593" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().unstack().unstack()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
smokerYesNo
sex
total_billMale22.28450019.791237
Female17.97787918.105185
tipMale3.0511673.113402
Female2.9315152.773519
sizeMale2.5000002.711340
Female2.2424242.592593
\n", "
" ], "text/plain": [ "smoker Yes No\n", " sex \n", "total_bill Male 22.284500 19.791237\n", " Female 17.977879 18.105185\n", "tip Male 3.051167 3.113402\n", " Female 2.931515 2.773519\n", "size Male 2.500000 2.711340\n", " Female 2.242424 2.592593" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().unstack().unstack(level=1)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
smokersex
YesMale22.2845003.0511672.500000
Female17.9778792.9315152.242424
NoMale19.7912373.1134022.711340
Female18.1051852.7735192.592593
\n", "
" ], "text/plain": [ " total_bill tip size\n", "smoker sex \n", "Yes Male 22.284500 3.051167 2.500000\n", " Female 17.977879 2.931515 2.242424\n", "No Male 19.791237 3.113402 2.711340\n", " Female 18.105185 2.773519 2.592593" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().unstack().unstack(level=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Melting and pivoting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`melt` is very similar to `stack`, except that it applies to all columns and not just the innermost level, and it the converts them into a normal column instead of an index level." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.head(2)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 total_bill 16.99\n", " tip 1.01\n", " sex Female\n", " smoker No\n", " day Sun\n", " time Dinner\n", " size 2\n", "1 total_bill 10.34\n", " tip 1.66\n", " sex Male\n", " smoker No\n", " day Sun\n", " time Dinner\n", " size 3\n", "dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.head(2).stack()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexvariablevalue
00total_bill16.99
11total_bill10.34
20tip1.01
31tip1.66
40sexFemale
51sexMale
60smokerNo
71smokerNo
80daySun
91daySun
100timeDinner
111timeDinner
120size2
131size3
\n", "
" ], "text/plain": [ " index variable value\n", "0 0 total_bill 16.99\n", "1 1 total_bill 10.34\n", "2 0 tip 1.01\n", "3 1 tip 1.66\n", "4 0 sex Female\n", "5 1 sex Male\n", "6 0 smoker No\n", "7 1 smoker No\n", "8 0 day Sun\n", "9 1 day Sun\n", "10 0 time Dinner\n", "11 1 time Dinner\n", "12 0 size 2\n", "13 1 size 3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that I have to move the row index into a column, which is called \"index\", to\n", "# preserve the association of the data in the original rows through the melt.\n", "tips_melted=tips.reset_index().head(2).melt(id_vars='index')\n", "tips_melted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no corresponding `cast` like R has. Instead, use `.pivot`" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
value
variabledaysexsizesmokertimetiptotal_bill
index
0SunFemale2NoDinner1.0116.99
1SunMale3NoDinner1.6610.34
\n", "
" ], "text/plain": [ " value \n", "variable day sex size smoker time tip total_bill\n", "index \n", "0 Sun Female 2 No Dinner 1.01 16.99\n", "1 Sun Male 3 No Dinner 1.66 10.34" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips_melted.pivot(index='index',columns='variable')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distinction between `.pivot` and `.pivot_table` is that the latter does aggregation:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timeDinnerLunch
sex
Female6.55.17
Male10.06.70
\n", "
" ], "text/plain": [ "time Dinner Lunch\n", "sex \n", "Female 6.5 5.17\n", "Male 10.0 6.70" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.pivot_table(values='tip',index='sex',columns='time',aggfunc='max')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the gotcha that the arguments are not in the same order if you specify them positionally:\n", "\n", "tips.pivot_table(__values__=None, __index__=None, __columns__=None, ... \n", "tips.pivot(__index__=None, __columns__=None, __values__=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Aggregate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Aggregate Syntax" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The full syntax of arguments to `aggregate()` is fairly complex. You can have:\n", "\n", "- A dict where the keys are columns in your source data, and the values are:\n", " - An array of functions to apply, where each element is:\n", " - A 2-tuple, where the first element is a string to call the output and the second element is the function" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tiptime
mean50pctpct_dinner
sexsmoker
MaleYes3.0511673.0078.333333
No3.1134022.7479.381443
FemaleYes2.9315152.8869.696970
No2.7735192.6853.703704
\n", "
" ], "text/plain": [ " tip time\n", " mean 50pct pct_dinner\n", "sex smoker \n", "Male Yes 3.051167 3.00 78.333333\n", " No 3.113402 2.74 79.381443\n", "Female Yes 2.931515 2.88 69.696970\n", " No 2.773519 2.68 53.703704" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.groupby(['sex','smoker']).aggregate(\n", " {'tip':[('mean',np.mean),('50pct',np.median)],\n", " 'time':[('pct_dinner', lambda x: 100*np.mean(x=='Dinner'))]\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But note that if you specify a tuple for one function, you had better specify it for all, or you get bad column names for the ones you didn't specify:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tip
<function mean at 0x7fdb78089ae8>50pct
sexsmoker
MaleYes3.0511673.00
No3.1134022.74
FemaleYes2.9315152.88
No2.7735192.68
\n", "
" ], "text/plain": [ " tip \n", " 50pct\n", "sex smoker \n", "Male Yes 3.051167 3.00\n", " No 3.113402 2.74\n", "Female Yes 2.931515 2.88\n", " No 2.773519 2.68" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,('50pct',np.median)]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whereas if you don't specify _any_ names, you get sane defaults. I dunno." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tip
meanmedian
sexsmoker
MaleYes3.0511673.00
No3.1134022.74
FemaleYes2.9315152.88
No2.7735192.68
\n", "
" ], "text/plain": [ " tip \n", " mean median\n", "sex smoker \n", "Male Yes 3.051167 3.00\n", " No 3.113402 2.74\n", "Female Yes 2.931515 2.88\n", " No 2.773519 2.68" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-level Column Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we aggregate multiple columns with multiple functions, we get hierarchical column names:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex(levels=[['tip', 'total_bill'], ['mean', 'median']],\n", " labels=[[0, 0, 1, 1], [0, 1, 0, 1]])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm = tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median],\n", " 'total_bill':[np.mean,np.median]})\n", "\n", "# See that we have a MultiIndex:\n", "tm.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to flip which is the first and which is the second level of the index, we can do it with `.swaplevel`:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tiptotal_bill
meanmedianmeanmedian
sexsmoker
MaleYes3.0511673.0022.28450020.39
No3.1134022.7419.79123718.24
FemaleYes2.9315152.8817.97787916.27
No2.7735192.6818.10518516.69
\n", "
" ], "text/plain": [ " tip total_bill \n", " mean median mean median\n", "sex smoker \n", "Male Yes 3.051167 3.00 22.284500 20.39\n", " No 3.113402 2.74 19.791237 18.24\n", "Female Yes 2.931515 2.88 17.977879 16.27\n", " No 2.773519 2.68 18.105185 16.69" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanmedianmeanmedian
tiptiptotal_billtotal_bill
sexsmoker
MaleYes3.0511673.0022.28450020.39
No3.1134022.7419.79123718.24
FemaleYes2.9315152.8817.97787916.27
No2.7735192.6818.10518516.69
\n", "
" ], "text/plain": [ " mean median mean median\n", " tip tip total_bill total_bill\n", "sex smoker \n", "Male Yes 3.051167 3.00 22.284500 20.39\n", " No 3.113402 2.74 19.791237 18.24\n", "Female Yes 2.931515 2.88 17.977879 16.27\n", " No 2.773519 2.68 18.105185 16.69" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.swaplevel(axis=1)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanmedian
tiptotal_billtiptotal_bill
sexsmoker
MaleYes3.05116722.2845003.0020.39
No3.11340219.7912372.7418.24
FemaleYes2.93151517.9778792.8816.27
No2.77351918.1051852.6816.69
\n", "
" ], "text/plain": [ " mean median \n", " tip total_bill tip total_bill\n", "sex smoker \n", "Male Yes 3.051167 22.284500 3.00 20.39\n", " No 3.113402 19.791237 2.74 18.24\n", "Female Yes 2.931515 17.977879 2.88 16.27\n", " No 2.773519 18.105185 2.68 16.69" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If we then want the columns sorted by the new index, we can do that explicitly:\n", "tm.swaplevel(axis=1).sort_index(axis=1)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tiptotal_bill
meanmedianmeanmedian
smokersex
YesMale3.0511673.0022.28450020.39
Female2.9315152.8817.97787916.27
NoMale3.1134022.7419.79123718.24
Female2.7735192.6818.10518516.69
\n", "
" ], "text/plain": [ " tip total_bill \n", " mean median mean median\n", "smoker sex \n", "Yes Male 3.051167 3.00 22.284500 20.39\n", " Female 2.931515 2.88 17.977879 16.27\n", "No Male 3.113402 2.74 19.791237 18.24\n", " Female 2.773519 2.68 18.105185 16.69" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The same thing works on the rows:\n", "tm.swaplevel(axis=0).sort_index(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we had more than two levels, we could specify which two we wanted to swap with additional arguments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some tools (including Altair) can't use data with hierarchical column names at all, so they have to be flattened. There's no built-in function for doing this, but the following idiom seems standard:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tip.meantip.mediantotal_bill.meantotal_bill.median
sexsmoker
MaleYes3.0511673.0022.28450020.39
No3.1134022.7419.79123718.24
FemaleYes2.9315152.8817.97787916.27
No2.7735192.6818.10518516.69
\n", "
" ], "text/plain": [ " tip.mean tip.median total_bill.mean total_bill.median\n", "sex smoker \n", "Male Yes 3.051167 3.00 22.284500 20.39\n", " No 3.113402 2.74 19.791237 18.24\n", "Female Yes 2.931515 2.88 17.977879 16.27\n", " No 2.773519 2.68 18.105185 16.69" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.columns = [c[0] + \".\" + c[1] for c in tm.columns]\n", "tm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set Membership" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is an `.isin` function for quickly checking set membership." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesizeweekend
23929.035.92MaleNoSatDinner3True
24027.182.00FemaleYesSatDinner2True
24122.672.00MaleYesSatDinner2True
24217.821.75MaleNoSatDinner2True
24318.783.00FemaleNoThurDinner2False
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size weekend\n", "239 29.03 5.92 Male No Sat Dinner 3 True\n", "240 27.18 2.00 Female Yes Sat Dinner 2 True\n", "241 22.67 2.00 Male Yes Sat Dinner 2 True\n", "242 17.82 1.75 Male No Sat Dinner 2 True\n", "243 18.78 3.00 Female No Thur Dinner 2 False" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips['weekend'] = tips.day.isin(['Sat','Sun'])\n", "tips.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance of `.isin` is good compared to the alternatives:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.32 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "3.92 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit -n100 diamonds.color.isin(['D','E','F'])\n", "%timeit -n100 diamonds.eval('color in [\"D\",\"E\",\"F\"]')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Method Chaining Helpers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to make it easier to create chains of manipulation functions, there is an `assign()` which creates new columns. It and the array-index filtering both take lambdas, which let you refer to an intermediate result that doesn't have a name." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesizeweekendtip_pct
1727.255.15MaleYesSunDinner2True71.034483
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size weekend tip_pct\n", "172 7.25 5.15 Male Yes Sun Dinner 2 True 71.034483" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \\\n", " [lambda x: x.tip_pct > 70]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also a \"rename\" for changing column names without needing to assign to the `.columns` or `.index` property of a named variable." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdayofweektimesizeweekendtip_pct
1727.255.15MaleYesSunDinner2True71.034483
\n", "
" ], "text/plain": [ " total_bill tip sex smoker dayofweek time size weekend tip_pct\n", "172 7.25 5.15 Male Yes Sun Dinner 2 True 71.034483" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \\\n", " [lambda x: x.tip_pct > 70].rename({'day':'dayofweek'},axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Sorting and Ranking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## sort_values and sort_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a sort_values which goes with sort_index():" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tipsexsmokerdaytimesizeweekend
total_bill
3.071.00FemaleYesSatDinner1True
5.751.00FemaleYesFriDinner2False
7.251.00FemaleNoSatDinner1True
7.255.15MaleYesSunDinner2True
7.512.00MaleNoThurLunch2False
\n", "
" ], "text/plain": [ " tip sex smoker day time size weekend\n", "total_bill \n", "3.07 1.00 Female Yes Sat Dinner 1 True\n", "5.75 1.00 Female Yes Fri Dinner 2 False\n", "7.25 1.00 Female No Sat Dinner 1 True\n", "7.25 5.15 Male Yes Sun Dinner 2 True\n", "7.51 2.00 Male No Thur Lunch 2 False" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.set_index('total_bill').sort_index().head()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tipsexsmokerdaytimesizeweekend
total_bill
3.071.00FemaleYesSatDinner1True
12.601.00MaleYesSatDinner2True
5.751.00FemaleYesFriDinner2False
7.251.00FemaleNoSatDinner1True
16.991.01FemaleNoSunDinner2True
\n", "
" ], "text/plain": [ " tip sex smoker day time size weekend\n", "total_bill \n", "3.07 1.00 Female Yes Sat Dinner 1 True\n", "12.60 1.00 Male Yes Sat Dinner 2 True\n", "5.75 1.00 Female Yes Fri Dinner 2 False\n", "7.25 1.00 Female No Sat Dinner 1 True\n", "16.99 1.01 Female No Sun Dinner 2 True" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.set_index('total_bill').sort_values('tip').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ranking\n", "\n", "is in the `.rank()` member function. The usual options for method (min, max, dense, etc) are available as an argument to `rank()`.\n", "\n", "Note that `pct` actually gives numbers between 0 and 1, not 0 and 100. Pandas is very sloppy generally about the meaning of \"percent\"." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesizeweekendtip_ranktip_pct
016.991.01FemaleNoSunDinner2True5.00.020492
110.341.66MaleNoSunDinner3True33.00.135246
221.013.50MaleNoSunDinner3True177.00.725410
323.683.31MaleNoSunDinner2True165.00.676230
424.593.61FemaleNoSunDinner4True185.00.758197
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size weekend tip_rank \\\n", "0 16.99 1.01 Female No Sun Dinner 2 True 5.0 \n", "1 10.34 1.66 Male No Sun Dinner 3 True 33.0 \n", "2 21.01 3.50 Male No Sun Dinner 3 True 177.0 \n", "3 23.68 3.31 Male No Sun Dinner 2 True 165.0 \n", "4 24.59 3.61 Female No Sun Dinner 4 True 185.0 \n", "\n", " tip_pct \n", "0 0.020492 \n", "1 0.135246 \n", "2 0.725410 \n", "3 0.676230 \n", "4 0.758197 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.assign(tip_rank=tips.tip.rank(), tip_pct = tips.tip.rank(pct=True)).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Replacing Values" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012.0
13NaN
256.0
\n", "
" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 NaN\n", "2 5 6.0" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that column b gets promoted from integer to float because NaN cannot be stored in an integer type in Numpy\n", "d = pd.DataFrame([{'a':1, 'b':2}, {'a':3, 'b':np.NaN}, {'a':5, 'b': 6}])\n", "d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can replace individual values with `map`, which takes a dict or a lambda. It operates on only one column at a time." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0199.0
13NaN
25NaN
\n", "
" ], "text/plain": [ " a b\n", "0 1 99.0\n", "1 3 NaN\n", "2 5 NaN" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.assign(b=d.b.map({2:99}))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0199.0
13NaN
256.0
\n", "
" ], "text/plain": [ " a b\n", "0 1 99.0\n", "1 3 NaN\n", "2 5 6.0" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.assign(b=d.b.map(lambda x: 99 if x==2 else x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can run a map on all columns with `applymap`:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0199.0
13NaN
256.0
\n", "
" ], "text/plain": [ " a b\n", "0 1 99.0\n", "1 3 NaN\n", "2 5 6.0" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.applymap(lambda x: 99 if x==2 else x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can fill NAs with `fillna`, which optionally takes column-specific defaults:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012.0
13-99.0
256.0
\n", "
" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 -99.0\n", "2 5 6.0" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.fillna({'b':-99})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That makes it especially convenient to do something like this:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012.0
134.0
256.0
\n", "
" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 4.0\n", "2 5 6.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.fillna(d.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.combine_first` is like a version of coalesce which works at a full column or dataframe level." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012.0
133.0
256.0
\n", "
" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 3.0\n", "2 5 6.0" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This fills in the value of column a into column b where there is a missing value:\n", "d.assign(b=d.b.combine_first(d.a))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0-99-98
1-99-98
2-99-98
\n", "
" ], "text/plain": [ " a b\n", "0 -99 -98\n", "1 -99 -98\n", "2 -99 -98" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "e=pd.DataFrame([{'a':-99, 'b':-98}]*3)\n", "e" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012.0
13-98.0
256.0
\n", "
" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 -98.0\n", "2 5 6.0" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This does the same thing at the full dataframe level instead of a single column:\n", "d.combine_first(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Categories\n", "\n", "Just like `.str` exposes special functions for strings, `.cat` exposes special functions for categorical variables. \n", "\n", "Let's make some categorical variables with cut (there is variant qcut, which bins by equal quantiles instead of equal width)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
price
carat
(0.0, 0.5]839.718149
(0.5, 1.0]2811.342683
(1.0, 1.5]6513.526534
(1.5, 2.0]11321.774838
(2.0, 2.5]14918.141237
(2.5, 3.0]15472.904255
(3.0, 3.5]14822.000000
(3.5, 4.0]15636.500000
(4.0, 4.5]16576.500000
\n", "
" ], "text/plain": [ " price\n", "carat \n", "(0.0, 0.5] 839.718149\n", "(0.5, 1.0] 2811.342683\n", "(1.0, 1.5] 6513.526534\n", "(1.5, 2.0] 11321.774838\n", "(2.0, 2.5] 14918.141237\n", "(2.5, 3.0] 15472.904255\n", "(3.0, 3.5] 14822.000000\n", "(3.5, 4.0] 15636.500000\n", "(4.0, 4.5] 16576.500000" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dc = diamonds.groupby(pd.cut(diamonds['carat'],np.arange(0,5,.5))).\\\n", " aggregate({'price':'mean'})\n", "dc" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CategoricalDtype(categories=[(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]]\n", " ordered=True)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can move the categorical index back into a column and see that it has type Categorical\n", "dc=dc.reset_index()\n", "dc.carat.dtype" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "dtype: int8" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Try tab-completing on dc.carat.cat.\n", "\n", "# This gives us the integer values\n", "dc.carat.cat.codes" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "IntervalIndex([(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]]\n", " closed='right',\n", " dtype='interval[float64]')" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This gives us the labels\n", "dc.carat.cat.categories" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is a metadata flag indicating whether the category order is semantically meaningful\n", "dc.carat.cat.ordered" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__There is a function to reorder categories, but it appears to have some bugs. Look at these examples:__" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (0.0, 0.5]\n", "1 (1.0, 1.5]\n", "2 (0.5, 1.0]\n", "3 (0.5, 1.0]\n", "4 (1.5, 2.0]\n", "5 (2.0, 2.5]\n", "6 (3.0, 3.5]\n", "7 (2.5, 3.0]\n", "8 (2.5, 3.0]\n", "Name: carat, dtype: category\n", "Categories (9, interval[float64]): [(0.0, 0.5] < (1.0, 1.5] < (0.5, 1.0] < (1.5, 2.0] ... (3.0, 3.5] < (2.5, 3.0] < (3.5, 4.0] < (4.0, 4.5]]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You can reorder categories:\n", "dc.carat.cat.reorder_categories(dc.carat.cat.categories[[0,2,1,3,4,6,5,7,8]])" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 NaN\n", "2 NaN\n", "3 NaN\n", "4 NaN\n", "5 NaN\n", "6 NaN\n", "7 NaN\n", "8 NaN\n", "Name: carat, dtype: category\n", "Categories (9, interval[float64]): [(4.0, 4.5] < (3.5, 4.0] < (3.0, 3.5] < (2.5, 3.0] ... (1.5, 2.0] < (1.0, 1.5] < (0.5, 1.0] < (0.0, 0.5]]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dc.carat.cat.reorder_categories(dc.carat.cat.categories[::-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I submitted a bug:\n", "https://github.com/pandas-dev/pandas/issues/23452" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What does \"percent\" mean?\n", "\n", "Python libraries are disappointingly sloppy about using the word \"percent\" correctly. \"Cent\" is 100 and percents are supposed to be on a scale of 100, but often the word is used very shoddily on the scale of 1 instead. Here's an example:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x
0NaN
10.0
21.0
3-0.5
\n", "
" ], "text/plain": [ " x\n", "0 NaN\n", "1 0.0\n", "2 1.0\n", "3 -0.5" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({'x':[1,1,2,1]}).pct_change()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should be a 100% increase from 1 to 2, and then a 50% decrease from 2 back to 1, but it's actually 1.0 and -0.5. It's _not a percent_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same thing is true in `stats`, for example with the \"percent point function\" which actually goes from 0 to 1, not 0 to 100:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-1.96, 1.96])" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.norm.ppf([0.025,0.975]).round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please do not follow these bad examples. The word \"percent\" does have a meaning." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# ToDo" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "These are some things I intend to write about but haven't gotten to yet:" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ ".corr, .cov, .corrwith \n", ".duplicated, .drop_duplicates \n", ".sample(replace=), .take \n", ".get_dummies" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "255px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }