{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Merge and pivot" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import addutils.toc ; addutils.toc.js(ipy_notebook=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we are going to see how to combine `pandas` datastructures together." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from IPython.display import (display, HTML)\n", "from addutils import side_by_side2, css_notebook\n", "css_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 Concat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas.concat` concatenates two (or more) pandas objects along a particular axis (default is the vertical one)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
0-1-243
134-4-3
22-532
32-532
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fedcba
00.50.51.53.53.54.5
14.53.51.52.50.51.5
20.50.53.52.53.54.5
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = np.array([[-1, -2, 4, 3],[3,4, -4, -3], [2, -5, 3, 2], [2, -5, 3, 2]])\n", "df1 = pd.DataFrame(data, columns = ['a','b','c','d'])\n", "df2 = pd.DataFrame(np.random.randint(5, size=(3,6)) + 0.5, columns = list('fedcba'))\n", "HTML(side_by_side2(df1, df2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done by defining `join`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdef
0-1.0-2.04.03.0NaNNaN
13.04.0-4.0-3.0NaNNaN
22.0-5.03.02.0NaNNaN
32.0-5.03.02.0NaNNaN
04.53.53.51.50.50.5
11.50.52.51.53.54.5
24.53.52.53.50.50.5
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dcba
03.04.0-2.0-1.0
1-3.0-4.04.03.0
22.03.0-5.02.0
32.03.0-5.02.0
01.53.53.54.5
11.52.50.51.5
23.52.53.54.5
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(side_by_side2(pd.concat([df1, df2]), pd.concat([df1, df2], join='inner')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible to create a Multi-index DataFrame setting the 'keys' parameter." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdef
First0-1.0-2.04.03.0NaNNaN
13.04.0-4.0-3.0NaNNaN
22.0-5.03.02.0NaNNaN
32.0-5.03.02.0NaNNaN
Second04.53.53.51.50.50.5
11.50.52.51.53.54.5
24.53.52.53.50.50.5
\n", "
" ], "text/plain": [ " a b c d e f\n", "First 0 -1.0 -2.0 4.0 3.0 NaN NaN\n", " 1 3.0 4.0 -4.0 -3.0 NaN NaN\n", " 2 2.0 -5.0 3.0 2.0 NaN NaN\n", " 3 2.0 -5.0 3.0 2.0 NaN NaN\n", "Second 0 4.5 3.5 3.5 1.5 0.5 0.5\n", " 1 1.5 0.5 2.5 1.5 3.5 4.5\n", " 2 4.5 3.5 2.5 3.5 0.5 0.5" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2], keys = ('First', 'Second'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `concat` allows overlapping indexes. In order to avoid overlapping index the parameter `ignore_index` has to be set as `True`. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdef
0-1.0-2.04.03.0NaNNaN
13.04.0-4.0-3.0NaNNaN
22.0-5.03.02.0NaNNaN
32.0-5.03.02.0NaNNaN
44.53.53.51.50.50.5
51.50.52.51.53.54.5
64.53.52.53.50.50.5
\n", "
" ], "text/plain": [ " a b c d e f\n", "0 -1.0 -2.0 4.0 3.0 NaN NaN\n", "1 3.0 4.0 -4.0 -3.0 NaN NaN\n", "2 2.0 -5.0 3.0 2.0 NaN NaN\n", "3 2.0 -5.0 3.0 2.0 NaN NaN\n", "4 4.5 3.5 3.5 1.5 0.5 0.5\n", "5 1.5 0.5 2.5 1.5 3.5 4.5\n", "6 4.5 3.5 2.5 3.5 0.5 0.5" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1, df2], ignore_index = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And if we want to concatenate the DataFrames on a different axis:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ONETWO
abcdfedcba
0-1-2430.50.51.53.53.54.5
134-4-34.53.51.52.50.51.5
22-5320.50.53.52.53.54.5
32-532NaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " ONE TWO \n", " a b c d f e d c b a\n", "0 -1 -2 4 3 0.5 0.5 1.5 3.5 3.5 4.5\n", "1 3 4 -4 -3 4.5 3.5 1.5 2.5 0.5 1.5\n", "2 2 -5 3 2 0.5 0.5 3.5 2.5 3.5 4.5\n", "3 2 -5 3 2 NaN NaN NaN NaN NaN NaN" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "concatenatedaxis = pd.concat([df1, df2], axis = 1, keys = ['ONE','TWO'])\n", "concatenatedaxis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 Append" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`append` is a shortcut for `concat` along `axis=0`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "d1 = pd.DataFrame(np.random.randint(10,99,(2,2)), index=[1001, 1002], columns=list('AB'))\n", "d2 = pd.DataFrame(np.random.randint(10,99,(2,3)), index=[1002, 2001], columns=list('BCD'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`append` allows overlapping index. Here we can see the two cases with `ignore_index` set respectively to true or false:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
10013213
10025631
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BCD
1002121545
2001182841
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
10013213NaNNaN
10025631NaNNaN
1002NaN121545
2001NaN182841
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
03213NaNNaN
15631NaNNaN
2NaN121545
3NaN182841
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(side_by_side2(d1, d2, d1.append(d2), d1.append(d2, ignore_index=True)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 Join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`join` is equivalent to the `sql` homonymous command. It's more advanced than `append` because has the parameter `how` that specifies how to merge indexes:\n", "\n", "* `left` keeps left indexes.\n", "* `right` keeps right indexes.\n", "* `inner` intersects indexes.\n", "* `outer` makes union of indexes." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
10013213
10025631
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BCD
1002121545
2001182841
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(side_by_side2(d1, d2))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABB_rCD
10013213NaNNaNNaN
10025631121545
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABB_rCD
10025631121545
2001NaNNaN182841
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABD
1002563145
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABD
10013213NaN
1002563145
2001NaNNaN41
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d6 = d1.join(d2, rsuffix='_r', how='left')\n", "d7 = d1.join(d2, rsuffix='_r', how='right')\n", "d8 = d1.join(d2.D, how='inner')\n", "d9 = d1.join(d2.D, how='outer')\n", "HTML(side_by_side2(d6, d7, d8, d9))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 Merge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas provides a single function, `merge`, as the entry point for all standard database join operations between DataFrame objects:\n", "\n", " pd.merge(left, right, how=’left’, on=None, left_on=None, right_on=None,\n", " left_index=False, right_index=False, sort=True,\n", " suffixes=(’_x’, ’_y’), copy=True)\n", " \n", "* `on`: Columns to join on. Must be found in both DataFrames\n", "* `left_on`: Columns from the left DataFrame to use as keys\n", "* `right_on`: same for right\n", "* `left_index`: If True, use the index (row labels) from the left DataFrame join key(s)\n", "* `right_index`: same for right\n", "* `how`: One of ’left’, ’right’, ’outer’, ’inner’. Defaults to ’inner’.\n", "* `suffixes`: A tuple of string suffixes to apply to overlapping columns. Defaults to (’_x’, ’_y’)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keylval
0foo1
1foo2
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyrval
0foo4
1bar5
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})\n", "right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})\n", "HTML(side_by_side2(left, right))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keylvalrval
0foo14
1foo24
2barNaN5
\n", "
" ], "text/plain": [ " key lval rval\n", "0 foo 1 4\n", "1 foo 2 4\n", "2 bar NaN 5" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(left, right, on='key', how='right')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A practical example: revenues is a little dataset that we want to join with the 'cities' database to find the geographical location of the offices:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityNameRevenuesState
0New YorkRoy1250NY
1friscoJohnn840CA
2houstonJim349tx
3taftPaul1100OK
4veniceRoss900IL
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CityStatelatlon
0ROBERTSID43.698922-112.173195
1HODGENVILLEKY37.559321-85.707267
2WILLIAMSPORTTN35.722628-87.212698
3MAUMEEOH41.571251-83.685036
4BERNENY42.602236-74.154615
5MONCLOVAOH41.577251-83.772378
6MEADOWBROOKWV39.346214-80.313875
7RUTHNV39.342631-114.885945
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "revenues = pd.read_csv('example_data/p05_d2.txt', index_col=0)\n", "cities = pd.read_csv('example_data/cities.csv.bz2', compression='bz2',\n", " names=['City', 'State', 'lat', 'lon'])\n", "HTML(side_by_side2(revenues, cities.head(8)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 'cities' database contains 30101 records:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "30101" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cities['City'].count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we map a normalization function to standardize the 'City' and 'State' names. **`map`** applies the function passed as parameter to all elements of the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameRevenuesCityState
0Roy1250NEW YORKNY
1Johnn840FRISCOCA
2Jim349HOUSTONTX
3Paul1100TAFTOK
4Ross900VENICEIL
\n", "
" ], "text/plain": [ " Name Revenues City State\n", "0 Roy 1250 NEW YORK NY\n", "1 Johnn 840 FRISCO CA\n", "2 Jim 349 HOUSTON TX\n", "3 Paul 1100 TAFT OK\n", "4 Ross 900 VENICE IL" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "revenues = revenues.reindex(columns=['Name','Revenues','City','State'])\n", "revenues['City'] = revenues['City'].map(lambda str: str.upper().strip())\n", "revenues['State'] = revenues['State'].map(lambda str: str.upper().strip())\n", "display(revenues)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameRevenuesCityState
0Roy1250NEW YORKNY
1Johnn840FRISCOCA
2Jim349HOUSTONTX
3Paul1100TAFTOK
4Ross900VENICEIL
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameRevenuesCityStatelatlon
0Roy1250NEW YORKNY40.757929-73.985506
3Paul1100TAFTOK35.763648-95.544501
4Ross900VENICEIL38.670250-90.168859
1Johnn840FRISCOCANaNNaN
2Jim349HOUSTONTX29.759956-95.362534
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "leftjoin = pd.merge(revenues, cities, left_on=['City', 'State'],\n", " right_on=['City', 'State'], how='left')\n", "leftjoin = leftjoin.sort_values(by='Revenues', ascending=False)\n", "HTML(side_by_side2(revenues, leftjoin))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5 Pivoting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pivot_table` is useful to analize data in the so-called \"stacked\" or \"record\" format: here each field of each record are stored in colums and similar fields could have multiple duplicates:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0datevariablevalue
002000-01-03 00:00:00A0.397018
112000-01-04 00:00:00A0.621109
222000-01-05 00:00:00A-2.508284
332000-01-03 00:00:00B0.219796
442000-01-04 00:00:00B1.371027
552000-01-05 00:00:00B-1.005504
662000-01-03 00:00:00C-0.391149
772000-01-04 00:00:00C1.209346
882000-01-05 00:00:00C-0.059617
992000-01-03 00:00:00D-0.464558
10102000-01-04 00:00:00D-0.173605
11112000-01-05 00:00:00D-1.649632
\n", "
" ], "text/plain": [ " Unnamed: 0 date variable value\n", "0 0 2000-01-03 00:00:00 A 0.397018\n", "1 1 2000-01-04 00:00:00 A 0.621109\n", "2 2 2000-01-05 00:00:00 A -2.508284\n", "3 3 2000-01-03 00:00:00 B 0.219796\n", "4 4 2000-01-04 00:00:00 B 1.371027\n", "5 5 2000-01-05 00:00:00 B -1.005504\n", "6 6 2000-01-03 00:00:00 C -0.391149\n", "7 7 2000-01-04 00:00:00 C 1.209346\n", "8 8 2000-01-05 00:00:00 C -0.059617\n", "9 9 2000-01-03 00:00:00 D -0.464558\n", "10 10 2000-01-04 00:00:00 D -0.173605\n", "11 11 2000-01-05 00:00:00 D -1.649632" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('example_data/pd06_pivot_Example.txt')\n", "data" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variableABCD
date
2000-01-03 00:00:000.3970180.219796-0.391149-0.464558
2000-01-04 00:00:000.6211091.3710271.209346-0.173605
2000-01-05 00:00:00-2.508284-1.005504-0.059617-1.649632
\n", "
" ], "text/plain": [ "variable A B C D\n", "date \n", "2000-01-03 00:00:00 0.397018 0.219796 -0.391149 -0.464558\n", "2000-01-04 00:00:00 0.621109 1.371027 1.209346 -0.173605\n", "2000-01-05 00:00:00 -2.508284 -1.005504 -0.059617 -1.649632" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pivot = data.pivot(index='date', columns='variable', values='value')\n", "pivot.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A practical example: apply pivoting to a dataset containing the statistics of te US crimes State by State, downloaded from the Stanford University \"Human-Computer Interaction\" group (http://hci.stanford.edu/jheer/workshop/data/):" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateType of CrimeCrimeYearCount
0AlabamaViolent CrimeMurder and nonnegligent Manslaughter1960406
1000ArkansasViolent CrimeMurder and nonnegligent Manslaughter1994294
2000ConnecticutViolent CrimeForcible rape1982692
3000FloridaViolent CrimeRobbery197012636
4000IdahoViolent CrimeRobbery2004241
5000IowaViolent CrimeAggravated assault19926131
6000LouisianaProperty CrimeBurglary198063997
7000MassachusettsProperty CrimeLarceny-theft196866823
\n", "
" ], "text/plain": [ " State Type of Crime Crime \\\n", "0 Alabama Violent Crime Murder and nonnegligent Manslaughter \n", "1000 Arkansas Violent Crime Murder and nonnegligent Manslaughter \n", "2000 Connecticut Violent Crime Forcible rape \n", "3000 Florida Violent Crime Robbery \n", "4000 Idaho Violent Crime Robbery \n", "5000 Iowa Violent Crime Aggravated assault \n", "6000 Louisiana Property Crime Burglary \n", "7000 Massachusetts Property Crime Larceny-theft \n", "\n", " Year Count \n", "0 1960 406 \n", "1000 1994 294 \n", "2000 1982 692 \n", "3000 1970 12636 \n", "4000 2004 241 \n", "5000 1992 6131 \n", "6000 1980 63997 \n", "7000 1968 66823 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2 = pd.read_csv('example_data/CrimeStatebyState.csv.bz2', compression='bz2')\n", "data2.ix[0:7000:1000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Definition:\n", " pd.pivot_table(data, values=None, rows=None, cols=None, aggfunc='mean',\n", " fill_value=None, margins=False, dropna=True)\n", "\n", "* `values`: Columns to aggregate\n", "* `aggfunc`: Default is `np.mean`. This can be a list of functions as well\n", "* `aggfunc`: Boolean, add subtotals, grandtotals\n", "* `dropna`: Exclude columns with NaN's" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingAll
Year
199712344716658261022332469448391901332006813195074
199811562418552248576333799461301850931831512485714
199910599916735231886302509491611730621658311634378
20009995818185214348300932470671721241628511608070
20019630716978228445308492461201794101739211849006
200210312915600229039309931453201769871785811878954
200310597314510220939312814469971697881796211826538
200410340915272220976322167514361582581805211679474
200510115814956221044329406526531606461724211556854
All343258271671692649001094231717094757399321726727486181458
\n", "
" ], "text/plain": [ "State Utah Vermont Virginia Washington West Virginia Wisconsin \\\n", "Year \n", "1997 123447 16658 261022 332469 44839 190133 \n", "1998 115624 18552 248576 333799 46130 185093 \n", "1999 105999 16735 231886 302509 49161 173062 \n", "2000 99958 18185 214348 300932 47067 172124 \n", "2001 96307 16978 228445 308492 46120 179410 \n", "2002 103129 15600 229039 309931 45320 176987 \n", "2003 105973 14510 220939 312814 46997 169788 \n", "2004 103409 15272 220976 322167 51436 158258 \n", "2005 101158 14956 221044 329406 52653 160646 \n", "All 3432582 716716 9264900 10942317 1709475 7399321 \n", "\n", "State Wyoming All \n", "Year \n", "1997 20068 13195074 \n", "1998 18315 12485714 \n", "1999 16583 11634378 \n", "2000 16285 11608070 \n", "2001 17392 11849006 \n", "2002 17858 11878954 \n", "2003 17962 11826538 \n", "2004 18052 11679474 \n", "2005 17242 11556854 \n", "All 726727 486181458 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option(\"display.colheader_justify\",\"right\")\n", "pivot2 = pd.pivot_table(data2, values='Count',\n", " index=['Year'],\n", " columns=['State'],\n", " margins=True, aggfunc=np.sum)\n", "pivot2.ix[-10:,-8:]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateAlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelaware
YearType of Crime
2001Violent Crime195823735286751219021285515492114924868
2002Property Crime1804002411831829610117111764841800549342626967
Violent Crime199313627301711150120838815882108074836
2003Property Crime182241243863143359908412150861797069298127943
Violent Crime193313877286381244920555115757110455525
2004Property Crime1823402217229120311091112271941803229394227256
Violent Crime193244159289521381418917517121101135105
2005Property Crime1773932397528734511277512005311884498979426245
Violent Crime19678419430478146591901781849896355332
All674554910195649436942367092566614776735932256496691359238
\n", "
" ], "text/plain": [ "State Alabama Alaska Arizona Arkansas California \\\n", "Year Type of Crime \n", "2001 Violent Crime 19582 3735 28675 12190 212855 \n", "2002 Property Crime 180400 24118 318296 101171 1176484 \n", " Violent Crime 19931 3627 30171 11501 208388 \n", "2003 Property Crime 182241 24386 314335 99084 1215086 \n", " Violent Crime 19331 3877 28638 12449 205551 \n", "2004 Property Crime 182340 22172 291203 110911 1227194 \n", " Violent Crime 19324 4159 28952 13814 189175 \n", "2005 Property Crime 177393 23975 287345 112775 1200531 \n", " Violent Crime 19678 4194 30478 14659 190178 \n", "All 6745549 1019564 9436942 3670925 66614776 \n", "\n", "State Colorado Connecticut Delaware \n", "Year Type of Crime \n", "2001 Violent Crime 15492 11492 4868 \n", "2002 Property Crime 180054 93426 26967 \n", " Violent Crime 15882 10807 4836 \n", "2003 Property Crime 179706 92981 27943 \n", " Violent Crime 15757 11045 5525 \n", "2004 Property Crime 180322 93942 27256 \n", " Violent Crime 17121 10113 5105 \n", "2005 Property Crime 188449 89794 26245 \n", " Violent Crime 18498 9635 5332 \n", "All 7359322 5649669 1359238 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pivot3 = pd.pivot_table(data2, values='Count',\n", " index=['Year', 'Type of Crime'],\n", " columns=['State'],\n", " margins=True, aggfunc=np.sum)\n", "pivot3.ix[-10:,:8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to select crimes for a given year and/or location:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateAlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelaware
Year
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware]\n", "Index: []" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pivot2.ix[2004:2005,:8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6 Stack and Unstack" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`stack` “pivot” a level of the column labels to an inner-most row level. In the following case, since after stacking there are no column labels remaining, `stack` returns a Series instead of a Dataframe:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateAlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelaware
YearType of Crime
1960Property Crime338233494365391654850851135695283939267
Violent Crime609723627041924375582408928375
1961Property Crime325413605413041683652100840156316589343
Violent Crime556420822881809381542659878318
1962Property Crime3582938494569418199564451478723379810157
\n", "
" ], "text/plain": [ "State Alabama Alaska Arizona Arkansas California Colorado \\\n", "Year Type of Crime \n", "1960 Property Crime 33823 3494 36539 16548 508511 35695 \n", " Violent Crime 6097 236 2704 1924 37558 2408 \n", "1961 Property Crime 32541 3605 41304 16836 521008 40156 \n", " Violent Crime 5564 208 2288 1809 38154 2659 \n", "1962 Property Crime 35829 3849 45694 18199 564451 47872 \n", "\n", "State Connecticut Delaware \n", "Year Type of Crime \n", "1960 Property Crime 28393 9267 \n", " Violent Crime 928 375 \n", "1961 Property Crime 31658 9343 \n", " Violent Crime 878 318 \n", "1962 Property Crime 33798 10157 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pivot3.ix[:5,:8]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Year Type of Crime State \n", "1960 Property Crime Alabama 33823\n", " Alaska 3494\n", " Arizona 36539\n", " Arkansas 16548\n", " California 508511\n", " Colorado 35695\n", " Connecticut 28393\n", " Delaware 9267\n", " District of Columbia 16495\n", " Florida 122858\n", "dtype: float64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stk = pivot3.stack()\n", "stk[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`unstack` reverse the operation by pivoting a row index on columns:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateAlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelaware
YearType of Crime
1960Property Crime338233494365391654850851135695283939267
Violent Crime609723627041924375582408928375
1961Property Crime325413605413041683652100840156316589343
Violent Crime556420822881809381542659878318
1962Property Crime3582938494569418199564451478723379810157
\n", "
" ], "text/plain": [ "State Alabama Alaska Arizona Arkansas California Colorado \\\n", "Year Type of Crime \n", "1960 Property Crime 33823 3494 36539 16548 508511 35695 \n", " Violent Crime 6097 236 2704 1924 37558 2408 \n", "1961 Property Crime 32541 3605 41304 16836 521008 40156 \n", " Violent Crime 5564 208 2288 1809 38154 2659 \n", "1962 Property Crime 35829 3849 45694 18199 564451 47872 \n", "\n", "State Connecticut Delaware \n", "Year Type of Crime \n", "1960 Property Crime 28393 9267 \n", " Violent Crime 928 375 \n", "1961 Property Crime 31658 9343 \n", " Violent Crime 878 318 \n", "1962 Property Crime 33798 10157 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from warnings import filterwarnings\n", "filterwarnings(action='ignore', category=FutureWarning)\n", "unstk = stk.unstack(-1)\n", "unstk.ix[:5,:8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Visit [www.add-for.com]() for more tutorials and updates.\n", "\n", "This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }