{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tidy Data In Python\n", "\n", "A Python exercise motivated by this nice article from Hadley Wickham: [Tidy Data](http://vita.had.co.nz/papers/tidy-data.pdf). reading this paper is a nice prerequisite for this notebook.\n", "\n", "The code in this notebook was originally developed and commented by [Jean-François Puget](https://www.ibm.com/developerworks/community/blogs/jfp?lang=en) in a blog post on \"[Tidy Data In Python](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Tidy_Data_In_Python?lang=en)\".\n", "\n", "Let's start.\n", "\n", "We only need two Python packages here." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's show readers which versions we are using." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'0.18.1'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "A messy data set. *Messy* is used as defined in Hadley Wickham's paper: any data set that is not tidy. Messy data sets are often convenient for showing them to human as they are compact. This form is often used in publications." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FirstLastTreatment ATreatment B
0JohnSmithNaN2
1JaneDoe16.011
2MaryJohnson3.01
\n", "
" ], "text/plain": [ " First Last Treatment A Treatment B\n", "0 John Smith NaN 2\n", "1 Jane Doe 16.0 11\n", "2 Mary Johnson 3.0 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'], \n", " 'Last' : ['Smith', 'Doe', 'Johnson'], \n", " 'Treatment A' : [np.nan, 16, 3], \n", " 'Treatment B' : [2, 11, 1]})\n", "messy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "People may prefer the transpose view of that data set" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
FirstJohnJaneMary
LastSmithDoeJohnson
Treatment ANaN163
Treatment B2111
\n", "
" ], "text/plain": [ " 0 1 2\n", "First John Jane Mary\n", "Last Smith Doe Johnson\n", "Treatment A NaN 16 3\n", "Treatment B 2 11 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Messy data sets aren't that easy to process by statistical or machine learning packages. These often assume that examples are provided as rows in a matrix whose columns are example featureThis is precisely what a tidy data set is. Applying the `melt()` function to it creates a tidy version of it. We sort the result by name to make it easier to read." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FirstLastvariablevalue
0JohnSmithTreatment ANaN
1JaneDoeTreatment A16.0
2MaryJohnsonTreatment A3.0
3JohnSmithTreatment B2.0
4JaneDoeTreatment B11.0
5MaryJohnsonTreatment B1.0
\n", "
" ], "text/plain": [ " First Last variable value\n", "0 John Smith Treatment A NaN\n", "1 Jane Doe Treatment A 16.0\n", "2 Mary Johnson Treatment A 3.0\n", "3 John Smith Treatment B 2.0\n", "4 Jane Doe Treatment B 11.0\n", "5 Mary Johnson Treatment B 1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy = pd.melt(messy, id_vars=['First','Last'])\n", "tidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values are fine but column names aren't really meaningful. Fortinately, the `melt()` function has arguments for renaming them." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FirstLasttreatmentresult
0JohnSmithTreatment ANaN
1JaneDoeTreatment A16.0
2MaryJohnsonTreatment A3.0
3JohnSmithTreatment B2.0
4JaneDoeTreatment B11.0
5MaryJohnsonTreatment B1.0
\n", "
" ], "text/plain": [ " First Last treatment result\n", "0 John Smith Treatment A NaN\n", "1 Jane Doe Treatment A 16.0\n", "2 Mary Johnson Treatment A 3.0\n", "3 John Smith Treatment B 2.0\n", "4 Jane Doe Treatment B 11.0\n", "5 Mary Johnson Treatment B 1.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy = pd.melt(messy, id_vars=['First','Last'], var_name='treatment', value_name='result')\n", "tidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A simple melt example" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcrow
0147A
1258B
2369C
\n", "
" ], "text/plain": [ " a b c row\n", "0 1 4 7 A\n", "1 2 5 8 B\n", "2 3 6 9 C" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy = pd.DataFrame({'row' : ['A', 'B', 'C'], \n", " 'a' : [1, 2, 3],\n", " 'b' : [4, 5, 6],\n", " 'c' : [7, 8, 9]})\n", "messy" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rowvariablevalue
0Aa1
1Ba2
2Ca3
3Ab4
4Bb5
5Cb6
6Ac7
7Bc8
8Cc9
\n", "
" ], "text/plain": [ " row variable value\n", "0 A a 1\n", "1 B a 2\n", "2 C a 3\n", "3 A b 4\n", "4 B b 5\n", "5 C b 6\n", "6 A c 7\n", "7 B c 8\n", "8 C c 9" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.melt(messy, id_vars='row')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rowdimensionlength
0Aa1
1Ba2
2Ca3
3Ab4
4Bb5
5Cb6
6Ac7
7Bc8
8Cc9
\n", "
" ], "text/plain": [ " row dimension length\n", "0 A a 1\n", "1 B a 2\n", "2 C a 3\n", "3 A b 4\n", "4 B b 5\n", "5 C b 6\n", "6 A c 7\n", "7 B c 8\n", "8 C c 9" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy = pd.melt(messy, id_vars='row', var_name='dimension', value_name='length')\n", "tidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pivot is almost the inverse of melt" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dimensionabc
row
A147
B258
C369
\n", "
" ], "text/plain": [ "dimension a b c\n", "row \n", "A 1 4 7\n", "B 2 5 8\n", "C 3 6 9" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy1 = tidy.pivot(index='row',columns='dimension',values='length')\n", "messy1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is almost the same as the orginal dataframe, except that row is used as index. We can move it back to a row easily." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dimensionrowabc
0A147
1B258
2C369
\n", "
" ], "text/plain": [ "dimension row a b c\n", "0 A 1 4 7\n", "1 B 2 5 8\n", "2 C 3 6 9" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy1.reset_index(inplace=True)\n", "messy1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last step is to remove the name for the set of columns." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rowabc
0A147
1B258
2C369
\n", "
" ], "text/plain": [ " row a b c\n", "0 A 1 4 7\n", "1 B 2 5 8\n", "2 C 3 6 9" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy1.columns.name = ''\n", "messy1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the same as the original dataframe, up to column reordering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Column headers are values, not variable names\n", "\n", "This is the first issue with messy data in Hadley's paper. Let's first create the dataframe used as an example. For practical reasons, it was simpler to first construct the transpose of it, then process it to get the data set used in the article." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k
0Agnostic2734608176137
1Atheist122737523570
2Buddhist272130343358
3Catholic4186177326706381116
4Don't know/refused151415111035
5Evangelical Prot57586910649828811486
6Hindu19791134
7Historically Black Prot228244236238197223
8Jehovah's Witness202724242130
9Jewish191925253095
\n", "
" ], "text/plain": [ " religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k\n", "0 Agnostic 27 34 60 81 76 137\n", "1 Atheist 12 27 37 52 35 70\n", "2 Buddhist 27 21 30 34 33 58\n", "3 Catholic 418 617 732 670 638 1116\n", "4 Don't know/refused 15 14 15 11 10 35\n", "5 Evangelical Prot 575 869 1064 982 881 1486\n", "6 Hindu 1 9 7 9 11 34\n", "7 Historically Black Prot 228 244 236 238 197 223\n", "8 Jehovah's Witness 20 27 24 24 21 30\n", "9 Jewish 19 19 25 25 30 95" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy = pd.DataFrame({'Agnostic' : [27, 34, 60, 81, 76, 137],\n", " 'Atheist' : [12, 27, 37, 52, 35, 70],\n", " 'Buddhist' : [27, 21, 30, 34, 33, 58],\n", " 'Catholic' : [418, 617, 732, 670, 638, 1116],\n", " \"Don't know/refused\" : [15, 14, 15, 11, 10, 35],\n", " 'Evangelical Prot' : [575, 869, 1064, 982, 881, 1486],\n", " 'Hindu' : [1, 9, 7, 9, 11, 34],\n", " 'Historically Black Prot' : [228, 244, 236, 238, 197, 223],\n", " \"Jehovah's Witness\" : [20, 27, 24, 24, 21, 30],\n", " 'Jewish' : [19, 19, 25, 25, 30, 95],\n", " })\n", " \n", "def transpose(df, columns):\n", " df = df.T.copy()\n", " df.reset_index(inplace=True)\n", " df.columns = columns\n", " return df\n", "\n", "messy = transpose(messy, ['religion', '<$10k', '$10-20k', '$20-30k', '$30-40k', '$40-50k', '$50-75k'])\n", "\n", "messy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, the `melt()` function is our friend. We sort the result by religion to make it easier to read." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
religionincomefreq
0Agnostic<$10k27
30Agnostic$30-40k81
40Agnostic$40-50k76
50Agnostic$50-75k137
10Agnostic$10-20k34
\n", "
" ], "text/plain": [ " religion income freq\n", "0 Agnostic <$10k 27\n", "30 Agnostic $30-40k 81\n", "40 Agnostic $40-50k 76\n", "50 Agnostic $50-75k 137\n", "10 Agnostic $10-20k 34" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy = pd.melt(messy, id_vars = ['religion'], var_name='income', value_name='freq')\n", "tidy.sort_values(by=['religion'], inplace=True)\n", "tidy.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Variables are stored in both rows and columns\n", "\n", "This example is a little trickier. We first read the input data as a data frame. This data is available at https://github.com/hadley/tidy-data/blob/master/data/tb.csv\n", "\n", "I've cloned it therefore it is in my local `data` directory.\n", "\n", "Reading it is easy. We remove the `new_sp_` prefix appearing in most columns, and we rename a couple of columns as well." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyeartotalm04m514m014m1524m2534m3544m4554...f04f514f014f1524f2534f3544f4554f5564f65fu
0AD1989NaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1AD1990NaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2AD1991NaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3AD1992NaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4AD199315.0NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " country year total m04 m514 m014 m1524 m2534 m3544 m4554 ... f04 \\\n", "0 AD 1989 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n", "1 AD 1990 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n", "2 AD 1991 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n", "3 AD 1992 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n", "4 AD 1993 15.0 NaN NaN NaN NaN NaN NaN NaN ... NaN \n", "\n", " f514 f014 f1524 f2534 f3544 f4554 f5564 f65 fu \n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"https://raw.githubusercontent.com/hadley/tidy-data/master/data/tb.csv\"\n", "tb = pd.read_csv(url)\n", "tb.columns = tb.columns.str.replace('new_sp_','')\n", "tb.rename(columns = {'new_sp' : 'total', 'iso2' : 'country'}, \n", " inplace=True)\n", "tb.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use year 2000, and drop few columns, to stay in sync with Wickham's article." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearm014m1524m2534m3544m4554m5564m65muf014f1524f2534f3544f4554f5564f65fu
10AD20000.00.01.00.00.00.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaN
36AE20002.04.04.06.05.012.010.0NaN3.016.01.03.00.00.04.0NaN
60AF200052.0228.0183.0149.0129.094.080.0NaN93.0414.0565.0339.0205.099.036.0NaN
87AG20000.00.00.00.00.00.01.0NaN1.01.01.00.00.00.00.0NaN
136AL20002.019.021.014.024.019.016.0NaN3.011.010.08.08.05.011.0NaN
165AM20002.0152.0130.0131.063.026.021.0NaN1.024.027.024.08.08.04.0NaN
178AN20000.00.01.02.00.00.00.0NaN0.00.01.00.00.01.00.0NaN
207AO2000186.0999.01003.0912.0482.0312.0194.0NaN247.01142.01091.0844.0417.0200.0120.0NaN
236AR200097.0278.0594.0402.0419.0368.0330.0NaN121.0544.0479.0262.0230.0179.0216.0NaN
265AS2000NaNNaNNaNNaN1.01.0NaNNaNNaNNaNNaNNaN1.0NaNNaNNaN
\n", "
" ], "text/plain": [ " country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu \\\n", "10 AD 2000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN \n", "36 AE 2000 2.0 4.0 4.0 6.0 5.0 12.0 10.0 NaN \n", "60 AF 2000 52.0 228.0 183.0 149.0 129.0 94.0 80.0 NaN \n", "87 AG 2000 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN \n", "136 AL 2000 2.0 19.0 21.0 14.0 24.0 19.0 16.0 NaN \n", "165 AM 2000 2.0 152.0 130.0 131.0 63.0 26.0 21.0 NaN \n", "178 AN 2000 0.0 0.0 1.0 2.0 0.0 0.0 0.0 NaN \n", "207 AO 2000 186.0 999.0 1003.0 912.0 482.0 312.0 194.0 NaN \n", "236 AR 2000 97.0 278.0 594.0 402.0 419.0 368.0 330.0 NaN \n", "265 AS 2000 NaN NaN NaN NaN 1.0 1.0 NaN NaN \n", "\n", " f014 f1524 f2534 f3544 f4554 f5564 f65 fu \n", "10 NaN NaN NaN NaN NaN NaN NaN NaN \n", "36 3.0 16.0 1.0 3.0 0.0 0.0 4.0 NaN \n", "60 93.0 414.0 565.0 339.0 205.0 99.0 36.0 NaN \n", "87 1.0 1.0 1.0 0.0 0.0 0.0 0.0 NaN \n", "136 3.0 11.0 10.0 8.0 8.0 5.0 11.0 NaN \n", "165 1.0 24.0 27.0 24.0 8.0 8.0 4.0 NaN \n", "178 0.0 0.0 1.0 0.0 0.0 1.0 0.0 NaN \n", "207 247.0 1142.0 1091.0 844.0 417.0 200.0 120.0 NaN \n", "236 121.0 544.0 479.0 262.0 230.0 179.0 216.0 NaN \n", "265 NaN NaN NaN NaN 1.0 NaN NaN NaN " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy = tb[tb['year'] == 2000].copy()\n", "messy.drop(['total','m04','m514','f04','f514'], axis=1, inplace=True)\n", "messy.head(10)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearm014m1524m2534m3544m4554m5564m65muf014
10AD20000.00.01.00.00.00.00.0NaNNaN
36AE20002.04.04.06.05.012.010.0NaN3.0
60AF200052.0228.0183.0149.0129.094.080.0NaN93.0
87AG20000.00.00.00.00.00.01.0NaN1.0
136AL20002.019.021.014.024.019.016.0NaN3.0
165AM20002.0152.0130.0131.063.026.021.0NaN1.0
178AN20000.00.01.02.00.00.00.0NaN0.0
207AO2000186.0999.01003.0912.0482.0312.0194.0NaN247.0
236AR200097.0278.0594.0402.0419.0368.0330.0NaN121.0
265AS2000NaNNaNNaNNaN1.01.0NaNNaNNaN
\n", "
" ], "text/plain": [ " country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014\n", "10 AD 2000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN NaN\n", "36 AE 2000 2.0 4.0 4.0 6.0 5.0 12.0 10.0 NaN 3.0\n", "60 AF 2000 52.0 228.0 183.0 149.0 129.0 94.0 80.0 NaN 93.0\n", "87 AG 2000 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN 1.0\n", "136 AL 2000 2.0 19.0 21.0 14.0 24.0 19.0 16.0 NaN 3.0\n", "165 AM 2000 2.0 152.0 130.0 131.0 63.0 26.0 21.0 NaN 1.0\n", "178 AN 2000 0.0 0.0 1.0 2.0 0.0 0.0 0.0 NaN 0.0\n", "207 AO 2000 186.0 999.0 1003.0 912.0 482.0 312.0 194.0 NaN 247.0\n", "236 AR 2000 97.0 278.0 594.0 402.0 419.0 368.0 330.0 NaN 121.0\n", "265 AS 2000 NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "messy.iloc[:,:11].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `melt()` function is useful, but is not enough. Let's use it still." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearvariablecases
0AD2000m0140.0
201AD2000m15240.0
402AD2000m25341.0
603AD2000m35440.0
804AD2000m45540.0
1005AD2000m55640.0
1206AD2000m650.0
1407AD2000muNaN
1608AD2000f014NaN
1809AD2000f1524NaN
\n", "
" ], "text/plain": [ " country year variable cases\n", "0 AD 2000 m014 0.0\n", "201 AD 2000 m1524 0.0\n", "402 AD 2000 m2534 1.0\n", "603 AD 2000 m3544 0.0\n", "804 AD 2000 m4554 0.0\n", "1005 AD 2000 m5564 0.0\n", "1206 AD 2000 m65 0.0\n", "1407 AD 2000 mu NaN\n", "1608 AD 2000 f014 NaN\n", "1809 AD 2000 f1524 NaN" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "molten = pd.melt(messy, id_vars=['country', 'year'], value_name='cases')\n", "molten.sort_values(by=['year', 'country'], inplace=True)\n", "molten.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What isn't really nice is the encoding of sex and age ranges as a string in the `variable` column. Let's process the dataset to create two additional columns, one for the sex, and one for the age range. We then remove the `variable` column. The tidy form also makes it easy to remove the values where the age is `u`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearsexagecases
0AD2000m0-140.0
201AD2000m15-240.0
402AD2000m25-341.0
603AD2000m35-440.0
804AD2000m45-540.0
1005AD2000m55-640.0
1206AD2000m65+0.0
1608AD2000f0-14NaN
1809AD2000f15-24NaN
2010AD2000f25-34NaN
\n", "
" ], "text/plain": [ " country year sex age cases\n", "0 AD 2000 m 0-14 0.0\n", "201 AD 2000 m 15-24 0.0\n", "402 AD 2000 m 25-34 1.0\n", "603 AD 2000 m 35-44 0.0\n", "804 AD 2000 m 45-54 0.0\n", "1005 AD 2000 m 55-64 0.0\n", "1206 AD 2000 m 65+ 0.0\n", "1608 AD 2000 f 0-14 NaN\n", "1809 AD 2000 f 15-24 NaN\n", "2010 AD 2000 f 25-34 NaN" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def parse_age(s):\n", " s = s[1:]\n", " if s == '65':\n", " return '65+'\n", " else:\n", " return s[:-2]+'-'+s[-2:]\n", "\n", "tidy = molten[molten['variable'] != 'mu'].copy()\n", "tidy['sex'] = tidy['variable'].apply(lambda s: s[:1])\n", "tidy['age'] = tidy['variable'].apply(parse_age)\n", "tidy = tidy[['country', 'year', 'sex', 'age', 'cases']]\n", "tidy.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Variables are stored in both rows and columns\n", "\n", "This example is really tricky. Let's first create the dataframe. This time, I create it using an array instead of a dictionary, just for the fun of doing something a bit different." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idyearmonthelementd1d2d3d4d5d6d7d8
0MX1700420101tmaxNaNNaNNaNNaNNaNNaNNaNNaN
1MX1700420101tminNaNNaNNaNNaNNaNNaNNaNNaN
2MX1700420102tmaxNaN27.324.1NaNNaNNaNNaNNaN
3MX1700420102tminNaN14.414.4NaNNaNNaNNaNNaN
4MX1700420103tmaxNaNNaNNaNNaN32.1NaNNaNNaN
5MX1700420103tminNaNNaNNaNNaN14.2NaNNaNNaN
6MX1700420104tmaxNaNNaNNaNNaNNaNNaNNaNNaN
7MX1700420104tminNaNNaNNaNNaNNaNNaNNaNNaN
8MX1700420105tmaxNaNNaNNaNNaNNaNNaNNaNNaN
9MX1700420105tminNaNNaNNaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " id year month element d1 d2 d3 d4 d5 d6 d7 d8\n", "0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN NaN\n", "1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN NaN\n", "2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN NaN\n", "3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN NaN\n", "4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN NaN\n", "5 MX17004 2010 3 tmin NaN NaN NaN NaN 14.2 NaN NaN NaN\n", "6 MX17004 2010 4 tmax NaN NaN NaN NaN NaN NaN NaN NaN\n", "7 MX17004 2010 4 tmin NaN NaN NaN NaN NaN NaN NaN NaN\n", "8 MX17004 2010 5 tmax NaN NaN NaN NaN NaN NaN NaN NaN\n", "9 MX17004 2010 5 tmin NaN NaN NaN NaN NaN NaN NaN NaN" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = ['id', 'year', 'month', 'element', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8']\n", "data = [['MX17004', 2010, 1, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 1, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 2, 'tmax', np.nan, 27.3, 24.1, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 2, 'tmin', np.nan, 14.4, 14.4, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 3, 'tmax', np.nan, np.nan, np.nan, np.nan, 32.1, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 3, 'tmin', np.nan, np.nan, np.nan, np.nan, 14.2, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 4, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 4, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 5, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n", " ['MX17004', 2010, 5, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,]\n", " ]\n", "messy = pd.DataFrame(data=data, columns=columns); messy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the values are not relevant. However, filtering the NaN values is imposible here. We need to melt the dataframe first." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idyearmonthelementdayvalue
0MX1700420102tmaxd227.3
1MX1700420102tmind214.4
2MX1700420102tmaxd324.1
3MX1700420102tmind314.4
4MX1700420103tmaxd532.1
5MX1700420103tmind514.2
\n", "
" ], "text/plain": [ " id year month element day value\n", "0 MX17004 2010 2 tmax d2 27.3\n", "1 MX17004 2010 2 tmin d2 14.4\n", "2 MX17004 2010 2 tmax d3 24.1\n", "3 MX17004 2010 2 tmin d3 14.4\n", "4 MX17004 2010 3 tmax d5 32.1\n", "5 MX17004 2010 3 tmin d5 14.2" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "molten = pd.melt(messy, \n", " id_vars=['id', 'year','month','element',],\n", " var_name='day');\n", "molten.dropna(inplace=True)\n", "molten = molten.reset_index(drop=True)\n", "molten" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataframe is not in tidy form yet. First, the column `element` contains variable names. Second, the columns `year, month, day` represent one variable: the date. Let's fix the latter problem first." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idelementvaluedate
0MX17004tmax27.32010-02-02
1MX17004tmin14.42010-02-02
2MX17004tmax24.12010-02-03
3MX17004tmin14.42010-02-03
4MX17004tmax32.12010-03-05
5MX17004tmin14.22010-03-05
\n", "
" ], "text/plain": [ " id element value date\n", "0 MX17004 tmax 27.3 2010-02-02\n", "1 MX17004 tmin 14.4 2010-02-02\n", "2 MX17004 tmax 24.1 2010-02-03\n", "3 MX17004 tmin 14.4 2010-02-03\n", "4 MX17004 tmax 32.1 2010-03-05\n", "5 MX17004 tmin 14.2 2010-03-05" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def f(row): \n", " return \"%d-%02d-%02d\" % (row['year'], row['month'], int(row['day'][1:]))\n", " \n", "molten['date'] = molten.apply(f,axis=1)\n", "molten = molten[['id', 'element','value','date']]\n", "molten" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to pivot the element column. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
elementtmaxtmin
date
2010-02-0227.314.4
2010-02-0324.114.4
2010-03-0532.114.2
\n", "
" ], "text/plain": [ "element tmax tmin\n", "date \n", "2010-02-02 27.3 14.4\n", "2010-02-03 24.1 14.4\n", "2010-03-05 32.1 14.2" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy = molten.pivot(index='date',columns='element',values='value')\n", "tidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wait a minute. \n", "\n", "Where is the id?\n", "\n", "One way to keep it, is to move the id to an index with the `groupby()` function, and apply `pivot()` inside each group. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
elementtmaxtmin
iddate
MX170042010-02-0227.314.4
2010-02-0324.114.4
2010-03-0532.114.2
\n", "
" ], "text/plain": [ "element tmax tmin\n", "id date \n", "MX17004 2010-02-02 27.3 14.4\n", " 2010-02-03 24.1 14.4\n", " 2010-03-05 32.1 14.2" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy = molten.groupby('id').apply(pd.DataFrame.pivot,\n", " index='date',\n", " columns='element',\n", " values='value')\n", "tidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are almost there. We simply have to move id back as a column with the `reset_index()`." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
elementiddatetmaxtmin
0MX170042010-02-0227.314.4
1MX170042010-02-0324.114.4
2MX170042010-03-0532.114.2
\n", "
" ], "text/plain": [ "element id date tmax tmin\n", "0 MX17004 2010-02-02 27.3 14.4\n", "1 MX17004 2010-02-03 24.1 14.4\n", "2 MX17004 2010-03-05 32.1 14.2" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy.reset_index(inplace=True)\n", "tidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get rid of the `element` name." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatetmaxtmin
0MX170042010-02-0227.314.4
1MX170042010-02-0324.114.4
2MX170042010-03-0532.114.2
\n", "
" ], "text/plain": [ " id date tmax tmin\n", "0 MX17004 2010-02-02 27.3 14.4\n", "1 MX17004 2010-02-03 24.1 14.4\n", "2 MX17004 2010-03-05 32.1 14.2" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy.columns.name = ''\n", "tidy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Et Voilà!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple types in one table\n", "\n", "This example is used to illustrate two of the above problems. \n", "\n", "Let's create it. It is an excerpt from the Billboard top hits for 2000." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearartisttracktimedate enteredwk1wk2wk3
020002,PacBaby Don't Cry4:222000-02-26878272
120002Ge+herThe Hardest Part Of ...3:152000-09-02918792
220003 Doors DownKryptonite3:532000-04-08817068
3200098^0Give Me Just One Nig...3:242000-08-19513934
42000A*TeensDancing Queen3:442000-07-08979796
52000AaliyahI Don't Wanna4:152000-01-29846251
62000AaliyahTry Again4:032000-03-18595338
72000Adams,YolandaOpen My Heart5:302000-08-26767674
\n", "
" ], "text/plain": [ " year artist track time date entered wk1 wk2 \\\n", "0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 87 82 \n", "1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87 \n", "2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70 \n", "3 2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 51 39 \n", "4 2000 A*Teens Dancing Queen 3:44 2000-07-08 97 97 \n", "5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 84 62 \n", "6 2000 Aaliyah Try Again 4:03 2000-03-18 59 53 \n", "7 2000 Adams,Yolanda Open My Heart 5:30 2000-08-26 76 76 \n", "\n", " wk3 \n", "0 72 \n", "1 92 \n", "2 68 \n", "3 34 \n", "4 96 \n", "5 51 \n", "6 38 \n", "7 74 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = ['year','artist','track','time','date entered','wk1','wk2','wk3',]\n", "\n", "data = [[2000,\"2,Pac\",\"Baby Don't Cry\",\"4:22\",\"2000-02-26\",87,82,72,],\n", " [2000,\"2Ge+her\",\"The Hardest Part Of ...\",\"3:15\",\"2000-09-02\",91,87,92,],\n", " [2000,\"3 Doors Down\",\"Kryptonite\",\"3:53\",\"2000-04-08\",81,70,68,],\n", " [2000,\"98^0\",\"Give Me Just One Nig...\",\"3:24\",\"2000-08-19\",51,39,34,],\n", " [2000,\"A*Teens\",\"Dancing Queen\",\"3:44\",\"2000-07-08\",97,97,96,],\n", " [2000,\"Aaliyah\",\"I Don't Wanna\",\"4:15\",\"2000-01-29\",84,62,51,],\n", " [2000,\"Aaliyah\",\"Try Again\",\"4:03\",\"2000-03-18\",59,53,38,],\n", " [2000,\"Adams,Yolanda\",\"Open My Heart\",\"5:30\",\"2000-08-26\",76,76,74]\n", " ]\n", "\n", "messy = pd.DataFrame(data=data, columns=columns)\n", "messy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset is messy because there are several observations per row, in the columns wk1, wk2, wk3. We can get one observation per row by metling the dataset." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearartisttracktimedate enteredweekrank
52000AaliyahI Don't Wanna4:152000-01-29wk184
132000AaliyahI Don't Wanna4:152000-01-29wk262
212000AaliyahI Don't Wanna4:152000-01-29wk351
020002,PacBaby Don't Cry4:222000-02-26wk187
820002,PacBaby Don't Cry4:222000-02-26wk282
\n", "
" ], "text/plain": [ " year artist track time date entered week rank\n", "5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk1 84\n", "13 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk2 62\n", "21 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk3 51\n", "0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 wk1 87\n", "8 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 wk2 82" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "molten = pd.melt(messy, \n", " id_vars=['year','artist','track','time','date entered'],\n", " var_name = 'week',\n", " value_name = 'rank',\n", " )\n", "molten.sort_values(by=['date entered','week'], inplace=True)\n", "molten.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can clean the dataset further, first by turning week into number" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearartisttracktimedate enteredweekrank
52000AaliyahI Don't Wanna4:152000-01-29184
132000AaliyahI Don't Wanna4:152000-01-29262
212000AaliyahI Don't Wanna4:152000-01-29351
020002,PacBaby Don't Cry4:222000-02-26187
820002,PacBaby Don't Cry4:222000-02-26282
\n", "
" ], "text/plain": [ " year artist track time date entered week rank\n", "5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 1 84\n", "13 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 2 62\n", "21 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 3 51\n", "0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 1 87\n", "8 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 2 82" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "molten['week'] = molten['week'].apply(lambda s: int(s[2:]))\n", "molten.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, we need the starting date of the week for each observation, instead of the date the track entered." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearartisttracktimeweekrankdate
52000AaliyahI Don't Wanna4:151842000-01-29
132000AaliyahI Don't Wanna4:152622000-02-05
212000AaliyahI Don't Wanna4:153512000-02-12
020002,PacBaby Don't Cry4:221872000-02-26
820002,PacBaby Don't Cry4:222822000-03-04
\n", "
" ], "text/plain": [ " year artist track time week rank date\n", "5 2000 Aaliyah I Don't Wanna 4:15 1 84 2000-01-29\n", "13 2000 Aaliyah I Don't Wanna 4:15 2 62 2000-02-05\n", "21 2000 Aaliyah I Don't Wanna 4:15 3 51 2000-02-12\n", "0 2000 2,Pac Baby Don't Cry 4:22 1 87 2000-02-26\n", "8 2000 2,Pac Baby Don't Cry 4:22 2 82 2000-03-04" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from datetime import datetime, timedelta\n", "\n", "def increment_date(row):\n", " date = datetime.strptime(row['date entered'], \"%Y-%m-%d\")\n", " return date + timedelta(7) * (row['week'] - 1)\n", "\n", "molten['date'] = molten.apply(increment_date, axis=1)\n", "molten.drop('date entered', axis=1, inplace=True)\n", "molten.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last, this dataset is denormalized. This is fine for most statistical and machine learning packages, but we might want to normalize it. It means that we should group information that is repeated every week for a track in a separate table. This information appears in columns `year ,artist, track, time`. " ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idyearartisttracktime
0020002,PacBaby Don't Cry4:22
1120002Ge+herThe Hardest Part Of ...3:15
2220003 Doors DownKryptonite3:53
33200098^0Give Me Just One Nig...3:24
442000A*TeensDancing Queen3:44
552000AaliyahI Don't Wanna4:15
662000AaliyahTry Again4:03
772000Adams,YolandaOpen My Heart5:30
\n", "
" ], "text/plain": [ " id year artist track time\n", "0 0 2000 2,Pac Baby Don't Cry 4:22\n", "1 1 2000 2Ge+her The Hardest Part Of ... 3:15\n", "2 2 2000 3 Doors Down Kryptonite 3:53\n", "3 3 2000 98^0 Give Me Just One Nig... 3:24\n", "4 4 2000 A*Teens Dancing Queen 3:44\n", "5 5 2000 Aaliyah I Don't Wanna 4:15\n", "6 6 2000 Aaliyah Try Again 4:03\n", "7 7 2000 Adams,Yolanda Open My Heart 5:30" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy_track = molten[['year','artist','track','time']]\\\n", " .groupby(['year','artist','track'])\\\n", " .first()\n", "tidy_track.reset_index(inplace=True)\n", "tidy_track.reset_index(inplace=True)\n", "tidy_track.rename(columns = {'index':'id'}, inplace=True)\n", "tidy_track" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddaterank
052000-01-2984
152000-02-0562
252000-02-1251
302000-02-2687
402000-03-0482
\n", "
" ], "text/plain": [ " id date rank\n", "0 5 2000-01-29 84\n", "1 5 2000-02-05 62\n", "2 5 2000-02-12 51\n", "3 0 2000-02-26 87\n", "4 0 2000-03-04 82" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tidy_rank = pd.merge(molten, tidy_track, on='track')\n", "tidy_rank = tidy_rank[['id', 'date', 'rank']]\n", "tidy_rank.head()" ] } ], "metadata": { "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "toc": { "navigate_menu": false, "number_sections": false, "sideBar": true, "threshold": 6, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 0 }