{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tidy Data In Python\n",
"\n",
"A Python exercise motivated by this nice article from Hadley Wickham: [Tidy Data](http://vita.had.co.nz/papers/tidy-data.pdf). reading this paper is a nice prerequisite for this notebook.\n",
"\n",
"The code in this notebook was originally developed and commented by [Jean-François Puget](https://www.ibm.com/developerworks/community/blogs/jfp?lang=en) in a blog post on \"[Tidy Data In Python](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Tidy_Data_In_Python?lang=en)\".\n",
"\n",
"Let's start.\n",
"\n",
"We only need two Python packages here."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's show readers which versions we are using."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'0.18.1'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"A messy data set. *Messy* is used as defined in Hadley Wickham's paper: any data set that is not tidy. Messy data sets are often convenient for showing them to human as they are compact. This form is often used in publications."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" First | \n",
" Last | \n",
" Treatment A | \n",
" Treatment B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John | \n",
" Smith | \n",
" NaN | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" Jane | \n",
" Doe | \n",
" 16.0 | \n",
" 11 | \n",
"
\n",
" \n",
" 2 | \n",
" Mary | \n",
" Johnson | \n",
" 3.0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" First Last Treatment A Treatment B\n",
"0 John Smith NaN 2\n",
"1 Jane Doe 16.0 11\n",
"2 Mary Johnson 3.0 1"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'], \n",
" 'Last' : ['Smith', 'Doe', 'Johnson'], \n",
" 'Treatment A' : [np.nan, 16, 3], \n",
" 'Treatment B' : [2, 11, 1]})\n",
"messy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"People may prefer the transpose view of that data set"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" First | \n",
" John | \n",
" Jane | \n",
" Mary | \n",
"
\n",
" \n",
" Last | \n",
" Smith | \n",
" Doe | \n",
" Johnson | \n",
"
\n",
" \n",
" Treatment A | \n",
" NaN | \n",
" 16 | \n",
" 3 | \n",
"
\n",
" \n",
" Treatment B | \n",
" 2 | \n",
" 11 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"First John Jane Mary\n",
"Last Smith Doe Johnson\n",
"Treatment A NaN 16 3\n",
"Treatment B 2 11 1"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Messy data sets aren't that easy to process by statistical or machine learning packages. These often assume that examples are provided as rows in a matrix whose columns are example featureThis is precisely what a tidy data set is. Applying the `melt()` function to it creates a tidy version of it. We sort the result by name to make it easier to read."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" First | \n",
" Last | \n",
" variable | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John | \n",
" Smith | \n",
" Treatment A | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" Jane | \n",
" Doe | \n",
" Treatment A | \n",
" 16.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Mary | \n",
" Johnson | \n",
" Treatment A | \n",
" 3.0 | \n",
"
\n",
" \n",
" 3 | \n",
" John | \n",
" Smith | \n",
" Treatment B | \n",
" 2.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Jane | \n",
" Doe | \n",
" Treatment B | \n",
" 11.0 | \n",
"
\n",
" \n",
" 5 | \n",
" Mary | \n",
" Johnson | \n",
" Treatment B | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" First Last variable value\n",
"0 John Smith Treatment A NaN\n",
"1 Jane Doe Treatment A 16.0\n",
"2 Mary Johnson Treatment A 3.0\n",
"3 John Smith Treatment B 2.0\n",
"4 Jane Doe Treatment B 11.0\n",
"5 Mary Johnson Treatment B 1.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy = pd.melt(messy, id_vars=['First','Last'])\n",
"tidy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The values are fine but column names aren't really meaningful. Fortinately, the `melt()` function has arguments for renaming them."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" First | \n",
" Last | \n",
" treatment | \n",
" result | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John | \n",
" Smith | \n",
" Treatment A | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" Jane | \n",
" Doe | \n",
" Treatment A | \n",
" 16.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Mary | \n",
" Johnson | \n",
" Treatment A | \n",
" 3.0 | \n",
"
\n",
" \n",
" 3 | \n",
" John | \n",
" Smith | \n",
" Treatment B | \n",
" 2.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Jane | \n",
" Doe | \n",
" Treatment B | \n",
" 11.0 | \n",
"
\n",
" \n",
" 5 | \n",
" Mary | \n",
" Johnson | \n",
" Treatment B | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" First Last treatment result\n",
"0 John Smith Treatment A NaN\n",
"1 Jane Doe Treatment A 16.0\n",
"2 Mary Johnson Treatment A 3.0\n",
"3 John Smith Treatment B 2.0\n",
"4 Jane Doe Treatment B 11.0\n",
"5 Mary Johnson Treatment B 1.0"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy = pd.melt(messy, id_vars=['First','Last'], var_name='treatment', value_name='result')\n",
"tidy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A simple melt example"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" row | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 4 | \n",
" 7 | \n",
" A | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 5 | \n",
" 8 | \n",
" B | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 6 | \n",
" 9 | \n",
" C | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c row\n",
"0 1 4 7 A\n",
"1 2 5 8 B\n",
"2 3 6 9 C"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy = pd.DataFrame({'row' : ['A', 'B', 'C'], \n",
" 'a' : [1, 2, 3],\n",
" 'b' : [4, 5, 6],\n",
" 'c' : [7, 8, 9]})\n",
"messy"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" row | \n",
" variable | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" A | \n",
" a | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" B | \n",
" a | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" C | \n",
" a | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" A | \n",
" b | \n",
" 4 | \n",
"
\n",
" \n",
" 4 | \n",
" B | \n",
" b | \n",
" 5 | \n",
"
\n",
" \n",
" 5 | \n",
" C | \n",
" b | \n",
" 6 | \n",
"
\n",
" \n",
" 6 | \n",
" A | \n",
" c | \n",
" 7 | \n",
"
\n",
" \n",
" 7 | \n",
" B | \n",
" c | \n",
" 8 | \n",
"
\n",
" \n",
" 8 | \n",
" C | \n",
" c | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" row variable value\n",
"0 A a 1\n",
"1 B a 2\n",
"2 C a 3\n",
"3 A b 4\n",
"4 B b 5\n",
"5 C b 6\n",
"6 A c 7\n",
"7 B c 8\n",
"8 C c 9"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.melt(messy, id_vars='row')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" row | \n",
" dimension | \n",
" length | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" A | \n",
" a | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" B | \n",
" a | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" C | \n",
" a | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" A | \n",
" b | \n",
" 4 | \n",
"
\n",
" \n",
" 4 | \n",
" B | \n",
" b | \n",
" 5 | \n",
"
\n",
" \n",
" 5 | \n",
" C | \n",
" b | \n",
" 6 | \n",
"
\n",
" \n",
" 6 | \n",
" A | \n",
" c | \n",
" 7 | \n",
"
\n",
" \n",
" 7 | \n",
" B | \n",
" c | \n",
" 8 | \n",
"
\n",
" \n",
" 8 | \n",
" C | \n",
" c | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" row dimension length\n",
"0 A a 1\n",
"1 B a 2\n",
"2 C a 3\n",
"3 A b 4\n",
"4 B b 5\n",
"5 C b 6\n",
"6 A c 7\n",
"7 B c 8\n",
"8 C c 9"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy = pd.melt(messy, id_vars='row', var_name='dimension', value_name='length')\n",
"tidy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pivot is almost the inverse of melt"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" dimension | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" row | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 1 | \n",
" 4 | \n",
" 7 | \n",
"
\n",
" \n",
" B | \n",
" 2 | \n",
" 5 | \n",
" 8 | \n",
"
\n",
" \n",
" C | \n",
" 3 | \n",
" 6 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"dimension a b c\n",
"row \n",
"A 1 4 7\n",
"B 2 5 8\n",
"C 3 6 9"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy1 = tidy.pivot(index='row',columns='dimension',values='length')\n",
"messy1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is almost the same as the orginal dataframe, except that row is used as index. We can move it back to a row easily."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" dimension | \n",
" row | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" A | \n",
" 1 | \n",
" 4 | \n",
" 7 | \n",
"
\n",
" \n",
" 1 | \n",
" B | \n",
" 2 | \n",
" 5 | \n",
" 8 | \n",
"
\n",
" \n",
" 2 | \n",
" C | \n",
" 3 | \n",
" 6 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"dimension row a b c\n",
"0 A 1 4 7\n",
"1 B 2 5 8\n",
"2 C 3 6 9"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy1.reset_index(inplace=True)\n",
"messy1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Last step is to remove the name for the set of columns."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" row | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" A | \n",
" 1 | \n",
" 4 | \n",
" 7 | \n",
"
\n",
" \n",
" 1 | \n",
" B | \n",
" 2 | \n",
" 5 | \n",
" 8 | \n",
"
\n",
" \n",
" 2 | \n",
" C | \n",
" 3 | \n",
" 6 | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" row a b c\n",
"0 A 1 4 7\n",
"1 B 2 5 8\n",
"2 C 3 6 9"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy1.columns.name = ''\n",
"messy1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the same as the original dataframe, up to column reordering."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Column headers are values, not variable names\n",
"\n",
"This is the first issue with messy data in Hadley's paper. Let's first create the dataframe used as an example. For practical reasons, it was simpler to first construct the transpose of it, then process it to get the data set used in the article."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" religion | \n",
" <$10k | \n",
" $10-20k | \n",
" $20-30k | \n",
" $30-40k | \n",
" $40-50k | \n",
" $50-75k | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Agnostic | \n",
" 27 | \n",
" 34 | \n",
" 60 | \n",
" 81 | \n",
" 76 | \n",
" 137 | \n",
"
\n",
" \n",
" 1 | \n",
" Atheist | \n",
" 12 | \n",
" 27 | \n",
" 37 | \n",
" 52 | \n",
" 35 | \n",
" 70 | \n",
"
\n",
" \n",
" 2 | \n",
" Buddhist | \n",
" 27 | \n",
" 21 | \n",
" 30 | \n",
" 34 | \n",
" 33 | \n",
" 58 | \n",
"
\n",
" \n",
" 3 | \n",
" Catholic | \n",
" 418 | \n",
" 617 | \n",
" 732 | \n",
" 670 | \n",
" 638 | \n",
" 1116 | \n",
"
\n",
" \n",
" 4 | \n",
" Don't know/refused | \n",
" 15 | \n",
" 14 | \n",
" 15 | \n",
" 11 | \n",
" 10 | \n",
" 35 | \n",
"
\n",
" \n",
" 5 | \n",
" Evangelical Prot | \n",
" 575 | \n",
" 869 | \n",
" 1064 | \n",
" 982 | \n",
" 881 | \n",
" 1486 | \n",
"
\n",
" \n",
" 6 | \n",
" Hindu | \n",
" 1 | \n",
" 9 | \n",
" 7 | \n",
" 9 | \n",
" 11 | \n",
" 34 | \n",
"
\n",
" \n",
" 7 | \n",
" Historically Black Prot | \n",
" 228 | \n",
" 244 | \n",
" 236 | \n",
" 238 | \n",
" 197 | \n",
" 223 | \n",
"
\n",
" \n",
" 8 | \n",
" Jehovah's Witness | \n",
" 20 | \n",
" 27 | \n",
" 24 | \n",
" 24 | \n",
" 21 | \n",
" 30 | \n",
"
\n",
" \n",
" 9 | \n",
" Jewish | \n",
" 19 | \n",
" 19 | \n",
" 25 | \n",
" 25 | \n",
" 30 | \n",
" 95 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k\n",
"0 Agnostic 27 34 60 81 76 137\n",
"1 Atheist 12 27 37 52 35 70\n",
"2 Buddhist 27 21 30 34 33 58\n",
"3 Catholic 418 617 732 670 638 1116\n",
"4 Don't know/refused 15 14 15 11 10 35\n",
"5 Evangelical Prot 575 869 1064 982 881 1486\n",
"6 Hindu 1 9 7 9 11 34\n",
"7 Historically Black Prot 228 244 236 238 197 223\n",
"8 Jehovah's Witness 20 27 24 24 21 30\n",
"9 Jewish 19 19 25 25 30 95"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy = pd.DataFrame({'Agnostic' : [27, 34, 60, 81, 76, 137],\n",
" 'Atheist' : [12, 27, 37, 52, 35, 70],\n",
" 'Buddhist' : [27, 21, 30, 34, 33, 58],\n",
" 'Catholic' : [418, 617, 732, 670, 638, 1116],\n",
" \"Don't know/refused\" : [15, 14, 15, 11, 10, 35],\n",
" 'Evangelical Prot' : [575, 869, 1064, 982, 881, 1486],\n",
" 'Hindu' : [1, 9, 7, 9, 11, 34],\n",
" 'Historically Black Prot' : [228, 244, 236, 238, 197, 223],\n",
" \"Jehovah's Witness\" : [20, 27, 24, 24, 21, 30],\n",
" 'Jewish' : [19, 19, 25, 25, 30, 95],\n",
" })\n",
" \n",
"def transpose(df, columns):\n",
" df = df.T.copy()\n",
" df.reset_index(inplace=True)\n",
" df.columns = columns\n",
" return df\n",
"\n",
"messy = transpose(messy, ['religion', '<$10k', '$10-20k', '$20-30k', '$30-40k', '$40-50k', '$50-75k'])\n",
"\n",
"messy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, the `melt()` function is our friend. We sort the result by religion to make it easier to read."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" religion | \n",
" income | \n",
" freq | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Agnostic | \n",
" <$10k | \n",
" 27 | \n",
"
\n",
" \n",
" 30 | \n",
" Agnostic | \n",
" $30-40k | \n",
" 81 | \n",
"
\n",
" \n",
" 40 | \n",
" Agnostic | \n",
" $40-50k | \n",
" 76 | \n",
"
\n",
" \n",
" 50 | \n",
" Agnostic | \n",
" $50-75k | \n",
" 137 | \n",
"
\n",
" \n",
" 10 | \n",
" Agnostic | \n",
" $10-20k | \n",
" 34 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" religion income freq\n",
"0 Agnostic <$10k 27\n",
"30 Agnostic $30-40k 81\n",
"40 Agnostic $40-50k 76\n",
"50 Agnostic $50-75k 137\n",
"10 Agnostic $10-20k 34"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy = pd.melt(messy, id_vars = ['religion'], var_name='income', value_name='freq')\n",
"tidy.sort_values(by=['religion'], inplace=True)\n",
"tidy.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Variables are stored in both rows and columns\n",
"\n",
"This example is a little trickier. We first read the input data as a data frame. This data is available at https://github.com/hadley/tidy-data/blob/master/data/tb.csv\n",
"\n",
"I've cloned it therefore it is in my local `data` directory.\n",
"\n",
"Reading it is easy. We remove the `new_sp_` prefix appearing in most columns, and we rename a couple of columns as well."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year | \n",
" total | \n",
" m04 | \n",
" m514 | \n",
" m014 | \n",
" m1524 | \n",
" m2534 | \n",
" m3544 | \n",
" m4554 | \n",
" ... | \n",
" f04 | \n",
" f514 | \n",
" f014 | \n",
" f1524 | \n",
" f2534 | \n",
" f3544 | \n",
" f4554 | \n",
" f5564 | \n",
" f65 | \n",
" fu | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" AD | \n",
" 1989 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" AD | \n",
" 1990 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" AD | \n",
" 1991 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" AD | \n",
" 1992 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" AD | \n",
" 1993 | \n",
" 15.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 23 columns
\n",
"
"
],
"text/plain": [
" country year total m04 m514 m014 m1524 m2534 m3544 m4554 ... f04 \\\n",
"0 AD 1989 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n",
"1 AD 1990 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n",
"2 AD 1991 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n",
"3 AD 1992 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n",
"4 AD 1993 15.0 NaN NaN NaN NaN NaN NaN NaN ... NaN \n",
"\n",
" f514 f014 f1524 f2534 f3544 f4554 f5564 f65 fu \n",
"0 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
"[5 rows x 23 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://raw.githubusercontent.com/hadley/tidy-data/master/data/tb.csv\"\n",
"tb = pd.read_csv(url)\n",
"tb.columns = tb.columns.str.replace('new_sp_','')\n",
"tb.rename(columns = {'new_sp' : 'total', 'iso2' : 'country'}, \n",
" inplace=True)\n",
"tb.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use year 2000, and drop few columns, to stay in sync with Wickham's article."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year | \n",
" m014 | \n",
" m1524 | \n",
" m2534 | \n",
" m3544 | \n",
" m4554 | \n",
" m5564 | \n",
" m65 | \n",
" mu | \n",
" f014 | \n",
" f1524 | \n",
" f2534 | \n",
" f3544 | \n",
" f4554 | \n",
" f5564 | \n",
" f65 | \n",
" fu | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" AD | \n",
" 2000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 36 | \n",
" AE | \n",
" 2000 | \n",
" 2.0 | \n",
" 4.0 | \n",
" 4.0 | \n",
" 6.0 | \n",
" 5.0 | \n",
" 12.0 | \n",
" 10.0 | \n",
" NaN | \n",
" 3.0 | \n",
" 16.0 | \n",
" 1.0 | \n",
" 3.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 4.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 60 | \n",
" AF | \n",
" 2000 | \n",
" 52.0 | \n",
" 228.0 | \n",
" 183.0 | \n",
" 149.0 | \n",
" 129.0 | \n",
" 94.0 | \n",
" 80.0 | \n",
" NaN | \n",
" 93.0 | \n",
" 414.0 | \n",
" 565.0 | \n",
" 339.0 | \n",
" 205.0 | \n",
" 99.0 | \n",
" 36.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 87 | \n",
" AG | \n",
" 2000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" NaN | \n",
" 1.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 136 | \n",
" AL | \n",
" 2000 | \n",
" 2.0 | \n",
" 19.0 | \n",
" 21.0 | \n",
" 14.0 | \n",
" 24.0 | \n",
" 19.0 | \n",
" 16.0 | \n",
" NaN | \n",
" 3.0 | \n",
" 11.0 | \n",
" 10.0 | \n",
" 8.0 | \n",
" 8.0 | \n",
" 5.0 | \n",
" 11.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 165 | \n",
" AM | \n",
" 2000 | \n",
" 2.0 | \n",
" 152.0 | \n",
" 130.0 | \n",
" 131.0 | \n",
" 63.0 | \n",
" 26.0 | \n",
" 21.0 | \n",
" NaN | \n",
" 1.0 | \n",
" 24.0 | \n",
" 27.0 | \n",
" 24.0 | \n",
" 8.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 178 | \n",
" AN | \n",
" 2000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 207 | \n",
" AO | \n",
" 2000 | \n",
" 186.0 | \n",
" 999.0 | \n",
" 1003.0 | \n",
" 912.0 | \n",
" 482.0 | \n",
" 312.0 | \n",
" 194.0 | \n",
" NaN | \n",
" 247.0 | \n",
" 1142.0 | \n",
" 1091.0 | \n",
" 844.0 | \n",
" 417.0 | \n",
" 200.0 | \n",
" 120.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 236 | \n",
" AR | \n",
" 2000 | \n",
" 97.0 | \n",
" 278.0 | \n",
" 594.0 | \n",
" 402.0 | \n",
" 419.0 | \n",
" 368.0 | \n",
" 330.0 | \n",
" NaN | \n",
" 121.0 | \n",
" 544.0 | \n",
" 479.0 | \n",
" 262.0 | \n",
" 230.0 | \n",
" 179.0 | \n",
" 216.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 265 | \n",
" AS | \n",
" 2000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1.0 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu \\\n",
"10 AD 2000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN \n",
"36 AE 2000 2.0 4.0 4.0 6.0 5.0 12.0 10.0 NaN \n",
"60 AF 2000 52.0 228.0 183.0 149.0 129.0 94.0 80.0 NaN \n",
"87 AG 2000 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN \n",
"136 AL 2000 2.0 19.0 21.0 14.0 24.0 19.0 16.0 NaN \n",
"165 AM 2000 2.0 152.0 130.0 131.0 63.0 26.0 21.0 NaN \n",
"178 AN 2000 0.0 0.0 1.0 2.0 0.0 0.0 0.0 NaN \n",
"207 AO 2000 186.0 999.0 1003.0 912.0 482.0 312.0 194.0 NaN \n",
"236 AR 2000 97.0 278.0 594.0 402.0 419.0 368.0 330.0 NaN \n",
"265 AS 2000 NaN NaN NaN NaN 1.0 1.0 NaN NaN \n",
"\n",
" f014 f1524 f2534 f3544 f4554 f5564 f65 fu \n",
"10 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"36 3.0 16.0 1.0 3.0 0.0 0.0 4.0 NaN \n",
"60 93.0 414.0 565.0 339.0 205.0 99.0 36.0 NaN \n",
"87 1.0 1.0 1.0 0.0 0.0 0.0 0.0 NaN \n",
"136 3.0 11.0 10.0 8.0 8.0 5.0 11.0 NaN \n",
"165 1.0 24.0 27.0 24.0 8.0 8.0 4.0 NaN \n",
"178 0.0 0.0 1.0 0.0 0.0 1.0 0.0 NaN \n",
"207 247.0 1142.0 1091.0 844.0 417.0 200.0 120.0 NaN \n",
"236 121.0 544.0 479.0 262.0 230.0 179.0 216.0 NaN \n",
"265 NaN NaN NaN NaN 1.0 NaN NaN NaN "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy = tb[tb['year'] == 2000].copy()\n",
"messy.drop(['total','m04','m514','f04','f514'], axis=1, inplace=True)\n",
"messy.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year | \n",
" m014 | \n",
" m1524 | \n",
" m2534 | \n",
" m3544 | \n",
" m4554 | \n",
" m5564 | \n",
" m65 | \n",
" mu | \n",
" f014 | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" AD | \n",
" 2000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 36 | \n",
" AE | \n",
" 2000 | \n",
" 2.0 | \n",
" 4.0 | \n",
" 4.0 | \n",
" 6.0 | \n",
" 5.0 | \n",
" 12.0 | \n",
" 10.0 | \n",
" NaN | \n",
" 3.0 | \n",
"
\n",
" \n",
" 60 | \n",
" AF | \n",
" 2000 | \n",
" 52.0 | \n",
" 228.0 | \n",
" 183.0 | \n",
" 149.0 | \n",
" 129.0 | \n",
" 94.0 | \n",
" 80.0 | \n",
" NaN | \n",
" 93.0 | \n",
"
\n",
" \n",
" 87 | \n",
" AG | \n",
" 2000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" NaN | \n",
" 1.0 | \n",
"
\n",
" \n",
" 136 | \n",
" AL | \n",
" 2000 | \n",
" 2.0 | \n",
" 19.0 | \n",
" 21.0 | \n",
" 14.0 | \n",
" 24.0 | \n",
" 19.0 | \n",
" 16.0 | \n",
" NaN | \n",
" 3.0 | \n",
"
\n",
" \n",
" 165 | \n",
" AM | \n",
" 2000 | \n",
" 2.0 | \n",
" 152.0 | \n",
" 130.0 | \n",
" 131.0 | \n",
" 63.0 | \n",
" 26.0 | \n",
" 21.0 | \n",
" NaN | \n",
" 1.0 | \n",
"
\n",
" \n",
" 178 | \n",
" AN | \n",
" 2000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" 0.0 | \n",
"
\n",
" \n",
" 207 | \n",
" AO | \n",
" 2000 | \n",
" 186.0 | \n",
" 999.0 | \n",
" 1003.0 | \n",
" 912.0 | \n",
" 482.0 | \n",
" 312.0 | \n",
" 194.0 | \n",
" NaN | \n",
" 247.0 | \n",
"
\n",
" \n",
" 236 | \n",
" AR | \n",
" 2000 | \n",
" 97.0 | \n",
" 278.0 | \n",
" 594.0 | \n",
" 402.0 | \n",
" 419.0 | \n",
" 368.0 | \n",
" 330.0 | \n",
" NaN | \n",
" 121.0 | \n",
"
\n",
" \n",
" 265 | \n",
" AS | \n",
" 2000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1.0 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014\n",
"10 AD 2000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN NaN\n",
"36 AE 2000 2.0 4.0 4.0 6.0 5.0 12.0 10.0 NaN 3.0\n",
"60 AF 2000 52.0 228.0 183.0 149.0 129.0 94.0 80.0 NaN 93.0\n",
"87 AG 2000 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN 1.0\n",
"136 AL 2000 2.0 19.0 21.0 14.0 24.0 19.0 16.0 NaN 3.0\n",
"165 AM 2000 2.0 152.0 130.0 131.0 63.0 26.0 21.0 NaN 1.0\n",
"178 AN 2000 0.0 0.0 1.0 2.0 0.0 0.0 0.0 NaN 0.0\n",
"207 AO 2000 186.0 999.0 1003.0 912.0 482.0 312.0 194.0 NaN 247.0\n",
"236 AR 2000 97.0 278.0 594.0 402.0 419.0 368.0 330.0 NaN 121.0\n",
"265 AS 2000 NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"messy.iloc[:,:11].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `melt()` function is useful, but is not enough. Let's use it still."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year | \n",
" variable | \n",
" cases | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" AD | \n",
" 2000 | \n",
" m014 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 201 | \n",
" AD | \n",
" 2000 | \n",
" m1524 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 402 | \n",
" AD | \n",
" 2000 | \n",
" m2534 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 603 | \n",
" AD | \n",
" 2000 | \n",
" m3544 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 804 | \n",
" AD | \n",
" 2000 | \n",
" m4554 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1005 | \n",
" AD | \n",
" 2000 | \n",
" m5564 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1206 | \n",
" AD | \n",
" 2000 | \n",
" m65 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1407 | \n",
" AD | \n",
" 2000 | \n",
" mu | \n",
" NaN | \n",
"
\n",
" \n",
" 1608 | \n",
" AD | \n",
" 2000 | \n",
" f014 | \n",
" NaN | \n",
"
\n",
" \n",
" 1809 | \n",
" AD | \n",
" 2000 | \n",
" f1524 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country year variable cases\n",
"0 AD 2000 m014 0.0\n",
"201 AD 2000 m1524 0.0\n",
"402 AD 2000 m2534 1.0\n",
"603 AD 2000 m3544 0.0\n",
"804 AD 2000 m4554 0.0\n",
"1005 AD 2000 m5564 0.0\n",
"1206 AD 2000 m65 0.0\n",
"1407 AD 2000 mu NaN\n",
"1608 AD 2000 f014 NaN\n",
"1809 AD 2000 f1524 NaN"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molten = pd.melt(messy, id_vars=['country', 'year'], value_name='cases')\n",
"molten.sort_values(by=['year', 'country'], inplace=True)\n",
"molten.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What isn't really nice is the encoding of sex and age ranges as a string in the `variable` column. Let's process the dataset to create two additional columns, one for the sex, and one for the age range. We then remove the `variable` column. The tidy form also makes it easy to remove the values where the age is `u`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year | \n",
" sex | \n",
" age | \n",
" cases | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" AD | \n",
" 2000 | \n",
" m | \n",
" 0-14 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 201 | \n",
" AD | \n",
" 2000 | \n",
" m | \n",
" 15-24 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 402 | \n",
" AD | \n",
" 2000 | \n",
" m | \n",
" 25-34 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 603 | \n",
" AD | \n",
" 2000 | \n",
" m | \n",
" 35-44 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 804 | \n",
" AD | \n",
" 2000 | \n",
" m | \n",
" 45-54 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1005 | \n",
" AD | \n",
" 2000 | \n",
" m | \n",
" 55-64 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1206 | \n",
" AD | \n",
" 2000 | \n",
" m | \n",
" 65+ | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1608 | \n",
" AD | \n",
" 2000 | \n",
" f | \n",
" 0-14 | \n",
" NaN | \n",
"
\n",
" \n",
" 1809 | \n",
" AD | \n",
" 2000 | \n",
" f | \n",
" 15-24 | \n",
" NaN | \n",
"
\n",
" \n",
" 2010 | \n",
" AD | \n",
" 2000 | \n",
" f | \n",
" 25-34 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country year sex age cases\n",
"0 AD 2000 m 0-14 0.0\n",
"201 AD 2000 m 15-24 0.0\n",
"402 AD 2000 m 25-34 1.0\n",
"603 AD 2000 m 35-44 0.0\n",
"804 AD 2000 m 45-54 0.0\n",
"1005 AD 2000 m 55-64 0.0\n",
"1206 AD 2000 m 65+ 0.0\n",
"1608 AD 2000 f 0-14 NaN\n",
"1809 AD 2000 f 15-24 NaN\n",
"2010 AD 2000 f 25-34 NaN"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def parse_age(s):\n",
" s = s[1:]\n",
" if s == '65':\n",
" return '65+'\n",
" else:\n",
" return s[:-2]+'-'+s[-2:]\n",
"\n",
"tidy = molten[molten['variable'] != 'mu'].copy()\n",
"tidy['sex'] = tidy['variable'].apply(lambda s: s[:1])\n",
"tidy['age'] = tidy['variable'].apply(parse_age)\n",
"tidy = tidy[['country', 'year', 'sex', 'age', 'cases']]\n",
"tidy.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Variables are stored in both rows and columns\n",
"\n",
"This example is really tricky. Let's first create the dataframe. This time, I create it using an array instead of a dictionary, just for the fun of doing something a bit different."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" year | \n",
" month | \n",
" element | \n",
" d1 | \n",
" d2 | \n",
" d3 | \n",
" d4 | \n",
" d5 | \n",
" d6 | \n",
" d7 | \n",
" d8 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" MX17004 | \n",
" 2010 | \n",
" 1 | \n",
" tmax | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" MX17004 | \n",
" 2010 | \n",
" 1 | \n",
" tmin | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" MX17004 | \n",
" 2010 | \n",
" 2 | \n",
" tmax | \n",
" NaN | \n",
" 27.3 | \n",
" 24.1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" MX17004 | \n",
" 2010 | \n",
" 2 | \n",
" tmin | \n",
" NaN | \n",
" 14.4 | \n",
" 14.4 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" MX17004 | \n",
" 2010 | \n",
" 3 | \n",
" tmax | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 32.1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" MX17004 | \n",
" 2010 | \n",
" 3 | \n",
" tmin | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 14.2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 6 | \n",
" MX17004 | \n",
" 2010 | \n",
" 4 | \n",
" tmax | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" MX17004 | \n",
" 2010 | \n",
" 4 | \n",
" tmin | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 8 | \n",
" MX17004 | \n",
" 2010 | \n",
" 5 | \n",
" tmax | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" MX17004 | \n",
" 2010 | \n",
" 5 | \n",
" tmin | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id year month element d1 d2 d3 d4 d5 d6 d7 d8\n",
"0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN NaN\n",
"1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN NaN\n",
"2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN NaN\n",
"3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN NaN\n",
"4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN NaN\n",
"5 MX17004 2010 3 tmin NaN NaN NaN NaN 14.2 NaN NaN NaN\n",
"6 MX17004 2010 4 tmax NaN NaN NaN NaN NaN NaN NaN NaN\n",
"7 MX17004 2010 4 tmin NaN NaN NaN NaN NaN NaN NaN NaN\n",
"8 MX17004 2010 5 tmax NaN NaN NaN NaN NaN NaN NaN NaN\n",
"9 MX17004 2010 5 tmin NaN NaN NaN NaN NaN NaN NaN NaN"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"columns = ['id', 'year', 'month', 'element', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8']\n",
"data = [['MX17004', 2010, 1, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 1, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 2, 'tmax', np.nan, 27.3, 24.1, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 2, 'tmin', np.nan, 14.4, 14.4, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 3, 'tmax', np.nan, np.nan, np.nan, np.nan, 32.1, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 3, 'tmin', np.nan, np.nan, np.nan, np.nan, 14.2, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 4, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 4, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 5, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],\n",
" ['MX17004', 2010, 5, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,]\n",
" ]\n",
"messy = pd.DataFrame(data=data, columns=columns); messy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of the values are not relevant. However, filtering the NaN values is imposible here. We need to melt the dataframe first."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" year | \n",
" month | \n",
" element | \n",
" day | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" MX17004 | \n",
" 2010 | \n",
" 2 | \n",
" tmax | \n",
" d2 | \n",
" 27.3 | \n",
"
\n",
" \n",
" 1 | \n",
" MX17004 | \n",
" 2010 | \n",
" 2 | \n",
" tmin | \n",
" d2 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 2 | \n",
" MX17004 | \n",
" 2010 | \n",
" 2 | \n",
" tmax | \n",
" d3 | \n",
" 24.1 | \n",
"
\n",
" \n",
" 3 | \n",
" MX17004 | \n",
" 2010 | \n",
" 2 | \n",
" tmin | \n",
" d3 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 4 | \n",
" MX17004 | \n",
" 2010 | \n",
" 3 | \n",
" tmax | \n",
" d5 | \n",
" 32.1 | \n",
"
\n",
" \n",
" 5 | \n",
" MX17004 | \n",
" 2010 | \n",
" 3 | \n",
" tmin | \n",
" d5 | \n",
" 14.2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id year month element day value\n",
"0 MX17004 2010 2 tmax d2 27.3\n",
"1 MX17004 2010 2 tmin d2 14.4\n",
"2 MX17004 2010 2 tmax d3 24.1\n",
"3 MX17004 2010 2 tmin d3 14.4\n",
"4 MX17004 2010 3 tmax d5 32.1\n",
"5 MX17004 2010 3 tmin d5 14.2"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molten = pd.melt(messy, \n",
" id_vars=['id', 'year','month','element',],\n",
" var_name='day');\n",
"molten.dropna(inplace=True)\n",
"molten = molten.reset_index(drop=True)\n",
"molten"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataframe is not in tidy form yet. First, the column `element` contains variable names. Second, the columns `year, month, day` represent one variable: the date. Let's fix the latter problem first."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" element | \n",
" value | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" MX17004 | \n",
" tmax | \n",
" 27.3 | \n",
" 2010-02-02 | \n",
"
\n",
" \n",
" 1 | \n",
" MX17004 | \n",
" tmin | \n",
" 14.4 | \n",
" 2010-02-02 | \n",
"
\n",
" \n",
" 2 | \n",
" MX17004 | \n",
" tmax | \n",
" 24.1 | \n",
" 2010-02-03 | \n",
"
\n",
" \n",
" 3 | \n",
" MX17004 | \n",
" tmin | \n",
" 14.4 | \n",
" 2010-02-03 | \n",
"
\n",
" \n",
" 4 | \n",
" MX17004 | \n",
" tmax | \n",
" 32.1 | \n",
" 2010-03-05 | \n",
"
\n",
" \n",
" 5 | \n",
" MX17004 | \n",
" tmin | \n",
" 14.2 | \n",
" 2010-03-05 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id element value date\n",
"0 MX17004 tmax 27.3 2010-02-02\n",
"1 MX17004 tmin 14.4 2010-02-02\n",
"2 MX17004 tmax 24.1 2010-02-03\n",
"3 MX17004 tmin 14.4 2010-02-03\n",
"4 MX17004 tmax 32.1 2010-03-05\n",
"5 MX17004 tmin 14.2 2010-03-05"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def f(row): \n",
" return \"%d-%02d-%02d\" % (row['year'], row['month'], int(row['day'][1:]))\n",
" \n",
"molten['date'] = molten.apply(f,axis=1)\n",
"molten = molten[['id', 'element','value','date']]\n",
"molten"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we need to pivot the element column. "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" element | \n",
" tmax | \n",
" tmin | \n",
"
\n",
" \n",
" date | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2010-02-02 | \n",
" 27.3 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 2010-02-03 | \n",
" 24.1 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 2010-03-05 | \n",
" 32.1 | \n",
" 14.2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"element tmax tmin\n",
"date \n",
"2010-02-02 27.3 14.4\n",
"2010-02-03 24.1 14.4\n",
"2010-03-05 32.1 14.2"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy = molten.pivot(index='date',columns='element',values='value')\n",
"tidy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait a minute. \n",
"\n",
"Where is the id?\n",
"\n",
"One way to keep it, is to move the id to an index with the `groupby()` function, and apply `pivot()` inside each group. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" element | \n",
" tmax | \n",
" tmin | \n",
"
\n",
" \n",
" id | \n",
" date | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" MX17004 | \n",
" 2010-02-02 | \n",
" 27.3 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 2010-02-03 | \n",
" 24.1 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 2010-03-05 | \n",
" 32.1 | \n",
" 14.2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"element tmax tmin\n",
"id date \n",
"MX17004 2010-02-02 27.3 14.4\n",
" 2010-02-03 24.1 14.4\n",
" 2010-03-05 32.1 14.2"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy = molten.groupby('id').apply(pd.DataFrame.pivot,\n",
" index='date',\n",
" columns='element',\n",
" values='value')\n",
"tidy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are almost there. We simply have to move id back as a column with the `reset_index()`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" element | \n",
" id | \n",
" date | \n",
" tmax | \n",
" tmin | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" MX17004 | \n",
" 2010-02-02 | \n",
" 27.3 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 1 | \n",
" MX17004 | \n",
" 2010-02-03 | \n",
" 24.1 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 2 | \n",
" MX17004 | \n",
" 2010-03-05 | \n",
" 32.1 | \n",
" 14.2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"element id date tmax tmin\n",
"0 MX17004 2010-02-02 27.3 14.4\n",
"1 MX17004 2010-02-03 24.1 14.4\n",
"2 MX17004 2010-03-05 32.1 14.2"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy.reset_index(inplace=True)\n",
"tidy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We get rid of the `element` name."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" tmax | \n",
" tmin | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" MX17004 | \n",
" 2010-02-02 | \n",
" 27.3 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 1 | \n",
" MX17004 | \n",
" 2010-02-03 | \n",
" 24.1 | \n",
" 14.4 | \n",
"
\n",
" \n",
" 2 | \n",
" MX17004 | \n",
" 2010-03-05 | \n",
" 32.1 | \n",
" 14.2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date tmax tmin\n",
"0 MX17004 2010-02-02 27.3 14.4\n",
"1 MX17004 2010-02-03 24.1 14.4\n",
"2 MX17004 2010-03-05 32.1 14.2"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy.columns.name = ''\n",
"tidy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Et Voilà!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple types in one table\n",
"\n",
"This example is used to illustrate two of the above problems. \n",
"\n",
"Let's create it. It is an excerpt from the Billboard top hits for 2000."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" artist | \n",
" track | \n",
" time | \n",
" date entered | \n",
" wk1 | \n",
" wk2 | \n",
" wk3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" 2,Pac | \n",
" Baby Don't Cry | \n",
" 4:22 | \n",
" 2000-02-26 | \n",
" 87 | \n",
" 82 | \n",
" 72 | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" 2Ge+her | \n",
" The Hardest Part Of ... | \n",
" 3:15 | \n",
" 2000-09-02 | \n",
" 91 | \n",
" 87 | \n",
" 92 | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" 3 Doors Down | \n",
" Kryptonite | \n",
" 3:53 | \n",
" 2000-04-08 | \n",
" 81 | \n",
" 70 | \n",
" 68 | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" 98^0 | \n",
" Give Me Just One Nig... | \n",
" 3:24 | \n",
" 2000-08-19 | \n",
" 51 | \n",
" 39 | \n",
" 34 | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" A*Teens | \n",
" Dancing Queen | \n",
" 3:44 | \n",
" 2000-07-08 | \n",
" 97 | \n",
" 97 | \n",
" 96 | \n",
"
\n",
" \n",
" 5 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 2000-01-29 | \n",
" 84 | \n",
" 62 | \n",
" 51 | \n",
"
\n",
" \n",
" 6 | \n",
" 2000 | \n",
" Aaliyah | \n",
" Try Again | \n",
" 4:03 | \n",
" 2000-03-18 | \n",
" 59 | \n",
" 53 | \n",
" 38 | \n",
"
\n",
" \n",
" 7 | \n",
" 2000 | \n",
" Adams,Yolanda | \n",
" Open My Heart | \n",
" 5:30 | \n",
" 2000-08-26 | \n",
" 76 | \n",
" 76 | \n",
" 74 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year artist track time date entered wk1 wk2 \\\n",
"0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 87 82 \n",
"1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87 \n",
"2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70 \n",
"3 2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 51 39 \n",
"4 2000 A*Teens Dancing Queen 3:44 2000-07-08 97 97 \n",
"5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 84 62 \n",
"6 2000 Aaliyah Try Again 4:03 2000-03-18 59 53 \n",
"7 2000 Adams,Yolanda Open My Heart 5:30 2000-08-26 76 76 \n",
"\n",
" wk3 \n",
"0 72 \n",
"1 92 \n",
"2 68 \n",
"3 34 \n",
"4 96 \n",
"5 51 \n",
"6 38 \n",
"7 74 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"columns = ['year','artist','track','time','date entered','wk1','wk2','wk3',]\n",
"\n",
"data = [[2000,\"2,Pac\",\"Baby Don't Cry\",\"4:22\",\"2000-02-26\",87,82,72,],\n",
" [2000,\"2Ge+her\",\"The Hardest Part Of ...\",\"3:15\",\"2000-09-02\",91,87,92,],\n",
" [2000,\"3 Doors Down\",\"Kryptonite\",\"3:53\",\"2000-04-08\",81,70,68,],\n",
" [2000,\"98^0\",\"Give Me Just One Nig...\",\"3:24\",\"2000-08-19\",51,39,34,],\n",
" [2000,\"A*Teens\",\"Dancing Queen\",\"3:44\",\"2000-07-08\",97,97,96,],\n",
" [2000,\"Aaliyah\",\"I Don't Wanna\",\"4:15\",\"2000-01-29\",84,62,51,],\n",
" [2000,\"Aaliyah\",\"Try Again\",\"4:03\",\"2000-03-18\",59,53,38,],\n",
" [2000,\"Adams,Yolanda\",\"Open My Heart\",\"5:30\",\"2000-08-26\",76,76,74]\n",
" ]\n",
"\n",
"messy = pd.DataFrame(data=data, columns=columns)\n",
"messy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset is messy because there are several observations per row, in the columns wk1, wk2, wk3. We can get one observation per row by metling the dataset."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" artist | \n",
" track | \n",
" time | \n",
" date entered | \n",
" week | \n",
" rank | \n",
"
\n",
" \n",
" \n",
" \n",
" 5 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 2000-01-29 | \n",
" wk1 | \n",
" 84 | \n",
"
\n",
" \n",
" 13 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 2000-01-29 | \n",
" wk2 | \n",
" 62 | \n",
"
\n",
" \n",
" 21 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 2000-01-29 | \n",
" wk3 | \n",
" 51 | \n",
"
\n",
" \n",
" 0 | \n",
" 2000 | \n",
" 2,Pac | \n",
" Baby Don't Cry | \n",
" 4:22 | \n",
" 2000-02-26 | \n",
" wk1 | \n",
" 87 | \n",
"
\n",
" \n",
" 8 | \n",
" 2000 | \n",
" 2,Pac | \n",
" Baby Don't Cry | \n",
" 4:22 | \n",
" 2000-02-26 | \n",
" wk2 | \n",
" 82 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year artist track time date entered week rank\n",
"5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk1 84\n",
"13 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk2 62\n",
"21 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk3 51\n",
"0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 wk1 87\n",
"8 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 wk2 82"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molten = pd.melt(messy, \n",
" id_vars=['year','artist','track','time','date entered'],\n",
" var_name = 'week',\n",
" value_name = 'rank',\n",
" )\n",
"molten.sort_values(by=['date entered','week'], inplace=True)\n",
"molten.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can clean the dataset further, first by turning week into number"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" artist | \n",
" track | \n",
" time | \n",
" date entered | \n",
" week | \n",
" rank | \n",
"
\n",
" \n",
" \n",
" \n",
" 5 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 2000-01-29 | \n",
" 1 | \n",
" 84 | \n",
"
\n",
" \n",
" 13 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 2000-01-29 | \n",
" 2 | \n",
" 62 | \n",
"
\n",
" \n",
" 21 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 2000-01-29 | \n",
" 3 | \n",
" 51 | \n",
"
\n",
" \n",
" 0 | \n",
" 2000 | \n",
" 2,Pac | \n",
" Baby Don't Cry | \n",
" 4:22 | \n",
" 2000-02-26 | \n",
" 1 | \n",
" 87 | \n",
"
\n",
" \n",
" 8 | \n",
" 2000 | \n",
" 2,Pac | \n",
" Baby Don't Cry | \n",
" 4:22 | \n",
" 2000-02-26 | \n",
" 2 | \n",
" 82 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year artist track time date entered week rank\n",
"5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 1 84\n",
"13 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 2 62\n",
"21 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 3 51\n",
"0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 1 87\n",
"8 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 2 82"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molten['week'] = molten['week'].apply(lambda s: int(s[2:]))\n",
"molten.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Second, we need the starting date of the week for each observation, instead of the date the track entered."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" artist | \n",
" track | \n",
" time | \n",
" week | \n",
" rank | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" 5 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 1 | \n",
" 84 | \n",
" 2000-01-29 | \n",
"
\n",
" \n",
" 13 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 2 | \n",
" 62 | \n",
" 2000-02-05 | \n",
"
\n",
" \n",
" 21 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
" 3 | \n",
" 51 | \n",
" 2000-02-12 | \n",
"
\n",
" \n",
" 0 | \n",
" 2000 | \n",
" 2,Pac | \n",
" Baby Don't Cry | \n",
" 4:22 | \n",
" 1 | \n",
" 87 | \n",
" 2000-02-26 | \n",
"
\n",
" \n",
" 8 | \n",
" 2000 | \n",
" 2,Pac | \n",
" Baby Don't Cry | \n",
" 4:22 | \n",
" 2 | \n",
" 82 | \n",
" 2000-03-04 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year artist track time week rank date\n",
"5 2000 Aaliyah I Don't Wanna 4:15 1 84 2000-01-29\n",
"13 2000 Aaliyah I Don't Wanna 4:15 2 62 2000-02-05\n",
"21 2000 Aaliyah I Don't Wanna 4:15 3 51 2000-02-12\n",
"0 2000 2,Pac Baby Don't Cry 4:22 1 87 2000-02-26\n",
"8 2000 2,Pac Baby Don't Cry 4:22 2 82 2000-03-04"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datetime import datetime, timedelta\n",
"\n",
"def increment_date(row):\n",
" date = datetime.strptime(row['date entered'], \"%Y-%m-%d\")\n",
" return date + timedelta(7) * (row['week'] - 1)\n",
"\n",
"molten['date'] = molten.apply(increment_date, axis=1)\n",
"molten.drop('date entered', axis=1, inplace=True)\n",
"molten.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Last, this dataset is denormalized. This is fine for most statistical and machine learning packages, but we might want to normalize it. It means that we should group information that is repeated every week for a track in a separate table. This information appears in columns `year ,artist, track, time`. "
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" year | \n",
" artist | \n",
" track | \n",
" time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 2000 | \n",
" 2,Pac | \n",
" Baby Don't Cry | \n",
" 4:22 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2000 | \n",
" 2Ge+her | \n",
" The Hardest Part Of ... | \n",
" 3:15 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 2000 | \n",
" 3 Doors Down | \n",
" Kryptonite | \n",
" 3:53 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 2000 | \n",
" 98^0 | \n",
" Give Me Just One Nig... | \n",
" 3:24 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 2000 | \n",
" A*Teens | \n",
" Dancing Queen | \n",
" 3:44 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" 2000 | \n",
" Aaliyah | \n",
" I Don't Wanna | \n",
" 4:15 | \n",
"
\n",
" \n",
" 6 | \n",
" 6 | \n",
" 2000 | \n",
" Aaliyah | \n",
" Try Again | \n",
" 4:03 | \n",
"
\n",
" \n",
" 7 | \n",
" 7 | \n",
" 2000 | \n",
" Adams,Yolanda | \n",
" Open My Heart | \n",
" 5:30 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id year artist track time\n",
"0 0 2000 2,Pac Baby Don't Cry 4:22\n",
"1 1 2000 2Ge+her The Hardest Part Of ... 3:15\n",
"2 2 2000 3 Doors Down Kryptonite 3:53\n",
"3 3 2000 98^0 Give Me Just One Nig... 3:24\n",
"4 4 2000 A*Teens Dancing Queen 3:44\n",
"5 5 2000 Aaliyah I Don't Wanna 4:15\n",
"6 6 2000 Aaliyah Try Again 4:03\n",
"7 7 2000 Adams,Yolanda Open My Heart 5:30"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy_track = molten[['year','artist','track','time']]\\\n",
" .groupby(['year','artist','track'])\\\n",
" .first()\n",
"tidy_track.reset_index(inplace=True)\n",
"tidy_track.reset_index(inplace=True)\n",
"tidy_track.rename(columns = {'index':'id'}, inplace=True)\n",
"tidy_track"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" rank | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5 | \n",
" 2000-01-29 | \n",
" 84 | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 2000-02-05 | \n",
" 62 | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 2000-02-12 | \n",
" 51 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 2000-02-26 | \n",
" 87 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 2000-03-04 | \n",
" 82 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date rank\n",
"0 5 2000-01-29 84\n",
"1 5 2000-02-05 62\n",
"2 5 2000-02-12 51\n",
"3 0 2000-02-26 87\n",
"4 0 2000-03-04 82"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy_rank = pd.merge(molten, tidy_track, on='track')\n",
"tidy_rank = tidy_rank[['id', 'date', 'rank']]\n",
"tidy_rank.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
},
"toc": {
"navigate_menu": false,
"number_sections": false,
"sideBar": true,
"threshold": 6,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 0
}