By default, `pivot_table` takes the **mean** of all values that would end up into one cell. However, you can also specify other aggregation functions using the `aggfunc` keyword.
Make a table of the median Fare payed by aged/underaged vs Sex.
\n",
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": true
},
"outputs": [],
"source": [
"# %load snippets/06 - Reshaping data22.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"# %load snippets/06 - Reshaping data23.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Melt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `melt` function performs the inverse operation of a `pivot`. This can be used to make your frame longer, i.e. to make a *tidy* version of your data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()\n",
"pivoted.columns.name = None"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pivoted"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assume we have a DataFrame like the above. The observations (the average Fare people payed) are spread over different columns. In a tidy dataset, each observation is stored in one row. To obtain this, we can use the `melt` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.melt(pivoted)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see above, the `melt` function puts all column labels in one column, and all values in a second column.\n",
"\n",
"In this case, this is not fully what we want. We would like to keep the 'Sex' column separately:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.melt(pivoted, id_vars=['Sex']) #, var_name='Pclass', value_name='Fare')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reshaping with `stack` and `unstack`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The docs say:\n",
"\n",
"> Pivot a level of the (possibly hierarchical) column labels, returning a\n",
"DataFrame (or Series in the case of an object with a single level of\n",
"column labels) having a hierarchical index with a new inner-most level\n",
"of row labels.\n",
"\n",
"Indeed... \n",
"\n",
"\n",
"Before we speak about `hierarchical index`, first check it in practice on the following dummy example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], \n",
" 'B':['a', 'b', 'a', 'b'], \n",
" 'C':range(4)})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use `stack`/`unstack`, we need the values we want to shift from rows to columns or the other way around as the index:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df = df.set_index(['A', 'B']) # Indeed, you can combine two indices\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"result = df['C'].unstack()\n",
"result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df = result.stack().reset_index(name='C')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"source": [
"
\n",
"\n",
"REMEMBER:\n",
"\n",
"
\n",
"
**stack**: make your data *longer* and *smaller*
\n",
"
**unstack**: make your data *shorter* and *wider*
\n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mimick pivot table "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To better understand and reason about pivot tables, we can express this method as a combination of more basic steps. In short, the pivot is a convenient way of expressing the combination of a `groupby` and `stack/unstack`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
},
"scrolled": true
},
"outputs": [],
"source": [
"df.pivot_table(index='Pclass', columns='Sex', \n",
" values='Survived', aggfunc='mean')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"EXERCISE:\n",
"\n",
"
\n",
"
Get the same result as above based on a combination of `groupby` and `unstack`
\n",
"
First use `groupby` to calculate the survival ratio for all groups
\n",
"
Then, use `unstack` to reshape the output of the groupby operation
\n",
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"# %load snippets/06 - Reshaping data37.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mimick melt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like the pivot table above, we can now also obtain the result of `melt` with stack/unstack.\n",
"\n",
"Let's use the same `pivoted` frame as above, and look at the final melt result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()\n",
"pivoted.columns.name = None\n",
"pivoted"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.melt(pivoted, id_vars=['Sex'], var_name='Pclass', value_name='Fare')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"EXERCISE:\n",
"\n",
"
\n",
"
Get the same result as above using `stack`/`unstack` (combined with `set_index` / `reset_index`)
\n",
"
Tip: set those columns as the index that you do not want to stack
\n",
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/06 - Reshaping data40.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/06 - Reshaping data41.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/06 - Reshaping data42.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/06 - Reshaping data43.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Exercises: use the reshaping methods with the movie data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"cast = pd.read_csv('data/cast.csv')\n",
"cast.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"titles = pd.read_csv('data/titles.csv')\n",
"titles.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"EXERCISE:\n",
"\n",
"
\n",
"
Plot the number of actor roles each year and the number of actress roles each year over the whole period of available movie data.
Define a year as a \"Superman year\" when films of that year feature more Superman characters than Batman characters. How many years in film history have been Superman years?