{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is part of my [Python data science curriculum](http://www.terran.us/articles/python_curriculum.html). It demonstrates some Pandas functions which I thought were not adequately explained in the Jake VanderPlas book." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import scipy.stats as stats\n", "from plotnine.data import diamonds" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# This is a standard Python demo dataset. You can also load it from your Python packages\n", "# dir with pd.read_csv if you don't want to import seaborn.\n", "import seaborn as sns\n", "tips = sns.load_dataset('tips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For no apparent reason, the VanderPlas book doesn't document `pd.read_csv`! This is definitely functionality that you need. If you have the Wes McKinney book available, he has a description that you can read in Chapter 6; otherwise just read the online docs." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "pd.read_csv?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pd.read_excel?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For DB connections: \n", "http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries \n", "https://www.sqlalchemy.org/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary Tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be aware that describe() ignores all non-numeric columns by default, which might not be what you wanted:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>size</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>244.000000</td>\n", " <td>244.000000</td>\n", " <td>244.000000</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>19.785943</td>\n", " <td>2.998279</td>\n", " <td>2.569672</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>8.902412</td>\n", " <td>1.383638</td>\n", " <td>0.951100</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>3.070000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>13.347500</td>\n", " <td>2.000000</td>\n", " <td>2.000000</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>17.795000</td>\n", " <td>2.900000</td>\n", " <td>2.000000</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>24.127500</td>\n", " <td>3.562500</td>\n", " <td>3.000000</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>50.810000</td>\n", " <td>10.000000</td>\n", " <td>6.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip size\n", "count 244.000000 244.000000 244.000000\n", "mean 19.785943 2.998279 2.569672\n", "std 8.902412 1.383638 0.951100\n", "min 3.070000 1.000000 1.000000\n", "25% 13.347500 2.000000 2.000000\n", "50% 17.795000 2.900000 2.000000\n", "75% 24.127500 3.562500 3.000000\n", "max 50.810000 10.000000 6.000000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can force it to include them with include='all', but they compute different statistics, so the result is ugly:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>day</th>\n", " <th>time</th>\n", " <th>size</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>244.000000</td>\n", " <td>244.000000</td>\n", " <td>244</td>\n", " <td>244</td>\n", " <td>244</td>\n", " <td>244</td>\n", " <td>244.000000</td>\n", " </tr>\n", " <tr>\n", " <th>unique</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>2</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>top</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>freq</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>157</td>\n", " <td>151</td>\n", " <td>87</td>\n", " <td>176</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>19.785943</td>\n", " <td>2.998279</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>2.569672</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>8.902412</td>\n", " <td>1.383638</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>0.951100</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>3.070000</td>\n", " <td>1.000000</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>1.000000</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>13.347500</td>\n", " <td>2.000000</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>2.000000</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>17.795000</td>\n", " <td>2.900000</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>2.000000</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>24.127500</td>\n", " <td>3.562500</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>3.000000</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>50.810000</td>\n", " <td>10.000000</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>6.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "count 244.000000 244.000000 244 244 244 244 244.000000\n", "unique NaN NaN 2 2 4 2 NaN\n", "top NaN NaN Male No Sat Dinner NaN\n", "freq NaN NaN 157 151 87 176 NaN\n", "mean 19.785943 2.998279 NaN NaN NaN NaN 2.569672\n", "std 8.902412 1.383638 NaN NaN NaN NaN 0.951100\n", "min 3.070000 1.000000 NaN NaN NaN NaN 1.000000\n", "25% 13.347500 2.000000 NaN NaN NaN NaN 2.000000\n", "50% 17.795000 2.900000 NaN NaN NaN NaN 2.000000\n", "75% 24.127500 3.562500 NaN NaN NaN NaN 3.000000\n", "max 50.810000 10.000000 NaN NaN NaN NaN 6.000000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You probably want to do the numeric and string types separately. Note that values which look like strings might be `np.object` or `pd.Categorical`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>day</th>\n", " <th>time</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>244</td>\n", " <td>244</td>\n", " <td>244</td>\n", " <td>244</td>\n", " </tr>\n", " <tr>\n", " <th>unique</th>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>top</th>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " </tr>\n", " <tr>\n", " <th>freq</th>\n", " <td>157</td>\n", " <td>151</td>\n", " <td>87</td>\n", " <td>176</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " sex smoker day time\n", "count 244 244 244 244\n", "unique 2 2 4 2\n", "top Male No Sat Dinner\n", "freq 157 151 87 176" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.describe(include=pd.Categorical)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.value_counts()` and `.unique()` work on any type of column, but only one column at a time, not an entire dataframe." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Male 157\n", "Female 87\n", "Name: sex, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips['sex'].value_counts()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Female, Male]\n", "Categories (2, object): [Female, Male]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips['sex'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These specialized functions are quite a bit faster than the general approach with `groupby`:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "905 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "1.4 ms ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "11.7 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit -n100 diamonds.color.unique()\n", "%timeit -n100 diamonds.color.value_counts()\n", "%timeit -n100 diamonds.groupby('color').count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Stacking, unstacking, melting, and pivoting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stacking and unstacking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To understand stacking and unstacking, let's start by creating a multindex on the rows." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>size</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>22.284500</td>\n", " <td>3.051167</td>\n", " <td>2.500000</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>19.791237</td>\n", " <td>3.113402</td>\n", " <td>2.711340</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>17.977879</td>\n", " <td>2.931515</td>\n", " <td>2.242424</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>18.105185</td>\n", " <td>2.773519</td>\n", " <td>2.592593</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip size\n", "sex smoker \n", "Male Yes 22.284500 3.051167 2.500000\n", " No 19.791237 3.113402 2.711340\n", "Female Yes 17.977879 2.931515 2.242424\n", " No 18.105185 2.773519 2.592593" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss = tips.groupby(['sex','smoker']).aggregate('mean')\n", "tss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`unstack` will them move the INNER level of the index from the ROWS to the COLUMNS:\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">total_bill</th>\n", " <th colspan=\"2\" halign=\"left\">tip</th>\n", " <th colspan=\"2\" halign=\"left\">size</th>\n", " </tr>\n", " <tr>\n", " <th>smoker</th>\n", " <th>Yes</th>\n", " <th>No</th>\n", " <th>Yes</th>\n", " <th>No</th>\n", " <th>Yes</th>\n", " <th>No</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Male</th>\n", " <td>22.284500</td>\n", " <td>19.791237</td>\n", " <td>3.051167</td>\n", " <td>3.113402</td>\n", " <td>2.500000</td>\n", " <td>2.711340</td>\n", " </tr>\n", " <tr>\n", " <th>Female</th>\n", " <td>17.977879</td>\n", " <td>18.105185</td>\n", " <td>2.931515</td>\n", " <td>2.773519</td>\n", " <td>2.242424</td>\n", " <td>2.592593</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip size \n", "smoker Yes No Yes No Yes No\n", "sex \n", "Male 22.284500 19.791237 3.051167 3.113402 2.500000 2.711340\n", "Female 17.977879 18.105185 2.931515 2.773519 2.242424 2.592593" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`stack` moves the innermost level from the columns to the rows:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>size</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>22.284500</td>\n", " <td>3.051167</td>\n", " <td>2.500000</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>19.791237</td>\n", " <td>3.113402</td>\n", " <td>2.711340</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>17.977879</td>\n", " <td>2.931515</td>\n", " <td>2.242424</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>18.105185</td>\n", " <td>2.773519</td>\n", " <td>2.592593</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip size\n", "sex smoker \n", "Male Yes 22.284500 3.051167 2.500000\n", " No 19.791237 3.113402 2.711340\n", "Female Yes 17.977879 2.931515 2.242424\n", " No 18.105185 2.773519 2.592593" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().stack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you remove the _last_ (only remaining) level from either the rows or the columns, you then get a one-dimensional Series instead of a DataFrame:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " smoker sex \n", "total_bill Yes Male 22.284500\n", " Female 17.977879\n", " No Male 19.791237\n", " Female 18.105185\n", "tip Yes Male 3.051167\n", " Female 2.931515\n", " No Male 3.113402\n", " Female 2.773519\n", "size Yes Male 2.500000\n", " Female 2.242424\n", " No Male 2.711340\n", " Female 2.592593\n", "dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have a series, it can only be `unstack`ed, not `stack`ed. You can use `level=` to control which part of the index gets turned back into columns." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>sex</th>\n", " <th>Male</th>\n", " <th>Female</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">total_bill</th>\n", " <th>Yes</th>\n", " <td>22.284500</td>\n", " <td>17.977879</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>19.791237</td>\n", " <td>18.105185</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">tip</th>\n", " <th>Yes</th>\n", " <td>3.051167</td>\n", " <td>2.931515</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>3.113402</td>\n", " <td>2.773519</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">size</th>\n", " <th>Yes</th>\n", " <td>2.500000</td>\n", " <td>2.242424</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>2.711340</td>\n", " <td>2.592593</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "sex Male Female\n", " smoker \n", "total_bill Yes 22.284500 17.977879\n", " No 19.791237 18.105185\n", "tip Yes 3.051167 2.931515\n", " No 3.113402 2.773519\n", "size Yes 2.500000 2.242424\n", " No 2.711340 2.592593" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().unstack().unstack()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>smoker</th>\n", " <th>Yes</th>\n", " <th>No</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th>sex</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">total_bill</th>\n", " <th>Male</th>\n", " <td>22.284500</td>\n", " <td>19.791237</td>\n", " </tr>\n", " <tr>\n", " <th>Female</th>\n", " <td>17.977879</td>\n", " <td>18.105185</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">tip</th>\n", " <th>Male</th>\n", " <td>3.051167</td>\n", " <td>3.113402</td>\n", " </tr>\n", " <tr>\n", " <th>Female</th>\n", " <td>2.931515</td>\n", " <td>2.773519</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">size</th>\n", " <th>Male</th>\n", " <td>2.500000</td>\n", " <td>2.711340</td>\n", " </tr>\n", " <tr>\n", " <th>Female</th>\n", " <td>2.242424</td>\n", " <td>2.592593</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "smoker Yes No\n", " sex \n", "total_bill Male 22.284500 19.791237\n", " Female 17.977879 18.105185\n", "tip Male 3.051167 3.113402\n", " Female 2.931515 2.773519\n", "size Male 2.500000 2.711340\n", " Female 2.242424 2.592593" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().unstack().unstack(level=1)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>size</th>\n", " </tr>\n", " <tr>\n", " <th>smoker</th>\n", " <th>sex</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Yes</th>\n", " <th>Male</th>\n", " <td>22.284500</td>\n", " <td>3.051167</td>\n", " <td>2.500000</td>\n", " </tr>\n", " <tr>\n", " <th>Female</th>\n", " <td>17.977879</td>\n", " <td>2.931515</td>\n", " <td>2.242424</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">No</th>\n", " <th>Male</th>\n", " <td>19.791237</td>\n", " <td>3.113402</td>\n", " <td>2.711340</td>\n", " </tr>\n", " <tr>\n", " <th>Female</th>\n", " <td>18.105185</td>\n", " <td>2.773519</td>\n", " <td>2.592593</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip size\n", "smoker sex \n", "Yes Male 22.284500 3.051167 2.500000\n", " Female 17.977879 2.931515 2.242424\n", "No Male 19.791237 3.113402 2.711340\n", " Female 18.105185 2.773519 2.592593" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tss.unstack().unstack().unstack(level=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Melting and pivoting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`melt` is very similar to `stack`, except that it applies to all columns and not just the innermost level, and it the converts them into a normal column instead of an index level." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>day</th>\n", " <th>time</th>\n", " <th>size</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>16.99</td>\n", " <td>1.01</td>\n", " <td>Female</td>\n", " <td>No</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>10.34</td>\n", " <td>1.66</td>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.head(2)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 total_bill 16.99\n", " tip 1.01\n", " sex Female\n", " smoker No\n", " day Sun\n", " time Dinner\n", " size 2\n", "1 total_bill 10.34\n", " tip 1.66\n", " sex Male\n", " smoker No\n", " day Sun\n", " time Dinner\n", " size 3\n", "dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.head(2).stack()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>index</th>\n", " <th>variable</th>\n", " <th>value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>total_bill</td>\n", " <td>16.99</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>total_bill</td>\n", " <td>10.34</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " <td>tip</td>\n", " <td>1.01</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1</td>\n", " <td>tip</td>\n", " <td>1.66</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>sex</td>\n", " <td>Female</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>1</td>\n", " <td>sex</td>\n", " <td>Male</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>0</td>\n", " <td>smoker</td>\n", " <td>No</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>1</td>\n", " <td>smoker</td>\n", " <td>No</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>0</td>\n", " <td>day</td>\n", " <td>Sun</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>1</td>\n", " <td>day</td>\n", " <td>Sun</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>0</td>\n", " <td>time</td>\n", " <td>Dinner</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>1</td>\n", " <td>time</td>\n", " <td>Dinner</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>0</td>\n", " <td>size</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>1</td>\n", " <td>size</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " index variable value\n", "0 0 total_bill 16.99\n", "1 1 total_bill 10.34\n", "2 0 tip 1.01\n", "3 1 tip 1.66\n", "4 0 sex Female\n", "5 1 sex Male\n", "6 0 smoker No\n", "7 1 smoker No\n", "8 0 day Sun\n", "9 1 day Sun\n", "10 0 time Dinner\n", "11 1 time Dinner\n", "12 0 size 2\n", "13 1 size 3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that I have to move the row index into a column, which is called \"index\", to\n", "# preserve the association of the data in the original rows through the melt.\n", "tips_melted=tips.reset_index().head(2).melt(id_vars='index')\n", "tips_melted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no corresponding `cast` like R has. Instead, use `.pivot`" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th colspan=\"7\" halign=\"left\">value</th>\n", " </tr>\n", " <tr>\n", " <th>variable</th>\n", " <th>day</th>\n", " <th>sex</th>\n", " <th>size</th>\n", " <th>smoker</th>\n", " <th>time</th>\n", " <th>tip</th>\n", " <th>total_bill</th>\n", " </tr>\n", " <tr>\n", " <th>index</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Sun</td>\n", " <td>Female</td>\n", " <td>2</td>\n", " <td>No</td>\n", " <td>Dinner</td>\n", " <td>1.01</td>\n", " <td>16.99</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Sun</td>\n", " <td>Male</td>\n", " <td>3</td>\n", " <td>No</td>\n", " <td>Dinner</td>\n", " <td>1.66</td>\n", " <td>10.34</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " value \n", "variable day sex size smoker time tip total_bill\n", "index \n", "0 Sun Female 2 No Dinner 1.01 16.99\n", "1 Sun Male 3 No Dinner 1.66 10.34" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips_melted.pivot(index='index',columns='variable')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distinction between `.pivot` and `.pivot_table` is that the latter does aggregation:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>time</th>\n", " <th>Dinner</th>\n", " <th>Lunch</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Female</th>\n", " <td>6.5</td>\n", " <td>5.17</td>\n", " </tr>\n", " <tr>\n", " <th>Male</th>\n", " <td>10.0</td>\n", " <td>6.70</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "time Dinner Lunch\n", "sex \n", "Female 6.5 5.17\n", "Male 10.0 6.70" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.pivot_table(values='tip',index='sex',columns='time',aggfunc='max')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the gotcha that the arguments are not in the same order if you specify them positionally:\n", "\n", "tips.pivot_table(__values__=None, __index__=None, __columns__=None, ... \n", "tips.pivot(__index__=None, __columns__=None, __values__=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Aggregate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Aggregate Syntax" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The full syntax of arguments to `aggregate()` is fairly complex. You can have:\n", "\n", "- A dict where the keys are columns in your source data, and the values are:\n", " - An array of functions to apply, where each element is:\n", " - A 2-tuple, where the first element is a string to call the output and the second element is the function" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">tip</th>\n", " <th>time</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th>mean</th>\n", " <th>50pct</th>\n", " <th>pct_dinner</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>3.051167</td>\n", " <td>3.00</td>\n", " <td>78.333333</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>3.113402</td>\n", " <td>2.74</td>\n", " <td>79.381443</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>2.931515</td>\n", " <td>2.88</td>\n", " <td>69.696970</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>2.773519</td>\n", " <td>2.68</td>\n", " <td>53.703704</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tip time\n", " mean 50pct pct_dinner\n", "sex smoker \n", "Male Yes 3.051167 3.00 78.333333\n", " No 3.113402 2.74 79.381443\n", "Female Yes 2.931515 2.88 69.696970\n", " No 2.773519 2.68 53.703704" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.groupby(['sex','smoker']).aggregate(\n", " {'tip':[('mean',np.mean),('50pct',np.median)],\n", " 'time':[('pct_dinner', lambda x: 100*np.mean(x=='Dinner'))]\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But note that if you specify a tuple for one function, you had better specify it for all, or you get bad column names for the ones you didn't specify:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">tip</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th><function mean at 0x7fdb78089ae8></th>\n", " <th>50pct</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>3.051167</td>\n", " <td>3.00</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>3.113402</td>\n", " <td>2.74</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>2.931515</td>\n", " <td>2.88</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>2.773519</td>\n", " <td>2.68</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tip \n", " <function mean at 0x7fdb78089ae8> 50pct\n", "sex smoker \n", "Male Yes 3.051167 3.00\n", " No 3.113402 2.74\n", "Female Yes 2.931515 2.88\n", " No 2.773519 2.68" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,('50pct',np.median)]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whereas if you don't specify _any_ names, you get sane defaults. I dunno." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">tip</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th>mean</th>\n", " <th>median</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>3.051167</td>\n", " <td>3.00</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>3.113402</td>\n", " <td>2.74</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>2.931515</td>\n", " <td>2.88</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>2.773519</td>\n", " <td>2.68</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tip \n", " mean median\n", "sex smoker \n", "Male Yes 3.051167 3.00\n", " No 3.113402 2.74\n", "Female Yes 2.931515 2.88\n", " No 2.773519 2.68" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-level Column Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we aggregate multiple columns with multiple functions, we get hierarchical column names:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex(levels=[['tip', 'total_bill'], ['mean', 'median']],\n", " labels=[[0, 0, 1, 1], [0, 1, 0, 1]])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm = tips.groupby(['sex','smoker']).aggregate({'tip':[np.mean,np.median],\n", " 'total_bill':[np.mean,np.median]})\n", "\n", "# See that we have a MultiIndex:\n", "tm.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to flip which is the first and which is the second level of the index, we can do it with `.swaplevel`:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">tip</th>\n", " <th colspan=\"2\" halign=\"left\">total_bill</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th>mean</th>\n", " <th>median</th>\n", " <th>mean</th>\n", " <th>median</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>3.051167</td>\n", " <td>3.00</td>\n", " <td>22.284500</td>\n", " <td>20.39</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>3.113402</td>\n", " <td>2.74</td>\n", " <td>19.791237</td>\n", " <td>18.24</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>2.931515</td>\n", " <td>2.88</td>\n", " <td>17.977879</td>\n", " <td>16.27</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>2.773519</td>\n", " <td>2.68</td>\n", " <td>18.105185</td>\n", " <td>16.69</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tip total_bill \n", " mean median mean median\n", "sex smoker \n", "Male Yes 3.051167 3.00 22.284500 20.39\n", " No 3.113402 2.74 19.791237 18.24\n", "Female Yes 2.931515 2.88 17.977879 16.27\n", " No 2.773519 2.68 18.105185 16.69" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th>mean</th>\n", " <th>median</th>\n", " <th>mean</th>\n", " <th>median</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th>tip</th>\n", " <th>tip</th>\n", " <th>total_bill</th>\n", " <th>total_bill</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>3.051167</td>\n", " <td>3.00</td>\n", " <td>22.284500</td>\n", " <td>20.39</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>3.113402</td>\n", " <td>2.74</td>\n", " <td>19.791237</td>\n", " <td>18.24</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>2.931515</td>\n", " <td>2.88</td>\n", " <td>17.977879</td>\n", " <td>16.27</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>2.773519</td>\n", " <td>2.68</td>\n", " <td>18.105185</td>\n", " <td>16.69</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " mean median mean median\n", " tip tip total_bill total_bill\n", "sex smoker \n", "Male Yes 3.051167 3.00 22.284500 20.39\n", " No 3.113402 2.74 19.791237 18.24\n", "Female Yes 2.931515 2.88 17.977879 16.27\n", " No 2.773519 2.68 18.105185 16.69" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.swaplevel(axis=1)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">mean</th>\n", " <th colspan=\"2\" halign=\"left\">median</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th>tip</th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>total_bill</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>3.051167</td>\n", " <td>22.284500</td>\n", " <td>3.00</td>\n", " <td>20.39</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>3.113402</td>\n", " <td>19.791237</td>\n", " <td>2.74</td>\n", " <td>18.24</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>2.931515</td>\n", " <td>17.977879</td>\n", " <td>2.88</td>\n", " <td>16.27</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>2.773519</td>\n", " <td>18.105185</td>\n", " <td>2.68</td>\n", " <td>16.69</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " mean median \n", " tip total_bill tip total_bill\n", "sex smoker \n", "Male Yes 3.051167 22.284500 3.00 20.39\n", " No 3.113402 19.791237 2.74 18.24\n", "Female Yes 2.931515 17.977879 2.88 16.27\n", " No 2.773519 18.105185 2.68 16.69" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If we then want the columns sorted by the new index, we can do that explicitly:\n", "tm.swaplevel(axis=1).sort_index(axis=1)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th colspan=\"2\" halign=\"left\">tip</th>\n", " <th colspan=\"2\" halign=\"left\">total_bill</th>\n", " </tr>\n", " <tr>\n", " <th></th>\n", " <th></th>\n", " <th>mean</th>\n", " <th>median</th>\n", " <th>mean</th>\n", " <th>median</th>\n", " </tr>\n", " <tr>\n", " <th>smoker</th>\n", " <th>sex</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Yes</th>\n", " <th>Male</th>\n", " <td>3.051167</td>\n", " <td>3.00</td>\n", " <td>22.284500</td>\n", " <td>20.39</td>\n", " </tr>\n", " <tr>\n", " <th>Female</th>\n", " <td>2.931515</td>\n", " <td>2.88</td>\n", " <td>17.977879</td>\n", " <td>16.27</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">No</th>\n", " <th>Male</th>\n", " <td>3.113402</td>\n", " <td>2.74</td>\n", " <td>19.791237</td>\n", " <td>18.24</td>\n", " </tr>\n", " <tr>\n", " <th>Female</th>\n", " <td>2.773519</td>\n", " <td>2.68</td>\n", " <td>18.105185</td>\n", " <td>16.69</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tip total_bill \n", " mean median mean median\n", "smoker sex \n", "Yes Male 3.051167 3.00 22.284500 20.39\n", " Female 2.931515 2.88 17.977879 16.27\n", "No Male 3.113402 2.74 19.791237 18.24\n", " Female 2.773519 2.68 18.105185 16.69" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The same thing works on the rows:\n", "tm.swaplevel(axis=0).sort_index(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we had more than two levels, we could specify which two we wanted to swap with additional arguments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some tools (including Altair) can't use data with hierarchical column names at all, so they have to be flattened. There's no built-in function for doing this, but the following idiom seems standard:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>tip.mean</th>\n", " <th>tip.median</th>\n", " <th>total_bill.mean</th>\n", " <th>total_bill.median</th>\n", " </tr>\n", " <tr>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Male</th>\n", " <th>Yes</th>\n", " <td>3.051167</td>\n", " <td>3.00</td>\n", " <td>22.284500</td>\n", " <td>20.39</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>3.113402</td>\n", " <td>2.74</td>\n", " <td>19.791237</td>\n", " <td>18.24</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"2\" valign=\"top\">Female</th>\n", " <th>Yes</th>\n", " <td>2.931515</td>\n", " <td>2.88</td>\n", " <td>17.977879</td>\n", " <td>16.27</td>\n", " </tr>\n", " <tr>\n", " <th>No</th>\n", " <td>2.773519</td>\n", " <td>2.68</td>\n", " <td>18.105185</td>\n", " <td>16.69</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tip.mean tip.median total_bill.mean total_bill.median\n", "sex smoker \n", "Male Yes 3.051167 3.00 22.284500 20.39\n", " No 3.113402 2.74 19.791237 18.24\n", "Female Yes 2.931515 2.88 17.977879 16.27\n", " No 2.773519 2.68 18.105185 16.69" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.columns = [c[0] + \".\" + c[1] for c in tm.columns]\n", "tm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set Membership" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is an `.isin` function for quickly checking set membership." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>day</th>\n", " <th>time</th>\n", " <th>size</th>\n", " <th>weekend</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>239</th>\n", " <td>29.03</td>\n", " <td>5.92</td>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>3</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>240</th>\n", " <td>27.18</td>\n", " <td>2.00</td>\n", " <td>Female</td>\n", " <td>Yes</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>241</th>\n", " <td>22.67</td>\n", " <td>2.00</td>\n", " <td>Male</td>\n", " <td>Yes</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>242</th>\n", " <td>17.82</td>\n", " <td>1.75</td>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>243</th>\n", " <td>18.78</td>\n", " <td>3.00</td>\n", " <td>Female</td>\n", " <td>No</td>\n", " <td>Thur</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip sex smoker day time size weekend\n", "239 29.03 5.92 Male No Sat Dinner 3 True\n", "240 27.18 2.00 Female Yes Sat Dinner 2 True\n", "241 22.67 2.00 Male Yes Sat Dinner 2 True\n", "242 17.82 1.75 Male No Sat Dinner 2 True\n", "243 18.78 3.00 Female No Thur Dinner 2 False" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips['weekend'] = tips.day.isin(['Sat','Sun'])\n", "tips.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance of `.isin` is good compared to the alternatives:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.32 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "3.92 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit -n100 diamonds.color.isin(['D','E','F'])\n", "%timeit -n100 diamonds.eval('color in [\"D\",\"E\",\"F\"]')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Method Chaining Helpers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to make it easier to create chains of manipulation functions, there is an `assign()` which creates new columns. It and the array-index filtering both take lambdas, which let you refer to an intermediate result that doesn't have a name." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>day</th>\n", " <th>time</th>\n", " <th>size</th>\n", " <th>weekend</th>\n", " <th>tip_pct</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>172</th>\n", " <td>7.25</td>\n", " <td>5.15</td>\n", " <td>Male</td>\n", " <td>Yes</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " <td>71.034483</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip sex smoker day time size weekend tip_pct\n", "172 7.25 5.15 Male Yes Sun Dinner 2 True 71.034483" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \\\n", " [lambda x: x.tip_pct > 70]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also a \"rename\" for changing column names without needing to assign to the `.columns` or `.index` property of a named variable." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>dayofweek</th>\n", " <th>time</th>\n", " <th>size</th>\n", " <th>weekend</th>\n", " <th>tip_pct</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>172</th>\n", " <td>7.25</td>\n", " <td>5.15</td>\n", " <td>Male</td>\n", " <td>Yes</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " <td>71.034483</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip sex smoker dayofweek time size weekend tip_pct\n", "172 7.25 5.15 Male Yes Sun Dinner 2 True 71.034483" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.assign(tip_pct=lambda x: 100*x.tip/x.total_bill) \\\n", " [lambda x: x.tip_pct > 70].rename({'day':'dayofweek'},axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Sorting and Ranking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## sort_values and sort_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a sort_values which goes with sort_index():" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>tip</th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>day</th>\n", " <th>time</th>\n", " <th>size</th>\n", " <th>weekend</th>\n", " </tr>\n", " <tr>\n", " <th>total_bill</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>3.07</th>\n", " <td>1.00</td>\n", " <td>Female</td>\n", " <td>Yes</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>1</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>5.75</th>\n", " <td>1.00</td>\n", " <td>Female</td>\n", " <td>Yes</td>\n", " <td>Fri</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>7.25</th>\n", " <td>1.00</td>\n", " <td>Female</td>\n", " <td>No</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>1</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>7.25</th>\n", " <td>5.15</td>\n", " <td>Male</td>\n", " <td>Yes</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>7.51</th>\n", " <td>2.00</td>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Thur</td>\n", " <td>Lunch</td>\n", " <td>2</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tip sex smoker day time size weekend\n", "total_bill \n", "3.07 1.00 Female Yes Sat Dinner 1 True\n", "5.75 1.00 Female Yes Fri Dinner 2 False\n", "7.25 1.00 Female No Sat Dinner 1 True\n", "7.25 5.15 Male Yes Sun Dinner 2 True\n", "7.51 2.00 Male No Thur Lunch 2 False" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.set_index('total_bill').sort_index().head()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>tip</th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>day</th>\n", " <th>time</th>\n", " <th>size</th>\n", " <th>weekend</th>\n", " </tr>\n", " <tr>\n", " <th>total_bill</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>3.07</th>\n", " <td>1.00</td>\n", " <td>Female</td>\n", " <td>Yes</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>1</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>12.60</th>\n", " <td>1.00</td>\n", " <td>Male</td>\n", " <td>Yes</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>5.75</th>\n", " <td>1.00</td>\n", " <td>Female</td>\n", " <td>Yes</td>\n", " <td>Fri</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>7.25</th>\n", " <td>1.00</td>\n", " <td>Female</td>\n", " <td>No</td>\n", " <td>Sat</td>\n", " <td>Dinner</td>\n", " <td>1</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>16.99</th>\n", " <td>1.01</td>\n", " <td>Female</td>\n", " <td>No</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tip sex smoker day time size weekend\n", "total_bill \n", "3.07 1.00 Female Yes Sat Dinner 1 True\n", "12.60 1.00 Male Yes Sat Dinner 2 True\n", "5.75 1.00 Female Yes Fri Dinner 2 False\n", "7.25 1.00 Female No Sat Dinner 1 True\n", "16.99 1.01 Female No Sun Dinner 2 True" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.set_index('total_bill').sort_values('tip').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ranking\n", "\n", "is in the `.rank()` member function. The usual options for method (min, max, dense, etc) are available as an argument to `rank()`.\n", "\n", "Note that `pct` actually gives numbers between 0 and 1, not 0 and 100. Pandas is very sloppy generally about the meaning of \"percent\"." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>total_bill</th>\n", " <th>tip</th>\n", " <th>sex</th>\n", " <th>smoker</th>\n", " <th>day</th>\n", " <th>time</th>\n", " <th>size</th>\n", " <th>weekend</th>\n", " <th>tip_rank</th>\n", " <th>tip_pct</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>16.99</td>\n", " <td>1.01</td>\n", " <td>Female</td>\n", " <td>No</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " <td>5.0</td>\n", " <td>0.020492</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>10.34</td>\n", " <td>1.66</td>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>3</td>\n", " <td>True</td>\n", " <td>33.0</td>\n", " <td>0.135246</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>21.01</td>\n", " <td>3.50</td>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>3</td>\n", " <td>True</td>\n", " <td>177.0</td>\n", " <td>0.725410</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>23.68</td>\n", " <td>3.31</td>\n", " <td>Male</td>\n", " <td>No</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>2</td>\n", " <td>True</td>\n", " <td>165.0</td>\n", " <td>0.676230</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>24.59</td>\n", " <td>3.61</td>\n", " <td>Female</td>\n", " <td>No</td>\n", " <td>Sun</td>\n", " <td>Dinner</td>\n", " <td>4</td>\n", " <td>True</td>\n", " <td>185.0</td>\n", " <td>0.758197</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " total_bill tip sex smoker day time size weekend tip_rank \\\n", "0 16.99 1.01 Female No Sun Dinner 2 True 5.0 \n", "1 10.34 1.66 Male No Sun Dinner 3 True 33.0 \n", "2 21.01 3.50 Male No Sun Dinner 3 True 177.0 \n", "3 23.68 3.31 Male No Sun Dinner 2 True 165.0 \n", "4 24.59 3.61 Female No Sun Dinner 4 True 185.0 \n", "\n", " tip_pct \n", "0 0.020492 \n", "1 0.135246 \n", "2 0.725410 \n", "3 0.676230 \n", "4 0.758197 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.assign(tip_rank=tips.tip.rank(), tip_pct = tips.tip.rank(pct=True)).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Replacing Values" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " <td>6.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 NaN\n", "2 5 6.0" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that column b gets promoted from integer to float because NaN cannot be stored in an integer type in Numpy\n", "d = pd.DataFrame([{'a':1, 'b':2}, {'a':3, 'b':np.NaN}, {'a':5, 'b': 6}])\n", "d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can replace individual values with `map`, which takes a dict or a lambda. It operates on only one column at a time." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>99.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 99.0\n", "1 3 NaN\n", "2 5 NaN" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.assign(b=d.b.map({2:99}))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>99.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " <td>6.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 99.0\n", "1 3 NaN\n", "2 5 6.0" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.assign(b=d.b.map(lambda x: 99 if x==2 else x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can run a map on all columns with `applymap`:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>99.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " <td>6.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 99.0\n", "1 3 NaN\n", "2 5 6.0" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.applymap(lambda x: 99 if x==2 else x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can fill NAs with `fillna`, which optionally takes column-specific defaults:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " <td>-99.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " <td>6.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 -99.0\n", "2 5 6.0" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.fillna({'b':-99})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That makes it especially convenient to do something like this:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " <td>4.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " <td>6.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 4.0\n", "2 5 6.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.fillna(d.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.combine_first` is like a version of coalesce which works at a full column or dataframe level." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " <td>6.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 3.0\n", "2 5 6.0" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This fills in the value of column a into column b where there is a missing value:\n", "d.assign(b=d.b.combine_first(d.a))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-99</td>\n", " <td>-98</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-99</td>\n", " <td>-98</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-99</td>\n", " <td>-98</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 -99 -98\n", "1 -99 -98\n", "2 -99 -98" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "e=pd.DataFrame([{'a':-99, 'b':-98}]*3)\n", "e" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " <td>-98.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " <td>6.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 2.0\n", "1 3 -98.0\n", "2 5 6.0" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This does the same thing at the full dataframe level instead of a single column:\n", "d.combine_first(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Categories\n", "\n", "Just like `.str` exposes special functions for strings, `.cat` exposes special functions for categorical variables. \n", "\n", "Let's make some categorical variables with cut (there is variant qcut, which bins by equal quantiles instead of equal width)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>price</th>\n", " </tr>\n", " <tr>\n", " <th>carat</th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>(0.0, 0.5]</th>\n", " <td>839.718149</td>\n", " </tr>\n", " <tr>\n", " <th>(0.5, 1.0]</th>\n", " <td>2811.342683</td>\n", " </tr>\n", " <tr>\n", " <th>(1.0, 1.5]</th>\n", " <td>6513.526534</td>\n", " </tr>\n", " <tr>\n", " <th>(1.5, 2.0]</th>\n", " <td>11321.774838</td>\n", " </tr>\n", " <tr>\n", " <th>(2.0, 2.5]</th>\n", " <td>14918.141237</td>\n", " </tr>\n", " <tr>\n", " <th>(2.5, 3.0]</th>\n", " <td>15472.904255</td>\n", " </tr>\n", " <tr>\n", " <th>(3.0, 3.5]</th>\n", " <td>14822.000000</td>\n", " </tr>\n", " <tr>\n", " <th>(3.5, 4.0]</th>\n", " <td>15636.500000</td>\n", " </tr>\n", " <tr>\n", " <th>(4.0, 4.5]</th>\n", " <td>16576.500000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " price\n", "carat \n", "(0.0, 0.5] 839.718149\n", "(0.5, 1.0] 2811.342683\n", "(1.0, 1.5] 6513.526534\n", "(1.5, 2.0] 11321.774838\n", "(2.0, 2.5] 14918.141237\n", "(2.5, 3.0] 15472.904255\n", "(3.0, 3.5] 14822.000000\n", "(3.5, 4.0] 15636.500000\n", "(4.0, 4.5] 16576.500000" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dc = diamonds.groupby(pd.cut(diamonds['carat'],np.arange(0,5,.5))).\\\n", " aggregate({'price':'mean'})\n", "dc" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CategoricalDtype(categories=[(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]]\n", " ordered=True)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can move the categorical index back into a column and see that it has type Categorical\n", "dc=dc.reset_index()\n", "dc.carat.dtype" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "dtype: int8" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Try tab-completing on dc.carat.cat.\n", "\n", "# This gives us the integer values\n", "dc.carat.cat.codes" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "IntervalIndex([(0.0, 0.5], (0.5, 1.0], (1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5]]\n", " closed='right',\n", " dtype='interval[float64]')" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This gives us the labels\n", "dc.carat.cat.categories" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is a metadata flag indicating whether the category order is semantically meaningful\n", "dc.carat.cat.ordered" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__There is a function to reorder categories, but it appears to have some bugs. Look at these examples:__" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (0.0, 0.5]\n", "1 (1.0, 1.5]\n", "2 (0.5, 1.0]\n", "3 (0.5, 1.0]\n", "4 (1.5, 2.0]\n", "5 (2.0, 2.5]\n", "6 (3.0, 3.5]\n", "7 (2.5, 3.0]\n", "8 (2.5, 3.0]\n", "Name: carat, dtype: category\n", "Categories (9, interval[float64]): [(0.0, 0.5] < (1.0, 1.5] < (0.5, 1.0] < (1.5, 2.0] ... (3.0, 3.5] < (2.5, 3.0] < (3.5, 4.0] < (4.0, 4.5]]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You can reorder categories:\n", "dc.carat.cat.reorder_categories(dc.carat.cat.categories[[0,2,1,3,4,6,5,7,8]])" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 NaN\n", "2 NaN\n", "3 NaN\n", "4 NaN\n", "5 NaN\n", "6 NaN\n", "7 NaN\n", "8 NaN\n", "Name: carat, dtype: category\n", "Categories (9, interval[float64]): [(4.0, 4.5] < (3.5, 4.0] < (3.0, 3.5] < (2.5, 3.0] ... (1.5, 2.0] < (1.0, 1.5] < (0.5, 1.0] < (0.0, 0.5]]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dc.carat.cat.reorder_categories(dc.carat.cat.categories[::-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I submitted a bug:\n", "https://github.com/pandas-dev/pandas/issues/23452" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What does \"percent\" mean?\n", "\n", "Python libraries are disappointingly sloppy about using the word \"percent\" correctly. \"Cent\" is 100 and percents are supposed to be on a scale of 100, but often the word is used very shoddily on the scale of 1 instead. Here's an example:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>x</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-0.5</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " x\n", "0 NaN\n", "1 0.0\n", "2 1.0\n", "3 -0.5" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({'x':[1,1,2,1]}).pct_change()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should be a 100% increase from 1 to 2, and then a 50% decrease from 2 back to 1, but it's actually 1.0 and -0.5. It's _not a percent_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same thing is true in `stats`, for example with the \"percent point function\" which actually goes from 0 to 1, not 0 to 100:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-1.96, 1.96])" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.norm.ppf([0.025,0.975]).round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please do not follow these bad examples. The word \"percent\" does have a meaning." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# ToDo" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "These are some things I intend to write about but haven't gotten to yet:" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ ".corr, .cov, .corrwith \n", ".duplicated, .drop_duplicates \n", ".sample(replace=), .take \n", ".get_dummies" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "255px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }