{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Grouping\n", "\n", "A common need is bound to arise where you will need to look at an aggregate view of a `DataFrame` by a certain value. This is where grouping comes in.\n", "\n", "CashBox has asked that we produce a list of the top 10 users who have been on the receiving side of transactions the most. They would like to see the user's first and last name, their email, and the total number of transactions where the user was the receiver.\n", "\n", "We can acheive this by grouping our data. We need to take a look at the **`transactions`** `DataFrame` and group it by the **`receiver`** field, which is the username.\n", "\n", "Grouping on a value returns a new type of object called the [`GroupBy`](https://pandas.pydata.org/pandas-docs/stable/api.html#groupby).\n", "\n", "Let's first explore how to create one of these, and then how to wield it's power." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(475, 998)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Setup\n", "import os\n", "\n", "import pandas as pd\n", "\n", "pd.options.display.max_rows = 10\n", "users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)\n", "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n", "# Sanity check\n", "(len(users), len(transactions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's remind ourselves about the types of data we have in the **`transactions`** `DataFrame`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sender object\n", "receiver object\n", "amount float64\n", "sent_date object\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping by a specific column is pretty straight forward. We want to group by the receiver, so we use the [`DataFrame.groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.groupby.groupby.DataFrameGroupBy" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_by_receiver = transactions.groupby('receiver')\n", "\n", "# Let's see what type of object we got back\n", "type(grouped_by_receiver)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We received a [`DataFrameGroupBy`](https://pandas.pydata.org/pandas-docs/stable/api.html#groupby) object. There are quite a few methods here.\n", "\n", "Let's take a look first at [`GroupBy.size`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.size.html). This will return a `Series` of how many members are in each of the groups. In our case this is the number of transactions that each user received." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "receiver\n", "aaron 6\n", "acook 1\n", "adam.saunders 2\n", "adrian 3\n", "adrian.blair 7\n", " ..\n", "wilson 2\n", "wking 2\n", "wright3590 4\n", "young 2\n", "zachary.neal 4\n", "Length: 410, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Returns a Series of total number of rows\n", "grouped_by_receiver.size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can use the `DataFrameGroupBy.count` method to see counts of how many non missing data points we have across each column in our group across the columns of our `DataFrame`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderamountsent_date
receiver
aaron666
acook111
adam.saunders222
adrian333
adrian.blair777
............
wilson222
wking222
wright3590444
young222
zachary.neal444
\n", "

410 rows × 3 columns

\n", "
" ], "text/plain": [ " sender amount sent_date\n", "receiver \n", "aaron 6 6 6\n", "acook 1 1 1\n", "adam.saunders 2 2 2\n", "adrian 3 3 3\n", "adrian.blair 7 7 7\n", "... ... ... ...\n", "wilson 2 2 2\n", "wking 2 2 2\n", "wright3590 4 4 4\n", "young 2 2 2\n", "zachary.neal 4 4 4\n", "\n", "[410 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_by_receiver.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `GroupBy` object provides aggregate functions that makes getting calculations quick and seamless. For instance, if we use the `GroupBy.sum` method we can see each numeric column summed up for each grouping. In our case there is only one numeric column **`amount`**." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amount
receiver
aaron366.15
acook94.65
adam.saunders101.15
adrian124.36
adrian.blair462.88
......
wilson44.39
wking74.07
wright3590195.45
young83.57
zachary.neal186.01
\n", "

410 rows × 1 columns

\n", "
" ], "text/plain": [ " amount\n", "receiver \n", "aaron 366.15\n", "acook 94.65\n", "adam.saunders 101.15\n", "adrian 124.36\n", "adrian.blair 462.88\n", "... ...\n", "wilson 44.39\n", "wking 74.07\n", "wright3590 195.45\n", "young 83.57\n", "zachary.neal 186.01\n", "\n", "[410 rows x 1 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_by_receiver.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now where were we? Oh right, we're trying to figure out the people who was received the most transactions. So why don't we use this group to create a new column on our **`users`** `DataFrame`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "65" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a new column in users called transaction count, and set the values to the size of the matching group\n", "users['transaction_count'] = grouped_by_receiver.size()\n", "\n", "# Not every user has made a transaction, let's see what kind of missing data we are dealing with\n", "len(users[users.transaction_count.isna()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we don't have a transaction record for everyone, not every user will be in our grouping. So when we created the new column, we ended up adding some `np.nan`. Let's fix that." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameemailemail_verifiedsignup_datereferral_countbalancetransaction_count
aaronAaronDavisaaron6348@gmail.comTrue2018-08-31618.146.0
acookAnthonyCookcook@gmail.comTrue2018-05-12255.451.0
adam.saundersAdamSaundersadam@gmail.comFalse2018-05-29372.122.0
adrianAdrianFangadrian.fang@teamtreehouse.comTrue2018-04-28330.013.0
adrian.blairAdrianBlairadrian9335@gmail.comTrue2018-06-16725.857.0
...........................
wilsonRobertWilsonrobert@yahoo.comFalse2018-05-16559.752.0
wkingWandaKingwanda.king@holt.comTrue2018-06-01267.082.0
wright3590JacquelineWrightjacqueline.wright@gonzalez.comTrue2018-02-08618.484.0
youngJessicaYoungjessica4028@yahoo.comTrue2018-07-17475.392.0
zachary.nealZacharyNealzneal@gmail.comTrue2018-07-26139.904.0
\n", "

475 rows × 8 columns

\n", "
" ], "text/plain": [ " first_name last_name email \\\n", "aaron Aaron Davis aaron6348@gmail.com \n", "acook Anthony Cook cook@gmail.com \n", "adam.saunders Adam Saunders adam@gmail.com \n", "adrian Adrian Fang adrian.fang@teamtreehouse.com \n", "adrian.blair Adrian Blair adrian9335@gmail.com \n", "... ... ... ... \n", "wilson Robert Wilson robert@yahoo.com \n", "wking Wanda King wanda.king@holt.com \n", "wright3590 Jacqueline Wright jacqueline.wright@gonzalez.com \n", "young Jessica Young jessica4028@yahoo.com \n", "zachary.neal Zachary Neal zneal@gmail.com \n", "\n", " email_verified signup_date referral_count balance \\\n", "aaron True 2018-08-31 6 18.14 \n", "acook True 2018-05-12 2 55.45 \n", "adam.saunders False 2018-05-29 3 72.12 \n", "adrian True 2018-04-28 3 30.01 \n", "adrian.blair True 2018-06-16 7 25.85 \n", "... ... ... ... ... \n", "wilson False 2018-05-16 5 59.75 \n", "wking True 2018-06-01 2 67.08 \n", "wright3590 True 2018-02-08 6 18.48 \n", "young True 2018-07-17 4 75.39 \n", "zachary.neal True 2018-07-26 1 39.90 \n", "\n", " transaction_count \n", "aaron 6.0 \n", "acook 1.0 \n", "adam.saunders 2.0 \n", "adrian 3.0 \n", "adrian.blair 7.0 \n", "... ... \n", "wilson 2.0 \n", "wking 2.0 \n", "wright3590 4.0 \n", "young 2.0 \n", "zachary.neal 4.0 \n", "\n", "[475 rows x 8 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set all missing data to 0, since in reality, there have been 0 received transactions for this user\n", "users.transaction_count.fillna(0, inplace=True)\n", "users" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check it out! There's our column, but it's a floating point number, we don't need that. Let's convert it!" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Convert from the default type of float64 to int64 (no precision needed)\n", "users.transaction_count = users.transaction_count.astype('int64')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we want to get the user with the highest transaction count, so let's sort by that descending." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameemailtransaction_count
scott3928ScottNaNscott@yahoo.com9
sfinleySamuelFinleysamuel@gmail.com8
adrian.blairAdrianBlairadrian9335@gmail.com7
hdeleonHannahDeleonhannah@yahoo.com7
miranda6426MirandaRogersmiranda.rogers@gmail.com7
aaronAaronDavisaaron6348@gmail.com6
coreyCoreyFullerfuller8100@yahoo.com6
heatherHeatherRayhray@yahoo.com6
jennifer.hebertJenniferHebertjennifer.hebert@yahoo.com6
edwardsMichaelEdwardsedwards5456@gmail.com6
\n", "
" ], "text/plain": [ " first_name last_name email \\\n", "scott3928 Scott NaN scott@yahoo.com \n", "sfinley Samuel Finley samuel@gmail.com \n", "adrian.blair Adrian Blair adrian9335@gmail.com \n", "hdeleon Hannah Deleon hannah@yahoo.com \n", "miranda6426 Miranda Rogers miranda.rogers@gmail.com \n", "aaron Aaron Davis aaron6348@gmail.com \n", "corey Corey Fuller fuller8100@yahoo.com \n", "heather Heather Ray hray@yahoo.com \n", "jennifer.hebert Jennifer Hebert jennifer.hebert@yahoo.com \n", "edwards Michael Edwards edwards5456@gmail.com \n", "\n", " transaction_count \n", "scott3928 9 \n", "sfinley 8 \n", "adrian.blair 7 \n", "hdeleon 7 \n", "miranda6426 7 \n", "aaron 6 \n", "corey 6 \n", "heather 6 \n", "jennifer.hebert 6 \n", "edwards 6 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sort our values by the new field descending (so the largest comes first), and then by first name ascending\n", "users.sort_values(\n", " ['transaction_count', 'first_name'],\n", " ascending=[False, True],\n", " inplace=True\n", ")\n", "# Take a look at our top 10 receivers, showing only the columns we want\n", "users.loc[:, ['first_name', 'last_name', 'email', 'transaction_count']].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here they are, the Top 10 Receivers! Nice work putting all those skills together!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }