{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Grouping\n", "\n", "A common need is bound to arise where you will need to look at an aggregate view of a `DataFrame` by a certain value. This is where grouping comes in.\n", "\n", "CashBox has asked that we produce a list of the top 10 users who have been on the receiving side of transactions the most. They would like to see the user's first and last name, their email, and the total number of transactions where the user was the receiver.\n", "\n", "We can acheive this by grouping our data. We need to take a look at the **`transactions`** `DataFrame` and group it by the **`receiver`** field, which is the username.\n", "\n", "Grouping on a value returns a new type of object called the [`GroupBy`](https://pandas.pydata.org/pandas-docs/stable/api.html#groupby).\n", "\n", "Let's first explore how to create one of these, and then how to wield it's power." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(475, 998)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Setup\n", "import os\n", "\n", "import pandas as pd\n", "\n", "pd.options.display.max_rows = 10\n", "users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)\n", "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n", "# Sanity check\n", "(len(users), len(transactions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's remind ourselves about the types of data we have in the **`transactions`** `DataFrame`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sender object\n", "receiver object\n", "amount float64\n", "sent_date object\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping by a specific column is pretty straight forward. We want to group by the receiver, so we use the [`DataFrame.groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.groupby.groupby.DataFrameGroupBy" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_by_receiver = transactions.groupby('receiver')\n", "\n", "# Let's see what type of object we got back\n", "type(grouped_by_receiver)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We received a [`DataFrameGroupBy`](https://pandas.pydata.org/pandas-docs/stable/api.html#groupby) object. There are quite a few methods here.\n", "\n", "Let's take a look first at [`GroupBy.size`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.size.html). This will return a `Series` of how many members are in each of the groups. In our case this is the number of transactions that each user received." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "receiver\n", "aaron 6\n", "acook 1\n", "adam.saunders 2\n", "adrian 3\n", "adrian.blair 7\n", " ..\n", "wilson 2\n", "wking 2\n", "wright3590 4\n", "young 2\n", "zachary.neal 4\n", "Length: 410, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Returns a Series of total number of rows\n", "grouped_by_receiver.size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can use the `DataFrameGroupBy.count` method to see counts of how many non missing data points we have across each column in our group across the columns of our `DataFrame`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | sender | \n", "amount | \n", "sent_date | \n", "
---|---|---|---|
receiver | \n", "\n", " | \n", " | \n", " |
aaron | \n", "6 | \n", "6 | \n", "6 | \n", "
acook | \n", "1 | \n", "1 | \n", "1 | \n", "
adam.saunders | \n", "2 | \n", "2 | \n", "2 | \n", "
adrian | \n", "3 | \n", "3 | \n", "3 | \n", "
adrian.blair | \n", "7 | \n", "7 | \n", "7 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
wilson | \n", "2 | \n", "2 | \n", "2 | \n", "
wking | \n", "2 | \n", "2 | \n", "2 | \n", "
wright3590 | \n", "4 | \n", "4 | \n", "4 | \n", "
young | \n", "2 | \n", "2 | \n", "2 | \n", "
zachary.neal | \n", "4 | \n", "4 | \n", "4 | \n", "
410 rows × 3 columns
\n", "\n", " | amount | \n", "
---|---|
receiver | \n", "\n", " |
aaron | \n", "366.15 | \n", "
acook | \n", "94.65 | \n", "
adam.saunders | \n", "101.15 | \n", "
adrian | \n", "124.36 | \n", "
adrian.blair | \n", "462.88 | \n", "
... | \n", "... | \n", "
wilson | \n", "44.39 | \n", "
wking | \n", "74.07 | \n", "
wright3590 | \n", "195.45 | \n", "
young | \n", "83.57 | \n", "
zachary.neal | \n", "186.01 | \n", "
410 rows × 1 columns
\n", "\n", " | first_name | \n", "last_name | \n", "email_verified | \n", "signup_date | \n", "referral_count | \n", "balance | \n", "transaction_count | \n", "|
---|---|---|---|---|---|---|---|---|
aaron | \n", "Aaron | \n", "Davis | \n", "aaron6348@gmail.com | \n", "True | \n", "2018-08-31 | \n", "6 | \n", "18.14 | \n", "6.0 | \n", "
acook | \n", "Anthony | \n", "Cook | \n", "cook@gmail.com | \n", "True | \n", "2018-05-12 | \n", "2 | \n", "55.45 | \n", "1.0 | \n", "
adam.saunders | \n", "Adam | \n", "Saunders | \n", "adam@gmail.com | \n", "False | \n", "2018-05-29 | \n", "3 | \n", "72.12 | \n", "2.0 | \n", "
adrian | \n", "Adrian | \n", "Yang | \n", "adrian.yang@teamtreehouse.com | \n", "True | \n", "2018-04-28 | \n", "3 | \n", "30.01 | \n", "3.0 | \n", "
adrian.blair | \n", "Adrian | \n", "Blair | \n", "adrian9335@gmail.com | \n", "True | \n", "2018-06-16 | \n", "7 | \n", "25.85 | \n", "7.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
wilson | \n", "Robert | \n", "Wilson | \n", "robert@yahoo.com | \n", "False | \n", "2018-05-16 | \n", "5 | \n", "59.75 | \n", "2.0 | \n", "
wking | \n", "Wanda | \n", "King | \n", "wanda.king@holt.com | \n", "True | \n", "2018-06-01 | \n", "2 | \n", "67.08 | \n", "2.0 | \n", "
wright3590 | \n", "Jacqueline | \n", "Wright | \n", "jacqueline.wright@gonzalez.com | \n", "True | \n", "2018-02-08 | \n", "6 | \n", "18.48 | \n", "4.0 | \n", "
young | \n", "Jessica | \n", "Young | \n", "jessica4028@yahoo.com | \n", "True | \n", "2018-07-17 | \n", "4 | \n", "75.39 | \n", "2.0 | \n", "
zachary.neal | \n", "Zachary | \n", "Neal | \n", "zneal@gmail.com | \n", "True | \n", "2018-07-26 | \n", "1 | \n", "39.90 | \n", "4.0 | \n", "
475 rows × 8 columns
\n", "\n", " | first_name | \n", "last_name | \n", "transaction_count | \n", "|
---|---|---|---|---|
scott3928 | \n", "Scott | \n", "NaN | \n", "scott@yahoo.com | \n", "9 | \n", "
sfinley | \n", "Samuel | \n", "Finley | \n", "samuel@gmail.com | \n", "8 | \n", "
adrian.blair | \n", "Adrian | \n", "Blair | \n", "adrian9335@gmail.com | \n", "7 | \n", "
hdeleon | \n", "Hannah | \n", "Deleon | \n", "hannah@yahoo.com | \n", "7 | \n", "
miranda6426 | \n", "Miranda | \n", "Rogers | \n", "miranda.rogers@gmail.com | \n", "7 | \n", "
aaron | \n", "Aaron | \n", "Davis | \n", "aaron6348@gmail.com | \n", "6 | \n", "
corey | \n", "Corey | \n", "Fuller | \n", "fuller8100@yahoo.com | \n", "6 | \n", "
heather | \n", "Heather | \n", "Ray | \n", "hray@yahoo.com | \n", "6 | \n", "
jennifer.hebert | \n", "Jennifer | \n", "Hebert | \n", "jennifer.hebert@yahoo.com | \n", "6 | \n", "
edwards | \n", "Michael | \n", "Edwards | \n", "edwards5456@gmail.com | \n", "6 | \n", "