{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Selecting Data\n",
"\n",
"A common need is to grab a subset of records that meet a certain criteria. You can do this by indexing the `DataFrame` much like you've seen done with a `NumPy.ndarray`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"475"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import os\n",
"import pandas as pd\n",
"\n",
"users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)\n",
"# Pop out a quick sanity check\n",
"len(users)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"CashBox uses a referral system, everyone you refer will earn you $5 credit. Let's see if we can find everyone who **has not** yet taken advantage of that deal. The number of referrals a user has made is defined in the **`referral_count`** column."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"aaron False\n",
"acook False\n",
"adam.saunders False\n",
"adrian False\n",
"adrian.blair False\n",
"Name: referral_count, dtype: bool"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This vectorized comparison returns a new `Series`, which we are naming so we can use it later\n",
"no_referrals_index = users['referral_count'] < 1\n",
"# See how the boolean `Series` returned includes all rows from the `DataFrame`.\n",
"# The value is the result of each comparison\n",
"no_referrals_index.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the boolean `Series` we just created, **`no_referrals_index`**, we can retrieve all rows where that comparison was True."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" first_name | \n",
" last_name | \n",
" email | \n",
" email_verified | \n",
" signup_date | \n",
" referral_count | \n",
" balance | \n",
"
\n",
" \n",
" \n",
" \n",
" alan9443 | \n",
" Alan | \n",
" Pope | \n",
" pope@hotmail.com | \n",
" True | \n",
" 2018-04-17 | \n",
" 0 | \n",
" 56.09 | \n",
"
\n",
" \n",
" andrew.alvarez | \n",
" Andrew | \n",
" Alvarez | \n",
" aalvarez@hotmail.com | \n",
" False | \n",
" 2018-08-01 | \n",
" 0 | \n",
" 81.66 | \n",
"
\n",
" \n",
" boyer7005 | \n",
" Sara | \n",
" Boyer | \n",
" boyer8636@gmail.com | \n",
" True | \n",
" 2018-07-31 | \n",
" 0 | \n",
" 91.41 | \n",
"
\n",
" \n",
" brandon.gilbert | \n",
" Brandon | \n",
" Gilbert | \n",
" brandon.gilbert@hotmail.com | \n",
" True | \n",
" 2018-04-28 | \n",
" 0 | \n",
" 10.17 | \n",
"
\n",
" \n",
" brooke2027 | \n",
" Brooke | \n",
" NaN | \n",
" brooke6938@gmail.com | \n",
" False | \n",
" 2018-05-23 | \n",
" 0 | \n",
" 7.22 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" first_name last_name email \\\n",
"alan9443 Alan Pope pope@hotmail.com \n",
"andrew.alvarez Andrew Alvarez aalvarez@hotmail.com \n",
"boyer7005 Sara Boyer boyer8636@gmail.com \n",
"brandon.gilbert Brandon Gilbert brandon.gilbert@hotmail.com \n",
"brooke2027 Brooke NaN brooke6938@gmail.com \n",
"\n",
" email_verified signup_date referral_count balance \n",
"alan9443 True 2018-04-17 0 56.09 \n",
"andrew.alvarez False 2018-08-01 0 81.66 \n",
"boyer7005 True 2018-07-31 0 91.41 \n",
"brandon.gilbert True 2018-04-28 0 10.17 \n",
"brooke2027 False 2018-05-23 0 7.22 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"users[no_referrals_index].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inversed index\n",
"A handy shortcut is to prefix the index with a `~` (tilde). This returns the inverse of the boolean `Series`. While I wish that the `~` was called \"the opposite day\" operator, it is in fact called `bitwise not` operator."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"aaron True\n",
"acook True\n",
"adam.saunders True\n",
"adrian True\n",
"adrian.blair True\n",
"Name: referral_count, dtype: bool"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Careful, double negative here. We don't need no education.\n",
"~no_referrals_index.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" first_name | \n",
" last_name | \n",
" email | \n",
" email_verified | \n",
" signup_date | \n",
" referral_count | \n",
" balance | \n",
"
\n",
" \n",
" \n",
" \n",
" aaron | \n",
" Aaron | \n",
" Davis | \n",
" aaron6348@gmail.com | \n",
" True | \n",
" 2018-08-31 | \n",
" 6 | \n",
" 18.14 | \n",
"
\n",
" \n",
" acook | \n",
" Anthony | \n",
" Cook | \n",
" cook@gmail.com | \n",
" True | \n",
" 2018-05-12 | \n",
" 2 | \n",
" 55.45 | \n",
"
\n",
" \n",
" adam.saunders | \n",
" Adam | \n",
" Saunders | \n",
" adam@gmail.com | \n",
" False | \n",
" 2018-05-29 | \n",
" 3 | \n",
" 72.12 | \n",
"
\n",
" \n",
" adrian | \n",
" Adrian | \n",
" Yang | \n",
" adrian.yang@teamtreehouse.com | \n",
" True | \n",
" 2018-04-28 | \n",
" 3 | \n",
" 30.01 | \n",
"
\n",
" \n",
" adrian.blair | \n",
" Adrian | \n",
" Blair | \n",
" adrian9335@gmail.com | \n",
" True | \n",
" 2018-06-16 | \n",
" 7 | \n",
" 25.85 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" first_name last_name email \\\n",
"aaron Aaron Davis aaron6348@gmail.com \n",
"acook Anthony Cook cook@gmail.com \n",
"adam.saunders Adam Saunders adam@gmail.com \n",
"adrian Adrian Yang adrian.yang@teamtreehouse.com \n",
"adrian.blair Adrian Blair adrian9335@gmail.com \n",
"\n",
" email_verified signup_date referral_count balance \n",
"aaron True 2018-08-31 6 18.14 \n",
"acook True 2018-05-12 2 55.45 \n",
"adam.saunders False 2018-05-29 3 72.12 \n",
"adrian True 2018-04-28 3 30.01 \n",
"adrian.blair True 2018-06-16 7 25.85 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Use the inverse of the index to find where referral values DO NOT equal zero\n",
"users[~no_referrals_index].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## In `loc`\n",
"Boolean `Series` as an index may also be used as an index the `DataFrame.loc` object. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" balance | \n",
" email | \n",
"
\n",
" \n",
" \n",
" \n",
" alan9443 | \n",
" 56.09 | \n",
" pope@hotmail.com | \n",
"
\n",
" \n",
" andrew.alvarez | \n",
" 81.66 | \n",
" aalvarez@hotmail.com | \n",
"
\n",
" \n",
" boyer7005 | \n",
" 91.41 | \n",
" boyer8636@gmail.com | \n",
"
\n",
" \n",
" brandon.gilbert | \n",
" 10.17 | \n",
" brandon.gilbert@hotmail.com | \n",
"
\n",
" \n",
" brooke2027 | \n",
" 7.22 | \n",
" brooke6938@gmail.com | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" balance email\n",
"alan9443 56.09 pope@hotmail.com\n",
"andrew.alvarez 81.66 aalvarez@hotmail.com\n",
"boyer7005 91.41 boyer8636@gmail.com\n",
"brandon.gilbert 10.17 brandon.gilbert@hotmail.com\n",
"brooke2027 7.22 brooke6938@gmail.com"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select rows where there are no referrals, and select only the following ordered columns\n",
"users.loc[no_referrals_index, ['balance', 'email']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also possible to do the comparison inline, without storing the index in a variable."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" first_name | \n",
" last_name | \n",
" email | \n",
" email_verified | \n",
" signup_date | \n",
" referral_count | \n",
" balance | \n",
"
\n",
" \n",
" \n",
" \n",
" alan9443 | \n",
" Alan | \n",
" Pope | \n",
" pope@hotmail.com | \n",
" True | \n",
" 2018-04-17 | \n",
" 0 | \n",
" 56.09 | \n",
"
\n",
" \n",
" andrew.alvarez | \n",
" Andrew | \n",
" Alvarez | \n",
" aalvarez@hotmail.com | \n",
" False | \n",
" 2018-08-01 | \n",
" 0 | \n",
" 81.66 | \n",
"
\n",
" \n",
" boyer7005 | \n",
" Sara | \n",
" Boyer | \n",
" boyer8636@gmail.com | \n",
" True | \n",
" 2018-07-31 | \n",
" 0 | \n",
" 91.41 | \n",
"
\n",
" \n",
" brandon.gilbert | \n",
" Brandon | \n",
" Gilbert | \n",
" brandon.gilbert@hotmail.com | \n",
" True | \n",
" 2018-04-28 | \n",
" 0 | \n",
" 10.17 | \n",
"
\n",
" \n",
" brooke2027 | \n",
" Brooke | \n",
" NaN | \n",
" brooke6938@gmail.com | \n",
" False | \n",
" 2018-05-23 | \n",
" 0 | \n",
" 7.22 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" first_name last_name email \\\n",
"alan9443 Alan Pope pope@hotmail.com \n",
"andrew.alvarez Andrew Alvarez aalvarez@hotmail.com \n",
"boyer7005 Sara Boyer boyer8636@gmail.com \n",
"brandon.gilbert Brandon Gilbert brandon.gilbert@hotmail.com \n",
"brooke2027 Brooke NaN brooke6938@gmail.com \n",
"\n",
" email_verified signup_date referral_count balance \n",
"alan9443 True 2018-04-17 0 56.09 \n",
"andrew.alvarez False 2018-08-01 0 81.66 \n",
"boyer7005 True 2018-07-31 0 91.41 \n",
"brandon.gilbert True 2018-04-28 0 10.17 \n",
"brooke2027 False 2018-05-23 0 7.22 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"users[users['referral_count'] == 0].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like a NumPy `ndarray`, it's possible for a boolean `Series` to be compared to another boolean `Series` using bitwise operators.\n",
"\n",
"Don't forget to surround your expressions with parenthesis to control the order of operations."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" first_name | \n",
" last_name | \n",
" email | \n",
" email_verified | \n",
" signup_date | \n",
" referral_count | \n",
" balance | \n",
"
\n",
" \n",
" \n",
" \n",
" alan9443 | \n",
" Alan | \n",
" Pope | \n",
" pope@hotmail.com | \n",
" True | \n",
" 2018-04-17 | \n",
" 0 | \n",
" 56.09 | \n",
"
\n",
" \n",
" boyer7005 | \n",
" Sara | \n",
" Boyer | \n",
" boyer8636@gmail.com | \n",
" True | \n",
" 2018-07-31 | \n",
" 0 | \n",
" 91.41 | \n",
"
\n",
" \n",
" brandon.gilbert | \n",
" Brandon | \n",
" Gilbert | \n",
" brandon.gilbert@hotmail.com | \n",
" True | \n",
" 2018-04-28 | \n",
" 0 | \n",
" 10.17 | \n",
"
\n",
" \n",
" bryant | \n",
" Darlene | \n",
" Bryant | \n",
" dbryant@yahoo.com | \n",
" True | \n",
" 2018-07-19 | \n",
" 0 | \n",
" 36.91 | \n",
"
\n",
" \n",
" calvin.perez | \n",
" Calvin | \n",
" Perez | \n",
" cperez@gmail.com | \n",
" True | \n",
" 2018-02-17 | \n",
" 0 | \n",
" 13.01 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" first_name last_name email \\\n",
"alan9443 Alan Pope pope@hotmail.com \n",
"boyer7005 Sara Boyer boyer8636@gmail.com \n",
"brandon.gilbert Brandon Gilbert brandon.gilbert@hotmail.com \n",
"bryant Darlene Bryant dbryant@yahoo.com \n",
"calvin.perez Calvin Perez cperez@gmail.com \n",
"\n",
" email_verified signup_date referral_count balance \n",
"alan9443 True 2018-04-17 0 56.09 \n",
"boyer7005 True 2018-07-31 0 91.41 \n",
"brandon.gilbert True 2018-04-28 0 10.17 \n",
"bryant True 2018-07-19 0 36.91 \n",
"calvin.perez True 2018-02-17 0 13.01 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select all users where they haven't made a referral AND their email has been verified\n",
"users[(users['referral_count'] == 0) & (users['email_verified'] == True)].head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}