{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Handling Duplicated and Missing Data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Setup\n",
"from datetime import datetime\n",
"import os\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from utils import render\n",
"\n",
"users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)\n",
"transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n",
"requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)\n",
"\n",
"# Perform the merge from the previous notebook \n",
"# (s2n06-combining-dataframes.ipynb)\n",
"successful_requests = requests.merge(\n",
" transactions,\n",
" left_on=['from_user', 'to_user', 'amount'], \n",
" right_on=['receiver', 'sender', 'amount']\n",
")\n",
"\n",
"# Statement from previous notebook\n",
"\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n",
" successful_requests.amount.sum(),\n",
" len(successful_requests),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Duplicated Data\n",
"\n",
"We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n",
"\n",
"We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. \n",
"\n",
"Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Wow! $9,316.12 has passed through the request system in 191 transactions!!!'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's get our records sorted chronologically\n",
"successful_requests.sort_values('request_date', inplace=True) \n",
"\n",
"# And then we'll drop dupes keeping only the last one. \n",
"# Note the use of the inplace keyword\n",
"successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)\n",
"\n",
"# Statement from previous notebook\n",
"\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n",
" successful_requests.amount.sum(),\n",
" len(successful_requests),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yikes, that little bit of difference can be pretty important, especially if that data is being used to make a decision. Always make sure to check for duplicates in situations where you are merging one or more `DataFrame`s together."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Locating records not found in another DataFrame using [`isin`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html)\n",
"\n",
"Now that we have seen how successful the requests, have been, it would be nice to locate all the users who **have not** yet made a request to another user.\n",
"\n",
"Basically we would to ask if each user from the **`users`** `DataFrame` if they are in the **`requests`** `DataFrame` as a **`from_user`**.\n",
"\n",
"The method `isin` is available for [`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html), [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html), and even an [`Index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.isin.html). The method takes an iterable and returns a boolean index of whether or not the value is contained in the owning `Series` or `DataFrame`.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" first_name | \n",
" last_name | \n",
" email | \n",
" email_verified | \n",
" signup_date | \n",
" referral_count | \n",
" balance | \n",
"
\n",
" \n",
" \n",
" \n",
" aaron | \n",
" Aaron | \n",
" Davis | \n",
" aaron6348@gmail.com | \n",
" True | \n",
" 2018-08-31 | \n",
" 6 | \n",
" 18.14 | \n",
"
\n",
" \n",
" acook | \n",
" Anthony | \n",
" Cook | \n",
" cook@gmail.com | \n",
" True | \n",
" 2018-05-12 | \n",
" 2 | \n",
" 55.45 | \n",
"
\n",
" \n",
" adam.saunders | \n",
" Adam | \n",
" Saunders | \n",
" adam@gmail.com | \n",
" False | \n",
" 2018-05-29 | \n",
" 3 | \n",
" 72.12 | \n",
"
\n",
" \n",
" adrian | \n",
" Adrian | \n",
" Yang | \n",
" adrian.yang@teamtreehouse.com | \n",
" True | \n",
" 2018-04-28 | \n",
" 3 | \n",
" 30.01 | \n",
"
\n",
" \n",
" adrian.blair | \n",
" Adrian | \n",
" Blair | \n",
" adrian9335@gmail.com | \n",
" True | \n",
" 2018-06-16 | \n",
" 7 | \n",
" 25.85 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" first_name last_name email \\\n",
"aaron Aaron Davis aaron6348@gmail.com \n",
"acook Anthony Cook cook@gmail.com \n",
"adam.saunders Adam Saunders adam@gmail.com \n",
"adrian Adrian Yang adrian.yang@teamtreehouse.com \n",
"adrian.blair Adrian Blair adrian9335@gmail.com \n",
"\n",
" email_verified signup_date referral_count balance \n",
"aaron True 2018-08-31 6 18.14 \n",
"acook True 2018-05-12 2 55.45 \n",
"adam.saunders False 2018-05-29 3 72.12 \n",
"adrian True 2018-04-28 3 30.01 \n",
"adrian.blair True 2018-06-16 7 25.85 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a boolean array where we check to see if the label (username) \n",
"# is in the `from_user` Series.\n",
"made_request_index = users.index.isin(requests.from_user)\n",
"# This will get us a list of all users who **have** made a request\n",
"users[made_request_index].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we have a boolean index, we can negate it, using the `~` bitwise operator, to get to what we were looking for, users who have not yet made a request."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" first_name | \n",
" last_name | \n",
" email | \n",
" email_verified | \n",
" signup_date | \n",
" referral_count | \n",
" balance | \n",
"
\n",
" \n",
" \n",
" \n",
" alan9443 | \n",
" Alan | \n",
" Pope | \n",
" pope@hotmail.com | \n",
" True | \n",
" 2018-04-17 | \n",
" 0 | \n",
" 56.09 | \n",
"
\n",
" \n",
" alvarado | \n",
" Denise | \n",
" Alvarado | \n",
" alvarado@hotmail.com | \n",
" True | \n",
" 2018-09-07 | \n",
" 6 | \n",
" 26.72 | \n",
"
\n",
" \n",
" amiller | \n",
" Anne | \n",
" Miller | \n",
" miller@hotmail.com | \n",
" False | \n",
" 2018-06-02 | \n",
" 5 | \n",
" 86.28 | \n",
"
\n",
" \n",
" andersen | \n",
" Mark | \n",
" Andersen | \n",
" mark.andersen@yahoo.com | \n",
" True | \n",
" 2018-08-21 | \n",
" 3 | \n",
" 75.69 | \n",
"
\n",
" \n",
" andrade | \n",
" Melissa | \n",
" Andrade | \n",
" mandrade@yahoo.com | \n",
" True | \n",
" 2018-01-06 | \n",
" 3 | \n",
" 83.22 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" first_name last_name email email_verified \\\n",
"alan9443 Alan Pope pope@hotmail.com True \n",
"alvarado Denise Alvarado alvarado@hotmail.com True \n",
"amiller Anne Miller miller@hotmail.com False \n",
"andersen Mark Andersen mark.andersen@yahoo.com True \n",
"andrade Melissa Andrade mandrade@yahoo.com True \n",
"\n",
" signup_date referral_count balance \n",
"alan9443 2018-04-17 0 56.09 \n",
"alvarado 2018-09-07 6 26.72 \n",
"amiller 2018-06-02 5 86.28 \n",
"andersen 2018-08-21 3 75.69 \n",
"andrade 2018-01-06 3 83.22 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"users[~made_request_index].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Locating Missing Data\n",
"\n",
"As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a **`last_name`**.\n",
"\n",
"We can get a quick overview of how many blank values we have by using the [`DataFrame.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html)\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"first_name 475\n",
"last_name 430\n",
"email 475\n",
"email_verified 475\n",
"signup_date 475\n",
"referral_count 475\n",
"balance 475\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"users.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[`Series.isna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isna.html) is a method that allows you to check for missing values, such as `None` or `np.nan`. It returns a boolean array which can be used as an index."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" first_name | \n",
" last_name | \n",
" email | \n",
" email_verified | \n",
" signup_date | \n",
" referral_count | \n",
" balance | \n",
"
\n",
" \n",
" \n",
" \n",
" anthony3761 | \n",
" Anthony | \n",
" NaN | \n",
" anthony9581@gmail.com | \n",
" True | \n",
" 2018-08-16 | \n",
" 6 | \n",
" 59.81 | \n",
"
\n",
" \n",
" bradley | \n",
" Bradley | \n",
" NaN | \n",
" bradley3941@henderson.com | \n",
" True | \n",
" 2018-06-24 | \n",
" 4 | \n",
" 22.24 | \n",
"
\n",
" \n",
" bradley7808 | \n",
" Bradley | \n",
" NaN | \n",
" bradley8794@yahoo.com | \n",
" True | \n",
" 2018-06-15 | \n",
" 1 | \n",
" 76.49 | \n",
"
\n",
" \n",
" brenda | \n",
" Brenda | \n",
" NaN | \n",
" brenda@hotmail.com | \n",
" True | \n",
" 2018-07-28 | \n",
" 7 | \n",
" 6.18 | \n",
"
\n",
" \n",
" brooke2027 | \n",
" Brooke | \n",
" NaN | \n",
" brooke6938@gmail.com | \n",
" False | \n",
" 2018-05-23 | \n",
" 0 | \n",
" 7.22 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" first_name last_name email email_verified \\\n",
"anthony3761 Anthony NaN anthony9581@gmail.com True \n",
"bradley Bradley NaN bradley3941@henderson.com True \n",
"bradley7808 Bradley NaN bradley8794@yahoo.com True \n",
"brenda Brenda NaN brenda@hotmail.com True \n",
"brooke2027 Brooke NaN brooke6938@gmail.com False \n",
"\n",
" signup_date referral_count balance \n",
"anthony3761 2018-08-16 6 59.81 \n",
"bradley 2018-06-24 4 22.24 \n",
"bradley7808 2018-06-15 1 76.49 \n",
"brenda 2018-07-28 7 6.18 \n",
"brooke2027 2018-05-23 0 7.22 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Gather all users where the last name is missing\n",
"users[users.last_name.isna()].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filling Missing Values\n",
"\n",
"You can also set all unknown values to a specific value using the [`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) method. By default it will return a new `DataFrame`, use the `inplace` parameter to update the existing."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" first_name | \n",
" last_name | \n",
" email | \n",
" email_verified | \n",
" signup_date | \n",
" referral_count | \n",
" balance | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [first_name, last_name, email, email_verified, signup_date, referral_count, balance]\n",
"Index: []"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Make a copy of the DataFrame with \"Unknown\" as the last name where it is missing\n",
"users_with_unknown = users.fillna('Unknown')\n",
"\n",
"# Make sure we got 'em all\n",
"users_with_unknown[users_with_unknown.last_name.isna()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dropping Rows with Missing Data\n",
"\n",
"Another option is to just drop rows that contain missing data. That is the [`DataFrame.dropna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) method. Use `inplace` parameter to update the existing `DataFrame`.\n",
"\n",
"In this situation it's a little drastic, but you can imagine where missing data, makes what you are trying to accomplish invalid"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(475, 430)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"users_with_last_names = users.dropna()\n",
"\n",
"# Row counts of the original \n",
"(len(users), len(users_with_last_names))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}