{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Handling Duplicated and Missing Data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Setup\n", "from datetime import datetime\n", "import os\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from utils import render\n", "\n", "users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)\n", "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n", "requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)\n", "\n", "# Perform the merge from the previous notebook \n", "# (s2n06-combining-dataframes.ipynb)\n", "successful_requests = requests.merge(\n", " transactions,\n", " left_on=['from_user', 'to_user', 'amount'], \n", " right_on=['receiver', 'sender', 'amount']\n", ")\n", "\n", "# Statement from previous notebook\n", "\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n", " successful_requests.amount.sum(),\n", " len(successful_requests),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Duplicated Data\n", "\n", "We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n", "\n", "We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. \n", "\n", "Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Wow! $9,316.12 has passed through the request system in 191 transactions!!!'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's get our records sorted chronologically\n", "successful_requests.sort_values('request_date', inplace=True) \n", "\n", "# And then we'll drop dupes keeping only the last one. \n", "# Note the use of the inplace keyword\n", "successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)\n", "\n", "# Statement from previous notebook\n", "\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n", " successful_requests.amount.sum(),\n", " len(successful_requests),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes, that little bit of difference can be pretty important, especially if that data is being used to make a decision. Always make sure to check for duplicates in situations where you are merging one or more `DataFrame`s together." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Locating records not found in another DataFrame using [`isin`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html)\n", "\n", "Now that we have seen how successful the requests, have been, it would be nice to locate all the users who **have not** yet made a request to another user.\n", "\n", "Basically we would to ask if each user from the **`users`** `DataFrame` if they are in the **`requests`** `DataFrame` as a **`from_user`**.\n", "\n", "The method `isin` is available for [`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html), [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html), and even an [`Index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.isin.html). The method takes an iterable and returns a boolean index of whether or not the value is contained in the owning `Series` or `DataFrame`.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameemailemail_verifiedsignup_datereferral_countbalance
aaronAaronDavisaaron6348@gmail.comTrue2018-08-31618.14
acookAnthonyCookcook@gmail.comTrue2018-05-12255.45
adam.saundersAdamSaundersadam@gmail.comFalse2018-05-29372.12
adrianAdrianYangadrian.yang@teamtreehouse.comTrue2018-04-28330.01
adrian.blairAdrianBlairadrian9335@gmail.comTrue2018-06-16725.85
\n", "
" ], "text/plain": [ " first_name last_name email \\\n", "aaron Aaron Davis aaron6348@gmail.com \n", "acook Anthony Cook cook@gmail.com \n", "adam.saunders Adam Saunders adam@gmail.com \n", "adrian Adrian Yang adrian.yang@teamtreehouse.com \n", "adrian.blair Adrian Blair adrian9335@gmail.com \n", "\n", " email_verified signup_date referral_count balance \n", "aaron True 2018-08-31 6 18.14 \n", "acook True 2018-05-12 2 55.45 \n", "adam.saunders False 2018-05-29 3 72.12 \n", "adrian True 2018-04-28 3 30.01 \n", "adrian.blair True 2018-06-16 7 25.85 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a boolean array where we check to see if the label (username) \n", "# is in the `from_user` Series.\n", "made_request_index = users.index.isin(requests.from_user)\n", "# This will get us a list of all users who **have** made a request\n", "users[made_request_index].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we have a boolean index, we can negate it, using the `~` bitwise operator, to get to what we were looking for, users who have not yet made a request." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameemailemail_verifiedsignup_datereferral_countbalance
alan9443AlanPopepope@hotmail.comTrue2018-04-17056.09
alvaradoDeniseAlvaradoalvarado@hotmail.comTrue2018-09-07626.72
amillerAnneMillermiller@hotmail.comFalse2018-06-02586.28
andersenMarkAndersenmark.andersen@yahoo.comTrue2018-08-21375.69
andradeMelissaAndrademandrade@yahoo.comTrue2018-01-06383.22
\n", "
" ], "text/plain": [ " first_name last_name email email_verified \\\n", "alan9443 Alan Pope pope@hotmail.com True \n", "alvarado Denise Alvarado alvarado@hotmail.com True \n", "amiller Anne Miller miller@hotmail.com False \n", "andersen Mark Andersen mark.andersen@yahoo.com True \n", "andrade Melissa Andrade mandrade@yahoo.com True \n", "\n", " signup_date referral_count balance \n", "alan9443 2018-04-17 0 56.09 \n", "alvarado 2018-09-07 6 26.72 \n", "amiller 2018-06-02 5 86.28 \n", "andersen 2018-08-21 3 75.69 \n", "andrade 2018-01-06 3 83.22 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users[~made_request_index].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Locating Missing Data\n", "\n", "As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a **`last_name`**.\n", "\n", "We can get a quick overview of how many blank values we have by using the [`DataFrame.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html)\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "first_name 475\n", "last_name 430\n", "email 475\n", "email_verified 475\n", "signup_date 475\n", "referral_count 475\n", "balance 475\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`Series.isna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isna.html) is a method that allows you to check for missing values, such as `None` or `np.nan`. It returns a boolean array which can be used as an index." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameemailemail_verifiedsignup_datereferral_countbalance
anthony3761AnthonyNaNanthony9581@gmail.comTrue2018-08-16659.81
bradleyBradleyNaNbradley3941@henderson.comTrue2018-06-24422.24
bradley7808BradleyNaNbradley8794@yahoo.comTrue2018-06-15176.49
brendaBrendaNaNbrenda@hotmail.comTrue2018-07-2876.18
brooke2027BrookeNaNbrooke6938@gmail.comFalse2018-05-2307.22
\n", "
" ], "text/plain": [ " first_name last_name email email_verified \\\n", "anthony3761 Anthony NaN anthony9581@gmail.com True \n", "bradley Bradley NaN bradley3941@henderson.com True \n", "bradley7808 Bradley NaN bradley8794@yahoo.com True \n", "brenda Brenda NaN brenda@hotmail.com True \n", "brooke2027 Brooke NaN brooke6938@gmail.com False \n", "\n", " signup_date referral_count balance \n", "anthony3761 2018-08-16 6 59.81 \n", "bradley 2018-06-24 4 22.24 \n", "bradley7808 2018-06-15 1 76.49 \n", "brenda 2018-07-28 7 6.18 \n", "brooke2027 2018-05-23 0 7.22 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Gather all users where the last name is missing\n", "users[users.last_name.isna()].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filling Missing Values\n", "\n", "You can also set all unknown values to a specific value using the [`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) method. By default it will return a new `DataFrame`, use the `inplace` parameter to update the existing." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_nameemailemail_verifiedsignup_datereferral_countbalance
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [first_name, last_name, email, email_verified, signup_date, referral_count, balance]\n", "Index: []" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make a copy of the DataFrame with \"Unknown\" as the last name where it is missing\n", "users_with_unknown = users.fillna('Unknown')\n", "\n", "# Make sure we got 'em all\n", "users_with_unknown[users_with_unknown.last_name.isna()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dropping Rows with Missing Data\n", "\n", "Another option is to just drop rows that contain missing data. That is the [`DataFrame.dropna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) method. Use `inplace` parameter to update the existing `DataFrame`.\n", "\n", "In this situation it's a little drastic, but you can imagine where missing data, makes what you are trying to accomplish invalid" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(475, 430)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users_with_last_names = users.dropna()\n", "\n", "# Row counts of the original \n", "(len(users), len(users_with_last_names))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }