{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Handling Duplicated and Missing Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Setup\n",
    "from datetime import datetime\n",
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from utils import render\n",
    "\n",
    "users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)\n",
    "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n",
    "requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)\n",
    "\n",
    "# Perform the merge from the previous notebook \n",
    "#  (s2n06-combining-dataframes.ipynb)\n",
    "successful_requests = requests.merge(\n",
    "    transactions,\n",
    "    left_on=['from_user', 'to_user', 'amount'], \n",
    "    right_on=['receiver', 'sender', 'amount']\n",
    ")\n",
    "\n",
    "# Statement from previous notebook\n",
    "\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n",
    "    successful_requests.amount.sum(),\n",
    "    len(successful_requests),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Duplicated Data\n",
    "\n",
    "We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n",
    "\n",
    "We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. \n",
    "\n",
    "Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Wow! $9,316.12 has passed through the request system in 191 transactions!!!'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's get our records sorted chronologically\n",
    "successful_requests.sort_values('request_date', inplace=True) \n",
    "\n",
    "# And then we'll drop dupes keeping only the last one. \n",
    "#  Note the use of the inplace keyword\n",
    "successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)\n",
    "\n",
    "# Statement from previous notebook\n",
    "\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n",
    "    successful_requests.amount.sum(),\n",
    "    len(successful_requests),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yikes, that little bit of difference can be pretty important, especially if that data is being used to make a decision. Always make sure to check for duplicates in situations where you are merging one or more `DataFrame`s together."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Locating records not found in another DataFrame using [`isin`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html)\n",
    "\n",
    "Now that we have seen how successful the requests, have been, it would be nice to locate all the users who **have not** yet made a request to another user.\n",
    "\n",
    "Basically we would to ask if each user from the **`users`** `DataFrame` if they are in the **`requests`** `DataFrame` as a **`from_user`**.\n",
    "\n",
    "The method `isin` is available for [`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html), [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html), and even an [`Index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.isin.html).  The method takes an iterable and returns a boolean index of whether or not the value is contained in the owning `Series` or `DataFrame`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>email</th>\n",
       "      <th>email_verified</th>\n",
       "      <th>signup_date</th>\n",
       "      <th>referral_count</th>\n",
       "      <th>balance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>aaron</th>\n",
       "      <td>Aaron</td>\n",
       "      <td>Davis</td>\n",
       "      <td>aaron6348@gmail.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-08-31</td>\n",
       "      <td>6</td>\n",
       "      <td>18.14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>acook</th>\n",
       "      <td>Anthony</td>\n",
       "      <td>Cook</td>\n",
       "      <td>cook@gmail.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-05-12</td>\n",
       "      <td>2</td>\n",
       "      <td>55.45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>adam.saunders</th>\n",
       "      <td>Adam</td>\n",
       "      <td>Saunders</td>\n",
       "      <td>adam@gmail.com</td>\n",
       "      <td>False</td>\n",
       "      <td>2018-05-29</td>\n",
       "      <td>3</td>\n",
       "      <td>72.12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>adrian</th>\n",
       "      <td>Adrian</td>\n",
       "      <td>Yang</td>\n",
       "      <td>adrian.yang@teamtreehouse.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-04-28</td>\n",
       "      <td>3</td>\n",
       "      <td>30.01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>adrian.blair</th>\n",
       "      <td>Adrian</td>\n",
       "      <td>Blair</td>\n",
       "      <td>adrian9335@gmail.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-06-16</td>\n",
       "      <td>7</td>\n",
       "      <td>25.85</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              first_name last_name                          email  \\\n",
       "aaron              Aaron     Davis            aaron6348@gmail.com   \n",
       "acook            Anthony      Cook                 cook@gmail.com   \n",
       "adam.saunders       Adam  Saunders                 adam@gmail.com   \n",
       "adrian            Adrian      Yang  adrian.yang@teamtreehouse.com   \n",
       "adrian.blair      Adrian     Blair           adrian9335@gmail.com   \n",
       "\n",
       "               email_verified signup_date  referral_count  balance  \n",
       "aaron                    True  2018-08-31               6    18.14  \n",
       "acook                    True  2018-05-12               2    55.45  \n",
       "adam.saunders           False  2018-05-29               3    72.12  \n",
       "adrian                   True  2018-04-28               3    30.01  \n",
       "adrian.blair             True  2018-06-16               7    25.85  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a boolean array where we check to see if the label (username) \n",
    "#   is in the `from_user` Series.\n",
    "made_request_index = users.index.isin(requests.from_user)\n",
    "# This will get us a list of all users who **have** made a request\n",
    "users[made_request_index].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because we have a boolean index, we can negate it, using the `~` bitwise operator, to get to what we were looking for, users who have not yet made a request."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>email</th>\n",
       "      <th>email_verified</th>\n",
       "      <th>signup_date</th>\n",
       "      <th>referral_count</th>\n",
       "      <th>balance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>alan9443</th>\n",
       "      <td>Alan</td>\n",
       "      <td>Pope</td>\n",
       "      <td>pope@hotmail.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-04-17</td>\n",
       "      <td>0</td>\n",
       "      <td>56.09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>alvarado</th>\n",
       "      <td>Denise</td>\n",
       "      <td>Alvarado</td>\n",
       "      <td>alvarado@hotmail.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-09-07</td>\n",
       "      <td>6</td>\n",
       "      <td>26.72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>amiller</th>\n",
       "      <td>Anne</td>\n",
       "      <td>Miller</td>\n",
       "      <td>miller@hotmail.com</td>\n",
       "      <td>False</td>\n",
       "      <td>2018-06-02</td>\n",
       "      <td>5</td>\n",
       "      <td>86.28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>andersen</th>\n",
       "      <td>Mark</td>\n",
       "      <td>Andersen</td>\n",
       "      <td>mark.andersen@yahoo.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-08-21</td>\n",
       "      <td>3</td>\n",
       "      <td>75.69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>andrade</th>\n",
       "      <td>Melissa</td>\n",
       "      <td>Andrade</td>\n",
       "      <td>mandrade@yahoo.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-01-06</td>\n",
       "      <td>3</td>\n",
       "      <td>83.22</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         first_name last_name                    email  email_verified  \\\n",
       "alan9443       Alan      Pope         pope@hotmail.com            True   \n",
       "alvarado     Denise  Alvarado     alvarado@hotmail.com            True   \n",
       "amiller        Anne    Miller       miller@hotmail.com           False   \n",
       "andersen       Mark  Andersen  mark.andersen@yahoo.com            True   \n",
       "andrade     Melissa   Andrade       mandrade@yahoo.com            True   \n",
       "\n",
       "         signup_date  referral_count  balance  \n",
       "alan9443  2018-04-17               0    56.09  \n",
       "alvarado  2018-09-07               6    26.72  \n",
       "amiller   2018-06-02               5    86.28  \n",
       "andersen  2018-08-21               3    75.69  \n",
       "andrade   2018-01-06               3    83.22  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "users[~made_request_index].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Locating Missing Data\n",
    "\n",
    "As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a **`last_name`**.\n",
    "\n",
    "We can get a quick overview of how many blank values we have by using the [`DataFrame.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "first_name        475\n",
       "last_name         430\n",
       "email             475\n",
       "email_verified    475\n",
       "signup_date       475\n",
       "referral_count    475\n",
       "balance           475\n",
       "dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "users.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[`Series.isna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isna.html) is a method that allows you to check for missing values, such as `None` or `np.nan`.  It returns a boolean array which can be used as an index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>email</th>\n",
       "      <th>email_verified</th>\n",
       "      <th>signup_date</th>\n",
       "      <th>referral_count</th>\n",
       "      <th>balance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>anthony3761</th>\n",
       "      <td>Anthony</td>\n",
       "      <td>NaN</td>\n",
       "      <td>anthony9581@gmail.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-08-16</td>\n",
       "      <td>6</td>\n",
       "      <td>59.81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bradley</th>\n",
       "      <td>Bradley</td>\n",
       "      <td>NaN</td>\n",
       "      <td>bradley3941@henderson.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-06-24</td>\n",
       "      <td>4</td>\n",
       "      <td>22.24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bradley7808</th>\n",
       "      <td>Bradley</td>\n",
       "      <td>NaN</td>\n",
       "      <td>bradley8794@yahoo.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-06-15</td>\n",
       "      <td>1</td>\n",
       "      <td>76.49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>brenda</th>\n",
       "      <td>Brenda</td>\n",
       "      <td>NaN</td>\n",
       "      <td>brenda@hotmail.com</td>\n",
       "      <td>True</td>\n",
       "      <td>2018-07-28</td>\n",
       "      <td>7</td>\n",
       "      <td>6.18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>brooke2027</th>\n",
       "      <td>Brooke</td>\n",
       "      <td>NaN</td>\n",
       "      <td>brooke6938@gmail.com</td>\n",
       "      <td>False</td>\n",
       "      <td>2018-05-23</td>\n",
       "      <td>0</td>\n",
       "      <td>7.22</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            first_name last_name                      email  email_verified  \\\n",
       "anthony3761    Anthony       NaN      anthony9581@gmail.com            True   \n",
       "bradley        Bradley       NaN  bradley3941@henderson.com            True   \n",
       "bradley7808    Bradley       NaN      bradley8794@yahoo.com            True   \n",
       "brenda          Brenda       NaN         brenda@hotmail.com            True   \n",
       "brooke2027      Brooke       NaN       brooke6938@gmail.com           False   \n",
       "\n",
       "            signup_date  referral_count  balance  \n",
       "anthony3761  2018-08-16               6    59.81  \n",
       "bradley      2018-06-24               4    22.24  \n",
       "bradley7808  2018-06-15               1    76.49  \n",
       "brenda       2018-07-28               7     6.18  \n",
       "brooke2027   2018-05-23               0     7.22  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Gather all users where the last name is missing\n",
    "users[users.last_name.isna()].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filling Missing Values\n",
    "\n",
    "You can also set all unknown values to a specific value using the [`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) method. By default it will return a new `DataFrame`, use the `inplace` parameter to update the existing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>email</th>\n",
       "      <th>email_verified</th>\n",
       "      <th>signup_date</th>\n",
       "      <th>referral_count</th>\n",
       "      <th>balance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [first_name, last_name, email, email_verified, signup_date, referral_count, balance]\n",
       "Index: []"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Make a copy of the DataFrame with \"Unknown\" as the last name where it is missing\n",
    "users_with_unknown = users.fillna('Unknown')\n",
    "\n",
    "# Make sure we got 'em all\n",
    "users_with_unknown[users_with_unknown.last_name.isna()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dropping Rows with Missing Data\n",
    "\n",
    "Another option is to just drop rows that contain missing data.  That is the [`DataFrame.dropna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) method.  Use `inplace` parameter to update the existing `DataFrame`.\n",
    "\n",
    "In this situation it's a little drastic, but you can imagine where missing data, makes what you are trying to accomplish invalid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(475, 430)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "users_with_last_names = users.dropna()\n",
    "\n",
    "# Row counts of the original \n",
    "(len(users), len(users_with_last_names))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}