{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Manipulating Text\n",
    "\n",
    "Oftentimes, there will be something a bit off with the string data in your dataset. You may want to replace some characters, change the case, or strip the whitespace. You know, anything you normally need to do with strings.\n",
    "\n",
    "Now this might lead you to want to loop through each row and manipulate the data, but before you do that, step back and lean into **vectorization**.  \n",
    "\n",
    "A `Series` provides a way to use vectorized string methods in a property named [`str`](https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling) and the vectorized methods are then available.\n",
    "\n",
    "Let's take a look at some examples that require us to use these methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup\n",
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from utils import make_chaos\n",
    "\n",
    "pd.options.display.max_rows = 10\n",
    "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n",
    "# Pay no attention to the person behind the curtain\n",
    "make_chaos(transactions, 42, ['sender'], lambda val: '$' + val)\n",
    "make_chaos(transactions, 88, ['receiver'], lambda val: val.upper())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Replacing Text\n",
    "\n",
    "When CashBox first got started, usernames were allowed to start with a dollar sign. As time progressed, they changed their mind. They made a mass update to the system. However, someone on the Customer Support team reported that there are some records in the **`transactions`** `DataFrame` still showing some senders whose user name still had the $ prefix.\n",
    "\n",
    "In order to get ahold of those rows where the sender starts with a $, we can use the [`Series.str.startswith`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.startswith.html#pandas.Series.str.startswith) method.  This will return a boolean `Series` which we can use as an index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sender</th>\n",
       "      <th>receiver</th>\n",
       "      <th>amount</th>\n",
       "      <th>sent_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>$porter</td>\n",
       "      <td>gail7896</td>\n",
       "      <td>75.16</td>\n",
       "      <td>2018-05-14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>$emily.lewis</td>\n",
       "      <td>kevin</td>\n",
       "      <td>5.49</td>\n",
       "      <td>2018-05-21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>158</th>\n",
       "      <td>$robinson</td>\n",
       "      <td>rodriguez</td>\n",
       "      <td>8.91</td>\n",
       "      <td>2018-06-25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>168</th>\n",
       "      <td>$nancy</td>\n",
       "      <td>margaret265</td>\n",
       "      <td>84.15</td>\n",
       "      <td>2018-06-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>$acook</td>\n",
       "      <td>adam.saunders</td>\n",
       "      <td>9.31</td>\n",
       "      <td>2018-07-04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>877</th>\n",
       "      <td>$april9082</td>\n",
       "      <td>jacob.davis</td>\n",
       "      <td>50.37</td>\n",
       "      <td>2018-09-21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>$victor</td>\n",
       "      <td>anthony1788</td>\n",
       "      <td>39.06</td>\n",
       "      <td>2018-09-21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>900</th>\n",
       "      <td>$andersen</td>\n",
       "      <td>corey.ingram</td>\n",
       "      <td>4.81</td>\n",
       "      <td>2018-09-22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>927</th>\n",
       "      <td>$janet.williams</td>\n",
       "      <td>bsmith</td>\n",
       "      <td>50.15</td>\n",
       "      <td>2018-09-23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>934</th>\n",
       "      <td>$robert8280</td>\n",
       "      <td>roger</td>\n",
       "      <td>98.35</td>\n",
       "      <td>2018-09-24</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>42 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              sender       receiver  amount   sent_date\n",
       "59           $porter       gail7896   75.16  2018-05-14\n",
       "70      $emily.lewis          kevin    5.49  2018-05-21\n",
       "158        $robinson      rodriguez    8.91  2018-06-25\n",
       "168           $nancy    margaret265   84.15  2018-06-26\n",
       "198           $acook  adam.saunders    9.31  2018-07-04\n",
       "..               ...            ...     ...         ...\n",
       "877       $april9082    jacob.davis   50.37  2018-09-21\n",
       "889          $victor    anthony1788   39.06  2018-09-21\n",
       "900        $andersen   corey.ingram    4.81  2018-09-22\n",
       "927  $janet.williams         bsmith   50.15  2018-09-23\n",
       "934      $robert8280          roger   98.35  2018-09-24\n",
       "\n",
       "[42 rows x 4 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transactions[transactions.sender.str.startswith('$')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now just go ahead and replace all `$` with an empty string, essentially removing all `$` from every sender by using the [`Series.str.replace`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Replace all \"$\" in the sender field with an empty string\n",
    "transactions.sender = transactions.sender.str.replace('$', '') \n",
    "# Verify we got them all\n",
    "len(transactions[transactions.sender.str.startswith('$')])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Changing Case\n",
    "\n",
    "When you want to select or merge by specific values, the case, you know upper case or lower case, matters.  \n",
    "\n",
    "Our CashBox customer support representative also raised another issue for us to take a look at. All usernames should be lowercased, but they have reported that they noticed the **`receiver`** column has some uppercased values.\n",
    "\n",
    "We can get a handle on those by using the [`Series.str.isupper`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isupper.html) method which will return a boolean `Series` that we can use for an index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sender</th>\n",
       "      <th>receiver</th>\n",
       "      <th>amount</th>\n",
       "      <th>sent_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>rose.eaton</td>\n",
       "      <td>EMILY.LEWIS</td>\n",
       "      <td>62.67</td>\n",
       "      <td>2018-02-15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>francis.hernandez</td>\n",
       "      <td>LMOORE</td>\n",
       "      <td>91.46</td>\n",
       "      <td>2018-03-14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>palmer</td>\n",
       "      <td>CHAD.CHEN</td>\n",
       "      <td>36.27</td>\n",
       "      <td>2018-04-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>elang</td>\n",
       "      <td>DONNA1922</td>\n",
       "      <td>26.07</td>\n",
       "      <td>2018-04-23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>payne</td>\n",
       "      <td>GRIFFIN4992</td>\n",
       "      <td>85.21</td>\n",
       "      <td>2018-04-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>963</th>\n",
       "      <td>stanley7729</td>\n",
       "      <td>JOSEPH.LOPEZ</td>\n",
       "      <td>50.84</td>\n",
       "      <td>2018-09-25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>977</th>\n",
       "      <td>martha6969</td>\n",
       "      <td>PATRICIA</td>\n",
       "      <td>87.33</td>\n",
       "      <td>2018-09-25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>987</th>\n",
       "      <td>alvarado</td>\n",
       "      <td>PAMELA</td>\n",
       "      <td>48.74</td>\n",
       "      <td>2018-09-25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>990</th>\n",
       "      <td>robert</td>\n",
       "      <td>HEATHER.WADE</td>\n",
       "      <td>86.44</td>\n",
       "      <td>2018-09-25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>992</th>\n",
       "      <td>pamela</td>\n",
       "      <td>CALEB</td>\n",
       "      <td>25.01</td>\n",
       "      <td>2018-09-25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>88 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                sender      receiver  amount   sent_date\n",
       "2           rose.eaton   EMILY.LEWIS   62.67  2018-02-15\n",
       "5    francis.hernandez        LMOORE   91.46  2018-03-14\n",
       "14              palmer     CHAD.CHEN   36.27  2018-04-07\n",
       "28               elang     DONNA1922   26.07  2018-04-23\n",
       "34               payne   GRIFFIN4992   85.21  2018-04-26\n",
       "..                 ...           ...     ...         ...\n",
       "963        stanley7729  JOSEPH.LOPEZ   50.84  2018-09-25\n",
       "977         martha6969      PATRICIA   87.33  2018-09-25\n",
       "987           alvarado        PAMELA   48.74  2018-09-25\n",
       "990             robert  HEATHER.WADE   86.44  2018-09-25\n",
       "992             pamela         CALEB   25.01  2018-09-25\n",
       "\n",
       "[88 rows x 4 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transactions[transactions.receiver.str.isupper()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So let's select the rows we want from **`transactions`** and then update the **`receiver`** column to the matching lowercased value.  We can use the [`Series.str.lower`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.lower.html) vectorized method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Update the receiver column of the specific rows that are uppercased.\n",
    "transactions.loc[transactions.receiver.str.isupper(), 'receiver'] = transactions.receiver.str.lower()\n",
    "# Verify that we got them\n",
    "len(transactions[transactions.receiver.str.isupper()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learn More\n",
    "As you work on cleaning up datasets, you'll end up in this space quite a bit. Make sure to check out the [documentation on String handling](https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling) so you know what super powers you have on your side."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}