{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Manipulating Text\n", "\n", "Oftentimes, there will be something a bit off with the string data in your dataset. You may want to replace some characters, change the case, or strip the whitespace. You know, anything you normally need to do with strings.\n", "\n", "Now this might lead you to want to loop through each row and manipulate the data, but before you do that, step back and lean into **vectorization**. \n", "\n", "A `Series` provides a way to use vectorized string methods in a property named [`str`](https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling) and the vectorized methods are then available.\n", "\n", "Let's take a look at some examples that require us to use these methods." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Setup\n", "import os\n", "\n", "import pandas as pd\n", "\n", "from utils import make_chaos\n", "\n", "pd.options.display.max_rows = 10\n", "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n", "# Pay no attention to the person behind the curtain\n", "make_chaos(transactions, 42, ['sender'], lambda val: '$' + val)\n", "make_chaos(transactions, 88, ['receiver'], lambda val: val.upper())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Replacing Text\n", "\n", "When CashBox first got started, usernames were allowed to start with a dollar sign. As time progressed, they changed their mind. They made a mass update to the system. However, someone on the Customer Support team reported that there are some records in the **`transactions`** `DataFrame` still showing some senders whose user name still had the $ prefix.\n", "\n", "In order to get ahold of those rows where the sender starts with a $, we can use the [`Series.str.startswith`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.startswith.html#pandas.Series.str.startswith) method. This will return a boolean `Series` which we can use as an index." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | sender | \n", "receiver | \n", "amount | \n", "sent_date | \n", "
---|---|---|---|---|
59 | \n", "$porter | \n", "gail7896 | \n", "75.16 | \n", "2018-05-14 | \n", "
70 | \n", "$emily.lewis | \n", "kevin | \n", "5.49 | \n", "2018-05-21 | \n", "
158 | \n", "$robinson | \n", "rodriguez | \n", "8.91 | \n", "2018-06-25 | \n", "
168 | \n", "$nancy | \n", "margaret265 | \n", "84.15 | \n", "2018-06-26 | \n", "
198 | \n", "$acook | \n", "adam.saunders | \n", "9.31 | \n", "2018-07-04 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
877 | \n", "$april9082 | \n", "jacob.davis | \n", "50.37 | \n", "2018-09-21 | \n", "
889 | \n", "$victor | \n", "anthony1788 | \n", "39.06 | \n", "2018-09-21 | \n", "
900 | \n", "$andersen | \n", "corey.ingram | \n", "4.81 | \n", "2018-09-22 | \n", "
927 | \n", "$janet.williams | \n", "bsmith | \n", "50.15 | \n", "2018-09-23 | \n", "
934 | \n", "$robert8280 | \n", "roger | \n", "98.35 | \n", "2018-09-24 | \n", "
42 rows × 4 columns
\n", "\n", " | sender | \n", "receiver | \n", "amount | \n", "sent_date | \n", "
---|---|---|---|---|
2 | \n", "rose.eaton | \n", "EMILY.LEWIS | \n", "62.67 | \n", "2018-02-15 | \n", "
5 | \n", "francis.hernandez | \n", "LMOORE | \n", "91.46 | \n", "2018-03-14 | \n", "
14 | \n", "palmer | \n", "CHAD.CHEN | \n", "36.27 | \n", "2018-04-07 | \n", "
28 | \n", "elang | \n", "DONNA1922 | \n", "26.07 | \n", "2018-04-23 | \n", "
34 | \n", "payne | \n", "GRIFFIN4992 | \n", "85.21 | \n", "2018-04-26 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
963 | \n", "stanley7729 | \n", "JOSEPH.LOPEZ | \n", "50.84 | \n", "2018-09-25 | \n", "
977 | \n", "martha6969 | \n", "PATRICIA | \n", "87.33 | \n", "2018-09-25 | \n", "
987 | \n", "alvarado | \n", "PAMELA | \n", "48.74 | \n", "2018-09-25 | \n", "
990 | \n", "robert | \n", "HEATHER.WADE | \n", "86.44 | \n", "2018-09-25 | \n", "
992 | \n", "pamela | \n", "CALEB | \n", "25.01 | \n", "2018-09-25 | \n", "
88 rows × 4 columns
\n", "