{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Combining DataFrames\n", "\n", "CashBox has provided us with several separate CSV files. Let's take a look at two files `transactions.csv` and `requests.csv`. Requests are made in the application when one user requests cash from another. Requests are not required for a transaction to occur. \n", "\n", "Let's see if we can't see get a feeling on how many successful requests and payments have been made. In order to do this we will need to combine the two `DataFrame`s." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((998, 4), (313, 4))" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Setup\n", "from datetime import datetime\n", "import os\n", "\n", "import pandas as pd\n", "\n", "from utils import render\n", "\n", "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n", "requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)\n", "\n", "# Pop out a quick sanity check\n", "(transactions.shape, requests.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore our data real quick, by taking a look at the first couple of rows in each `DataFrame`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
senderreceiveramountsent_date
0steinsmoyer49.032018-01-24
1holden4580joshua.henry34.642018-02-06
\n", "
" ], "text/plain": [ " sender receiver amount sent_date\n", "0 stein smoyer 49.03 2018-01-24\n", "1 holden4580 joshua.henry 34.64 2018-02-06" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.head(2)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
from_userto_useramountrequest_date
0chad.chenpaula798078.612018-02-12
1kallenlmoore1.942018-02-23
\n", "
" ], "text/plain": [ " from_user to_user amount request_date\n", "0 chad.chen paula7980 78.61 2018-02-12\n", "1 kallen lmoore 1.94 2018-02-23" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "requests.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'd like to see all the requests that have a matching transaction based on the users and the amount involved.\n", "\n", "In order to do this we will merge both of our datasets together. \n", "\n", "We'll create a new dataset by using the [`DataFrame.merge`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) method." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
from_userto_useramountrequest_datesenderreceiversent_date
0chad.chenpaula798078.612018-02-12paula7980chad.chen2018-07-15
1kallenlmoore1.942018-02-23lmoorekallen2018-03-05
2gregory.blackwellrodriguez576830.572018-03-04rodriguez5768gregory.blackwell2018-03-17
3kristina.millerjohn.hardy77.052018-03-12john.hardykristina.miller2018-04-25
4lacey8987mcguire54.092018-03-13mcguirelacey89872018-06-28
\n", "
" ], "text/plain": [ " from_user to_user amount request_date sender \\\n", "0 chad.chen paula7980 78.61 2018-02-12 paula7980 \n", "1 kallen lmoore 1.94 2018-02-23 lmoore \n", "2 gregory.blackwell rodriguez5768 30.57 2018-03-04 rodriguez5768 \n", "3 kristina.miller john.hardy 77.05 2018-03-12 john.hardy \n", "4 lacey8987 mcguire 54.09 2018-03-13 mcguire \n", "\n", " receiver sent_date \n", "0 chad.chen 2018-07-15 \n", "1 kallen 2018-03-05 \n", "2 gregory.blackwell 2018-03-17 \n", "3 kristina.miller 2018-04-25 \n", "4 lacey8987 2018-06-28 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Since we are calling merge on the `requests` DataFrame it is considered the left side\n", "successful_requests = requests.merge(\n", " # And transactions is the right side\n", " transactions, \n", " # So now we line up columns that will make the join make sense.\n", " left_on=['from_user', 'to_user', 'amount'], \n", " right_on=['receiver', 'sender', 'amount']\n", ")\n", "# Let's take a look and see how we did\n", "successful_requests.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that since the **`amount`** is the same on both sides of the merge that there is only one column represented." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gather Insights\n", "\n", "So looking at this data merged together, I'd like to see the time difference between when the request was made, and when the money was actually received.\n", "\n", "Good news for us, pandas has very powerful date/time functionality, but in order to get there we're going to need to convert our columns. As you can see, the CSV import did not recognize our date field. **`sent_date`** and **`request_date`** are just plain old objects." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "from_user object\n", "to_user object\n", "amount float64\n", "request_date object\n", "sender object\n", "receiver object\n", "sent_date object\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "successful_requests.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can convert columns (which remember are a `Series`) on CSV import or just when you need them by using the `pandas.to_datetime` method. There are actually quite a few conversion methods." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "from_user object\n", "to_user object\n", "amount float64\n", "request_date datetime64[ns]\n", "sender object\n", "receiver object\n", "sent_date datetime64[ns]\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "successful_requests['request_date'] = pd.to_datetime(successful_requests['request_date'])\n", "successful_requests['sent_date'] = pd.to_datetime(successful_requests['sent_date'])\n", "# And now we can see they are converted\n", "successful_requests.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have dates we can subtract them (vectoization ftw!) to create a timedelta. Let's create a new column called **`time_passed`** that stores the result." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "successful_requests['time_passed'] = successful_requests.sent_date - successful_requests.request_date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's take a look at the top 5 longest request to successful transactions by sorting and limiting, to get a vibe." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
from_userto_useramountrequest_datesenderreceiversent_datetime_passed
0chad.chenpaula798078.612018-02-12paula7980chad.chen2018-07-15153 days
33sthompsonandrade14.072018-05-09andradesthompson2018-09-21135 days
4lacey8987mcguire54.092018-03-13mcguirelacey89872018-06-28107 days
53marcus.berrymelissa.mendoza71.482018-05-31melissa.mendozamarcus.berry2018-09-0698 days
39bishopmassey210218.272018-05-16massey2102bishop2018-08-1591 days
\n", "
" ], "text/plain": [ " from_user to_user amount request_date sender \\\n", "0 chad.chen paula7980 78.61 2018-02-12 paula7980 \n", "33 sthompson andrade 14.07 2018-05-09 andrade \n", "4 lacey8987 mcguire 54.09 2018-03-13 mcguire \n", "53 marcus.berry melissa.mendoza 71.48 2018-05-31 melissa.mendoza \n", "39 bishop massey2102 18.27 2018-05-16 massey2102 \n", "\n", " receiver sent_date time_passed \n", "0 chad.chen 2018-07-15 153 days \n", "33 sthompson 2018-09-21 135 days \n", "4 lacey8987 2018-06-28 107 days \n", "53 marcus.berry 2018-09-06 98 days \n", "39 bishop 2018-08-15 91 days " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "successful_requests.sort_values(by='time_passed', ascending=False).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm, that makes wonder how much money passed through the request and transaction system. Let's see." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Wow! $10,496.47 has passed through the request system in 214 transactions!!!'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n", " successful_requests.amount.sum(),\n", " len(successful_requests),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further research\n", "I saw something a little strange as I was looking through those **`successful_requests`**, it noticed a couple of what seemed like duplicated requests. I called my contact at CashBox and asked about possible duplicate requests. Sure enough, the application allows you to send multiple requests for the same amount. \n", "\n", "So this means there are probably duplicates in our **`successful_requests`** `DataFrame` because there are duplicates in the **`requests`**. There is most likely only one transaction that fulfills the request, but there could be multiple requests that match. Our merge brought that duplication across as well.\n", "\n", "Let's explore the possible duplicates in the **`requests`** `DataFrame`. There is a method [`DataFrame.duplicated`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html) that will return a boolean `Series` which we can use as an index. A `keep` parameter is available which is used to choose which of the duplicated rows to mark as a duplicate. You can mark the first, last or all of them." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
from_userto_useramountrequest_date
58austin486shelly11.242018-05-29
59austin486shelly11.242018-05-29
8cjimenezsarah.evans48.142018-03-21
26cjimenezsarah.evans48.142018-04-27
218clark8139moore14.542018-08-31
240clark8139moore14.542018-09-10
195diane4652dean23656.822018-08-21
224diane4652dean23656.822018-09-05
143donna1922danderson79.222018-07-23
157donna1922danderson79.222018-07-31
116edwards8658johnston228147.292018-07-10
127edwards8658johnston228147.292018-07-16
289heathersmercer98.132018-09-22
290heathersmercer98.132018-09-22
32hoover4302dennis87.772018-05-04
33hoover4302dennis87.772018-05-05
160jcolemanfuentes40.482018-08-03
216jcolemanfuentes40.482018-08-29
82jonesjones579695.882018-06-21
95jonesjones579695.882018-06-27
84kimberlywbrown9.022018-06-24
101kimberlywbrown9.022018-06-28
60kirkjason36.532018-05-30
85kirkjason36.532018-06-24
165lawrence1685joshua969099.962018-08-07
168lawrence1685joshua969099.962018-08-08
114lspencersarah44.892018-07-09
194lspencersarah44.892018-08-21
44matthew7940sherry611358.722018-05-13
108matthew7940sherry611358.722018-07-04
201michael7792jennifer90.562018-08-23
241michael7792jennifer90.562018-09-10
75nicholas.traviscorey73.252018-06-15
109nicholas.traviscorey73.252018-07-04
307patriciamartha696987.332018-09-25
312patriciamartha696987.332018-09-25
10paula7980mackenzie56.002018-03-29
15paula7980mackenzie56.002018-04-15
68paula7980mackenzie56.002018-06-02
221phillips5654miguel.gamble11.352018-09-01
260phillips5654miguel.gamble11.352018-09-15
139rebecca848mpaul1.342018-07-21
229rebecca848mpaul1.342018-09-06
115rodriguez9475miller855225.892018-07-09
121rodriguez9475miller855225.892018-07-13
\n", "
" ], "text/plain": [ " from_user to_user amount request_date\n", "58 austin486 shelly 11.24 2018-05-29\n", "59 austin486 shelly 11.24 2018-05-29\n", "8 cjimenez sarah.evans 48.14 2018-03-21\n", "26 cjimenez sarah.evans 48.14 2018-04-27\n", "218 clark8139 moore 14.54 2018-08-31\n", "240 clark8139 moore 14.54 2018-09-10\n", "195 diane4652 dean2365 6.82 2018-08-21\n", "224 diane4652 dean2365 6.82 2018-09-05\n", "143 donna1922 danderson 79.22 2018-07-23\n", "157 donna1922 danderson 79.22 2018-07-31\n", "116 edwards8658 johnston2281 47.29 2018-07-10\n", "127 edwards8658 johnston2281 47.29 2018-07-16\n", "289 heather smercer 98.13 2018-09-22\n", "290 heather smercer 98.13 2018-09-22\n", "32 hoover4302 dennis 87.77 2018-05-04\n", "33 hoover4302 dennis 87.77 2018-05-05\n", "160 jcoleman fuentes 40.48 2018-08-03\n", "216 jcoleman fuentes 40.48 2018-08-29\n", "82 jones jones5796 95.88 2018-06-21\n", "95 jones jones5796 95.88 2018-06-27\n", "84 kimberly wbrown 9.02 2018-06-24\n", "101 kimberly wbrown 9.02 2018-06-28\n", "60 kirk jason 36.53 2018-05-30\n", "85 kirk jason 36.53 2018-06-24\n", "165 lawrence1685 joshua9690 99.96 2018-08-07\n", "168 lawrence1685 joshua9690 99.96 2018-08-08\n", "114 lspencer sarah 44.89 2018-07-09\n", "194 lspencer sarah 44.89 2018-08-21\n", "44 matthew7940 sherry6113 58.72 2018-05-13\n", "108 matthew7940 sherry6113 58.72 2018-07-04\n", "201 michael7792 jennifer 90.56 2018-08-23\n", "241 michael7792 jennifer 90.56 2018-09-10\n", "75 nicholas.travis corey 73.25 2018-06-15\n", "109 nicholas.travis corey 73.25 2018-07-04\n", "307 patricia martha6969 87.33 2018-09-25\n", "312 patricia martha6969 87.33 2018-09-25\n", "10 paula7980 mackenzie 56.00 2018-03-29\n", "15 paula7980 mackenzie 56.00 2018-04-15\n", "68 paula7980 mackenzie 56.00 2018-06-02\n", "221 phillips5654 miguel.gamble 11.35 2018-09-01\n", "260 phillips5654 miguel.gamble 11.35 2018-09-15\n", "139 rebecca848 mpaul 1.34 2018-07-21\n", "229 rebecca848 mpaul 1.34 2018-09-06\n", "115 rodriguez9475 miller8552 25.89 2018-07-09\n", "121 rodriguez9475 miller8552 25.89 2018-07-13" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a boolean Series of records that are duplicated. \n", "# Note that `keep=False` marks all that are duplicated\n", "dupes = requests[requests.duplicated(('from_user', 'to_user', 'amount'), keep=False)]\n", "# Order by requester and the date of request. \n", "# Note that `request_date` in this case is a string, but this string date format sorts properly still.\n", "dupes.sort_values(['from_user', 'request_date'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We should get those duplicates out of our successful requests. Let's take a look at some more tools that will help us do cleanup like this one." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }