{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Practice Optimizing Dataframes and Processing in Chunks\n",
    "\n",
    "In this project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from [Lending Club], a marketplace for personal loans that matches borrowers with investors. You can read more about the marketplace on its [website].\n",
    "\n",
    "The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the [origination fee] that Lending Club charges.\n",
    "\n",
    "We'll be working with a dataset of loans approved from `2007-2011`, which can be downloaded from [Lending Club's website]. The `desc` column has been removed to improve system performance.\n",
    "\n",
    "If we read in the entire data set, it will consume about 67 megabytes of memory. For learning purposes, let's imagine that we only have 10 megabytes of memory available throughout this project, so we can practice the concepts we learned in the last two missions.\n",
    "\n",
    "[Lending Club]: https://www.lendingclub.com/\n",
    "[website]: https://www.lendingclub.com/public/how-peer-lending-works.action\n",
    "[origination fee]: https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-\n",
    "[Lending Club's website]: https://www.lendingclub.com/info/download-data.action"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import pprint as pp\n",
    "import matplotlib.pyplot as plt\n",
    "pd.options.display.max_columns = 99"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "loans_2007.csv\r\n",
      "project02_practice_optimizing_dataframes_and_processing_in_chunks.ipynb\r\n"
     ]
    }
   ],
   "source": [
    "!ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Introduction\n",
    "\n",
    "- Read in the first five lines from `loans_2007.csv` and look for any data quality issues.\n",
    "- Read in the first 1000 rows from the data set, and calculate the total memory usage for these rows. Increase or decrease the number of rows to converge on a memory usage under `five megabytes` (to stay on the conservative side)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>member_id</th>\n",
       "      <th>loan_amnt</th>\n",
       "      <th>funded_amnt</th>\n",
       "      <th>funded_amnt_inv</th>\n",
       "      <th>term</th>\n",
       "      <th>int_rate</th>\n",
       "      <th>installment</th>\n",
       "      <th>grade</th>\n",
       "      <th>sub_grade</th>\n",
       "      <th>emp_title</th>\n",
       "      <th>emp_length</th>\n",
       "      <th>home_ownership</th>\n",
       "      <th>annual_inc</th>\n",
       "      <th>verification_status</th>\n",
       "      <th>issue_d</th>\n",
       "      <th>loan_status</th>\n",
       "      <th>pymnt_plan</th>\n",
       "      <th>purpose</th>\n",
       "      <th>title</th>\n",
       "      <th>zip_code</th>\n",
       "      <th>addr_state</th>\n",
       "      <th>dti</th>\n",
       "      <th>delinq_2yrs</th>\n",
       "      <th>earliest_cr_line</th>\n",
       "      <th>inq_last_6mths</th>\n",
       "      <th>open_acc</th>\n",
       "      <th>pub_rec</th>\n",
       "      <th>revol_bal</th>\n",
       "      <th>revol_util</th>\n",
       "      <th>total_acc</th>\n",
       "      <th>initial_list_status</th>\n",
       "      <th>out_prncp</th>\n",
       "      <th>out_prncp_inv</th>\n",
       "      <th>total_pymnt</th>\n",
       "      <th>total_pymnt_inv</th>\n",
       "      <th>total_rec_prncp</th>\n",
       "      <th>total_rec_int</th>\n",
       "      <th>total_rec_late_fee</th>\n",
       "      <th>recoveries</th>\n",
       "      <th>collection_recovery_fee</th>\n",
       "      <th>last_pymnt_d</th>\n",
       "      <th>last_pymnt_amnt</th>\n",
       "      <th>last_credit_pull_d</th>\n",
       "      <th>collections_12_mths_ex_med</th>\n",
       "      <th>policy_code</th>\n",
       "      <th>application_type</th>\n",
       "      <th>acc_now_delinq</th>\n",
       "      <th>chargeoff_within_12_mths</th>\n",
       "      <th>delinq_amnt</th>\n",
       "      <th>pub_rec_bankruptcies</th>\n",
       "      <th>tax_liens</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1077501</td>\n",
       "      <td>1296599.0</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>4975.0</td>\n",
       "      <td>36 months</td>\n",
       "      <td>10.65%</td>\n",
       "      <td>162.87</td>\n",
       "      <td>B</td>\n",
       "      <td>B2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10+ years</td>\n",
       "      <td>RENT</td>\n",
       "      <td>24000.0</td>\n",
       "      <td>Verified</td>\n",
       "      <td>Dec-2011</td>\n",
       "      <td>Fully Paid</td>\n",
       "      <td>n</td>\n",
       "      <td>credit_card</td>\n",
       "      <td>Computer</td>\n",
       "      <td>860xx</td>\n",
       "      <td>AZ</td>\n",
       "      <td>27.65</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Jan-1985</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>13648.0</td>\n",
       "      <td>83.7%</td>\n",
       "      <td>9.0</td>\n",
       "      <td>f</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>5863.155187</td>\n",
       "      <td>5833.84</td>\n",
       "      <td>5000.00</td>\n",
       "      <td>863.16</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>Jan-2015</td>\n",
       "      <td>171.62</td>\n",
       "      <td>Jun-2016</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>INDIVIDUAL</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1077430</td>\n",
       "      <td>1314167.0</td>\n",
       "      <td>2500.0</td>\n",
       "      <td>2500.0</td>\n",
       "      <td>2500.0</td>\n",
       "      <td>60 months</td>\n",
       "      <td>15.27%</td>\n",
       "      <td>59.83</td>\n",
       "      <td>C</td>\n",
       "      <td>C4</td>\n",
       "      <td>Ryder</td>\n",
       "      <td>&lt; 1 year</td>\n",
       "      <td>RENT</td>\n",
       "      <td>30000.0</td>\n",
       "      <td>Source Verified</td>\n",
       "      <td>Dec-2011</td>\n",
       "      <td>Charged Off</td>\n",
       "      <td>n</td>\n",
       "      <td>car</td>\n",
       "      <td>bike</td>\n",
       "      <td>309xx</td>\n",
       "      <td>GA</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Apr-1999</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1687.0</td>\n",
       "      <td>9.4%</td>\n",
       "      <td>4.0</td>\n",
       "      <td>f</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1008.710000</td>\n",
       "      <td>1008.71</td>\n",
       "      <td>456.46</td>\n",
       "      <td>435.17</td>\n",
       "      <td>0.00</td>\n",
       "      <td>117.08</td>\n",
       "      <td>1.11</td>\n",
       "      <td>Apr-2013</td>\n",
       "      <td>119.66</td>\n",
       "      <td>Sep-2013</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>INDIVIDUAL</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1077175</td>\n",
       "      <td>1313524.0</td>\n",
       "      <td>2400.0</td>\n",
       "      <td>2400.0</td>\n",
       "      <td>2400.0</td>\n",
       "      <td>36 months</td>\n",
       "      <td>15.96%</td>\n",
       "      <td>84.33</td>\n",
       "      <td>C</td>\n",
       "      <td>C5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10+ years</td>\n",
       "      <td>RENT</td>\n",
       "      <td>12252.0</td>\n",
       "      <td>Not Verified</td>\n",
       "      <td>Dec-2011</td>\n",
       "      <td>Fully Paid</td>\n",
       "      <td>n</td>\n",
       "      <td>small_business</td>\n",
       "      <td>real estate business</td>\n",
       "      <td>606xx</td>\n",
       "      <td>IL</td>\n",
       "      <td>8.72</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Nov-2001</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2956.0</td>\n",
       "      <td>98.5%</td>\n",
       "      <td>10.0</td>\n",
       "      <td>f</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>3005.666844</td>\n",
       "      <td>3005.67</td>\n",
       "      <td>2400.00</td>\n",
       "      <td>605.67</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>Jun-2014</td>\n",
       "      <td>649.91</td>\n",
       "      <td>Jun-2016</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>INDIVIDUAL</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1076863</td>\n",
       "      <td>1277178.0</td>\n",
       "      <td>10000.0</td>\n",
       "      <td>10000.0</td>\n",
       "      <td>10000.0</td>\n",
       "      <td>36 months</td>\n",
       "      <td>13.49%</td>\n",
       "      <td>339.31</td>\n",
       "      <td>C</td>\n",
       "      <td>C1</td>\n",
       "      <td>AIR RESOURCES BOARD</td>\n",
       "      <td>10+ years</td>\n",
       "      <td>RENT</td>\n",
       "      <td>49200.0</td>\n",
       "      <td>Source Verified</td>\n",
       "      <td>Dec-2011</td>\n",
       "      <td>Fully Paid</td>\n",
       "      <td>n</td>\n",
       "      <td>other</td>\n",
       "      <td>personel</td>\n",
       "      <td>917xx</td>\n",
       "      <td>CA</td>\n",
       "      <td>20.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Feb-1996</td>\n",
       "      <td>1.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>5598.0</td>\n",
       "      <td>21%</td>\n",
       "      <td>37.0</td>\n",
       "      <td>f</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>12231.890000</td>\n",
       "      <td>12231.89</td>\n",
       "      <td>10000.00</td>\n",
       "      <td>2214.92</td>\n",
       "      <td>16.97</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>Jan-2015</td>\n",
       "      <td>357.48</td>\n",
       "      <td>Apr-2016</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>INDIVIDUAL</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1075358</td>\n",
       "      <td>1311748.0</td>\n",
       "      <td>3000.0</td>\n",
       "      <td>3000.0</td>\n",
       "      <td>3000.0</td>\n",
       "      <td>60 months</td>\n",
       "      <td>12.69%</td>\n",
       "      <td>67.79</td>\n",
       "      <td>B</td>\n",
       "      <td>B5</td>\n",
       "      <td>University Medical Group</td>\n",
       "      <td>1 year</td>\n",
       "      <td>RENT</td>\n",
       "      <td>80000.0</td>\n",
       "      <td>Source Verified</td>\n",
       "      <td>Dec-2011</td>\n",
       "      <td>Current</td>\n",
       "      <td>n</td>\n",
       "      <td>other</td>\n",
       "      <td>Personal</td>\n",
       "      <td>972xx</td>\n",
       "      <td>OR</td>\n",
       "      <td>17.94</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Jan-1996</td>\n",
       "      <td>0.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>27783.0</td>\n",
       "      <td>53.9%</td>\n",
       "      <td>38.0</td>\n",
       "      <td>f</td>\n",
       "      <td>461.73</td>\n",
       "      <td>461.73</td>\n",
       "      <td>3581.120000</td>\n",
       "      <td>3581.12</td>\n",
       "      <td>2538.27</td>\n",
       "      <td>1042.85</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>Jun-2016</td>\n",
       "      <td>67.79</td>\n",
       "      <td>Jun-2016</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>INDIVIDUAL</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \\\n",
       "0  1077501  1296599.0     5000.0       5000.0           4975.0   36 months   \n",
       "1  1077430  1314167.0     2500.0       2500.0           2500.0   60 months   \n",
       "2  1077175  1313524.0     2400.0       2400.0           2400.0   36 months   \n",
       "3  1076863  1277178.0    10000.0      10000.0          10000.0   36 months   \n",
       "4  1075358  1311748.0     3000.0       3000.0           3000.0   60 months   \n",
       "\n",
       "  int_rate  installment grade sub_grade                 emp_title emp_length  \\\n",
       "0   10.65%       162.87     B        B2                       NaN  10+ years   \n",
       "1   15.27%        59.83     C        C4                     Ryder   < 1 year   \n",
       "2   15.96%        84.33     C        C5                       NaN  10+ years   \n",
       "3   13.49%       339.31     C        C1       AIR RESOURCES BOARD  10+ years   \n",
       "4   12.69%        67.79     B        B5  University Medical Group     1 year   \n",
       "\n",
       "  home_ownership  annual_inc verification_status   issue_d  loan_status  \\\n",
       "0           RENT     24000.0            Verified  Dec-2011   Fully Paid   \n",
       "1           RENT     30000.0     Source Verified  Dec-2011  Charged Off   \n",
       "2           RENT     12252.0        Not Verified  Dec-2011   Fully Paid   \n",
       "3           RENT     49200.0     Source Verified  Dec-2011   Fully Paid   \n",
       "4           RENT     80000.0     Source Verified  Dec-2011      Current   \n",
       "\n",
       "  pymnt_plan         purpose                 title zip_code addr_state    dti  \\\n",
       "0          n     credit_card              Computer    860xx         AZ  27.65   \n",
       "1          n             car                  bike    309xx         GA   1.00   \n",
       "2          n  small_business  real estate business    606xx         IL   8.72   \n",
       "3          n           other              personel    917xx         CA  20.00   \n",
       "4          n           other              Personal    972xx         OR  17.94   \n",
       "\n",
       "   delinq_2yrs earliest_cr_line  inq_last_6mths  open_acc  pub_rec  revol_bal  \\\n",
       "0          0.0         Jan-1985             1.0       3.0      0.0    13648.0   \n",
       "1          0.0         Apr-1999             5.0       3.0      0.0     1687.0   \n",
       "2          0.0         Nov-2001             2.0       2.0      0.0     2956.0   \n",
       "3          0.0         Feb-1996             1.0      10.0      0.0     5598.0   \n",
       "4          0.0         Jan-1996             0.0      15.0      0.0    27783.0   \n",
       "\n",
       "  revol_util  total_acc initial_list_status  out_prncp  out_prncp_inv  \\\n",
       "0      83.7%        9.0                   f       0.00           0.00   \n",
       "1       9.4%        4.0                   f       0.00           0.00   \n",
       "2      98.5%       10.0                   f       0.00           0.00   \n",
       "3        21%       37.0                   f       0.00           0.00   \n",
       "4      53.9%       38.0                   f     461.73         461.73   \n",
       "\n",
       "    total_pymnt  total_pymnt_inv  total_rec_prncp  total_rec_int  \\\n",
       "0   5863.155187          5833.84          5000.00         863.16   \n",
       "1   1008.710000          1008.71           456.46         435.17   \n",
       "2   3005.666844          3005.67          2400.00         605.67   \n",
       "3  12231.890000         12231.89         10000.00        2214.92   \n",
       "4   3581.120000          3581.12          2538.27        1042.85   \n",
       "\n",
       "   total_rec_late_fee  recoveries  collection_recovery_fee last_pymnt_d  \\\n",
       "0                0.00        0.00                     0.00     Jan-2015   \n",
       "1                0.00      117.08                     1.11     Apr-2013   \n",
       "2                0.00        0.00                     0.00     Jun-2014   \n",
       "3               16.97        0.00                     0.00     Jan-2015   \n",
       "4                0.00        0.00                     0.00     Jun-2016   \n",
       "\n",
       "   last_pymnt_amnt last_credit_pull_d  collections_12_mths_ex_med  \\\n",
       "0           171.62           Jun-2016                         0.0   \n",
       "1           119.66           Sep-2013                         0.0   \n",
       "2           649.91           Jun-2016                         0.0   \n",
       "3           357.48           Apr-2016                         0.0   \n",
       "4            67.79           Jun-2016                         0.0   \n",
       "\n",
       "   policy_code application_type  acc_now_delinq  chargeoff_within_12_mths  \\\n",
       "0          1.0       INDIVIDUAL             0.0                       0.0   \n",
       "1          1.0       INDIVIDUAL             0.0                       0.0   \n",
       "2          1.0       INDIVIDUAL             0.0                       0.0   \n",
       "3          1.0       INDIVIDUAL             0.0                       0.0   \n",
       "4          1.0       INDIVIDUAL             0.0                       0.0   \n",
       "\n",
       "   delinq_amnt  pub_rec_bankruptcies  tax_liens  \n",
       "0          0.0                   0.0        0.0  \n",
       "1          0.0                   0.0        0.0  \n",
       "2          0.0                   0.0        0.0  \n",
       "3          0.0                   0.0        0.0  \n",
       "4          0.0                   0.0        0.0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_five = pd.read_csv('loans_2007.csv', nrows=5)\n",
    "first_five.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "\n",
    "- `emp_title` has some missing values\n",
    "- `emp_length` could be cleaned up into a numeric column instead of a string\n",
    "- Lots of columns can be converted to categories or boolean types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAD7lJREFUeJzt3X+MJ3V9x/Hnq3dQrdiA3qIXTjxqsUobOe16paWtiNoiNAUqTaVGSUtyttVGU21Bk/qjrSkmVYxN1ZxCvTZWtIiFINoShBKjont6nIeHBSm1JxduqSJiG+rBu3/sEDfn7n1n9/v97nf30+cj+WZn5vv57rx2cvvaufnOdyZVhSRp7fuRSQeQJI2GhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqxPqVXNmGDRtq8+bNK7lKSVrzdu7ceV9VTQ0at6KFvnnzZmZmZlZylZK05iX5jz7jPOQiSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNWNFPikrSJG2++BMTW/fdl5w19nW4hy5JjbDQJakRFrokNcJCl6RGWOiS1IjehZ5kXZIvJ7m2mz8hyS1J7kjykSRHji+mJGmQpeyhvwbYO2/+7cClVXUi8G3gwlEGkyQtTa9CT7IJOAv4QDcf4HTgym7IDuCccQSUJPXTdw/9XcCfAI90808E7q+qg938PuC4EWeTJC3BwEJP8mvAgaraOX/xAkNrkddvSzKTZGZ2dnaZMSVJg/TZQz8V+PUkdwNXMHeo5V3A0UkevXTAJuCehV5cVdurarqqpqemBt60WpK0TAMLvareUFWbqmoz8FLg01X1MuBG4Lxu2AXA1WNLKUkaaJjz0C8C/ijJncwdU79sNJEkScuxpKstVtVNwE3d9F3A1tFHkiQth58UlaRGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1os9Noh+T5AtJbk1yW5K3dss/mOTfk+zqHlvGH1eStJg+dyx6CDi9qh5McgTwmSSf7J7746q6cnzxJEl9DSz0qirgwW72iO5R4wwlSVq6XsfQk6xLsgs4AFxfVbd0T70tye4klyb50UVeuy3JTJKZ2dnZEcWWJB2qV6FX1cNVtQXYBGxN8jPAG4BnAM8FngBctMhrt1fVdFVNT01NjSi2JOlQSzrLparuB24Czqiq/TXnIeBvga1jyCdJ6qnPWS5TSY7uph8LvBC4PcnGblmAc4A94wwqSTq8Pme5bAR2JFnH3B+Aj1bVtUk+nWQKCLAL+L0x5pQkDdDnLJfdwLMXWH76WBJJkpbFT4pKUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUiD53LHpMki8kuTXJbUne2i0/IcktSe5I8pEkR44/riRpMX320B8CTq+qk4EtwBlJTgHeDlxaVScC3wYuHF9MSdIgAwu9uxH0g93sEd2jgNOBK7vlO5i7r6gkaUJ6HUNPsi7JLuAAcD3wdeD+qjrYDdkHHDeeiJKkPnoVelU9XFVbgE3AVuCZCw1b6LVJtiWZSTIzOzu7/KSSpMNa0lkuVXU/cBNwCnB0kkdvMr0JuGeR12yvqumqmp6amhomqyTpMPqc5TKV5Ohu+rHAC4G9wI3Aed2wC4CrxxVSkjTY+sFD2AjsSLKOuT8AH62qa5N8FbgiyV8AXwYuG2NOSdIAAwu9qnYDz15g+V3MHU+XJK0CflJUkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktSIPrege0qSG5PsTXJbktd0y9+S5JtJdnWPM8cfV5K0mD63oDsIvK6qvpTk8cDOJNd3z11aVX81vniSpL763IJuP7C/m/5ukr3AceMOJklamiUdQ0+ymbn7i97SLXp1kt1JLk9yzIizSZKWoHehJzkK+Bjw2qp6AHgv8DRgC3N78O9Y5HXbkswkmZmdnR1BZEnSQnoVepIjmCvzD1XVVQBVdW9VPVxVjwDvB7Yu9Nqq2l5V01U1PTU1NarckqRD9DnLJcBlwN6qeue85RvnDTsX2DP6eJKkvvqc5XIq8HLgK0l2dcveCJyfZAtQwN3AK8eSUJLUS5+zXD4DZIGnrht9HEnScvlJUUlqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSI/rcgu4pSW5MsjfJbUle0y1/QpLrk9zRfT1m/HElSYvps4d+EHhdVT0TOAV4VZKTgIuBG6rqROCGbl6SNCEDC72q9lfVl7rp7wJ7geOAs4Ed3bAdwDnjCilJGmxJx9CTbAaeDdwCPKmq9sNc6QPHLvKabUlmkszMzs4Ol1aStKjehZ7kKOBjwGur6oG+r6uq7VU1XVXTU1NTy8koSeqhV6EnOYK5Mv9QVV3VLb43ycbu+Y3AgfFElCT10ecslwCXAXur6p3znroGuKCbvgC4evTxJEl9re8x5lTg5cBXkuzqlr0RuAT4aJILgW8AvzmeiJKkPgYWelV9BsgiT79gtHEkScvlJ0UlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqRJ87Fl2e5ECSPfOWvSXJN5Ps6h5njjemJGmQPnvoHwTOWGD5pVW1pXtcN9pYkqSlGljoVXUz8K0VyCJJGsIwx9BfnWR3d0jmmJElkiQty3IL/b3A04AtwH7gHYsNTLItyUySmdnZ2WWuTpI0yLIKvaruraqHq+oR4P3A1sOM3V5V01U1PTU1tdyckqQBllXoSTbOmz0X2LPYWEnSylg/aECSDwOnARuS7APeDJyWZAtQwN3AK8eYUZLUw8BCr6rzF1h82RiySJKG4CdFJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNGFjoSS5PciDJnnnLnpDk+iR3dF+PGW9MSdIgffbQPwiccciyi4EbqupE4IZuXpI0QQMLvapuBr51yOKzgR3d9A7gnBHnkiQt0XKPoT+pqvYDdF+PXWxgkm1JZpLMzM7OLnN1kqRBxv6maFVtr6rpqpqempoa9+ok6f+t5Rb6vUk2AnRfD4wukiRpOZZb6NcAF3TTFwBXjyaOJGm5+py2+GHgc8BPJdmX5ELgEuBFSe4AXtTNS5ImaP2gAVV1/iJPvWDEWSRJQ/CTopLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDVi4PXQDyfJ3cB3gYeBg1U1PYpQkqSlG6rQO8+vqvtG8H0kSUPwkIskNWLYQi/gX5LsTLJtFIEkScsz7CGXU6vqniTHAtcnub2qbp4/oCv6bQDHH3/8kKuTJC1mqD30qrqn+3oA+DiwdYEx26tquqqmp6amhlmdJOkwll3oSR6X5PGPTgO/AuwZVTBJ0tIMc8jlScDHkzz6ff6hqj41klSSpCVbdqFX1V3AySPMIkkagqctSlIjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1Ihh7ym6YjZf/ImJrfvuS86a2Lolqa+h9tCTnJHka0nuTHLxqEJJkpZumHuKrgP+BngxcBJwfpKTRhVMkrQ0w+yhbwXurKq7qup/gSuAs0cTS5K0VMMU+nHAf86b39ctkyRNwDBvimaBZfVDg5JtwLZu9sEkXxtinYezAbhvHN84bx/ptxtbzhFaCxnBnKO0FjLC2si5YMYhe+SpfQYNU+j7gKfMm98E3HPooKraDmwfYj29JJmpqulxr2dYayHnWsgI5hyltZAR1kbOSWYc5pDLF4ETk5yQ5EjgpcA1o4klSVqqZe+hV9XBJK8G/hlYB1xeVbeNLJkkaUmG+mBRVV0HXDeiLMMa+2GdEVkLOddCRjDnKK2FjLA2ck4sY6p+6H1MSdIa5LVcJKkRa67QB11uIMlTk9yQZHeSm5JsmkDGy5McSLJnkeeT5N3dz7A7yXNWOmOXY1DOZyT5XJKHkrx+pfN1GQZlfFm3DXcn+WySk1c6Y5djUM6zu4y7kswk+cXVlnHeuOcmeTjJeSuV7ZD1D9qWpyX5TrctdyV500pn7HIM3J5d1l1Jbkvyr2MPVVVr5sHcm69fB34COBK4FTjpkDH/CFzQTZ8O/P0Ecv4y8BxgzyLPnwl8krlz+U8BbpnQ9hyU81jgucDbgNev0oy/ABzTTb94FW/Lo/jBIc5nAbevtozdmHXAp5l7b+y8VbotTwOunUS2JeY8GvgqcHw3f+y4M621PfQ+lxs4Cbihm75xgefHrqpuBr51mCFnA39Xcz4PHJ1k48qk+4FBOavqQFV9Efj+yqX6oQyDMn62qr7dzX6euc9DrLgeOR+s7rcaeBwLfAhv3Hr8uwT4Q+BjwIHxJ1pYz5wT1yPnbwNXVdU3uvFj36ZrrdD7XG7gVuAl3fS5wOOTPHEFsi2Fl00YjwuZ+5/PqpTk3CS3A58AfnfSeQ6V5DjmfmfeN+ksPfx8kluTfDLJT086zCKeDhzTHfrdmeQV417hWiv0PpcbeD3wvCRfBp4HfBM4OO5gS9TrsgnqL8nzmSv0iyadZTFV9fGqegZwDvDnk86zgHcBF1XVw5MOMsCXgKdW1cnAXwP/NOE8i1kP/CxwFvCrwJ8mefq4V7iWDLzcQFXdA/wGQJKjgJdU1XdWLGE/vS6boH6SPAv4APDiqvqvSecZpKpuTvK0JBuqajVdl2QauCIJzF2P5MwkB6tqVRVmVT0wb/q6JO9ZhdsS5n7P76uq7wHfS3IzcDLwb+Na4VrbQx94uYEkG5I8+nO9Abh8hTP2cQ3wiu5sl1OA71TV/kmHWouSHA9cBby8qsb2izKsJD+Zrim7s5qOBFbVH5+qOqGqNlfVZuBK4A9WW5kDJHnyvG25lbkeW1XbsnM18EtJ1if5MeDngL3jXOGa2kOvRS43kOTPgJmquoa5d8D/MkkBNwOvWumcST7c5diQZB/wZuCI7md4H3NnEJwJ3An8N/A7K52xT84kTwZmgB8HHknyWubOKnpgkW+54hmBNwFPBN7T/Y4frAlcGKlHzpcw90f8+8D/AL81703S1ZJxVeiR8zzg95McZG5bvnSlt2WfnFW1N8mngN3AI8AHquqwp4wOnWkC20GSNAZr7ZCLJGkRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY34P+th6qVakV8FAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "thousand_chunks = pd.read_csv('loans_2007.csv', chunksize=1000)\n",
    "\n",
    "memory_footprints = []\n",
    "\n",
    "for chunk in thousand_chunks:\n",
    "    chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576\n",
    "    memory_footprints.append(chunk_memory_mb)\n",
    "\n",
    "plt.hist(memory_footprints)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Chunks of 1000 rows is too small $=>$ Each chunk is no larger than 1.7 MB. Let's increase the chunk size to 3000:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAADfZJREFUeJzt3X+MZeVdx/H3pwy15YdS3WmLLHTQNCRIquCEUEmwgdKsQMBE/lgSKlTMxp+l/giCTST6F0ZTq9ZIVkCopbSGgiItFSwlxISuzi5Qli5tSV3pFnSHEqG1Rlz79Y+56DrM7Nx7z5l7Z5++X8lk7j3nmft88rD3w5lz77mTqkKSdPh7zbQDSJL6YaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGjEzyck2bdpUc3Nzk5xSkg57O3fufL6qZtcaN9FCn5ubY2FhYZJTStJhL8k/DzPOUy6S1AgLXZIaYaFLUiMsdElqhIUuSY1Ys9CT3JJkf5LdK+z79SSVZNP6xJMkDWuYI/RbgS3LNyY5ETgfeKbnTJKkMaxZ6FX1MPDCCrv+ALgG8G/YSdIGMNY59CQXA1+rqsd7ziNJGtPIV4omOQp4P/CuIcdvA7YBnHTSSaNOJ0m9mbv2k1Obe+8NF677HOMcof8gcDLweJK9wGZgV5I3rzS4qrZX1XxVzc/OrvlRBJKkMY18hF5VTwBvfOX+oNTnq+r5HnNJkkY0zNsW7wAeAU5Jsi/JVesfS5I0qjWP0KvqsjX2z/WWRpI0Nq8UlaRGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWrEmoWe5JYk+5PsPmjb7yV5Ksnnk9yd5Lj1jSlJWsswR+i3AluWbXsAOK2q3gZ8Cbiu51ySpBGtWehV9TDwwrJt91fVgcHdzwGb1yGbJGkEfZxD/xngvtV2JtmWZCHJwuLiYg/TSZJW0qnQk7wfOADcvtqYqtpeVfNVNT87O9tlOknSIcyM+4NJrgAuAs6rquovkiRpHGMVepItwG8AP15V3+o3kiRpHMO8bfEO4BHglCT7klwFfAg4FnggyWNJblznnJKkNax5hF5Vl62w+eZ1yCJJ6sArRSWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1Ig1Cz3JLUn2J9l90LbvTfJAki8Pvr9hfWNKktYyzBH6rcCWZduuBT5TVW8FPjO4L0maojULvaoeBl5YtvkS4LbB7duAn+w5lyRpROOeQ39TVT0HMPj+xv4iSZLGse4viibZlmQhycLi4uJ6TydJ37HGLfR/TXI8wOD7/tUGVtX2qpqvqvnZ2dkxp5MkrWXcQr8HuGJw+wrgr/uJI0ka1zBvW7wDeAQ4Jcm+JFcBNwDnJ/kycP7gviRpimbWGlBVl62y67yes0iSOvBKUUlqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGdCr0JL+S5Mkku5PckeR1fQWTJI1m7EJPcgLwXmC+qk4DjgC29hVMkjSarqdcZoDXJ5kBjgKe7R5JkjSOsQu9qr4G/D7wDPAc8GJV3b98XJJtSRaSLCwuLo6fVJJ0SF1OubwBuAQ4Gfh+4Ogkly8fV1Xbq2q+quZnZ2fHTypJOqQup1zeCfxTVS1W1X8BdwE/1k8sSdKouhT6M8BZSY5KEuA8YE8/sSRJo+pyDn0HcCewC3hi8Fjbe8olSRrRTJcfrqrrget7yiJJ6sArRSWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1IhOhZ7kuCR3JnkqyZ4kb+8rmCRpNDMdf/4PgU9X1aVJXgsc1UMmSdIYxi70JN8NnANcCVBVLwMv9xNLkjSqLqdcfgBYBP48yaNJbkpydE+5JEkj6lLoM8AZwJ9W1enAvwPXLh+UZFuShSQLi4uLHaaTJB1Kl0LfB+yrqh2D+3eyVPD/T1Vtr6r5qpqfnZ3tMJ0k6VDGLvSq+hfgq0lOGWw6D/hCL6kkSSPr+i6XXwZuH7zD5SvAe7pHkiSNo1OhV9VjwHxPWSRJHXilqCQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjOhd6kiOSPJrk3j4CSZLG08cR+tXAnh4eR5LUQadCT7IZuBC4qZ84kqRxdT1C/yBwDfDtHrJIkjoYu9CTXATsr6qda4zblmQhycLi4uK400mS1tDlCP1s4OIke4GPAecm+cjyQVW1varmq2p+dna2w3SSpEMZu9Cr6rqq2lxVc8BW4MGqury3ZJKkkfg+dElqxEwfD1JVDwEP9fFYkqTxeIQuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1IixCz3JiUk+m2RPkieTXN1nMEnSaGY6/OwB4NeqaleSY4GdSR6oqi/0lE2SNIKxj9Cr6rmq2jW4/Q1gD3BCX8EkSaPp5Rx6kjngdGDHCvu2JVlIsrC4uNjHdJKkFXQu9CTHAJ8A3ldVLy3fX1Xbq2q+quZnZ2e7TidJWkWnQk9yJEtlfntV3dVPJEnSOLq8yyXAzcCeqvpAf5EkSePocoR+NvBu4Nwkjw2+LugplyRpRGO/bbGq/h5Ij1kkSR14pagkNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDWiyx+4mKi5az85tbn33nDh1OaWpGF5hC5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpEZ0KPcmWJF9M8nSSa/sKJUka3diFnuQI4E+AnwBOBS5LcmpfwSRJo+lyhH4m8HRVfaWqXgY+BlzSTyxJ0qi6FPoJwFcPur9vsE2SNAVdPj43K2yrVw1KtgHbBne/meSLHeZcyybg+b4fNL/b+SHWJVcPzDUac41mI+aaWqY1emStXG8ZZo4uhb4POPGg+5uBZ5cPqqrtwPYO8wwtyUJVzU9irlGYazTmGo25hrcRM0F/ubqccvlH4K1JTk7yWmArcE/XQJKk8Yx9hF5VB5L8EvC3wBHALVX1ZG/JJEkj6fQn6KrqU8CnesrSh4mc2hmDuUZjrtGYa3gbMRP0lCtVr3odU5J0GPLSf0lqxGFX6EluSbI/ye5V9ifJHw0+juDzSc7YILnekeTFJI8Nvn5rQrlOTPLZJHuSPJnk6hXGTHzNhsw18TVL8rok/5Dk8UGu315hzHcl+fhgvXYkmdsAma5MsnjQWv3semZaNvcRSR5Ncu8K+ya6ViPkmsp6Jdmb5InBnAsr7O/2XKyqw+oLOAc4A9i9yv4LgPtYep/8WcCODZLrHcC9U1iv44EzBrePBb4EnDrtNRsy18TXbLAGxwxuHwnsAM5aNuYXgBsHt7cCH98Ama4EPjTpf1+DuX8V+OhK/60mvVYj5JrKegF7gU2H2N/puXjYHaFX1cPAC4cYcgnw4VryOeC4JMdvgFxTUVXPVdWuwe1vAHt49RW9E1+zIXNN3GANvjm4e+Tga/kLTZcAtw1u3wmcl2SlC+0mmWkqkmwGLgRuWmXIRNdqhFwbVafn4mFX6EPYyB9J8PbBr833JfmhSU8++HX3dJaO8A421TU7RC6YwpoNflV/DNgPPFBVq65XVR0AXgS+b8qZAH5q8Gv6nUlOXGH/evggcA3w7VX2T3ythswF01mvAu5PsjNLV9Ev1+m52GKhD/WRBFOwC3hLVf0w8MfAX01y8iTHAJ8A3ldVLy3fvcKPTGTN1sg1lTWrqv+uqh9h6ernM5OctmzIxNdriEx/A8xV1duAv+P/jorXTZKLgP1VtfNQw1bYtq5rNWSuia/XwNlVdQZLn1L7i0nOWba/03q1WOhDfSTBpFXVS6/82lxL798/MsmmScyd5EiWSvP2qrprhSFTWbO1ck1zzQZz/hvwELBl2a7/Xa8kM8D3MKHTbatlqqqvV9V/Du7+GfCjE4hzNnBxkr0sfdrquUk+smzMNNZqzVxTWi+q6tnB9/3A3Sx9au3BOj0XWyz0e4CfHrxafBbwYlU9N+1QSd78yrnDJGeytPZfn8C8AW4G9lTVB1YZNvE1GybXNNYsyWyS4wa3Xw+8E3hq2bB7gCsGty8FHqzBK1rTyrTsPOvFLL0msa6q6rqq2lxVcyy94PlgVV2+bNhE12rYXNNYryRHJzn2ldvAu4Dl74rr9FzsdKXoNCS5g6V3P2xKsg+4nqUXiaiqG1m6cvUC4GngW8B7NkiuS4GfT3IA+A9g63r/wx44G3g38MTgHCzAbwInHZRtGms2TK5prNnxwG1Z+gMurwH+sqruTfI7wEJV3cPS/4j+IsnTLB1tbt0Amd6b5GLgwCDTleucaVVTXqthc01jvd4E3D04RpkBPlpVn07yc9DPc9ErRSWpES2ecpGk70gWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjfgfxIqyuYfC0GwAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n",
    "\n",
    "memory_footprints = []\n",
    "\n",
    "for chunk in chunks:\n",
    "    chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576\n",
    "    memory_footprints.append(chunk_memory_mb)\n",
    "\n",
    "plt.hist(memory_footprints)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Chunks of 3000 rows gets us closer to the 5 MB limit per chunk we set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Exploring the Data in Chunks\n",
    "\n",
    "Let's familiarize ourselves with the columns to see which ones we can optimize. Let's try to understand the column types better while using dataframe chunks.\n",
    "\n",
    "For each chunk:\n",
    "\n",
    "    - How many columns have a numeric type? How many have a string type?\n",
    "    - How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?\n",
    "    - Which float columns have no missing values and could be candidates for conversion to the integer type?\n",
    "\n",
    "Calculate the total memory usage across all of the chunks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many columns have a numeric type? How many have a string type?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Chunk 1:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 2:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 3:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 4:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 5:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 6:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 7:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 8:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 9:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 10:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 11:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 12:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 13:\n",
      "float64    30\n",
      "object     21\n",
      "int64       1\n",
      "dtype: int64\n",
      "\n",
      "Chunk 14:\n",
      "float64    30\n",
      "object     22\n",
      "dtype: int64\n",
      "\n",
      "Chunk 15:\n",
      "float64    30\n",
      "object     22\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n",
    "\n",
    "count = 0\n",
    "total_memory_mb = 0\n",
    "for chunk in chunks:\n",
    "    count += 1\n",
    "    \n",
    "    print(f'\\nChunk {count}:')\n",
    "    print(chunk.dtypes.value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "overall obj cols: ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']\n",
      "\n",
      "chunk 14 obj cols: ['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']\n",
      "\n",
      "overall obj cols: ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']\n",
      "\n",
      "chunk 15 obj cols: ['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Are string columns consistent across chunks?\n",
    "count = 0\n",
    "obj_cols = []\n",
    "chunk = pd.read_csv('loans_2007.csv', chunksize=3000)\n",
    "\n",
    "for chunk in chunk:\n",
    "    count += 1\n",
    "    chunk_obj_cols = chunk.select_dtypes(include=['object']).columns.tolist()\n",
    "    \n",
    "    if len(obj_cols) > 0:\n",
    "        is_same = obj_cols == chunk_obj_cols\n",
    "        \n",
    "        if not is_same:\n",
    "            print(f\"overall obj cols: {obj_cols}\\n\")\n",
    "            print(f\"chunk {count} obj cols: {chunk_obj_cols}\\n\")    \n",
    "    \n",
    "    else:\n",
    "        obj_cols = chunk_obj_cols"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "\n",
    "- There are 31 numeric columns and 21 string columns\n",
    "- In the last 2 chunks, the `id` column is being cast as an object instead of an `int64` type. Since the `id` column isn't useful for analysis, visualization or predictive modeling, we can ignore this column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total Num Rows: 42538\n"
     ]
    }
   ],
   "source": [
    "## Create dictionary (key: column, value: list of Series objects representing each chunk's value counts)\n",
    "# vc - value counts\n",
    "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n",
    "\n",
    "count = 0\n",
    "total_num_rows = 0\n",
    "str_cols_vc = {}\n",
    "\n",
    "for chunk in chunks:\n",
    "    total_num_rows += chunk.shape[0]\n",
    "    str_cols = chunk.select_dtypes(include=['object'])\n",
    "    \n",
    "    for col in str_cols.columns:\n",
    "        current_col_vc = chunk[col].value_counts()\n",
    "        \n",
    "        if col in str_cols_vc:\n",
    "            str_cols_vc[col].append(current_col_vc)\n",
    "        else:\n",
    "            str_cols_vc[col] = [current_col_vc]\n",
    "\n",
    "print(f'Total Num Rows: {total_num_rows}')\n",
    "## Combine the value count results\n",
    "combined_vcs = {}\n",
    "\n",
    "for col, vc_list in str_cols_vc.items():\n",
    "    combined_vc = pd.concat(vc_list)\n",
    "    final_vc = combined_vc.groupby(combined_vc.index).sum()\n",
    "    combined_vcs[col] = final_vc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Unique Values per string column (22 columns):\n",
      "term                : 2         0.00\n",
      "int_rate            : 394       0.93\n",
      "grade               : 7         0.02\n",
      "sub_grade           : 35        0.08\n",
      "emp_title           : 30658     72.07\n",
      "emp_length          : 11        0.03\n",
      "home_ownership      : 5         0.01\n",
      "verification_status : 3         0.01\n",
      "issue_d             : 55        0.13\n",
      "loan_status         : 9         0.02\n",
      "pymnt_plan          : 2         0.00\n",
      "purpose             : 14        0.03\n",
      "title               : 21264     49.99\n",
      "zip_code            : 837       1.97\n",
      "addr_state          : 50        0.12\n",
      "earliest_cr_line    : 530       1.25\n",
      "revol_util          : 1119      2.63\n",
      "initial_list_status : 1         0.00\n",
      "last_pymnt_d        : 103       0.24\n",
      "last_credit_pull_d  : 108       0.25\n",
      "application_type    : 1         0.00\n",
      "id                  : 3538      8.32\n"
     ]
    }
   ],
   "source": [
    "print(f'Unique Values per string column ({len(combined_vcs)} columns):')\n",
    "for col, col_vc in combined_vcs.items():\n",
    "    unique_values_percentage = (len(col_vc) / total_num_rows) * 100\n",
    "    print(f'{col:<20}: {col_vc.shape[0]:<10}{unique_values_percentage:0.2f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "\n",
    "- Each string column has varying number of unique values (1 to over 30,000).\n",
    "- 2 columns in particular (`emp_title` and `title`) contain 50% or more unique values. The remaining string columns have no more than 1 - 10% unique values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Which float columns have no missing values and could be candidates for conversion to the integer type?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Chunk 1:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 2:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 3:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 4:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 5:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 6:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 7:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 8:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 9:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 10:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 11:\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "0/30 columns:[]\n",
      "\n",
      "Chunk 12:\n",
      "29/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']\n",
      "1/30 columns:['pub_rec_bankruptcies']\n",
      "member_id                     0\n",
      "loan_amnt                     0\n",
      "funded_amnt                   0\n",
      "funded_amnt_inv               0\n",
      "installment                   0\n",
      "annual_inc                    0\n",
      "dti                           0\n",
      "delinq_2yrs                   0\n",
      "inq_last_6mths                0\n",
      "open_acc                      0\n",
      "pub_rec                       0\n",
      "revol_bal                     0\n",
      "total_acc                     0\n",
      "out_prncp                     0\n",
      "out_prncp_inv                 0\n",
      "total_pymnt                   0\n",
      "total_pymnt_inv               0\n",
      "total_rec_prncp               0\n",
      "total_rec_int                 0\n",
      "total_rec_late_fee            0\n",
      "recoveries                    0\n",
      "collection_recovery_fee       0\n",
      "last_pymnt_amnt               0\n",
      "collections_12_mths_ex_med    0\n",
      "policy_code                   0\n",
      "acc_now_delinq                0\n",
      "chargeoff_within_12_mths      0\n",
      "delinq_amnt                   0\n",
      "pub_rec_bankruptcies          3\n",
      "tax_liens                     0\n",
      "dtype: int64\n",
      "\n",
      "Chunk 13:\n",
      "29/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']\n",
      "1/30 columns:['pub_rec_bankruptcies']\n",
      "member_id                     0\n",
      "loan_amnt                     0\n",
      "funded_amnt                   0\n",
      "funded_amnt_inv               0\n",
      "installment                   0\n",
      "annual_inc                    0\n",
      "dti                           0\n",
      "delinq_2yrs                   0\n",
      "inq_last_6mths                0\n",
      "open_acc                      0\n",
      "pub_rec                       0\n",
      "revol_bal                     0\n",
      "total_acc                     0\n",
      "out_prncp                     0\n",
      "out_prncp_inv                 0\n",
      "total_pymnt                   0\n",
      "total_pymnt_inv               0\n",
      "total_rec_prncp               0\n",
      "total_rec_int                 0\n",
      "total_rec_late_fee            0\n",
      "recoveries                    0\n",
      "collection_recovery_fee       0\n",
      "last_pymnt_amnt               0\n",
      "collections_12_mths_ex_med    0\n",
      "policy_code                   0\n",
      "acc_now_delinq                0\n",
      "chargeoff_within_12_mths      0\n",
      "delinq_amnt                   0\n",
      "pub_rec_bankruptcies          4\n",
      "tax_liens                     0\n",
      "dtype: int64\n",
      "\n",
      "Chunk 14:\n",
      "0/30 columns:[]\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "member_id                       1\n",
      "loan_amnt                       1\n",
      "funded_amnt                     1\n",
      "funded_amnt_inv                 1\n",
      "installment                     1\n",
      "annual_inc                      1\n",
      "dti                             1\n",
      "delinq_2yrs                     1\n",
      "inq_last_6mths                  1\n",
      "open_acc                        1\n",
      "pub_rec                         1\n",
      "revol_bal                       1\n",
      "total_acc                       1\n",
      "out_prncp                       1\n",
      "out_prncp_inv                   1\n",
      "total_pymnt                     1\n",
      "total_pymnt_inv                 1\n",
      "total_rec_prncp                 1\n",
      "total_rec_int                   1\n",
      "total_rec_late_fee              1\n",
      "recoveries                      1\n",
      "collection_recovery_fee         1\n",
      "last_pymnt_amnt                 1\n",
      "collections_12_mths_ex_med     57\n",
      "policy_code                     1\n",
      "acc_now_delinq                  1\n",
      "chargeoff_within_12_mths       57\n",
      "delinq_amnt                     1\n",
      "pub_rec_bankruptcies          849\n",
      "tax_liens                      40\n",
      "dtype: int64\n",
      "\n",
      "Chunk 15:\n",
      "0/30 columns:[]\n",
      "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n",
      "member_id                       2\n",
      "loan_amnt                       2\n",
      "funded_amnt                     2\n",
      "funded_amnt_inv                 2\n",
      "installment                     2\n",
      "annual_inc                      6\n",
      "dti                             2\n",
      "delinq_2yrs                    31\n",
      "inq_last_6mths                 31\n",
      "open_acc                       31\n",
      "pub_rec                        31\n",
      "revol_bal                       2\n",
      "total_acc                      31\n",
      "out_prncp                       2\n",
      "out_prncp_inv                   2\n",
      "total_pymnt                     2\n",
      "total_pymnt_inv                 2\n",
      "total_rec_prncp                 2\n",
      "total_rec_int                   2\n",
      "total_rec_late_fee              2\n",
      "recoveries                      2\n",
      "collection_recovery_fee         2\n",
      "last_pymnt_amnt                 2\n",
      "collections_12_mths_ex_med     91\n",
      "policy_code                     2\n",
      "acc_now_delinq                 31\n",
      "chargeoff_within_12_mths       91\n",
      "delinq_amnt                    31\n",
      "pub_rec_bankruptcies          512\n",
      "tax_liens                      68\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n",
    "count = 0\n",
    "max_count = 2\n",
    "\n",
    "for chunk in chunks:\n",
    "    count += 1\n",
    "    print(f'\\nChunk {count}:')\n",
    "    \n",
    "    float_cols_with_null_check = chunk.select_dtypes(include=['float']).isnull().sum()\n",
    "    \n",
    "    #print(float_cols_with_null_check.shape[0])\n",
    "    #float_cols_no_missing_val = float_cols_with_null_check.index.tolist()\n",
    "    \n",
    "    cols_no_missing_val = list(float_cols_with_null_check[float_cols_with_null_check == 0].index)\n",
    "    \n",
    "    cols_with_missing_val = list(float_cols_with_null_check[float_cols_with_null_check > 0].index)\n",
    "    \n",
    "    print(f\"{len(cols_no_missing_val)}/{len(float_cols_with_null_check)} columns:\"\\\n",
    "         f\"{cols_no_missing_val}\")\n",
    "    \n",
    "    print(f\"{len(cols_with_missing_val)}/{len(float_cols_with_null_check)} columns:\"\\\n",
    "         f\"{cols_with_missing_val}\")\n",
    "    \n",
    "    # the last several chunks have float columns with missing values\n",
    "    if count > 11:\n",
    "        print(float_cols_with_null_check)\n",
    "    #if count >= max_count:\n",
    "    #    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "\n",
    "There are no float columns with no missing values. Therefore, we can't convert any of the float columns to an integer type."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate the total memory usage across all of the chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Chunk 1:\tMemory Usage: 4.65 MB\n",
      "Chunk 2:\tMemory Usage: 4.64 MB\n",
      "Chunk 3:\tMemory Usage: 4.65 MB\n",
      "Chunk 4:\tMemory Usage: 4.65 MB\n",
      "Chunk 5:\tMemory Usage: 4.64 MB\n",
      "Chunk 6:\tMemory Usage: 4.65 MB\n",
      "Chunk 7:\tMemory Usage: 4.64 MB\n",
      "Chunk 8:\tMemory Usage: 4.65 MB\n",
      "Chunk 9:\tMemory Usage: 4.65 MB\n",
      "Chunk 10:\tMemory Usage: 4.65 MB\n",
      "Chunk 11:\tMemory Usage: 4.66 MB\n",
      "Chunk 12:\tMemory Usage: 4.66 MB\n",
      "Chunk 13:\tMemory Usage: 4.66 MB\n",
      "Chunk 14:\tMemory Usage: 4.90 MB\n",
      "Chunk 15:\tMemory Usage: 0.88 MB\n",
      "There are 15 chunks! Total memory: 66.22 MB\n"
     ]
    }
   ],
   "source": [
    "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n",
    "\n",
    "count = 0\n",
    "total_memory_mb = 0\n",
    "for chunk in chunks:\n",
    "    count += 1\n",
    "    print(f\"Chunk {count}:\", end='\\t')\n",
    "    \n",
    "    # chunk memory\n",
    "    chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576\n",
    "    print(f\"Memory Usage: {chunk_memory_mb:0.2f} MB\")\n",
    "    total_memory_mb += chunk_memory_mb\n",
    "\n",
    "print(f\"There are {count} chunks! Total memory: {total_memory_mb:0.2f} MB\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Optimizing String Columns\n",
    "\n",
    "We can achieve the greatest memory improvements by converting the string columns to a numeric type. Let's convert all of the columns where the values are less than 50% unique to the category type, and the columns that contain numeric values to the float type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Chunk 1:\n",
      "term                : 0.19 (object) vs. 0.00 (int8)\n",
      "int_rate            : 0.18 (object) vs. 0.01 (float32)\n",
      "grade               : 0.19 (object) vs. 0.00 (category)\n",
      "sub_grade           : 0.17 (object) vs. 0.01 (category)\n",
      "emp_title           : 0.21 (object) vs. 0.21 (object)\n",
      "emp_length          : 0.18 (object) vs. 0.18 (object)\n",
      "home_ownership      : 0.18 (object) vs. 0.00 (category)\n",
      "verification_status : 0.20 (object) vs. 0.00 (category)\n",
      "issue_d             : 0.19 (object) vs. 0.02 (datetime64[ns])\n",
      "loan_status         : 0.19 (object) vs. 0.00 (category)\n",
      "pymnt_plan          : 0.19 (object) vs. 0.00 (bool)\n",
      "purpose             : 0.20 (object) vs. 0.00 (category)\n",
      "title               : 0.21 (object) vs. 0.21 (object)\n",
      "zip_code            : 0.18 (object) vs. 0.18 (object)\n",
      "addr_state          : 0.17 (object) vs. 0.17 (object)\n",
      "earliest_cr_line    : 0.19 (object) vs. 0.02 (datetime64[ns])\n",
      "revol_util          : 0.18 (object) vs. 0.01 (float32)\n",
      "initial_list_status : 0.19 (object) vs. 0.00 (category)\n",
      "last_pymnt_d        : 0.19 (object) vs. 0.02 (datetime64[ns])\n",
      "last_credit_pull_d  : 0.19 (object) vs. 0.02 (datetime64[ns])\n",
      "application_type    : 0.19 (object) vs. 0.00 (category)\n",
      "Memory: 3.94 mb vs 1.10 mb\n",
      "\n",
      "Memory:4.65 mb vs 1.81 mb\n"
     ]
    }
   ],
   "source": [
    "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n",
    "count = 0\n",
    "total_initial_memory_mb = 0\n",
    "total_final_memory_mb = 0\n",
    "B_PER_MB = 1048576\n",
    "\n",
    "for chunk in chunks:\n",
    "    count += 1\n",
    "    print(f'\\nChunk {count}:')\n",
    "    obj_cols = chunk.select_dtypes(include=['object'])\n",
    "    \n",
    "    total_initial_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n",
    "    \n",
    "    #print('Initial Memory Usage:')\n",
    "    memory_comparison = []\n",
    "    initial_memory_mb = 0\n",
    "    for col in obj_cols.columns:\n",
    "        chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n",
    "        initial_memory_mb += chunk_memory_mb\n",
    "        memory_comparison.append(f'{col:<20}: {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n",
    "    \n",
    "    # convert columns to category type\n",
    "    chunk['grade'] = chunk['grade'].astype('category')\n",
    "    chunk['sub_grade'] = chunk['sub_grade'].astype('category')\n",
    "    chunk['home_ownership'] = chunk['home_ownership'].astype('category')\n",
    "    chunk['verification_status'] = chunk['verification_status'].astype('category')\n",
    "    chunk['application_type'] = chunk['application_type'].astype('category')\n",
    "    chunk['loan_status'] = chunk['loan_status'].astype('category')\n",
    "    chunk['initial_list_status'] = chunk['initial_list_status'].astype('category')\n",
    "    chunk['purpose'] = chunk['purpose'].astype('category')\n",
    "    \n",
    "    # convert columns to numeric type\n",
    "    chunk['term'] = chunk['term'].str.replace('months', '').str.replace(' ', '')\n",
    "    chunk['term'] = pd.to_numeric(chunk['term'], downcast='integer')\n",
    "    \n",
    "    chunk['int_rate'] = chunk['int_rate'].str.replace(' ', '').str.replace('%', '')\n",
    "    chunk['int_rate'] = pd.to_numeric(chunk['int_rate'], downcast='float')\n",
    "    \n",
    "    chunk['revol_util'] = chunk['revol_util'].str.replace('%', '')\n",
    "    chunk['revol_util'] = pd.to_numeric(chunk['revol_util'], downcast='float')\n",
    "    \n",
    "    # convert columns to datetime type\n",
    "    chunk['issue_d'] = pd.to_datetime(chunk['issue_d'])\n",
    "    chunk['earliest_cr_line'] = pd.to_datetime(chunk['earliest_cr_line'])\n",
    "    chunk['last_pymnt_d'] = pd.to_datetime(chunk['last_pymnt_d'])\n",
    "    chunk['last_credit_pull_d'] = pd.to_datetime(chunk['last_credit_pull_d'])\n",
    "    \n",
    "    # convert columns to boolean type - 2 valid approaches!\n",
    "    \n",
    "    chunk['pymnt_plan'] = chunk['pymnt_plan'].astype('bool')\n",
    "    \n",
    "    #d = {'n': False, 'y': True}\n",
    "    #chunk['pymnt_plan'] = chunk['pymnt_plan'].map(d)\n",
    "    \n",
    "    \n",
    "    # cycle through column unique values\n",
    "    #for col in obj_cols.columns:\n",
    "    #    print('\\n', col)\n",
    "    #    unique_values = chunk[col].unique()\n",
    "    #    print(f'{len(unique_values)} values: {unique_values[:5]}')\n",
    "    \n",
    "    #print('\\nMemory Comparison:')\n",
    "    total_memory_mb = 0\n",
    "    for i, col in enumerate(obj_cols.columns):\n",
    "        chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n",
    "        total_memory_mb += chunk_memory_mb\n",
    "        previous_info = memory_comparison[i]\n",
    "        print(f'{previous_info:<10} vs. {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n",
    "    \n",
    "    print(f'Memory: {initial_memory_mb:0.2f} mb vs {total_memory_mb:0.2f} mb')\n",
    "    \n",
    "    total_final_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n",
    "    break\n",
    "\n",
    "print(f'\\nMemory:{total_initial_memory_mb:0.2f} mb vs {total_final_memory_mb:0.2f} mb')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "60.963455149501655"
      ]
     },
     "execution_count": 144,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(66.22 - 25.85) / 66.22 * 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "=> a 61% improvement in memory usage!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Optimizing Numeric Columns\n",
    "\n",
    "We found out earlier that all float columns have some missing values. So we can't convert any of them to an integer column. We can downcast each float column to a more efficient subtype."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Chunk 1:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 2:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 3:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 4:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 5:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 6:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 7:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 8:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 9:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 10:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 11:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 12:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 13:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 14:\n",
      "Memory: 0.69 mb vs 0.35 mb\n",
      "\n",
      "Chunk 15:\n",
      "Memory: 0.13 mb vs 0.06 mb\n",
      "\n",
      "Memory:66.22 mb vs 61.35 mb\n"
     ]
    }
   ],
   "source": [
    "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n",
    "count = 0\n",
    "total_initial_memory_mb = 0\n",
    "total_final_memory_mb = 0\n",
    "B_PER_MB = 1048576\n",
    "\n",
    "for chunk in chunks:\n",
    "    count += 1\n",
    "    print(f'\\nChunk {count}:')\n",
    "    float_cols = chunk.select_dtypes(include=['float'])\n",
    "    \n",
    "    total_initial_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n",
    "    \n",
    "    #print('Initial Memory Usage:')\n",
    "    memory_comparison = []\n",
    "    initial_memory_mb = 0\n",
    "    for col in float_cols.columns:\n",
    "        chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n",
    "        initial_memory_mb += chunk_memory_mb\n",
    "        memory_comparison.append(f'{col:<30}: {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n",
    "    \n",
    "    for col in float_cols.columns:\n",
    "        chunk[col] = pd.to_numeric(chunk[col], downcast='float')\n",
    "    \n",
    "    # cycle through column unique values\n",
    "    #for col in float_cols.columns:\n",
    "    #    print('\\n', col)\n",
    "    #    unique_values = chunk[col].unique()\n",
    "    #    print(f'{len(unique_values)} values: {unique_values[:5]}')\n",
    "        \n",
    "    \n",
    "    #print('\\nMemory Comparison:')\n",
    "    total_memory_mb = 0\n",
    "    for i, col in enumerate(float_cols.columns):\n",
    "        chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n",
    "        total_memory_mb += chunk_memory_mb\n",
    "        previous_info = memory_comparison[i]\n",
    "        #print(f'{previous_info:<20} vs. {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n",
    "    \n",
    "    print(f'Memory: {initial_memory_mb:0.2f} mb vs {total_memory_mb:0.2f} mb')\n",
    "    \n",
    "    total_final_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n",
    "    \n",
    "    #break\n",
    "\n",
    "print(f'\\nMemory:{total_initial_memory_mb:0.2f} mb vs {total_final_memory_mb:0.2f} mb')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.354273633343397"
      ]
     },
     "execution_count": 160,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(66.22 - 61.35) / 66.22 * 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "=> So we save 7% by converting our float columns to more space efficient subtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "By optimizing our strings and float columns, we were able to reduce our memory usage by **68%**. This is quite significant reduction, especially in cases when we are dealing with very large datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Next Steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an idea for some next steps:\n",
    "\n",
    "- Create a function that automates as much of the work we just did as possible, so that we could use it on other Lending Club data sets. This function should:\n",
    "    - Determine the optimal chunk size based on the memory constraints you provide\n",
    "    - Determine which string columns can be converted to numeric ones by removing the % character\n",
    "    - Determine which numeric columns can be converted to more space efficient representations"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}