{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practice Optimizing Dataframes and Processing in Chunks\n", "\n", "In this project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from [Lending Club], a marketplace for personal loans that matches borrowers with investors. You can read more about the marketplace on its [website].\n", "\n", "The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the [origination fee] that Lending Club charges.\n", "\n", "We'll be working with a dataset of loans approved from `2007-2011`, which can be downloaded from [Lending Club's website]. The `desc` column has been removed to improve system performance.\n", "\n", "If we read in the entire data set, it will consume about 67 megabytes of memory. For learning purposes, let's imagine that we only have 10 megabytes of memory available throughout this project, so we can practice the concepts we learned in the last two missions.\n", "\n", "[Lending Club]: https://www.lendingclub.com/\n", "[website]: https://www.lendingclub.com/public/how-peer-lending-works.action\n", "[origination fee]: https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-\n", "[Lending Club's website]: https://www.lendingclub.com/info/download-data.action" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import pprint as pp\n", "import matplotlib.pyplot as plt\n", "pd.options.display.max_columns = 99" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loans_2007.csv\r\n", "project02_practice_optimizing_dataframes_and_processing_in_chunks.ipynb\r\n" ] } ], "source": [ "!ls" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Introduction\n", "\n", "- Read in the first five lines from `loans_2007.csv` and look for any data quality issues.\n", "- Read in the first 1000 rows from the data set, and calculate the total memory usage for these rows. Increase or decrease the number of rows to converge on a memory usage under `five megabytes` (to stay on the conservative side)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idmember_idloan_amntfunded_amntfunded_amnt_invtermint_rateinstallmentgradesub_gradeemp_titleemp_lengthhome_ownershipannual_incverification_statusissue_dloan_statuspymnt_planpurposetitlezip_codeaddr_statedtidelinq_2yrsearliest_cr_lineinq_last_6mthsopen_accpub_recrevol_balrevol_utiltotal_accinitial_list_statusout_prncpout_prncp_invtotal_pymnttotal_pymnt_invtotal_rec_prncptotal_rec_inttotal_rec_late_feerecoveriescollection_recovery_feelast_pymnt_dlast_pymnt_amntlast_credit_pull_dcollections_12_mths_ex_medpolicy_codeapplication_typeacc_now_delinqchargeoff_within_12_mthsdelinq_amntpub_rec_bankruptciestax_liens
010775011296599.05000.05000.04975.036 months10.65%162.87BB2NaN10+ yearsRENT24000.0VerifiedDec-2011Fully Paidncredit_cardComputer860xxAZ27.650.0Jan-19851.03.00.013648.083.7%9.0f0.000.005863.1551875833.845000.00863.160.000.000.00Jan-2015171.62Jun-20160.01.0INDIVIDUAL0.00.00.00.00.0
110774301314167.02500.02500.02500.060 months15.27%59.83CC4Ryder< 1 yearRENT30000.0Source VerifiedDec-2011Charged Offncarbike309xxGA1.000.0Apr-19995.03.00.01687.09.4%4.0f0.000.001008.7100001008.71456.46435.170.00117.081.11Apr-2013119.66Sep-20130.01.0INDIVIDUAL0.00.00.00.00.0
210771751313524.02400.02400.02400.036 months15.96%84.33CC5NaN10+ yearsRENT12252.0Not VerifiedDec-2011Fully Paidnsmall_businessreal estate business606xxIL8.720.0Nov-20012.02.00.02956.098.5%10.0f0.000.003005.6668443005.672400.00605.670.000.000.00Jun-2014649.91Jun-20160.01.0INDIVIDUAL0.00.00.00.00.0
310768631277178.010000.010000.010000.036 months13.49%339.31CC1AIR RESOURCES BOARD10+ yearsRENT49200.0Source VerifiedDec-2011Fully Paidnotherpersonel917xxCA20.000.0Feb-19961.010.00.05598.021%37.0f0.000.0012231.89000012231.8910000.002214.9216.970.000.00Jan-2015357.48Apr-20160.01.0INDIVIDUAL0.00.00.00.00.0
410753581311748.03000.03000.03000.060 months12.69%67.79BB5University Medical Group1 yearRENT80000.0Source VerifiedDec-2011CurrentnotherPersonal972xxOR17.940.0Jan-19960.015.00.027783.053.9%38.0f461.73461.733581.1200003581.122538.271042.850.000.000.00Jun-201667.79Jun-20160.01.0INDIVIDUAL0.00.00.00.00.0
\n", "
" ], "text/plain": [ " id member_id loan_amnt funded_amnt funded_amnt_inv term \\\n", "0 1077501 1296599.0 5000.0 5000.0 4975.0 36 months \n", "1 1077430 1314167.0 2500.0 2500.0 2500.0 60 months \n", "2 1077175 1313524.0 2400.0 2400.0 2400.0 36 months \n", "3 1076863 1277178.0 10000.0 10000.0 10000.0 36 months \n", "4 1075358 1311748.0 3000.0 3000.0 3000.0 60 months \n", "\n", " int_rate installment grade sub_grade emp_title emp_length \\\n", "0 10.65% 162.87 B B2 NaN 10+ years \n", "1 15.27% 59.83 C C4 Ryder < 1 year \n", "2 15.96% 84.33 C C5 NaN 10+ years \n", "3 13.49% 339.31 C C1 AIR RESOURCES BOARD 10+ years \n", "4 12.69% 67.79 B B5 University Medical Group 1 year \n", "\n", " home_ownership annual_inc verification_status issue_d loan_status \\\n", "0 RENT 24000.0 Verified Dec-2011 Fully Paid \n", "1 RENT 30000.0 Source Verified Dec-2011 Charged Off \n", "2 RENT 12252.0 Not Verified Dec-2011 Fully Paid \n", "3 RENT 49200.0 Source Verified Dec-2011 Fully Paid \n", "4 RENT 80000.0 Source Verified Dec-2011 Current \n", "\n", " pymnt_plan purpose title zip_code addr_state dti \\\n", "0 n credit_card Computer 860xx AZ 27.65 \n", "1 n car bike 309xx GA 1.00 \n", "2 n small_business real estate business 606xx IL 8.72 \n", "3 n other personel 917xx CA 20.00 \n", "4 n other Personal 972xx OR 17.94 \n", "\n", " delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal \\\n", "0 0.0 Jan-1985 1.0 3.0 0.0 13648.0 \n", "1 0.0 Apr-1999 5.0 3.0 0.0 1687.0 \n", "2 0.0 Nov-2001 2.0 2.0 0.0 2956.0 \n", "3 0.0 Feb-1996 1.0 10.0 0.0 5598.0 \n", "4 0.0 Jan-1996 0.0 15.0 0.0 27783.0 \n", "\n", " revol_util total_acc initial_list_status out_prncp out_prncp_inv \\\n", "0 83.7% 9.0 f 0.00 0.00 \n", "1 9.4% 4.0 f 0.00 0.00 \n", "2 98.5% 10.0 f 0.00 0.00 \n", "3 21% 37.0 f 0.00 0.00 \n", "4 53.9% 38.0 f 461.73 461.73 \n", "\n", " total_pymnt total_pymnt_inv total_rec_prncp total_rec_int \\\n", "0 5863.155187 5833.84 5000.00 863.16 \n", "1 1008.710000 1008.71 456.46 435.17 \n", "2 3005.666844 3005.67 2400.00 605.67 \n", "3 12231.890000 12231.89 10000.00 2214.92 \n", "4 3581.120000 3581.12 2538.27 1042.85 \n", "\n", " total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d \\\n", "0 0.00 0.00 0.00 Jan-2015 \n", "1 0.00 117.08 1.11 Apr-2013 \n", "2 0.00 0.00 0.00 Jun-2014 \n", "3 16.97 0.00 0.00 Jan-2015 \n", "4 0.00 0.00 0.00 Jun-2016 \n", "\n", " last_pymnt_amnt last_credit_pull_d collections_12_mths_ex_med \\\n", "0 171.62 Jun-2016 0.0 \n", "1 119.66 Sep-2013 0.0 \n", "2 649.91 Jun-2016 0.0 \n", "3 357.48 Apr-2016 0.0 \n", "4 67.79 Jun-2016 0.0 \n", "\n", " policy_code application_type acc_now_delinq chargeoff_within_12_mths \\\n", "0 1.0 INDIVIDUAL 0.0 0.0 \n", "1 1.0 INDIVIDUAL 0.0 0.0 \n", "2 1.0 INDIVIDUAL 0.0 0.0 \n", "3 1.0 INDIVIDUAL 0.0 0.0 \n", "4 1.0 INDIVIDUAL 0.0 0.0 \n", "\n", " delinq_amnt pub_rec_bankruptcies tax_liens \n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_five = pd.read_csv('loans_2007.csv', nrows=5)\n", "first_five.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- `emp_title` has some missing values\n", "- `emp_length` could be cleaned up into a numeric column instead of a string\n", "- Lots of columns can be converted to categories or boolean types" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAD7lJREFUeJzt3X+MJ3V9x/Hnq3dQrdiA3qIXTjxqsUobOe16paWtiNoiNAUqTaVGSUtyttVGU21Bk/qjrSkmVYxN1ZxCvTZWtIiFINoShBKjont6nIeHBSm1JxduqSJiG+rBu3/sEDfn7n1n9/v97nf30+cj+WZn5vv57rx2cvvaufnOdyZVhSRp7fuRSQeQJI2GhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqxPqVXNmGDRtq8+bNK7lKSVrzdu7ceV9VTQ0at6KFvnnzZmZmZlZylZK05iX5jz7jPOQiSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNWNFPikrSJG2++BMTW/fdl5w19nW4hy5JjbDQJakRFrokNcJCl6RGWOiS1IjehZ5kXZIvJ7m2mz8hyS1J7kjykSRHji+mJGmQpeyhvwbYO2/+7cClVXUi8G3gwlEGkyQtTa9CT7IJOAv4QDcf4HTgym7IDuCccQSUJPXTdw/9XcCfAI90808E7q+qg938PuC4EWeTJC3BwEJP8mvAgaraOX/xAkNrkddvSzKTZGZ2dnaZMSVJg/TZQz8V+PUkdwNXMHeo5V3A0UkevXTAJuCehV5cVdurarqqpqemBt60WpK0TAMLvareUFWbqmoz8FLg01X1MuBG4Lxu2AXA1WNLKUkaaJjz0C8C/ijJncwdU79sNJEkScuxpKstVtVNwE3d9F3A1tFHkiQth58UlaRGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1os9Noh+T5AtJbk1yW5K3dss/mOTfk+zqHlvGH1eStJg+dyx6CDi9qh5McgTwmSSf7J7746q6cnzxJEl9DSz0qirgwW72iO5R4wwlSVq6XsfQk6xLsgs4AFxfVbd0T70tye4klyb50UVeuy3JTJKZ2dnZEcWWJB2qV6FX1cNVtQXYBGxN8jPAG4BnAM8FngBctMhrt1fVdFVNT01NjSi2JOlQSzrLparuB24Czqiq/TXnIeBvga1jyCdJ6qnPWS5TSY7uph8LvBC4PcnGblmAc4A94wwqSTq8Pme5bAR2JFnH3B+Aj1bVtUk+nWQKCLAL+L0x5pQkDdDnLJfdwLMXWH76WBJJkpbFT4pKUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUiD53LHpMki8kuTXJbUne2i0/IcktSe5I8pEkR44/riRpMX320B8CTq+qk4EtwBlJTgHeDlxaVScC3wYuHF9MSdIgAwu9uxH0g93sEd2jgNOBK7vlO5i7r6gkaUJ6HUNPsi7JLuAAcD3wdeD+qjrYDdkHHDeeiJKkPnoVelU9XFVbgE3AVuCZCw1b6LVJtiWZSTIzOzu7/KSSpMNa0lkuVXU/cBNwCnB0kkdvMr0JuGeR12yvqumqmp6amhomqyTpMPqc5TKV5Ohu+rHAC4G9wI3Aed2wC4CrxxVSkjTY+sFD2AjsSLKOuT8AH62qa5N8FbgiyV8AXwYuG2NOSdIAAwu9qnYDz15g+V3MHU+XJK0CflJUkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktSIPrege0qSG5PsTXJbktd0y9+S5JtJdnWPM8cfV5K0mD63oDsIvK6qvpTk8cDOJNd3z11aVX81vniSpL763IJuP7C/m/5ukr3AceMOJklamiUdQ0+ymbn7i97SLXp1kt1JLk9yzIizSZKWoHehJzkK+Bjw2qp6AHgv8DRgC3N78O9Y5HXbkswkmZmdnR1BZEnSQnoVepIjmCvzD1XVVQBVdW9VPVxVjwDvB7Yu9Nqq2l5V01U1PTU1NarckqRD9DnLJcBlwN6qeue85RvnDTsX2DP6eJKkvvqc5XIq8HLgK0l2dcveCJyfZAtQwN3AK8eSUJLUS5+zXD4DZIGnrht9HEnScvlJUUlqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSI/rcgu4pSW5MsjfJbUle0y1/QpLrk9zRfT1m/HElSYvps4d+EHhdVT0TOAV4VZKTgIuBG6rqROCGbl6SNCEDC72q9lfVl7rp7wJ7geOAs4Ed3bAdwDnjCilJGmxJx9CTbAaeDdwCPKmq9sNc6QPHLvKabUlmkszMzs4Ol1aStKjehZ7kKOBjwGur6oG+r6uq7VU1XVXTU1NTy8koSeqhV6EnOYK5Mv9QVV3VLb43ycbu+Y3AgfFElCT10ecslwCXAXur6p3znroGuKCbvgC4evTxJEl9re8x5lTg5cBXkuzqlr0RuAT4aJILgW8AvzmeiJKkPgYWelV9BsgiT79gtHEkScvlJ0UlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqRJ87Fl2e5ECSPfOWvSXJN5Ps6h5njjemJGmQPnvoHwTOWGD5pVW1pXtcN9pYkqSlGljoVXUz8K0VyCJJGsIwx9BfnWR3d0jmmJElkiQty3IL/b3A04AtwH7gHYsNTLItyUySmdnZ2WWuTpI0yLIKvaruraqHq+oR4P3A1sOM3V5V01U1PTU1tdyckqQBllXoSTbOmz0X2LPYWEnSylg/aECSDwOnARuS7APeDJyWZAtQwN3AK8eYUZLUw8BCr6rzF1h82RiySJKG4CdFJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNGFjoSS5PciDJnnnLnpDk+iR3dF+PGW9MSdIgffbQPwiccciyi4EbqupE4IZuXpI0QQMLvapuBr51yOKzgR3d9A7gnBHnkiQt0XKPoT+pqvYDdF+PXWxgkm1JZpLMzM7OLnN1kqRBxv6maFVtr6rpqpqempoa9+ok6f+t5Rb6vUk2AnRfD4wukiRpOZZb6NcAF3TTFwBXjyaOJGm5+py2+GHgc8BPJdmX5ELgEuBFSe4AXtTNS5ImaP2gAVV1/iJPvWDEWSRJQ/CTopLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDVi4PXQDyfJ3cB3gYeBg1U1PYpQkqSlG6rQO8+vqvtG8H0kSUPwkIskNWLYQi/gX5LsTLJtFIEkScsz7CGXU6vqniTHAtcnub2qbp4/oCv6bQDHH3/8kKuTJC1mqD30qrqn+3oA+DiwdYEx26tquqqmp6amhlmdJOkwll3oSR6X5PGPTgO/AuwZVTBJ0tIMc8jlScDHkzz6ff6hqj41klSSpCVbdqFX1V3AySPMIkkagqctSlIjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1Ihh7ym6YjZf/ImJrfvuS86a2Lolqa+h9tCTnJHka0nuTHLxqEJJkpZumHuKrgP+BngxcBJwfpKTRhVMkrQ0w+yhbwXurKq7qup/gSuAs0cTS5K0VMMU+nHAf86b39ctkyRNwDBvimaBZfVDg5JtwLZu9sEkXxtinYezAbhvHN84bx/ptxtbzhFaCxnBnKO0FjLC2si5YMYhe+SpfQYNU+j7gKfMm98E3HPooKraDmwfYj29JJmpqulxr2dYayHnWsgI5hyltZAR1kbOSWYc5pDLF4ETk5yQ5EjgpcA1o4klSVqqZe+hV9XBJK8G/hlYB1xeVbeNLJkkaUmG+mBRVV0HXDeiLMMa+2GdEVkLOddCRjDnKK2FjLA2ck4sY6p+6H1MSdIa5LVcJKkRa67QB11uIMlTk9yQZHeSm5JsmkDGy5McSLJnkeeT5N3dz7A7yXNWOmOXY1DOZyT5XJKHkrx+pfN1GQZlfFm3DXcn+WySk1c6Y5djUM6zu4y7kswk+cXVlnHeuOcmeTjJeSuV7ZD1D9qWpyX5TrctdyV500pn7HIM3J5d1l1Jbkvyr2MPVVVr5sHcm69fB34COBK4FTjpkDH/CFzQTZ8O/P0Ecv4y8BxgzyLPnwl8krlz+U8BbpnQ9hyU81jgucDbgNev0oy/ABzTTb94FW/Lo/jBIc5nAbevtozdmHXAp5l7b+y8VbotTwOunUS2JeY8GvgqcHw3f+y4M621PfQ+lxs4Cbihm75xgefHrqpuBr51mCFnA39Xcz4PHJ1k48qk+4FBOavqQFV9Efj+yqX6oQyDMn62qr7dzX6euc9DrLgeOR+s7rcaeBwLfAhv3Hr8uwT4Q+BjwIHxJ1pYz5wT1yPnbwNXVdU3uvFj36ZrrdD7XG7gVuAl3fS5wOOTPHEFsi2Fl00YjwuZ+5/PqpTk3CS3A58AfnfSeQ6V5DjmfmfeN+ksPfx8kluTfDLJT086zCKeDhzTHfrdmeQV417hWiv0PpcbeD3wvCRfBp4HfBM4OO5gS9TrsgnqL8nzmSv0iyadZTFV9fGqegZwDvDnk86zgHcBF1XVw5MOMsCXgKdW1cnAXwP/NOE8i1kP/CxwFvCrwJ8mefq4V7iWDLzcQFXdA/wGQJKjgJdU1XdWLGE/vS6boH6SPAv4APDiqvqvSecZpKpuTvK0JBuqajVdl2QauCIJzF2P5MwkB6tqVRVmVT0wb/q6JO9ZhdsS5n7P76uq7wHfS3IzcDLwb+Na4VrbQx94uYEkG5I8+nO9Abh8hTP2cQ3wiu5sl1OA71TV/kmHWouSHA9cBby8qsb2izKsJD+Zrim7s5qOBFbVH5+qOqGqNlfVZuBK4A9WW5kDJHnyvG25lbkeW1XbsnM18EtJ1if5MeDngL3jXOGa2kOvRS43kOTPgJmquoa5d8D/MkkBNwOvWumcST7c5diQZB/wZuCI7md4H3NnEJwJ3An8N/A7K52xT84kTwZmgB8HHknyWubOKnpgkW+54hmBNwFPBN7T/Y4frAlcGKlHzpcw90f8+8D/AL81703S1ZJxVeiR8zzg95McZG5bvnSlt2WfnFW1N8mngN3AI8AHquqwp4wOnWkC20GSNAZr7ZCLJGkRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY34P+th6qVakV8FAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "thousand_chunks = pd.read_csv('loans_2007.csv', chunksize=1000)\n", "\n", "memory_footprints = []\n", "\n", "for chunk in thousand_chunks:\n", " chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576\n", " memory_footprints.append(chunk_memory_mb)\n", "\n", "plt.hist(memory_footprints)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Chunks of 1000 rows is too small $=>$ Each chunk is no larger than 1.7 MB. Let's increase the chunk size to 3000:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAADfZJREFUeJzt3X+MZeVdx/H3pwy15YdS3WmLLHTQNCRIquCEUEmwgdKsQMBE/lgSKlTMxp+l/giCTST6F0ZTq9ZIVkCopbSGgiItFSwlxISuzi5Qli5tSV3pFnSHEqG1Rlz79Y+56DrM7Nx7z5l7Z5++X8lk7j3nmft88rD3w5lz77mTqkKSdPh7zbQDSJL6YaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGjEzyck2bdpUc3Nzk5xSkg57O3fufL6qZtcaN9FCn5ubY2FhYZJTStJhL8k/DzPOUy6S1AgLXZIaYaFLUiMsdElqhIUuSY1Ys9CT3JJkf5LdK+z79SSVZNP6xJMkDWuYI/RbgS3LNyY5ETgfeKbnTJKkMaxZ6FX1MPDCCrv+ALgG8G/YSdIGMNY59CQXA1+rqsd7ziNJGtPIV4omOQp4P/CuIcdvA7YBnHTSSaNOJ0m9mbv2k1Obe+8NF677HOMcof8gcDLweJK9wGZgV5I3rzS4qrZX1XxVzc/OrvlRBJKkMY18hF5VTwBvfOX+oNTnq+r5HnNJkkY0zNsW7wAeAU5Jsi/JVesfS5I0qjWP0KvqsjX2z/WWRpI0Nq8UlaRGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWrEmoWe5JYk+5PsPmjb7yV5Ksnnk9yd5Lj1jSlJWsswR+i3AluWbXsAOK2q3gZ8Cbiu51ySpBGtWehV9TDwwrJt91fVgcHdzwGb1yGbJGkEfZxD/xngvtV2JtmWZCHJwuLiYg/TSZJW0qnQk7wfOADcvtqYqtpeVfNVNT87O9tlOknSIcyM+4NJrgAuAs6rquovkiRpHGMVepItwG8AP15V3+o3kiRpHMO8bfEO4BHglCT7klwFfAg4FnggyWNJblznnJKkNax5hF5Vl62w+eZ1yCJJ6sArRSWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1Ig1Cz3JLUn2J9l90LbvTfJAki8Pvr9hfWNKktYyzBH6rcCWZduuBT5TVW8FPjO4L0maojULvaoeBl5YtvkS4LbB7duAn+w5lyRpROOeQ39TVT0HMPj+xv4iSZLGse4viibZlmQhycLi4uJ6TydJ37HGLfR/TXI8wOD7/tUGVtX2qpqvqvnZ2dkxp5MkrWXcQr8HuGJw+wrgr/uJI0ka1zBvW7wDeAQ4Jcm+JFcBNwDnJ/kycP7gviRpimbWGlBVl62y67yes0iSOvBKUUlqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGdCr0JL+S5Mkku5PckeR1fQWTJI1m7EJPcgLwXmC+qk4DjgC29hVMkjSarqdcZoDXJ5kBjgKe7R5JkjSOsQu9qr4G/D7wDPAc8GJV3b98XJJtSRaSLCwuLo6fVJJ0SF1OubwBuAQ4Gfh+4Ogkly8fV1Xbq2q+quZnZ2fHTypJOqQup1zeCfxTVS1W1X8BdwE/1k8sSdKouhT6M8BZSY5KEuA8YE8/sSRJo+pyDn0HcCewC3hi8Fjbe8olSRrRTJcfrqrrget7yiJJ6sArRSWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1IhOhZ7kuCR3JnkqyZ4kb+8rmCRpNDMdf/4PgU9X1aVJXgsc1UMmSdIYxi70JN8NnANcCVBVLwMv9xNLkjSqLqdcfgBYBP48yaNJbkpydE+5JEkj6lLoM8AZwJ9W1enAvwPXLh+UZFuShSQLi4uLHaaTJB1Kl0LfB+yrqh2D+3eyVPD/T1Vtr6r5qpqfnZ3tMJ0k6VDGLvSq+hfgq0lOGWw6D/hCL6kkSSPr+i6XXwZuH7zD5SvAe7pHkiSNo1OhV9VjwHxPWSRJHXilqCQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjOhd6kiOSPJrk3j4CSZLG08cR+tXAnh4eR5LUQadCT7IZuBC4qZ84kqRxdT1C/yBwDfDtHrJIkjoYu9CTXATsr6qda4zblmQhycLi4uK400mS1tDlCP1s4OIke4GPAecm+cjyQVW1varmq2p+dna2w3SSpEMZu9Cr6rqq2lxVc8BW4MGqury3ZJKkkfg+dElqxEwfD1JVDwEP9fFYkqTxeIQuSY2w0CWpERa6JDXCQpekRljoktQIC12SGmGhS1IjLHRJaoSFLkmNsNAlqREWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjbDQJakRFrokNcJCl6RGWOiS1IixCz3JiUk+m2RPkieTXN1nMEnSaGY6/OwB4NeqaleSY4GdSR6oqi/0lE2SNIKxj9Cr6rmq2jW4/Q1gD3BCX8EkSaPp5Rx6kjngdGDHCvu2JVlIsrC4uNjHdJKkFXQu9CTHAJ8A3ldVLy3fX1Xbq2q+quZnZ2e7TidJWkWnQk9yJEtlfntV3dVPJEnSOLq8yyXAzcCeqvpAf5EkSePocoR+NvBu4Nwkjw2+LugplyRpRGO/bbGq/h5Ij1kkSR14pagkNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpERa6JDWiyx+4mKi5az85tbn33nDh1OaWpGF5hC5JjbDQJakRFrokNcJCl6RGWOiS1AgLXZIaYaFLUiMsdElqhIUuSY2w0CWpEZ0KPcmWJF9M8nSSa/sKJUka3diFnuQI4E+AnwBOBS5LcmpfwSRJo+lyhH4m8HRVfaWqXgY+BlzSTyxJ0qi6FPoJwFcPur9vsE2SNAVdPj43K2yrVw1KtgHbBne/meSLHeZcyybg+b4fNL/b+SHWJVcPzDUac41mI+aaWqY1emStXG8ZZo4uhb4POPGg+5uBZ5cPqqrtwPYO8wwtyUJVzU9irlGYazTmGo25hrcRM0F/ubqccvlH4K1JTk7yWmArcE/XQJKk8Yx9hF5VB5L8EvC3wBHALVX1ZG/JJEkj6fQn6KrqU8CnesrSh4mc2hmDuUZjrtGYa3gbMRP0lCtVr3odU5J0GPLSf0lqxGFX6EluSbI/ye5V9ifJHw0+juDzSc7YILnekeTFJI8Nvn5rQrlOTPLZJHuSPJnk6hXGTHzNhsw18TVL8rok/5Dk8UGu315hzHcl+fhgvXYkmdsAma5MsnjQWv3semZaNvcRSR5Ncu8K+ya6ViPkmsp6Jdmb5InBnAsr7O/2XKyqw+oLOAc4A9i9yv4LgPtYep/8WcCODZLrHcC9U1iv44EzBrePBb4EnDrtNRsy18TXbLAGxwxuHwnsAM5aNuYXgBsHt7cCH98Ama4EPjTpf1+DuX8V+OhK/60mvVYj5JrKegF7gU2H2N/puXjYHaFX1cPAC4cYcgnw4VryOeC4JMdvgFxTUVXPVdWuwe1vAHt49RW9E1+zIXNN3GANvjm4e+Tga/kLTZcAtw1u3wmcl2SlC+0mmWkqkmwGLgRuWmXIRNdqhFwbVafn4mFX6EPYyB9J8PbBr833JfmhSU8++HX3dJaO8A421TU7RC6YwpoNflV/DNgPPFBVq65XVR0AXgS+b8qZAH5q8Gv6nUlOXGH/evggcA3w7VX2T3ythswF01mvAu5PsjNLV9Ev1+m52GKhD/WRBFOwC3hLVf0w8MfAX01y8iTHAJ8A3ldVLy3fvcKPTGTN1sg1lTWrqv+uqh9h6ernM5OctmzIxNdriEx/A8xV1duAv+P/jorXTZKLgP1VtfNQw1bYtq5rNWSuia/XwNlVdQZLn1L7i0nOWba/03q1WOhDfSTBpFXVS6/82lxL798/MsmmScyd5EiWSvP2qrprhSFTWbO1ck1zzQZz/hvwELBl2a7/Xa8kM8D3MKHTbatlqqqvV9V/Du7+GfCjE4hzNnBxkr0sfdrquUk+smzMNNZqzVxTWi+q6tnB9/3A3Sx9au3BOj0XWyz0e4CfHrxafBbwYlU9N+1QSd78yrnDJGeytPZfn8C8AW4G9lTVB1YZNvE1GybXNNYsyWyS4wa3Xw+8E3hq2bB7gCsGty8FHqzBK1rTyrTsPOvFLL0msa6q6rqq2lxVcyy94PlgVV2+bNhE12rYXNNYryRHJzn2ldvAu4Dl74rr9FzsdKXoNCS5g6V3P2xKsg+4nqUXiaiqG1m6cvUC4GngW8B7NkiuS4GfT3IA+A9g63r/wx44G3g38MTgHCzAbwInHZRtGms2TK5prNnxwG1Z+gMurwH+sqruTfI7wEJV3cPS/4j+IsnTLB1tbt0Amd6b5GLgwCDTleucaVVTXqthc01jvd4E3D04RpkBPlpVn07yc9DPc9ErRSWpES2ecpGk70gWuiQ1wkKXpEZY6JLUCAtdkhphoUtSIyx0SWqEhS5JjfgfxIqyuYfC0GwAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "\n", "memory_footprints = []\n", "\n", "for chunk in chunks:\n", " chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576\n", " memory_footprints.append(chunk_memory_mb)\n", "\n", "plt.hist(memory_footprints)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Chunks of 3000 rows gets us closer to the 5 MB limit per chunk we set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Exploring the Data in Chunks\n", "\n", "Let's familiarize ourselves with the columns to see which ones we can optimize. Let's try to understand the column types better while using dataframe chunks.\n", "\n", "For each chunk:\n", "\n", " - How many columns have a numeric type? How many have a string type?\n", " - How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?\n", " - Which float columns have no missing values and could be candidates for conversion to the integer type?\n", "\n", "Calculate the total memory usage across all of the chunks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How many columns have a numeric type? How many have a string type?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Chunk 1:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 2:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 3:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 4:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 5:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 6:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 7:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 8:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 9:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 10:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 11:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 12:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 13:\n", "float64 30\n", "object 21\n", "int64 1\n", "dtype: int64\n", "\n", "Chunk 14:\n", "float64 30\n", "object 22\n", "dtype: int64\n", "\n", "Chunk 15:\n", "float64 30\n", "object 22\n", "dtype: int64\n" ] } ], "source": [ "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "\n", "count = 0\n", "total_memory_mb = 0\n", "for chunk in chunks:\n", " count += 1\n", " \n", " print(f'\\nChunk {count}:')\n", " print(chunk.dtypes.value_counts())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "overall obj cols: ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']\n", "\n", "chunk 14 obj cols: ['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']\n", "\n", "overall obj cols: ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']\n", "\n", "chunk 15 obj cols: ['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']\n", "\n" ] } ], "source": [ "# Are string columns consistent across chunks?\n", "count = 0\n", "obj_cols = []\n", "chunk = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "\n", "for chunk in chunk:\n", " count += 1\n", " chunk_obj_cols = chunk.select_dtypes(include=['object']).columns.tolist()\n", " \n", " if len(obj_cols) > 0:\n", " is_same = obj_cols == chunk_obj_cols\n", " \n", " if not is_same:\n", " print(f\"overall obj cols: {obj_cols}\\n\")\n", " print(f\"chunk {count} obj cols: {chunk_obj_cols}\\n\") \n", " \n", " else:\n", " obj_cols = chunk_obj_cols" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- There are 31 numeric columns and 21 string columns\n", "- In the last 2 chunks, the `id` column is being cast as an object instead of an `int64` type. Since the `id` column isn't useful for analysis, visualization or predictive modeling, we can ignore this column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total Num Rows: 42538\n" ] } ], "source": [ "## Create dictionary (key: column, value: list of Series objects representing each chunk's value counts)\n", "# vc - value counts\n", "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "\n", "count = 0\n", "total_num_rows = 0\n", "str_cols_vc = {}\n", "\n", "for chunk in chunks:\n", " total_num_rows += chunk.shape[0]\n", " str_cols = chunk.select_dtypes(include=['object'])\n", " \n", " for col in str_cols.columns:\n", " current_col_vc = chunk[col].value_counts()\n", " \n", " if col in str_cols_vc:\n", " str_cols_vc[col].append(current_col_vc)\n", " else:\n", " str_cols_vc[col] = [current_col_vc]\n", "\n", "print(f'Total Num Rows: {total_num_rows}')\n", "## Combine the value count results\n", "combined_vcs = {}\n", "\n", "for col, vc_list in str_cols_vc.items():\n", " combined_vc = pd.concat(vc_list)\n", " final_vc = combined_vc.groupby(combined_vc.index).sum()\n", " combined_vcs[col] = final_vc" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unique Values per string column (22 columns):\n", "term : 2 0.00\n", "int_rate : 394 0.93\n", "grade : 7 0.02\n", "sub_grade : 35 0.08\n", "emp_title : 30658 72.07\n", "emp_length : 11 0.03\n", "home_ownership : 5 0.01\n", "verification_status : 3 0.01\n", "issue_d : 55 0.13\n", "loan_status : 9 0.02\n", "pymnt_plan : 2 0.00\n", "purpose : 14 0.03\n", "title : 21264 49.99\n", "zip_code : 837 1.97\n", "addr_state : 50 0.12\n", "earliest_cr_line : 530 1.25\n", "revol_util : 1119 2.63\n", "initial_list_status : 1 0.00\n", "last_pymnt_d : 103 0.24\n", "last_credit_pull_d : 108 0.25\n", "application_type : 1 0.00\n", "id : 3538 8.32\n" ] } ], "source": [ "print(f'Unique Values per string column ({len(combined_vcs)} columns):')\n", "for col, col_vc in combined_vcs.items():\n", " unique_values_percentage = (len(col_vc) / total_num_rows) * 100\n", " print(f'{col:<20}: {col_vc.shape[0]:<10}{unique_values_percentage:0.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- Each string column has varying number of unique values (1 to over 30,000).\n", "- 2 columns in particular (`emp_title` and `title`) contain 50% or more unique values. The remaining string columns have no more than 1 - 10% unique values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Which float columns have no missing values and could be candidates for conversion to the integer type?" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Chunk 1:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 2:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 3:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 4:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 5:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 6:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 7:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 8:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 9:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 10:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 11:\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "0/30 columns:[]\n", "\n", "Chunk 12:\n", "29/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']\n", "1/30 columns:['pub_rec_bankruptcies']\n", "member_id 0\n", "loan_amnt 0\n", "funded_amnt 0\n", "funded_amnt_inv 0\n", "installment 0\n", "annual_inc 0\n", "dti 0\n", "delinq_2yrs 0\n", "inq_last_6mths 0\n", "open_acc 0\n", "pub_rec 0\n", "revol_bal 0\n", "total_acc 0\n", "out_prncp 0\n", "out_prncp_inv 0\n", "total_pymnt 0\n", "total_pymnt_inv 0\n", "total_rec_prncp 0\n", "total_rec_int 0\n", "total_rec_late_fee 0\n", "recoveries 0\n", "collection_recovery_fee 0\n", "last_pymnt_amnt 0\n", "collections_12_mths_ex_med 0\n", "policy_code 0\n", "acc_now_delinq 0\n", "chargeoff_within_12_mths 0\n", "delinq_amnt 0\n", "pub_rec_bankruptcies 3\n", "tax_liens 0\n", "dtype: int64\n", "\n", "Chunk 13:\n", "29/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']\n", "1/30 columns:['pub_rec_bankruptcies']\n", "member_id 0\n", "loan_amnt 0\n", "funded_amnt 0\n", "funded_amnt_inv 0\n", "installment 0\n", "annual_inc 0\n", "dti 0\n", "delinq_2yrs 0\n", "inq_last_6mths 0\n", "open_acc 0\n", "pub_rec 0\n", "revol_bal 0\n", "total_acc 0\n", "out_prncp 0\n", "out_prncp_inv 0\n", "total_pymnt 0\n", "total_pymnt_inv 0\n", "total_rec_prncp 0\n", "total_rec_int 0\n", "total_rec_late_fee 0\n", "recoveries 0\n", "collection_recovery_fee 0\n", "last_pymnt_amnt 0\n", "collections_12_mths_ex_med 0\n", "policy_code 0\n", "acc_now_delinq 0\n", "chargeoff_within_12_mths 0\n", "delinq_amnt 0\n", "pub_rec_bankruptcies 4\n", "tax_liens 0\n", "dtype: int64\n", "\n", "Chunk 14:\n", "0/30 columns:[]\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "member_id 1\n", "loan_amnt 1\n", "funded_amnt 1\n", "funded_amnt_inv 1\n", "installment 1\n", "annual_inc 1\n", "dti 1\n", "delinq_2yrs 1\n", "inq_last_6mths 1\n", "open_acc 1\n", "pub_rec 1\n", "revol_bal 1\n", "total_acc 1\n", "out_prncp 1\n", "out_prncp_inv 1\n", "total_pymnt 1\n", "total_pymnt_inv 1\n", "total_rec_prncp 1\n", "total_rec_int 1\n", "total_rec_late_fee 1\n", "recoveries 1\n", "collection_recovery_fee 1\n", "last_pymnt_amnt 1\n", "collections_12_mths_ex_med 57\n", "policy_code 1\n", "acc_now_delinq 1\n", "chargeoff_within_12_mths 57\n", "delinq_amnt 1\n", "pub_rec_bankruptcies 849\n", "tax_liens 40\n", "dtype: int64\n", "\n", "Chunk 15:\n", "0/30 columns:[]\n", "30/30 columns:['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']\n", "member_id 2\n", "loan_amnt 2\n", "funded_amnt 2\n", "funded_amnt_inv 2\n", "installment 2\n", "annual_inc 6\n", "dti 2\n", "delinq_2yrs 31\n", "inq_last_6mths 31\n", "open_acc 31\n", "pub_rec 31\n", "revol_bal 2\n", "total_acc 31\n", "out_prncp 2\n", "out_prncp_inv 2\n", "total_pymnt 2\n", "total_pymnt_inv 2\n", "total_rec_prncp 2\n", "total_rec_int 2\n", "total_rec_late_fee 2\n", "recoveries 2\n", "collection_recovery_fee 2\n", "last_pymnt_amnt 2\n", "collections_12_mths_ex_med 91\n", "policy_code 2\n", "acc_now_delinq 31\n", "chargeoff_within_12_mths 91\n", "delinq_amnt 31\n", "pub_rec_bankruptcies 512\n", "tax_liens 68\n", "dtype: int64\n" ] } ], "source": [ "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "count = 0\n", "max_count = 2\n", "\n", "for chunk in chunks:\n", " count += 1\n", " print(f'\\nChunk {count}:')\n", " \n", " float_cols_with_null_check = chunk.select_dtypes(include=['float']).isnull().sum()\n", " \n", " #print(float_cols_with_null_check.shape[0])\n", " #float_cols_no_missing_val = float_cols_with_null_check.index.tolist()\n", " \n", " cols_no_missing_val = list(float_cols_with_null_check[float_cols_with_null_check == 0].index)\n", " \n", " cols_with_missing_val = list(float_cols_with_null_check[float_cols_with_null_check > 0].index)\n", " \n", " print(f\"{len(cols_no_missing_val)}/{len(float_cols_with_null_check)} columns:\"\\\n", " f\"{cols_no_missing_val}\")\n", " \n", " print(f\"{len(cols_with_missing_val)}/{len(float_cols_with_null_check)} columns:\"\\\n", " f\"{cols_with_missing_val}\")\n", " \n", " # the last several chunks have float columns with missing values\n", " if count > 11:\n", " print(float_cols_with_null_check)\n", " #if count >= max_count:\n", " # break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "There are no float columns with no missing values. Therefore, we can't convert any of the float columns to an integer type." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculate the total memory usage across all of the chunks." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chunk 1:\tMemory Usage: 4.65 MB\n", "Chunk 2:\tMemory Usage: 4.64 MB\n", "Chunk 3:\tMemory Usage: 4.65 MB\n", "Chunk 4:\tMemory Usage: 4.65 MB\n", "Chunk 5:\tMemory Usage: 4.64 MB\n", "Chunk 6:\tMemory Usage: 4.65 MB\n", "Chunk 7:\tMemory Usage: 4.64 MB\n", "Chunk 8:\tMemory Usage: 4.65 MB\n", "Chunk 9:\tMemory Usage: 4.65 MB\n", "Chunk 10:\tMemory Usage: 4.65 MB\n", "Chunk 11:\tMemory Usage: 4.66 MB\n", "Chunk 12:\tMemory Usage: 4.66 MB\n", "Chunk 13:\tMemory Usage: 4.66 MB\n", "Chunk 14:\tMemory Usage: 4.90 MB\n", "Chunk 15:\tMemory Usage: 0.88 MB\n", "There are 15 chunks! Total memory: 66.22 MB\n" ] } ], "source": [ "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "\n", "count = 0\n", "total_memory_mb = 0\n", "for chunk in chunks:\n", " count += 1\n", " print(f\"Chunk {count}:\", end='\\t')\n", " \n", " # chunk memory\n", " chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576\n", " print(f\"Memory Usage: {chunk_memory_mb:0.2f} MB\")\n", " total_memory_mb += chunk_memory_mb\n", "\n", "print(f\"There are {count} chunks! Total memory: {total_memory_mb:0.2f} MB\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Optimizing String Columns\n", "\n", "We can achieve the greatest memory improvements by converting the string columns to a numeric type. Let's convert all of the columns where the values are less than 50% unique to the category type, and the columns that contain numeric values to the float type." ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Chunk 1:\n", "term : 0.19 (object) vs. 0.00 (int8)\n", "int_rate : 0.18 (object) vs. 0.01 (float32)\n", "grade : 0.19 (object) vs. 0.00 (category)\n", "sub_grade : 0.17 (object) vs. 0.01 (category)\n", "emp_title : 0.21 (object) vs. 0.21 (object)\n", "emp_length : 0.18 (object) vs. 0.18 (object)\n", "home_ownership : 0.18 (object) vs. 0.00 (category)\n", "verification_status : 0.20 (object) vs. 0.00 (category)\n", "issue_d : 0.19 (object) vs. 0.02 (datetime64[ns])\n", "loan_status : 0.19 (object) vs. 0.00 (category)\n", "pymnt_plan : 0.19 (object) vs. 0.00 (bool)\n", "purpose : 0.20 (object) vs. 0.00 (category)\n", "title : 0.21 (object) vs. 0.21 (object)\n", "zip_code : 0.18 (object) vs. 0.18 (object)\n", "addr_state : 0.17 (object) vs. 0.17 (object)\n", "earliest_cr_line : 0.19 (object) vs. 0.02 (datetime64[ns])\n", "revol_util : 0.18 (object) vs. 0.01 (float32)\n", "initial_list_status : 0.19 (object) vs. 0.00 (category)\n", "last_pymnt_d : 0.19 (object) vs. 0.02 (datetime64[ns])\n", "last_credit_pull_d : 0.19 (object) vs. 0.02 (datetime64[ns])\n", "application_type : 0.19 (object) vs. 0.00 (category)\n", "Memory: 3.94 mb vs 1.10 mb\n", "\n", "Memory:4.65 mb vs 1.81 mb\n" ] } ], "source": [ "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "count = 0\n", "total_initial_memory_mb = 0\n", "total_final_memory_mb = 0\n", "B_PER_MB = 1048576\n", "\n", "for chunk in chunks:\n", " count += 1\n", " print(f'\\nChunk {count}:')\n", " obj_cols = chunk.select_dtypes(include=['object'])\n", " \n", " total_initial_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n", " \n", " #print('Initial Memory Usage:')\n", " memory_comparison = []\n", " initial_memory_mb = 0\n", " for col in obj_cols.columns:\n", " chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n", " initial_memory_mb += chunk_memory_mb\n", " memory_comparison.append(f'{col:<20}: {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n", " \n", " # convert columns to category type\n", " chunk['grade'] = chunk['grade'].astype('category')\n", " chunk['sub_grade'] = chunk['sub_grade'].astype('category')\n", " chunk['home_ownership'] = chunk['home_ownership'].astype('category')\n", " chunk['verification_status'] = chunk['verification_status'].astype('category')\n", " chunk['application_type'] = chunk['application_type'].astype('category')\n", " chunk['loan_status'] = chunk['loan_status'].astype('category')\n", " chunk['initial_list_status'] = chunk['initial_list_status'].astype('category')\n", " chunk['purpose'] = chunk['purpose'].astype('category')\n", " \n", " # convert columns to numeric type\n", " chunk['term'] = chunk['term'].str.replace('months', '').str.replace(' ', '')\n", " chunk['term'] = pd.to_numeric(chunk['term'], downcast='integer')\n", " \n", " chunk['int_rate'] = chunk['int_rate'].str.replace(' ', '').str.replace('%', '')\n", " chunk['int_rate'] = pd.to_numeric(chunk['int_rate'], downcast='float')\n", " \n", " chunk['revol_util'] = chunk['revol_util'].str.replace('%', '')\n", " chunk['revol_util'] = pd.to_numeric(chunk['revol_util'], downcast='float')\n", " \n", " # convert columns to datetime type\n", " chunk['issue_d'] = pd.to_datetime(chunk['issue_d'])\n", " chunk['earliest_cr_line'] = pd.to_datetime(chunk['earliest_cr_line'])\n", " chunk['last_pymnt_d'] = pd.to_datetime(chunk['last_pymnt_d'])\n", " chunk['last_credit_pull_d'] = pd.to_datetime(chunk['last_credit_pull_d'])\n", " \n", " # convert columns to boolean type - 2 valid approaches!\n", " \n", " chunk['pymnt_plan'] = chunk['pymnt_plan'].astype('bool')\n", " \n", " #d = {'n': False, 'y': True}\n", " #chunk['pymnt_plan'] = chunk['pymnt_plan'].map(d)\n", " \n", " \n", " # cycle through column unique values\n", " #for col in obj_cols.columns:\n", " # print('\\n', col)\n", " # unique_values = chunk[col].unique()\n", " # print(f'{len(unique_values)} values: {unique_values[:5]}')\n", " \n", " #print('\\nMemory Comparison:')\n", " total_memory_mb = 0\n", " for i, col in enumerate(obj_cols.columns):\n", " chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n", " total_memory_mb += chunk_memory_mb\n", " previous_info = memory_comparison[i]\n", " print(f'{previous_info:<10} vs. {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n", " \n", " print(f'Memory: {initial_memory_mb:0.2f} mb vs {total_memory_mb:0.2f} mb')\n", " \n", " total_final_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n", " break\n", "\n", "print(f'\\nMemory:{total_initial_memory_mb:0.2f} mb vs {total_final_memory_mb:0.2f} mb')" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "60.963455149501655" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(66.22 - 25.85) / 66.22 * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "=> a 61% improvement in memory usage!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Optimizing Numeric Columns\n", "\n", "We found out earlier that all float columns have some missing values. So we can't convert any of them to an integer column. We can downcast each float column to a more efficient subtype." ] }, { "cell_type": "code", "execution_count": 158, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Chunk 1:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 2:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 3:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 4:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 5:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 6:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 7:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 8:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 9:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 10:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 11:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 12:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 13:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 14:\n", "Memory: 0.69 mb vs 0.35 mb\n", "\n", "Chunk 15:\n", "Memory: 0.13 mb vs 0.06 mb\n", "\n", "Memory:66.22 mb vs 61.35 mb\n" ] } ], "source": [ "chunks = pd.read_csv('loans_2007.csv', chunksize=3000)\n", "count = 0\n", "total_initial_memory_mb = 0\n", "total_final_memory_mb = 0\n", "B_PER_MB = 1048576\n", "\n", "for chunk in chunks:\n", " count += 1\n", " print(f'\\nChunk {count}:')\n", " float_cols = chunk.select_dtypes(include=['float'])\n", " \n", " total_initial_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n", " \n", " #print('Initial Memory Usage:')\n", " memory_comparison = []\n", " initial_memory_mb = 0\n", " for col in float_cols.columns:\n", " chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n", " initial_memory_mb += chunk_memory_mb\n", " memory_comparison.append(f'{col:<30}: {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n", " \n", " for col in float_cols.columns:\n", " chunk[col] = pd.to_numeric(chunk[col], downcast='float')\n", " \n", " # cycle through column unique values\n", " #for col in float_cols.columns:\n", " # print('\\n', col)\n", " # unique_values = chunk[col].unique()\n", " # print(f'{len(unique_values)} values: {unique_values[:5]}')\n", " \n", " \n", " #print('\\nMemory Comparison:')\n", " total_memory_mb = 0\n", " for i, col in enumerate(float_cols.columns):\n", " chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n", " total_memory_mb += chunk_memory_mb\n", " previous_info = memory_comparison[i]\n", " #print(f'{previous_info:<20} vs. {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n", " \n", " print(f'Memory: {initial_memory_mb:0.2f} mb vs {total_memory_mb:0.2f} mb')\n", " \n", " total_final_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n", " \n", " #break\n", "\n", "print(f'\\nMemory:{total_initial_memory_mb:0.2f} mb vs {total_final_memory_mb:0.2f} mb')" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.354273633343397" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(66.22 - 61.35) / 66.22 * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "=> So we save 7% by converting our float columns to more space efficient subtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "\n", "By optimizing our strings and float columns, we were able to reduce our memory usage by **68%**. This is quite significant reduction, especially in cases when we are dealing with very large datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Next Steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an idea for some next steps:\n", "\n", "- Create a function that automates as much of the work we just did as possible, so that we could use it on other Lending Club data sets. This function should:\n", " - Determine the optimal chunk size based on the memory constraints you provide\n", " - Determine which string columns can be converted to numeric ones by removing the % character\n", " - Determine which numeric columns can be converted to more space efficient representations" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }