{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practice Optimizing Dataframes and Processing in Chunks\n", "\n", "In this project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from [Lending Club], a marketplace for personal loans that matches borrowers with investors. You can read more about the marketplace on its [website].\n", "\n", "The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the [origination fee] that Lending Club charges.\n", "\n", "We'll be working with a dataset of loans approved from `2007-2011`, which can be downloaded from [Lending Club's website]. The `desc` column has been removed to improve system performance.\n", "\n", "If we read in the entire data set, it will consume about 67 megabytes of memory. For learning purposes, let's imagine that we only have 10 megabytes of memory available throughout this project, so we can practice the concepts we learned in the last two missions.\n", "\n", "[Lending Club]: https://www.lendingclub.com/\n", "[website]: https://www.lendingclub.com/public/how-peer-lending-works.action\n", "[origination fee]: https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-\n", "[Lending Club's website]: https://www.lendingclub.com/info/download-data.action" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import pprint as pp\n", "import matplotlib.pyplot as plt\n", "pd.options.display.max_columns = 99" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loans_2007.csv\r\n", "project02_practice_optimizing_dataframes_and_processing_in_chunks.ipynb\r\n" ] } ], "source": [ "!ls" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Introduction\n", "\n", "- Read in the first five lines from `loans_2007.csv` and look for any data quality issues.\n", "- Read in the first 1000 rows from the data set, and calculate the total memory usage for these rows. Increase or decrease the number of rows to converge on a memory usage under `five megabytes` (to stay on the conservative side)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "member_id | \n", "loan_amnt | \n", "funded_amnt | \n", "funded_amnt_inv | \n", "term | \n", "int_rate | \n", "installment | \n", "grade | \n", "sub_grade | \n", "emp_title | \n", "emp_length | \n", "home_ownership | \n", "annual_inc | \n", "verification_status | \n", "issue_d | \n", "loan_status | \n", "pymnt_plan | \n", "purpose | \n", "title | \n", "zip_code | \n", "addr_state | \n", "dti | \n", "delinq_2yrs | \n", "earliest_cr_line | \n", "inq_last_6mths | \n", "open_acc | \n", "pub_rec | \n", "revol_bal | \n", "revol_util | \n", "total_acc | \n", "initial_list_status | \n", "out_prncp | \n", "out_prncp_inv | \n", "total_pymnt | \n", "total_pymnt_inv | \n", "total_rec_prncp | \n", "total_rec_int | \n", "total_rec_late_fee | \n", "recoveries | \n", "collection_recovery_fee | \n", "last_pymnt_d | \n", "last_pymnt_amnt | \n", "last_credit_pull_d | \n", "collections_12_mths_ex_med | \n", "policy_code | \n", "application_type | \n", "acc_now_delinq | \n", "chargeoff_within_12_mths | \n", "delinq_amnt | \n", "pub_rec_bankruptcies | \n", "tax_liens | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1077501 | \n", "1296599.0 | \n", "5000.0 | \n", "5000.0 | \n", "4975.0 | \n", "36 months | \n", "10.65% | \n", "162.87 | \n", "B | \n", "B2 | \n", "NaN | \n", "10+ years | \n", "RENT | \n", "24000.0 | \n", "Verified | \n", "Dec-2011 | \n", "Fully Paid | \n", "n | \n", "credit_card | \n", "Computer | \n", "860xx | \n", "AZ | \n", "27.65 | \n", "0.0 | \n", "Jan-1985 | \n", "1.0 | \n", "3.0 | \n", "0.0 | \n", "13648.0 | \n", "83.7% | \n", "9.0 | \n", "f | \n", "0.00 | \n", "0.00 | \n", "5863.155187 | \n", "5833.84 | \n", "5000.00 | \n", "863.16 | \n", "0.00 | \n", "0.00 | \n", "0.00 | \n", "Jan-2015 | \n", "171.62 | \n", "Jun-2016 | \n", "0.0 | \n", "1.0 | \n", "INDIVIDUAL | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
1 | \n", "1077430 | \n", "1314167.0 | \n", "2500.0 | \n", "2500.0 | \n", "2500.0 | \n", "60 months | \n", "15.27% | \n", "59.83 | \n", "C | \n", "C4 | \n", "Ryder | \n", "< 1 year | \n", "RENT | \n", "30000.0 | \n", "Source Verified | \n", "Dec-2011 | \n", "Charged Off | \n", "n | \n", "car | \n", "bike | \n", "309xx | \n", "GA | \n", "1.00 | \n", "0.0 | \n", "Apr-1999 | \n", "5.0 | \n", "3.0 | \n", "0.0 | \n", "1687.0 | \n", "9.4% | \n", "4.0 | \n", "f | \n", "0.00 | \n", "0.00 | \n", "1008.710000 | \n", "1008.71 | \n", "456.46 | \n", "435.17 | \n", "0.00 | \n", "117.08 | \n", "1.11 | \n", "Apr-2013 | \n", "119.66 | \n", "Sep-2013 | \n", "0.0 | \n", "1.0 | \n", "INDIVIDUAL | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
2 | \n", "1077175 | \n", "1313524.0 | \n", "2400.0 | \n", "2400.0 | \n", "2400.0 | \n", "36 months | \n", "15.96% | \n", "84.33 | \n", "C | \n", "C5 | \n", "NaN | \n", "10+ years | \n", "RENT | \n", "12252.0 | \n", "Not Verified | \n", "Dec-2011 | \n", "Fully Paid | \n", "n | \n", "small_business | \n", "real estate business | \n", "606xx | \n", "IL | \n", "8.72 | \n", "0.0 | \n", "Nov-2001 | \n", "2.0 | \n", "2.0 | \n", "0.0 | \n", "2956.0 | \n", "98.5% | \n", "10.0 | \n", "f | \n", "0.00 | \n", "0.00 | \n", "3005.666844 | \n", "3005.67 | \n", "2400.00 | \n", "605.67 | \n", "0.00 | \n", "0.00 | \n", "0.00 | \n", "Jun-2014 | \n", "649.91 | \n", "Jun-2016 | \n", "0.0 | \n", "1.0 | \n", "INDIVIDUAL | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
3 | \n", "1076863 | \n", "1277178.0 | \n", "10000.0 | \n", "10000.0 | \n", "10000.0 | \n", "36 months | \n", "13.49% | \n", "339.31 | \n", "C | \n", "C1 | \n", "AIR RESOURCES BOARD | \n", "10+ years | \n", "RENT | \n", "49200.0 | \n", "Source Verified | \n", "Dec-2011 | \n", "Fully Paid | \n", "n | \n", "other | \n", "personel | \n", "917xx | \n", "CA | \n", "20.00 | \n", "0.0 | \n", "Feb-1996 | \n", "1.0 | \n", "10.0 | \n", "0.0 | \n", "5598.0 | \n", "21% | \n", "37.0 | \n", "f | \n", "0.00 | \n", "0.00 | \n", "12231.890000 | \n", "12231.89 | \n", "10000.00 | \n", "2214.92 | \n", "16.97 | \n", "0.00 | \n", "0.00 | \n", "Jan-2015 | \n", "357.48 | \n", "Apr-2016 | \n", "0.0 | \n", "1.0 | \n", "INDIVIDUAL | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
4 | \n", "1075358 | \n", "1311748.0 | \n", "3000.0 | \n", "3000.0 | \n", "3000.0 | \n", "60 months | \n", "12.69% | \n", "67.79 | \n", "B | \n", "B5 | \n", "University Medical Group | \n", "1 year | \n", "RENT | \n", "80000.0 | \n", "Source Verified | \n", "Dec-2011 | \n", "Current | \n", "n | \n", "other | \n", "Personal | \n", "972xx | \n", "OR | \n", "17.94 | \n", "0.0 | \n", "Jan-1996 | \n", "0.0 | \n", "15.0 | \n", "0.0 | \n", "27783.0 | \n", "53.9% | \n", "38.0 | \n", "f | \n", "461.73 | \n", "461.73 | \n", "3581.120000 | \n", "3581.12 | \n", "2538.27 | \n", "1042.85 | \n", "0.00 | \n", "0.00 | \n", "0.00 | \n", "Jun-2016 | \n", "67.79 | \n", "Jun-2016 | \n", "0.0 | \n", "1.0 | \n", "INDIVIDUAL | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "