{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Prep\n", "\n", "For this vignette, we'll be using acquisition loan data from [Fannie Mae's public datasets](https://loanperformancedata.fanniemae.com/lppub/index.html#Portfolio). There is sadly not yet a Python script to facilitate the reading in of all the data—which are stored as pipe-delimited .txt files, released quarterly on ~1 year lag—so I had to infer some of my data cleaning steps from the [R code that Fannie made available](https://loanperformancedata.fanniemae.com/lppub-docs/FNMA_SF_Loan_Performance_r_Primary.zip) to download and clean the data.\n", "\n", "If you are not immediately familiar with these data (or mortgage finance-related data in general), I highly recommend referencing the 1st table—\"Acquisition File Layout\"—in [this dictionnary file here](https://loanperformancedata.fanniemae.com/lppub-docs/FNMA_SF_Loan_Performance_File_layout.pdf), along with the [corresponding glossary here](https://loanperformancedata.fanniemae.com/lppub-docs/FNMA_SF_Loan_Performance_Glossary.pdf), which provides an explanation of the meaning of the values for each variable.\n", "\n", "For our purposes here, I will only be using acquisition data from the 1st and 4th quarter of 2017; thus, I will need to read in 2 .txt files. The code below is generalizable up to however many text files you want to use; you need only change the path_to_data variable to point at whatever path stores your .txt files.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# basic packages\n", "import numpy as np\n", "import pandas as pd\n", "import datetime\n", "\n", "# for data importing\n", "import os\n", "import csv\n", "\n", "# for data cleaning\n", "from janitor import clean_names, remove_empty" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This notebook was last executed on 2019-08-05 22:09'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# store the datetime of the most recent running of this notebook as a form of a log\n", "most_recent_run_datetime = datetime.datetime.now().strftime(\"%Y-%m-%d %H:%M\")\n", "f\"This notebook was last executed on {most_recent_run_datetime}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading in the Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# location of my .txt data files\n", "path_to_data = '../data/'\n", "\n", "# instantiate an empty list to store the file names to read in\n", "filelist = []\n", "\n", "# loop over each filename in the specified directory\n", "for filename in os.listdir(path_to_data):\n", " # check if the file path corresponds to a file\n", " # ensure the file in question is a .txt file\n", " # ensure we haven't already added that file to the list\n", " if os.path.isfile(path_to_data + filename) \\\n", " and filename.endswith(\".txt\") \\\n", " and filename not in filelist:\n", " # append the file to our list of files\n", " filelist.append(path_to_data+filename)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['../data/Acquisition_2017Q4.txt', '../data/Acquisition_2017Q1.txt']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# take a look at the files we'll be reading in\n", "filelist" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# instantiate an empty df that we'll use to store all our data\n", "acq_df = pd.DataFrame()\n", "\n", "# list of variable names for all the fields in the .txt files (adapted from the aforementioned docs)\n", "acq_var_names = ['LOAN_ID', 'ORIG_CHN', 'Seller.Name', 'ORIG_RT', 'ORIG_AMT', 'ORIG_TRM', 'ORIG_DTE','FRST_DTE',\n", " 'OLTV', 'OCLTV', 'NUM_BO', 'DTI', 'CSCORE_B', 'FTHB_FLG', 'PURPOSE', 'PROP_TYP', 'NUM_UNIT', \n", " 'OCC_STAT', 'STATE', 'ZIP_3', 'MI_PCT', 'Product.Type', 'CSCORE_C', 'MI_TYPE', 'RELOCATION_FLG']\n", "\n", "# loop over the .txt files, read them in, and append them to make our master acquisitions df\n", "for f in filelist:\n", " # specify that our delimiter is a pipe, ignore the header, and use pre-specified variable names\n", " temp_df = pd.read_csv(filepath_or_buffer = f, sep=\"|\", header=None, names=acq_var_names)\n", " # ensure that concatenation is row-wise and ignore the index values as they don't convey meaning here\n", " acq_df = pd.concat(objs=[acq_df, temp_df], axis=0, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LOAN_IDORIG_CHNSeller.NameORIG_RTORIG_AMTORIG_TRMORIG_DTEFRST_DTEOLTVOCLTV...PROP_TYPNUM_UNITOCC_STATSTATEZIP_3MI_PCTProduct.TypeCSCORE_CMI_TYPERELOCATION_FLG
0100004724719RGUILD MORTGAGE COMPANY4.37522800036009/201711/20179595...SF1PHI96730.0FRMNaN1.0N
1100005682269RWELLS FARGO BANK, N.A.3.3757500012010/201712/20177979...SF1PMS394NaNFRM765.0NaNN
2100005754525CWELLS FARGO BANK, N.A.4.7507900036011/201701/20188080...SF1IMO653NaNFRM799.0NaNN
3100006973104CSUNTRUST BANK4.25029500036011/201701/20189191...SF1POH45230.0FRM789.01.0N
4100007420970RFREEDOM MORTGAGE CORP.4.50017800036009/201711/20176363...SF1PCA924NaNFRMNaNNaNN
\n", "

5 rows × 25 columns

\n", "
" ], "text/plain": [ " LOAN_ID ORIG_CHN Seller.Name ORIG_RT ORIG_AMT ORIG_TRM \\\n", "0 100004724719 R GUILD MORTGAGE COMPANY 4.375 228000 360 \n", "1 100005682269 R WELLS FARGO BANK, N.A. 3.375 75000 120 \n", "2 100005754525 C WELLS FARGO BANK, N.A. 4.750 79000 360 \n", "3 100006973104 C SUNTRUST BANK 4.250 295000 360 \n", "4 100007420970 R FREEDOM MORTGAGE CORP. 4.500 178000 360 \n", "\n", " ORIG_DTE FRST_DTE OLTV OCLTV ... PROP_TYP NUM_UNIT OCC_STAT STATE \\\n", "0 09/2017 11/2017 95 95 ... SF 1 P HI \n", "1 10/2017 12/2017 79 79 ... SF 1 P MS \n", "2 11/2017 01/2018 80 80 ... SF 1 I MO \n", "3 11/2017 01/2018 91 91 ... SF 1 P OH \n", "4 09/2017 11/2017 63 63 ... SF 1 P CA \n", "\n", " ZIP_3 MI_PCT Product.Type CSCORE_C MI_TYPE RELOCATION_FLG \n", "0 967 30.0 FRM NaN 1.0 N \n", "1 394 NaN FRM 765.0 NaN N \n", "2 653 NaN FRM 799.0 NaN N \n", "3 452 30.0 FRM 789.0 1.0 N \n", "4 924 NaN FRM NaN NaN N \n", "\n", "[5 rows x 25 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# taking a look at the data structure we have so far\n", "acq_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning the Data\n", "\n", "For the most part, our data here are pretty clean. In the section below, we'll just make a few convenience changes, execute a couple checks, and create a few new variables." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# use pyjanitor package to take care of basic data cleaning\n", "acq_df = (\n", " acq_df\n", " # clean the column names, remove any leading/trailing underscores\n", " .clean_names(strip_underscores=True)\n", " # remove any rows that are entirely NA\n", " .remove_empty()\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# create a few new fields as recommended by the aformentioned docs\n", "\n", "# find minimum credit score of borrower and co-borrower\n", "acq_df['cscore_min'] = (\n", " acq_df[['cscore_b','cscore_c']].min(axis=1)\n", ")\n", "\n", "# find origination value = origination amount / origination loan-to-value ratio\n", "acq_df['orig_val'] = (\n", " acq_df['orig_amt'] / (acq_df['oltv']/100)\n", ")\n", "\n", "# check if the ocltv is null; if it is, set it to the oltv\n", "acq_df['ocltv'] = (\n", " np.where(acq_df['ocltv'].isnull(), acq_df['oltv'], acq_df['ocltv'])\n", ")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
loan_idorig_chnseller_nameorig_rtorig_amtorig_trmorig_dtefrst_dteoltvocltv...occ_statstatezip_3mi_pctproduct_typecscore_cmi_typerelocation_flgcscore_minorig_val
0100004724719RGUILD MORTGAGE COMPANY4.37522800036009/201711/20179595...PHI96730.0FRMNaN1.0N693.0240000.000000
1100005682269RWELLS FARGO BANK, N.A.3.3757500012010/201712/20177979...PMS394NaNFRM765.0NaNN737.094936.708861
2100005754525CWELLS FARGO BANK, N.A.4.7507900036011/201701/20188080...IMO653NaNFRM799.0NaNN799.098750.000000
3100006973104CSUNTRUST BANK4.25029500036011/201701/20189191...POH45230.0FRM789.01.0N760.0324175.824176
4100007420970RFREEDOM MORTGAGE CORP.4.50017800036009/201711/20176363...PCA924NaNFRMNaNNaNN675.0282539.682540
\n", "

5 rows × 27 columns

\n", "
" ], "text/plain": [ " loan_id orig_chn seller_name orig_rt orig_amt orig_trm \\\n", "0 100004724719 R GUILD MORTGAGE COMPANY 4.375 228000 360 \n", "1 100005682269 R WELLS FARGO BANK, N.A. 3.375 75000 120 \n", "2 100005754525 C WELLS FARGO BANK, N.A. 4.750 79000 360 \n", "3 100006973104 C SUNTRUST BANK 4.250 295000 360 \n", "4 100007420970 R FREEDOM MORTGAGE CORP. 4.500 178000 360 \n", "\n", " orig_dte frst_dte oltv ocltv ... occ_stat state zip_3 mi_pct \\\n", "0 09/2017 11/2017 95 95 ... P HI 967 30.0 \n", "1 10/2017 12/2017 79 79 ... P MS 394 NaN \n", "2 11/2017 01/2018 80 80 ... I MO 653 NaN \n", "3 11/2017 01/2018 91 91 ... P OH 452 30.0 \n", "4 09/2017 11/2017 63 63 ... P CA 924 NaN \n", "\n", " product_type cscore_c mi_type relocation_flg cscore_min orig_val \n", "0 FRM NaN 1.0 N 693.0 240000.000000 \n", "1 FRM 765.0 NaN N 737.0 94936.708861 \n", "2 FRM 799.0 NaN N 799.0 98750.000000 \n", "3 FRM 789.0 1.0 N 760.0 324175.824176 \n", "4 FRM NaN NaN N 675.0 282539.682540 \n", "\n", "[5 rows x 27 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect our final cleaned data\n", "acq_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Final Data Trimming\n", "\n", "For the sake of size / efficiency, I am going to create a dataset composed of only loans originated in two months: January 2017 and December 2017. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The final dataset filtered to just Jan2017 and Dec2017 originations has 158168 rows and 27 columns.'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# filter original acq_df to just loans with origination dates in jan or dec 2017\n", "jan_and_dec_17_acqs = acq_df.loc[\n", " (acq_df['orig_dte'] == '01/2017') | (acq_df['orig_dte'] == '12/2017')\n", "]\n", "\n", "# inspect the features of the resulting dataset\n", "row_count, column_count = jan_and_dec_17_acqs.shape\n", "f\"The final dataset filtered to just Jan2017 and Dec2017 originations has {row_count} rows and {column_count} columns.\"" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# lastly, we'll save out this dataset for use elsewhere\n", "jan_and_dec_17_acqs.to_csv(path_or_buf='../data/jan_and_dec_17_acqs.csv', index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 4 }