{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Prep\n", "\n", "For this vignette, we'll be using acquisition loan data from [Fannie Mae's public datasets](https://loanperformancedata.fanniemae.com/lppub/index.html#Portfolio). There is sadly not yet a Python script to facilitate the reading in of all the data—which are stored as pipe-delimited .txt files, released quarterly on ~1 year lag—so I had to infer some of my data cleaning steps from the [R code that Fannie made available](https://loanperformancedata.fanniemae.com/lppub-docs/FNMA_SF_Loan_Performance_r_Primary.zip) to download and clean the data.\n", "\n", "If you are not immediately familiar with these data (or mortgage finance-related data in general), I highly recommend referencing the 1st table—\"Acquisition File Layout\"—in [this dictionnary file here](https://loanperformancedata.fanniemae.com/lppub-docs/FNMA_SF_Loan_Performance_File_layout.pdf), along with the [corresponding glossary here](https://loanperformancedata.fanniemae.com/lppub-docs/FNMA_SF_Loan_Performance_Glossary.pdf), which provides an explanation of the meaning of the values for each variable.\n", "\n", "For our purposes here, I will only be using acquisition data from the 1st and 4th quarter of 2017; thus, I will need to read in 2 .txt files. The code below is generalizable up to however many text files you want to use; you need only change the path_to_data variable to point at whatever path stores your .txt files.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# basic packages\n", "import numpy as np\n", "import pandas as pd\n", "import datetime\n", "\n", "# for data importing\n", "import os\n", "import csv\n", "\n", "# for data cleaning\n", "from janitor import clean_names, remove_empty" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This notebook was last executed on 2019-08-05 22:09'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# store the datetime of the most recent running of this notebook as a form of a log\n", "most_recent_run_datetime = datetime.datetime.now().strftime(\"%Y-%m-%d %H:%M\")\n", "f\"This notebook was last executed on {most_recent_run_datetime}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading in the Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# location of my .txt data files\n", "path_to_data = '../data/'\n", "\n", "# instantiate an empty list to store the file names to read in\n", "filelist = []\n", "\n", "# loop over each filename in the specified directory\n", "for filename in os.listdir(path_to_data):\n", " # check if the file path corresponds to a file\n", " # ensure the file in question is a .txt file\n", " # ensure we haven't already added that file to the list\n", " if os.path.isfile(path_to_data + filename) \\\n", " and filename.endswith(\".txt\") \\\n", " and filename not in filelist:\n", " # append the file to our list of files\n", " filelist.append(path_to_data+filename)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['../data/Acquisition_2017Q4.txt', '../data/Acquisition_2017Q1.txt']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# take a look at the files we'll be reading in\n", "filelist" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# instantiate an empty df that we'll use to store all our data\n", "acq_df = pd.DataFrame()\n", "\n", "# list of variable names for all the fields in the .txt files (adapted from the aforementioned docs)\n", "acq_var_names = ['LOAN_ID', 'ORIG_CHN', 'Seller.Name', 'ORIG_RT', 'ORIG_AMT', 'ORIG_TRM', 'ORIG_DTE','FRST_DTE',\n", " 'OLTV', 'OCLTV', 'NUM_BO', 'DTI', 'CSCORE_B', 'FTHB_FLG', 'PURPOSE', 'PROP_TYP', 'NUM_UNIT', \n", " 'OCC_STAT', 'STATE', 'ZIP_3', 'MI_PCT', 'Product.Type', 'CSCORE_C', 'MI_TYPE', 'RELOCATION_FLG']\n", "\n", "# loop over the .txt files, read them in, and append them to make our master acquisitions df\n", "for f in filelist:\n", " # specify that our delimiter is a pipe, ignore the header, and use pre-specified variable names\n", " temp_df = pd.read_csv(filepath_or_buffer = f, sep=\"|\", header=None, names=acq_var_names)\n", " # ensure that concatenation is row-wise and ignore the index values as they don't convey meaning here\n", " acq_df = pd.concat(objs=[acq_df, temp_df], axis=0, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | LOAN_ID | \n", "ORIG_CHN | \n", "Seller.Name | \n", "ORIG_RT | \n", "ORIG_AMT | \n", "ORIG_TRM | \n", "ORIG_DTE | \n", "FRST_DTE | \n", "OLTV | \n", "OCLTV | \n", "... | \n", "PROP_TYP | \n", "NUM_UNIT | \n", "OCC_STAT | \n", "STATE | \n", "ZIP_3 | \n", "MI_PCT | \n", "Product.Type | \n", "CSCORE_C | \n", "MI_TYPE | \n", "RELOCATION_FLG | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "100004724719 | \n", "R | \n", "GUILD MORTGAGE COMPANY | \n", "4.375 | \n", "228000 | \n", "360 | \n", "09/2017 | \n", "11/2017 | \n", "95 | \n", "95 | \n", "... | \n", "SF | \n", "1 | \n", "P | \n", "HI | \n", "967 | \n", "30.0 | \n", "FRM | \n", "NaN | \n", "1.0 | \n", "N | \n", "
1 | \n", "100005682269 | \n", "R | \n", "WELLS FARGO BANK, N.A. | \n", "3.375 | \n", "75000 | \n", "120 | \n", "10/2017 | \n", "12/2017 | \n", "79 | \n", "79 | \n", "... | \n", "SF | \n", "1 | \n", "P | \n", "MS | \n", "394 | \n", "NaN | \n", "FRM | \n", "765.0 | \n", "NaN | \n", "N | \n", "
2 | \n", "100005754525 | \n", "C | \n", "WELLS FARGO BANK, N.A. | \n", "4.750 | \n", "79000 | \n", "360 | \n", "11/2017 | \n", "01/2018 | \n", "80 | \n", "80 | \n", "... | \n", "SF | \n", "1 | \n", "I | \n", "MO | \n", "653 | \n", "NaN | \n", "FRM | \n", "799.0 | \n", "NaN | \n", "N | \n", "
3 | \n", "100006973104 | \n", "C | \n", "SUNTRUST BANK | \n", "4.250 | \n", "295000 | \n", "360 | \n", "11/2017 | \n", "01/2018 | \n", "91 | \n", "91 | \n", "... | \n", "SF | \n", "1 | \n", "P | \n", "OH | \n", "452 | \n", "30.0 | \n", "FRM | \n", "789.0 | \n", "1.0 | \n", "N | \n", "
4 | \n", "100007420970 | \n", "R | \n", "FREEDOM MORTGAGE CORP. | \n", "4.500 | \n", "178000 | \n", "360 | \n", "09/2017 | \n", "11/2017 | \n", "63 | \n", "63 | \n", "... | \n", "SF | \n", "1 | \n", "P | \n", "CA | \n", "924 | \n", "NaN | \n", "FRM | \n", "NaN | \n", "NaN | \n", "N | \n", "
5 rows × 25 columns
\n", "\n", " | loan_id | \n", "orig_chn | \n", "seller_name | \n", "orig_rt | \n", "orig_amt | \n", "orig_trm | \n", "orig_dte | \n", "frst_dte | \n", "oltv | \n", "ocltv | \n", "... | \n", "occ_stat | \n", "state | \n", "zip_3 | \n", "mi_pct | \n", "product_type | \n", "cscore_c | \n", "mi_type | \n", "relocation_flg | \n", "cscore_min | \n", "orig_val | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "100004724719 | \n", "R | \n", "GUILD MORTGAGE COMPANY | \n", "4.375 | \n", "228000 | \n", "360 | \n", "09/2017 | \n", "11/2017 | \n", "95 | \n", "95 | \n", "... | \n", "P | \n", "HI | \n", "967 | \n", "30.0 | \n", "FRM | \n", "NaN | \n", "1.0 | \n", "N | \n", "693.0 | \n", "240000.000000 | \n", "
1 | \n", "100005682269 | \n", "R | \n", "WELLS FARGO BANK, N.A. | \n", "3.375 | \n", "75000 | \n", "120 | \n", "10/2017 | \n", "12/2017 | \n", "79 | \n", "79 | \n", "... | \n", "P | \n", "MS | \n", "394 | \n", "NaN | \n", "FRM | \n", "765.0 | \n", "NaN | \n", "N | \n", "737.0 | \n", "94936.708861 | \n", "
2 | \n", "100005754525 | \n", "C | \n", "WELLS FARGO BANK, N.A. | \n", "4.750 | \n", "79000 | \n", "360 | \n", "11/2017 | \n", "01/2018 | \n", "80 | \n", "80 | \n", "... | \n", "I | \n", "MO | \n", "653 | \n", "NaN | \n", "FRM | \n", "799.0 | \n", "NaN | \n", "N | \n", "799.0 | \n", "98750.000000 | \n", "
3 | \n", "100006973104 | \n", "C | \n", "SUNTRUST BANK | \n", "4.250 | \n", "295000 | \n", "360 | \n", "11/2017 | \n", "01/2018 | \n", "91 | \n", "91 | \n", "... | \n", "P | \n", "OH | \n", "452 | \n", "30.0 | \n", "FRM | \n", "789.0 | \n", "1.0 | \n", "N | \n", "760.0 | \n", "324175.824176 | \n", "
4 | \n", "100007420970 | \n", "R | \n", "FREEDOM MORTGAGE CORP. | \n", "4.500 | \n", "178000 | \n", "360 | \n", "09/2017 | \n", "11/2017 | \n", "63 | \n", "63 | \n", "... | \n", "P | \n", "CA | \n", "924 | \n", "NaN | \n", "FRM | \n", "NaN | \n", "NaN | \n", "N | \n", "675.0 | \n", "282539.682540 | \n", "
5 rows × 27 columns
\n", "