{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project: Artificial Financial Advisor on Peer-to-Peer lending -- Part 2\n",
    "\n",
    "--by Lu Tang\n",
    "\n",
    "## Part 2 Choosing the best model for investors\n",
    "\n",
    "### Table of Contents\n",
    "<ul>\n",
    "<li><a href=\"#review\">1. Review Part 1 and clarify our goal in modelling</a></li>\n",
    "<li><a href=\"#metrics\">2. Defining Error Metrics</a></li>\n",
    "<li><a href=\"#preprocessing\">3. Preprocessing the data</a></li>\n",
    "<li><a href=\"#model\">4. Fit the data in different models and analyze the result</a></li>  \n",
    "<li><a href=\"#summary\">5. Summary</a></li>    \n",
    "</ul> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Import general library and make settings**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 432x288 with 0 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# import library\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import precision_recall_curve\n",
    "\n",
    "# some other library are imported along with the code\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "sns.despine(left=True, bottom=True)\n",
    "\n",
    "# to avoid warnings\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Pandas options\n",
    "pd.set_option('display.max_colwidth', 1000, 'display.max_rows', None, 'display.max_columns', None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Read the data we have cleaned in Part 1 *loans_2007_2019_cleaned.csv*, this is our target data split as train data for features column**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>loan_amnt</th>\n",
       "      <th>term</th>\n",
       "      <th>int_rate</th>\n",
       "      <th>installment</th>\n",
       "      <th>grade</th>\n",
       "      <th>emp_length</th>\n",
       "      <th>home_ownership</th>\n",
       "      <th>annual_inc</th>\n",
       "      <th>verification_status</th>\n",
       "      <th>purpose</th>\n",
       "      <th>addr_state</th>\n",
       "      <th>dti</th>\n",
       "      <th>delinq_2yrs</th>\n",
       "      <th>inq_last_6mths</th>\n",
       "      <th>open_acc</th>\n",
       "      <th>pub_rec</th>\n",
       "      <th>revol_bal</th>\n",
       "      <th>revol_util</th>\n",
       "      <th>total_acc</th>\n",
       "      <th>initial_list_status</th>\n",
       "      <th>tot_cur_bal</th>\n",
       "      <th>total_rev_hi_lim</th>\n",
       "      <th>acc_open_past_24mths</th>\n",
       "      <th>avg_cur_bal</th>\n",
       "      <th>bc_open_to_buy</th>\n",
       "      <th>bc_util</th>\n",
       "      <th>mo_sin_old_il_acct</th>\n",
       "      <th>mo_sin_old_rev_tl_op</th>\n",
       "      <th>mo_sin_rcnt_rev_tl_op</th>\n",
       "      <th>mo_sin_rcnt_tl</th>\n",
       "      <th>mort_acc</th>\n",
       "      <th>mths_since_recent_bc</th>\n",
       "      <th>mths_since_recent_inq</th>\n",
       "      <th>num_accts_ever_120_pd</th>\n",
       "      <th>num_actv_bc_tl</th>\n",
       "      <th>num_actv_rev_tl</th>\n",
       "      <th>num_bc_sats</th>\n",
       "      <th>num_bc_tl</th>\n",
       "      <th>num_il_tl</th>\n",
       "      <th>num_op_rev_tl</th>\n",
       "      <th>num_rev_accts</th>\n",
       "      <th>num_rev_tl_bal_gt_0</th>\n",
       "      <th>num_sats</th>\n",
       "      <th>num_tl_op_past_12m</th>\n",
       "      <th>pct_tl_nvr_dlq</th>\n",
       "      <th>percent_bc_gt_75</th>\n",
       "      <th>pub_rec_bankruptcies</th>\n",
       "      <th>tot_hi_cred_lim</th>\n",
       "      <th>total_bal_ex_mort</th>\n",
       "      <th>total_bc_limit</th>\n",
       "      <th>total_il_high_credit_limit</th>\n",
       "      <th>loan_status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>16000.0</td>\n",
       "      <td>36 months</td>\n",
       "      <td>11.14</td>\n",
       "      <td>524.89</td>\n",
       "      <td>B</td>\n",
       "      <td>8.0</td>\n",
       "      <td>RENT</td>\n",
       "      <td>65000.0</td>\n",
       "      <td>Not Verified</td>\n",
       "      <td>credit_card</td>\n",
       "      <td>VA</td>\n",
       "      <td>15.76</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>13484.0</td>\n",
       "      <td>37.6</td>\n",
       "      <td>14.0</td>\n",
       "      <td>f</td>\n",
       "      <td>39639.0</td>\n",
       "      <td>35900.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4404.0</td>\n",
       "      <td>20256.0</td>\n",
       "      <td>38.1</td>\n",
       "      <td>130.0</td>\n",
       "      <td>155.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>63549.0</td>\n",
       "      <td>39639.0</td>\n",
       "      <td>32700.0</td>\n",
       "      <td>25728.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   loan_amnt        term  int_rate  installment grade  emp_length  \\\n",
       "0    16000.0   36 months     11.14       524.89     B         8.0   \n",
       "\n",
       "  home_ownership  annual_inc verification_status      purpose addr_state  \\\n",
       "0           RENT     65000.0        Not Verified  credit_card         VA   \n",
       "\n",
       "     dti  delinq_2yrs  inq_last_6mths  open_acc  pub_rec  revol_bal  \\\n",
       "0  15.76          0.0             1.0       9.0      0.0    13484.0   \n",
       "\n",
       "   revol_util  total_acc initial_list_status  tot_cur_bal  total_rev_hi_lim  \\\n",
       "0        37.6       14.0                   f      39639.0           35900.0   \n",
       "\n",
       "   acc_open_past_24mths  avg_cur_bal  bc_open_to_buy  bc_util  \\\n",
       "0                   3.0       4404.0         20256.0     38.1   \n",
       "\n",
       "   mo_sin_old_il_acct  mo_sin_old_rev_tl_op  mo_sin_rcnt_rev_tl_op  \\\n",
       "0               130.0                 155.0                    7.0   \n",
       "\n",
       "   mo_sin_rcnt_tl  mort_acc  mths_since_recent_bc  mths_since_recent_inq  \\\n",
       "0             3.0       0.0                   7.0                    3.0   \n",
       "\n",
       "   num_accts_ever_120_pd  num_actv_bc_tl  num_actv_rev_tl  num_bc_sats  \\\n",
       "0                    0.0             3.0              4.0          5.0   \n",
       "\n",
       "   num_bc_tl  num_il_tl  num_op_rev_tl  num_rev_accts  num_rev_tl_bal_gt_0  \\\n",
       "0        6.0        5.0            6.0            8.0                  4.0   \n",
       "\n",
       "   num_sats  num_tl_op_past_12m  pct_tl_nvr_dlq  percent_bc_gt_75  \\\n",
       "0       9.0                 2.0           100.0              20.0   \n",
       "\n",
       "   pub_rec_bankruptcies  tot_hi_cred_lim  total_bal_ex_mort  total_bc_limit  \\\n",
       "0                   0.0          63549.0            39639.0         32700.0   \n",
       "\n",
       "   total_il_high_credit_limit  loan_status  \n",
       "0                     25728.0            1  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The total number of rows and columns in our cleaned 2007 to 2019 loan data is:  (1023834, 52)\n"
     ]
    }
   ],
   "source": [
    "# load the data we saved from the first notebook part 1 \n",
    "loans_2007_2019 = pd.read_csv('loans_2007_2019_cleaned.csv')\n",
    "\n",
    "# view 1th rows\n",
    "display(loans_2007_2019.head(1))\n",
    "\n",
    "print('The total number of rows and columns in our cleaned 2007 to 2019 loan data is: ',loans_2007_2019.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='#review'></a>\n",
    "## 1. Review Part 1 and clarify our goal in modelling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From part 1, we found the Charged Off rate for all the loans in average is 19.57%. This means **if an investor randomly choose loans to invest, he/she could have about 19% of Charged-Off rate**. \n",
    "\n",
    "I digged deeper and found **if a super conservative investor randomly choose ONLY A loans to invest, he/she could have about 5.9% of Charged-Off rate**.\n",
    "\n",
    "The following table is the one I saved from Part 1 after analyzing the loan grade and returns. Let's do further analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Loan Grade</th>\n",
       "      <th>Average Interest Rate</th>\n",
       "      <th>Unnamed: 2</th>\n",
       "      <th>Averafe Charged-Off Rate</th>\n",
       "      <th>loan_grade</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>7.10%</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.95%</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>10.67%</td>\n",
       "      <td>NaN</td>\n",
       "      <td>13.28%</td>\n",
       "      <td>B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>C</td>\n",
       "      <td>14.04%</td>\n",
       "      <td>NaN</td>\n",
       "      <td>22.14%</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>D</td>\n",
       "      <td>17.81%</td>\n",
       "      <td>NaN</td>\n",
       "      <td>29.85%</td>\n",
       "      <td>D</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>E</td>\n",
       "      <td>21.27%</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.87%</td>\n",
       "      <td>E</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>F</td>\n",
       "      <td>25.03%</td>\n",
       "      <td>NaN</td>\n",
       "      <td>44.78%</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>G</td>\n",
       "      <td>27.82%</td>\n",
       "      <td>NaN</td>\n",
       "      <td>49.48%</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Loan Grade Average Interest Rate  Unnamed: 2 Averafe Charged-Off Rate  \\\n",
       "0          A                 7.10%         NaN                    5.95%   \n",
       "1          B                10.67%         NaN                   13.28%   \n",
       "2          C                14.04%         NaN                   22.14%   \n",
       "3          D                17.81%         NaN                   29.85%   \n",
       "4          E                21.27%         NaN                   37.87%   \n",
       "5          F                25.03%         NaN                   44.78%   \n",
       "6          G                27.82%         NaN                   49.48%   \n",
       "\n",
       "  loan_grade  \n",
       "0          A  \n",
       "1          B  \n",
       "2          C  \n",
       "3          D  \n",
       "4          E  \n",
       "5          F  \n",
       "6          G  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read investment data made from part 1.\n",
    "investment = pd.read_csv('investment.csv')\n",
    "# add loan grade column\n",
    "investment['loan_grade']=['A','B','C','D','E','F','G'] # since I saved the data as index=False, so i need to add it back.\n",
    "# view the table\n",
    "investment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: From part 1, we have found the proportion of the loans between different grade are not the same, and I also mentioned that 'Charged-Off' rate does NOT mean the when the loans are charged off, investor will loss all the principal and interest, because loan payment schedul is a mortgage-like arrangement and a loan maybe charged off during the loan term. This can explain why the charged-off rate is even higher than interest rate except for A loans.\n",
    "\n",
    "From first row, we can see if an investor put 25 dollar per loan and invest in 1000 loan, and ONLY choose A loans to invest, the average expected return is **7.1%**, but if he/she lose all the money for the loans that are charged-off, the net return is: 0.071004-0.059498 = **1.5%**. We can also calculate as following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.011505999999999997"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate the worst case for expected return \n",
    "((0.071004-0.059498)*25*1000)/(25*1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Our goal in buidling machine learning model**\n",
    "\n",
    "1. Our model should perform better than if an investor randomly pick up all the loans to invest, which is **to reduce the risk at least lower than 19.57%**.  \n",
    "\n",
    "2. Our model should perform better than if a conservative investor randomly choose only A loans to invest. **Our model that have expected return higher than 7.1%**, and **for the worst case, still have expected return higher than 1.15%**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='metrics'></a>\n",
    "## 2. Define error metric"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before diving in and selecting an algorithm to apply to the data, we should select an error metric, and it is not always an accuracy score or AUC-ROC, it depends on the problems we are trying to solve with our model.\n",
    "\n",
    "We're using a machine learning model to predict whether or not we should fund a loan on the Lending Club platform. Our objective in this is to make money -- we want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off. An error metric will help us determine if our algorithm will make us money or lose us money.\n",
    "\n",
    "In this case, we're primarily concerned with **false positives** and **false negatives**. Since we're viewing this problem from the standpoint of a **conservative investor**, we need to treat false positives differently than false negatives. A conservative investor would want to **minimize risk**, and **avoid false positives as much as possible**. They'd be more okay with funding a safer loan (false positives) than missing out on opportunities (false negatives).\n",
    "\n",
    "From part 1, we can see there is a **significant class imbalance** in the loan_status column. This causes a major issue when we use accuracy as a metric. This is because due to the class imbalance, a classifier can predict 1 for every row, and still have high accuracy. Read more on [beyond-accuracy-precision-and-recall](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c)\n",
    "\n",
    "This means that we should **optimize** for:\n",
    "- **Low fpr (false positive rate)** \n",
    "\n",
    "fpr = fp / (fp + tn), i.e. \"among the loans that ACTUALLLY 'Charged-Off', what percentage the model is wrong.\" \n",
    "\n",
    "- **High Precision (similar with Low fpr)**\n",
    "\n",
    "precision = tp/(tp+fp), i.e.\"among the loans that the model PREDICTED as 'will Fully-Paid', what percentage the model is right.\n",
    "\n",
    "- **High tpr (true positive rate/Recall/Sensitivity) (this comes with a cost when we optimize the fpr and Precision)**\n",
    "\n",
    "tpr = tp / (tp + fn), i.e.\"among the loans that ACTUALLLY 'Fully-Paid', what percentage of loans that should be funded that I would fund\". \n",
    "\n",
    "Generally, if we want to reduce false positive rate, true positive rate will go down. This is because if we want to reduce the risk of false positives, we wouldn't think about funding riskier loans in the first place. In other words, our model will be very 'strict' on picking up the loans that it thinks will much more likely to pay off; it basically discard a lot of bad and good loans, and select only those very good ones."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I wrote my own **model metric function** to print out only the metrics we need. (The classification report and other popular metric library is not useful here, and only makes readers more confusing). \n",
    "\n",
    "I also wrote a function for **precision-recall curve**. I choose precision-recall curve to visualize the trade-off between the precision and recall **because this is imbalanced dataset and one class is more important than another one**. Although ROC-AUC is popular, it is not useful to solve our problem. (Read more [roc-and-precision-recall-with-imbalanced-datasets](https://classeval.wordpress.com/simulation-analysis/roc-and-precision-recall-with-imbalanced-datasets/). Also [here](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/) And [Interpreting ROC Curves, Precision-Recall Curves, and AUCs](https://www.datascienceblog.net/post/machine-learning/interpreting-roc-curves-auc/)). Also for f1_score, I did not use F1_score, because if I have a classifier which F1 score is low, we can’t tell whether it has problems with false positives or false negatives. In this case, the best way to “debug” such a classifier is to use confusion matrix to diagnose the problem and then look at the problematic cases in the validation or test dataset. (Read more [here](https://www.mikulskibartosz.name/f1-score-explained/) and read [here](https://stats.stackexchange.com/questions/49226/how-to-interpret-f-measure-values)). (At the first I actually calculate F1_score and AUC for ROC, it's high but it does not give more informative insights.)\n",
    "\n",
    "Lastly, I wrote my own **investment portfolio return function** , i.e. if an investor invested on the loans selected selected  by my model, how much investment return (%) he/she can expect. If we only choose model based on those metric score or a confusion metrix, it is still very confusing, because for our purpose, it does not mean a model with lowesr false positive rate is the best model. With this function, we can choose a model that perform better than if the investor randomly pick up loans to invest without my model. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1. Define my metric function**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def my_metric(model, X_test, y_test):\n",
    "    \n",
    "    # getting predicted values\n",
    "    y_predict = model.predict(X_test)\n",
    "    \n",
    "    # False positives\n",
    "    fp_filter = (y_predict == 1) & (y_test == 0)\n",
    "    fp = len(y_predict[fp_filter])\n",
    "    \n",
    "    # True positives.\n",
    "    tp_filter = (y_predict == 1) & (y_test == 1)\n",
    "    tp = len(y_predict[tp_filter])\n",
    "    \n",
    "    # False negatives.\n",
    "    fn_filter = (y_predict == 0) & (y_test == 1)\n",
    "    fn = len(y_predict[fn_filter])\n",
    "    \n",
    "    # True negatives\n",
    "    tn_filter = (y_predict == 0) & (y_test == 0)\n",
    "    tn = len(y_predict[tn_filter])\n",
    "    \n",
    "    # Rates\n",
    "    tpr = tp  / (tp + fn)\n",
    "    fpr = fp  / (fp + tn)\n",
    "    precision = tp/(tp+fp)\n",
    "    \n",
    "    print('Confusion matrix: \\n', confusion_matrix(y_test, y_predict))\n",
    "    # print(classification_report(y_test,y_predict)) # we don't need this\n",
    "    print('The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn): ', tpr)\n",
    "    print('The False Positive Rate (1-specification)is fpr = fp/(fp+tn): ', fpr)\n",
    "    print('The Precision (tp/(tp+fp)) is: ', precision)\n",
    "    print('\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2. Define precision-recall curves**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def precision_recall_curves(model, X_test, y_test):\n",
    "    \n",
    "    # plot Precision Recall Curve\n",
    "    plt.figure(figsize=(8,6))\n",
    "    \n",
    "    # predict probabilities\n",
    "    probs = model.predict_proba(X_test)\n",
    "    # keep probabilities for the positive outcome only\n",
    "    probs = probs[:, 1]\n",
    "    \n",
    "    # get predicted class value\n",
    "    y_predict = model.predict(X_test)\n",
    "    \n",
    "    # calculate precision-recall curve\n",
    "    precision, recall, thresholds = precision_recall_curve(y_test, probs)\n",
    "    \n",
    "    # plot no skill\n",
    "    # I set 0.8 because our imbalanced data is roughly 80% as majority class\n",
    "    plt.plot([0, 1], [0.8, 0.8], linestyle='--') \n",
    "    # plot the precision-recall curve for the model\n",
    "    plt.plot(recall, precision, color='darkorange', marker='.')\n",
    "    plt.xlabel('Recall (sensitivity, or True Positive Rate)')\n",
    "    plt.ylabel('Precision')\n",
    "    plt.title('Precision Recall Curve')\n",
    "    \n",
    "    return plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3. Define investment return with my model**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def investment_return_with_my_model(model, X_test, y_test):\n",
    "    \n",
    "    # get predicted class value\n",
    "    y_predict = model.predict(X_test)\n",
    "    # reshape\n",
    "    y_predict = y_predict.reshape((y_predict.shape[0], 1))\n",
    "    # put it into a dataframe\n",
    "    y_predict = pd.DataFrame(y_predict, index=range(0,len(y_predict)), columns=['loan_status_predicted'])\n",
    "    \n",
    "    # Join X_test and y_test using 'join' since they have the same index\n",
    "    loans_test = X_test.join(y_test)\n",
    "    \n",
    "    # Then join the test dataframe with y_predict; Since it is different index, I created a 'Join' column and then use 'merge'\n",
    "    loans_test['Join']=list(range(0,len(y_predict)))\n",
    "    y_predict['Join']=list(range(0,len(y_predict)))\n",
    "    # Merge test data with predicted data\n",
    "    loans_test_with_predict = pd.merge(loans_test, y_predict, on='Join')\n",
    "    \n",
    "    # filter the rows that are predicted as 1\n",
    "    predict_should_invest = loans_test_with_predict[loans_test_with_predict['loan_status_predicted']==1]\n",
    "    \n",
    "    # Among the loans predicted as 1, filter the rows thar are actually as 1\n",
    "    actual_should_invest = predict_should_invest[predict_should_invest['loan_status']==1]\n",
    "    \n",
    "    # calculate the mistake rate, this is the same as fpr\n",
    "    mistake_rate = (predict_should_invest.shape[0]-actual_should_invest.shape[0])/predict_should_invest.shape[0]\n",
    "    \n",
    "    # add a new column, indicating for each loan we invest 25 dollars\n",
    "    predict_should_invest['invest_amount'] = 25\n",
    "    \n",
    "    # add a new column, calculating interest earned from this loan by multiplying interest rate per loan with the amount invested per loan\n",
    "    predict_should_invest['interest_earned']=(predict_should_invest['int_rate']/100)*predict_should_invest['invest_amount']\n",
    "    \n",
    "    # sum the interest earned for all the loans we invested\n",
    "    total_interest = predict_should_invest['interest_earned'].sum()\n",
    "    \n",
    "    # calculate our investment return\n",
    "    investment_return = total_interest/(25*len(predict_should_invest))\n",
    "    print('The investment return with this model is: ', '{:.2%}'.format(investment_return))\n",
    "    \n",
    "    # calculate the return for the worse case, meaning if we lost all the money from the loans that are charged-off\n",
    "    #(including full amount of principal and interest, which is less likely), how much return we will get.\n",
    "    investment_return_with_maxlosss = (total_interest-(mistake_rate*(25*len(predict_should_invest))))/(25*len(predict_should_invest))\n",
    "    print('The investment return with this model for the worse case is: ', '{:.2%}'.format(investment_return_with_maxlosss))\n",
    "    print('\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='preprocessing'></a>\n",
    "## 3. Preprocessing the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Encode object columns as dummy variables**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>grade</th>\n",
       "      <th>home_ownership</th>\n",
       "      <th>verification_status</th>\n",
       "      <th>purpose</th>\n",
       "      <th>addr_state</th>\n",
       "      <th>initial_list_status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>36 months</td>\n",
       "      <td>B</td>\n",
       "      <td>RENT</td>\n",
       "      <td>Not Verified</td>\n",
       "      <td>credit_card</td>\n",
       "      <td>VA</td>\n",
       "      <td>f</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         term grade home_ownership verification_status      purpose  \\\n",
       "0   36 months     B           RENT        Not Verified  credit_card   \n",
       "\n",
       "  addr_state initial_list_status  \n",
       "0         VA                   f  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select object columns\n",
    "object_columns = loans_2007_2019.select_dtypes(include=[\"object\"])\n",
    "\n",
    "object_columns.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>loan_amnt</th>\n",
       "      <th>int_rate</th>\n",
       "      <th>installment</th>\n",
       "      <th>emp_length</th>\n",
       "      <th>annual_inc</th>\n",
       "      <th>dti</th>\n",
       "      <th>delinq_2yrs</th>\n",
       "      <th>inq_last_6mths</th>\n",
       "      <th>open_acc</th>\n",
       "      <th>pub_rec</th>\n",
       "      <th>revol_bal</th>\n",
       "      <th>revol_util</th>\n",
       "      <th>total_acc</th>\n",
       "      <th>tot_cur_bal</th>\n",
       "      <th>total_rev_hi_lim</th>\n",
       "      <th>acc_open_past_24mths</th>\n",
       "      <th>avg_cur_bal</th>\n",
       "      <th>bc_open_to_buy</th>\n",
       "      <th>bc_util</th>\n",
       "      <th>mo_sin_old_il_acct</th>\n",
       "      <th>mo_sin_old_rev_tl_op</th>\n",
       "      <th>mo_sin_rcnt_rev_tl_op</th>\n",
       "      <th>mo_sin_rcnt_tl</th>\n",
       "      <th>mort_acc</th>\n",
       "      <th>mths_since_recent_bc</th>\n",
       "      <th>mths_since_recent_inq</th>\n",
       "      <th>num_accts_ever_120_pd</th>\n",
       "      <th>num_actv_bc_tl</th>\n",
       "      <th>num_actv_rev_tl</th>\n",
       "      <th>num_bc_sats</th>\n",
       "      <th>num_bc_tl</th>\n",
       "      <th>num_il_tl</th>\n",
       "      <th>num_op_rev_tl</th>\n",
       "      <th>num_rev_accts</th>\n",
       "      <th>num_rev_tl_bal_gt_0</th>\n",
       "      <th>num_sats</th>\n",
       "      <th>num_tl_op_past_12m</th>\n",
       "      <th>pct_tl_nvr_dlq</th>\n",
       "      <th>percent_bc_gt_75</th>\n",
       "      <th>pub_rec_bankruptcies</th>\n",
       "      <th>tot_hi_cred_lim</th>\n",
       "      <th>total_bal_ex_mort</th>\n",
       "      <th>total_bc_limit</th>\n",
       "      <th>total_il_high_credit_limit</th>\n",
       "      <th>loan_status</th>\n",
       "      <th>term_ 60 months</th>\n",
       "      <th>grade_B</th>\n",
       "      <th>grade_C</th>\n",
       "      <th>grade_D</th>\n",
       "      <th>grade_E</th>\n",
       "      <th>grade_F</th>\n",
       "      <th>grade_G</th>\n",
       "      <th>home_ownership_MORTGAGE</th>\n",
       "      <th>home_ownership_NONE</th>\n",
       "      <th>home_ownership_OTHER</th>\n",
       "      <th>home_ownership_OWN</th>\n",
       "      <th>home_ownership_RENT</th>\n",
       "      <th>verification_status_Source Verified</th>\n",
       "      <th>verification_status_Verified</th>\n",
       "      <th>purpose_credit_card</th>\n",
       "      <th>purpose_debt_consolidation</th>\n",
       "      <th>purpose_educational</th>\n",
       "      <th>purpose_home_improvement</th>\n",
       "      <th>purpose_house</th>\n",
       "      <th>purpose_major_purchase</th>\n",
       "      <th>purpose_medical</th>\n",
       "      <th>purpose_moving</th>\n",
       "      <th>purpose_other</th>\n",
       "      <th>purpose_renewable_energy</th>\n",
       "      <th>purpose_small_business</th>\n",
       "      <th>purpose_vacation</th>\n",
       "      <th>purpose_wedding</th>\n",
       "      <th>addr_state_AL</th>\n",
       "      <th>addr_state_AR</th>\n",
       "      <th>addr_state_AZ</th>\n",
       "      <th>addr_state_CA</th>\n",
       "      <th>addr_state_CO</th>\n",
       "      <th>addr_state_CT</th>\n",
       "      <th>addr_state_DC</th>\n",
       "      <th>addr_state_DE</th>\n",
       "      <th>addr_state_FL</th>\n",
       "      <th>addr_state_GA</th>\n",
       "      <th>addr_state_HI</th>\n",
       "      <th>addr_state_IA</th>\n",
       "      <th>addr_state_ID</th>\n",
       "      <th>addr_state_IL</th>\n",
       "      <th>addr_state_IN</th>\n",
       "      <th>addr_state_KS</th>\n",
       "      <th>addr_state_KY</th>\n",
       "      <th>addr_state_LA</th>\n",
       "      <th>addr_state_MA</th>\n",
       "      <th>addr_state_MD</th>\n",
       "      <th>addr_state_ME</th>\n",
       "      <th>addr_state_MI</th>\n",
       "      <th>addr_state_MN</th>\n",
       "      <th>addr_state_MO</th>\n",
       "      <th>addr_state_MS</th>\n",
       "      <th>addr_state_MT</th>\n",
       "      <th>addr_state_NC</th>\n",
       "      <th>addr_state_ND</th>\n",
       "      <th>addr_state_NE</th>\n",
       "      <th>addr_state_NH</th>\n",
       "      <th>addr_state_NJ</th>\n",
       "      <th>addr_state_NM</th>\n",
       "      <th>addr_state_NV</th>\n",
       "      <th>addr_state_NY</th>\n",
       "      <th>addr_state_OH</th>\n",
       "      <th>addr_state_OK</th>\n",
       "      <th>addr_state_OR</th>\n",
       "      <th>addr_state_PA</th>\n",
       "      <th>addr_state_RI</th>\n",
       "      <th>addr_state_SC</th>\n",
       "      <th>addr_state_SD</th>\n",
       "      <th>addr_state_TN</th>\n",
       "      <th>addr_state_TX</th>\n",
       "      <th>addr_state_UT</th>\n",
       "      <th>addr_state_VA</th>\n",
       "      <th>addr_state_VT</th>\n",
       "      <th>addr_state_WA</th>\n",
       "      <th>addr_state_WI</th>\n",
       "      <th>addr_state_WV</th>\n",
       "      <th>addr_state_WY</th>\n",
       "      <th>initial_list_status_w</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>16000.0</td>\n",
       "      <td>11.14</td>\n",
       "      <td>524.89</td>\n",
       "      <td>8.0</td>\n",
       "      <td>65000.0</td>\n",
       "      <td>15.76</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>13484.0</td>\n",
       "      <td>37.6</td>\n",
       "      <td>14.0</td>\n",
       "      <td>39639.0</td>\n",
       "      <td>35900.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4404.0</td>\n",
       "      <td>20256.0</td>\n",
       "      <td>38.1</td>\n",
       "      <td>130.0</td>\n",
       "      <td>155.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>63549.0</td>\n",
       "      <td>39639.0</td>\n",
       "      <td>32700.0</td>\n",
       "      <td>25728.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   loan_amnt  int_rate  installment  emp_length  annual_inc    dti  \\\n",
       "0    16000.0     11.14       524.89         8.0     65000.0  15.76   \n",
       "\n",
       "   delinq_2yrs  inq_last_6mths  open_acc  pub_rec  revol_bal  revol_util  \\\n",
       "0          0.0             1.0       9.0      0.0    13484.0        37.6   \n",
       "\n",
       "   total_acc  tot_cur_bal  total_rev_hi_lim  acc_open_past_24mths  \\\n",
       "0       14.0      39639.0           35900.0                   3.0   \n",
       "\n",
       "   avg_cur_bal  bc_open_to_buy  bc_util  mo_sin_old_il_acct  \\\n",
       "0       4404.0         20256.0     38.1               130.0   \n",
       "\n",
       "   mo_sin_old_rev_tl_op  mo_sin_rcnt_rev_tl_op  mo_sin_rcnt_tl  mort_acc  \\\n",
       "0                 155.0                    7.0             3.0       0.0   \n",
       "\n",
       "   mths_since_recent_bc  mths_since_recent_inq  num_accts_ever_120_pd  \\\n",
       "0                   7.0                    3.0                    0.0   \n",
       "\n",
       "   num_actv_bc_tl  num_actv_rev_tl  num_bc_sats  num_bc_tl  num_il_tl  \\\n",
       "0             3.0              4.0          5.0        6.0        5.0   \n",
       "\n",
       "   num_op_rev_tl  num_rev_accts  num_rev_tl_bal_gt_0  num_sats  \\\n",
       "0            6.0            8.0                  4.0       9.0   \n",
       "\n",
       "   num_tl_op_past_12m  pct_tl_nvr_dlq  percent_bc_gt_75  pub_rec_bankruptcies  \\\n",
       "0                 2.0           100.0              20.0                   0.0   \n",
       "\n",
       "   tot_hi_cred_lim  total_bal_ex_mort  total_bc_limit  \\\n",
       "0          63549.0            39639.0         32700.0   \n",
       "\n",
       "   total_il_high_credit_limit  loan_status  term_ 60 months  grade_B  grade_C  \\\n",
       "0                     25728.0            1                0        1        0   \n",
       "\n",
       "   grade_D  grade_E  grade_F  grade_G  home_ownership_MORTGAGE  \\\n",
       "0        0        0        0        0                        0   \n",
       "\n",
       "   home_ownership_NONE  home_ownership_OTHER  home_ownership_OWN  \\\n",
       "0                    0                     0                   0   \n",
       "\n",
       "   home_ownership_RENT  verification_status_Source Verified  \\\n",
       "0                    1                                    0   \n",
       "\n",
       "   verification_status_Verified  purpose_credit_card  \\\n",
       "0                             0                    1   \n",
       "\n",
       "   purpose_debt_consolidation  purpose_educational  purpose_home_improvement  \\\n",
       "0                           0                    0                         0   \n",
       "\n",
       "   purpose_house  purpose_major_purchase  purpose_medical  purpose_moving  \\\n",
       "0              0                       0                0               0   \n",
       "\n",
       "   purpose_other  purpose_renewable_energy  purpose_small_business  \\\n",
       "0              0                         0                       0   \n",
       "\n",
       "   purpose_vacation  purpose_wedding  addr_state_AL  addr_state_AR  \\\n",
       "0                 0                0              0              0   \n",
       "\n",
       "   addr_state_AZ  addr_state_CA  addr_state_CO  addr_state_CT  addr_state_DC  \\\n",
       "0              0              0              0              0              0   \n",
       "\n",
       "   addr_state_DE  addr_state_FL  addr_state_GA  addr_state_HI  addr_state_IA  \\\n",
       "0              0              0              0              0              0   \n",
       "\n",
       "   addr_state_ID  addr_state_IL  addr_state_IN  addr_state_KS  addr_state_KY  \\\n",
       "0              0              0              0              0              0   \n",
       "\n",
       "   addr_state_LA  addr_state_MA  addr_state_MD  addr_state_ME  addr_state_MI  \\\n",
       "0              0              0              0              0              0   \n",
       "\n",
       "   addr_state_MN  addr_state_MO  addr_state_MS  addr_state_MT  addr_state_NC  \\\n",
       "0              0              0              0              0              0   \n",
       "\n",
       "   addr_state_ND  addr_state_NE  addr_state_NH  addr_state_NJ  addr_state_NM  \\\n",
       "0              0              0              0              0              0   \n",
       "\n",
       "   addr_state_NV  addr_state_NY  addr_state_OH  addr_state_OK  addr_state_OR  \\\n",
       "0              0              0              0              0              0   \n",
       "\n",
       "   addr_state_PA  addr_state_RI  addr_state_SC  addr_state_SD  addr_state_TN  \\\n",
       "0              0              0              0              0              0   \n",
       "\n",
       "   addr_state_TX  addr_state_UT  addr_state_VA  addr_state_VT  addr_state_WA  \\\n",
       "0              0              0              1              0              0   \n",
       "\n",
       "   addr_state_WI  addr_state_WV  addr_state_WY  initial_list_status_w  \n",
       "0              0              0              0                      0  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The number of rows and columns in our machine learning model is :  (1023834, 123)\n"
     ]
    }
   ],
   "source": [
    "# Encode object columns to integer values and return a Dataframe containing the dummy columns.\n",
    "dummy_df = pd.get_dummies(loans_2007_2019[object_columns.columns], drop_first=True)\n",
    "\n",
    "# combine dummy column dataframe with original dataframe as column\n",
    "loans_2007_2019 = pd.concat([loans_2007_2019, dummy_df], axis=1)\n",
    "\n",
    "# drop original object columns\n",
    "loans_2007_2019 = loans_2007_2019.drop(object_columns.columns, axis=1)\n",
    "\n",
    "# review the result with 3 rows\n",
    "display(loans_2007_2019.head(1))\n",
    "\n",
    "print('The number of rows and columns in our machine learning model is : ', loans_2007_2019.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Not scale on our data*\n",
    "\n",
    "Scale data is not very neccessary for this project, I have tested with the same machines learning model with standard scaled data, the results are not too much different. However, scaled data makes interpretation more difficult, I need the original test data to with interest column to calculate interest on investment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Train Test split**\n",
    "\n",
    "We take 70% of the data to use as trainning data that will be fit to a model, the rest 30% of the data will be used as test data to get predicted value; we then compare our predicted value with actual value to calculate our error metric fpr and tpr."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = loans_2007_2019.drop('loan_status', axis=1)\n",
    "\n",
    "target = loans_2007_2019['loan_status']\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='model'></a>\n",
    "## 4. Fit the data in different models and analyze the results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 Try Logistic Regression with default papameters\n",
    "\n",
    "A good first algorithm to apply to binary classification problems is logistic regression, for the following reasons:\n",
    "\n",
    "- it's quick to train and we can iterate more quickly,\n",
    "- it's less prone to overfitting than more complex models like decision trees,\n",
    "- it's easy to interpret."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Confusion matrix: \n",
      " [[   325  58998]\n",
      " [   307 247521]]\n",
      "The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn):  0.9987612376325516\n",
      "The False Positive Rate (1-specification)is fpr = fp/(fp+tn):  0.9945215177924245\n",
      "The Precision (tp/(tp+fp)) is:  0.8075225353077623\n",
      "\n",
      "\n",
      "The investment return with this model is:  13.19%\n",
      "The investment return with this model for the worse case is:  -6.06%\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# creat our model\n",
    "lr_mymodel_default = LogisticRegression().fit(X_train, y_train)\n",
    "# use my function\n",
    "my_metric(lr_mymodel_default, X_test, y_test)\n",
    "investment_return_with_my_model(lr_mymodel_default, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Analyze the result**\n",
    "\n",
    "Our tpr and fpr are all very high, this means our model predict 99% of the loans that are Fully-paid that are actually Fully-paid (which is good), but also predict 99% of the loans as Fully-paid when it is asctuall Charged-off. If we check what is our prediction, we will see it's almost all as 1.\n",
    "\n",
    "Precision is 80%, This seems high, but our target columns is a highly imbalanced class, 80% of the values are 1 and 20% of the values are 0. So 80% of precision does not mean much.\n",
    "\n",
    "Lastly, if we use this model, we could lost 6.06% of our money.\n",
    "\n",
    "**My stragegy to fix the problem** is to tell the classifier to penalize certain rows more by setting the class_weight parameter to balanced. This tells scikit-learn to penalize the misclassification of the minority class during the training process. The penalty means that the logistic regression classifier pays more attention to correctly classifying rows where loan_status is 0. This lowers accuracy when loan_status is 1, but raises accuracy when loan_status is 0. By **setting the class_weight parameter to balanced**, the penalty is set to be inversely proportional to the class frequencies. This would mean that for the classifier, correctly classifying a row where loan_status is 0 is 4 times more important than correctly classifying a row where loan_status is 1. (4 times is not default value, it is based on the proportionate of our loan status data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Try Logistic Regression with class_weight = 'balanced'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Confusion matrix: \n",
      " [[ 38777  20546]\n",
      " [ 87184 160644]]\n",
      "The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn):  0.6482076278709428\n",
      "The False Positive Rate (1-specification)is fpr = fp/(fp+tn):  0.3463412167287561\n",
      "The Precision (tp/(tp+fp)) is:  0.8866052210386887\n",
      "\n",
      "\n",
      "The investment return with this model is:  10.48%\n",
      "The investment return with this model for the worse case is:  -0.86%\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# creat model, set model parameter with balanced\n",
    "lr_mymodel_balanced = LogisticRegression(class_weight='balanced').fit(X_train, y_train)\n",
    "# use my function\n",
    "my_metric(lr_mymodel_balanced, X_test, y_test)\n",
    "investment_return_with_my_model(lr_mymodel_balanced, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Analyze the result**\n",
    "\n",
    "We significantly improved false positive rate by balancing the classes, which also reduced true positive rate. Our true positive rate is now around 65%, and our false positive rate is around 35%. From a conservative investor's standpoint, it's reassuring that the false positive rate is lower because it means that we'll be able to do a better job at avoiding bad loans than if we funded everything. However, investor can still lose 0.86% of investment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Using precision_recall_curves to visulize the trade off**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfUAAAGDCAYAAAAyM4nNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xm4JHV97/H3hwEEREAdNAIDI5sKiAojytUYjMagN0JUIhg3jEpiVIJLXO7NVYNJjCaoiUsM7jsgiWQ0KFECogYigyAwIDoiOOMYnREBFZRlvvePqgM9Z87M6bP0OX2q36/n6efU1lXfru6ZT/+qfl2VqkKSJC18W813AZIkaXYY6pIkdYShLklSRxjqkiR1hKEuSVJHGOqSJHWEoS4NiSQrkxwxyTJ7JvlFkkVzVNZAJTk+ydd6xivJvvNZk7SQGerSJJJcl+TWNkx/nOTDSXac7e1U1YFVdf4ky/ygqnasqjtne/ttoP6yfZ0/TPL2YfvykOR3k1yQ5OdJ1iX5SpKj5rsuaVgY6lJ/nlpVOwKHAI8E/mL8Amks9H9TD2tf528BxwJ/NM/13CXJMcBngI8BewD3B94APHUa6+rCeyVtwg+1NAVV9UPgC8BBAEnOT/LXSb4O3ALsnWTnJB9M8qO2xftXvS3eJC9OcnXb2rwqySHt9OuSPLEdPizJiiQ3t0cH3t5OX9q2qLdux3dLsjzJDUlWJXlxz3belOSMJB9rt7UyybI+X+cq4OvAw3vWN93X9bok3+uZ/rSp7vckAd4OvLmqPlBVN1XVhqr6SlW9uOf1fqLnOeP31fj36v8kWTFuO69IsrwdvkeSv0/yg/Y9eF+S7adauzSXDHVpCpIsAZ4CXNoz+bnACcC9gOuBjwJ3APsCjwCeBLyoff4fAG8CngfsBBwF/HSCTf0D8A9VtROwD3DGZkr6NLAG2A04BvibJE/omX8UcBqwC7AceHefr/PBwG8Cq3omT/d1fa9d187AXwKfSPKAfuro8SBgCXDmFJ83Xu979S7gQUn265n/h8Cn2uG3AvvTfLHZF9id5siANLQMdak/ZyW5Efga8BXgb3rmfaSqVlbVHcB9gCcDJ1XVL6vqJ8A7gOPaZV8EvK2qLq7Gqqq6foLt3Q7sm2RxVf2iqi4av0D7BeOxwGur6ldVdRnwAZrgGvO1qjq7PQf/ceBhk7zObyb5JXA1cD7w3nZb95/u66qqz1TV2rZlfTrwXeCwSeoY777t3x9N8Xnj3fVeVdVNwL8BzwJow/3BwPL2yMCLgVdU1Q1V9XOa9/y4za1YGgaGutSf36+qXapqr6r606q6tWfe6p7hvYBtgB8lubH9IvDPwP3a+UtoWq6TeSFNK/HbSS5O8nsTLLMbMBY4Y66naVGO+Z+e4VuA7cYOR2/GIcCONOfTHwXcc6avK8nzklzW87yDgMVbqGEiY63+qbbwx1s9bvxTtKFO00o/q6puAXYFdgAu6an7i+10aWgZ6tLM9d7qcDXwa2Bx+yVgl6raqaoO7Jm/z6QrrPpuVT2LJjTfCpyZ5J7jFlsL3CfJvXqm7Qn8cLovpN12VdUZwIXcfbh5Wq8ryV7A+4GXAfetql2AK4FMsaxr2m08YwvL/JImiMf8xgTLjL8t5X8Ai5M8nCbcxw69rwduBQ7seb07t50IpaFlqEuzqKp+RBMUpyTZKclWSfZJ8lvtIh8AXp3k0LYH9r5t8G0kyXOS7FpVG4Ab28kb/YytqlYD/wW8Jcl2SQ6maeF/cpZezt8CJyT5jRm8rnvSBOm69nW9gLaT4VRUc4/oVwL/L8kLemp4bJJT28UuAx6X5rf8OwOv72O9d9Ccp/87mlMnX2qnb6D5MvKOJPdra989ye9OtXZpLhnq0ux7HrAtcBXwM5rQeAA055eBv6ZpEf4cOIsmTMY7EliZ5Bc0neaOq6pfTbDcs4ClNK32zwJvrKovzcaLqKoraPoP/Pl0X1dVXQWcQtPq/zHwUJpe9dOp50zu/pnd2nZ9f0VzXpz2dZ8OXA5cAny+z1V/Cngi8Jk25Me8lqaj4EVJbga+TNNhTxpaab4AS5Kkhc6WuiRJHWGoS5LUEYa6JEkdYahLktQRhrokSR2xpStLDaXFixfX0qVL57sMSZLmxCWXXLK+qvq6muGCC/WlS5eyYsWKyReUJKkDkkx0f4gJefhdkqSOMNQlSeoIQ12SpI4w1CVJ6ghDXZKkjjDUJUnqCENdkqSOMNQlSeoIQ12SpI4YWKgn+VCSnyS5cjPzk+Qfk6xKcnmSQwZViyRJo2CQLfWPAEduYf6Tgf3axwnAPw2wloldfiqc+bvNX0mSFriBXfu9qi5IsnQLixwNfKyqCrgoyS5JHlBVPxpUTRu5/FT40h83w9f/R/P34BPmZNOSJA3CfJ5T3x1Y3TO+pp22iSQnJFmRZMW6detmZ+vf+Zctj0uStMDMZ6hngmk10YJVdWpVLauqZbvu2tfd5ya3/zO2PC5J0gIzn6G+BljSM74HsHbOtt57qP2gF3roXZK04M1nqC8Hntf2gn80cNOcnU8fb8/Hz8tmJUmaTQPrKJfk08ARwOIka4A3AtsAVNX7gLOBpwCrgFuAFwyqFkmSRsEge78/a5L5Bbx0UNuXJGnUeEU5gJqwf54kSQuKoS5JUkcY6pIkdYShDmzm5/GSJC0ohjrAF54PHz5gvquQJGlGDHUACm642mCXJC1ohnqvG66e7wokSZo2Q328U9I8JElaYAZ28ZkFb3ywP+TZ8JRPzE8tkiT1wVDv19WfbB5jXmWPeUnScPHw+3R5iF6SNGQM9Znw/LskaYh4+H02jA92D81LkubB6Ib65acObt29IW/AS5LmyOiG+hUf3PL88WE83cPstuIlSXNkdEP9jts2P2+i4B2bNtNz6Ia8JGlARjfUN/x6es+brXAfs7n1GPaSpCka3VDffldggsvC9humvcsNogf8ROs06CVJWzC6ob7dfWZvXbN1/n0yBr0kaQtGN9QHqTdo331f+PUNg9uWPe0lSS1DfdBe9tNNp81VS96Ql6SRMrqhPp8XgpvPw/Wbq0GStOCNbqivWznfFdxtSwHr+XlJUp9GN9Rv/v58V9CfQfey72XLXpIWtNENdSYKqUVzXsWUTBSsc3VDGX9PL0lDb3RDfat7wJ23bDwtC/COa3N1fn5zDHtJGhqjG+p1x6bT0oE70W4uTOc77A15SRq4EQ71CUJmmx3mvo65Mt9h7/l6SRq40Q31rbaBO2/feNrBJ8xPLfNpPnreT7YdQ16SpmWEQ30ruLN3fBt43FvnrZyhNF+te1v1kjQtoxvq2+wIt//i7vHt7jt/tSw083ko39/XS9JmjW6oZ+stj2vq5qsn/kTbuddecMJ1c7N9SRoSo5tkv1q/5XHN3Hyer//59bbqJY2c0Q31rbaBO3+18bjmzjC16g16SR0xuqG+zY5w+883Htf8mc9WvRfQkdQRoxvqi7bd8riGx3xdHtde+JIWmNEN9Xvcuznv2juuhWMYL6Zj0EuaZ6Mb6rf8eMvjWpi86Y2kETa6oX7bz7c8ru6Yz6Df3LYMekkDMLqhPv7mLV24mYv65+F7SR00uqHOhknGNZLshS9pARvdUB9/mVh/0qbJDFMvfINe0gRGN9R3+A245X82HpemaqJwveC1cPHbBrtdg17SBEY31HsDfaJxaboe99aJ7/jn4XtJAza6od576H2icWm2efhe0oCNbqh7Tl3DYJjuWW/QSwve6Ia659Q1zOajVT/R+h/5molPJUgaSqMb6p5T10IzH0F/8dsm7vRnq14aSqMb6rVhy+PSQuDhe0k9RjfUl/4OXP3JjcelrhiWw/ebq0XSQAw01JMcCfwDsAj4QFX97bj5ewEfAnYFbgCeU1VrBlnTXba515bHpa6x973UeQML9SSLgPcAvwOsAS5OsryqrupZ7O+Bj1XVR5P8NvAW4LmDqknSOAa91CmDbKkfBqyqqmsBkpwGHA30hvoBwCva4fOAswZYz8YOfB5c/r5mOIuacUmep5cWsEGG+u7A6p7xNcCjxi3zLeAZNIfonwbcK8l9q+qnA6yrR4DyDm1SP4blPL1BL23WIEN9on/t4/81vhp4d5LjgQuAHwJ3bLKi5ATgBIA999xzdqpbff7d5dSdzfhuh8/OuqVRMR+t+vHrNuSluwwy1NcAS3rG9wDW9i5QVWuBpwMk2RF4RlXdNH5FVXUqcCrAsmXLZudf8JIjgK2ADbDVNu24pFkxl616W/PSXQYZ6hcD+yV5IE0L/DjgD3sXSLIYuKGqNgCvp+kJPzd2OxwWHwjrr4Aj3mkrXRo0g14auIGFelXdkeRlwDk0P2n7UFWtTHIysKKqlgNHAG9JUjSH3186qHomtO1Ozd/FB83pZiW1DHppVg30d+pVdTZw9rhpb+gZPhM4c5A1bNFtNzd/118Jezx23sqQ1GM+g96Q1wI3uleUW3shrF/ZDJ9/EtzvYR6Cl4bV+LA15KUJjW6orz4faK/3vuF2e79LC8lcteY9ZK8FZnRD3d7vUrfMV9Ab8hoioxvq9n6Xum8ugt7WvIbI6IY62PtdGkVzcX7eoNc8Ge1QlyQ74alDDHVJ6jVf5+azDbzyttnfjkaKoS5Jk5mL1nzdbmteMzbaoe7FZyRNhz3tNaRGN9S9+Iyk2WRPew2B0Q11Lz4jadDmo6e9IT/SRjfUvfiMpLk2H615Q36kjG6oe/EZScNg0K15Q36kjG6ogxefkTR85jLkDfjOGe1Ql6RhN8hD9rbiO8dQl6SFZlCteUN+wTPUJWmhM+TVMtQlqWsM+ZE12qHuFeUkjYLe8J3NjneG/NAZ3VD3inKSRtEge9fbs37ejW6oe0U5SfJQfceMbqh7RTlJ2tRchLwBPzCjG+peUU6SJjeIkLcVPzCjG+rgFeUkaaoGHfIG/IyMdqhLkmZmtnvWG/AzYqhLkmbHbLfiDfgpM9QlSYMxm614A74vhrokafBmsxVvwG+WoS5Jmnuz1Yofe67hDhjqkqT5NhsBb+sdMNQlScNkNgN+BMPdUJckDaeZBvwItt5HO9S9S5skLQyzFfAdD/fRDXXv0iZJC9NYMNt638Tohrp3aZOkhc3W+yZGN9S9S5skdcdYMJ+6FH5+/dSe26HW++iGundpk6TuOeG6u4dHsPW+1XwXMK+8S5skdderavrhfEpm717yc2h0W+qSpNEwk3PvC6zlPtotdUnSaJlu632BtNxtqUuSRs90W+9D3nK3pS5JGm3Tab0PaavdUJckCaYe7qcE3n3fwdUzDYa6JEm9phLuv75hqFrthrokSROZSrgPSbAb6pIkbUm/4T4EPeQNdUmS+rEAWu2GuiRJ/ZpKq30eGOqSJE3VkAa7oS5J0nQM4QVoDHVJkqZrJjeNGYCBhnqSI5Nck2RVktdNMH/PJOcluTTJ5UmeMsh6JEkaiCEJ9oGFepJFwHuAJwMHAM9KcsC4xf4COKOqHgEcB7x3UPVM6Labm7/rr5zTzUqSOmh8sM9D0A/yhi6HAauq6lqAJKcBRwNX9SxTQHtTc3YG1g6wno2tvRDWrwTgtnNP5OT/2orvLnooAL938AN47uFLufW2Ozn+w9/Y5KnHHLoHf7BsCTf88jZe8olLNpn/nEfvxVMfthtrb7yVV5x+2SbzX/ybe/PEA+7P99b9gv/zr1dsMv/lv70fj91vMSvX3sTJn7tqk/mvOfJBHLrXfbjk+ht42xev2WT+G556AAfutjNf++563vWf391k/t88/aHss+uOfPmqH/P+r167yfx3HPtwdttlez73rbV84qLrN5n/T885lPvcc1s+s2I1Z16yZpP5H3nBYWy/7SI+fuF1fP7yH20y//Q/PhyAUy/4Hude/ZON5m23zSI++keHAfCP536Xr69av9H8e++wLe977qEAvPWL3+ab1/9so/kP2Hk73nncIwD4y8+t5Kq1N280f+9d78lbnn4wAK//18u5dt0vN5p/wG478canHgjASaddyo9u+tVG8w/Z69689sgHA/AnH7+En91y20bzH7PvYk58wn4APP9D3+BXt9+50fwnPOR+nPC4fQA49p8vZDw/e372wM/egv7s/a/v89zDl24yf64M8vD77sDqnvE17bRebwKek2QNcDbw8olWlOSEJCuSrFi3bt3sVLf6fGADAIu4kwPvvHR21itJ0jxJ1WAODyT5A+B3q+pF7fhzgcOq6uU9y7yyreGUJIcDHwQOqqoNm1vvsmXLasWKFTMvcO2F8OnHAhtg0T3gmefBbofPfL2SJM2iJJdU1bJ+lh1kS30NsKRnfA82Pbz+QuAMgKq6ENgOWDzAmu622+GwuDnUxRHvNNAlSQveIEP9YmC/JA9Msi1NR7jl45b5AfAEgCQPoQn1WTq+3odt29P5iw+as01KkjQoAwv1qroDeBlwDnA1TS/3lUlOTnJUu9irgBcn+RbwaeD4GtT5AEmSOm6Qvd+pqrNpOsD1TntDz/BVwGMGWYMkSaPCK8pJktQRhrokSR1hqEuS1BGGuiRJHdF3R7kkuwN79T6nqi4YRFGSJGnq+gr1JG8FjqW5bvvYBYULMNQlSRoS/bbUfx94UFX9epDFSJKk6ev3nPq1wDaDLESSJM1Mvy31W4DLkpwL3NVar6oTB1KVJEmasn5DfTmbXrddkiQNkb5Cvao+2t6UZf920jVVdfvgypIkSVPVb+/3I4CPAtcBAZYkeb4/aZMkaXj0e/j9FOBJVXUNQJL9ae6qduigCpMkSVPTb+/3bcYCHaCqvoO94SVJGir9ttRXJPkg8PF2/NnAJYMpSZIkTUe/of4S4KXAiTTn1C8A3juooiRJ0tT12/v918Db24ckSRpCWwz1JGdU1TOTXEFzrfeNVNXBA6tMkiRNyWQt9T9r//7eoAuRJEkzs8Xe71X1o3ZwPbC6qq4H7gE8DFg74NokSdIU9PuTtguA7dp7qp8LvAD4yKCKkiRJU9dvqKeqbgGeDryrqp4GHDC4siRJ0lT1HepJDqf5ffq/t9P6/Tnc8Lrt5ubv+ivntw5JkmZBv6F+EvB64LNVtTLJ3sB5gytrDqy9ENavbIbPP6kZlyRpAev3d+pfAb7SM34tzYVoFq7V5wMbmuENtzfjux0+jwVJkjQzk/1O/Z1VdVKSzzHx79SPGlhlg7bkCJoDFRtgq23acUmSFq7JWupj13r/+0EXMud2OxwWHwjrr4Aj3mkrXZK04G0x1Ktq7KYtK4Bbq2oDQJJFNL9XX9i23an5u/ig+a1DkqRZ0G9HuXOBHXrGtwe+PPvlSJKk6eo31Lerql+MjbTDO2xheUmSNMf6DfVfJjlkbCTJocCtgylJkiRNR78XkDkJ+EySseu9PwA4djAlSZKk6ej3d+oXJ3kw8CAgwLer6vaBViZJkqakr8PvSXYAXgv8WVVdASxN4u1YJUkaIv2eU/8wcBsw9mPuNcBfDaQiSZI0Lf2G+j5V9TbgdoCqupXmMLwkSRoS/Yb6bUm2p71UbJJ9gF8PrCpJkjRl/fZ+fyPwRWBJkk8CjwGOH1RRkiRp6iYN9SQBvg08HXg0zWH3P6uq9QOuTZIkTcGkoV5VleSsqjoU+Pc5qEmSJE1Dv+fUL0ryyIFWIkmSZqTfc+qPB/4kyXXAL2kOwVdVHTyowiRJ0tT0G+pPHmgVkiRpxrYY6km2A/4E2Be4AvhgVd0xF4VJkqSpmeyc+keBZTSB/mTglIFXJEmSpmWyw+8HVNVDAZJ8EPjG4EuSJEnTMVlL/a47sXnYXZKk4TZZS/1hSW5uhwNs346P9X7faaDVSZKkvm0x1Ktq0VwVIkmSZqbfi89IkqQhN9BQT3JkkmuSrEryugnmvyPJZe3jO0luHGQ9kiR1Wb8Xn5myJIuA9wC/A6wBLk6yvKquGlumql7Rs/zLgUcMqh5JkrpukC31w4BVVXVtVd0GnAYcvYXlnwV8eoD1SJLUaYMM9d2B1T3ja9ppm0iyF/BA4D83M/+EJCuSrFi3bt2sFypJUhcMMtQzwbTazLLHAWdW1Z0TzayqU6tqWVUt23XXXWetQEmSumSQob4GWNIzvgewdjPLHoeH3iVJmpFBhvrFwH5JHphkW5rgXj5+oSQPAu4NXDjAWiRJ6ryBhXp7WdmXAecAVwNnVNXKJCcnOapn0WcBp1XV5g7NS5KkPgzsJ20AVXU2cPa4aW8YN/6mQdYgSdKo8IpykiR1hKEuSVJHGOqSJHWEoS5JUkcY6pIkdYShLklSRxjqkiR1hKEuSVJHGOqSJHWEoS5JUkcY6pIkdYShLklSR4x2qN92c/N3/ZXzW4ckSbNgdEN97YWwfmUzfP5JzbgkSQvY6Ib66vOBDc3whtvbcUmSFq7RDfUlR3DXy99qm3ZckqSFa3RDfbfDYfGBzfAR72zGJUlawEY31AG23an5u/ig+a1DkqRZMNqhLklShxjqkiR1hKEuSVJHGOqSJHWEoS5JUkcY6pIkdYShLklSRxjqkiR1hKEuSVJHGOqSJHWEoS5JUkcY6pIkdYShLklSRxjqkiR1hKEuSVJHGOqSJHWEoS5JUkcY6pIkdYShLklSRxjqkiR1hKEuSVJHGOqSJHWEoS5JUkcY6pIkdYShLklSRxjqkiR1hKEuSVJHGOqSJHWEoS5JUkcY6pIkdYShLklSRxjqkiR1xEBDPcmRSa5JsirJ6zazzDOTXJVkZZJPDbIeSZK6bOtBrTjJIuA9wO8Aa4CLkyyvqqt6ltkPeD3wmKr6WZL7DaoeSZK6bpAt9cOAVVV1bVXdBpwGHD1umRcD76mqnwFU1U8GWI8kSZ02yFDfHVjdM76mndZrf2D/JF9PclGSIwdYjyRJnTaww+9AJphWE2x/P+AIYA/gq0kOqqobN1pRcgJwAsCee+45+5VKktQBg2yprwGW9IzvAaydYJl/q6rbq+r7wDU0Ib+Rqjq1qpZV1bJdd911YAVLkrSQDTLULwb2S/LAJNsCxwHLxy1zFvB4gCSLaQ7HXzvAmiRJ6qyBhXpV3QG8DDgHuBo4o6pWJjk5yVHtYucAP01yFXAe8OdV9dNB1SRJUpcN8pw6VXU2cPa4aW/oGS7gle1DkiTNgFeUkySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjBhrqSY5Mck2SVUleN8H845OsS3JZ+3jRIOuRJKnLBhbqSRYB7wGeDBwAPCvJARMsenpVPbx9fGBQ9Uzotpubv+uvnNPNSpI0CINsqR8GrKqqa6vqNuA04OgBbm9q1l4IP13ZDJ9/UjMuSdICNshQ3x1Y3TO+pp023jOSXJ7kzCRLJlpRkhOSrEiyYt26dbNT3erzgTTDG+5oxyVJWrgGGeqZYFqNG/8csLSqDga+DHx0ohVV1alVtayqlu26666zU92SI2DRtpBFzd8lR8zOeiVJmidbD3Dda4DelvcewNreBarqpz2j7wfeOsB6Nrbb4fAH5zYt9CVHNOOSJC1ggwz1i4H9kjwQ+CFwHPCHvQskeUBV/agdPQq4eoD1bGq3ww1zSVJnDCzUq+qOJC8DzgEWAR+qqpVJTgZWVNVy4MQkRwF3ADcAxw+qHkmSui5V409zD7dly5bVihUr5rsMSZLmRJJLqmpZP8t6RTlJkjrCUJckqSMMdUmSOsJQlySpIwx1SZI6wlCXJKkjDHVJkjrCUJckqSMMdUmSOmLBXVEuyTrg+llc5WJg/Syub1S5H2fOfThz7sOZcx/O3Gzvw72qqq9blC64UJ9tSVb0e/k9bZ77cebchzPnPpw59+HMzec+9PC7JEkdYahLktQRhjqcOt8FdIT7cebchzPnPpw59+HMzds+HPlz6pIkdYUtdUmSOmJkQj3JkUmuSbIqyesmmH+PJKe38/87ydK5r3K49bEPX5nkqiSXJzk3yV7zUecwm2wf9ix3TJJKYi/kCfSzH5M8s/08rkzyqbmucdj18e95zyTnJbm0/Tf9lPmoc1gl+VCSnyS5cjPzk+Qf2/17eZJD5qSwqur8A1gEfA/YG9gW+BZwwLhl/hR4Xzt8HHD6fNc9TI8+9+HjgR3a4Ze4D6e+D9vl7gVcAFwELJvvuoft0edncT/gUuDe7fj95rvuYXr0uQ9PBV7SDh8AXDffdQ/TA3gccAhw5WbmPwX4AhDg0cB/z0Vdo9JSPwxYVVXXVtVtwGnA0eOWORr4aDt8JvCEJJnDGofdpPuwqs6rqlva0YuAPea4xmHXz+cQ4M3A24BfzWVxC0g/+/HFwHuq6mcAVfWTOa5x2PWzDwvYqR3eGVg7h/UNvaq6ALhhC4scDXysGhcBuyR5wKDrGpVQ3x1Y3TO+pp024TJVdQdwE3DfOaluYehnH/Z6Ic23VN1t0n2Y5BHAkqr6/FwWtsD081ncH9g/ydeTXJTkyDmrbmHoZx++CXhOkjXA2cDL56a0zpjq/5mzYutBb2BITNTiHt/tv59lRlnf+yfJc4BlwG8NtKKFZ4v7MMlWwDuA4+eqoAWqn8/i1jSH4I+gOWL01SQHVdWNA65toehnHz4L+EhVnZLkcODj7T7cMPjyOmFeMmVUWuprgCU943uw6aGku5ZJsjXN4aYtHVoZNf3sQ5I8Efi/wFFV9es5qm2hmGwf3gs4CDg/yXU05+GW21luE/3+e/63qrq9qr4PXEMT8mr0sw9fCJwBUFUXAtvRXNNc/enr/8zZNiqhfjGwX5IHJtmWpiPc8nHLLAee3w4fA/xntb0dBPSxD9tDx/9ME+iew9zUFvdhVd1UVYuramlVLaXpl3BUVa1PndtwAAAIpklEQVSYn3KHVj//ns+i6bhJksU0h+OvndMqh1s/+/AHwBMAkjyEJtTXzWmVC9ty4HltL/hHAzdV1Y8GvdGROPxeVXckeRlwDk2vzw9V1cokJwMrqmo58EGaw0uraFrox81fxcOnz334d8COwGfaPoY/qKqj5q3oIdPnPtQk+tyP5wBPSnIVcCfw51X10/mrerj0uQ9fBbw/yStoDhsfb0Pnbkk+TXN6Z3Hb7+CNwDYAVfU+mn4ITwFWAbcAL5iTunyPJEnqhlE5/C5JUucZ6pIkdYShLklSRxjqkiR1hKEuSVJHGOoaWknuTHJZkiuTfC7JLrO8/uOTvLsdflOSV29muZOSPG82t72Z7ZzcXrxnbJs79Mw7e0uvP8luSc5shx8+l3fUSnLf9n26LMn/JPlhz/i2s7ytJya5qV331Un+7zTWsSjJV9vhvZMc1zPvUUneMct1fjvJ3/bxnEP6uZxt+/5+YKY1qpsMdQ2zW6vq4VV1EM21A1461wW0Vxf8I2Dgt+6sqjdU1Zfb0ZOAHXrmPWVLlzitqrVVdUw7+nCa38cOTLtfxrb90/Z9ejjwPuAdY+PtzULGnpP2UrgzdV67rUcCL0zysKk8uarurKrfbEf3pueaFFX131X1ilmosbfOQ4BnJHnUJMsfAkwa6lV1GbBPkoFfR1wLj6GuheJCem6GkOTPk1zc3qf4L3umP6+d9q0kH2+nPTXJf6e5L/SXk9x/Ctv9beCb7U1+SHJi7r5n/GnttHumubfyxe02jm6nH5/kX5N8Mcl3k7ytnb4oyUfaIxBXtBf3oJ12TJITgd2A85Kc1867LsniJG9N8qc9r/dNSV6VZGm7vm2Bk4Fj21bise22d22X3yrN/Z03e7nPJNsl+XBb26VJxq7MdnySzyT5HPAf/ey8JPu2db0P+CawJMmNPfOPG2t1Jrl/u79WJPlGmqtwbVZV/aJd5z5Jtk/y0bbmbyZ5XLvOh7bvy2Xte7Z3kq17avhb4PHt/BPbFvZZ7Xt0fZKd2vUkybXtezDVOm+hubXp7u26Hp3kwnbffj3Jfkm2B94APLut5ZgkO7afiW+0yz61Z7WfB47t5z3QiJmL+7v68DGdB/CL9u8i4DPAke34k2ju9RyaL6afp7m38YE01/he3C53n/bvvbn7QksvAk5ph48H3t0Ovwl49QQ1/CXw8p7xtcA92uFd2r9/AzxnbBrwHeCe7fqvpbmPwHbA9TTXgj4U+FLPOsfW8xHgmHb4urHX0TsOPAL4Ss/0q4A9gaW093XufV3t+BuBk3r23b9Mst9fBXy4HX4wzeVCt2vXu2Zsv27muRvtR2BfYAPwyHZ8a+DGnvnHAR9oh08HHt0O3/V6xq3/icBZ7fCubW0PAl4LvL+dfmC7r7cF/gk4tp1+j/Z13FVD7/omWP97gOe2w48BvjjNOu9D8+Xjfu34zsCidvhI4PSez+Y7e9bxNuC4ns/wd4Dt2vHfAj473/9GfQzfYyQuE6sFa/skl9H8x3kJ8KV2+pPax6Xt+I40N+t4GHBmVa0HqKqxG/LsAZye5l7G2wLfn0INDwCu7hm/HPhkkrNori8+Vs9Rufuc/HY0QQtwblXdBJDmkqV7ASuBvZO8C/h3+mz1tq/p0iT3S7IbTaj9rKp+kGTpFp72IeDfgHfSnEr48CSbeSzwrnZ7305yPc2106H5MjLVGx19r6ou7mO5JwIPSu66udW9k2xfVbeOW+7xSS6l+bLw5qq6JsljaS5TTDWXO11L84Xiv4C/SLIX8K9VtSo9pw4mcTrwGuDjNF8+Tp9GnZfTfDF6c919P4RdgI8l2WeS7T8JeHKS17XjY5+r7wA/oTmaI23Ew+8aZrdWc05yL5owHjunHuAtdfd5232r6oPt9Imue/wumpbrQ4E/pvnPse8axi3/v2lacIcCl7QBEeAZPfXsWVVjXwR671R3J7B1Vf2M5gvI+e1rmmqnpzNpbjp0LHDaZAtX1Wrgx0l+G3gUk9/nfqJbRo75Zb9FbuY5G8atv3ffBjisZz/uPkFQQnOu+hFVdWhVvX9LNVfVx4Gn0bwPXxo7LN+nrwIHJLkvcBTw2WnUeTBwMHBikoe20/8aOKeaviK/z+Y/jwF+f9zn6jvtvO1oPpvSRgx1Db22pXsi8Ook29DchOKPkuwIkGT3JPcDzgWe2f4nTJL7tKvYGfhhO/x8puZqmhbf2P3Ol1TVeTQtuF1ojhKcA7w8bdMtzd3qNqs9n71VVf0L8P9oOkiN93OaW7FO5DSaluMxNAHfz3M/AHwCOKOq7mzreFqSt0zw/AuAZ7fL7E/TOrxmS6+pX9Xci/tn7XnkrWgCd8yX6ekMmeThU1h1b80PoTnCsirJ3lW1qqr+geaoyMHjnrfZ/VxVxd1HOL5Vd3dUnFKdVfVtmkPpr2kn9X4ej99CLefQfO7HttP7udofuHJL29VoMtS1IFTVpTSdjY6rqv+g6Y1+YZIraILtXlW1kqYV9JUk3wLe3j79TTR3jvsqsH6Km/4Czfl6aM7tf6Ld5qU0vbxvBN5Mc3emy5Nc2Y5vye4090y/jOY8+usnWOZU4AtpO8r1al/nvYAf1sS3cjyPpoV5WZKxzlTLab6A9B563we4eYLnvxdY1L7O02nuzvXrCZabrtcCX6T5EramZ/pLgce0HdquAl48hXW+i+Z0zRXAJ4HnVdPz/g+TrGz39d40X2x6XUrzWr+VpoPieKcDz+HuQ+/TrfO9wBOS7Am8Ffi7JF8ft8x/Ag9rO8UdQ9OfY4e2899Kms/xmMfTfEmRNuJd2qRJJPks8Jqq+u581zJdSZbRfAn5zZ5pnwBeUVXeI3sBaXvKnwc8ZuyoizTGUJcmkeRBwP2r6oL5rmU62o5WLwGeXVVfm+96NDML/fOowTLUJUnqCM+pS5LUEYa6JEkdYahLktQRhrokSR1hqEuS1BGGuiRJHfH/AV/p5q3OVAPEAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 576x432 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "precision_recall_curves(lr_mymodel_balanced, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When recall is close to 1.0, precision is about 80%, this is similar in the logistic regression without balanced class in our first model. When recall is roughly 0.65, precision is about 0.89 as shown in balanced logistic regression.\n",
    "\n",
    "**As recall reduces, precision can increase. But since there is 'missing good investment opportunity' vs 'being safer', to which point we shoud choose a model that can give best investment return, this is where my investmet portfolio function can help.** "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3  Try Random Forest Classifier with class_weight as balanced\n",
    "\n",
    "Random forests are able to work with nonlinear data, and learn complex conditionals. Logistic regressions are only able to work with linear data. Training a random forest algorithm may enable us to get more accuracy due to columns that correlate nonlinearly with loan_status."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Confusion matrix: \n",
      " [[  9346  49977]\n",
      " [ 14450 233378]]\n",
      "The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn):  0.9416934325419243\n",
      "The False Positive Rate (1-specification)is fpr = fp/(fp+tn):  0.8424557085784603\n",
      "The Precision (tp/(tp+fp)) is:  0.8236240758059677\n",
      "\n",
      "\n",
      "The investment return with this model is:  12.72%\n",
      "The investment return with this model for the worse case is:  -4.91%\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# create model, with balanced \n",
    "rf_balanced = RandomForestClassifier(class_weight=\"balanced\", random_state=1).fit(X_train, y_train)\n",
    "# use my function\n",
    "my_metric(rf_balanced, X_test, y_test)\n",
    "investment_return_with_my_model(rf_balanced, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Analysis the result**\n",
    "\n",
    "Unfortunately, using a random forest classifier didn't improve our false positive rate. The model is likely weighting too heavily on the 1 class, and still mostly predicting 1s. Compared with Logistic Regression Balanced class, the false positive rate with Random Forest is still very high."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**My strategy**\n",
    "\n",
    "Choose logistic regression and try to lower the false positive rate further by **assigning a harsher penalty for misclassifying the negative class**. While setting class_weight to balanced will automatically set a penalty based on the number of 1s and 0s in the column, we can also set a manual penalty. (The default penalty scikit-learn imposed for misclassifying a 0 would have been around 4, since there are 4 times as many 1s as 0s)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.4 Logistic Regression with class_weight with different penalty in a for loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Confusion matrix: \n",
      " [[ 48188  11135]\n",
      " [132224 115604]]\n",
      "The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn):  0.4664686798908921\n",
      "The False Positive Rate (1-specification)is fpr = fp/(fp+tn):  0.18770122886570134\n",
      "The Precision (tp/(tp+fp)) is:  0.912142276647283\n",
      "\n",
      "\n",
      "The investment return with this model is:  9.42%\n",
      "The investment return with this model for the worse case is:  0.63%\n",
      "\n",
      "\n",
      "Confusion matrix: \n",
      " [[ 50930   8393]\n",
      " [148812  99016]]\n",
      "The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn):  0.399535161482964\n",
      "The False Positive Rate (1-specification)is fpr = fp/(fp+tn):  0.1414796959020953\n",
      "The Precision (tp/(tp+fp)) is:  0.9218594344980402\n",
      "\n",
      "\n",
      "The investment return with this model is:  8.93%\n",
      "The investment return with this model for the worse case is:  1.12%\n",
      "\n",
      "\n",
      "Confusion matrix: \n",
      " [[ 53789   5534]\n",
      " [171675  76153]]\n",
      "The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn):  0.3072816630889165\n",
      "The False Positive Rate (1-specification)is fpr = fp/(fp+tn):  0.09328590934376212\n",
      "The Precision (tp/(tp+fp)) is:  0.932253602164359\n",
      "\n",
      "\n",
      "The investment return with this model is:  8.61%\n",
      "The investment return with this model for the worse case is:  1.84%\n",
      "\n",
      "\n",
      "Confusion matrix: \n",
      " [[ 54774   4549]\n",
      " [180205  67623]]\n",
      "The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn):  0.2728626305340801\n",
      "The False Positive Rate (1-specification)is fpr = fp/(fp+tn):  0.07668189403772567\n",
      "The Precision (tp/(tp+fp)) is:  0.9369700160727151\n",
      "\n",
      "\n",
      "The investment return with this model is:  8.30%\n",
      "The investment return with this model for the worse case is:  2.00%\n",
      "\n",
      "\n",
      "Confusion matrix: \n",
      " [[ 56074   3249]\n",
      " [193410  54418]]\n",
      "The True Positive Rate (Recall/Sensitivity)is tpr = tp/(tp+fn):  0.21957970850751327\n",
      "The False Positive Rate (1-specification)is fpr = fp/(fp+tn):  0.05476796520742377\n",
      "The Precision (tp/(tp+fp)) is:  0.9436592852064439\n",
      "\n",
      "\n",
      "The investment return with this model is:  8.04%\n",
      "The investment return with this model for the worse case is:  2.41%\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# set model parameter with different penalty\n",
    "for i in [6, 7, 8, 9, 10]:   \n",
    "    # creat our model\n",
    "    penalty = {0: i, 1: 1} \n",
    "    lrf_mymodel_penalty = LogisticRegression(class_weight=penalty).fit(X_train, y_train)\n",
    "    # use my function\n",
    "    my_metric(lrf_mymodel_penalty, X_test, y_test)  \n",
    "    investment_return_with_my_model(lrf_mymodel_penalty, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='#summary'></a>\n",
    "# Summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Different model returns different results, which one to choose depends on the investor. As explained by the precision recall curve, there is a trade off between 'missng good investment opportunity' vs 'being safer'. For this project, we **consider from a more conservative perspective and help investor reduce risk and still get decent return**. **Remember our goal for is: Choose a model that has lower than 19.57% charged off rate, and have expected return higher than 7.1% (i.e. average interest for only A loans), and for the worst case, still have expected return higher than 1.15%**.\n",
    "\n",
    "Keeping this in mind, **I would recommend the Logistic Regression model with class_weight set as 10, because in this model:**\n",
    "\n",
    "- **False posistive rate is: 5.4% (In plain English, this is the Charged-off rate, similar as default rate, the percentage of the loans invested that not pay you back). This is much lower than 19% loss if the investor randomly choose loans.**\n",
    "- **Investing in this portfolio will have expected return is the range of of 8.04%, and the expected return for the worse case is: 2.41%.**\n",
    "\n",
    "This gives 1% higher return and TWICE safer than if the investor only randomly choose A loans. Well, 1% higher maybe not too significant, but we all know the bank gives us interest rate like 0.1%; If there is a way we can invest more safely and get higher return, it is something worth doing. And if you invest a lot, 1% increase can give you a lot more interest. \n",
    "\n",
    "**If you are willing to take higher risk**, I would recomend the the Logistic Regression model with class_weight set as 6, and **your expected return will be in the range of 0.63% to 9.42%**. Not too bad, at least for the worst case, you still keep all of your principle safe with 0.63% of return."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other things to consider:\n",
    "\n",
    "1. I did not use other popular advanced model like SVM, Gridient Boosting and Neural Network for this data. First reason is: with Logistic Regression, I already got good results, even better than Random forest, and Logistic Regression is easier to interpret and quicker to interpret. Second reason is: I tried with other model, it never runs even if I took only a sample, the running time is very slow. \n",
    "2. I think Artificial Financial Advisor can be a trend for the future, it doesn't mean to replace finance manager, but another options. I plan to do more project about data science application on finance. Thank you for reading."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}