{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting Uncertainty:\n", "\n", "\n", "**Prediction Intervals from Gradient Boosting Quantile Regression**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![image](https://i.imgur.com/3Tilb52.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "> What is the difference between actuaries and fortune tellers?\n", ">\n", "> Answer: Fortune tellers give you point estimates, but actuaries give you *prediction intervals*. Both try to predict the future.\n", "\n", "Models continue to become more accurate, but for certain applications, the uncertainty of an estimate is as important as the prediction itself. By using prediction intervals, which create a range of values instead of a single number, this uncertainty can be measured using quantile regression. In insurance and finance where model risk translates to financial gain or loss, predictive analytics practitioners can reduce losses and increase profits by hedging against this risk. \n", "\n", "Using gradient boosting, a widely-used machine learning method, two use cases are demonstrated on open-source and reproducible data:\n", "\n", "1. Add value to actuarial predictive models by creating prediction intervals;\n", "2. Audit data quality in a fully-scalable manner using outlier detection." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "---\n", "\n", "Predictive analytics is the science of using statistics to predict the future. This processes often involves identifying a business problem, collecting relevant data, and then making predictions that help to improve or solve the problem. As advances in the field of machine learning produce better algorithms, the relative size of errors continues to decrease. But even the best models are random to some degree. \n", "\n", "In real-world applications, a prediction about a future event will never completely match reality. Even with the most robust validation, we can never know for certain how well a given model will perform on unseen data. Data collected outside of the laboratory is often sparse for certain segments of the population. In health care, models are built based on thousands of diagnosis codes, and each patient may only have a few dozen codes in common with other patients. Credit unions and banks create models from purchase histories which contain varying line item keywords. Websites use clickstream data which contains millions of user inputs. In all of these examples, the data quality can vary dramatically from case-to-case. Intuitively, those predictions which are based on low-quality data will be less reliable than those with more data.\n", "\n", "For financial applications where the prediction is a dollar amount, model risk translates into financial risk. In [Modern Portfolio Theory (MPT)](https://www.investopedia.com/terms/m/modernportfoliotheory.asp), investment risk is measured by the standard deviation. This is the tendency of the asset to have a return that is above or below the average return. Portfolio managers need to know the risk level as well as the average return in order to maintain profitability.\n", "\n", "In insurance, actuaries are the original scientists of risk. Traditionally, actuaries use parametric models which allow for not only the average to be estimated, but also other [risk measures](https://www.casact.org/library/studynotes/hardy4.pdf). The Value at Risk (VaR) is a terminological name for the pth quantile of the loss distribution. Usually this is just the 90th or 99th percentile. Two entities with the same average but different VaR are valued very differently.\n", "\n", "The Actuarial Standards of Practice (ASOPs) describe the procedures an actuary should follow when performing actuarial services. ASOP No. 23 speaks to this issue of data quality and how the uncertainty behind results should be acknowledged.\n", "\n", ">Appropriate data that are accurate and complete may not be available. The actuary should use available data that, in the actuary’s professional judgment, allow the actuary to perform the desired analysis. *However, if significant data limitations are known to the actuary, the actuary should disclose those limitations and their implications in accordance with section 4.1(b).* \n", "> - [ASOP No. 23 § 3.1 (emphasis added)](http://www.actuarialstandardsboard.org/asops/data-quality/).\n", "\n", "Much research has been done to create machine learning models which create more accurate averages, but less research focuses on measuring the uncertainty of those predictions. One of the most widely-applicable algorithms, the gradient boosting machine (GBM), has a method of estimating prediction error using quantile regression. This allows for predictions of not only the average, but also the range or uncertainty of the prediction. This helps to solve the problem of model risk: if an observation is predicted based on sparse data, then the prediction interval for this record will be wide; if the prediction is based on strong evidence, or in actuarial terminology has \"higher credibility\", then the prediction interval will be narrower.\n", "\n", "### Quantile Regression\n", "\n", "Most models estimate averages, but quantile regression can estimate the median, 75th percentile, 90th percentile, or any other quantile. Quantile regression has been around since the 1800s for linear models but is not very well known within the machine learning community. The reasons for this are twofold: there has only been [limited research](https://journals.sagepub.com/doi/abs/10.1177/1471082X18759142?journalCode=smja) on practical applications outside of academia, and all machine learning competitions, such as the popular website [kaggle.com](https://www.kaggle.com/competitions), have focused on model accuracy. \n", "\n", "In [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning), the goal is to model the target conditioned on data. The process of creating a model from data is called *training*. During training, many models are built and compared, but only the best is kept. One of the choices that a modeler makes when fitting a model is what defines a \"best\" model. This is controlled by the loss function. When the value of the loss function is high, then the model is considered \"worse\". When the loss is low, the model is considered \"better\". \n", "\n", "**Theorem 1: Using Mean Squared Error as a loss function results in predicting the mean.**\n", "\n", "This is the default setting in most machine learning software. The proof can be shown rather easily. This mathematical convenience is another reason why MSE is so popular. Before modern computers, models were limited to loss functions which computers could optimize.\n", "\n", "**Theorem 2: Any quantile, such as the median, can be predicted by modifying the loss function.**\n", "\n", "The math behind this is more challenging, but the payoff is greater because predicting quantiles tells more about the distribution than merely the average. \n", "\n", "***Theorem 1 Proof:***\n", "\n", "Let $Y$ by the target value, $\\hat{Y}$ the predicted value, and $n$ the number of observations. The Mean Squared Error (MSE) is the mean of the squares of the residuals. To optimize the model, find the minimum value of the loss function with respect to u.\n", "\n", "$$\\text{Loss} = \\text{MSE} = \\frac{1}{n}\\sum_{1}^{n} (Y_i - u)^2$$\n", "\n", "Take the derivative and set the result equal to zero.\n", "\n", "$$\\frac{d}{du} \\frac{1}{n}\\sum_{1}^{n} (Y_i - u)^2 = \\frac{-2}{n}\\sum_{1}^{n} (Y_i - u) = 0$$\n", "\n", "Because the right hand side is 0, we can remove $\\frac{-2}{n}$. Solve for $u$ to find that the minimum value is at the average.\n", "\n", "$$\\sum_{1}^{n} (Y_i - u) = 0 \\Rightarrow u = \\frac{1}{n}\\sum_{1}^{n} Y_i \\space \\blacksquare$$\n", "\n", "***Theorem 2 Proof:***\n", "\n", "To make inferences about the distribution of $Y$, we first need to define the distribution function $F_Y(y) = Pr(Y \\leq y)$. For $\\tau \\in(0,1)$, the $\\tau$th quantile of Y the number $y_{\\tau}$ such that $F(y_{\\tau}) = \\tau$.\n", "\n", "Define the *loss function* as $\\rho_{\\tau}(y)=y(\\tau-\\mathbb{I}_{(y<0)})$, where $\\mathbb{I}$ is an indicator function. A specific quantile can be found by minimizing the expected loss of $Y-u$ with respect to $u$.\n", "$$\\text{Loss} = E(\\rho_{\\tau}(Y-u)) = (\\tau-1)\\int_{-\\infty}^{u}(y-u)dF_{Y}(y)+\\tau\\int_{u}^{\\infty}(y-u)dF_{Y}(y)$$\n", "\n", "Just as before, we take the derivative with respect to $u$ and set it equal to zero.\n", "\n", "$$0=(1-\\tau)\\int_{-\\infty}^{u}dF_{Y}(y)-\\tau\\int_{u}^{\\infty}dF_{Y}(y)$$\n", "\n", "This equation reduces to\n", "$0=F_{Y}(u)-\\tau$\n", "and then to\n", "$F_{Y}(u)=\\tau$. \n", "\n", "Hence, $u$ is $\\tau$th quantile of the random variable Y. Because the median is just the 50th percentile, setting $\\tau = 0.5$ results in estimating the median. This implies that any quantile can be predicted by setting the corresponding $\\tau$ in the loss function. $\\space \\blacksquare$\n", "\n", "**Intuition:**\n", "\n", "A residual is the error between the prediction and target value, $y_i - \\hat{y_i}$. For RMSE and MAE, positive and negative residuals are penalized equally. This is because both loss functions are *even*. An even function returns the same output for a negative or positive input. \n", "\n", "$$\\text{MSE}(Y - \\hat{Y}) = \\text{MSE}(\\hat{Y} - Y) = \\frac{1}{n}\\sum_{1}^{n} (Y_i - Y_i)^2$$\n", "\n", "$$\\text{MAE}(Y - \\hat{Y}) = \\text{MAE}(\\hat{Y} - Y) = \\frac{1}{n}\\sum_{1}^{n} |Y_i - Y_i|$$\n", "\n", "Intuitively, this means that there the average (or median) residual should be zero. For quantile loss, positive residuals are penalized differently than negative residuals, which causes the model to aim *above* the center (for quantiles above 0.50) and *below* the center (for quantiles below 0.50).\n", "\n", "Say that $Y$ is uniform on the integers 1,..., 10. Then the $dF_Y(y)$ just turns into $\\frac{1}{10}$ since all $Y_i$'s have the same probability, and the integrals turn into summation.\n", "\n", "Then the quantile loss is\n", "\n", "$$\\frac{1}{10}(\\tau - 1)\\sum_{y_iu}(y_i - u)$$\n", "\n", "Ignoring the $\\frac{1}{10}$, this is just\n", "\n", "$$(\\tau - 1) \\sum\\text{(Positive Residuals)} + \\tau \\sum\\text{(Negative Residuals)}$$\n", "\n", "When $\\tau = 0.5$ the positive and negative residuals are weighted equalled. This is just MAE.\n", "\n", "Say that $\\tau = 0.10$ and that the model's \"guess\" for the 10th percentile too low. Then most of the residuals will be positive because the guess is too low. The negative residuals will be very small because they get hit by $0.10$.\n", "\n", "$$(0.1 - 1) \\sum\\text{(Positive Residuals)} + 0$$\n", "\n", "This will result in a small value, or a low loss penalty. This tells the model that a low guess is in the right direction.\n", "\n", "Conversely, say that the model's \"guess\" for the 10th percentile is too high. Then the residuals will often be negative and so the weight will be on the second term and so the loss will be *high*. This tells the model that a high guess is in the wrong direction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Gradient Boosting\n", "\n", "Gradient Boosting is a technique which forms an ensemble of weak learners to create a single prediction. These weak learners are often decision trees, which means that they can apply to either regression or classification problems. Each tree is built to correct the error of the previous tree. This stage-wise approach allows for hundreds or thousands of decision trees to be ensembled together. A detailed description of the GBM algorithm can be found in chapter 8.2 of [(An Introduction to Statistical Learning)](http://faculty.marshall.usc.edu/gareth-james/ISL/)\n", "\n", "GBMs are an exceptionally versatile method, but they do not apply to all predictive modeling applications. Consideration should be taken to the advantages and disadvantages of the modeling method to determine if it will provide the best solution to the predictive analytics problem.\n", "\n", "**Advantages:**\n", "- High prediction accuracy;\n", "- Has been empirically shown to succeed on a wide class of problems;\n", "- Performs variable selection;\n", "- Handles non-linearities, interaction effects, is resilient to outliers, and corrects for missing values;\n", "- Deals with class imbalance directly by weighting observations.\n", "\n", "**Disadvantages:**\n", "- Requires a large sample size;\n", "- Can takes longer to train than other methods;\n", "- Does not detect linear combinations of features;\n", "- Can overfit if not tuned correctly.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Application 1: Adding Prediction Intervals to Risk Scores\n", "\n", "-----------\n", "A common use for predictive models is in creating risk scores. A risk score is a value from 1-10, or some other high number, which is intended to measure the tendency of the person to by expensive or not. In health insurance, this is the cost of the person relative to the average. A risk score of 1.5 means that on average, the person costs 50% more than the average person in their population. \n", "\n", "In the table below, three members are represented for the sake of example. Member B at times appears to have the lowest cost, at times when their score is 1.1, but expensive at others, when their score is 10.1. Members A and C have relatively lower variance in their scores. For an insurance practitioner only looking at a single number, at a single point in time, based on a single data sample, all three risk scores appear to have the same precision, but in truth this is not the case.\n", "\n", "![Risk Scores](https://media.giphy.com/media/ME38o1vgzyCIqFY0J5/giphy.gif \"Risk Score Example\")\n", "\n", "The risk score is a random variable. If the data was reshuffled and the model refit, then the risk score will change. A common means of interpreting the average is in terms of frequency. For example, if there were 100 members with the same characteristics as A, if their total health care cost were added together and divided by 100 then this would be about 1.5 times the overall average of A, B, C, and all other members.\n", "\n", "Using quantile regression, we can use GBMs to create a range of values.\n", "\n", "\n", "### Data\n", "\n", "The focus of this paper was on the methodology and reproducibility rather than the result itself. The data consists of 1,338 records of patient health care costs. This data is [publically available](https://www.kaggle.com/mirichoi0218/insurance) for download and was also cleaned and republished as a companion to the textbook *Machine Learning with R* by Brett Lantz. There are six columns which are relate the patient's demographics to their annual health care cost. \n", "\n", "**age**: age of primary beneficiary\n", "\n", "**sex:** insurance contractor gender, female, male\n", "\n", "**bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9\n", "\n", "**children:** Number of children covered by health insurance / Number of dependents\n", "\n", "**smoker:** Smoking\n", "\n", "**region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.\n", "\n", "**charges:** Annual health insurance claims\n", "\n" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "code_folding": [ 15 ] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import sklearn as sk\n", "from sklearn import preprocessing, model_selection, ensemble, linear_model\n", "import matplotlib.pyplot as plt\n", "import matplotlib.style\n", "import matplotlib as mpl\n", "import seaborn as sns\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "%matplotlib inline\n", "from matplotlib.pylab import rcParams\n", "rcParams['figure.figsize'] = 12, 4\n", "\n", "data = pd.read_csv(\"insurance.csv\")" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "hideCode": true }, "outputs": [], "source": [ "def style_table(table):\n", " return(table.style.set_table_styles(\n", " [{'selector': 'tr:nth-of-type(odd)',\n", " 'props': [('background', '#eee')]}, \n", " {'selector': 'tr:nth-of-type(even)',\n", " 'props': [('background', 'white')]},\n", " {'selector': 'th',\n", " 'props': [('background', '#606060'), \n", " ('color', 'white'),\n", " ('font-family', 'verdana')]},\n", " {'selector': 'td',\n", " 'props': [('font-family', 'verdana')]},\n", " {'selector': 'th',\n", " 'props':[('max-width', '100px')]}\n", " ]\n", " ).hide_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The top five records are shown below." ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "hideCode": false }, "outputs": [ { "data": { "text/html": [ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexbmichildrensmokerregioncharges
19female27.90yessouthwest16884.9
18male33.771nosoutheast1725.55
28male333nosoutheast4449.46
33male22.7050nonorthwest21984.5
32male28.880nonorthwest3866.86