Seminar Applied Predictive Analytics - Topic 2. Interpretability

Group Members:

Submitted on July 22, 2022.

Table of Contents

Show Table of Contents
1. [Import Python Packages](#Python_Packages) 2. [Introduction](#Introduction) 3. [Literature Review](#Literature_Review) 3.1 [The Importance of Interpretability](#The_Importance_of_Interpretability) 3.2 [Dimensions of Interpretability](#Dimensions_of_Interpretability) 3.3 [A Definition of Interpretability](#Definition_of_Interpretability) 3.4 [Methods to Achieve Interpretability](#Methods_Interpretability) 3.5 [Literature Table](#LitTab) 4. [Research Question](#ResearchQuestion) 5. [Hypotheses Development](#Hypotheses) 6. [Data](#Data) 6.2 [Load Data](#LoadData) 6.3 [Summary Statistics of Numerical Features](#summary_num) 6.4 [Summary Statistics of Categorical Features](#summary_cat) 7. [Methodology](#Methodology) 7.1 [Logistic Regression](#logit) 7.2 [XGBoost](#XGB) 7.3 [Local, Model-Agnostic Post-Hoc Methods](#LMAPH) 8. [Data Handling and Preprocessing](#Preprocessing) 8.1 [Dealing with Missings](#Missing) 8.2 [Adapting Data Formats](#Formatting) 8.3 [Handling Category Mismatches](#Mismatches) 8.4 [Dummy Encoding](#Dummies) 8.5 [Summary](#Summary1) 8.6 [Dataset Split and Scaling](#Split) 9. [Results](#Results) 10.1 [Stage II](#StageII) 10.2 [Stage III](#StageIII) 10.3 [Summary](#Summary2) 11. [A Case for Cost-Sensitive Learning](#CostAnalysis) 12. [Limitations and Future Research](#Limitations) 13. [References](#References) 14. [Appendix](#Appendix)

Import Python Packages

Introduction

The widespread use of computational algorithms in various industries in the recent decade confirms a transformation of business processes towards a more computation-driven approach (Visani et al. 2022). Simple methods, the most famous being Linear Regression and Generalized Linear Models (William H. Greene 2003), have been followed by the advent of powerful computing tools, leading to the development of more sophisticated machine learning techniques. In particular, machine learning models can perform intelligent tasks usually done by humans. In combination with the huge availability of data sources and increased computational power, machine learning techniques have reduced the time to achieve more accurate results. Despite these advantages, machine learning models display weaknesses especially when it comes to interpretability, i.e., “the ability to explain or to present the results, in understandable terms, to a human” (Hall and Gill 2019). This issue is mainly caused by large model structures and huge numbers of iterations involved in machine learning algorithms combined with potentially many mathematical calculations, hiding the logic underlying these models and making them hard to grasp for humans. Consequently, a substantial amount of techniques have been proposed in the recent literature to make these models with an increased complexity more understandable for humans (Visani et al. 2022). Considering the importance of interpretability with a local focus i.e. trying to explain a single prediction, this paper investigates various aspects of interpretability and the challenges involved in implementing the corresponding methods within the credit risk industry. In particular, three different techniques are employed: LIME (Local Interpretable Model Agnostic Explanations), SHAP (Shapley Additive exPlanations), and the less common approach Anchor.

Literature Review

The Importance of Interpretability

Article 8 of the European Union’s (EU) Charter of Fundamental Rights of the European Union stipulates the right of "Protection of personal data" (European Union 2012). Building on this fundamental right, the EU passed Regulation (EU) 2016/679, better known as General Data Protection Regulation (GDPR), in 2016. The regulation, which has been legally binding since 2018, aims to "strengthen individuals' fundamental rights in the digital age”. At the same time, according to the regulation, model performance should not be the sole criterion for the suitability of a machine learning. (European Commission 2022) Recital 71 expands on this point and specifically states that the "data subject [author's note: respective individual] should have the right not to be subject to a decision, which may include a measure, evaluating personal aspects relating to him or her which is based solely on automated processing and which produces legal effects concerning him or her or similarly significantly affects him or her, such as automatic refusal of an online credit application or e-recruiting practices without any human intervention. [...] In any case, such processing should be subject to suitable safeguards, which should include specific information to the data subject and the right to obtain human intervention, to express his or her point of view, to obtain an explanation of the decision reached after such assessment and to challenge the decision." (European Parliament and the Council 2016) Article 22 manifests the right to "not to be subject to a decision based solely on automated processing [...] [and] at least the right to obtain human intervention on the part of the controller, to express his or her point of view and to contest the decision." (European Parliament and the Council 2016) In addition, article 14 requires that the individual is provided with "meaningful information about the logic involved, as well as the significance" in order to "ensure fair and transparent processing" of personal data. (European Parliament and the Council 2016) This contradicts the approach of many common supervised machine learning algorithms, which rely purely on statistical associations rather than causalities or explanatory rules to produce out-of-sample predictions (Goodman and Flaxman 2017).

Similarly, the Consumer Financial Protection Bureau (2022) clarified that the rights set out in the Equal Credit Opportunity Act (Regulation B) also apply to decisions based on “complex algorithms”. The Equal Credit Opportunity Act (Regulation B) prohibits adverse actions such as the denial of a credit application on the basis of discriminatory behavior and grants credit applicants against which an adverse action has been taken the right to obtain a statement rationalizing this adverse decision.(Bureau of Consumer Financial Protection 2011) The Consumer Financial Protection Bureau (2022) unequivocally formulates that the law does "not permit creditors to use complex algorithms when doing so means they cannot provide the specific and accurate reasons for adverse actions.” (Consumer Financial Protection Bureau 2022) The development of interpretable algorithms is therefore not only desirable but a legal necessity.

Interpretability in Credit Risk

It is evident that a bank's objective is to maximize its franchise value. The minimization of its credit risk plays a key role in this regard. The bank will hence try to reduce credit default as far as possible by screening credit applicants. Banks therefore benefit from models that predict the probability of default for a credit applicant based on observable factors. For this reason, machine learning approaches are in principle also highly interesting for banks. At the same time, model performance is not the sole criterion for the suitability of a machine learning algorithm for the use in the banking sector since the bank must also satisfy its stakeholders. The most important stakeholders in this context are a) the bank's credit applicants and b) the supervisory authority that regulates banks.

Image 1: A Bank's Primary Stakeholders; own illustration.

The considerations on the customer side have already been addressed in the section on legal implications: Credit applicants have a right to know the reasons why their credit application was (not) granted. At the same time, the topic is currently gaining relevance from a regulatory standpoint: For example, the Deutsche Bundesbank and BaFin (2021), the two central supervisory authorities of German banks, warned in a consultation paper of increased model risk due to the black-box nature of many machine learning techniques and called for a focus on interpretability. (Deutsche Bundesbank and BaFin 2021) Responses to the consultation paper indicated that respondents in the banking and insurance industry believe that the main benefit of interpretable models is to facilitate model selection by allowing the validity of the model to be verified by a human with domain knowledge. At the same time, the consultation highlights that the implementation of interpretable models is considered cost-intensive and therefore only worthwhile undertaking if the effort is counterbalanced by a significant increase in performance. (Deutsche Bundesbank and BaFin 2022)

Dimensions of Interpretability

Lipton (2016) clarifies that no universal definition of the term "interpretability" exists. Rather, machine learners seek different underlying goals when searching for interpretable algorithms. His taxonomy identifies five dimensions users wish to achieve through the application of interpretable models (see Figure 2):

  1. Trust
  2. Causality
  3. Transferability
  4. Informativeness
  5. Fairness

Trust is partially covered by model performance: The higher a model’s predictive accuracy the less false predictions it makes and the more a user can rely on the model’s predictions. However, there is more to trust then just accuracy. First, there is a psychological component: Humans may tend to have more faith in a model they are able to grasp intellectually. Second, an algorithm may be perceived as more trustworthy if it makes wrong judgement in cases in which a human decision maker would have misjudged as well as in such cases human intervention would have failed to yield an improvement. (Lipton 2016) The more severe the consequences of a model decision, the more important trust becomes. (Ribeiro et al. 2016)

Another component of interpretability is the ability to establish causal relationships between the features and the target. In standard supervised learning techniques this is problematic. Thus, inducing causal relationships in a machine learning context is a research branch in its own right.

In addition, one would ideally want to train a machine learning model that can be applied to new data without significant performance loss. Data scientists usually try to achieve this by partitioning a given dataset into a test and training dataset. Within this context, a model is thus deemed transferable if it is able to predict the test data given what it has learned about the data’s underlying structure though studying the training dataset. In contrast to this rather narrow definition of transferability, a human decision maker typically performs better at generalizing previously obtained knowledge and applying it to entirely new context. (Lipton 2016) Here, a similar reasoning can be applied as in the case of the dimension "Trust": A model appears interpretable to humans if it is evident how the model applies known principles to new situations and if it fails to do so in scenarios in which a human would fall short as well.

The fourth dimension is the desire for a model to be as informative as possible to benefit decision makers to the maximum degree possible.

Lastly, ethical considerations constitute grounds for the use of interpretable models. Users wish to ensure that model-based decisions are made in a manner that the user, supervisor or society deems ethical. (Lipton 2016)

Image 2: The Five Dimensions of Interpretability; own illustration based on Lipton (2016).

A Definition of Interpretability

Given the research question of this paper, the question arises which dimensions of Lipton’s (2016) interpretability taxonomy are of particular relevance in the area of credit risk and how the term should be defined in our given context. Considering the main stakeholders of a bank, we find the dimensions “Trust” and “Informativeness” to be of particular relevance. Trust in the credit risk context is strongly related to credit risk reduction but goes beyond it. In addition to good predictive power (and thus low credit default probability), there must also be a sense of security among stakeholders that the model is drawing its predictions on the basis of meaningful conclusions. In addition, models must be as informative as possible in order to plausibly communicate decisions to stakeholders. To illustrate this point, imagine a situation in which a credit applicant has been classified as not creditworthy on the basis of a machine learning algorithm. Naturally, the credit applicant will expect a plausible explanation for this. An interpretable machine learning technique should provide credit officers, who have to advocate model decisions in front of credit applicants, with all the necessary information. Our definition of "interpretability" is thus strongly geared to the needs of the end user. For the purposes of this paper, we thus adopt the definition of “interpretability” by Ribeiro et al. (2016) who classify an explanation as interpretable if it generates a “qualitative understanding [of the relationship] between the input variables and the response”. They aim at a model that can support human decision makers by making predictions and providing rationales for them. The human, who acts as the final authority in the decision-making process, then evaluates the model's predictions and justifications, and uses his or her domain knowledge to decide whether the decision is conclusive. (Ribeiro et al. 2016)

Methods to Achieve Interpretability

Having outlined the importance of interpretable machine learning models in general and in particular in credit risk, one must ask in which way interpretability can be achieved. Figure 3 gives a high-level overview. Most fundamentally, machine learning models can be divided into intrinsically interpretable models (so-called white-box-models) and black-box-models.

Intrinsically interpretable models are – as the name suggests - in themselves interpretable and thus do not require further adjustments to deliver interpretable explanations. For a model to be intrinsically explainable, it therefore needs to be sufficiently simple for a human to be able to grasp it. For example, linear regression, logistic regression and a shallow decision tree can be considered intrinsically explainable models.

A model that is not inherently explainable requires co-called post-hoc methods to render its results explainable ex-post. These post-hoc methods fall into one of two categories: They are either model-specific or model-agnostic. As the name suggests, model-specific interpretable machine learning algorithms are tailored towards specific model classes. Model-agnostic methods, on the other hand, are suited for any class of model. A further distinction can be made with regard to the scope of the interpretation offered by the post-hoc method: Does the explanation cover the entirety of the model’s decision behavior and can thus explain according to which principles the models arrive at its conclusions? Or does the explanation offered concern a single observation and can thus explain the conclusions drawn by the models for any particular case? The latter of these options is called “local interpretability” while the former is referred to as “global interpretability”. (Molnar 2022)

Image 3: Overview of Interpretability Methods; own illustration.

Literature Table

Research Question

This paper aims to answer the question whether it is possible to develop a model framework applicable within credit risk that achieves a high predictive performance as measured by the AUC whilst maintaining or even improving interpretability. A related question hence is whether there is a trade-off between performance (AUC) and interpretability.

Hypotheses Development

Previous literature has highlighted the trade-off between simple, intrinsically interpretable models with limited accuracy and less transparent black-box models which exhibit a high predictive power. (Bussmann et al. 2021; Emad Azhar Ali et al. 2021) We expect that such a trade-off between interpretability and performance is also visible using our dataset when comparing a white-box model to a black-box model, as verbalized by hypotheses I and II.

I) The black-box model outperforms the white-box model (measured by the AUC).

II) The white-box model is better explainable than the black-box model.

We suspect that the trade-off formulated in hypotheses I and II can be relaxed by employing post-hoc methods, leading us to hypothesis III:

III) Through the usage of post-hoc methods, it is possible to achieve a higher performance compared to a white-box model (measured by the AUC) and having a higher degree of interpretability (compared to the black-box model) at the same time.

Data

The analysis is performed using a credit risk dataset obtained from Freddie Mac (2021). Freddie Mac is a U.S. mortgage bank that was created by the United States Congress in 1970 to “ensure a reliable and affordable supply of mortgage funds throughout the country”. (Federal Housing Finance Agency 2022) Instead of lending to individuals directly, its business consists of purchasing loans on the secondary housing market, bundling them and selling them in the form of Mortgage-Backed Securities (MBS).(Freddie Mac 2022a) It separates its business operations into three major divisions: Single-Family Division, Multifamily Division and Capital Markets Division. The dataset used in this paper is the so-called Single Family Loan-Level Dataset (Origination Data File). As the name suggests it covers loans purchased by Freddie Mac’s Single-Family Division. (Freddie Mac 2021)

The dataset includes loans originated between 1999 and the Origination Cutoff Date.(Freddie Mac 2022c) The loans chosen for inclusion were selected at random from Freddie Mac’s loan portfolio and every loan had the same probability of inclusion. (Freddie Mac 2022b) Approximately 150.000 observations with the first payment date from 2013 to 2016 are used as training data to build the prediction models, while the test data consists of approximately 50.000 observations from 2017 to 2019.

In total, the dataset includes 31 features. These are described in the following table.

The target feature, i.e. whether a loan defaulted or not defaulted, are not given in the original dataset. In constructing target values for the training dataset, the definition of default of the Basel Committee on Banking Supervision (2004) was used:

“A default is considered to have occurred with regard to a particular obligor when […] [t]he obligor is past due more than 90 days on any material credit obligation to the banking group.[…] Overdrafts will be considered as being past due once the customer has breached an advised limit or been advised of a limit smaller than current outstandings.“

Using this definition, the overall default ratio (defaulted loans over non defaulted loans) is given by 0.009103, i.e. approximately 0.9%. In the following, summary statistics are presented for numerical features. In addition to mean and standard deviation, the mean difference (mean of non-defaulted loans – mean of defaulted loans) is calculated. Features, for which the mean difference is negative (positive), are highlighted in red (green).

Categorical features, on the other hand, are treated one-by-one. For each category of a feature, the absolute and relative frequency in the test and training dataset is indicated. Then, for the training dataset, the proportion of defaulted and non-defaulted loans within each category is reported (for the test dataset, this is not possible, as no target values are available). Finally, the ratio of defaulted and non-defaulted loans within each category is calculated and the differences to the overall default ratio are calculated. Negative differences are highlighted in green, positive differences in red. In this way, the reader is given a quick overview of the data and can identify potential correlations between feature characteristics and the target.

As an example, consider the feature ‘servicer_name’. Here, for ‘servicer_name == SPECIALIZED LOAN SERVICING LLC’, the default ratio is equal to approximately 15.2% and thus 16.7 times higher than the average default ratio. This is on example of an interesting prima facie evidence that becomes visible through this preliminary analysis.

For further information regarding the data collection and the characteristics of the dataset, the interested reader is referred to the data provider Freddie Mac (2021).

Load Data

Summary Statistics of Numerical Features

Summary Statistics of Categorical Features

Methodology

Image 4: Illustration of Multi-Stage Model Design; own illustration.

To answer our research question and test our hypothesis, we develop a multi-stage model as illustrated in Figure 4. In the initial modelling stage, the data is cleaned, and features are engineered. In Stage II a white-box and a black-box model are compared along their predictive performance (as measured by the AUC) and their interpretability to test Hypothesis I and II. As our baseline model, we choose Logistic Regression, which is seen as inherently interpretable thanks to its simplicity (Bussmann et al. 2019; Dumitrescu et al. 2022; Li et al. 2020). It is frequently employed as a benchmark both in the machine learning research community (Emad Azhar Ali et al. 2021; Altinbas and Akkaya 2017) and in financial institutions to evaluate credit risk in a way that is easy to communicate to regulators (Dumitrescu et al. 2022, 2022; Lipton 2016; Liu et al. 2022, p. 5326), as pioneered by Steenackers and Goovaerts (1989). This simplicity comes with shortcomings, such as strong assumptions (a linear relationship between features and targets) and biased results in the presence of outliers. Machine learning algorithms are seen as a potential remedy.(Emad Azhar Ali et al. 2021; Xia et al. 2020) Therefore, we choose extreme gradient boosting (XGBoost) as our competing model, which we suspect to have a higher predictive power but lower inherent interpretability (as formulated in Hypotheses I and II).

To enhance the interpretability of the XGBoost machine, we employ three different model-agnostic, post-hoc interpretability methods in Stage III:

  1. Anchor
  2. LIME
  3. SHAP

The consecutive section gives a brief overview of the theoretical backgrounds of the methods employed.

Logistic Regression

Logistic Regression is one of the most popular models for the classification of binary target variables. As in any model within the class of binary response models, logit ensures that the conditional probability $P_i$ that $y_i=1$ given the information set $\Omega_i$ lies within the 0-1-interval. In general, this is achieved by specifying that $P_i \equiv E(y_i | \Omega_i) = F(X_i\beta)$, where $X_i \beta $ serves as an index function, hence mapping a feature vector $X_i$ and the parameter vector $\beta$ to a scalar index. $F_i()$ is a function that transforms its input values $X_i \beta$ in a way that ensures that they are bounded by 0 and 1. In the case of logit, this transformation function $F_i()$ is the logistic function $\Lambda (x) = \frac{1}{1+e^{-x}}$. The conditional probability of $y_i=1$ is then given by $P_i = \frac{exp(X_t \beta)}{1+ X_t \beta} = \frac{1}{1+exp(X_t \beta)}$. Usually, logit models (as well as other forms of binary response models) are solved using maximum likelihood estimation. (Davidson and MacKinnon 2004)

As the question of default versus non-default of a given loan constitutes a binary classification problem, it is well suited for the application of logit. The downside of logit, however, is that contrary to linear regression, it is not possible to directly interpret the coefficients $\beta$. Instead, one may interpret the change in the log odds (“How much do the log odds of the target change with a one unit change in the features?”). The odds, or more technically the odds of success, are defined as the probability of success over the probability of failure. In the credit risk application, this translates to the probability of defaulting over the probability of not-defaulting. The log odds are simply the logarithm of the odds. For standardized data, the intuitive interpretation for single coefficients is not given anymore. Therefore, one may question the interpretability of this type of model outcome, as it requires statistical knowledge regarding odd-ratios. Nevertheless, as mentioned above, logit is frequently used in machine learning and in credit risk applications. Therefore, it constitutes a natural benchmark against which alternative modelling approaches can be evaluated.

XGBoost

The gradient boosting algorithm (XGBoost) belongs to the ensemble methods. These methods are characterized by the fact that they build powerful predictive models by using base learners as their foundation. In terms of application, the gradient boosting algorithm can be applied to both regression and classification problems. With regard to the inner flow structure, the base learners used are modified by means of boosting. This process is controlled by a learning rate set by the user. Within the further explanation of the algorithm, reference is made to the application with regard to a classification problem, as this is addressed within the project. The starting point of the algorithm is an initial prediction that has to be made. This initial prediction is calculated based on the log odds of all individuals within the target variable. Due to the classification problem, this value must now be converted into a probability. This is achieved by the usage of the logistic function. This conversion represents the first difference in terms of application to regression problems.

After a matching format was achieved pseudo residuals are generated based on the difference of observed and predicted values. These residuals form the target value of the base learner to be generated. This base learner determines the corresponding output values on the basis of the feature values of the different individuals. Similar to the previous step, these values must also be converted into probabilities. This step represents the second difference in terms of applying it to a regression problem.

After ensuring consistent formats, the initial predictions can be updated. This is achieved by combining the initial predictions with the predictions of the first base learner. Here, the predicted value of the base learner is modified using the previously determined learning rate and added to the initial predicted value. This process is now reinitiated with the updated prediction value and repeated until the specified number of base learners is reached which leads to the determination of the algorithm. (Friedman 2001)

Inspired by Baesens et al. (2003), we also experimented with neural networks. However, we were unable to achieve a higher AUC as the one for XGBoost (maximum AUC using Neural Networks: 0.65). We rationalize this by noting that applying neural networks to tabular data remains challenging (Borisov et al. 2021). This is in line with the results of Lundberg et al. (2020) who state that “[w]hile deep learning models are more appropriate in fields like image recognition, speech recognition, and natural language processing, tree-based models consistently outperform standard deep models on tabular-style datasets where features are individually meaningful and do not have strong multi-scale temporal or spatial structures”.

Lundberg et al. (2020) add that the low bias of tree-based models increases their inherent interpretability. They contrast the view that linear models are generally more interpretable as this viewpoint neglects the potential model mismatch when using a linear model that simplifies the true feature-target-relationship too much. (Lundberg et al. 2020) These considerations support our choice of XGBoost for the research question at hand.

Local, Model-Agnostic Post-Hoc Methods

ANCHOR

With regard to the model agnostic methods, the anchor method is assigned to the post-hoc methods and can be docked de facto to any black-box model in order to make the model interpretable. Within the inner procedure of this method, a rule is defined for a given instance with respect to the features passed to the model, which should indicate which features either have an effect on the target variable or not when their values are changed. The resulting rules have an IF-THEN form and can be reused due to their scaling, which leads to the idea of coverage, where the identified rule is applied to other instances of the dataset.

The Anchor as object of this method can be defined formally as follows: $𝔼_{\mathcal{D}_x(z|A)}[1_{\hat{f}(x)=\hat{f}(z)}]\ge\tau,A(x) = 1$, where

In an informal way, the previous rule and its notation can be understood as follows. Given a data point $x$ to be explained, a rule $A$ is to be found so that it can be applied to $x$ and in addition is simultaneously applicable to a set of neighbors which are limited by $\tau$ in view of precision. The evaluation of the rule with regard to a sufficient accuracy is achieved by ${\mathcal{D}_x(z|A)}$, whereby the output of the learning model is used as a correspondence $1_{\hat{f}(x)=\hat{f}(z)}$. With regard to formal completeness, the anchor method explained in the previous section is sufficient, but within the practical application, efficiency problems with regard to computability arise with larger datasets due to the fact that an evaluation of the dataset in notion of the form $\mathcal{D}_x(z|A)$ would be computationally too costly. For this aim Ribeiro et al. (2018) introduced a further parameter which is integrated into their probabilistic definition in the following form: P(prec(A) ≥ $\tau$ ) ≥ 1 − δ. By introducing this parameter and defining the precision as follows: prec(A) = $𝔼_{\mathcal{D}_x(z|A)}[1_{\hat{f}(x)=\hat{f}(z)}]$, statistical confidence with respect to the precision can be achieved which means the defined anchors can satisfy the precision constraint with high probability.

In the context of the current form, it is now logically possible that more than one anchor can be identified within Anchor Determination, which can fulfill the described constraint. With regard to this property, the component of coverage is used to resolve this conflict, which is defined as follows: $cov(A)=𝔼_{\mathcal{D}(z)}[A(z)]$. The final goal of the anchor definition is now to maximize the coverage value with regard to all explained components. This gives the solution to the previous problem, since now the anchor that maximizes the coverage value is chosen from the pool of identified anchors.(Molnar 2022; Ribeiro et al. 2018)

LIME

Ribeiro et al. (2016) introduce Locally Interpretable Model-agnostic Explanations (LIME) to explain any black-box machine learning model by a "locally faithful" interpretable model. The algorithm therefore belongs to the group of model-agnostic methods. Within the implementation interpretable representation of the data is used regardless of the actual features of the original model. Considering the ultimate goal of explaining the model in a way that is understandable to humans, the fidelity-interpretability trade-off emerges as a challenge for interpretability. In this regard, the desired interpretability method should provide understandable explanations while using a model that is adequately faithful to the black-box model in its predictions. In doing so, LIME uses an objective function that incorporates interpretability and local fidelity without any assumption about the black-box model. The former is measured by a loss function and the latter by adding a complexity measure into the following objective function: $$ξ(x) = argmin_{g\in G}\mathcal{L} \left( f, g, \pi_{X_i} \right) + Ω(g)$$ Where the locally weighted square loss is used as the loss function $$ \mathcal{L} \left( f, g, \pi_{X_i} \right) = \sum_{j=1}^{J} \pi_{X_i}\left( Z_i \right) \left( f\left( Z_j \right)- g \left( Z_j \right) \right)^2 $$ with $ \pi_{X_i}\left( Z_i \right)=\exp \left( - \frac{D\left(X_i,Z_j\right)}{\sigma^2}\right)$ as an exponential kernel defined on a distance function $D\left(X_i,Z_j\right)$ (e.g. cosine distance for text, $L2$ distance for images) with width $σ$. In principle, minimizing the loss function while keeping $Ω(g)$ low enough ensures an interpretable yet locally faithful model.

Figure 5 represents the intuition of LIME in local model training using tabular data. In this example, a classification prediction by Random Forest is explained using a linear classifier.

Image 5: The intuition of LIME in local model training using tabular data (Molnar 2022)

A) Random forest predictions given features x1 and x2. Predicted classes: 1 (dark) or 0 (light).

B) Instance of interest (big dot) and data sampled from a normal distribution (small dots).

C) Assign higher weight to points near the instance of interest.

D) Signs of the grid show the classifications of the locally learned model from the weighted samples. The white line marks the decision boundary (P(class=1) = 0.5).

Accordingly, LIME explains the instance of interest in a 5-stage process:

  1. Perturbing the dataset: New points are produced, generated from a multivariate distribution of the features in the dataset. The features are considered to follow a normal distribution with parameters inferred from the dataset. For the purpose of data generation, features are considered to be independent.
  2. Predicting the target for these new samples using the black-box model: A new dataset is achieved with the respective predictions.
  3. Weighting the new samples according to their proximity to the desired instance: to ensure locality, a Gaussian Kernel is used based on the distance between the instance of interest and its neighbors
  4. Training a weighted, interpretable model on the perturbed dataset: Prior to model fitting, a feature selection step is usually performed with the LASSO technique followed by a rescaling process. Subsequently, a Ridge Regression, i.e., a Linear Regression combined with a penalty related to the L2 norm of the coefficients (E. Hoerl, Robert W. Kennard 1970) is used to predict the target while avoiding potential overfitting issue.
  5. Explaining the prediction by interpreting the local model: Lastly, the results are interpreted in a human-understandable manner. (Molnar 2022; Visani et al. 2022)

Stability of LIME

Regardless of its advantages, LIME may suffer from a lack of stability, i.e. repeated applications of the method under the same conditions may obtain different results. Instability is particularly an issue when it comes to interpretability as it makes the interpretations less reliable which, consequently, diminishes the trust in the method. The issue is however frequently overlooked.

To enlighten the stability problem with LIME, its origin could be found in the sensitivity of the method's definition when it comes to the dimensionality of the dataset. In this regard, the weighting kernel used in LIME with a huge number of variables makes the local explanation unable to discriminate among relevant and irrelevant features and which negatively affects the data point weighting, namely step 3, in the explanation process. During the process, the kernel function is applied before variable selection which makes the kernel function unable to distinguish between near and distant points, considering all of them approximately at the same distance. This leads to a loss of the locality and consequently declined performance of the algorithm. Therefore, recognizing the issue, as well as having the tool for spotting it becomes an important task when employing LIME. In doing so, Visani et al. (2022) introduce two complementary measures: Variable Stability Index (VSI) and Coefficient Stability Index (CSI): "By construction, VSI measures the concordance of the variables retrieved, whereas CSI tests the similarity among coefficients for the same variable, in repeated LIME calls. Both of them range from 0 to 100". While both of these indices point to stability, each value checks a particular stability instance. A high VSI ensures the practitioner that LIME explanations are almost always the same, while a low value for VSI shows that the results are prone to change with repeated calls of the same decision, making the explanation unreliable. On the other hand, high CSI guarantees trustworthy LIME coefficients while low values alert to potential risks of changing the magnitude and consequently the sign of the contribution of a feature in different calls. Considering the fact that a LIME coefficient represents the impact of the feature on the particular machine learning decision, obtaining different values lead to completely different explanations. It should be mentioned that while high values for both indices ensures stability, low values for only one metric do not. More importantly, having these two measures could better guide the practitioner in finding the source of instability.(Molnar 2022)

SHAP

SHAP (Shapley Additive exPlanations) introduced by Lundberg and Lee (2017) is a game theoretic approach to explain the output of any machine learning model and therefore also belong to the model-agnostic methods. The goal is to explain the prediction of an observation $X_i$ by computing the contribution of each feature to the prediction.

SHAP is based on Shapley values, a cooperative game theory method, introduced by Shapley (1953). The basic concept of Shapley values is to assign payouts to players depending on their contribution to the total payout. Players cooperate in a coalition and receive a certain profit from this cooperation. Transferring this concept to machine learning algorithms, the “game” is the prediction task for a single observation of the dataset. The “gain” is the actual prediction made for this observation minus the average prediction over all instances. The “players” represent the specific feature values of the observation that cooperate to receive the “gain” or to predict a certain value. Hence the Shapley values represent a fair distribution of the prediction among the features.

Young (1985) demonstrated that Shapley values are the only set of values that satisfy three desired properties for local explanation models: 1.local accuracy (known as ‘efficiency’ in game theory) states that when approximating the original model $f$ for a specific input $x$, the explanation’s attribution values should sum up to the output $f(x)$. 2.Consistency (known as ‘monotonicity’ in game theory) states that if a model changes so that some feature’s contribution increases or stays the same regardless of the other inputs, that input’s attribution should not decrease.

  1. Missingness (known as ‘null effects and ‘symmetry’ in game theory) states that features missing in the original input should have no impact.

Lundberg and Lee (2017) specified the Shapley values as an additive feature attribution method, and defined SHAP as: $f(z) = g(z’)=\phi_0 + \sum_{i=1}^{M}\phi_i z_i’$, where $f(\cdot)$ is the original model, $g(\cdot)$ is the explanation model, $z’ \in {0,1}^M$ is the coalition vector, $M$ is the maximum coalition size and $\phi_i \in \mathbb{R}$ is the feature attribution for a feature $i$ or the Shapley value for feature $i$. Figure 1 illustrates how the expected model prediction $E\left[ f(z) \right]$ that would be predicted if we did not know any features of the current output $f(x)$ would change when conditioning on a specific feature $\phi_i$ with one specific feature order. When using non-linear models or non-independent features, the order in which the features are added to the explanation model matters. The SHAP values are calculated by averaging $\phi_i$ over all possible sequences.

Image 6: Attribution of SHAP value to prediction (Lundberg and Lee 2017)

SHAP values represent a local interpretation of the feature importance. When using local interpretability models, stability and trust in single explanations is not always given. To improve the trustworthiness and the interpretability of the model, Lundberg et al. (2020) introduce an extension of SHAP values, the TreeExplainer. TreeExplainer is an explanation method specially for trees, which enables the tractable computation of optimal local explanations. The authors call it “fast local explanations with guaranteed consistency”. To compute the impact of a specific feature subset during the Shapley value calculation, TreeExplainer uses interventional expectations over a user supplied background dataset (feature perturbation = “interventional”). But it can also avoid the need for a user-supplied background dataset by relying only on the path coverage information stored in the model (feature perturbation = “tree_path_dependent”). Lundberg et al. (2020) show that by efficiently and exactly computing the Shapley values and therefore guaranteeing that explanations are always consistent and locally accurate, the results over previous local explanation methods are improved in several ways:

TreeExplainer also extends local explanations to measuring interaction effects and therefore presenting a richer type of local explanation. These interaction values use the ‘Shapley interaction index’ from game theory to capture local interaction effects. They are computed by generalizing the original Shapley value properties and allocate credit not just among each individual player of a game but also among all pairs of players. By enabling the additional consideration of interaction effects for individual model predictions, TreeExplainer can uncover important patterns that might otherwise have been missed.

Another key extension is not to calculate Shapley values for only one specific observation, but for the whole dataset. By combining many local explanations, a global structure is constructed while retaining local faithfulness to the original model and therefore a rich summary of both an entire model and individual features can be created.

Data Handling and Preprocessing

As described in the Data Section, the Single Family Loan-Level Dataset contains 31 features, 14 of which are numerical. Two features are dates, and the remaining 15 features are categorical.

As a first step, missing values are imputed or dropped (see Section “Dealing with Missing Data”). Next, the data formats of variables are inspected and adjusted where appropriate.

These steps are first conducted for the training dataset. Afterwards, the same steps are taken for the test dataset for consistency reasons. In addition, for categorical feature the possibility of a category mismatch, i.e. that the test data includes categories unseen in the training dataset, is taken into account. Next, categorical features in both test and training data are dummy encoded and the training data is split into features and target values.

Lastly, the numerical features of both datasets are standardized.

Dealing with Missings

In the original version of the Single Family Loan-Level dataset, missing values are encoded using numerical codes. For example, for the feature ‘ltv’ (Original Loan-to-Value), “not available” is encoded as 999. Evidently, this is suboptimal when performing analyses as any algorithm would interpret this as a regular numerical value equal to 999 instead of as missing. Therefore, encoding missing values as system missings constitutes the initial step of our data preprocessing process. For categorical features, a separate category for missing values is added. After having implemented these changes, it becomes apparent that merely five numerical features contain missing values, three of which only contain a neglectable number (<= 10) of missings. Observations containing one or more of these three variables (‘fico’, ‘cltv’ and ‘ltv’) are dropped. The features remaining two numerical features with missing values, ‘dti’ and ‘cd_msa’, on the other hand, contain a significant amount of missing values (26153 and 20971, respectively). With respect to ‘cd_msa’, an identifier for the Metropolitan Area a creditor lives in, we choose to disregard this variable, as its meaning practically equals that of the Zip code, which has no missing values. For ‘dti’, on the other hand, an observation may be missing for one of two reasons: Either its ‘dti’ ratio exceeds 65%, or it belongs to the HARP dataset. HARP is short for Home Affordable Refinance Program and “was created by the Federal Housing Finance Agency specifically to help homeowners who are current on their mortgage payments, but who are underwater on their mortgages. That is, they owe almost as much or more than the current value of their homes.” For a mortgage to be eligible for the HARP relief program, it had to be owned by Fannie Mae or Freddie Mac (Federal Housing Finance Agency 2013), who had been placed in federal conservatorship during the financial crisis (Lockhart 9/7/2008). Upon further inspection of the dataset, it becomes clear that merely 5 individuals for whom the ‘dti’ is missing or not part of the HARP dataset and thus have a ‘dti’ ratio exceeding 65%. The vast majority of observations with a missing ‘dti’ ratio hence received emergency relief via the HARP program. As a result, we deem it unreasonable to impute the median ‘dti’ value for those missing values. Unfortunately, it is impossible to predict the value of those missing values in a sensible way, as the ‘dti’ ratio is missing for the entire subpopulation of HARP program receivers per design. Consequently, we decided to impute missing ‘dti’ values with the cut-off value of 65% as we deem this the most prudent decision given the fact illustrated above.

Training Data

Dti = ORIGINAL DEBT-TO-INCOME (DTI) RATIO

Test Data

Dti = ORIGINAL DEBT-TO-INCOME (DTI) RATIO

Adapting Data Formats

After having dealt with missing values, data formats are adapted where needed. As an initial step, the feature ‘id’ is used to index the data. Float variables are expressed in 'mi_pct', 'cltv', 'dti', 'ltv' and 'int_rt' are expressed in percentage terms and hence converted into float. Date variables 'dt_first_pi' and 'dt_matr' are converted into pandas datetime variables and categorical variables are encoded as such. The same applies to Boolean features.

Feature ‘pre_relief_prog’ originally had the following format: PYYQnXXXXXXX, where ‘P’ is the type of product, ‘YY’ is the year, ‘Qn’ the quarter and the final digits are randomly assigned. We choose to extract the product type only and encode it as a categorical feature.

We drop features which only take on one unique value as they do not add additional information while increasing dimensionality. In addition, ‘ltv’ is dropped due to its high correlation to ‘cltv’ which we view as logical given the fact that cltv is the Original Combined Loan-to-Value and ltv the Original Loan-to-Value.

Finally, we drop features ‘zipcode' in addition to ‘cd_msa'. While there are no missing values in ‘zipcode’, its coverage is very low such that each individual value of ‘zipcode’ is only populated by a minor number of observations. In the interest of reduced dimensionality, we thus only use the more granular geographical indicator of state (‘st’). In addition, we aggregate states containing less than 1500 observations (<~1% of observations) into a new category 'Other'. We follow the same approach for the variable 'seller_name' (new category ‘Other sellers’) and ‘servicer_name’ (new category ‘Other servicer’).

Training Data

Test Data

Handling Category Mismatches

To ensure that the algorithm is capable of handling unknown categories we compare test and training data and encode any new categories as “Other”. This is the most reasonable modeling choice in our opinion as the algorithm is unable to interpret new categories in a sensible way if it has not been trained on them.

Afterwards, we dummy encode all categorical variables.

Dummy Encoding of Categorical Features

Summary

Dataset Split and Scaling

As the final preprocessing steps, the data is split into a target- and a feature-dataset and the numerical features of both datasets are standardized. An important observation is that the data is – as one would expect and as has been mentioned previously – highly imbalanced: Only approximately 0.9% of observations default on their loan.

Results

Stage II

Logistic Regression

To build our base model, we fit a logistic regression from the package sklearn (David Cournapeau 2022) using the ridge regularization and maximal iterations of 500 on the scaled training data. Afterwards we predict the probability of defaulting for the test data observations. On Kaggle, the simple logit achieves an AUC value of 0.71312 on the public score, which will represent our baseline to beat with the XGBoost model. To interpret logit, we can look at the coefficients output of the model. The values shown below represent the average impact of a change in the features on the log odds. Only the sign of the coefficients can be directly interpreted. For example, the negative coefficient for the feature ‘fico’ means that for a c.p. higher value of ‘fico’ the probability of defaulting decreases which is consistent with our findings in the Data Analysis part.

To evaluate the degree of the feature influence, we look at the example of the dummy variable ‘servicer_name_Other_servicers’. If this dummy goes from 0 to 1, the log odds change is equal to 2.648731. To interpret this, we first calculate the odds with $odds = exp(2.648731) = 14.13609$. This means, that having this feature attribution ($‘\text{servicer_name_Other_servicers}’ = 1$), on average and c.p., multiplies the probability of defaulting by 14.13609 compared to people not having this feature attribution ($’\text{servicer_name_Other_servicers}’ = 0$).

XGBoost

As a more advanced technique, we choose XGBoost (xgboost developers 2021). Within the practical application of our project, in addition to the number of base learners and the learning rate, six other hyperparameters were used within the model specification. In order to achieve the maximum possible predictive power within the model specification, a grid search was used as a tuning method in order to achieve the best possible combination of hyperparameter values. The concrete values can be seen within the code. Compared to the previously used logit model, the AUC value could be increased up to 0.73865 on Kaggle. The significance of this increase is discussed in the resume. In addition to the performance, however, due to the complexity of the model, a decrease in interpretability could be observed, which is potentially solved and discussed within the next section.(Friedman 2001)

Stage III

ANCHOR

Within the practical application the anchor method (Robert Samoilescu 2018) delivered an unsatisfying result. Within the observations made regarding the individuals, it was noticed that the method could not provide an explanation regarding predictions with a probability lower than 0.5. For observations with a prediction value higher than 0.5, the method could provide an explanation, but within the coverage value it featured a value below zero and contained at best a relevant feature count between 10 and 20. Occasionally, dentified anchors contained the entire feature set. The poor quality of the results can be traced back to the high dimension of the feature set and the extremely high ratio between the defaulting and non-defaulting observations (99.1 : 0.9). This is confirmed by the formal definition of the probabilistic parameter given above in the formal anchor definition.

LIME

As anchor method did not provide useful interpretation for our model, we therefore apply LIME. In order to investigate potential stability issues in our application, the version of LIME by Visani et al. (2022) is used as this implementation includes a function that checks the previously described stability indices: VSI and CSI.

Below is the LIME result employed on our XGBoost model using the 10 most important features. Accordingly, LIME predicts a value of 76% probability of default for this specific applicant with a predicted value of 75.6% by XGBoost. Clearly, almost the same probability values resulted from LIME and the original model confirms the well-performed loss minimization in LIME. The results presented show the highest contribution by Wells Fargo Bank being the seller of the mortgage with 5%, while the relatively low fico score as a measure of the applicant's creditworthiness comes next with the value of 3% in worsening the applicant's probability of default. On the positive side, the most impactful feature is the fact that the U.S. Bank is not the seller of the mortgage.

Although the outcome of LIME is somewhat informative to explain the high probability of default for this specific applicant e.g. low creditworthiness being an indicator of the applicant's weak financial standpoint contributing to the high risk of default, the almost negligibly small magnitude of all the 10 variables could be a sign of instability. Following the hints leading to this issue, stability indices for 10, 20, 50 and 100 LIME iterations using the same observation have been checked. The graph below represents the outcome of this experiment.

Accordingly, relatively steady and high values for CSI show the reliable sign and thus form of contribution (positive/negative) of LIME explanations while lower but steady VSI values require more caution: The variables in our results and consequently our whole explanation is relatively less reliable in terms of the important features describing the probability of default. To be more precise, the magnitude and therefore the order of the most important features in explaining the target could fluctuate in different calls of LIME for the same observation which makes the interpretation less reliable. This issue is relevant in our case due to high dimensionality of the dataset which could cause problems mentioned in LIME limitations and thus explain the not-so-informative LIME results to some extent. The problem becomes even more concerning when we repeat the stability check for the same number of iterations, as the value of each index for a certain number of LIME calls fluctuates as well. The stability concerns in LIME lead us to try another method, SHAP, with the aim of interpretability improvement.

SHAP

Lastly, we use the presented TreeExplainer (Scott Lundberg 2017). First, we define the explainer function and calculate the SHAP values for the whole test data. As the feature perturbation method, we use the default value “interventional”. To achieve stable results, we use the whole training data as background data. We now inspect the output of three different observations of the test data, one for which XGBoost predicted a high probability of defaulting (f(x) = 0,7579817), the second with a slightly higher probability of defaulting compared to the base value (f(x) = 0,08390787) and one with a lower probability of defaulting (f(x) = 0,000073524774). The base value represents the average model prediction for the training data with E[f(X)] = 0.012. The figure below shows the waterfall plot with the 15 most important features pushing the base value towards the final prediction for the three observations. The different local feature contributions show similar results regarding the most important features. The level of the contribution depends on the specific feature values and feature combinations. The result of SHAP for the most important features are consistent with our intuitive thoughts after the Data Analysis part. For example, a high ‘fico’ score (e.g., 809 for the low probability prediction) influences the base value negatively and therefore decreases the probability of defaulting, while a low ‘fico’ score (e.g., 620 for the high probability prediction) influences the base value positively and therefore increases the probability of defaulting.

To gain additional insights into the results, we plot the summary plot, which constructs the global structure and therefore adds a rich summary of the entire model to the individual feature importances. The most important feature over all observations is the ‘fico’ value, followed by the dummy feature ‘servicer_name_Other_servicers’, and the feature ‘cnt_borr’. Comparing the global with the three local plots, we can see many similarities in the global feature importance and the feature importance for these three observations.

Stability of SHAP To eliminate computing complexity, the SHAP package in python includes various explainers for different ML models. For many of those, like the TreeExplainer when using the feature perturbation “interventional” (default value for this function), background data is required and serves as a prior expectation towards the instances to be explained. The official SHAP documentation suggests 100 to 1000 randomly drawn samples from the training data as an adequate background dataset, while other studies employ different sample sizes. The question now is, what is the effect of different background dataset sizes on SHAP TreeExplainer and are the results different when using the feature perturbation approach “tree_path_dependet”, which does not require any background data. Instead, it just follows the trees and use the number of training examples that went down each leaf to represent the background distribution.

To answer this question, we therefore simulate 100 iterations of SHAP using different sample sizes of the background data (50, 500, 1500, 15000, 150000) to test the stability of SHAP with changing sample sizes. We observe that the order of feature importance fluctuates when using different background data, especially when using small sample sizes. The results suggest that users should take into account how background data affects SHAP results, with improved SHAP stability as the background sample size increases. By using the whole training dataset as background data, the SHAP results are stable.

The code and further explanations can be found in the Appendix.

Summary

With respect to the research question and hypotheses stated previously, we can thus summarize that we indeed see a trade-off between performance (AUC) and interpretability when comparing Logistic Regression as an intrinsically interpretable method and XGBoost as a black-box model. One might criticize that the AUC improvement appears relatively small. However, this has to be put into perspective, as even a small change in predictive performance may translate into a substantial monetary value for the bank. We expand on this point in the consecutive section. Overall, the results of Stage II confirm hypotheses I and II. In Stage III, we are able to construct a model framework applicable within credit risk that achieves a high predictive performance as measure by the AUC whilst maintaining or even improving interpretability; thereby relaxing said tradeoff and confirming hypothesis III. To this end, we use XGBoost as a black-box model and add model-agnostic, local post-hoc methods. Within our framework, SHAP delievers the most interpretable and stable results among the three methods considered.

A Case for Cost-Sensitive Learning

Even a small increase in predictive performance may transfer into a significant monetary value for financial institutions within credit risk (Emad Azhar Ali et al. 2021; Hayashi 2016; Altinbas and Akkaya 2017). This is due to the fact that the costs of a type I error, i.e. the cost associated with granting a credit to a defaulting customer, differ substantially from the costs of a type II error, i.e. the profit lost by rejecting a non-defaulting customer. As quantifying the exact cost differences requires a large amount of information (such as regulatory capital requirements or exposure at default), the literature makes use of “rules of thumb” regarding appropriates ratios of misclassification costs.(Dumitrescu et al. 2022) However, the magnitudes of the ratios used in the literature vary widely. For example, West (2000) proposes a ratio of 1:5, while others assume as much as 1:20 Kao et al. (2012). For this reason, we quantify the costs once with a ratio of 1:5 (as the lower limit of the expected costs), once with 1:10 (as the middle part of the expected costs) and once with 1:20 (as the upper limit of the expected costs). We make no claim to completeness or precision in this section. Instead, we aim to provide an order of magnitude to illustrate how the increase in the AUC achieved by using XGBoost instead of Logistic Regression translates into costs.

For this purpose, we split the original training data into training and validation data (70:30) to be able to calculate the corresponding costs of falsely negative predicted observations (FN) and falsely positive predicted observations (FP) and calculate costs corresponding to our predictions using the rules of thumb. Using for example the ratio of 1:5, this means that one FP prediction (giving a loan to a person who will default) costs the bank the same amount as making 5 FN predictions (not giving a loan to a person, who would have not defaulted).

To determines the threshold for which an observation is predicted as defaulting ($p >= threshold$) or as non-defaulting ($p < threshold$), we follow the approach of Charles Elkan (2001) by calculating the cost-minimal threshold by $ 𝑝(𝑏|𝒙) ≥ 𝜏^∗ = \frac{𝐶(𝑏,𝐺)}{ 𝐶 (𝑏,𝐺) + 𝐶(𝑔,𝐵)} $ and calculating the different thresholds using the three ratios. We do not use an empirical approach to tune the models using a cost matrix as this would go beyond the scope of this paper.

To get representative results, we simulate 100 iterations of splitting the original training data into the new training and new validation data (70:30) and build a new Logistic Regression as well as XGBoost with the same parameters as above using the new training data of each iteration. Using these models, we predict the probabilities of the validation observations for each round and calculate the confusion matrix as well as the AUC values for each iteration.

Afterwards we calculate the average values for AUC, the TP, TN, FP and FN values, visualize them and calculate the costs using the ratios and the corresponding theoretical optimal thresholds.

The average AUC values do not differ as much in this simulation as above, with Logistic Regression achieving 0.8535 and XGBoost achieving 0.8611. The average values for the confusion matrix are shown in the graph below. The results do not give consistent results. While the AUC of XGBoost is higher for each iteration and on average than of Logistic Regression, the corresponding costs are not lower for XGBoost for each cost structure.

These results suggest that when building credit-risk models with imbalanced data, a cost sensitive algorithm should be used rather than only focusing on an improvement of the AUC value. This result should be addressed by future research. In the following section, we point out further limitations and thus potentials for future investigations.

Conclusion and Future Research

While the credit risk model framework presented above is successful in relaxing the initial performance-interpretability-tradeoff, it still has several limitations which should be addressed by future research: Firstly, the model does not include any macroeconomic and systemic explanatory variables which might be relevant and might help to increase predictive performance (Uddin et al. 2020; Altinbas and Akkaya 2017; Hu et al. 2021; Xia et al. 2020). Secondly, the predictive power of credit risk models may be adversely affected by the widely accepted fact that credit applications can actively influence the parameters on the basis of which banks make their predictions in order to appear more creditworthy. For instance, as explained by Lipton (2016), credit seekers may seemingly “improve” their debt ratio by “simply […] requesting periodic increases to credit lines while keeping spending patterns constant”. Providers of credit ratings such as FICO even openly advise credit applicants on how to improve their score.(Lipton 2016) Moreover, as has been demonstrated above, there is severe concern about model stability. Given the fact that financial intermediaries play a significant role in our financial system, this is particularly worrying. In the case of Freddie Mac, this became painfully clear during the financial crisis of 2007/2008, when Freddie Mac (together with Fanny Mae) was taken under conservatorship by the Federal Housing Finance Agency (Lockhart 9/7/2008) to prevent the institutions from failing after having accumulated losses amounting to $5.2 trillion of home mortgage debt in total.(Frame et al. 2015) In addition, the data’s high imbalance and dimensionality is likely to have complicated the analysis and exacerbated stability concerns. This issue, however, is far from a theoretical concern, but should rather be seen as a real-life problem when dealing with credit risk data. Further research should hence focus on finding ways to improve stability, especially in the presence of highly imbalanced and high dimensional data. In conclusion, this analysis shows the potential of machine learning in the credit risk domain, but at the same time indicates that the application of these methods in the real world should be evaluated with caution.

References

💡 Altinbas, Hazar; Akkaya, Goktug Cenk (2017): Improving the performance of statistical learning methods with a combined meta-heuristic for consumer credit risk assessment. In Risk Manag 19 (4), pp. 255–280. DOI: 10.1057/s41283-017-0021-0.

💡 Baesens, Bart; Setiono, Rudy; Mues, Christophe; Vanthienen, Jan (2003): Using Neural Network Rule Extraction and Decision Tables for Credit-Risk Evaluation. In Management Science 49 (3), pp. 312–329. DOI: 10.1287/mnsc.49.3.312.12739.

💡Basel Committee on Banking Supervision (2004): International Convergence of Capital Measurement and Capital Standards. A Revised Framework.

💡Borisov, Vadim; Leemann, Tobias; Seßler, Kathrin; Haug, Johannes; Pawelczyk, Martin; Kasneci, Gjergji (2021): Deep Neural Networks and Tabular Data: A Survey. Available online at http://arxiv.org/pdf/2110.01889v3.

💡 Bureau of Consumer Financial Protection (2011): Equal Credit Opportunity Act (Regulation B), Part 1002. Source: 12 U.S.C. 5512, 5581; 15 U.S.C. 1691b. In : 76 FR 79445.

💡 Bussmann, Niklas; Giudici, Paolo; Marinelli, Dimitri; Papenbrock, Jochen (2019): Explainable AI in Credit Risk Management. In SSRN Journal. DOI: 10.2139/ssrn.3506274.

💡 Bussmann, Niklas; Giudici, Paolo; Marinelli, Dimitri; Papenbrock, Jochen (2021): Explainable Machine Learning in Credit Risk Management. In Comput Econ 57 (1), pp. 203–216. DOI: 10.1007/s10614-020-10042-0.

💡 Charles Elkan (2001): The Foundations of Cost-Sensitive Learning. Available online at https://www.researchgate.net/publication/2365611_The_Foundations_of_Cost-Sensitive_Learning.

💡 Consumer Financial Protection Bureau (2022): CFPB Circular 2022-03. Available online at https://files.consumerfinance.gov/f/documents/cfpb_2022-03_circular_2022-05.pdf, checked on 7/2/2022.

💡 David Cournapeau (2022): scikit-learn. Machine Learning in Python. With assistance of Jérémie du Boisberranger, Joris Van den Bossche,Loïc Estève,Thomas J. Fan, Alexandre Gramfort, Olivier Grisel, Yaroslav Halchenko, Nicolas Hug. scikitlearnofficial. Available online at https://scikit-learn.org/stable/, updated on May 2022.

💡 Davidson, Russell; MacKinnon, James G. (2004): Econometric theory and methods. New York, NY: Oxford Univ. Press.

💡 Demajo, Lara Marie; Vella, Vince; Dingli, Alexiei (2021): An Explanation Framework for Interpretable Credit Scoring. In IJAIA 12 (1), pp. 19–38. DOI: 10.5121/ijaia.2021.12102.

💡 Deutsche Bundesbank; BaFin (2021): Consultation paper: Machine learning in risk models – Characteristics and supervisory priorities.

💡 Deutsche Bundesbank; BaFin (2022): Machine learning in risk models – Characteristics and supervisory priorities.

💡 Dumitrescu, Elena; Hué, Sullivan; Hurlin, Christophe; Tokpavi, Sessi (2022): Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects. In European Journal of Operational Research 297 (3), pp. 1178–1192. DOI: 10.1016/j.ejor.2021.06.053.

💡 E. Hoerl, Robert W. Kennard (1970): Ridge Regression: Biased Estimation for Nonorthogonal Problems, Technometrics. Available online at https://doi.org/10.1080/.

💡 Emad Azhar Ali, Syed; Sajjad Hussain Rizvi, Syed; Lai, Fong-Woon; Faizan Ali, Rao; Ali Jan, Ahmad (2021): Predicting Delinquency on Mortgage Loans: An Exhaustive Parametric Comparison of Machine Learning Techniques. In Int J Ind Eng Manag Volume 12 (Issue 1), pp. 1–13. DOI: 10.24867/IJIEM-2021-1-272.

💡 European Commission (2022): Data protection in the EU. Available online at https://ec.europa.eu/info/law/law-topic/data-protection/data-protection-eu_en, updated on 6/7/2022, checked on 6/21/2022.

💡 European Parliament and the Council (2016): REGULATION (EU) 2016/ 679 - on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/ 46/ EC (General Data Protection Regulation).

💡 European Union (2012): Charter of Fundamental Rights of the European Union, revised C 326/02. Available online at https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:12012P/TXT, checked on 6/21/2022.

💡 Federal Housing Finance Agency (2013): HARP: Your best route to a better mortgage. Available online at https://web.archive.org/web/20130924120848/https://www.harp.gov/, updated on 10/4/2013, checked on 7/13/2022.

💡 Federal Housing Finance Agency (2022): About Fannie Mae & Freddie Mac | Federal Housing Finance Agency. Available online at https://www.fhfa.gov/about-fannie-mae-freddie-mac, updated on 7/3/2022, checked on 7/3/2022.

💡 Frame, W. Scott; Fuster, Andreas; Tracy, Joseph; Vickery, James (2015): The Rescue of Fannie Mae and Freddie Mac. In Journal of Economic Perspectives 29 (2), pp. 25–52. DOI: 10.1257/jep.29.2.25.

💡 Freddie Mac (2021): Single Family Loan-Level Dataset. Available online at https://www.freddiemac.com/research/datasets/sf-loanlevel-dataset, updated on 7/2/2022, checked on 7/3/2022.

💡 Freddie Mac (2022a): About Freddie Mac. Available online at https://www.freddiemac.com/about, updated on 7/2/2022, checked on 7/3/2022.

💡 Freddie Mac (2022b): Single Family Loan-Level Dataset Frequently Asked Questions (FAQs).

💡 Freddie Mac (2022c): Single-Family Loan-Level Dataset General User Guide.

💡 Friedman, Jerome H. (2001): Greedy Function Approximation: A Gradient Boosting Machine. 1999 Reitz Lecture. In The Annals of Statistics 29 (5), pp. 1189–1232.

💡 Goodman, Bryce; Flaxman, Seth (2017): European Union regulations on algorithmic decision-making and a "right to explanation". In AIMag 38 (3), pp. 50–57. DOI: 10.1609/aimag.v38i3.2741.

💡 Hall, Patrick; Gill, Navdeep (2019): An Introduction to Machine Learning Interpretability. An Applied Perspective on Fairness, Accountability, Transparency, and Explainable AI. 2nd: O'Reilly Media, Inc. Available online at https://www.oreilly.com/library/view/an-introduction-to/9781098115487/.

💡Hayashi, Yoichi (2016): Application of a rule extraction algorithm family based on the Re-RX algorithm to financial credit risk assessment from a Pareto optimal perspective. In Operations Research Perspectives 3, pp. 32–42. DOI: 10.1016/j.orp.2016.08.001.

💡 Hu, Linwei; Chen, Jie; Vaughan, Joel; Aramideh, Soroush; Yang, Hanyu; Wang, Kelly et al. (2021): Supervised Machine Learning Techniques: An Overview with Applications to Banking. In International Statistical Review 89 (3), pp. 573–604. DOI: 10.1111/insr.12448.

💡 Kao, Ling-Jing; Chiu, Chih-Chou; Chiu, Fon-Yu (2012): A Bayesian latent variable model with classification and regression tree approach for behavior and credit scoring. In Knowledge-Based Systems 36, pp. 245–252. DOI: 10.1016/j.knosys.2012.07.004.

💡 Li, Wei; Ding, Shuai; Wang, Hao; Chen, Yi; Yang, Shanlin (2020): Heterogeneous ensemble learning with feature engineering for default prediction in peer-to-peer lending in China. In World Wide Web 23 (1), pp. 23–45. DOI: 10.1007/s11280-019-00676-y.

💡 Lipton, Zachary C. (2016): The Mythos of Model Interpretability. Available online at http://arxiv.org/pdf/1606.03490v3.

💡 Liu, Wan’an; Fan, Hong; Xia, Min (2022): Multi-grained and multi-layered gradient boosting decision tree for credit scoring. In Appl Intell 52 (5), pp. 5325–5341. DOI: 10.1007/s10489-021-02715-6.

💡 Lockhart, James B. (9/7/2008): Statement of FHFA Director James B. Lockhart. Federal Housing Finance Agency.

💡 Lundberg, Scott; Lee, Su-In (2017): A Unified Approach to Interpreting Model Predictions. Available online at http://arxiv.org/pdf/1705.07874v2.

💡 Lundberg, Scott M.; Erion, Gabriel; Chen, Hugh; DeGrave, Alex; Prutkin, Jordan M.; Nair, Bala et al. (2020): From local explanations to global understanding with explainable AI for trees. In Nature Machine Intelligence (2), pp. 56–67. Available online at https://www.nature.com/articles/s42256-019-0138-9.

💡 Molnar, Christoph (2022): Interpretable Machine Learning. Available online at https://christophm.github.io/interpretable-ml-book/.

💡 Ribeiro, Marco; Singh, Sameer; Guestrin, Carlos (2016): “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 97–101. DOI: 10.18653/v1/N16-3020.

💡 Ribeiro, Marco Tulio; Singh, Sameer; Guestrin, Carlos (2018): Anchors: High Precision Model-Agnostic Explanations. In AAAI.

💡 Robert Samoilescu (2018): Alibi. Available online at https://github.com/SeldonIO/alibi, updated on 7/20/2022.

💡 Scott Lundberg (2017): SHAP (0.41.0). Available online at https://github.com/slundberg/shap, updated on 1/16/2022.

💡 Shapley, L. S. (1953): 17. A Value for n-Person Games. In Harold William Kuhn, Albert William Tucker (Eds.): Contributions to the Theory of Games (AM-28), Volume II: Princeton University Press, pp. 307–318.

💡 Steenackers, A.; Goovaerts, M. J. (1989): A credit scoring model for personal loans. In Insurance: Mathematics and Economics 8 (1), pp. 31–34. DOI: 10.1016/0167-6687(89)90044-9.

💡 Uddin, Mohammad S.; Chi, Guotai; Al Janabi, Mazin A. M.; Habib, Tabassum (2020): Leveraging random forest in micro‐enterprises credit risk modelling for accuracy and interpretability. In Int. J Fin Econ, Article ijfe.2346. DOI: 10.1002/ijfe.2346.

💡 Visani, Giorgio; Bagli, Enrico; Chesani, Federico; Poluzzi, Alessandro; Capuzzo, Davide (2022): Statistical stability indices for LIME: obtaining reliable explanations for Machine Learning models. In Journal of the Operational Research Society 73 (1), pp. 91–101. DOI: 10.1080/01605682.2020.1865846.

💡 West, David (2000): Neural network credit scoring models. In Computers & Operations Research 27 (11-12), pp. 1131–1152. DOI: 10.1016/S0305-0548(99)00149-5.

💡 William H. Greene (2003): Econometric Analysis. 5th Edition: Pearson Education, Inc.

💡 xgboost developers (2021): XGBoost. With assistance of Andrew Ziem, Philip Hyunsu Cho, Jiaming Yuan. Available online at https://xgboost.readthedocs.io/en/stable/python/python_intro.html, updated on 12/16/2021.

💡 Xia, Yufei; He, Lingyun; Li, Yinguo; Fu, Yating; Xu, Yixin (2020): A DYNAMIC CREDIT SCORING MODEL BASED ON SURVIVAL GRADIENT BOOSTING DECISION TREE APPROACH. In Technological and Economic Development of Economy 27 (1), pp. 96–119. DOI: 10.3846/tede.2020.13997.

💡 Young, H. P. (1985): Monotonic solutions of cooperative games. In Int J Game Theory 14 (2), pp. 65–72. DOI: 10.1007/BF01769885.

💡 Yuan, Han; Liu, Mingxuan; Kang, Lican; Miao, Chenkui; Wu, Ying (2022): An empirical study of the effect of background data size on the stability of SHapley Additive exPlanations (SHAP) for deep learning models.

Appendix

Stability of SHAP

Code

❌Disclaimer, the following section has a very long runtime (> 20 hours), therefore it is all commented. The results of the analysis can be seen in the Resuls section.

Results

The first table is this section shows the global feature importance ranking for 100 simulations using different background data sizes. The rows represent the ranking from the highest global feature importance (Ranking = 1) to the lowest global feature importance (Ranking = 83). The columns represent the different approaches. For each sample size, we simulate 100 random draws from the training data with the same parameters to see whether we get consistent results. The table shows all possible features which got assigned the specific ranking in at least one iteration.

When using no background data and just the tree path dependent approach explained earlier, as well as using the whole training dataset, we get the same ranking for all 100 simulations, so there is no variation in the ranking. When using different sample sizes as background data randomly drawn from the training data, we see for the most and the least important features some variation (~ 3 unique values for the rankings 1-5 and rankings 79-83) while we observe much more variation in the middle with many more unique values for each possible ranking position.

The results suggest that users should take into account how background data affects SHAP results, and therefore test the stability of their used approach before using it to interpret their model.

The second table shows the variance in the SHAP values for each feature over the 100 simulations. We observe for lower background sample sizes (m = 50, m = 500) higher variances as for higher background data sizes. In our case, after using more than 1500 observations, the variance does not significally decrease anymore.

As described above, when using the whole training data as background data, or using the tree path dependent approach with no background data, SHAP TreeExplainer gives stable results for each iteration.