{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "c831b6f8-79ef-4d1b-b963-47422c056731", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet seaborn pandas scikit-learn numpy matplotlib jupyterlab_myst ipython" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "bfee6a78", "metadata": {}, "source": [ "# Feature importance\n", "\n", "It's quite often that you want to make out the exact reasons of the algorithm outputting a particular answer. Or at the very least to find out which input features contributed most to the result. With Random Forest, you can obtain such information quite easily." ] }, { "cell_type": "markdown", "id": "6eb70ed9", "metadata": {}, "source": [ "## Intuition\n", "\n", "From the picture below, it is intuitively clear that, in our credit scoring problem, *Age* is much more important than *Income*. This can be formally explained using the concept of *information gain*." ] }, { "cell_type": "markdown", "id": "e5a58c79", "metadata": {}, "source": [ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-advanced/ensemble-learning/feature-importance/Example_of_credit_scoring_problem.png\n", "---\n", "name: 'example of credit scoring problem' \n", "width: 90%\n", "---\n", "Example of credit scoring problem\n", ":::" ] }, { "cell_type": "markdown", "id": "a5fcfd33", "metadata": {}, "source": [ "In the case of many decision trees or a random forest, the closer the mean position of a feature over all the trees to the root, the more significant it is for a given classification or regression problem. Gains in the splitting criterion, such as the *Gini impurity*, obtained at each optimal split in every tree is a measure of importance that is directly associated with the splitting feature. The value of this score is distinct for each feature and accumulates over all the trees.\n", "\n", "Let's go a little deeper into the details. \n", "\n", "There exist a lot of methods to assess feature importances. Leo Breinman in his works suggested to evaluate the importance of a variable by measuring decrease of accuracy of the forest when the variable is randomly permuted or decrease of impurity of a nodes where the given variable is used for splitting. The former method is often called **permutation importance**. The latter method is used in `sklearn`." ] }, { "cell_type": "markdown", "id": "ba051632", "metadata": {}, "source": [ "### Permutation importance\n", "\n", "Inspired by this [article](https://www.researchgate.net/publication/5231126_Conditional_Variable_Importance_for_Random_Forests).\n", "The average reduction in accuracy caused by a variable is determined during the calculation of the out-of-bag error. The greater the reduction in accuracy due to an exclusion or permutation of the variable, the higher its *importance score*. For this reason, variables with a greater average reduction in accuracy are generally more significant for classification.\n", "\n", "The rationale for calculating permutation importance is the following: By randomly permuting the predictor variable $X_j$, its original association with the response $Y$ is broken. When the permuted variable $X_j$, together with all the others non-permuted variables, is used the response for the out-of-bag observations, the prediction *accuracy* decreases substantially if the original $X_j$ was associated with response. Thus, as a measure of variable importance, the difference in prediction accuracy before and after permuting is used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More formally: denote $\\overline{\\mathfrak{B}}^{(t)}$ as the out-of-bag sample for a tree $t$, for $t\\in\\{1, ..., N\\}$ where $N$ is the number of trees in ensemble. Then the permutation importance of variable $X_j$ in tree $t$ is \n", "\n", "$${PI}^{(t)}\\left(X_j\\right)=\\frac{\\sum_{i\\in\\overline{\\mathfrak{B}}^{(t)}}I\\left(y_i=\\hat{y}_i^{(t)}\\right)}{\\left|\\overline{\\mathfrak{B}}^{(t)}\\right|}-\\frac{\\sum_{i\\in\\overline{\\mathfrak{B}}^{(t)}}I\\left(y_i=\\hat{y}_{i,\\pi_j}^{(t)}\\right)}{\\left|\\overline{\\mathfrak{B}}^{(t)}\\right|}$$ \n", "\n", "where $\\hat{y}_i^{(t)}=f^{(t)}(\\mathbf{x}_i)$ is the predicted class for observation $i$ before and $\\hat{y}_{i, \\pi_j}^{(t)}=f^{(t)}(\\mathbf{x}_{i,\\pi_j})$ is the predicted class for observation $i$ after permuting $X_j$, $\\mathbf{x}_{i,\\pi_j}=\\left(x_{i,1}, ..., x_{i,j-1},x_{\\pi_j(i),j},x_{i,j+1},...,x_{i,p}\\right)$\n", "\n", "Note that by definition ${PI}^{(t)}=0$, if variable $X_j$ isn't in tree $t$.\n", "\n", "Now, we can give the feature importance calculation for ensembles:\n", "* not normalized:\n", "\n", "$${PI}\\left(X_j\\right)=\\frac{\\sum_{t=1}^N {PI}^{(t)}(X_j)}{N}$$\n", "* normalized by the standard deviation of the differences:\n", "\n", "$$z_j=\\frac{{PI}\\left(X_j\\right)}{\\frac{\\hat{\\sigma}}{\\sqrt{N}}}$$" ] }, { "cell_type": "markdown", "id": "bf1e314f", "metadata": {}, "source": [ "## Illustrating permutation importance\n", "\n", "Let's assume that we have a toy dataset with 10 instances. Target variable can be either **'N'** or **'P'**.\n", "\n", "$$\\begin{array}{c|c|c|c|c|c|c|c|c|c}\n", " \\text{Instances}, i & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & \\\\ \n", " \\hline\n", " y_i & N & P & P & N & N & P & N & N & N & P \\\\\n", " \\end{array}$$\n", " \n", "We build an ensemble of 5 trees $t$, for $t\\in\\{1, ..., 5\\}$. For each tree we get out-of-bag sample (denoted $\\overline{\\mathfrak{B}}^{(t)}$ above). For example for the first tree out-of-bag sample consists of instances # 2, 4, 5, and 6.\n", "\n", "$$\\begin{array}{c|c|c|c|c|c|c|c|c|c}\n", " \\text{Tree 1} & \\text{Bootstrap-sample 1} & 10 & 9 & 7 & 8 & 1 & 3 & 9 & 10 & 10 & 7\\\\\n", " \\hline\n", " \\text{Tree 2} & \\text{Bootstrap-sample 2} & 4 & 8 & 5 & 8 & 3 & 9 & 2 & 6 & 1 & 6\\\\\n", " \\hline\n", " \\text{Tree 3} & \\text{Bootstrap-sample 3} & 6 & 2 & 6 & 10 & 2 & 10 & 3 & 6 & 5 & 1\\\\\n", " \\hline\n", " \\text{Tree 4} & \\text{Bootstrap-sample 4} & 6 & 7 & 8 & 10 & 6 & 10 & 9 & 10 & 8 & 2\\\\\n", " \\hline\n", " \\text{Tree 5} & \\text{Bootstrap-sample 5} & 5 & 8 & 1 & 8 & 5 & 7 & 10 & 1 & 10 & 9\\\\\n", " \\end{array}$$\n", " \n", "Thus, out-of-bag samples for each tree $t$ are\n", "\n", "$$\\begin{array}{c|cccc}\n", " \\text{Tree}, t & \\overline{\\mathfrak{B}}^{(t)} \\\\\n", " \\hline\n", " \\text{Tree 1} & 2 & 4 & 5 & 6\\\\\n", " \\hline\n", " \\text{Tree 2} & 7 & 10\\\\\n", " \\hline\n", " \\text{Tree 3} & 4 & 7 & 8 & 9\\\\\n", " \\hline\n", " \\text{Tree 4} & 1 & 3 & 4 & 5\\\\\n", " \\hline\n", " \\text{Tree 5} & 2 & 3 & 4 & 6\\\\\n", " \\hline\n", " \\end{array}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Suppose that we have four features $X_j$, $j\\in\\{1, 2, 3, 4\\}$ and we'd like to compute _permutation importance_ for $X_2$. First, for each out-of-bag sample we compute _accuracy_ of the model before and after permutation of the values of $X_2$.\n", "\n", "For instance, before permutation for $\\overline{\\mathfrak{B}}^{(1)}$ we have\n", "\n", "$$\\begin{array}{c|cccc|cc|c}\n", " & X_1 & \\color{red}{X_2} & X_3 & X_4 & y_i & \\hat{y}_i & I\\left(y_i=\\hat{y}_i\\right)\\\\\n", " \\hline\n", " \\textbf{2} & 1 & \\color{red}2 & 11 & 101 & \\textbf{P} & \\textbf{P} & 1\\\\\n", " \\hline\n", " \\textbf{4} & 2 & \\color{red}3 & 12 & 102 & \\textbf{N} & \\textbf{P} & 0\\\\\n", " \\hline\n", " \\textbf{5} & 3 & \\color{red}5 & 13 & 103 & \\textbf{N} & \\textbf{N} & 1\\\\\n", " \\hline\n", " \\textbf{6} & 4 & \\color{red}7 & 14 & 104 & \\textbf{P} & \\textbf{P} & 1\\\\\n", " \\end{array}$$\n", " \n", "Thus, the accuracy before permutation is $3/4=0.75$.\n", " \n", "After permutation for $\\overline{\\mathfrak{B}}^{(1)}$ we have\n", "\n", "$$\\begin{array}{c|cccc|cc|c}\n", " & X_1 & \\color{red}{X_2} & X_3 & X_4 & y_i & \\hat{y}_i & I\\left(y_i=\\hat{y}_i\\right)\\\\\n", " \\hline\n", " \\textbf{2} & 1 & \\color{red}5 & 11 & 101 & \\textbf{P} & \\textbf{P} & 0\\\\\n", " \\hline\n", " \\textbf{4} & 2 & \\color{red}7 & 12 & 102 & \\textbf{N} & \\textbf{P} & 0\\\\\n", " \\hline\n", " \\textbf{5} & 3 & \\color{red}2 & 13 & 103 & \\textbf{N} & \\textbf{N} & 1\\\\\n", " \\hline\n", " \\textbf{6} & 4 & \\color{red}3 & 14 & 104 & \\textbf{P} & \\textbf{P} & 1\\\\\n", " \\end{array}$$\n", " \n", "The accuracy after permutation is $2/4=0.50$.\n", "\n", "Then the difference between accuracies is computed.\n", "\n", "The above mentioned steps are to be done for each out-of-bag sample $\\overline{\\mathfrak{B}}^{(t)}$. To get not normalized _permutation importance_ we sum all computed differences and divide by the number of trees. Normalization is done by dividing _not normalized permutation importance_ by standard error." ] }, { "cell_type": "markdown", "id": "05c4f130", "metadata": {}, "source": [ "## Sklearn Random Forest Feature Importance\n", "\n", "Inspired by [this](https://medium.com/@srnghn/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3) article.\n", "Sklearn library uses another approach to determine feature importance. The rationale for that method is that the more gain in information the node (with splitting feature $X_j$) provides, the higher its importance.\n", "\n", "The average reduction in the Gini impurity – or MSE for regression – represents the contribution of each feature to the homogeneity of nodes and leaves in the resulting Random Forest model. Each time a selected feature is used for splitting, the Gini impurity of the child nodes is calculated and compared with that of the original node.\n", "\n", "Gini impurity is a score of homogeneity with the range from 0 (homogeneous) to 1 (heterogeneous). The changes in the value of the splitting criterion are accumulated for each feature and normalized at the end of the calculation. A higher reduction in the Gini impurity signals that splitting results by this feature results in nodes with higher purity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The algorithm of obtaining feature importance may be represented with the following sequence of steps:\n", "\n", "1. For each tree $t$ in ensemble $t\\in\\{1,...,N\\}$:\n", "\n", " 1.1. for each node $i$ calculate the reduction in impurity (or MSE, or entropy) as ${RI}_i^{(t)}=w_i^{(t)}\\cdot I_i^{(t)} - w_{LEFT_i}^{(t)}\\cdot I_{LEFT_i}^{(t)}-w_{RIGHT_i}^{(t)}\\cdot I_{RIGHT_i}^{(t)}$, where:\n", " - $w_i^{(t)}$, $w_{LEFT_i}^{(t)}$, and $w_{RIGHT_i}^{(t)}$ are respectively weighted number of samples reaching node $i$ in tree $t$, and its left $LEFT_i$ and right $RIGHT_i$ children\n", " - $I_i^{(t)}$, $I_{LEFT_i}^{(t)}$, $I_{RIGHT_i}^{(t)}$ are impurity of the nodes. For leaves ${RI}_i^{(t)}$ is equal to 0.\n", "\n", " 1.2. for each feature $j$ calculate its importance in that particular tree as\n", " \n", "$${FI}_j^{(t)}=\\frac{\\sum_{i:\\text{node }i\\text{ splits on feature } j}{RI}_i^{(t)}}{\\sum_{i\\in\\text{all nodes}}{RI}_i^{(t)}}$$\n", "\n", " That means that in numerator we sum the reduction in impurity only in those nodes where feature $j$ is situated.\n", "2. Calculate the average feature importances over all trees in ensemble:\n", "\n", "$${FI}_j=\\frac{\\sum_{t=1}^N {FI}_j^{(t)}}{N}$$\n", "\n", "Those are pretty confusing formulas so let's demonstrate each step with the Iris Dataset." ] }, { "cell_type": "code", "execution_count": 2, "id": "0c59dcfd", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 3, "id": "cae7631a", "metadata": {}, "outputs": [], "source": [ "iris_data = pd.read_csv(\"https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/iris.csv\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "60a56379", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | SepalLength | \n", "SepalWidth | \n", "PetalLength | \n", "PetalWidth | \n", "
|---|---|---|---|---|
| 0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "
| 1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "
| 2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "
| 3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "
| 4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "