{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Permutation Importance vs Random Forest Feature Importance (MDI)\n\nIn this example, we will compare the impurity-based feature importance of\n:class:`~sklearn.ensemble.RandomForestClassifier` with the\npermutation importance on the titanic dataset using\n:func:`~sklearn.inspection.permutation_importance`. We will show that the\nimpurity-based feature importance can inflate the importance of numerical\nfeatures.\n\nFurthermore, the impurity-based feature importance of random forests suffers\nfrom being computed on statistics derived from the training dataset: the\nimportances can be high even for features that are not predictive of the target\nvariable, as long as the model has the capacity to use them to overfit.\n\nThis example shows how to use Permutation Importances as an alternative that\ncan mitigate those limitations.\n\n.. rubric:: References\n\n* :doi:`L. Breiman, \"Random Forests\", Machine Learning, 45(1), 5-32,\n 2001. <10.1023/A:1010933404324>`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Loading and Feature Engineering\nLet's use pandas to load a copy of the titanic dataset. The following shows\nhow to apply separate preprocessing on numerical and categorical features.\n\nWe further include two random variables that are not correlated in any way\nwith the target variable (``survived``):\n\n- ``random_num`` is a high cardinality numerical variable (as many unique\n values as records).\n- ``random_cat`` is a low cardinality categorical variable (3 possible\n values).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.model_selection import train_test_split\n\nX, y = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\nrng = np.random.RandomState(seed=42)\nX[\"random_cat\"] = rng.randint(3, size=X.shape[0])\nX[\"random_num\"] = rng.randn(X.shape[0])\n\ncategorical_columns = [\"pclass\", \"sex\", \"embarked\", \"random_cat\"]\nnumerical_columns = [\"age\", \"sibsp\", \"parch\", \"fare\", \"random_num\"]\n\nX = X[categorical_columns + numerical_columns]\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We define a predictive model based on a random forest. Therefore, we will make\nthe following preprocessing steps:\n\n- use :class:`~sklearn.preprocessing.OrdinalEncoder` to encode the\n categorical features;\n- use :class:`~sklearn.impute.SimpleImputer` to fill missing values for\n numerical features using a mean strategy.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import OrdinalEncoder\n\ncategorical_encoder = OrdinalEncoder(\n handle_unknown=\"use_encoded_value\", unknown_value=-1, encoded_missing_value=-1\n)\nnumerical_pipe = SimpleImputer(strategy=\"mean\")\n\npreprocessing = ColumnTransformer(\n [\n (\"cat\", categorical_encoder, categorical_columns),\n (\"num\", numerical_pipe, numerical_columns),\n ],\n verbose_feature_names_out=False,\n)\n\nrf = Pipeline(\n [\n (\"preprocess\", preprocessing),\n (\"classifier\", RandomForestClassifier(random_state=42)),\n ]\n)\nrf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accuracy of the Model\nPrior to inspecting the feature importances, it is important to check that\nthe model predictive performance is high enough. Indeed there would be little\ninterest of inspecting the important features of a non-predictive model.\n\nHere one can observe that the train accuracy is very high (the forest model\nhas enough capacity to completely memorize the training set) but it can still\ngeneralize well enough to the test set thanks to the built-in bagging of\nrandom forests.\n\nIt might be possible to trade some accuracy on the training set for a\nslightly better accuracy on the test set by limiting the capacity of the\ntrees (for instance by setting ``min_samples_leaf=5`` or\n``min_samples_leaf=10``) so as to limit overfitting while not introducing too\nmuch underfitting.\n\nHowever let's keep our high capacity random forest model for now so as to\nillustrate some pitfalls with feature importance on variables with many\nunique values.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(f\"RF train accuracy: {rf.score(X_train, y_train):.3f}\")\nprint(f\"RF test accuracy: {rf.score(X_test, y_test):.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree's Feature Importance from Mean Decrease in Impurity (MDI)\nThe impurity-based feature importance ranks the numerical features to be the\nmost important features. As a result, the non-predictive ``random_num``\nvariable is ranked as one of the most important features!\n\nThis problem stems from two limitations of impurity-based feature\nimportances:\n\n- impurity-based importances are biased towards high cardinality features;\n- impurity-based importances are computed on training set statistics and\n therefore do not reflect the ability of feature to be useful to make\n predictions that generalize to the test set (when the model has enough\n capacity).\n\nThe bias towards high cardinality features explains why the `random_num` has\na really large importance in comparison with `random_cat` while we would\nexpect both random features to have a null importance.\n\nThe fact that we use training set statistics explains why both the\n`random_num` and `random_cat` features have a non-null importance.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\nfeature_names = rf[:-1].get_feature_names_out()\n\nmdi_importances = pd.Series(\n rf[-1].feature_importances_, index=feature_names\n).sort_values(ascending=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ax = mdi_importances.plot.barh()\nax.set_title(\"Random Forest Feature Importances (MDI)\")\nax.figure.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an alternative, the permutation importances of ``rf`` are computed on a\nheld out test set. This shows that the low cardinality categorical feature,\n`sex` and `pclass` are the most important feature. Indeed, permuting the\nvalues of these features will lead to most decrease in accuracy score of the\nmodel on the test set.\n\nAlso note that both random features have very low importances (close to 0) as\nexpected.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.inspection import permutation_importance\n\nresult = permutation_importance(\n rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2\n)\n\nsorted_importances_idx = result.importances_mean.argsort()\nimportances = pd.DataFrame(\n result.importances[sorted_importances_idx].T,\n columns=X.columns[sorted_importances_idx],\n)\nax = importances.plot.box(vert=False, whis=10)\nax.set_title(\"Permutation Importances (test set)\")\nax.axvline(x=0, color=\"k\", linestyle=\"--\")\nax.set_xlabel(\"Decrease in accuracy score\")\nax.figure.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to compute the permutation importances on the training\nset. This reveals that `random_num` and `random_cat` get a significantly\nhigher importance ranking than when computed on the test set. The difference\nbetween those two plots is a confirmation that the RF model has enough\ncapacity to use that random numerical and categorical features to overfit.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "result = permutation_importance(\n rf, X_train, y_train, n_repeats=10, random_state=42, n_jobs=2\n)\n\nsorted_importances_idx = result.importances_mean.argsort()\nimportances = pd.DataFrame(\n result.importances[sorted_importances_idx].T,\n columns=X.columns[sorted_importances_idx],\n)\nax = importances.plot.box(vert=False, whis=10)\nax.set_title(\"Permutation Importances (train set)\")\nax.axvline(x=0, color=\"k\", linestyle=\"--\")\nax.set_xlabel(\"Decrease in accuracy score\")\nax.figure.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can further retry the experiment by limiting the capacity of the trees\nto overfit by setting `min_samples_leaf` at 20 data points.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rf.set_params(classifier__min_samples_leaf=20).fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observing the accuracy score on the training and testing set, we observe that\nthe two metrics are very similar now. Therefore, our model is not overfitting\nanymore. We can then check the permutation importances with this new model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(f\"RF train accuracy: {rf.score(X_train, y_train):.3f}\")\nprint(f\"RF test accuracy: {rf.score(X_test, y_test):.3f}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_result = permutation_importance(\n rf, X_train, y_train, n_repeats=10, random_state=42, n_jobs=2\n)\ntest_results = permutation_importance(\n rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2\n)\nsorted_importances_idx = train_result.importances_mean.argsort()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_importances = pd.DataFrame(\n train_result.importances[sorted_importances_idx].T,\n columns=X.columns[sorted_importances_idx],\n)\ntest_importances = pd.DataFrame(\n test_results.importances[sorted_importances_idx].T,\n columns=X.columns[sorted_importances_idx],\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for name, importances in zip([\"train\", \"test\"], [train_importances, test_importances]):\n ax = importances.plot.box(vert=False, whis=10)\n ax.set_title(f\"Permutation Importances ({name} set)\")\n ax.set_xlabel(\"Decrease in accuracy score\")\n ax.axvline(x=0, color=\"k\", linestyle=\"--\")\n ax.figure.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can observe that on both sets, the `random_num` and `random_cat`\nfeatures have a lower importance compared to the overfitting random forest.\nHowever, the conclusions regarding the importance of the other features are\nstill valid.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }