{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Categorical Feature Support in Gradient Boosting\n\n.. currentmodule:: sklearn\n\nIn this example, we will compare the training times and prediction\nperformances of :class:`~ensemble.HistGradientBoostingRegressor` with\ndifferent encoding strategies for categorical features. In\nparticular, we will evaluate:\n\n- dropping the categorical features\n- using a :class:`~preprocessing.OneHotEncoder`\n- using an :class:`~preprocessing.OrdinalEncoder` and treat categories as\n ordered, equidistant quantities\n- using an :class:`~preprocessing.OrdinalEncoder` and rely on the `native\n category support ` of the\n :class:`~ensemble.HistGradientBoostingRegressor` estimator.\n\nWe will work with the Ames Iowa Housing dataset which consists of numerical\nand categorical features, where the houses' sales prices is the target.\n\nSee `sphx_glr_auto_examples_ensemble_plot_hgbt_regression.py` for an\nexample showcasing some other features of\n:class:`~ensemble.HistGradientBoostingRegressor`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Ames Housing dataset\nFirst, we load the Ames Housing data as a pandas dataframe. The features\nare either categorical or numerical:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n\nX, y = fetch_openml(data_id=42165, as_frame=True, return_X_y=True)\n\n# Select only a subset of features of X to make the example faster to run\ncategorical_columns_subset = [\n \"BldgType\",\n \"GarageFinish\",\n \"LotConfig\",\n \"Functional\",\n \"MasVnrType\",\n \"HouseStyle\",\n \"FireplaceQu\",\n \"ExterCond\",\n \"ExterQual\",\n \"PoolQC\",\n]\n\nnumerical_columns_subset = [\n \"3SsnPorch\",\n \"Fireplaces\",\n \"BsmtHalfBath\",\n \"HalfBath\",\n \"GarageCars\",\n \"TotRmsAbvGrd\",\n \"BsmtFinSF1\",\n \"BsmtFinSF2\",\n \"GrLivArea\",\n \"ScreenPorch\",\n]\n\nX = X[categorical_columns_subset + numerical_columns_subset]\nX[categorical_columns_subset] = X[categorical_columns_subset].astype(\"category\")\n\ncategorical_columns = X.select_dtypes(include=\"category\").columns\nn_categorical_features = len(categorical_columns)\nn_numerical_features = X.select_dtypes(include=\"number\").shape[1]\n\nprint(f\"Number of samples: {X.shape[0]}\")\nprint(f\"Number of features: {X.shape[1]}\")\nprint(f\"Number of categorical features: {n_categorical_features}\")\nprint(f\"Number of numerical features: {n_numerical_features}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient boosting estimator with dropped categorical features\nAs a baseline, we create an estimator where the categorical features are\ndropped:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.compose import make_column_selector, make_column_transformer\nfrom sklearn.ensemble import HistGradientBoostingRegressor\nfrom sklearn.pipeline import make_pipeline\n\ndropper = make_column_transformer(\n (\"drop\", make_column_selector(dtype_include=\"category\")), remainder=\"passthrough\"\n)\nhist_dropped = make_pipeline(dropper, HistGradientBoostingRegressor(random_state=42))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient boosting estimator with one-hot encoding\nNext, we create a pipeline that will one-hot encode the categorical features\nand let the rest of the numerical data to passthrough:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\n\none_hot_encoder = make_column_transformer(\n (\n OneHotEncoder(sparse_output=False, handle_unknown=\"ignore\"),\n make_column_selector(dtype_include=\"category\"),\n ),\n remainder=\"passthrough\",\n)\n\nhist_one_hot = make_pipeline(\n one_hot_encoder, HistGradientBoostingRegressor(random_state=42)\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient boosting estimator with ordinal encoding\nNext, we create a pipeline that will treat categorical features as if they\nwere ordered quantities, i.e. the categories will be encoded as 0, 1, 2,\netc., and treated as continuous features.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.preprocessing import OrdinalEncoder\n\nordinal_encoder = make_column_transformer(\n (\n OrdinalEncoder(handle_unknown=\"use_encoded_value\", unknown_value=np.nan),\n make_column_selector(dtype_include=\"category\"),\n ),\n remainder=\"passthrough\",\n # Use short feature names to make it easier to specify the categorical\n # variables in the HistGradientBoostingRegressor in the next step\n # of the pipeline.\n verbose_feature_names_out=False,\n)\n\nhist_ordinal = make_pipeline(\n ordinal_encoder, HistGradientBoostingRegressor(random_state=42)\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient boosting estimator with native categorical support\nWe now create a :class:`~ensemble.HistGradientBoostingRegressor` estimator\nthat will natively handle categorical features. This estimator will not treat\ncategorical features as ordered quantities. We set\n`categorical_features=\"from_dtype\"` such that features with categorical dtype\nare considered categorical features.\n\nThe main difference between this estimator and the previous one is that in\nthis one, we let the :class:`~ensemble.HistGradientBoostingRegressor` detect\nwhich features are categorical from the DataFrame columns' dtypes.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hist_native = HistGradientBoostingRegressor(\n random_state=42, categorical_features=\"from_dtype\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model comparison\nFinally, we evaluate the models using cross validation. Here we compare the\nmodels performance in terms of\n:func:`~metrics.mean_absolute_percentage_error` and fit times.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfrom sklearn.model_selection import cross_validate\n\nscoring = \"neg_mean_absolute_percentage_error\"\nn_cv_folds = 3\n\ndropped_result = cross_validate(hist_dropped, X, y, cv=n_cv_folds, scoring=scoring)\none_hot_result = cross_validate(hist_one_hot, X, y, cv=n_cv_folds, scoring=scoring)\nordinal_result = cross_validate(hist_ordinal, X, y, cv=n_cv_folds, scoring=scoring)\nnative_result = cross_validate(hist_native, X, y, cv=n_cv_folds, scoring=scoring)\n\n\ndef plot_results(figure_title):\n fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))\n\n plot_info = [\n (\"fit_time\", \"Fit times (s)\", ax1, None),\n (\"test_score\", \"Mean Absolute Percentage Error\", ax2, None),\n ]\n\n x, width = np.arange(4), 0.9\n for key, title, ax, y_limit in plot_info:\n items = [\n dropped_result[key],\n one_hot_result[key],\n ordinal_result[key],\n native_result[key],\n ]\n\n mape_cv_mean = [np.mean(np.abs(item)) for item in items]\n mape_cv_std = [np.std(item) for item in items]\n\n ax.bar(\n x=x,\n height=mape_cv_mean,\n width=width,\n yerr=mape_cv_std,\n color=[\"C0\", \"C1\", \"C2\", \"C3\"],\n )\n ax.set(\n xlabel=\"Model\",\n title=title,\n xticks=x,\n xticklabels=[\"Dropped\", \"One Hot\", \"Ordinal\", \"Native\"],\n ylim=y_limit,\n )\n fig.suptitle(figure_title)\n\n\nplot_results(\"Gradient Boosting on Ames Housing\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the model with one-hot-encoded data is by far the slowest. This\nis to be expected, since one-hot-encoding creates one additional feature per\ncategory value (for each categorical feature), and thus more split points\nneed to be considered during fitting. In theory, we expect the native\nhandling of categorical features to be slightly slower than treating\ncategories as ordered quantities ('Ordinal'), since native handling requires\n`sorting categories `. Fitting times should\nhowever be close when the number of categories is small, and this may not\nalways be reflected in practice.\n\nIn terms of prediction performance, dropping the categorical features leads\nto poorer performance. The three models that use categorical features have\ncomparable error rates, with a slight edge for the native handling.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Limiting the number of splits\nIn general, one can expect poorer predictions from one-hot-encoded data,\nespecially when the tree depths or the number of nodes are limited: with\none-hot-encoded data, one needs more split points, i.e. more depth, in order\nto recover an equivalent split that could be obtained in one single split\npoint with native handling.\n\nThis is also true when categories are treated as ordinal quantities: if\ncategories are `A..F` and the best split is `ACF - BDE` the one-hot-encoder\nmodel will need 3 split points (one per category in the left node), and the\nordinal non-native model will need 4 splits: 1 split to isolate `A`, 1 split\nto isolate `F`, and 2 splits to isolate `C` from `BCDE`.\n\nHow strongly the models' performances differ in practice will depend on the\ndataset and on the flexibility of the trees.\n\nTo see this, let us re-run the same analysis with under-fitting models where\nwe artificially limit the total number of splits by both limiting the number\nof trees and the depth of each tree.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for pipe in (hist_dropped, hist_one_hot, hist_ordinal, hist_native):\n if pipe is hist_native:\n # The native model does not use a pipeline so, we can set the parameters\n # directly.\n pipe.set_params(max_depth=3, max_iter=15)\n else:\n pipe.set_params(\n histgradientboostingregressor__max_depth=3,\n histgradientboostingregressor__max_iter=15,\n )\n\ndropped_result = cross_validate(hist_dropped, X, y, cv=n_cv_folds, scoring=scoring)\none_hot_result = cross_validate(hist_one_hot, X, y, cv=n_cv_folds, scoring=scoring)\nordinal_result = cross_validate(hist_ordinal, X, y, cv=n_cv_folds, scoring=scoring)\nnative_result = cross_validate(hist_native, X, y, cv=n_cv_folds, scoring=scoring)\n\nplot_results(\"Gradient Boosting on Ames Housing (few and small trees)\")\n\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results for these under-fitting models confirm our previous intuition:\nthe native category handling strategy performs the best when the splitting\nbudget is constrained. The two other strategies (one-hot encoding and\ntreating categories as ordinal values) lead to error values comparable\nto the baseline model that just dropped the categorical features altogether.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }