{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Target Encoder's Internal Cross fitting\n\n.. currentmodule:: sklearn.preprocessing\n\nThe :class:`TargetEncoder` replaces each category of a categorical feature with\nthe shrunk mean of the target variable for that category. This method is useful\nin cases where there is a strong relationship between the categorical feature\nand the target. To prevent overfitting, :meth:`TargetEncoder.fit_transform` uses\nan internal :term:`cross fitting` scheme to encode the training data to be used\nby a downstream model. This scheme involves splitting the data into *k* folds\nand encoding each fold using the encodings learnt using the other *k-1* folds.\nIn this example, we demonstrate the importance of the cross\nfitting procedure to prevent overfitting.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Synthetic Dataset\nFor this example, we build a dataset with three categorical features:\n\n* an informative feature with medium cardinality (\"informative\")\n* an uninformative feature with medium cardinality (\"shuffled\")\n* an uninformative feature with high cardinality (\"near_unique\")\n\nFirst, we generate the informative feature:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.preprocessing import KBinsDiscretizer\n\nn_samples = 50_000\n\nrng = np.random.RandomState(42)\ny = rng.randn(n_samples)\nnoise = 0.5 * rng.randn(n_samples)\nn_categories = 100\n\nkbins = KBinsDiscretizer(\n n_bins=n_categories,\n encode=\"ordinal\",\n strategy=\"uniform\",\n random_state=rng,\n subsample=None,\n)\nX_informative = kbins.fit_transform((y + noise).reshape(-1, 1))\n\n# Remove the linear relationship between y and the bin index by permuting the\n# values of X_informative:\npermuted_categories = rng.permutation(n_categories)\nX_informative = permuted_categories[X_informative.astype(np.int32)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The uninformative feature with medium cardinality is generated by permuting the\ninformative feature and removing the relationship with the target:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_shuffled = rng.permutation(X_informative)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The uninformative feature with high cardinality is generated so that it is\nindependent of the target variable. We will show that target encoding without\n:term:`cross fitting` will cause catastrophic overfitting for the downstream\nregressor. These high cardinality features are basically unique identifiers\nfor samples which should generally be removed from machine learning datasets.\nIn this example, we generate them to show how :class:`TargetEncoder`'s default\n:term:`cross fitting` behavior mitigates the overfitting issue automatically.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_near_unique_categories = rng.choice(\n int(0.9 * n_samples), size=n_samples, replace=True\n).reshape(-1, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we assemble the dataset and perform a train test split:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\nfrom sklearn.model_selection import train_test_split\n\nX = pd.DataFrame(\n np.concatenate(\n [X_informative, X_shuffled, X_near_unique_categories],\n axis=1,\n ),\n columns=[\"informative\", \"shuffled\", \"near_unique\"],\n)\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training a Ridge Regressor\nIn this section, we train a ridge regressor on the dataset with and without\nencoding and explore the influence of target encoder with and without the\ninternal :term:`cross fitting`. First, we see the Ridge model trained on the\nraw features will have low performance. This is because we permuted the order\nof the informative feature meaning `X_informative` is not informative when\nraw:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sklearn\nfrom sklearn.linear_model import Ridge\n\n# Configure transformers to always output DataFrames\nsklearn.set_config(transform_output=\"pandas\")\n\nridge = Ridge(alpha=1e-6, solver=\"lsqr\", fit_intercept=False)\n\nraw_model = ridge.fit(X_train, y_train)\nprint(\"Raw Model score on training set: \", raw_model.score(X_train, y_train))\nprint(\"Raw Model score on test set: \", raw_model.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create a pipeline with the target encoder and ridge model. The pipeline\nuses :meth:`TargetEncoder.fit_transform` which uses :term:`cross fitting`. We\nsee that the model fits the data well and generalizes to the test set:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import TargetEncoder\n\nmodel_with_cf = make_pipeline(TargetEncoder(random_state=0), ridge)\nmodel_with_cf.fit(X_train, y_train)\nprint(\"Model with CF on train set: \", model_with_cf.score(X_train, y_train))\nprint(\"Model with CF on test set: \", model_with_cf.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The coefficients of the linear model shows that most of the weight is on the\nfeature at column index 0, which is the informative feature\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport pandas as pd\n\nplt.rcParams[\"figure.constrained_layout.use\"] = True\n\ncoefs_cf = pd.Series(\n model_with_cf[-1].coef_, index=model_with_cf[-1].feature_names_in_\n).sort_values()\nax = coefs_cf.plot(kind=\"barh\")\n_ = ax.set(\n title=\"Target encoded with cross fitting\",\n xlabel=\"Ridge coefficient\",\n ylabel=\"Feature\",\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While :meth:`TargetEncoder.fit_transform` uses an internal\n:term:`cross fitting` scheme to learn encodings for the training set,\n:meth:`TargetEncoder.transform` itself does not.\nIt uses the complete training set to learn encodings and to transform the\ncategorical features. Thus, we can use :meth:`TargetEncoder.fit` followed by\n:meth:`TargetEncoder.transform` to disable the :term:`cross fitting`. This\nencoding is then passed to the ridge model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "target_encoder = TargetEncoder(random_state=0)\ntarget_encoder.fit(X_train, y_train)\nX_train_no_cf_encoding = target_encoder.transform(X_train)\nX_test_no_cf_encoding = target_encoder.transform(X_test)\n\nmodel_no_cf = ridge.fit(X_train_no_cf_encoding, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We evaluate the model that did not use :term:`cross fitting` when encoding and\nsee that it overfits:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\n \"Model without CF on training set: \",\n model_no_cf.score(X_train_no_cf_encoding, y_train),\n)\nprint(\n \"Model without CF on test set: \",\n model_no_cf.score(\n X_test_no_cf_encoding,\n y_test,\n ),\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ridge model overfits because it assigns much more weight to the\nuninformative extremely high cardinality (\"near_unique\") and medium\ncardinality (\"shuffled\") features than when the model used\n:term:`cross fitting` to encode the features.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "coefs_no_cf = pd.Series(\n model_no_cf.coef_, index=model_no_cf.feature_names_in_\n).sort_values()\nax = coefs_no_cf.plot(kind=\"barh\")\n_ = ax.set(\n title=\"Target encoded without cross fitting\",\n xlabel=\"Ridge coefficient\",\n ylabel=\"Feature\",\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\nThis example demonstrates the importance of :class:`TargetEncoder`'s internal\n:term:`cross fitting`. It is important to use\n:meth:`TargetEncoder.fit_transform` to encode training data before passing it\nto a machine learning model. When a :class:`TargetEncoder` is a part of a\n:class:`~sklearn.pipeline.Pipeline` and the pipeline is fitted, the pipeline\nwill correctly call :meth:`TargetEncoder.fit_transform` and use\n:term:`cross fitting` when encoding the training data.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }