{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Introducing the `set_output` API\n\n.. currentmodule:: sklearn\n\nThis example will demonstrate the `set_output` API to configure transformers to\noutput pandas DataFrames. `set_output` can be configured per estimator by calling\nthe `set_output` method or globally by setting `set_config(transform_output=\"pandas\")`.\nFor details, see\n[SLEP018](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep018/proposal.html)_.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we load the iris dataset as a DataFrame to demonstrate the `set_output` API.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\nX, y = load_iris(as_frame=True, return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\nX_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To configure an estimator such as :class:`preprocessing.StandardScaler` to return\nDataFrames, call `set_output`. This feature requires pandas to be installed.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n\nscaler = StandardScaler().set_output(transform=\"pandas\")\n\nscaler.fit(X_train)\nX_test_scaled = scaler.transform(X_test)\nX_test_scaled.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`set_output` can be called after `fit` to configure `transform` after the fact.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scaler2 = StandardScaler()\n\nscaler2.fit(X_train)\nX_test_np = scaler2.transform(X_test)\nprint(f\"Default output type: {type(X_test_np).__name__}\")\n\nscaler2.set_output(transform=\"pandas\")\nX_test_df = scaler2.transform(X_test)\nprint(f\"Configured pandas output type: {type(X_test_df).__name__}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a :class:`pipeline.Pipeline`, `set_output` configures all steps to output\nDataFrames.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_selection import SelectPercentile\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline\n\nclf = make_pipeline(\n StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()\n)\nclf.set_output(transform=\"pandas\")\nclf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each transformer in the pipeline is configured to return DataFrames. This\nmeans that the final logistic regression step contains the feature names of the input.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clf[-1].feature_names_in_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

If one uses the method `set_params`, the transformer will be\n replaced by a new one with the default output format.

\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clf.set_params(standardscaler=StandardScaler())\nclf.fit(X_train, y_train)\nclf[-1].feature_names_in_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To keep the intended behavior, use `set_output` on the new transformer\nbeforehand\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scaler = StandardScaler().set_output(transform=\"pandas\")\nclf.set_params(standardscaler=scaler)\nclf.fit(X_train, y_train)\nclf[-1].feature_names_in_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we load the titanic dataset to demonstrate `set_output` with\n:class:`compose.ColumnTransformer` and heterogeneous data.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n\nX, y = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `set_output` API can be configured globally by using :func:`set_config` and\nsetting `transform_output` to `\"pandas\"`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import set_config\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\n\nset_config(transform_output=\"pandas\")\n\nnum_pipe = make_pipeline(SimpleImputer(), StandardScaler())\nnum_cols = [\"age\", \"fare\"]\nct = ColumnTransformer(\n (\n (\"numerical\", num_pipe, num_cols),\n (\n \"categorical\",\n OneHotEncoder(\n sparse_output=False, drop=\"if_binary\", handle_unknown=\"ignore\"\n ),\n [\"embarked\", \"sex\", \"pclass\"],\n ),\n ),\n verbose_feature_names_out=False,\n)\nclf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())\nclf.fit(X_train, y_train)\nclf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the global configuration, all transformers output DataFrames. This allows us to\neasily plot the logistic regression coefficients with the corresponding feature names.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\nlog_reg = clf[-1]\ncoef = pd.Series(log_reg.coef_.ravel(), index=log_reg.feature_names_in_)\n_ = coef.sort_values().plot.barh()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to demonstrate the :func:`config_context` functionality below, let\nus first reset `transform_output` to its default value.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "set_config(transform_output=\"default\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When configuring the output type with :func:`config_context` the\nconfiguration at the time when `transform` or `fit_transform` are\ncalled is what counts. Setting these only when you construct or fit\nthe transformer has no effect.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn import config_context\n\nscaler = StandardScaler()\nscaler.fit(X_train[num_cols])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "with config_context(transform_output=\"pandas\"):\n # the output of transform will be a Pandas DataFrame\n X_test_scaled = scaler.transform(X_test[num_cols])\nX_test_scaled.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "outside of the context manager, the output will be a NumPy array\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_test_scaled = scaler.transform(X_test[num_cols])\nX_test_scaled[:5]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }