{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Introducing the `set_output` API\n\n.. currentmodule:: sklearn\n\nThis example will demonstrate the `set_output` API to configure transformers to\noutput pandas DataFrames. `set_output` can be configured per estimator by calling\nthe `set_output` method or globally by setting `set_config(transform_output=\"pandas\")`.\nFor details, see\n[SLEP018](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep018/proposal.html)_.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we load the iris dataset as a DataFrame to demonstrate the `set_output` API.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\nX, y = load_iris(as_frame=True, return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\nX_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To configure an estimator such as :class:`preprocessing.StandardScaler` to return\nDataFrames, call `set_output`. This feature requires pandas to be installed.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n\nscaler = StandardScaler().set_output(transform=\"pandas\")\n\nscaler.fit(X_train)\nX_test_scaled = scaler.transform(X_test)\nX_test_scaled.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`set_output` can be called after `fit` to configure `transform` after the fact.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scaler2 = StandardScaler()\n\nscaler2.fit(X_train)\nX_test_np = scaler2.transform(X_test)\nprint(f\"Default output type: {type(X_test_np).__name__}\")\n\nscaler2.set_output(transform=\"pandas\")\nX_test_df = scaler2.transform(X_test)\nprint(f\"Configured pandas output type: {type(X_test_df).__name__}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a :class:`pipeline.Pipeline`, `set_output` configures all steps to output\nDataFrames.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_selection import SelectPercentile\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline\n\nclf = make_pipeline(\n StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()\n)\nclf.set_output(transform=\"pandas\")\nclf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each transformer in the pipeline is configured to return DataFrames. This\nmeans that the final logistic regression step contains the feature names of the input.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clf[-1].feature_names_in_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
If one uses the method `set_params`, the transformer will be\n replaced by a new one with the default output format.