{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scaling data with `scikit-learn`\n", "\n", "Many machine learning techniques require standardized data. In this notebook, we discuss typical standardization schemes offered by `scikit-learn`'s `preprocessing` module: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import io\n", "import pandas\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us assume we have the following data read using `pandas.read_csv` function:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " age weight height\n", "0 23 70 180\n", "1 22 65 160\n", "2 31 80 190\n", "3 26 80 175\n", "4 22 65 170\n" ] } ], "source": [ "file_content = io.StringIO(\"\"\"age;weight;height\n", "23;70;180\n", "22;65;160\n", "31;80;190\n", "26;80;175\n", "22;65;170\n", "\"\"\")\n", "\n", "df = pandas.read_csv(file_content, sep=\";\")\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All three variables have rather different means and variance, which can be problematic for some machine learning tools:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Means\n", "age 24.8\n", "weight 72.0\n", "height 175.0\n", "dtype: float64\n", "\n", "Standard deviations\n", "age 3.834058\n", "weight 7.582875\n", "height 11.180340\n", "dtype: float64\n" ] } ], "source": [ "print(\"Means\")\n", "print(df.mean(axis=0))\n", "print(\"\\nStandard deviations\")\n", "print(df.std(axis=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, `scikit-learn` offers scaler object that rescale data on a per-variable basis. In this tutorial, we will present the following scalers:\n", "* [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) rescales the data to have zero mean and unit variance (take a look at the doc if you want to do either unit variance normalization only or zero mean normalization only);\n", "* [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) rescales the data to lie in the [0,1] interval (take a look at the doc if you want to change the interval boundaries);\n", "* [`MaxAbsScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) rescales the data so that it lies in the [-1,1] interval." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scaled data\n", "[[-0.52489066 -0.29488391 0.5 ]\n", " [-0.81649658 -1.03209369 -1.5 ]\n", " [ 1.80795671 1.17953565 1.5 ]\n", " [ 0.34992711 1.17953565 0. ]\n", " [-0.81649658 -1.03209369 -0.5 ]]\n", "\n", "Means\n", "[ -1.99840144e-16 0.00000000e+00 0.00000000e+00]\n", "\n", "Standard deviations\n", "[ 1. 1. 1.]\n" ] } ], "source": [ "scaler = StandardScaler()\n", "scaler.fit(df)\n", "df_scaled = scaler.transform(df)\n", "\n", "print(\"Scaled data\")\n", "print(df_scaled)\n", "print(\"\\nMeans\")\n", "print(df_scaled.mean(axis=0))\n", "print(\"\\nStandard deviations\")\n", "print(df_scaled.std(axis=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can notice that `pandas` dataframes are turned into `numpy` arrays after scaling (`scikit-learn` works with `numpy` arrays).\n", "\n", "Once the scaler has been fitted to the data, the `transform` methods turns unscaled data to its scaled equivalent, while `inverse_transform` transforms scaled data back to its unscaled representation:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 23. 70. 180.]\n", " [ 22. 65. 160.]\n", " [ 31. 80. 190.]\n", " [ 26. 80. 175.]\n", " [ 22. 65. 170.]]\n" ] } ], "source": [ "print(scaler.inverse_transform(df_scaled))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other scalers can be used in a similar manner:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scaled data\n", "[[ 0.11111111 0.33333333 0.66666667]\n", " [ 0. 0. 0. ]\n", " [ 1. 1. 1. ]\n", " [ 0.44444444 1. 0.5 ]\n", " [ 0. 0. 0.33333333]]\n", "\n", "Minimum values\n", "[ 0. 0. 0.]\n", "\n", "Maximum values\n", "[ 1. 1. 1.]\n", "\n", "Inverse transforms\n", "[[ 23. 70. 180.]\n", " [ 22. 65. 160.]\n", " [ 31. 80. 190.]\n", " [ 26. 80. 175.]\n", " [ 22. 65. 170.]]\n" ] } ], "source": [ "scaler = MinMaxScaler()\n", "scaler.fit(df)\n", "df_scaled = scaler.transform(df)\n", "\n", "print(\"Scaled data\")\n", "print(df_scaled)\n", "print(\"\\nMinimum values\")\n", "print(df_scaled.min(axis=0))\n", "print(\"\\nMaximum values\")\n", "print(df_scaled.max(axis=0))\n", "print(\"\\nInverse transforms\")\n", "print(scaler.inverse_transform(df_scaled))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scaled data\n", "[[ 0.74193548 0.875 0.94736842]\n", " [ 0.70967742 0.8125 0.84210526]\n", " [ 1. 1. 1. ]\n", " [ 0.83870968 1. 0.92105263]\n", " [ 0.70967742 0.8125 0.89473684]]\n", "\n", "Minimum values\n", "[ 0.70967742 0.8125 0.84210526]\n", "\n", "Maximum values\n", "[ 1. 1. 1.]\n", "\n", "Inverse transforms\n", "[[ 23. 70. 180.]\n", " [ 22. 65. 160.]\n", " [ 31. 80. 190.]\n", " [ 26. 80. 175.]\n", " [ 22. 65. 170.]]\n" ] } ], "source": [ "scaler = MaxAbsScaler()\n", "scaler.fit(df)\n", "df_scaled = scaler.transform(df)\n", "\n", "print(\"Scaled data\")\n", "print(df_scaled)\n", "print(\"\\nMinimum values\")\n", "print(df_scaled.min(axis=0))\n", "print(\"\\nMaximum values\")\n", "print(df_scaled.max(axis=0))\n", "print(\"\\nInverse transforms\")\n", "print(scaler.inverse_transform(df_scaled))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }