{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scaling data with `scikit-learn`\n",
    "\n",
    "Many machine learning techniques require standardized data. In this notebook, we discuss typical standardization schemes offered by `scikit-learn`'s `preprocessing` module: <http://scikit-learn.org/stable/modules/preprocessing.html>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "import io\n",
    "import pandas\n",
    "from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us assume we have the following data read using `pandas.read_csv` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   age  weight  height\n",
      "0   23      70     180\n",
      "1   22      65     160\n",
      "2   31      80     190\n",
      "3   26      80     175\n",
      "4   22      65     170\n"
     ]
    }
   ],
   "source": [
    "file_content = io.StringIO(\"\"\"age;weight;height\n",
    "23;70;180\n",
    "22;65;160\n",
    "31;80;190\n",
    "26;80;175\n",
    "22;65;170\n",
    "\"\"\")\n",
    "\n",
    "df = pandas.read_csv(file_content, sep=\";\")\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All three variables have rather different means and variance, which can be problematic for some machine learning tools:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Means\n",
      "age        24.8\n",
      "weight     72.0\n",
      "height    175.0\n",
      "dtype: float64\n",
      "\n",
      "Standard deviations\n",
      "age        3.834058\n",
      "weight     7.582875\n",
      "height    11.180340\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "print(\"Means\")\n",
    "print(df.mean(axis=0))\n",
    "print(\"\\nStandard deviations\")\n",
    "print(df.std(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, `scikit-learn` offers scaler object that rescale data on a per-variable basis. In this tutorial, we will present the following scalers:\n",
    "* [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) rescales the data to have zero mean and unit variance (take a look at the doc if you want to do either unit variance normalization only or zero mean normalization only);\n",
    "* [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) rescales the data to lie in the [0,1] interval (take a look at the doc if you want to change the interval boundaries);\n",
    "* [`MaxAbsScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) rescales the data so that it lies in the [-1,1] interval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scaled data\n",
      "[[-0.52489066 -0.29488391  0.5       ]\n",
      " [-0.81649658 -1.03209369 -1.5       ]\n",
      " [ 1.80795671  1.17953565  1.5       ]\n",
      " [ 0.34992711  1.17953565  0.        ]\n",
      " [-0.81649658 -1.03209369 -0.5       ]]\n",
      "\n",
      "Means\n",
      "[ -1.99840144e-16   0.00000000e+00   0.00000000e+00]\n",
      "\n",
      "Standard deviations\n",
      "[ 1.  1.  1.]\n"
     ]
    }
   ],
   "source": [
    "scaler = StandardScaler()\n",
    "scaler.fit(df)\n",
    "df_scaled = scaler.transform(df)\n",
    "\n",
    "print(\"Scaled data\")\n",
    "print(df_scaled)\n",
    "print(\"\\nMeans\")\n",
    "print(df_scaled.mean(axis=0))\n",
    "print(\"\\nStandard deviations\")\n",
    "print(df_scaled.std(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can notice that `pandas` dataframes are turned into `numpy` arrays after scaling (`scikit-learn` works with `numpy` arrays).\n",
    "\n",
    "Once the scaler has been fitted to the data, the `transform` methods turns unscaled data to its scaled equivalent, while `inverse_transform` transforms scaled data back to its unscaled representation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[  23.   70.  180.]\n",
      " [  22.   65.  160.]\n",
      " [  31.   80.  190.]\n",
      " [  26.   80.  175.]\n",
      " [  22.   65.  170.]]\n"
     ]
    }
   ],
   "source": [
    "print(scaler.inverse_transform(df_scaled))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other scalers can be used in a similar manner:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scaled data\n",
      "[[ 0.11111111  0.33333333  0.66666667]\n",
      " [ 0.          0.          0.        ]\n",
      " [ 1.          1.          1.        ]\n",
      " [ 0.44444444  1.          0.5       ]\n",
      " [ 0.          0.          0.33333333]]\n",
      "\n",
      "Minimum values\n",
      "[ 0.  0.  0.]\n",
      "\n",
      "Maximum values\n",
      "[ 1.  1.  1.]\n",
      "\n",
      "Inverse transforms\n",
      "[[  23.   70.  180.]\n",
      " [  22.   65.  160.]\n",
      " [  31.   80.  190.]\n",
      " [  26.   80.  175.]\n",
      " [  22.   65.  170.]]\n"
     ]
    }
   ],
   "source": [
    "scaler = MinMaxScaler()\n",
    "scaler.fit(df)\n",
    "df_scaled = scaler.transform(df)\n",
    "\n",
    "print(\"Scaled data\")\n",
    "print(df_scaled)\n",
    "print(\"\\nMinimum values\")\n",
    "print(df_scaled.min(axis=0))\n",
    "print(\"\\nMaximum values\")\n",
    "print(df_scaled.max(axis=0))\n",
    "print(\"\\nInverse transforms\")\n",
    "print(scaler.inverse_transform(df_scaled))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scaled data\n",
      "[[ 0.74193548  0.875       0.94736842]\n",
      " [ 0.70967742  0.8125      0.84210526]\n",
      " [ 1.          1.          1.        ]\n",
      " [ 0.83870968  1.          0.92105263]\n",
      " [ 0.70967742  0.8125      0.89473684]]\n",
      "\n",
      "Minimum values\n",
      "[ 0.70967742  0.8125      0.84210526]\n",
      "\n",
      "Maximum values\n",
      "[ 1.  1.  1.]\n",
      "\n",
      "Inverse transforms\n",
      "[[  23.   70.  180.]\n",
      " [  22.   65.  160.]\n",
      " [  31.   80.  190.]\n",
      " [  26.   80.  175.]\n",
      " [  22.   65.  170.]]\n"
     ]
    }
   ],
   "source": [
    "scaler = MaxAbsScaler()\n",
    "scaler.fit(df)\n",
    "df_scaled = scaler.transform(df)\n",
    "\n",
    "print(\"Scaled data\")\n",
    "print(df_scaled)\n",
    "print(\"\\nMinimum values\")\n",
    "print(df_scaled.min(axis=0))\n",
    "print(\"\\nMaximum values\")\n",
    "print(df_scaled.max(axis=0))\n",
    "print(\"\\nInverse transforms\")\n",
    "print(scaler.inverse_transform(df_scaled))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}