{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction\n", "\n", "We have deployed our model and it is successfully being used to determine the class of Iris flowers by botanists all over the world!\n", "\n", "Enjoying our success and the data science we have been doing, has resulted in us spending more time learning more about LinearSVC models.\n", "\n", "One thing we noticed in the [LinearSVC documentation](https://scikit-learn.org/stable/modules/svm.html#svm-classification) was the following:\n", "\n", "> Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data.\n", "\n", "Oh. What does 'scale your data' mean?\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Scaling Data\n", "\n", "There are many ways to scale data (see the [Appendix](#Appendix)). A good starting point for learning about scalers is StandardScaler:\n", "\n", "> Standardize features by removing the mean and scaling to unit variance.\n", ">\n", "> *Source: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html*\n", "\n", "We will try the StandardScaler, below:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# ensure everyone running this code gets the same result\n", "import numpy as np\n", "np.random.seed(100)\n", "\n", "from sklearn import datasets\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "iris = datasets.load_iris()\n", "X = iris.data\n", "y = iris.target\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)\n", "\n", "# We 'fit' the StandardScaler on the training data\n", "std_slc = StandardScaler()\n", "std_slc.fit(X_train)\n", "\n", "# We can then use the scaler to transform out training and test datasets\n", "X_train_std = std_slc.transform(X_train)\n", "X_test_std = std_slc.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Next we can print out the datasets. We wrap the datasets with pandas dataframes to give us easier to read tables." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "X_train_df = pd.DataFrame(X_train, columns=[\"Sepal Length\", \"Sepal Width\", \"Petal Length\", \"Petal Width\"])\n", "X_train_std_df = pd.DataFrame(X_train_std, columns=[\"Sepal Length\", \"Sepal Width\", \"Petal Length\", \"Petal Width\"])" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Let's look at the first few records of the unscaled and the scaled dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal LengthSepal WidthPetal LengthPetal Width
05.13.31.70.5
15.03.21.20.2
26.52.84.61.5
37.93.86.42.0
46.13.04.91.8
\n", "
" ], "text/plain": [ " Sepal Length Sepal Width Petal Length Petal Width\n", "0 5.1 3.3 1.7 0.5\n", "1 5.0 3.2 1.2 0.2\n", "2 6.5 2.8 4.6 1.5\n", "3 7.9 3.8 6.4 2.0\n", "4 6.1 3.0 4.9 1.8" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal LengthSepal WidthPetal LengthPetal Width
0-0.9160690.595736-1.251745-0.998077
1-1.0405350.356485-1.546968-1.413942
20.826454-0.6005210.4605480.388141
32.5689771.7919941.5233511.081250
40.328590-0.1220180.6376820.804006
\n", "
" ], "text/plain": [ " Sepal Length Sepal Width Petal Length Petal Width\n", "0 -0.916069 0.595736 -1.251745 -0.998077\n", "1 -1.040535 0.356485 -1.546968 -1.413942\n", "2 0.826454 -0.600521 0.460548 0.388141\n", "3 2.568977 1.791994 1.523351 1.081250\n", "4 0.328590 -0.122018 0.637682 0.804006" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(X_train_df.head())\n", "display(X_train_std_df.head())" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "We won't describe in detail the scaling process because that is covered in the StandardScaler documentation.\n", "\n", "However, let's revist the description of the StandardScaler:\n", "\n", "> Standardize features by removing the mean and scaling to unit variance\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "We can print out the mean and variance as follows (note that we are looking at Sepal Width):" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "Unscaled Mean: 3.0509999999999993\n", "Scaled Mean: 6.483702463810914e-16\n", "\n", "Unscaled Variance: 0.17646363636363638\n", "Scaled Variance: 1.0101010101010097\n", "---\n" ] } ], "source": [ "print(\"---\")\n", "print(\"Unscaled Mean: \" + str(X_train_df['Sepal Width'].mean()))\n", "print(\"Scaled Mean: \" + str(X_train_std_df['Sepal Width'].mean()))\n", "print(\"\")\n", "print(\"Unscaled Variance: \" + str(X_train_df['Sepal Width'].var()))\n", "print(\"Scaled Variance: \" + str(X_train_std_df['Sepal Width'].var()))\n", "print(\"---\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "You can see that the scaled mean is approximately zero and the variance is approximately one.\n", "\n", "This can be viewed graphically:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAD4CAYAAADrRI2NAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8/fFQqAAAACXBIWXMAAAsTAAALEwEAmpwYAAAeF0lEQVR4nO3deXRV9b338feXgIQUZAyKIgYtxKIkAROGBAUVAYWrCFJEi6BVHECRR63YAdFVnsXt460KtQXuI0WsVORinVBbBi0tASFAFBCQgrk2wgKMZfIy833+yCEPQ4YTyD4ncX9ea2Xl7H328OWQ9cnO7/zOd5u7IyIi4VEr3gWIiEhsKfhFREJGwS8iEjIKfhGRkFHwi4iETO14FxCNZs2aeUpKSrzLEBGpUVauXPm1uyefur5GBH9KSgp5eXnxLkNEpEYxs/8ubb2GekREQkbBLyISMgp+EZGQqRFj/CLfdYcPH6awsJADBw7EuxSpgRITE2nZsiV16tSJansFv0g1UFhYSIMGDUhJScHM4l2O1CDuTlFREYWFhbRu3TqqfTTUI1INHDhwgKZNmyr0pdLMjKZNm1bqr0UFv0g1odCXM1XZnx0Fv4hIyGiMX6QaShk7r0qPVzCxb/nPFxTQr18/1q5dW7Ju/Pjx1K9fn8cee6xKaznVjBkzyMvL4ze/+U3U+xz/UGezZs1OWj99+nSee+45zIxjx44xYcIEbr755iqrtbTXCeCWW25h2LBh9O/fH4DU1FSGDh3Kz3/+cwAGDhzIHXfcwY4dO0hKSuLOO+8s87j5+fls3bqVG2+8EQjm/0HBL8Ea3zDAY+8O7thS4xQWFjJhwgRWrVpFw4YN2bdvHzt37ozJubOzs8nNzaV///4UFRVRv359li5dWvL80qVLefHFFzn//PMrPFZ+fj55eXklwR8EDfWISIV69OjBE088QadOnWjbti1/+9vfAFi3bh2dOnUiIyODtLQ0Nm3aBMDMmTNJS0sjPT2doUOHAvDOO+/QuXNnOnToQM+ePdm+fftp59m5cycDBw4kKyuLrKwslixZAkBRURG9evWiQ4cO3HfffZR258AdO3bQoEED6tevD0D9+vVLZrls3ryZPn36cOWVV3LVVVexYcMGAIYPH87999/PVVddRdu2bXn33XeB4ivwq666io4dO9KxY0dyc3PLfX1ycnJKtsnNzaVfv37s3LkTd+eLL76gXr16nH/++YwfP55nn30WgJUrV5Kenk7Xrl158cUXATh06BDjxo1j9uzZZGRkMHv2bAA+++wzevTowSWXXMKkSZOi+j8rj4JfRKJy5MgRli9fzvPPP8/TTz8NwJQpUxg9enTJVWrLli1Zt24dEyZMYNGiRXzyySe88MILAHTr1o1ly5axevVqbrvtNn71q1+ddo7Ro0czZswYVqxYwdy5c7nnnnsAePrpp+nWrRurV6/mpptu4ssvvzxt3/T0dM477zxat27NXXfdxTvvvFPy3IgRI5g8eTIrV67k2Wef5cEHHyx5rqCggL/+9a/MmzeP+++/nwMHDtC8eXPmz5/PqlWrmD17Ng8//HC5r82VV17J2rVrOXToELm5uXTt2pXU1FTWr19Pbm4uOTk5p+1z1113MWnSpJP+MjjnnHN45plnGDx4MPn5+QwePBiADRs28Oc//5nly5fz9NNPc/jw4XLrqYiGekSkzFkhJ64fMGAAUBxyBQUFAHTt2pUJEyZQWFjIgAEDaNOmDYsWLeLWW28tGX9v0qQJUDwUM3jwYLZt28ahQ4dKnXO+YMECPvvss5LlPXv2sHfvXhYvXswbb7wBQN++fWncuPFp+yYkJPDBBx+wYsUKFi5cyJgxY1i5ciWPPfYYubm5DBo0qGTbgwcPljz+4Q9/SK1atWjTpg2XXHIJGzZsoHXr1owaNYr8/HwSEhL4/PPPy3396taty+WXX86qVatYtmwZP/nJT9iyZQu5ubmsXr2a7Ozsk7bfvXs3u3btonv37gAMHTqU999/v8zj9+3bl7p161K3bl2aN2/O9u3badmyZbk1lUdX/CJC06ZN+de//nXSum+++eakN0/r1q0LFAfskSNHALj99tt5++23qVevHr1792bRokW4e6m/SB566CFGjRrFmjVrmDp1aqnzzo8dO8bSpUvJz88nPz+fr776igYNGgDRTVk0Mzp16sSTTz7Ja6+9xty5czl27BiNGjUqOWZ+fj7r168/aZ9Tj/Hcc89x3nnn8cknn5CXl8ehQ4cqPHd2djaLFy9m7969NG7cmC5dupCbm1vqFX9Zr1FZjr/2cPLrf6YU/CJC/fr1adGiBQsXLgSKQ/+DDz6gW7du5e63ZcsWLrnkEh5++GFuuukmPv30U6677jpef/11ioqKSo4FxVe5F154IQAvv/xyqcfr1avXSbN78vPzAbj66qt59dVXAXj//fdP+yUFsHXrVlatWnXSvhdffDHnnnsurVu3Zs6cOUBx6H7yyScl282ZM4djx46xefNmtmzZQmpqKrt376ZFixbUqlWLV155haNHj5b7OkDxOP/UqVNJT08HIC0tjWXLlvHll19y+eWXn7Rto0aNaNiwIX//+98BSv5tAA0aNGDv3r0Vnu9saKhHpBqqaPplEGbOnMnIkSN59NFHAXjqqae49NJLy91n9uzZ/OEPf6BOnTqcf/75jBs3jiZNmvCzn/2M7t27k5CQQIcOHZgxYwbjx49n0KBBXHjhhXTp0oUvvvjitONNmjSJkSNHkpaWxpEjR7j66quZMmUKTz31FEOGDKFjx450796dVq1anbbv4cOHeeyxx9i6dSuJiYkkJyczZcoUoDhYH3jgAX75y19y+PBhbrvttpKATk1NpXv37mzfvp0pU6aQmJjIgw8+yMCBA5kzZw7XXHMN3/ve9yp8/bKzs9myZQtPPvkkALVr16Z58+ZcdNFF1Kp1+jX273//e+6++26SkpLo3bt3yfprrrmGiRMnkpGRUXKsqmalvTte3WRmZrpuxFJDaTpnVNavX88PfvCDeJcROsOHD6dfv37ceuut8S7lrJX2M2RmK90989RtNdQjIhIyGuoRkdCaMWNGvEuIC13xi4iEjIJfRCRkFPwiIiGj4BcRCRm9uStSHVX1NNgopr5OmDCBWbNmkZCQQK1atZg6dSqdO3eu1GnKaltcnrKmVC5btozRo0dz8OBBDh48yODBgxk/fnyl6qlIae2dX3jhBb744guef/55AO677z42b97MggULAJg8eTKbNm3izjvvZObMmaU2TTt+3Nq1azNr1qyS3kAfffQRzz77bEkzuHgJLPjNLBFYDNSNnOe/3P0pM2sCzAZSgALgh+5++sfwRCRmli5dyrvvvsuqVauoW7cuX3/9dVRtCoI0bNgwXn/9ddLT0zl69CgbN26MyXmzs7NP+iRtfn4+x44d4+jRoyQkJJS0X87MzCQz87Qp8ifZtWsXv/3tb09qClcdBDnUcxC41t3TgQygj5l1AcYCC929DbAwsiwicbRt2zaaNWtW0hOmWbNmXHDBBQCsWLGC7Oxs0tPT6dSpE3v37o2qbfHRo0d5/PHHycrKIi0tjalTpwLFLRNGjRpFu3bt6Nu3Lzt27Ci1ph07dtCiRQuguD9Nu3btAPj222+5++67ycrKokOHDrz11ltA8dTMm2++mT59+pCamlrSQRSgf//+XHnllVx++eVMmzat3NeiQ4cOfP755+zfv5/du3eTlJRERkYGa9asAYrbLmdnZ/PRRx/Rr18/oOy20WPHjmXz5s1kZGTw+OOPA7Bv3z5uvfVWLrvsMu64445SW0wHLbArfi/+1+yLLNaJfDlwM9Ajsv5l4CPgiaDqEJGK9erVi2eeeYa2bdvSs2dPBg8eTPfu3Tl06BCDBw9m9uzZZGVlsWfPHurVq1fStjgxMZFNmzYxZMgQTv10/UsvvUTDhg1ZsWIFBw8eJCcnh169erF69Wo2btzImjVr2L59O+3atePuu+8+raYxY8aQmppKjx496NOnD8OGDSMxMZEJEyZw7bXXMn36dHbt2kWnTp3o2bMnAMuXL2ft2rUkJSWRlZVF3759yczMZPr06TRp0oT9+/eTlZXFwIEDadq0aamvRe3atcnIyGDFihXs37+fzp0706ZNG3Jzc2nevDnuzkUXXcTmzZtL9jneNnrcuHHMmzev5JfLxIkTS+6qBcVDPatXr2bdunVccMEF5OTksGTJkgp7IlW1QN/cNbMEM8sHdgDz3f1j4Dx33wYQ+d48yBpEpGL169dn5cqVTJs2jeTkZAYPHsyMGTPYuHEjLVq0ICsrC4Bzzz2X2rVrc/jwYe69917at2/PoEGDTmqlfNxf/vIXZs6cSUZGBp07d6aoqIhNmzaxePFihgwZQkJCAhdccAHXXnttqTWNGzeOvLw8evXqxaxZs+jTp0/JcY/3sunRowcHDhwo6c9//fXX07RpU+rVq8eAAQNKmqBNmjSJ9PR0unTpwj//+c+SG8aU5fiNVY731u/atSu5ubksWbLktBbLAIsXL+ZHP/oRUHbb6OM6depEy5YtqVWrFhkZGSUtrmMp0Dd33f0okGFmjYA/mdkV0e5rZiOAEUCpDZlEpGolJCTQo0cPevToQfv27Xn55Zfp2LFjqe2DT2xbfOzYMRITE0/bxt2ZPHnySQ3IAN57772oWxJfeumlPPDAA9x7770kJydTVFSEuzN37lxSU1NP2vbjjz8utcXyRx99xIIFC1i6dClJSUklvyzKk52dXdI6euTIkSQnJ/PZZ5+RnJxc6k1Vjp8rGlXdYvlMxGQ6p7vvonhIpw+w3cxaAES+lzrA5+7T3D3T3TOTk5NjUaZIaG3cuPGkq+DjLY0vu+wytm7dyooVKwDYu3cvR44ciaptce/evfnd735Xcreozz//nG+//Zarr76a1157jaNHj7Jt2zY+/PDDUmuaN29eyfj3pk2bSEhIoFGjRvTu3ZvJkyeXPLd69eqSfebPn88333zD/v37efPNN8nJyWH37t00btyYpKQkNmzYwLJlyyp8PbKzs1m2bBk7d+6kefPmmBnJycm89dZbpV7xl9U2OhYtls9EkLN6koHD7r7LzOoBPYF/B94GhgETI9/fCqoGkRorxp1H9+3bx0MPPcSuXbuoXbs23//+95k2bRrnnHMOs2fP5qGHHmL//v3Uq1ePBQsWRNW2+J577qGgoICOHTvi7iQnJ/Pmm29yyy23sGjRItq3b0/btm1L7kJ1qldeeYUxY8aQlJRE7dq1efXVV0lISOAXv/gFjzzyCGlpabg7KSkpJdMju3XrxtChQ/nHP/7B7bffTmZmJu3bt2fKlCmkpaWRmppKly5dKnw9GjduTHJy8kl99Lt27cqSJUtK2jmfqKy20U2bNiUnJ4crrriCG264gb59Y99uuzSBtWU2szSK37xNoPgvi9fd/Rkzawq8DrQCvgQGufs35R1LbZlrMLVljoraMp+9GTNmkJeXd9KNXMKkMm2Zg5zV8ynQoZT1RcB1QZ1XRETKp0/uish3wvDhwxk+fHi8y6gR1KtHpJqoCXfDk+qpsj87Cn6RaiAxMbFkqqJIZbg7RUVFpU6pLYuGekSqgZYtW1JYWMjOnTvjXYrUQImJibRs2TLq7RX8ItVAnTp1aN26dbzLkJDQUI+ISMgo+EVEQkbBLyISMgp+EZGQUfCLiISMgl9EJGQU/CIiIaPgFxEJGQW/iEjIKPhFREJGwS8iEjIKfhGRkFHwi4iEjIJfRCRkFPwiIiGj4BcRCRkFv4hIyCj4RURCJrDgN7OLzOxDM1tvZuvMbHRk/Xgz+8rM8iNfNwZVg4iInC7Ie+4eAR5191Vm1gBYaWbzI8895+7PBnhuEREpQ2DB7+7bgG2Rx3vNbD1wYVDnExGR6MRkjN/MUoAOwMeRVaPM7FMzm25mjcvYZ4SZ5ZlZ3s6dO2NRpohIKAQe/GZWH5gLPOLue4DfAZcCGRT/RfAfpe3n7tPcPdPdM5OTk4MuU0QkNAINfjOrQ3Hov+rubwC4+3Z3P+rux4D/BDoFWYOIiJwsyFk9BrwErHf3X5+wvsUJm90CrA2qBhEROV2Qs3pygKHAGjPLj6z7KTDEzDIABwqA+wKsQUREThHkrJ6/A1bKU+8FdU4REalYkFf8IiJxkzJ2XlTbFUzsG3Al1Y9aNoiIhIyCX0QkZBT8IiIho+AXEQkZBb+ISMhoVo+I1CjRztaRsumKX0QkZBT8IiIho+AXEQkZBb+ISMgo+EVEQkbBLyISMgp+EZGQUfCLiISMgl9EJGQU/CIiIaPgFxEJGQW/iEjIRBX8ZnZF0IWIiEhsRHvFP8XMlpvZg2bWKMiCREQkWFEFv7t3A+4ALgLyzGyWmV0faGUiIhKIqMf43X0T8HPgCaA7MMnMNpjZgNK2N7OLzOxDM1tvZuvMbHRkfRMzm29mmyLfG1fFP0RERKIT7Rh/mpk9B6wHrgX+zd1/EHn8XBm7HQEejWzXBRhpZu2AscBCd28DLIwsi4hIjER7xf8bYBWQ7u4j3X0VgLtvpfivgNO4+7YTtttL8S+NC4GbgZcjm70M9D/j6kVEpNKivfXijcB+dz8KYGa1gER3/x93f6Winc0sBegAfAyc5+7boPiXg5k1L2OfEcAIgFatWkVZpoiIVCTaK/4FQL0TlpMi6ypkZvWBucAj7r4n2sLcfZq7Z7p7ZnJycrS7iYhIBaIN/kR333d8IfI4qaKdzKwOxaH/qru/EVm93cxaRJ5vAeyoXMkiInI2og3+b82s4/EFM7sS2F/eDmZmwEvAenf/9QlPvQ0MizweBrwVfbkiInK2oh3jfwSYY2ZbI8stgMEV7JMDDAXWmFl+ZN1PgYnA62b2Y+BLYFBlChYRkbMTVfC7+wozuwxIBQzY4O6HK9jn75FtS3NdpaoUEZEqE+0VP0AWkBLZp4OZ4e4zA6lKREQCE1Xwm9krwKVAPnA0stoBBb+ISA0T7RV/JtDO3T3IYkREJHjRzupZC5wfZCEiIhIb0V7xNwM+M7PlwMHjK939pkCqEhGRwEQb/OODLEJERGIn2umcfzWzi4E27r7AzJKAhGBLExGRIETblvle4L+AqZFVFwJvBlSTiIgEKNo3d0dS/EncPVByU5ZSu2qKiEj1Fm3wH3T3Q8cXzKw2xfP4RUSkhok2+P9qZj8F6kXutTsHeCe4skREJCjRBv9YYCewBrgPeI8y7rwlIiLVW7Szeo4B/xn5EhGRGizaXj1fUMqYvrtfUuUVSWyNbxjvCkQASBk7L94lhEZlevUcl0hxD/0mVV+OiIgELaoxfncvOuHrK3d/Hrg22NJERCQI0Q71dDxhsRbFfwE0CKQiEREJVLRDPf9xwuMjQAHwwyqvRkREAhftrJ5rgi5ERERiI9qhnv9V3vPu/uuqKUdERIJWmVk9WcDbkeV/AxYD/wyiKBERCU5lbsTS0d33ApjZeGCOu98TVGEiIhKMaFs2tAIOnbB8CEip8mpERCRw0Qb/K8ByMxtvZk8BHwMzy9vBzKab2Q4zW3vCuvFm9pWZ5Ue+bjzz0kVE5ExE+wGuCcBdwL+AXcBd7v6/K9htBtCnlPXPuXtG5Ou9StQqIiJVINorfoAkYI+7vwAUmlnr8jZ298XAN2dTnIiIVL1ob734FPAE8GRkVR3gD2d4zlFm9mlkKKhxOeccYWZ5Zpa3c+fOMzyViIicKtor/luAm4BvAdx9K2fWsuF3wKVABrCNkz8RfBJ3n+bume6emZycfAanEhGR0kQb/Ifc3Ym0Zjaz753Jydx9u7sfPaG/f6czOY6IiJy5aIP/dTObCjQys3uBBZzBTVnMrMUJi7cAa8vaVkREglHhB7jMzIDZwGXAHiAVGOfu8yvY749AD6CZmRUCTwE9zCyD4r8cCii+jaOIiMRQhcHv7m5mb7r7lUC5YX/KfkNKWf1SZYoTEZGqF23LhmVmluXuKwKtRqQygr5t5PjdwR5fJE6iDf5rgPvNrIDimT1G8R8DaUEVJiIiwSg3+M2slbt/CdwQo3pERCRgFV3xv0lxV87/NrO57j4wBjWJiEiAKprOaSc8viTIQkREJDYqCn4v47GIiNRQFQ31pJvZHoqv/OtFHsP/f3P33ECrExGRKldu8Lt7QqwKERGR2KhMW2YREfkOUPCLiISMgl9EJGQU/CIiIaPgFxEJGQW/iEjIKPhFREJGwS8iEjIKfhGRkFHwi4iEjIJfRCRkFPwiIiGj4BcRCRkFv4hIyAQW/GY23cx2mNnaE9Y1MbP5ZrYp8r1xUOcXEZHSBXnFPwPoc8q6scBCd28DLIwsi4hIDAUW/O6+GPjmlNU3Ay9HHr8M9A/q/CIiUrqKbr1Y1c5z920A7r7NzJqXtaGZjQBGALRq1SpG5YmcYHzDgI+/O9jjS1RSxs6LaruCiX0DriR2qu2bu+4+zd0z3T0zOTk53uWIiHxnxDr4t5tZC4DI9x0xPr+ISOjFOvjfBoZFHg8D3orx+UVEQi/I6Zx/BJYCqWZWaGY/BiYC15vZJuD6yLKIiMRQYG/uuvuQMp66LqhziohIxartm7siIhIMBb+ISMgo+EVEQkbBLyISMgp+EZGQUfCLiISMgl9EJGQU/CIiIaPgFxEJGQW/iEjIKPhFREJGwS8iEjIKfhGRkIn1rRelsoK+/Z9IwKK9taHEjq74RURCRsEvIhIyCn4RkZBR8IuIhIyCX0QkZDSrR+S7KugZYeN3B3v8GiraWUwFE/sGXEnZdMUvIhIyCn4RkZCJy1CPmRUAe4GjwBF3z4xHHSIiYRTPMf5r3P3rOJ5fRCSUNNQjIhIy8brid+AvZubAVHefduoGZjYCGAHQqlWrGJdXCeqlIyGlHjw1V7yu+HPcvSNwAzDSzK4+dQN3n+bume6emZycHPsKRUS+o+IS/O6+NfJ9B/AnoFM86hARCaOYB7+Zfc/MGhx/DPQC1sa6DhGRsIrHGP95wJ/M7Pj5Z7n7B3GoQ0QklGIe/O6+BUiP9XlFRKSYpnOKiISMmrSJxIumAtco36Xpq7riFxEJGQW/iEjIKPhFREJGwS8iEjIKfhGRkFHwi4iEjIJfRCRkFPwiIiGj4BcRCRkFv4hIyCj4RURC5rvfq0f9UEQCUZB4e6DHTzkwK9Djx1u0vX8KJvat8nPril9EJGQU/CIiIaPgFxEJGQW/iEjIKPhFREJGwS8iEjIKfhGRkFHwi4iEjIJfRCRk4hL8ZtbHzDaa2T/MbGw8ahARCauYB7+ZJQAvAjcA7YAhZtYu1nWIiIRVPK74OwH/cPct7n4IeA24OQ51iIiEUjyatF0I/POE5UKg86kbmdkIYERkcZ+ZbYxBbZXVDPg63kWcgZpYt2qOnWpSd7/KbFxNaq6UqGq2fz+rc1xc2sp4BL+Vss5PW+E+DZgWfDlnzszy3D0z3nVUVk2sWzXHTk2sWzVXTjyGegqBi05YbglsjUMdIiKhFI/gXwG0MbPWZnYOcBvwdhzqEBEJpZgP9bj7ETMbBfwZSACmu/u6WNdRRar1UFQ5amLdqjl2amLdqrkSzP204XUREfkO0yd3RURCRsEvIhIyCv6zZGb/x8w2mNmnZvYnM2sU75qiYWaDzGydmR0zs2o9Da6mtfgws+lmtsPM1sa7lmiZ2UVm9qGZrY/8XIyOd03RMLNEM1tuZp9E6n463jVFy8wSzGy1mb0b63Mr+M/efOAKd08DPgeejHM90VoLDAAWx7uQ8tTQFh8zgD7xLqKSjgCPuvsPgC7AyBrwOgMcBK5193QgA+hjZl3iW1LURgPr43FiBf9Zcve/uPuRyOIyij+XUO25+3p3r46fhj5VjWvx4e6LgW/iXUdluPs2d18VebyX4kC6ML5VVcyL7Yss1ol8VfsZK2bWEugL/N94nF/BX7XuBt6PdxHfMaW1+Kj2gVSTmVkK0AH4OM6lRCUyZJIP7ADmu3tNqPt54CfAsXicPB4tG2ocM1sAnF/KUz9z97ci2/yM4j+XX41lbeWJpu4aIKoWH1I1zKw+MBd4xN33xLueaLj7USAj8v7an8zsCnevtu+vmFk/YIe7rzSzHvGoQcEfBXfvWd7zZjaM4o5S13k1+mBERXXXEGrxESNmVofi0H/V3d+Idz2V5e67zOwjit9fqbbBD+QAN5nZjUAicK6Z/cHdfxSrAjTUc5bMrA/wBHCTu/9PvOv5DlKLjxgwMwNeAta7+6/jXU+0zCz5+Ew6M6sH9AQ2xLWoCrj7k+7e0t1TKP55XhTL0AcFf1X4DdAAmG9m+WY2Jd4FRcPMbjGzQqArMM/M/hzvmkoTeeP8eIuP9cDr1b3Fh5n9EVgKpJpZoZn9ON41RSEHGApcG/k5zo9ckVZ3LYAPzexTii8S5rt7zKdH1jRq2SAiEjK64hcRCRkFv4hIyCj4RURCRsEvIhIyCn4RkZBR8IuIhIyCX0QkZP4fjT2bMZ+QdgAAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "display(X_train_df['Sepal Width'].rename(\"Unscaled Sepal Width\").plot.hist(legend=True))\n", "display(X_train_std_df['Sepal Width'].rename(\"Scaled Sepal Width\").plot.hist(legend=True))" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "**Exercise 12.01:**\n", " \n", "Explore the mean and variance for the other features." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Anti-pattern: Inconsistent Preprocessing\n", "\n", "\n", "> If ... data transforms are used when training a model, they also must be used on subsequent datasets, whether it’s test data or data in a production system. Otherwise, the feature space will change, and the model will not be able to perform effectively.\n", ">\n", "> https://scikit-learn.org/stable/common_pitfalls.html#inconsistent-preprocessing\n", "\n", "To summarise the above; if we transform our training data then we must also transform the test data and the data being used to make inferences in production.\n", "\n", "To understand why this is important, let's create a new Iris classifier, but this time using a [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/tree.html#tree-classification). Note that we are NOT scaling the training data.\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Tree\n", "\n", "\n", "\n", "0\n", "\n", "petal length (cm) ≤ 2.45\n", "gini = 0.667\n", "samples = 150\n", "value = [50, 50, 50]\n", "class = setosa\n", "\n", "\n", "\n", "1\n", "\n", "gini = 0.0\n", "samples = 50\n", "value = [50, 0, 0]\n", "class = setosa\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "True\n", "\n", "\n", "\n", "2\n", "\n", "petal width (cm) ≤ 1.75\n", "gini = 0.5\n", "samples = 100\n", "value = [0, 50, 50]\n", "class = versicolor\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "False\n", "\n", "\n", "\n", "3\n", "\n", "gini = 0.168\n", "samples = 54\n", "value = [0, 49, 5]\n", "class = versicolor\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "\n", "gini = 0.043\n", "samples = 46\n", "value = [0, 1, 45]\n", "class = virginica\n", "\n", "\n", "\n", "2->4\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ensure everyone running this code gets the same result\n", "import numpy as np\n", "np.random.seed(100)\n", "\n", "# train the decision tree classifier\n", "from sklearn.datasets import load_iris\n", "from sklearn import tree\n", "iris = load_iris()\n", "X, y = iris.data, iris.target\n", "clf = tree.DecisionTreeClassifier(max_depth=2)\n", "clf = clf.fit(X, y)\n", "\n", "# display the decision tree\n", "import graphviz \n", "dot_data = tree.export_graphviz(clf, out_file=None, \n", " feature_names=iris.feature_names, \n", " class_names=iris.target_names, \n", " filled=True, rounded=True, \n", " special_characters=True) \n", "graph = graphviz.Source(dot_data) \n", "graph " ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Here we have made a decision tree with a depth of two `clf = tree.DecisionTreeClassifier(max_depth=2)`. \n", "\n", "If we read from the top, we can see that if the `Petal length <= 2.45` the predicted class is `Setosa`.\n", "\n", "However, when we standardized the training data we 'shifted' the values. See the [box plots](https://en.wikipedia.org/wiki/Box_plot), below, comparing the Unscaled Petal Length values to the Scaled Petal Length values." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Unscaled')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAEICAYAAAB25L6yAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8/fFQqAAAACXBIWXMAAAsTAAALEwEAmpwYAAAOl0lEQVR4nO3db4xldX3H8fcHWBCXFUgWpyrCqLQgoRHMYEtoyQCWAFqJ1aBWTSSmW2OCmmrr2tpSMG2pNm3tE+MUqLRSCwWxLSsrtHIVREFA1JXFB/wrizSAIOxYwp/ttw/uGRjWWebOMmfmh/N+JZOdOffcc753c3nv4Td35qaqkCS1a5flHkCS9OwMtSQ1zlBLUuMMtSQ1zlBLUuMMtSQ1zlBL20kymWTLUt9X2hFDrWWXpJIctN22P03y+eWaSWqJoZakxhlqNW9mOSHJh5Pcl+TeJKfNuv3kJLck2ZrkniQfmXXbKUluTvJIktuSnNhtPy3J5u4+tyf53Wc5/0uTXJLk/iR3JPnArNv2TPK5JA8luQU4sqe/Bq1guy33ANKIfgHYG3gZ8BvAxUm+VFUPAecCp1bV1Un2BV4BkOR1wD8CbwX+C3gJsKY73n3AG4HbgWOAy5N8u6pumn3SJLsA/wH8G/AOYH/gP5P8sKq+ApwBvKr7WA1c3tPj1wrmFbWeL54AzqqqJ6rqy8A0cPCs2w5N8qKqemhWbN8LnFdVV1bV/1XVPVV1K0BVbaiq22roa8AVwK/Pcd4jgf2q6qyqeryqbgf+Hnh7d/upwJ9V1YNVdTfwd308eK1shlot2Aas2m7bKoYBnvHjqnpy1tf/C+zVff4W4GTgriRfS3JUt/3lwG1znTDJSUm+leTBJD/p7r92jl0PBF6a5CczH8AfAmPd7S8F7p61/107fpjSzjHUasF/A+PbbXsFI0avqr5dVacALwa+BFzU3XQ3wyWJZ0iyB3AJ8FfAWFXtA3wZyByHvxu4o6r2mfWxpqpO7m6/l+E/CDMOGGVmaSEMtVpwIfDxJPsn2SXJ64HfBC6e745Jdk/yziR7V9UTwCMMr9BhuHZ9WpLju+O+LMkhwO7AHsD9wJNJTgJO2MEprgceSfLR7huHuyY5LMnMNw0vAj6WZN8k+wOn7+TfgbRDhlotOAu4FrgGeAj4JPDOqto04v3fDdyZ5BHgfcC7AKrqeuA04G+Ah4GvAQdW1VbgAwwj+xDw28C/z3XgqtrG8B+Nw4E7gAeAcxh+YxPgTIZX/ncwXOf+pxFnlkYW3zhAktrmFbUkNc5QS1LjDLUkNc5QS1LjevkR8rVr19b4+Hgfh5aek5/+9KesXr16uceQfsaNN974QFXtN9dtvYR6fHycG264oY9DS8/JYDBgcnJyuceQfkaSHf6Al0sfktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjfPNbfW8lcz1hiz98NcBazl5Ra3nrapa8MeBH71sp+4nLSdDLUmNM9SS1DhDLUmNM9SS1DhDLUmNM9SS1DhDLUmNM9SS1DhDLUmNM9SS1DhDLUmNM9SS1DhDLUmNM9SS1DhDLUmNM9SS1DhDLUmNM9SS1DhDLUmNGynUSfZJcnGSW5NsTnJU34NJkoZGfRfyTwMbq+qtSXYHXtjjTJKkWeYNdZIXAccA7wGoqseBx/sdS5I0Y5Slj1cC9wP/kOQ7Sc5JsrrnuSRJnVGWPnYDXgucXlXXJfk0sB7449k7JVkHrAMYGxtjMBgs8qjS4vC5qeebUUK9BdhSVdd1X1/MMNTPUFVTwBTAxMRETU5OLtaM0uLZuAGfm3q+mXfpo6r+B7g7ycHdpuOBW3qdSpL0lFFf9XE6cEH3io/bgdP6G0mSNNtIoa6qm4GJfkeRJM3Fn0yUpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklqnKGWpMYZaklq3G6j7JTkTmArsA14sqom+hxKK9NrzryChx99ovfzjK/f0Ovx995zFd8944Rez6GVZaRQd46tqgd6m0Qr3sOPPsGdZ7+h13MMBgMmJyd7PUff/xBo5XHpQ5IaN+oVdQFXJCngs1U1tf0OSdYB6wDGxsYYDAaLNqRWjr6fN9PT00vy3PT5r8U0aqiPrqofJXkxcGWSW6vq67N36OI9BTAxMVF9/++lfg5t3ND7ssRSLH0sxePQyjLS0kdV/aj78z7gUuB1fQ4lSXravKFOsjrJmpnPgROATX0PJkkaGmXpYwy4NMnM/v9cVRt7nUqS9JR5Q11VtwOvWYJZJElz8OV5ktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjTPUktS4hby5rdSrNa9ezy+fv77/E53f7+HXvBqg3zfp1cpiqNWMrZvP9l3IpTm49CFJjTPUktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjTPUktQ4Qy1JjRs51El2TfKdJJf1OZAk6ZkWckX9QWBzX4NIkuY2UqiT7M/w9zae0+84kqTtjfprTv8W+ANgzY52SLIOWAcwNjbGYDB4rrNpBer7eTM9Pb0kz02f/1pM84Y6yRuB+6rqxiSTO9qvqqaAKYCJiYnq+3f+6ufQxg29/67opfh91EvxOLSyjLL0cTTwpiR3Av8CHJfk871OJUl6yryhrqqPVdX+VTUOvB34alW9q/fJJEmAr6OWpOYt6D0Tq2oADHqZRJI0J6+oJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalx84Y6yQuSXJ/ku0l+kOTMpRhMkjS02wj7PAYcV1XTSVYB1yS5vKq+1fNskiRGCHVVFTDdfbmq+6g+h5IkPW2kNeokuya5GbgPuLKqrut1KknSU0ZZ+qCqtgGHJ9kHuDTJYVW1afY+SdYB6wDGxsYYDAaLPKpWgr6fN9PT00vy3PT5r8U0UqhnVNVPkgyAE4FN2902BUwBTExM1OTk5CKNqBVj4wb6ft4MBoPez7EUj0Mryyiv+tivu5ImyZ7A64Fbe55LktQZ5Yr6JcD5SXZlGPaLquqyfseSJM0Y5VUf3wOOWIJZJElz8CcTJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGjdvqJO8PMlVSTYn+UGSDy7FYJKkod1G2OdJ4MNVdVOSNcCNSa6sqlt6nk2SxAhX1FV1b1Xd1H2+FdgMvKzvwSRJQ6NcUT8lyThwBHDdHLetA9YBjI2NMRgMFmE8rTR9P2+mp6eX5Lnp81+LaeRQJ9kLuAT4UFU9sv3tVTUFTAFMTEzU5OTkYs2olWLjBvp+3gwGg97PsRSPQyvLSK/6SLKKYaQvqKov9juSJGm2UV71EeBcYHNV/XX/I0mSZhvlivpo4N3AcUlu7j5O7nkuSVJn3jXqqroGyBLMIkmagz+ZKEmNM9SS1DhDLUmNM9SS1DhDLUmNM9SS1LgF/a4PqW/j6zf0f5KN/Z5j7z1X9Xp8rTyGWs248+w39H6O8fUbluQ80mJy6UOSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalxhlqSGmeoJalx84Y6yXlJ7kuyaSkGkiQ90yhX1J8DTux5DknSDswb6qr6OvDgEswiSZrDor0LeZJ1wDqAsbExBoPBYh1amtOxxx67U/fLXy78PlddddVOnUtaDIsW6qqaAqYAJiYmanJycrEOLc2pqhZ8n8FggM9NPd/4qg9JapyhlqTGjfLyvC8A3wQOTrIlyXv7H0uSNGPeNeqqesdSDCJJmptLH5LUOEMtSY0z1JLUOEMtSY3LzvzQwLwHTe4H7lr0A0vP3VrggeUeQprDgVW131w39BJqqVVJbqiqieWeQ1oIlz4kqXGGWpIaZ6i10kwt9wDSQrlGLUmN84pakhpnqCWpcYZayyLJtiQ3J9mU5F+TvPBZ9j08yckjHHMyyWWjbl8sSfZJ8v6lOp9WHkOt5fJoVR1eVYcBjwPve5Z9DwfmDfUy2gd4/3w7STvLUKsFVwMHJVmd5Lwk307ynSSnJNkdOAt4W3cF/rYkr0tybbfPtUkO3pmTJjkhyTeT3NRd1e/Vbb8zyZnd9u8nOaTbvl+SK7vtn01yV5K1wNnAq7r5PtUdfq8kFye5NckFSfLc/5q0UhlqLaskuwEnAd8H/gj4alUdCRwLfApYBfwJcGF3BX4hcCtwTFUd0d325ztx3rXAx4HXV9VrgRuA35u1ywPd9s8AH+m2ndHN91rgUuCAbvt64LZuvt/vth0BfAg4FHglcPRCZ5RmLNqb20oLtGeSm7vPrwbOBa4F3pRkJowv4OkYzrY3cH6SXwSKYcwX6lcZRvQb3cXu7gzfyWjGF7s/bwR+q/v814A3A1TVxiQPPcvxr6+qLQDd4xwHrtmJOSVDrWXzaFUdPntDtzzwlqr64Xbbf2W7+34CuKqq3pxkHBjsxPkDXPks72D0WPfnNp7+72QhyxePzfp89jGkBXPpQy35CnD6zHpukiO67VuBNbP22xu4p/v8PTt5rm8BRyc5qDvXC5P80jz3uQY4tdv/BGDfHcwnLSpDrZZ8guEyxveSbOq+BrgKOHTmm4nAJ4G/SPINYNcRj3189+bMW5JsAQ5iGPkvJPkew3AfMs8xzgROSHITw3X1e4GtVfVjhksom2Z9M1FaNP4IuTSiJHsA26rqySRHAZ/ZfvlG6oPrZtLoDgAuSrILw9d+/84yz6MVwitqSWqca9SS1DhDLUmNM9SS1DhDLUmNM9SS1Lj/B5/5scu0wWhCAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "X_train_df.boxplot(column='Petal Length').axes.set_title('Unscaled')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Scaled')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEICAYAAABcVE8dAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8/fFQqAAAACXBIWXMAAAsTAAALEwEAmpwYAAARSElEQVR4nO3df6zddX3H8efLUpwigq54QUDqtE6ZRmTXItEt16hE0MjwJ2QZ6jI7f7BkmSxrwuKPmW2oWbIYndhtCGabvzZRYpsCOs74PUAGWH7NSkroykRQoUWmFt/743w779rT9t57flzaz/ORnPT7/Xw/5/v+nOTb1/nezznf70lVIUna/z1hsQcgSZoMA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGvjSEJO9IctWknysthIGvpiR5RZJrkjyU5AdJrk7y0sUelzQJByz2AKRJSfJU4OvAe4AvAQcCvwH8ZDHHJU2KZ/hqyfMAqurzVfVYVT1aVZdW1a0ASd6V5I4kW5PcnuT4rn11ku/Oaj9tdwWSPD/JZd1fD3cleeusbb+c5OIkDye5HnjOmF+v9P8Y+GrJfwKPJbkwyclJnrZjQ5K3AB8CzgSeCrwBeLDb/F36fwkcAnwY+IckR+y88yQHAZcB/wQ8AzgD+Jskv9Z1+RTwP8ARwO92D2liDHw1o6oeBl4BFPC3wPe7M+4p4PeAj1XVDdW3saru6Z735araUlU/r6ovAt8BVg4o8XpgU1V9tqq2V9VNwL8Ab06yBHgT8IGqeqSqNgAXjv1FS7MY+GpKVd1RVe+oqqOAFwLPBP4aOJr+mfwukpyZ5OYkP0ryo+55ywZ0PQY4YUe/ru9vA4cDh9H/zOzeWf3vGc2rkubGD23VrKq6M8kFwO/TD+Jd5tSTHEP/r4FXAddW1WNJbgYyYJf3Av9WVa8ZsJ8lwHb6byx3ds3PGsHLkObMM3w1o/tA9f1JjurWj6Y/z34d8HfA2Ul+PX3P7cL+IPpTQN/vnvNO+mf4g3wdeF6S30mytHu8NMkLquox4CvAh5I8OcmxwNvH+oKlnRj4aslW4ATg35M8Qj/oNwDvr6ovA39O/wPXrcBXgadX1e3AXwHXAt8DXgRcPWjnVbUVOAk4HdgC/DfwUeCJXZezgKd07RcAnx31C5T2JP4AiiS1wTN8SWqEgS9JjTDwJakRBr4kNeJx/T38ZcuW1fLlyxd7GNIuHnnkEQ466KDFHoa0i29961sPVNVhg7Y9rgN/+fLl3HjjjYs9DGkXvV6PmZmZxR6GtIsku72C2ykdSWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMe1xdeSZOQDPrxqvHwduRaTJ7hq3lVNe/HMX/y9QU9T1pMBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0YSeAnOT/J/Uk27Gb7TJKHktzcPT4wirqSpLkb1ZW2FwCfBD63hz5XVtXrR1RPkjRPIznDr6orgB+MYl+SpPGY5L10TkxyC7AFOLuqbhvUKckqYBXA1NQUvV5vciOU5sFjU/uaSQX+TcAxVbUtySnAV4EVgzpW1RpgDcD09HTNzMxMaIjSPKxfi8em9jUT+ZZOVT1cVdu65XXA0iTLJlFbktQ3kcBPcni6e9AmWdnVfXAStSVJfSOZ0knyeWAGWJZkM/BBYClAVZ0HvBl4T5LtwKPA6eW9YiVpokYS+FV1xl62f5L+1zYlSYvEK20lqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktSIkQR+kvOT3J9kw262J8knkmxMcmuS40dRV5I0d6M6w78AeO0etp8MrOgeq4BPj6iuJGmORhL4VXUF8IM9dDkV+Fz1XQccmuSIUdSWJM3NAROqcyRw76z1zV3bfTt3TLKK/l8BTE1N0ev1JjE+ad48NrWvmVTgZ0BbDepYVWuANQDT09M1MzMzxmFJC7R+LR6b2tdM6ls6m4GjZ60fBWyZUG1JEpML/IuBM7tv67wMeKiqdpnOkSSNz0imdJJ8HpgBliXZDHwQWApQVecB64BTgI3Aj4F3jqKuJGnuRhL4VXXGXrYX8L5R1JIkLYxX2kpSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMmdbdMaWJe/OFLeejRn429zvLVa8e6/0OetJRbPnjSWGuoLQa+9jsPPfozNp37urHW6PV6Y7898rjfUNQep3QkqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNWIkgZ/ktUnuSrIxyeoB22eSPJTk5u7xgVHUlSTN3dD30kmyBPgU8BpgM3BDkour6vadul5ZVa8ftp4kaWFGcYa/EthYVXdX1U+BLwCnjmC/kqQRGsXdMo8E7p21vhk4YUC/E5PcAmwBzq6q2wbtLMkqYBXA1NQUvV5vBENUa8Z93Gzbtm0ix6bHv0ZpFIGfAW210/pNwDFVtS3JKcBXgRWDdlZVa4A1ANPT0zXuW9BqP7R+7dhvXTyJ2yNP4nWoLaOY0tkMHD1r/Sj6Z/H/p6oerqpt3fI6YGmSZSOoLUmao1EE/g3AiiTPTnIgcDpw8ewOSQ5Pkm55ZVf3wRHUliTN0dBTOlW1PclZwCXAEuD8qrotybu77ecBbwbek2Q78ChwelXtPO0jSRqjkfzEYTdNs26ntvNmLX8S+OQoakmSFsYrbSWpEf6IufY7B79gNS+6cJcLvkfvwvHu/uAXAIz3x9jVFgNf+52td5zLpnPHG5ST+Frm8tVrx7p/tccpHUlqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWrESAI/yWuT3JVkY5JdflsufZ/ott+a5PhR1JUkzd3QgZ9kCfAp4GTgWOCMJMfu1O1kYEX3WAV8eti6kqT5GcUZ/kpgY1XdXVU/Bb4AnLpTn1OBz1XfdcChSY4YQW1J0hyN4kfMjwTunbW+GThhDn2OBO7beWdJVtH/K4CpqSl6vd4IhqjWjPu42bZt20SOTY9/jdIoAj8D2moBffqNVWuANQDT09M1MzMz1ODUoPVrGfdx0+v1xl5jEq9DbRnFlM5m4OhZ60cBWxbQR5I0RqMI/BuAFUmeneRA4HTg4p36XAyc2X1b52XAQ1W1y3SOJGl8hp7SqartSc4CLgGWAOdX1W1J3t1tPw9YB5wCbAR+DLxz2LqSpPkZxRw+VbWOfqjPbjtv1nIB7xtFLUnSwnilrSQ1wsCXpEYY+JLUCANfkhph4EtSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGjHUj5gneTrwRWA5sAl4a1X9cEC/TcBW4DFge1VND1NXkjR/w57hrwa+WVUrgG9267vzyqo6zrCXpMUxbOCfClzYLV8I/NaQ+5MkjclQUzrAVFXdB1BV9yV5xm76FXBpkgI+U1VrdrfDJKuAVQBTU1P0er0hh6gWjfu42bZt20SOTY9/jdJeAz/JN4DDB2w6Zx51Xl5VW7o3hMuS3FlVVwzq2L0ZrAGYnp6umZmZeZSRgPVrGfdx0+v1xl5jEq9Dbdlr4FfVq3e3Lcn3khzRnd0fAdy/m31s6f69P8lFwEpgYOBLksZj2Dn8i4G3d8tvB762c4ckByU5eMcycBKwYci6kqR5GjbwzwVek+Q7wGu6dZI8M8m6rs8UcFWSW4DrgbVVtX7IupKkeRrqQ9uqehB41YD2LcAp3fLdwIuHqSNJGp5X2kpSIwx8SWqEgS9JjTDwJakRBr4kNcLAl6RGGPiS1AgDX5IaYeBLUiMMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1YqjAT/KWJLcl+XmS6T30e22Su5JsTLJ6mJqSpIUZ9gx/A/BG4IrddUiyBPgUcDJwLHBGkmOHrCtJmqcDhnlyVd0BkGRP3VYCG6vq7q7vF4BTgduHqS1Jmp+hAn+OjgTunbW+GThhd52TrAJWAUxNTdHr9cY6OO2fxn3cbNu2bSLHpse/RmmvgZ/kG8DhAzadU1Vfm0ONQaf/tbvOVbUGWAMwPT1dMzMzcyghzbJ+LeM+bnq93thrTOJ1qC17DfyqevWQNTYDR89aPwrYMuQ+JUnzNImvZd4ArEjy7CQHAqcDF0+griRplmG/lnlaks3AicDaJJd07c9Msg6gqrYDZwGXAHcAX6qq24YbtiRpvob9ls5FwEUD2rcAp8xaXwesG6aWJGk4XmkrSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqRGTuB++NHHLV68df5H1461xyJOWjnX/ao+Br/3OpnNfN/Yay1evnUgdaZSc0pGkRhj4ktQIA1+SGmHgS1IjDHxJaoSBL0mNMPAlqREGviQ1YqjAT/KWJLcl+XmS6T3025Tk20luTnLjMDUlSQsz7JW2G4A3Ap+ZQ99XVtUDQ9aTJC3QUIFfVXcAJBnNaCRJYzOpe+kUcGmSAj5TVWt21zHJKmAVwNTUFL1ebzIjlObJY1P7mr0GfpJvAIcP2HROVX1tjnVeXlVbkjwDuCzJnVV1xaCO3ZvBGoDp6emamZmZYwlpgtavxWNT+5q9Bn5VvXrYIlW1pfv3/iQXASuBgYEvSRqPsX8tM8lBSQ7esQycRP/DXknSBA37tczTkmwGTgTWJrmka39mknVdtyngqiS3ANcDa6tq/TB1JUnzN+y3dC4CLhrQvgU4pVu+G3jxMHUkScPzSltJaoSBL0mNMPAlqREGviQ1YlJX2kqPWwu9NUg+Ov/nVNWCakmj4Bm+mldV835cfvnlC3qetJgMfElqhIEvSY0w8CWpEQa+JDXCwJekRhj4ktQIA1+SGmHgS1Ij8ni+GCTJ94F7Fnsc0gDLgAcWexDSAMdU1WGDNjyuA196vEpyY1VNL/Y4pPlwSkeSGmHgS1IjDHxpYdYs9gCk+XIOX5Ia4Rm+JDXCwJekRhj42qcleSzJzUk2JPlykifvoe9xSU6Zwz5nknx9ru2jkuTQJO+dVD21x8DXvu7Rqjquql4I/BR49x76HgfsNfAX0aHAe/fWSVooA1/7kyuB5yY5KMn5SW5I8h9JTk1yIPBnwNu6vwjelmRlkmu6Ptck+dWFFE1yUpJrk9zU/ZXxlK59U5IPd+3fTvL8rv2wJJd17Z9Jck+SZcC5wHO68X282/1TkvxzkjuT/GMW+gO8Ega+9hNJDgBOBr4NnAP8a1W9FHgl8HFgKfAB4IvdXwRfBO4EfrOqXtJt+4sF1F0G/Cnw6qo6HrgR+KNZXR7o2j8NnN21fbAb3/HARcCzuvbVwHe78f1x1/YS4A+BY4FfAV4+3zFKOxyw2AOQhvSkJDd3y1cCfw9cA7whyY6A/SV+EaqzHQJcmGQFUPTfFObrZfTD+Oru5PtA4NpZ27/S/fst4I3d8iuA0wCqan2SH+5h/9dX1WaA7nUuB65awDglA1/7vEer6rjZDd20x5uq6q6d2k/Y6bkfAS6vqtOSLAd6C6gf4LKqOmM323/S/fsYv/j/Np9pmZ/MWp69D2nenNLR/ugS4A92zHcneUnXvhU4eFa/Q4D/6pbfscBa1wEvT/LcrtaTkzxvL8+5Cnhr1/8k4Gm7GZ80Uga+9kcfoT89c2uSDd06wOXAsTs+tAU+BvxlkquBJXPc96uSbN7xAJ5L/83i80lupf8G8Py97OPDwElJbqL/ucN9wNaqepD+1NCGWR/aSiPjrRWkCUvyROCxqtqe5ETg0ztPS0nj4HygNHnPAr6U5An0rx141yKPR43wDF+SGuEcviQ1wsCXpEYY+JLUCANfkhph4EtSI/4XtHEBVxHAh1wAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "X_train_std_df.boxplot(column='Petal Length').axes.set_title('Scaled')" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Note that the scaled training data has been shifted compared to the unscaled data.\n", "\n", "So if we don't 'shift' the prediction data, the decision tree rules will be invalid.\n", "\n", "The first rule in the decision tree, based on unscaled data is `Petal length <= 2.45`. However, the scaled Petal length values from the training data (as per the box chart) is approximately `-1.75` to `+1.75` **so no records would trigger the first rule**.\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "**Exercise 12.02**\n", "\n", "Experiment with the depth parameter on the DecisionTreeClassifier. \n", "\n", "Generally increasing the depth will create a more accurate model, but risks '[Overfitting](https://en.wikipedia.org/wiki/Overfitting)' the model to the training data." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Scaled Inference Example\n", "\n", "In this example, we create a scaler for the training data" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['iris_data_scaler.joblib']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set the random state to make the examples repeatable\n", "import numpy as np\n", "np.random.seed(1)\n", "\n", "from sklearn import datasets\n", "from sklearn.svm import LinearSVC\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "iris = datasets.load_iris()\n", "X = iris.data\n", "y = iris.target\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)\n", "\n", "# We 'fit' the StandardScaler on the training data\n", "std_slc = StandardScaler()\n", "std_slc.fit(X_train)\n", "\n", "X_train_std = std_slc.transform(X_train)\n", "X_test_std = std_slc.transform(X_test)\n", "\n", "classifier = LinearSVC()\n", "model = classifier.fit(X_train_std, y_train)\n", "\n", "# Testing the model is left as an exercise for the reader\n", "\n", "from joblib import dump\n", "dump(model, 'iris_model_scaled.joblib')\n", "dump(std_slc, 'iris_data_scaler.joblib')" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Now let's make a prediction using the same feature data we used to make a prediction using an unscaled model in the [Previous](./11_model_predictions.ipynb) notebook." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scaled input features: [[ 0.11352823 -0.07994018 0.75731121 0.76873449]]\n", "Predicted class: [2]\n" ] } ], "source": [ "# set the random state to make the examples repeatable\n", "import numpy as np\n", "np.random.seed(1)\n", "\n", "from joblib import load\n", "\n", "scaler_file = load('iris_data_scaler.joblib') \n", "model_file = load('iris_model_scaled.joblib') \n", "\n", "new_sepal_length = 5.9\n", "new_sepal_width = 3.0\n", "new_petal_length = 5.1\n", "new_petal_width = 1.8\n", "\n", "# Scale the input features\n", "X_new = scaler_file.transform([ \n", " [ new_sepal_length, new_sepal_width, new_petal_length, new_petal_width ] \n", "])\n", "print(\"Scaled input features: \" + str(X_new))\n", "\n", "prediction = model_file.predict(X_new)\n", "print(\"Predicted class: \" + str(prediction))" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "**Excercise 12.03**\n", "\n", "Compare the results of the scaled model prediction to the unscaled model prediction in the [Previous](./11_model_predictions.ipynb) notebook. The result should be the same." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Other Dataset Transformations\n", "\n", "All of the Scikit-learn transformations are documented [here](https://scikit-learn.org/stable/data_transforms.html).\n", "\n", "Some of the transformations are quite advanced, but it is recommended to read through the documentation and try the examples on these topics:\n", " \n", "- [Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)\n", "- [Imputation of missing values](https://scikit-learn.org/stable/modules/impute.html)\n", "- [Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)\n", "- [Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Navigation\n", "\n", "[Previous](./11_model_predictions.ipynb) | [Home](./00-README-FIRST.ipynb) | [Next](./13_exploratory_data_analysis.ipynb) notebook\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "
\n", "\n", "### Appendix\n", "\n", "Other Scikit-learn scalers:\n", "\n", "- [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)\n", "- [sklearn.preprocessing.MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html)\n", "- [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)\n", "- [sklearn.preprocessing.RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)\n", "- [sklearn.preprocessing.Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html)\n", "- [sklearn.preprocessing.QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html)\n", "- [sklearn.preprocessing.PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "**Solution for Exercise 12.01**\n", "\n", "There are a few solutions, here we augment the [describe method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) with a variance statistic:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal LengthSepal WidthPetal LengthPetal Width
count100.000000100.000000100.000000100.000000
mean5.8360003.0510003.8200001.220000
std0.8074800.4200761.7021670.725022
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6750000.300000
50%5.8000003.0000004.4500001.400000
75%6.4000003.3000005.1000001.800000
max7.9000004.2000006.7000002.500000
variance0.6520240.1764642.8973740.525657
\n", "
" ], "text/plain": [ " Sepal Length Sepal Width Petal Length Petal Width\n", "count 100.000000 100.000000 100.000000 100.000000\n", "mean 5.836000 3.051000 3.820000 1.220000\n", "std 0.807480 0.420076 1.702167 0.725022\n", "min 4.300000 2.000000 1.000000 0.100000\n", "25% 5.100000 2.800000 1.675000 0.300000\n", "50% 5.800000 3.000000 4.450000 1.400000\n", "75% 6.400000 3.300000 5.100000 1.800000\n", "max 7.900000 4.200000 6.700000 2.500000\n", "variance 0.652024 0.176464 2.897374 0.525657" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary_stats = X_train_df.describe()\n", "\n", "# add 'variance' because describe doesn't show it\n", "summary_stats.loc['variance'] = summary_stats.loc['std']**2\n", "summary_stats" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal LengthSepal WidthPetal LengthPetal Width
count1.000000e+021.000000e+021.000000e+021.000000e+02
mean7.149836e-166.483702e-168.770762e-16-2.664535e-16
std1.005038e+001.005038e+001.005038e+001.005038e+00
min-1.911797e+00-2.514534e+00-1.665058e+00-1.552564e+00
25%-9.160693e-01-6.005214e-01-1.266507e+00-1.275320e+00
50%-4.480774e-02-1.220183e-013.719809e-012.495191e-01
75%7.019879e-015.957364e-017.557708e-018.040061e-01
max2.568977e+002.749001e+001.700484e+001.774358e+00
variance1.010101e+001.010101e+001.010101e+001.010101e+00
\n", "
" ], "text/plain": [ " Sepal Length Sepal Width Petal Length Petal Width\n", "count 1.000000e+02 1.000000e+02 1.000000e+02 1.000000e+02\n", "mean 7.149836e-16 6.483702e-16 8.770762e-16 -2.664535e-16\n", "std 1.005038e+00 1.005038e+00 1.005038e+00 1.005038e+00\n", "min -1.911797e+00 -2.514534e+00 -1.665058e+00 -1.552564e+00\n", "25% -9.160693e-01 -6.005214e-01 -1.266507e+00 -1.275320e+00\n", "50% -4.480774e-02 -1.220183e-01 3.719809e-01 2.495191e-01\n", "75% 7.019879e-01 5.957364e-01 7.557708e-01 8.040061e-01\n", "max 2.568977e+00 2.749001e+00 1.700484e+00 1.774358e+00\n", "variance 1.010101e+00 1.010101e+00 1.010101e+00 1.010101e+00" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary_stats = X_train_std_df.describe()\n", "summary_stats.loc['variance'] = summary_stats.loc['std']**2\n", "summary_stats" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "**Solution for Exercise 12.02**\n", "\n", "No solution required - just experiment and observe :)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }