{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Standardizing Data\n",
    "> This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance. This is the Summary of lecture \"Preprocessing for Machine Learning in Python\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Standardizing Data\n",
    "- Standardization\n",
    "    - Preprocessing method used to transform continuous data to make it look normally distributed\n",
    "    - Scikit-learn models assume normally distributed data\n",
    "        - Log normalization\n",
    "        - feature Scaling\n",
    "- When to standardize: models\n",
    "    - Model in linear space\n",
    "    - Dataset features have high variance\n",
    "    - Dataset features are continuous and on different scales\n",
    "    - Linearity assumptions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modeling without normalizing\n",
    "Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, `Proline`, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.\n",
    "\n",
    "The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (`knn`) as well as the `X` and `y` sets you need to fit and score on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Alcohol</th>\n",
       "      <th>Malic acid</th>\n",
       "      <th>Ash</th>\n",
       "      <th>Alcalinity of ash</th>\n",
       "      <th>Magnesium</th>\n",
       "      <th>Total phenols</th>\n",
       "      <th>Flavanoids</th>\n",
       "      <th>Nonflavanoid phenols</th>\n",
       "      <th>Proanthocyanins</th>\n",
       "      <th>Color intensity</th>\n",
       "      <th>Hue</th>\n",
       "      <th>OD280/OD315 of diluted wines</th>\n",
       "      <th>Proline</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>14.23</td>\n",
       "      <td>1.71</td>\n",
       "      <td>2.43</td>\n",
       "      <td>15.6</td>\n",
       "      <td>127</td>\n",
       "      <td>2.80</td>\n",
       "      <td>3.06</td>\n",
       "      <td>0.28</td>\n",
       "      <td>2.29</td>\n",
       "      <td>5.64</td>\n",
       "      <td>1.04</td>\n",
       "      <td>3.92</td>\n",
       "      <td>1065</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>13.20</td>\n",
       "      <td>1.78</td>\n",
       "      <td>2.14</td>\n",
       "      <td>11.2</td>\n",
       "      <td>100</td>\n",
       "      <td>2.65</td>\n",
       "      <td>2.76</td>\n",
       "      <td>0.26</td>\n",
       "      <td>1.28</td>\n",
       "      <td>4.38</td>\n",
       "      <td>1.05</td>\n",
       "      <td>3.40</td>\n",
       "      <td>1050</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>13.16</td>\n",
       "      <td>2.36</td>\n",
       "      <td>2.67</td>\n",
       "      <td>18.6</td>\n",
       "      <td>101</td>\n",
       "      <td>2.80</td>\n",
       "      <td>3.24</td>\n",
       "      <td>0.30</td>\n",
       "      <td>2.81</td>\n",
       "      <td>5.68</td>\n",
       "      <td>1.03</td>\n",
       "      <td>3.17</td>\n",
       "      <td>1185</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>14.37</td>\n",
       "      <td>1.95</td>\n",
       "      <td>2.50</td>\n",
       "      <td>16.8</td>\n",
       "      <td>113</td>\n",
       "      <td>3.85</td>\n",
       "      <td>3.49</td>\n",
       "      <td>0.24</td>\n",
       "      <td>2.18</td>\n",
       "      <td>7.80</td>\n",
       "      <td>0.86</td>\n",
       "      <td>3.45</td>\n",
       "      <td>1480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>13.24</td>\n",
       "      <td>2.59</td>\n",
       "      <td>2.87</td>\n",
       "      <td>21.0</td>\n",
       "      <td>118</td>\n",
       "      <td>2.80</td>\n",
       "      <td>2.69</td>\n",
       "      <td>0.39</td>\n",
       "      <td>1.82</td>\n",
       "      <td>4.32</td>\n",
       "      <td>1.04</td>\n",
       "      <td>2.93</td>\n",
       "      <td>735</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Type  Alcohol  Malic acid   Ash  Alcalinity of ash  Magnesium  \\\n",
       "0     1    14.23        1.71  2.43               15.6        127   \n",
       "1     1    13.20        1.78  2.14               11.2        100   \n",
       "2     1    13.16        2.36  2.67               18.6        101   \n",
       "3     1    14.37        1.95  2.50               16.8        113   \n",
       "4     1    13.24        2.59  2.87               21.0        118   \n",
       "\n",
       "   Total phenols  Flavanoids  Nonflavanoid phenols  Proanthocyanins  \\\n",
       "0           2.80        3.06                  0.28             2.29   \n",
       "1           2.65        2.76                  0.26             1.28   \n",
       "2           2.80        3.24                  0.30             2.81   \n",
       "3           3.85        3.49                  0.24             2.18   \n",
       "4           2.80        2.69                  0.39             1.82   \n",
       "\n",
       "   Color intensity   Hue  OD280/OD315 of diluted wines  Proline  \n",
       "0             5.64  1.04                          3.92     1065  \n",
       "1             4.38  1.05                          3.40     1050  \n",
       "2             5.68  1.03                          3.17     1185  \n",
       "3             7.80  0.86                          3.45     1480  \n",
       "4             4.32  1.04                          2.93      735  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wine = pd.read_csv('./dataset/wine_types.csv')\n",
    "wine.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]\n",
    "y = wine['Type'] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6888888888888889\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "knn = KNeighborsClassifier()\n",
    "\n",
    "# Split the dataset and labels into training and test sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
    "\n",
    "# Fit the k-nearest neighbors model to the training data\n",
    "knn.fit(X_train, y_train)\n",
    "\n",
    "# SCore the model on the test data\n",
    "print(knn.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Log normalization\n",
    "- Applies log transformation\n",
    "- Natural log using the constant $e$ (2.718)\n",
    "- Captures relative changes, the magnitude of change, and keeps everything in the positive space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checking the variance\n",
    "Check the variance of the columns in the `wine` dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Alcohol</th>\n",
       "      <th>Malic acid</th>\n",
       "      <th>Ash</th>\n",
       "      <th>Alcalinity of ash</th>\n",
       "      <th>Magnesium</th>\n",
       "      <th>Total phenols</th>\n",
       "      <th>Flavanoids</th>\n",
       "      <th>Nonflavanoid phenols</th>\n",
       "      <th>Proanthocyanins</th>\n",
       "      <th>Color intensity</th>\n",
       "      <th>Hue</th>\n",
       "      <th>OD280/OD315 of diluted wines</th>\n",
       "      <th>Proline</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1.938202</td>\n",
       "      <td>13.000618</td>\n",
       "      <td>2.336348</td>\n",
       "      <td>2.366517</td>\n",
       "      <td>19.494944</td>\n",
       "      <td>99.741573</td>\n",
       "      <td>2.295112</td>\n",
       "      <td>2.029270</td>\n",
       "      <td>0.361854</td>\n",
       "      <td>1.590899</td>\n",
       "      <td>5.058090</td>\n",
       "      <td>0.957449</td>\n",
       "      <td>2.611685</td>\n",
       "      <td>746.893258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.775035</td>\n",
       "      <td>0.811827</td>\n",
       "      <td>1.117146</td>\n",
       "      <td>0.274344</td>\n",
       "      <td>3.339564</td>\n",
       "      <td>14.282484</td>\n",
       "      <td>0.625851</td>\n",
       "      <td>0.998859</td>\n",
       "      <td>0.124453</td>\n",
       "      <td>0.572359</td>\n",
       "      <td>2.318286</td>\n",
       "      <td>0.228572</td>\n",
       "      <td>0.709990</td>\n",
       "      <td>314.907474</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>11.030000</td>\n",
       "      <td>0.740000</td>\n",
       "      <td>1.360000</td>\n",
       "      <td>10.600000</td>\n",
       "      <td>70.000000</td>\n",
       "      <td>0.980000</td>\n",
       "      <td>0.340000</td>\n",
       "      <td>0.130000</td>\n",
       "      <td>0.410000</td>\n",
       "      <td>1.280000</td>\n",
       "      <td>0.480000</td>\n",
       "      <td>1.270000</td>\n",
       "      <td>278.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>12.362500</td>\n",
       "      <td>1.602500</td>\n",
       "      <td>2.210000</td>\n",
       "      <td>17.200000</td>\n",
       "      <td>88.000000</td>\n",
       "      <td>1.742500</td>\n",
       "      <td>1.205000</td>\n",
       "      <td>0.270000</td>\n",
       "      <td>1.250000</td>\n",
       "      <td>3.220000</td>\n",
       "      <td>0.782500</td>\n",
       "      <td>1.937500</td>\n",
       "      <td>500.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>13.050000</td>\n",
       "      <td>1.865000</td>\n",
       "      <td>2.360000</td>\n",
       "      <td>19.500000</td>\n",
       "      <td>98.000000</td>\n",
       "      <td>2.355000</td>\n",
       "      <td>2.135000</td>\n",
       "      <td>0.340000</td>\n",
       "      <td>1.555000</td>\n",
       "      <td>4.690000</td>\n",
       "      <td>0.965000</td>\n",
       "      <td>2.780000</td>\n",
       "      <td>673.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>13.677500</td>\n",
       "      <td>3.082500</td>\n",
       "      <td>2.557500</td>\n",
       "      <td>21.500000</td>\n",
       "      <td>107.000000</td>\n",
       "      <td>2.800000</td>\n",
       "      <td>2.875000</td>\n",
       "      <td>0.437500</td>\n",
       "      <td>1.950000</td>\n",
       "      <td>6.200000</td>\n",
       "      <td>1.120000</td>\n",
       "      <td>3.170000</td>\n",
       "      <td>985.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>14.830000</td>\n",
       "      <td>5.800000</td>\n",
       "      <td>3.230000</td>\n",
       "      <td>30.000000</td>\n",
       "      <td>162.000000</td>\n",
       "      <td>3.880000</td>\n",
       "      <td>5.080000</td>\n",
       "      <td>0.660000</td>\n",
       "      <td>3.580000</td>\n",
       "      <td>13.000000</td>\n",
       "      <td>1.710000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>1680.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Type     Alcohol  Malic acid         Ash  Alcalinity of ash  \\\n",
       "count  178.000000  178.000000  178.000000  178.000000         178.000000   \n",
       "mean     1.938202   13.000618    2.336348    2.366517          19.494944   \n",
       "std      0.775035    0.811827    1.117146    0.274344           3.339564   \n",
       "min      1.000000   11.030000    0.740000    1.360000          10.600000   \n",
       "25%      1.000000   12.362500    1.602500    2.210000          17.200000   \n",
       "50%      2.000000   13.050000    1.865000    2.360000          19.500000   \n",
       "75%      3.000000   13.677500    3.082500    2.557500          21.500000   \n",
       "max      3.000000   14.830000    5.800000    3.230000          30.000000   \n",
       "\n",
       "        Magnesium  Total phenols  Flavanoids  Nonflavanoid phenols  \\\n",
       "count  178.000000     178.000000  178.000000            178.000000   \n",
       "mean    99.741573       2.295112    2.029270              0.361854   \n",
       "std     14.282484       0.625851    0.998859              0.124453   \n",
       "min     70.000000       0.980000    0.340000              0.130000   \n",
       "25%     88.000000       1.742500    1.205000              0.270000   \n",
       "50%     98.000000       2.355000    2.135000              0.340000   \n",
       "75%    107.000000       2.800000    2.875000              0.437500   \n",
       "max    162.000000       3.880000    5.080000              0.660000   \n",
       "\n",
       "       Proanthocyanins  Color intensity         Hue  \\\n",
       "count       178.000000       178.000000  178.000000   \n",
       "mean          1.590899         5.058090    0.957449   \n",
       "std           0.572359         2.318286    0.228572   \n",
       "min           0.410000         1.280000    0.480000   \n",
       "25%           1.250000         3.220000    0.782500   \n",
       "50%           1.555000         4.690000    0.965000   \n",
       "75%           1.950000         6.200000    1.120000   \n",
       "max           3.580000        13.000000    1.710000   \n",
       "\n",
       "       OD280/OD315 of diluted wines      Proline  \n",
       "count                    178.000000   178.000000  \n",
       "mean                       2.611685   746.893258  \n",
       "std                        0.709990   314.907474  \n",
       "min                        1.270000   278.000000  \n",
       "25%                        1.937500   500.500000  \n",
       "50%                        2.780000   673.500000  \n",
       "75%                        3.170000   985.000000  \n",
       "max                        4.000000  1680.000000  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wine.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Proline` column has an extremely high variance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log normalization in Python\n",
    "Now that we know that the `Proline` column in our wine dataset has a large amount of variance, let's log normalize it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "99166.71735542428\n",
      "0.17231366191842018\n"
     ]
    }
   ],
   "source": [
    "# Print out the variance of the Proline column\n",
    "print(wine['Proline'].var())\n",
    "\n",
    "# Apply the log normalization function to the Proline column\n",
    "wine['Proline_log'] = np.log(wine['Proline'])\n",
    "\n",
    "# Check the variance of the normalized Proline column\n",
    "print(wine['Proline_log'].var())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scaling data for feature comparison\n",
    "- Features on different scales\n",
    "- Model with linear characteristics\n",
    "- Center features around 0 and transform to unit variance(1)\n",
    "- Transforms to approximately normal distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scaling data - investigating columns\n",
    "We want to use the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the `wine` dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using `describe()` to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Ash</th>\n",
       "      <th>Alcalinity of ash</th>\n",
       "      <th>Magnesium</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "      <td>178.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>2.366517</td>\n",
       "      <td>19.494944</td>\n",
       "      <td>99.741573</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.274344</td>\n",
       "      <td>3.339564</td>\n",
       "      <td>14.282484</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.360000</td>\n",
       "      <td>10.600000</td>\n",
       "      <td>70.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.210000</td>\n",
       "      <td>17.200000</td>\n",
       "      <td>88.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2.360000</td>\n",
       "      <td>19.500000</td>\n",
       "      <td>98.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2.557500</td>\n",
       "      <td>21.500000</td>\n",
       "      <td>107.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.230000</td>\n",
       "      <td>30.000000</td>\n",
       "      <td>162.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Ash  Alcalinity of ash   Magnesium\n",
       "count  178.000000         178.000000  178.000000\n",
       "mean     2.366517          19.494944   99.741573\n",
       "std      0.274344           3.339564   14.282484\n",
       "min      1.360000          10.600000   70.000000\n",
       "25%      2.210000          17.200000   88.000000\n",
       "50%      2.360000          19.500000   98.000000\n",
       "75%      2.557500          21.500000  107.000000\n",
       "max      3.230000          30.000000  162.000000"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scaling data - standardizing columns\n",
    "Since we know that the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the `wine` dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    Ash  Alcalinity of ash  Magnesium\n",
      "0  2.43               15.6        127\n",
      "1  2.14               11.2        100\n",
      "2  2.67               18.6        101\n",
      "[[ 0.23205254 -1.16959318  1.91390522]\n",
      " [-0.82799632 -2.49084714  0.01814502]\n",
      " [ 1.10933436 -0.2687382   0.08835836]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# Create the scaler\n",
    "ss = StandardScaler()\n",
    "\n",
    "# Take a subset of the DataFrame you want to scale\n",
    "wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]\n",
    "\n",
    "print(wine_subset.iloc[:3])\n",
    "\n",
    "# Apply the scaler to the DataFrame subset\n",
    "wine_subset_scaled = ss.fit_transform(wine_subset)\n",
    "\n",
    "print(wine_subset_scaled[:3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Standardized data and modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### KNN on non-scaled data\n",
    "Let's first take a look at the accuracy of a K-nearest neighbors model on the `wine` dataset without standardizing the data. The `knn` model as well as the `X` and `y` data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "wine = pd.read_csv('./dataset/wine_types.csv')\n",
    "\n",
    "X = wine.drop('Type', axis=1)\n",
    "y = wine['Type'] \n",
    "\n",
    "knn = KNeighborsClassifier()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7555555555555555\n"
     ]
    }
   ],
   "source": [
    "# Split the dataset and labels into training and test sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
    "\n",
    "# Fit the k-nearest neighbors model to the training data\n",
    "knn.fit(X_train, y_train)\n",
    "\n",
    "# Score the model on the test data\n",
    "print(knn.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### KNN on scaled data\n",
    "The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9555555555555556\n"
     ]
    }
   ],
   "source": [
    "knn = KNeighborsClassifier()\n",
    "\n",
    "# Create the scaling method\n",
    "ss = StandardScaler()\n",
    "\n",
    "# Apply the scaling method to the dataset used for modeling\n",
    "X_scaled = ss.fit_transform(X)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)\n",
    "\n",
    "# Fit the k-nearest neighbors model to the training data.\n",
    "knn.fit(X_train, y_train)\n",
    "\n",
    "# Score the model on the test data\n",
    "print(knn.score(X_test, y_test))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}