{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Standardizing Data\n", "> This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance. This is the Summary of lecture \"Preprocessing for Machine Learning in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standardizing Data\n", "- Standardization\n", " - Preprocessing method used to transform continuous data to make it look normally distributed\n", " - Scikit-learn models assume normally distributed data\n", " - Log normalization\n", " - feature Scaling\n", "- When to standardize: models\n", " - Model in linear space\n", " - Dataset features have high variance\n", " - Dataset features are continuous and on different scales\n", " - Linearity assumptions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modeling without normalizing\n", "Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, `Proline`, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.\n", "\n", "The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (`knn`) as well as the `X` and `y` sets you need to fit and score on." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeAlcoholMalic acidAshAlcalinity of ashMagnesiumTotal phenolsFlavanoidsNonflavanoid phenolsProanthocyaninsColor intensityHueOD280/OD315 of diluted winesProline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735
\n", "
" ], "text/plain": [ " Type Alcohol Malic acid Ash Alcalinity of ash Magnesium \\\n", "0 1 14.23 1.71 2.43 15.6 127 \n", "1 1 13.20 1.78 2.14 11.2 100 \n", "2 1 13.16 2.36 2.67 18.6 101 \n", "3 1 14.37 1.95 2.50 16.8 113 \n", "4 1 13.24 2.59 2.87 21.0 118 \n", "\n", " Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins \\\n", "0 2.80 3.06 0.28 2.29 \n", "1 2.65 2.76 0.26 1.28 \n", "2 2.80 3.24 0.30 2.81 \n", "3 3.85 3.49 0.24 2.18 \n", "4 2.80 2.69 0.39 1.82 \n", "\n", " Color intensity Hue OD280/OD315 of diluted wines Proline \n", "0 5.64 1.04 3.92 1065 \n", "1 4.38 1.05 3.40 1050 \n", "2 5.68 1.03 3.17 1185 \n", "3 7.80 0.86 3.45 1480 \n", "4 4.32 1.04 2.93 735 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wine = pd.read_csv('./dataset/wine_types.csv')\n", "wine.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]\n", "y = wine['Type'] " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.6888888888888889\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "knn = KNeighborsClassifier()\n", "\n", "# Split the dataset and labels into training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)\n", "\n", "# Fit the k-nearest neighbors model to the training data\n", "knn.fit(X_train, y_train)\n", "\n", "# SCore the model on the test data\n", "print(knn.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Log normalization\n", "- Applies log transformation\n", "- Natural log using the constant $e$ (2.718)\n", "- Captures relative changes, the magnitude of change, and keeps everything in the positive space" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checking the variance\n", "Check the variance of the columns in the `wine` dataset." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeAlcoholMalic acidAshAlcalinity of ashMagnesiumTotal phenolsFlavanoidsNonflavanoid phenolsProanthocyaninsColor intensityHueOD280/OD315 of diluted winesProline
count178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000
mean1.93820213.0006182.3363482.36651719.49494499.7415732.2951122.0292700.3618541.5908995.0580900.9574492.611685746.893258
std0.7750350.8118271.1171460.2743443.33956414.2824840.6258510.9988590.1244530.5723592.3182860.2285720.709990314.907474
min1.00000011.0300000.7400001.36000010.60000070.0000000.9800000.3400000.1300000.4100001.2800000.4800001.270000278.000000
25%1.00000012.3625001.6025002.21000017.20000088.0000001.7425001.2050000.2700001.2500003.2200000.7825001.937500500.500000
50%2.00000013.0500001.8650002.36000019.50000098.0000002.3550002.1350000.3400001.5550004.6900000.9650002.780000673.500000
75%3.00000013.6775003.0825002.55750021.500000107.0000002.8000002.8750000.4375001.9500006.2000001.1200003.170000985.000000
max3.00000014.8300005.8000003.23000030.000000162.0000003.8800005.0800000.6600003.58000013.0000001.7100004.0000001680.000000
\n", "
" ], "text/plain": [ " Type Alcohol Malic acid Ash Alcalinity of ash \\\n", "count 178.000000 178.000000 178.000000 178.000000 178.000000 \n", "mean 1.938202 13.000618 2.336348 2.366517 19.494944 \n", "std 0.775035 0.811827 1.117146 0.274344 3.339564 \n", "min 1.000000 11.030000 0.740000 1.360000 10.600000 \n", "25% 1.000000 12.362500 1.602500 2.210000 17.200000 \n", "50% 2.000000 13.050000 1.865000 2.360000 19.500000 \n", "75% 3.000000 13.677500 3.082500 2.557500 21.500000 \n", "max 3.000000 14.830000 5.800000 3.230000 30.000000 \n", "\n", " Magnesium Total phenols Flavanoids Nonflavanoid phenols \\\n", "count 178.000000 178.000000 178.000000 178.000000 \n", "mean 99.741573 2.295112 2.029270 0.361854 \n", "std 14.282484 0.625851 0.998859 0.124453 \n", "min 70.000000 0.980000 0.340000 0.130000 \n", "25% 88.000000 1.742500 1.205000 0.270000 \n", "50% 98.000000 2.355000 2.135000 0.340000 \n", "75% 107.000000 2.800000 2.875000 0.437500 \n", "max 162.000000 3.880000 5.080000 0.660000 \n", "\n", " Proanthocyanins Color intensity Hue \\\n", "count 178.000000 178.000000 178.000000 \n", "mean 1.590899 5.058090 0.957449 \n", "std 0.572359 2.318286 0.228572 \n", "min 0.410000 1.280000 0.480000 \n", "25% 1.250000 3.220000 0.782500 \n", "50% 1.555000 4.690000 0.965000 \n", "75% 1.950000 6.200000 1.120000 \n", "max 3.580000 13.000000 1.710000 \n", "\n", " OD280/OD315 of diluted wines Proline \n", "count 178.000000 178.000000 \n", "mean 2.611685 746.893258 \n", "std 0.709990 314.907474 \n", "min 1.270000 278.000000 \n", "25% 1.937500 500.500000 \n", "50% 2.780000 673.500000 \n", "75% 3.170000 985.000000 \n", "max 4.000000 1680.000000 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wine.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Proline` column has an extremely high variance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Log normalization in Python\n", "Now that we know that the `Proline` column in our wine dataset has a large amount of variance, let's log normalize it." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "99166.71735542428\n", "0.17231366191842018\n" ] } ], "source": [ "# Print out the variance of the Proline column\n", "print(wine['Proline'].var())\n", "\n", "# Apply the log normalization function to the Proline column\n", "wine['Proline_log'] = np.log(wine['Proline'])\n", "\n", "# Check the variance of the normalized Proline column\n", "print(wine['Proline_log'].var())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scaling data for feature comparison\n", "- Features on different scales\n", "- Model with linear characteristics\n", "- Center features around 0 and transform to unit variance(1)\n", "- Transforms to approximately normal distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scaling data - investigating columns\n", "We want to use the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the `wine` dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using `describe()` to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AshAlcalinity of ashMagnesium
count178.000000178.000000178.000000
mean2.36651719.49494499.741573
std0.2743443.33956414.282484
min1.36000010.60000070.000000
25%2.21000017.20000088.000000
50%2.36000019.50000098.000000
75%2.55750021.500000107.000000
max3.23000030.000000162.000000
\n", "
" ], "text/plain": [ " Ash Alcalinity of ash Magnesium\n", "count 178.000000 178.000000 178.000000\n", "mean 2.366517 19.494944 99.741573\n", "std 0.274344 3.339564 14.282484\n", "min 1.360000 10.600000 70.000000\n", "25% 2.210000 17.200000 88.000000\n", "50% 2.360000 19.500000 98.000000\n", "75% 2.557500 21.500000 107.000000\n", "max 3.230000 30.000000 162.000000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scaling data - standardizing columns\n", "Since we know that the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the `wine` dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Ash Alcalinity of ash Magnesium\n", "0 2.43 15.6 127\n", "1 2.14 11.2 100\n", "2 2.67 18.6 101\n", "[[ 0.23205254 -1.16959318 1.91390522]\n", " [-0.82799632 -2.49084714 0.01814502]\n", " [ 1.10933436 -0.2687382 0.08835836]]\n" ] } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "# Create the scaler\n", "ss = StandardScaler()\n", "\n", "# Take a subset of the DataFrame you want to scale\n", "wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]\n", "\n", "print(wine_subset.iloc[:3])\n", "\n", "# Apply the scaler to the DataFrame subset\n", "wine_subset_scaled = ss.fit_transform(wine_subset)\n", "\n", "print(wine_subset_scaled[:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standardized data and modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### KNN on non-scaled data\n", "Let's first take a look at the accuracy of a K-nearest neighbors model on the `wine` dataset without standardizing the data. The `knn` model as well as the `X` and `y` data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "wine = pd.read_csv('./dataset/wine_types.csv')\n", "\n", "X = wine.drop('Type', axis=1)\n", "y = wine['Type'] \n", "\n", "knn = KNeighborsClassifier()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7555555555555555\n" ] } ], "source": [ "# Split the dataset and labels into training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)\n", "\n", "# Fit the k-nearest neighbors model to the training data\n", "knn.fit(X_train, y_train)\n", "\n", "# Score the model on the test data\n", "print(knn.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### KNN on scaled data\n", "The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. \n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9555555555555556\n" ] } ], "source": [ "knn = KNeighborsClassifier()\n", "\n", "# Create the scaling method\n", "ss = StandardScaler()\n", "\n", "# Apply the scaling method to the dataset used for modeling\n", "X_scaled = ss.fit_transform(X)\n", "X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)\n", "\n", "# Fit the k-nearest neighbors model to the training data.\n", "knn.fit(X_train, y_train)\n", "\n", "# Score the model on the test data\n", "print(knn.score(X_test, y_test))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }