{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Scaling\n", "\n", "Because in Machine Learning models, features are mapped into n-dimensional space. So let's say there are two variables (x, y) which will be mapped in 2D co-ordinate system. If one variable, say y, is very huge and other, x, is very small, then the euclidean distance will be dominated by the bigger one and smaller one will be ignored. In this case we are losing valuable information, hence feature scaling is used to solve this problem. \n", "\n", "\n", "Additional reasons for transformation:\n", "\n", "1. To more closely approximate a theoretical distribution that has nice statistical properties. \n", "2. To spread out data more evenly.\n", "3. To make data distribution more symmetric\n", "4. to make relationships between variables more linear. \n", "5. TO make data more constant in variance (homoscedasticity). \n", "\n", "#### There are 3 most used ways to scale features. \n", "1. __Min Max Scaling__: \n", "Will scale the input to have minimum of 0 and maximum of 1. That is, it scales the data in the range of [0, 1] This is useful when the parameters have to be on same positive scale. But in this case, the outliers are lost. \n", "$$X_{norm} = \\frac{X - X_{min}}{X_{max} - X_{min}}$$\n", "\n", "2. __Standardization__:\n", "Will scale the input to have mean of 0 and variance of 1. \n", "$$X_{stand} = \\frac{X - \\mu}{\\sigma}$$\n", "\n", "3. __Normalizing__: \n", "Will scale the input to make the norm of 1. For instance, for 3D data the 3 independent variables will lie on a unit Sphere. \n", "\n", "4. __Log Transformation__:\n", "Taking the log of data after any of above transformation. \n", "\n", "Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.\n", "\n", "For most applications, Standardization is recommended. Min Max Scaling is recommended for Neural Networks. Normalizing is recommended when Clustering eg. KMeans. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Country Age Salary Purchased\n", "0 France 44.0 72000.0 No\n", "1 Spain 27.0 48000.0 Yes\n", "2 Germany 30.0 54000.0 No\n", "3 Spain 38.0 61000.0 No\n", "5 France 35.0 58000.0 Yes\n", "7 France 48.0 79000.0 Yes\n", "8 Germany 50.0 83000.0 No\n", "9 France 37.0 67000.0 Yes\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler\n", "\n", "df = pd.read_csv('Data.csv').dropna()\n", "print(df)\n", "X = df[[\"Age\", \"Salary\"]].values.astype(np.float64)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Standardization\n", "[[ 0.69985807 0.58989097]\n", " [-1.51364653 -1.50749915]\n", " [-1.12302807 -0.98315162]\n", " [-0.08137885 -0.37141284]\n", " [-0.47199731 -0.6335866 ]\n", " [ 1.22068269 1.20162976]\n", " [ 1.48109499 1.55119478]\n", " [-0.211585 0.1529347 ]]\n", "Normalizing\n", "[[ 6.11110997e-04 9.99999813e-01]\n", " [ 5.62499911e-04 9.99999842e-01]\n", " [ 5.55555470e-04 9.99999846e-01]\n", " [ 6.22950699e-04 9.99999806e-01]\n", " [ 6.03448166e-04 9.99999818e-01]\n", " [ 6.07594825e-04 9.99999815e-01]\n", " [ 6.02409529e-04 9.99999819e-01]\n", " [ 5.52238722e-04 9.99999848e-01]]\n", "MinMax Scaling\n", "[[ 0.73913043 0.68571429]\n", " [ 0. 0. ]\n", " [ 0.13043478 0.17142857]\n", " [ 0.47826087 0.37142857]\n", " [ 0.34782609 0.28571429]\n", " [ 0.91304348 0.88571429]\n", " [ 1. 1. ]\n", " [ 0.43478261 0.54285714]]\n" ] } ], "source": [ "standard_scaler = StandardScaler()\n", "normalizer = Normalizer()\n", "min_max_scaler = MinMaxScaler()\n", "\n", "print(\"Standardization\")\n", "print(standard_scaler.fit_transform(X))\n", "\n", "print(\"Normalizing\")\n", "print(normalizer.fit_transform(X))\n", "\n", "print(\"MinMax Scaling\")\n", "print(min_max_scaler.fit_transform(X))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }