{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data scale normalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normalization is a common technique used in machine learning to render the scales of different magnitudes to a common range between 0 and 1.\n", "\n", "Here we demonstrate how this is done with pandas and altair.\n", "\n", "\n", "Original inspiration: (Jason Brownlee: Machine Learning Algorithms from Scratch)[https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/]" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.156605Z", "start_time": "2020-09-21T19:29:48.763216Z" } }, "outputs": [ { "data": { "text/plain": [ "RendererRegistry(active='default', registered=['colab', 'default', 'html', 'json', 'jupyterlab', 'kaggle', 'mimetype', 'notebook', 'nteract', 'png', 'svg', 'zeppelin'])" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import altair as alt\n", "\n", "# alt.renderers.enable('default')\n", "alt.renderers" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.162070Z", "start_time": "2020-09-21T19:29:49.158130Z" } }, "outputs": [], "source": [ "from vega_datasets import data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the Gapminder health and income dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.307685Z", "start_time": "2020-09-21T19:29:49.163734Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryincomehealthpopulation
0Afghanistan192557.6332526562
1Albania1062076.002896679
2Algeria1343476.5039666519
3Andorra4657784.1070473
4Angola761561.0025021974
\n", "
" ], "text/plain": [ " country income health population\n", "0 Afghanistan 1925 57.63 32526562\n", "1 Albania 10620 76.00 2896679\n", "2 Algeria 13434 76.50 39666519\n", "3 Andorra 46577 84.10 70473\n", "4 Angola 7615 61.00 25021974" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "health_income = data('gapminder-health-income')\n", "health_income.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.340639Z", "start_time": "2020-09-21T19:29:49.310210Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "income_domain = [health_income['income'].min(), health_income['income'].max()]\n", "health_domain = [health_income['health'].min(), health_income['health'].max()]\n", "\n", "alt.Chart(health_income).mark_point().encode(\n", " alt.X('income:Q', scale=alt.Scale(domain=income_domain)),\n", " alt.Y('health:Q', scale=alt.Scale(domain=health_domain)),\n", " alt.Size('population:Q'),\n", " alt.Tooltip('country:N')\n", ").properties(height=600, width=800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The process:\n", "\n", "1. Take the values' difference from the smallest one;\n", "2. Take the value range, that is, the difference between the largest and smallest values;\n", "3. Divide the reduced values with the range.\n", "\n", "$ \\text {scaled value} = \\frac{value - min} {max - min} $\n", "\n", "The first step ensures that the smallest value will become 0. Dividing the reduced values by the range 'compresses' the values so the new maximum becomes 1." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.344228Z", "start_time": "2020-09-21T19:29:49.342047Z" } }, "outputs": [], "source": [ "quantitative_columns = ['income', 'health', 'population']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The original minimum and maximum values" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.357354Z", "start_time": "2020-09-21T19:29:49.345747Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryincomehealthpopulation
32Central African Republic59953.84900274
93Lesotho259848.52135022
105Marshall Islands366165.152993
\n", "
" ], "text/plain": [ " country income health population\n", "32 Central African Republic 599 53.8 4900274\n", "93 Lesotho 2598 48.5 2135022\n", "105 Marshall Islands 3661 65.1 52993" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "health_income.loc[health_income[quantitative_columns].idxmin(), :]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.369172Z", "start_time": "2020-09-21T19:29:49.359331Z" } }, "outputs": [ { "data": { "text/plain": [ "income 599.0\n", "health 48.5\n", "population 52993.0\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "minimums = health_income[quantitative_columns].min()\n", "minimums" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.381879Z", "start_time": "2020-09-21T19:29:49.371370Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryincomehealthpopulation
134Qatar13287782.02235355
3Andorra4657784.170473
35China1333476.91376048943
\n", "
" ], "text/plain": [ " country income health population\n", "134 Qatar 132877 82.0 2235355\n", "3 Andorra 46577 84.1 70473\n", "35 China 13334 76.9 1376048943" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "health_income.loc[health_income[quantitative_columns].idxmax(), :]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.391669Z", "start_time": "2020-09-21T19:29:49.383878Z" } }, "outputs": [ { "data": { "text/plain": [ "income 1.328770e+05\n", "health 8.410000e+01\n", "population 1.376049e+09\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "maximums = health_income[quantitative_columns].max()\n", "maximums" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Difference of values from the column minimum" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.405142Z", "start_time": "2020-09-21T19:29:49.393652Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomehealthpopulation
01326.09.1332473569.0
110021.027.502843686.0
212835.028.0039613526.0
345978.035.6017480.0
47016.012.5024968981.0
............
1825024.028.0093394608.0
1833720.026.704615473.0
1843288.019.1026779222.0
1853435.010.4616158774.0
1861202.011.5115549758.0
\n", "

187 rows × 3 columns

\n", "
" ], "text/plain": [ " income health population\n", "0 1326.0 9.13 32473569.0\n", "1 10021.0 27.50 2843686.0\n", "2 12835.0 28.00 39613526.0\n", "3 45978.0 35.60 17480.0\n", "4 7016.0 12.50 24968981.0\n", ".. ... ... ...\n", "182 5024.0 28.00 93394608.0\n", "183 3720.0 26.70 4615473.0\n", "184 3288.0 19.10 26779222.0\n", "185 3435.0 10.46 16158774.0\n", "186 1202.0 11.51 15549758.0\n", "\n", "[187 rows x 3 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "health_income[quantitative_columns] - minimums" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Value ranges: the difference between the maximum and the minimum" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.410965Z", "start_time": "2020-09-21T19:29:49.406603Z" } }, "outputs": [ { "data": { "text/plain": [ "income 1.322780e+05\n", "health 3.560000e+01\n", "population 1.375996e+09\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "maximums - minimums" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's normalize the dataset" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.415358Z", "start_time": "2020-09-21T19:29:49.412365Z" } }, "outputs": [], "source": [ "def normalize_dataset(dataset, quantitative_columns):\n", " dataset = dataset.copy()\n", " \n", " minimums = dataset[quantitative_columns].min()\n", " maximums = dataset[quantitative_columns].max()\n", "\n", " dataset[quantitative_columns] = (dataset[quantitative_columns] - minimums) / (maximums - minimums)\n", " \n", " return dataset\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.433712Z", "start_time": "2020-09-21T19:29:49.416906Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryincomehealthpopulation
0Afghanistan0.0100240.2564610.023600
1Albania0.0757570.7724720.002067
2Algeria0.0970300.7865170.028789
3Andorra0.3475861.0000000.000013
4Angola0.0530400.3511240.018146
...............
182Vietnam0.0379810.7865170.067874
183West Bank and Gaza0.0281230.7500000.003354
184Yemen0.0248570.5365170.019462
185Zambia0.0259680.2938200.011743
186Zimbabwe0.0090870.3233150.011301
\n", "

187 rows × 4 columns

\n", "
" ], "text/plain": [ " country income health population\n", "0 Afghanistan 0.010024 0.256461 0.023600\n", "1 Albania 0.075757 0.772472 0.002067\n", "2 Algeria 0.097030 0.786517 0.028789\n", "3 Andorra 0.347586 1.000000 0.000013\n", "4 Angola 0.053040 0.351124 0.018146\n", ".. ... ... ... ...\n", "182 Vietnam 0.037981 0.786517 0.067874\n", "183 West Bank and Gaza 0.028123 0.750000 0.003354\n", "184 Yemen 0.024857 0.536517 0.019462\n", "185 Zambia 0.025968 0.293820 0.011743\n", "186 Zimbabwe 0.009087 0.323315 0.011301\n", "\n", "[187 rows x 4 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalized_health_income = normalize_dataset(health_income, quantitative_columns)\n", "normalized_health_income" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new minimum and maximum values" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.445088Z", "start_time": "2020-09-21T19:29:49.435633Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryincomehealthpopulation
32Central African Republic0.0000000.1488760.003523
93Lesotho0.0151120.0000000.001513
105Marshall Islands0.0231480.4662920.000000
\n", "
" ], "text/plain": [ " country income health population\n", "32 Central African Republic 0.000000 0.148876 0.003523\n", "93 Lesotho 0.015112 0.000000 0.001513\n", "105 Marshall Islands 0.023148 0.466292 0.000000" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmin(), :]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.460769Z", "start_time": "2020-09-21T19:29:49.446582Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryincomehealthpopulation
134Qatar1.0000000.9410110.001586
3Andorra0.3475861.0000000.000013
35China0.0962750.7977531.000000
\n", "
" ], "text/plain": [ " country income health population\n", "134 Qatar 1.000000 0.941011 0.001586\n", "3 Andorra 0.347586 1.000000 0.000013\n", "35 China 0.096275 0.797753 1.000000" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmax(), :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting the normalized data, we got the same results, but with the `income`, `health`, and `population` scales all normalized to the \\[0, 1\\] range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Maximum values" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2020-09-21T19:29:49.491216Z", "start_time": "2020-09-21T19:29:49.463107Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(normalized_health_income).mark_point().encode(\n", " alt.X('income:Q',),\n", " alt.Y('health:Q'),\n", " alt.Size('population:Q'),\n", " alt.Tooltip('country:N')\n", ").properties(height=600, width=800)" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }