{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data scale normalization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Normalization is a common technique used in machine learning to render the scales of different magnitudes to a common range between 0 and 1.\n",
"\n",
"Here we demonstrate how this is done with pandas and altair.\n",
"\n",
"\n",
"Original inspiration: (Jason Brownlee: Machine Learning Algorithms from Scratch)[https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/]"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.156605Z",
"start_time": "2020-09-21T19:29:48.763216Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"RendererRegistry(active='default', registered=['colab', 'default', 'html', 'json', 'jupyterlab', 'kaggle', 'mimetype', 'notebook', 'nteract', 'png', 'svg', 'zeppelin'])"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import altair as alt\n",
"\n",
"# alt.renderers.enable('default')\n",
"alt.renderers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.162070Z",
"start_time": "2020-09-21T19:29:49.158130Z"
}
},
"outputs": [],
"source": [
"from vega_datasets import data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the Gapminder health and income dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.307685Z",
"start_time": "2020-09-21T19:29:49.163734Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" income | \n",
" health | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" 1925 | \n",
" 57.63 | \n",
" 32526562 | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" 10620 | \n",
" 76.00 | \n",
" 2896679 | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" 13434 | \n",
" 76.50 | \n",
" 39666519 | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 46577 | \n",
" 84.10 | \n",
" 70473 | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" 7615 | \n",
" 61.00 | \n",
" 25021974 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country income health population\n",
"0 Afghanistan 1925 57.63 32526562\n",
"1 Albania 10620 76.00 2896679\n",
"2 Algeria 13434 76.50 39666519\n",
"3 Andorra 46577 84.10 70473\n",
"4 Angola 7615 61.00 25021974"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"health_income = data('gapminder-health-income')\n",
"health_income.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.340639Z",
"start_time": "2020-09-21T19:29:49.310210Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.Chart(...)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"income_domain = [health_income['income'].min(), health_income['income'].max()]\n",
"health_domain = [health_income['health'].min(), health_income['health'].max()]\n",
"\n",
"alt.Chart(health_income).mark_point().encode(\n",
" alt.X('income:Q', scale=alt.Scale(domain=income_domain)),\n",
" alt.Y('health:Q', scale=alt.Scale(domain=health_domain)),\n",
" alt.Size('population:Q'),\n",
" alt.Tooltip('country:N')\n",
").properties(height=600, width=800)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The process:\n",
"\n",
"1. Take the values' difference from the smallest one;\n",
"2. Take the value range, that is, the difference between the largest and smallest values;\n",
"3. Divide the reduced values with the range.\n",
"\n",
"$ \\text {scaled value} = \\frac{value - min} {max - min} $\n",
"\n",
"The first step ensures that the smallest value will become 0. Dividing the reduced values by the range 'compresses' the values so the new maximum becomes 1."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.344228Z",
"start_time": "2020-09-21T19:29:49.342047Z"
}
},
"outputs": [],
"source": [
"quantitative_columns = ['income', 'health', 'population']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The original minimum and maximum values"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.357354Z",
"start_time": "2020-09-21T19:29:49.345747Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" income | \n",
" health | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" 32 | \n",
" Central African Republic | \n",
" 599 | \n",
" 53.8 | \n",
" 4900274 | \n",
"
\n",
" \n",
" 93 | \n",
" Lesotho | \n",
" 2598 | \n",
" 48.5 | \n",
" 2135022 | \n",
"
\n",
" \n",
" 105 | \n",
" Marshall Islands | \n",
" 3661 | \n",
" 65.1 | \n",
" 52993 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country income health population\n",
"32 Central African Republic 599 53.8 4900274\n",
"93 Lesotho 2598 48.5 2135022\n",
"105 Marshall Islands 3661 65.1 52993"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"health_income.loc[health_income[quantitative_columns].idxmin(), :]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.369172Z",
"start_time": "2020-09-21T19:29:49.359331Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"income 599.0\n",
"health 48.5\n",
"population 52993.0\n",
"dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"minimums = health_income[quantitative_columns].min()\n",
"minimums"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.381879Z",
"start_time": "2020-09-21T19:29:49.371370Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" income | \n",
" health | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" 134 | \n",
" Qatar | \n",
" 132877 | \n",
" 82.0 | \n",
" 2235355 | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 46577 | \n",
" 84.1 | \n",
" 70473 | \n",
"
\n",
" \n",
" 35 | \n",
" China | \n",
" 13334 | \n",
" 76.9 | \n",
" 1376048943 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country income health population\n",
"134 Qatar 132877 82.0 2235355\n",
"3 Andorra 46577 84.1 70473\n",
"35 China 13334 76.9 1376048943"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"health_income.loc[health_income[quantitative_columns].idxmax(), :]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.391669Z",
"start_time": "2020-09-21T19:29:49.383878Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"income 1.328770e+05\n",
"health 8.410000e+01\n",
"population 1.376049e+09\n",
"dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"maximums = health_income[quantitative_columns].max()\n",
"maximums"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Difference of values from the column minimum"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.405142Z",
"start_time": "2020-09-21T19:29:49.393652Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" income | \n",
" health | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1326.0 | \n",
" 9.13 | \n",
" 32473569.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 10021.0 | \n",
" 27.50 | \n",
" 2843686.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 12835.0 | \n",
" 28.00 | \n",
" 39613526.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 45978.0 | \n",
" 35.60 | \n",
" 17480.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 7016.0 | \n",
" 12.50 | \n",
" 24968981.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 182 | \n",
" 5024.0 | \n",
" 28.00 | \n",
" 93394608.0 | \n",
"
\n",
" \n",
" 183 | \n",
" 3720.0 | \n",
" 26.70 | \n",
" 4615473.0 | \n",
"
\n",
" \n",
" 184 | \n",
" 3288.0 | \n",
" 19.10 | \n",
" 26779222.0 | \n",
"
\n",
" \n",
" 185 | \n",
" 3435.0 | \n",
" 10.46 | \n",
" 16158774.0 | \n",
"
\n",
" \n",
" 186 | \n",
" 1202.0 | \n",
" 11.51 | \n",
" 15549758.0 | \n",
"
\n",
" \n",
"
\n",
"
187 rows × 3 columns
\n",
"
"
],
"text/plain": [
" income health population\n",
"0 1326.0 9.13 32473569.0\n",
"1 10021.0 27.50 2843686.0\n",
"2 12835.0 28.00 39613526.0\n",
"3 45978.0 35.60 17480.0\n",
"4 7016.0 12.50 24968981.0\n",
".. ... ... ...\n",
"182 5024.0 28.00 93394608.0\n",
"183 3720.0 26.70 4615473.0\n",
"184 3288.0 19.10 26779222.0\n",
"185 3435.0 10.46 16158774.0\n",
"186 1202.0 11.51 15549758.0\n",
"\n",
"[187 rows x 3 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"health_income[quantitative_columns] - minimums"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Value ranges: the difference between the maximum and the minimum"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.410965Z",
"start_time": "2020-09-21T19:29:49.406603Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"income 1.322780e+05\n",
"health 3.560000e+01\n",
"population 1.375996e+09\n",
"dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"maximums - minimums"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's normalize the dataset"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.415358Z",
"start_time": "2020-09-21T19:29:49.412365Z"
}
},
"outputs": [],
"source": [
"def normalize_dataset(dataset, quantitative_columns):\n",
" dataset = dataset.copy()\n",
" \n",
" minimums = dataset[quantitative_columns].min()\n",
" maximums = dataset[quantitative_columns].max()\n",
"\n",
" dataset[quantitative_columns] = (dataset[quantitative_columns] - minimums) / (maximums - minimums)\n",
" \n",
" return dataset\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.433712Z",
"start_time": "2020-09-21T19:29:49.416906Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" income | \n",
" health | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" 0.010024 | \n",
" 0.256461 | \n",
" 0.023600 | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" 0.075757 | \n",
" 0.772472 | \n",
" 0.002067 | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" 0.097030 | \n",
" 0.786517 | \n",
" 0.028789 | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 0.347586 | \n",
" 1.000000 | \n",
" 0.000013 | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" 0.053040 | \n",
" 0.351124 | \n",
" 0.018146 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 182 | \n",
" Vietnam | \n",
" 0.037981 | \n",
" 0.786517 | \n",
" 0.067874 | \n",
"
\n",
" \n",
" 183 | \n",
" West Bank and Gaza | \n",
" 0.028123 | \n",
" 0.750000 | \n",
" 0.003354 | \n",
"
\n",
" \n",
" 184 | \n",
" Yemen | \n",
" 0.024857 | \n",
" 0.536517 | \n",
" 0.019462 | \n",
"
\n",
" \n",
" 185 | \n",
" Zambia | \n",
" 0.025968 | \n",
" 0.293820 | \n",
" 0.011743 | \n",
"
\n",
" \n",
" 186 | \n",
" Zimbabwe | \n",
" 0.009087 | \n",
" 0.323315 | \n",
" 0.011301 | \n",
"
\n",
" \n",
"
\n",
"
187 rows × 4 columns
\n",
"
"
],
"text/plain": [
" country income health population\n",
"0 Afghanistan 0.010024 0.256461 0.023600\n",
"1 Albania 0.075757 0.772472 0.002067\n",
"2 Algeria 0.097030 0.786517 0.028789\n",
"3 Andorra 0.347586 1.000000 0.000013\n",
"4 Angola 0.053040 0.351124 0.018146\n",
".. ... ... ... ...\n",
"182 Vietnam 0.037981 0.786517 0.067874\n",
"183 West Bank and Gaza 0.028123 0.750000 0.003354\n",
"184 Yemen 0.024857 0.536517 0.019462\n",
"185 Zambia 0.025968 0.293820 0.011743\n",
"186 Zimbabwe 0.009087 0.323315 0.011301\n",
"\n",
"[187 rows x 4 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"normalized_health_income = normalize_dataset(health_income, quantitative_columns)\n",
"normalized_health_income"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The new minimum and maximum values"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.445088Z",
"start_time": "2020-09-21T19:29:49.435633Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" income | \n",
" health | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" 32 | \n",
" Central African Republic | \n",
" 0.000000 | \n",
" 0.148876 | \n",
" 0.003523 | \n",
"
\n",
" \n",
" 93 | \n",
" Lesotho | \n",
" 0.015112 | \n",
" 0.000000 | \n",
" 0.001513 | \n",
"
\n",
" \n",
" 105 | \n",
" Marshall Islands | \n",
" 0.023148 | \n",
" 0.466292 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country income health population\n",
"32 Central African Republic 0.000000 0.148876 0.003523\n",
"93 Lesotho 0.015112 0.000000 0.001513\n",
"105 Marshall Islands 0.023148 0.466292 0.000000"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmin(), :]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.460769Z",
"start_time": "2020-09-21T19:29:49.446582Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" income | \n",
" health | \n",
" population | \n",
"
\n",
" \n",
" \n",
" \n",
" 134 | \n",
" Qatar | \n",
" 1.000000 | \n",
" 0.941011 | \n",
" 0.001586 | \n",
"
\n",
" \n",
" 3 | \n",
" Andorra | \n",
" 0.347586 | \n",
" 1.000000 | \n",
" 0.000013 | \n",
"
\n",
" \n",
" 35 | \n",
" China | \n",
" 0.096275 | \n",
" 0.797753 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country income health population\n",
"134 Qatar 1.000000 0.941011 0.001586\n",
"3 Andorra 0.347586 1.000000 0.000013\n",
"35 China 0.096275 0.797753 1.000000"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmax(), :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plotting the normalized data, we got the same results, but with the `income`, `health`, and `population` scales all normalized to the \\[0, 1\\] range."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Maximum values"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2020-09-21T19:29:49.491216Z",
"start_time": "2020-09-21T19:29:49.463107Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.Chart(...)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(normalized_health_income).mark_point().encode(\n",
" alt.X('income:Q',),\n",
" alt.Y('health:Q'),\n",
" alt.Size('population:Q'),\n",
" alt.Tooltip('country:N')\n",
").properties(height=600, width=800)"
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}