{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Information\n", "__Section__: Attributes selection \n", "__Goal__: Understand which attributes can be useful, useless or even prejudicial for prediction. \n", "__Time needed__: 20 min \n", "__Prerequisites__: AIS data, basics about machine learning\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Attributes selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this section, we will see how the selection of the predictive attributes can lead to very different results in the prediction.\n", "\n", "Again, you are asked to build a model to predict the width of a ship. You can use any numerical attribute in the dataset. The static dataset is used here." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%run 1-functions.ipynb # this line runs the other functions we will use later on the page\n", "\n", "import pandas as pd\n", "\n", "static_data = pd.read_csv('./static_data.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The attributes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's have a look at the list of the numerical attributes we can use in the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{toggle} Advanced level\n", "We use the function [select_dtypes()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) and the type ``np.number`` from the [numpy](https://docs.scipy.org/doc/numpy/reference/) library, which allows us to select all columns that are numerical.\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/plain": [ "Index(['TripID', 'MMSI', 'MeanSOG', 'VesselType', 'Length', 'Width', 'Draft',\n", " 'Cargo', 'DepLat', 'DepLon', 'ArrLat', 'ArrLon'],\n", " dtype='object')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "static_data.select_dtypes([np.number]).columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we will build a model to predict the ``Width`` attribute, from a combination of these attributes. Change the value of the variable ``x`` as much as possible, adding more or less of the numerical attributes, and compare the results in the prediction.\n", "\n", "Try to find the combination of attributes that gives the best performance, and the one that gives the worst performance." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "hide-input", "hide-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE: 4.929193421052632\n" ] } ], "source": [ "from sklearn.metrics import mean_absolute_error\n", "\n", "x = ['Length', 'TripID', 'MMSI']\n", "y = ['Width']\n", "\n", "predictions, ytest = knn_regression(static_data, x, y)\n", "print('MAE: ' + str(mean_absolute_error(predictions, ytest)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Beginner version:__\n", "Use the following widget and add or remove attributes to the predictive model. The first attribute is necessary, the other ones are optional. Try to find out which model gives the best prediction, and which one gives the worst." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8938a035fbb949748119f6d7ca0778ec", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='x = [Att 1:', options=('TripID', 'MMSI', 'MeanSOG', 'VesselType', …" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For beginner version: cell to hide\n", "\n", "import numpy as np\n", "from sklearn.metrics import mean_absolute_error\n", "import ipywidgets as widgets\n", "from ipywidgets import interact\n", "\n", "attributes = []\n", "attributes.append('')\n", "for att in static_data.select_dtypes([np.number]).columns:\n", " attributes.append(att)\n", "\n", "def mae_pred(att1, att2, att3, att4, att5):\n", " \n", " x = []\n", " for att in [att1, att2, att3, att4, att5]:\n", " if att != '':\n", " x.append(att) \n", " y = ['Width']\n", "\n", " predictions, ytest = knn_regression(static_data, x, y)\n", " print('MAE: ' + str(mean_absolute_error(predictions, ytest)))\n", "\n", "interact(mae_pred,\n", " att1 = widgets.Dropdown(options = static_data.select_dtypes([np.number]).columns,\n", " value = static_data.select_dtypes([np.number]).columns[0],\n", " description = 'x = [Att 1:',\n", " disabled = False,),\n", " att2 = widgets.Dropdown(options = attributes,\n", " value = attributes[0],\n", " description = '+ Att 2 (opt):',\n", " disabled = False,),\n", " att3 = widgets.Dropdown(options = attributes,\n", " value = attributes[0],\n", " description = '+ Att 3 (opt):',\n", " disabled = False,),\n", " att4 = widgets.Dropdown(options = attributes,\n", " value = attributes[0],\n", " description = '+ Att 4 (opt):',\n", " disabled = False,),\n", " att5 = widgets.Dropdown(options = attributes,\n", " value = attributes[0],\n", " description = '+ Att 5 (opt)]:',\n", " disabled = False,))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compare the performances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can plot the predictions made according to the value of ``x``, to better gauge the performance of the model built:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "hide-input", "hide-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE: 4.929193421052632\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "x = ['Length', 'TripID', 'MMSI']\n", "y = ['Width']\n", "\n", "predictions, ytest = knn_regression(static_data, x, y)\n", "print('MAE: ' + str(mean_absolute_error(predictions, ytest)))\n", "\n", "plt.figure(figsize = (12, 8))\n", "pred = []\n", "for element in predictions:\n", " pred.append(element[0])\n", "plt.plot(pred, ytest, 'x')\n", " \n", "x = np.linspace(0, 50, 50)\n", "plt.plot(x, x, color = 'black')\n", " \n", "plt.xlabel('Prediction')\n", "plt.ylabel('True label')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6293fc0554d84887ba7bdc2701c9c9bd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='x = [Att 1:', options=('TripID', 'MMSI', 'MeanSOG', 'VesselType', …" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For beginner version: cell to hide\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from sklearn.metrics import mean_absolute_error\n", "import ipywidgets as widgets\n", "from ipywidgets import interact\n", "\n", "attributes = []\n", "attributes.append('')\n", "for att in static_data.select_dtypes([np.number]).columns:\n", " attributes.append(att)\n", "\n", "def plot_pred(att1, att2, att3, att4, att5):\n", " \n", " plt.figure(figsize = (12, 8))\n", " \n", " x = []\n", " title = 'Prediction of Width from '\n", " for att in [att1, att2, att3, att4, att5]:\n", " if att != '':\n", " x.append(att)\n", " title = title + str(att) + ' '\n", " \n", " y = ['Width']\n", "\n", " predictions, ytest = knn_regression(static_data, x, y)\n", " print('MAE: ' + str(mean_absolute_error(predictions, ytest)))\n", " \n", " pred = []\n", " for element in predictions:\n", " pred.append(element[0])\n", " plt.plot(pred, ytest, 'x')\n", " \n", " x = np.linspace(0, 50, 50)\n", " plt.plot(x, x, color = 'black')\n", " \n", " plt.xlabel('Prediction')\n", " plt.ylabel('True label')\n", " plt.title(title)\n", "\n", "interact(plot_pred,\n", " att1 = widgets.Dropdown(options = static_data.select_dtypes([np.number]).columns,\n", " value = static_data.select_dtypes([np.number]).columns[0],\n", " description = 'x = [Att 1:',\n", " disabled = False,),\n", " att2 = widgets.Dropdown(options = attributes,\n", " value = attributes[0],\n", " description = '+ Att 2 (opt):',\n", " disabled = False,),\n", " att3 = widgets.Dropdown(options = attributes,\n", " value = attributes[0],\n", " description = '+ Att 3 (opt):',\n", " disabled = False,),\n", " att4 = widgets.Dropdown(options = attributes,\n", " value = attributes[0],\n", " description = '+ Att 4 (opt):',\n", " disabled = False,),\n", " att5 = widgets.Dropdown(options = attributes,\n", " value = attributes[0],\n", " description = '+ Att 5 (opt)]:',\n", " disabled = False,))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the following predictive models:\n", "+ only ``Length``\n", "+ ``Length`` and ``MMSI``\n", "+ only ``MMSI``\n", "+ only ``TripID``\n", "\n", "As you can imagine, the attribute ``TripID``, having been artificially created only for organisation purpose, makes no sense in predicting the width of the ship. It is no surprising that it gives the worst performance.\n", "The attribute ``Length`` is the best attribute to predict the width, which seems normal: by construction, the length and the width of a ship have to be more or less proportional. However, we see that adding the ``MMSI`` to the ``Length`` attribute pollutes the model and gives a worse performance.\n", "\n", "In fact, when we remove the ``Length`` attribute, the prediction doesn't change: the ``MMSI`` attribute takes all the lead on the ``Length`` attribute for the prediction. This last point is mainly due to the model used, the KNN model being sensitive to the scale of the attributes (and ``MMSI`` is a very high number compared to ``Length``). But the choice of the model is outside of the scope of the course." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, it is important to carefully consider which attributes are chosen for the prediction: more attributes means more information, but it doesn't necessarily lead to better result, as the added information can be useless or even prejudicial for the prediction.\n", "\n", "The attributes such as ``TripID``, ``MMSI`` or ``VesselTypes`` represent a code: they are not continuous variables, and as a consequence, should not be used a numerical attributes. If we used them as numerical attributes, we would give a meaning to their value (in the sense: a lower value means something different than a higher value), when in the real world, they are just codes: two close values of these attributes don't necessarily imply a close meaning, like it is the case for continuous variables, where two instances with close values for length can mean close values for width." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quiz" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import IFrame\n", "IFrame(\"https://h5p.org/h5p/embed/761741\", \"694\", \"600\")" ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }