{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udcc3 Solution for Exercise M4.01\n", "\n", "The aim of this exercise is two-fold:\n", "\n", "* understand the parametrization of a linear model;\n", "* quantify the fitting accuracy of a set of such models.\n", "\n", "We will reuse part of the code of the course to:\n", "\n", "* load data;\n", "* create the function representing a linear model.\n", "\n", "## Prerequisites\n", "\n", "### Data loading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note

\n", "

If you want a deeper overview regarding this dataset, you can refer to the\n", "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "penguins = pd.read_csv(\"../datasets/penguins_regression.csv\")\n", "feature_name = \"Flipper Length (mm)\"\n", "target_name = \"Body Mass (g)\"\n", "data, target = penguins[[feature_name]], penguins[target_name]" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "### Model definition" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def linear_model_flipper_mass(\n", " flipper_length, weight_flipper_length, intercept_body_mass\n", "):\n", " \"\"\"Linear model of the form y = a * x + b\"\"\"\n", " body_mass = weight_flipper_length * flipper_length + intercept_body_mass\n", " return body_mass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Main exercise\n", "\n", "Define a vector `weights = [...]` and a vector `intercepts = [...]` of the\n", "same length. Each pair of entries `(weights[i], intercepts[i])` tags a\n", "different model. Use these vectors along with the vector\n", "`flipper_length_range` to plot several linear models that could possibly fit\n", "our data. Use the above helper function to visualize both the models and the\n", "real samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "flipper_length_range = np.linspace(data.min(), data.max(), num=300)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "weights = [-40, 45, 90]\n", "intercepts = [15000, -5000, -14000]\n", "\n", "ax = sns.scatterplot(\n", " data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n", ")\n", "\n", "label = \"{0:.2f} (g / mm) * flipper length + {1:.2f} (g)\"\n", "for weight, intercept in zip(weights, intercepts):\n", " predicted_body_mass = linear_model_flipper_mass(\n", " flipper_length_range, weight, intercept\n", " )\n", "\n", " ax.plot(\n", " flipper_length_range,\n", " predicted_body_mass,\n", " label=label.format(weight, intercept),\n", " )\n", "_ = ax.legend(loc=\"center left\", bbox_to_anchor=(-0.25, 1.25), ncol=1)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In the previous question, you were asked to create several linear models. The\n", "visualization allowed you to qualitatively assess if a model was better than\n", "another.\n", "\n", "Now, you should come up with a quantitative measure which indicates the\n", "goodness of fit of each linear model and allows you to select the best model.\n", "Define a function `goodness_fit_measure(true_values, predictions)` that takes\n", "as inputs the true target values and the predictions and returns a single\n", "scalar as output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "def goodness_fit_measure(true_values, predictions):\n", " # we compute the error between the true values and the predictions of our\n", " # model\n", " errors = np.ravel(true_values) - np.ravel(predictions)\n", " # We have several possible strategies to reduce all errors to a single value.\n", " # Computing the mean error (sum divided by the number of element) might seem\n", " # like a good solution. However, we have negative errors that will misleadingly\n", " # reduce the mean error. Therefore, we can either square each\n", " # error or take the absolute value: these metrics are known as mean\n", " # squared error (MSE) and mean absolute error (MAE). Let's use the MAE here\n", " # as an example.\n", " return np.mean(np.abs(errors))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now copy and paste the code below to show the goodness of fit for each\n", "model.\n", "\n", "```python\n", "for model_idx, (weight, intercept) in enumerate(zip(weights, intercepts)):\n", " target_predicted = linear_model_flipper_mass(data, weight, intercept)\n", " print(f\"Model #{model_idx}:\")\n", " print(f\"{weight:.2f} (g / mm) * flipper length + {intercept:.2f} (g)\")\n", " print(f\"Error: {goodness_fit_measure(target, target_predicted):.3f}\\n\")\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "for model_idx, (weight, intercept) in enumerate(zip(weights, intercepts)):\n", " target_predicted = linear_model_flipper_mass(data, weight, intercept)\n", " print(f\"Model #{model_idx}:\")\n", " print(f\"{weight:.2f} (g / mm) * flipper length + {intercept:.2f} (g)\")\n", " print(f\"Error: {goodness_fit_measure(target, target_predicted):.3f}\\n\")" ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }