{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udcdd Exercise M4.02\n", "\n", "In the previous notebook, we showed that we can add new features based on the\n", "original feature `x` to make the model more expressive, for instance `x ** 2` or\n", "`x ** 3`. In that case we only used a single feature in `data`.\n", "\n", "The aim of this notebook is to train a linear regression algorithm on a\n", "dataset with more than a single feature. In such a \"multi-dimensional\" feature\n", "space we can derive new features of the form `x1 * x2`, `x2 * x3`, etc.\n", "Products of features are usually called \"non-linear\" or \"multiplicative\"\n", "interactions between features.\n", "\n", "Feature engineering can be an important step of a model pipeline as long as\n", "the new features are expected to be predictive. For instance, think of a\n", "classification model to decide if a patient has risk of developing a heart\n", "disease. This would depend on the patient's Body Mass Index which is defined\n", "as `weight / height ** 2`.\n", "\n", "We load the dataset penguins dataset. We first use a set of 3 numerical\n", "features to predict the target, i.e. the body mass of the penguin." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note

\n", "

If you want a deeper overview regarding this dataset, you can refer to the\n", "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "penguins = pd.read_csv(\"../datasets/penguins.csv\")\n", "\n", "columns = [\"Flipper Length (mm)\", \"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", "target_name = \"Body Mass (g)\"\n", "\n", "# Remove lines with missing values for the columns of interest\n", "penguins_non_missing = penguins[columns + [target_name]].dropna()\n", "\n", "data = penguins_non_missing[columns]\n", "target = penguins_non_missing[target_name]\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it is your turn to train a linear regression model on this dataset. First,\n", "create a linear regression model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", "as metric." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the mean and std of the MAE in grams (g). Remember you have to revert\n", "the sign introduced when metrics start with `neg_`, such as in\n", "`\"neg_mean_absolute_error\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create a pipeline using `make_pipeline` consisting of a\n", "`PolynomialFeatures` and a linear regression. Set `degree=2` and\n", "`interaction_only=True` to the feature engineering step. Remember not to\n", "include a \"bias\" feature (that is a constant-valued feature) to avoid\n", "introducing a redundancy with the intercept of the subsequent linear\n", "regression model.\n", "\n", "You may want to use the `.set_output(transform=\"pandas\")` method of the\n", "pipeline to answer the next question." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform the first 5 rows of the dataset and look at the column names. How\n", "many features are generated at the output of the `PolynomialFeatures` step in\n", "the previous pipeline?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the values for the new interaction features are correct for a few\n", "of them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the same cross-validation strategy as done previously to estimate the mean\n", "and std of the MAE in grams (g) for such a pipeline. Compare with the results\n", "without feature engineering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now let's try to build an alternative pipeline with an adjustable number of\n", "intermediate features while keeping a similar predictive power. To do so, try\n", "using the `Nystroem` transformer instead of `PolynomialFeatures`. Set the\n", "kernel parameter to `\"poly\"` and `degree` to 2. Adjust the number of\n", "components to be as small as possible while keeping a good cross-validation\n", "performance.\n", "\n", "Hint: Use a `ValidationCurveDisplay` with `param_range = np.array([5, 10, 50,\n", "100])` to find the optimal `n_components`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do the mean and std of the MAE for the Nystroem pipeline with optimal\n", "`n_components` compare to the other previous models?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code here." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }