{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udcc3 Solution for Exercise M4.02\n", "\n", "In the previous notebook, we showed that we can add new features based on the\n", "original feature `x` to make the model more expressive, for instance `x ** 2` or\n", "`x ** 3`. In that case we only used a single feature in `data`.\n", "\n", "The aim of this notebook is to train a linear regression algorithm on a\n", "dataset with more than a single feature. In such a \"multi-dimensional\" feature\n", "space we can derive new features of the form `x1 * x2`, `x2 * x3`, etc.\n", "Products of features are usually called \"non-linear\" or \"multiplicative\"\n", "interactions between features.\n", "\n", "Feature engineering can be an important step of a model pipeline as long as\n", "the new features are expected to be predictive. For instance, think of a\n", "classification model to decide if a patient has risk of developing a heart\n", "disease. This would depend on the patient's Body Mass Index which is defined\n", "as `weight / height ** 2`.\n", "\n", "We load the dataset penguins dataset. We first use a set of 3 numerical\n", "features to predict the target, i.e. the body mass of the penguin." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note

\n", "

If you want a deeper overview regarding this dataset, you can refer to the\n", "Appendix - Datasets description section at the end of this MOOC.

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "penguins = pd.read_csv(\"../datasets/penguins.csv\")\n", "\n", "columns = [\"Flipper Length (mm)\", \"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n", "target_name = \"Body Mass (g)\"\n", "\n", "# Remove lines with missing values for the columns of interest\n", "penguins_non_missing = penguins[columns + [target_name]].dropna()\n", "\n", "data = penguins_non_missing[columns]\n", "target = penguins_non_missing[target_name]\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it is your turn to train a linear regression model on this dataset. First,\n", "create a linear regression model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.linear_model import LinearRegression\n", "\n", "linear_regression = LinearRegression()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n", "as metric." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.model_selection import cross_validate\n", "\n", "cv_results = cross_validate(\n", " linear_regression,\n", " data,\n", " target,\n", " cv=10,\n", " scoring=\"neg_mean_absolute_error\",\n", " n_jobs=2,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the mean and std of the MAE in grams (g). Remember you have to revert\n", "the sign introduced when metrics start with `neg_`, such as in\n", "`\"neg_mean_absolute_error\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "print(\n", " \"Mean absolute error on testing set with original features: \"\n", " f\"{-cv_results['test_score'].mean():.3f} \u00b1 \"\n", " f\"{cv_results['test_score'].std():.3f} g\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create a pipeline using `make_pipeline` consisting of a\n", "`PolynomialFeatures` and a linear regression. Set `degree=2` and\n", "`interaction_only=True` to the feature engineering step. Remember not to\n", "include a \"bias\" feature (that is a constant-valued feature) to avoid\n", "introducing a redundancy with the intercept of the subsequent linear\n", "regression model.\n", "\n", "You may want to use the `.set_output(transform=\"pandas\")` method of the\n", "pipeline to answer the next question." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.pipeline import make_pipeline\n", "\n", "poly_features = PolynomialFeatures(\n", " degree=2, include_bias=False, interaction_only=True\n", ")\n", "linear_regression_interactions = make_pipeline(\n", " poly_features, linear_regression\n", ").set_output(transform=\"pandas\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform the first 5 rows of the dataset and look at the column names. How\n", "many features are generated at the output of the `PolynomialFeatures` step in\n", "the previous pipeline?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "linear_regression_interactions.fit(data, target)\n", "linear_regression_interactions[0].transform(data[:5])" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "We observe that 3 features are generated, corresponding to the different\n", "combinations of products of the 3 original features, i.e. we have 6\n", "intermediate features in total. In general, given `p` original features, one\n", "has `p * (p - 1) / 2` interactions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the values for the new interaction features are correct for a few\n", "of them." ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "Let's now check that the value in the 1st row and the 5th column (3384.7) is\n", "the product of the values at the first and third columns (respectively 181.0\n", "and 18.7) of the same row:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "flipper_length_first_sample = 181.0\n", "culmen_depth_first_sample = 18.7\n", "flipper_length_first_sample * culmen_depth_first_sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the same cross-validation strategy as done previously to estimate the mean\n", "and std of the MAE in grams (g) for such a pipeline. Compare with the results\n", "without feature engineering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "cv_results = cross_validate(\n", " linear_regression_interactions,\n", " data,\n", " target,\n", " cv=10,\n", " scoring=\"neg_mean_absolute_error\",\n", " n_jobs=2,\n", ")\n", "print(\n", " \"Mean absolute error on testing set with interactions: \"\n", " f\"{-cv_results['test_score'].mean():.3f} \u00b1 \"\n", " f\"{cv_results['test_score'].std():.3f} g\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "We observe that the MAE is lower and less spread with the enriched features.\n", "In this case the additional \"interaction\" features are indeed predictive.\n", "Later in this module we will see what happens when the enriched features are\n", "non-predictive and how to deal with this case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now let's try to build an alternative pipeline with an adjustable number of\n", "intermediate features while keeping a similar predictive power. To do so, try\n", "using the `Nystroem` transformer instead of `PolynomialFeatures`. Set the\n", "kernel parameter to `\"poly\"` and `degree` to 2. Adjust the number of\n", "components to be as small as possible while keeping a good cross-validation\n", "performance.\n", "\n", "Hint: Use a `ValidationCurveDisplay` with `param_range = np.array([5, 10, 50,\n", "100])` to find the optimal `n_components`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "import numpy as np\n", "\n", "from sklearn.kernel_approximation import Nystroem\n", "from sklearn.model_selection import ValidationCurveDisplay\n", "\n", "nystroem_regression = make_pipeline(\n", " Nystroem(kernel=\"poly\", degree=2, random_state=0),\n", " linear_regression,\n", ")\n", "\n", "param_range = np.array([5, 10, 50, 100])\n", "disp = ValidationCurveDisplay.from_estimator(\n", " nystroem_regression,\n", " data,\n", " target,\n", " param_name=\"nystroem__n_components\",\n", " param_range=param_range,\n", " cv=10,\n", " scoring=\"neg_mean_absolute_error\",\n", " negate_score=True,\n", " std_display_style=\"errorbar\",\n", " n_jobs=2,\n", ")\n", "\n", "_ = disp.ax_.set(\n", " xlabel=\"Number of components\",\n", " ylabel=\"Mean absolute error (g)\",\n", " title=\"Validation curve for Nystroem regression\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "In the validation curve above we can observe that a small number of components\n", "leads to an underfitting model, whereas a large number of components leads to\n", "an overfitting model. The optimal number of Nystr\u00f6m components is around 10\n", "for this dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do the mean and std of the MAE for the Nystroem pipeline with optimal\n", "`n_components` compare to the other previous models?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "nystroem_regression.set_params(nystroem__n_components=10)\n", "cv_results = cross_validate(\n", " nystroem_regression,\n", " data,\n", " target,\n", " cv=10,\n", " scoring=\"neg_mean_absolute_error\",\n", " n_jobs=2,\n", ")\n", "print(\n", " \"Mean absolute error on testing set with nystroem: \"\n", " f\"{-cv_results['test_score'].mean():.3f} \u00b1 \"\n", " f\"{cv_results['test_score'].std():.3f} g\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "In this case we have a model with 10 features instead of 6, and which has\n", "approximately the same prediction error as the model with interactions.\n", "\n", "Notice that if we had `p = 100` original features (instead of 3), the\n", "`PolynomialFeatures` transformer would have generated `100 * (100 - 1) / 2 =\n", "4950` additional interaction features (so we would have 5050 features in\n", "total). The resulting pipeline would have been much slower to train and\n", "predict and would have had a much larger memory footprint. Furthermore, the\n", "large number of interaction features would probably have resulted in an\n", "overfitting model.\n", "\n", "On the other hand, the `Nystroem` transformer generates a user-adjustable\n", "number of features (`n_components`). Furthermore, the optimal number of\n", "components is usually much smaller than that. So the `Nystroem` transformer\n", "can be more scalable when the number of original features is too large for\n", "`PolynomialFeatures` to be used.\n", "\n", "The main downside of the `Nystroem` transformer is that it is not possible to\n", "easily interpret the meaning of the generated features and therefore the\n", "meaning of the learned coefficients for the downstream linear model." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }