{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcdd Exercise M4.02\n",
    "\n",
    "In the previous notebook, we showed that we can add new features based on the\n",
    "original feature `x` to make the model more expressive, for instance `x ** 2` or\n",
    "`x ** 3`. In that case we only used a single feature in `data`.\n",
    "\n",
    "The aim of this notebook is to train a linear regression algorithm on a\n",
    "dataset with more than a single feature. In such a \"multi-dimensional\" feature\n",
    "space we can derive new features of the form `x1 * x2`, `x2 * x3`, etc.\n",
    "Products of features are usually called \"non-linear\" or \"multiplicative\"\n",
    "interactions between features.\n",
    "\n",
    "Feature engineering can be an important step of a model pipeline as long as\n",
    "the new features are expected to be predictive. For instance, think of a\n",
    "classification model to decide if a patient has risk of developing a heart\n",
    "disease. This would depend on the patient's Body Mass Index which is defined\n",
    "as `weight / height ** 2`.\n",
    "\n",
    "We load the dataset penguins dataset. We first use a set of 3 numerical\n",
    "features to predict the target, i.e. the body mass of the penguin."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins.csv\")\n",
    "\n",
    "columns = [\"Flipper Length (mm)\", \"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
    "target_name = \"Body Mass (g)\"\n",
    "\n",
    "# Remove lines with missing values for the columns of interest\n",
    "penguins_non_missing = penguins[columns + [target_name]].dropna()\n",
    "\n",
    "data = penguins_non_missing[columns]\n",
    "target = penguins_non_missing[target_name]\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it is your turn to train a linear regression model on this dataset. First,\n",
    "create a linear regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Execute a cross-validation with 10 folds and use the mean absolute error (MAE)\n",
    "as metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the mean and std of the MAE in grams (g). Remember you have to revert\n",
    "the sign introduced when metrics start with `neg_`, such as in\n",
    "`\"neg_mean_absolute_error\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now create a pipeline using `make_pipeline` consisting of a\n",
    "`PolynomialFeatures` and a linear regression. Set `degree=2` and\n",
    "`interaction_only=True` to the feature engineering step. Remember not to\n",
    "include a \"bias\" feature (that is a constant-valued feature) to avoid\n",
    "introducing a redundancy with the intercept of the subsequent linear\n",
    "regression model.\n",
    "\n",
    "You may want to use the `.set_output(transform=\"pandas\")` method of the\n",
    "pipeline to answer the next question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transform the first 5 rows of the dataset and look at the column names. How\n",
    "many features are generated at the output of the `PolynomialFeatures` step in\n",
    "the previous pipeline?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that the values for the new interaction features are correct for a few\n",
    "of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the same cross-validation strategy as done previously to estimate the mean\n",
    "and std of the MAE in grams (g) for such a pipeline. Compare with the results\n",
    "without feature engineering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Now let's try to build an alternative pipeline with an adjustable number of\n",
    "intermediate features while keeping a similar predictive power. To do so, try\n",
    "using the `Nystroem` transformer instead of `PolynomialFeatures`. Set the\n",
    "kernel parameter to `\"poly\"` and `degree` to 2. Adjust the number of\n",
    "components to be as small as possible while keeping a good cross-validation\n",
    "performance.\n",
    "\n",
    "Hint: Use a `ValidationCurveDisplay` with `param_range = np.array([5, 10, 50,\n",
    "100])` to find the optimal `n_components`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do the mean and std of the MAE for the Nystroem pipeline with optimal\n",
    "`n_components` compare to the other previous models?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}