{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Non-linear feature engineering for Linear Regression\n",
    "\n",
    "In this notebook, we show that even if linear models are not natively adapted\n",
    "to express a `target` that is not a linear function of the `data`, it is still\n",
    "possible to make linear models more expressive by engineering additional\n",
    "features.\n",
    "\n",
    "A machine learning pipeline that combines a non-linear feature engineering\n",
    "step followed by a linear regression step can therefore be considered a\n",
    "non-linear regression model as a whole.\n",
    "\n",
    "In this occasion we are not loading a dataset, but creating our own custom\n",
    "data consisting of a single feature. The target is built as a cubic polynomial\n",
    "on said feature. To make things a bit more challenging, we add some random\n",
    "fluctuations to the target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "rng = np.random.RandomState(0)\n",
    "\n",
    "n_sample = 100\n",
    "data_max, data_min = 1.4, -1.4\n",
    "len_data = data_max - data_min\n",
    "# sort the data to make plotting easier later\n",
    "data = np.sort(rng.rand(n_sample) * len_data - len_data / 2)\n",
    "noise = rng.randn(n_sample) * 0.3\n",
    "target = data**3 - 0.5 * data**2 + noise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p class=\"last\"><tt class=\"docutils literal\">np.random.RandomState</tt> allows to create a random number generator which can\n",
    "be later used to get deterministic results.</p>\n",
    "</div>\n",
    "\n",
    "To ease the plotting, we create a pandas dataframe containing the data and\n",
    "target:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "full_data = pd.DataFrame({\"input_feature\": data, \"target\": target})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "_ = sns.scatterplot(\n",
    "    data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition warning alert alert-danger\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Warning</p>\n",
    "<p class=\"last\">In scikit-learn, by convention <tt class=\"docutils literal\">data</tt> (also called <tt class=\"docutils literal\">X</tt> in the scikit-learn\n",
    "documentation) should be a 2D matrix of shape <tt class=\"docutils literal\">(n_samples, n_features)</tt>.\n",
    "If <tt class=\"docutils literal\">data</tt> is a 1D vector, you need to reshape it into a matrix with a\n",
    "single column if the vector represents a feature or a single row if the\n",
    "vector represents a sample.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# X should be 2D for sklearn: (n_samples, n_features)\n",
    "data = data.reshape((-1, 1))\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "To avoid writing the same code in multiple places we define a helper function\n",
    "that fits, scores and plots the different regression models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_score_plot_regression(model, title=None):\n",
    "    model.fit(data, target)\n",
    "    target_predicted = model.predict(data)\n",
    "    mse = mean_squared_error(target, target_predicted)\n",
    "    ax = sns.scatterplot(\n",
    "        data=full_data, x=\"input_feature\", y=\"target\", color=\"black\", alpha=0.5\n",
    "    )\n",
    "    ax.plot(data, target_predicted)\n",
    "    if title is not None:\n",
    "        _ = ax.set_title(title + f\" (MSE = {mse:.2f})\")\n",
    "    else:\n",
    "        _ = ax.set_title(f\"Mean squared error = {mse:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now observe the limitations of fitting a linear regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "linear_regression = LinearRegression()\n",
    "linear_regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_score_plot_regression(linear_regression, title=\"Simple linear regression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the coefficient and intercept learnt by `LinearRegression` define the\n",
    "best \"straight line\" that fits the data. We can inspect the coefficients using\n",
    "the attributes of the model learnt as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    f\"weight: {linear_regression.coef_[0]:.2f}, \"\n",
    "    f\"intercept: {linear_regression.intercept_:.2f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the learnt model cannot handle the non-linear relationship between\n",
    "`data` and `target` because linear models assume a linear relationship.\n",
    "Indeed, there are 3 possibilities to solve this issue:\n",
    "\n",
    "1. choose a model that can natively deal with non-linearity,\n",
    "2. engineer a richer set of features by including expert knowledge which can\n",
    "   be directly used by a simple linear model, or\n",
    "3. use a \"kernel\" to have a locally-based decision function instead of a\n",
    "   global linear decision function.\n",
    "\n",
    "Let's illustrate quickly the first point by using a decision tree regressor\n",
    "which can natively handle non-linearity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "tree = DecisionTreeRegressor(max_depth=3).fit(data, target)\n",
    "tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_score_plot_regression(tree, title=\"Decision tree regression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of having a model which can natively deal with non-linearity, we could\n",
    "also modify our data: we could create new features, derived from the original\n",
    "features, using some expert knowledge. In this example, we know that we have a\n",
    "cubic and squared relationship between `data` and `target` (because we\n",
    "generated the data).\n",
    "\n",
    "Indeed, we could create two new features (`data ** 2` and `data ** 3`) using\n",
    "this information as follows. This kind of transformation is called a\n",
    "polynomial feature expansion:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_expanded = np.concatenate([data, data**2, data**3], axis=1)\n",
    "data_expanded.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of manually creating such polynomial features one could directly use\n",
    "[sklearn.preprocessing.PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "polynomial_expansion = PolynomialFeatures(degree=3, include_bias=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous cell we had to set `include_bias=False` as otherwise we would\n",
    "create a constant feature perfectly correlated to the `intercept_` introduced\n",
    "by the `LinearRegression`. We can verify that this procedure is equivalent to\n",
    "creating the features by hand up to numerical error by computing the maximum\n",
    "of the absolute values of the differences between the features generated by\n",
    "both methods and checking that it is close to zero:\n",
    "\n",
    "np.abs(polynomial_expansion.fit_transform(data) - data_expanded).max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To demonstrate the use of the `PolynomialFeatures` class, we use a\n",
    "scikit-learn pipeline which first transforms the features and then fit the\n",
    "regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "polynomial_regression = make_pipeline(\n",
    "    PolynomialFeatures(degree=3, include_bias=False),\n",
    "    LinearRegression(),\n",
    ")\n",
    "polynomial_regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_score_plot_regression(polynomial_regression, title=\"Polynomial regression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that even with a linear model, we can overcome the linearity\n",
    "limitation of the model by adding the non-linear components in the design of\n",
    "additional features. Here, we created new features by knowing the way the\n",
    "target was generated.\n",
    "\n",
    "The last possibility is to make a linear model more expressive is to use a\n",
    "\"kernel\". Instead of learning one weight per feature as we previously did, a\n",
    "weight is assigned to each sample. However, not all samples are used: some\n",
    "redundant data points of the training set are assigned a weight of 0 so\n",
    "that they do no influence the model's prediction function. This is the\n",
    "main intuition of the support vector machine algorithm.\n",
    "\n",
    "The mathematical definition of \"kernels\" and \"support vector machines\" is\n",
    "beyond the scope of this course. We encourage interested readers with a\n",
    "mathematical training to have a look at the scikit-learn [documentation on\n",
    "SVMs](https://scikit-learn.org/stable/modules/svm.html) for more details.\n",
    "\n",
    "For the rest of us, let us just develop some intuitions on the relative\n",
    "expressive power of support vector machines with linear and non-linear kernels\n",
    "by fitting them on the same dataset.\n",
    "\n",
    "First, consider a support vector machine with a linear kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.svm import SVR\n",
    "\n",
    "svr = SVR(kernel=\"linear\")\n",
    "svr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_score_plot_regression(svr, title=\"Linear support vector machine\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The predictions of our SVR with a linear kernel are all aligned on a straight\n",
    "line. `SVR(kernel=\"linear\")` is indeed yet another example of a linear model.\n",
    "\n",
    "The estimator can also be configured to use a non-linear kernel. Then, it can\n",
    "learn a prediction function that computes non-linear relations between samples\n",
    "for which we want to make a prediction and selected samples from the training\n",
    "set.\n",
    "\n",
    "The result is another kind of non-linear regression model with a similar\n",
    "expressivity as our previous polynomial regression pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "svr = SVR(kernel=\"poly\", degree=3)\n",
    "svr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_score_plot_regression(svr, title=\"Polynomial support vector machine\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kernel methods such as SVR are very efficient for small to medium datasets.\n",
    "\n",
    "For larger datasets with `n_samples >> 10_000`, it is often computationally\n",
    "more efficient to perform explicit feature expansion using\n",
    "`PolynomialFeatures` or other non-linear transformers from scikit-learn such\n",
    "as\n",
    "[KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html)\n",
    "or\n",
    "[SplineTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.SplineTransformer.html).\n",
    "\n",
    "Here again we refer the interested reader to the documentation to get a proper\n",
    "definition of those methods. The following just gives an intuitive overview of\n",
    "the predictions we would get using those on our toy dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import KBinsDiscretizer\n",
    "\n",
    "binned_regression = make_pipeline(\n",
    "    KBinsDiscretizer(n_bins=8),\n",
    "    LinearRegression(),\n",
    ")\n",
    "binned_regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_score_plot_regression(binned_regression, title=\"Binned regression\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import SplineTransformer\n",
    "\n",
    "spline_regression = make_pipeline(\n",
    "    SplineTransformer(degree=3, include_bias=False),\n",
    "    LinearRegression(),\n",
    ")\n",
    "spline_regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_score_plot_regression(spline_regression, title=\"Spline regression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Nystroem` is a nice alternative to `PolynomialFeatures` that makes it\n",
    "possible to keep the memory usage of the transformed dataset under control.\n",
    "However, interpreting the meaning of the intermediate features can be\n",
    "challenging."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.kernel_approximation import Nystroem\n",
    "\n",
    "nystroem_regression = make_pipeline(\n",
    "    Nystroem(kernel=\"poly\", degree=3, n_components=5, random_state=0),\n",
    "    LinearRegression(),\n",
    ")\n",
    "nystroem_regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fit_score_plot_regression(\n",
    "    nystroem_regression, title=\"Polynomial Nystroem regression\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook Recap\n",
    "\n",
    "In this notebook we explored several ways to expand a single numerical feature\n",
    "into several non-linearly derived new features. This makes our machine\n",
    "learning pipeline more expressive and less likely to underfit, even if the\n",
    "last stage of the pipeline is a simple linear regression model.\n",
    "\n",
    "For the sake of simplicity, we introduced those transformers on a toy\n",
    "regression problem with a single input feature. However, non-linear feature\n",
    "transformers such as Nystroem can further improve the expressiveness of\n",
    "machine learning pipelines to model non-linear interactions between features.\n",
    "We will explore this possibility in the next exercise."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}