{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Non-linear feature engineering for Linear Regression\n", "\n", "In this notebook, we show that even if linear models are not natively adapted\n", "to express a `target` that is not a linear function of the `data`, it is still\n", "possible to make linear models more expressive by engineering additional\n", "features.\n", "\n", "A machine learning pipeline that combines a non-linear feature engineering\n", "step followed by a linear regression step can therefore be considered a\n", "non-linear regression model as a whole.\n", "\n", "In this occasion we are not loading a dataset, but creating our own custom\n", "data consisting of a single feature. The target is built as a cubic polynomial\n", "on said feature. To make things a bit more challenging, we add some random\n", "fluctuations to the target." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "rng = np.random.RandomState(0)\n", "\n", "n_sample = 100\n", "data_max, data_min = 1.4, -1.4\n", "len_data = data_max - data_min\n", "# sort the data to make plotting easier later\n", "data = np.sort(rng.rand(n_sample) * len_data - len_data / 2)\n", "noise = rng.randn(n_sample) * 0.3\n", "target = data**3 - 0.5 * data**2 + noise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Tip
\n", "np.random.RandomState allows to create a random number generator which can\n", "be later used to get deterministic results.
\n", "Warning
\n", "In scikit-learn, by convention data (also called X in the scikit-learn\n", "documentation) should be a 2D matrix of shape (n_samples, n_features).\n", "If data is a 1D vector, you need to reshape it into a matrix with a\n", "single column if the vector represents a feature or a single row if the\n", "vector represents a sample.
\n", "