{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.1. Getting started with scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will generate one-dimensional data with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve-fitting regression problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. First, let's make all the necessary imports."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipy.stats as st\n",
    "import sklearn.linear_model as lm\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. We now define the deterministic function underlying our generative model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = lambda x: np.exp(3 * x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. We generate the values along the curve on $[0, 2]$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_tr = np.linspace(0., 2, 200)\n",
    "y_tr = f(x_tr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. Now, let's generate our data points within $[0, 1]$. We use the function $f$ and we add some Gaussian noise. In order to be able to demonstrate some effects, we use one specific set of data points generated in this fashion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.array([0, .1, .2, .5, .8, .9, 1])\n",
    "# y = f(x) + np.random.randn(len(x))\n",
    "y = np.array([0.59837698, 2.90450025, 4.73684354, 3.87158063, 11.77734608, 15.51112358, 20.08663964])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Let's plot our data points on $[0, 1]$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(6,3));\n",
    "plt.plot(x_tr[:100], y_tr[:100], '--k');\n",
    "plt.plot(x, y, 'ok', ms=10);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. Now, we use scikit-learn to fit a linear model to the data. There are three steps. First, we create the model (an instance of the `LinearRegression` class). Then we fit the model to our data. Finally, we predict values from our trained model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We create the model.\n",
    "lr = lm.LinearRegression()\n",
    "# We train the model on our training dataset.\n",
    "lr.fit(x[:, np.newaxis], y);\n",
    "# Now, we predict points with our trained model.\n",
    "y_lr = lr.predict(x_tr[:, np.newaxis])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "style": "tip"
   },
   "source": [
    "We need to convert `x` and `x_tr` to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have 7 observations with 1 feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7. We now plot the result of the trained linear model. We obtain a regression line, in green here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(6,3));\n",
    "plt.plot(x_tr, y_tr, '--k');\n",
    "plt.plot(x_tr, y_lr, 'g');\n",
    "plt.plot(x, y, 'ok', ms=10);\n",
    "plt.xlim(0, 1);\n",
    "plt.ylim(y.min()-1, y.max()+1);\n",
    "plt.title(\"Linear regression\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "8. The linear fit is not well adapted here, since the data points are generated according to a non-linear model (an exponential curve). Therefore, we are now going to fit a non-linear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for that, by pre-computing the exponents of our data points. This is done by generating a Vandermonde matrix, using the `np.vander` function. We will explain this trick in more detail in *How it works...*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lrp = lm.LinearRegression()\n",
    "plt.figure(figsize=(6,3));\n",
    "plt.plot(x_tr, y_tr, '--k');\n",
    "\n",
    "for deg, s in zip([2, 5], ['-', '.']):\n",
    "    lrp.fit(np.vander(x, deg + 1), y);\n",
    "    y_lrp = lrp.predict(np.vander(x_tr, deg + 1))\n",
    "    plt.plot(x_tr, y_lrp, s, label='degree ' + str(deg));\n",
    "    plt.legend(loc=2);\n",
    "    plt.xlim(0, 1.4);\n",
    "    plt.ylim(-10, 40);\n",
    "    # Print the model's coefficients.\n",
    "    print(' '.join(['%.2f' % c for c in lrp.coef_]))\n",
    "plt.plot(x, y, 'ok', ms=10);\n",
    "plt.title(\"Linear regression\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial fits the data points less precisely than the degree 5 polynomial. However, it seems more robust: the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the portion $x \\geq 1$). This is what we call **overfitting**: by using a model too complex, we obtain a better fit on the trained dataset, but a less robust model outside this set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "9. We will now use a different learning model, called **ridge regression**. It works like linear regression, except that it prevents the polynomial's coefficients to explode (which was what happened in the overfitting example above). By adding a **regularization term** in the **loss function**, ridge regression imposes some structure on the underlying model. We will see more details in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ridge regression model has a meta-parameter which represents the weight of the regularization term. We could try different values with trials and errors, using the `Ridge` class. However, scikit-learn includes another model called `RidgeCV` which includes a parameter search with cross-validation. In practice, it means that you don't have to tweak this parameter by hand: scikit-learn does it for you. Since the models of scikit-learn always follow the `fit`-`predict` API, all we have to do is replace `lm.LinearRegression` by `lm.RidgeCV` in the code above. We will give more details in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ridge = lm.RidgeCV()\n",
    "plt.figure(figsize=(6,3));\n",
    "plt.plot(x_tr, y_tr, '--k');\n",
    "\n",
    "for deg, s in zip([2, 5], ['-', '.']):\n",
    "    ridge.fit(np.vander(x, deg + 1), y);\n",
    "    y_ridge = ridge.predict(np.vander(x_tr, deg + 1))\n",
    "    plt.plot(x_tr, y_ridge, s, label='degree ' + str(deg));\n",
    "    plt.legend(loc=2);\n",
    "    plt.xlim(0, 1.5);\n",
    "    plt.ylim(-5, 80);\n",
    "    # Print the model's coefficients.\n",
    "    print(' '.join(['%.2f' % c for c in ridge.coef_]))\n",
    "\n",
    "plt.plot(x, y, 'ok', ms=10);\n",
    "plt.title(\"Ridge regression\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time, the degree 5 polynomial seems better than the simpler degree 2 polynomial (which now causes **underfitting**). The ridge regression reduces the overfitting issue here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n",
    "\n",
    "> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}