{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8.1. Getting started with scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will generate one-dimensional data with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve-fitting regression problem." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. First, let's make all the necessary imports." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scipy.stats as st\n", "import sklearn.linear_model as lm\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. We now define the deterministic function underlying our generative model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f = lambda x: np.exp(3 * x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. We generate the values along the curve on $[0, 2]$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_tr = np.linspace(0., 2, 200)\n", "y_tr = f(x_tr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Now, let's generate our data points within $[0, 1]$. We use the function $f$ and we add some Gaussian noise. In order to be able to demonstrate some effects, we use one specific set of data points generated in this fashion." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.array([0, .1, .2, .5, .8, .9, 1])\n", "# y = f(x) + np.random.randn(len(x))\n", "y = np.array([0.59837698, 2.90450025, 4.73684354, 3.87158063, 11.77734608, 15.51112358, 20.08663964])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Let's plot our data points on $[0, 1]$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(6,3));\n", "plt.plot(x_tr[:100], y_tr[:100], '--k');\n", "plt.plot(x, y, 'ok', ms=10);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Now, we use scikit-learn to fit a linear model to the data. There are three steps. First, we create the model (an instance of the `LinearRegression` class). Then we fit the model to our data. Finally, we predict values from our trained model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We create the model.\n", "lr = lm.LinearRegression()\n", "# We train the model on our training dataset.\n", "lr.fit(x[:, np.newaxis], y);\n", "# Now, we predict points with our trained model.\n", "y_lr = lr.predict(x_tr[:, np.newaxis])" ] }, { "cell_type": "markdown", "metadata": { "style": "tip" }, "source": [ "We need to convert `x` and `x_tr` to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have 7 observations with 1 feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. We now plot the result of the trained linear model. We obtain a regression line, in green here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(6,3));\n", "plt.plot(x_tr, y_tr, '--k');\n", "plt.plot(x_tr, y_lr, 'g');\n", "plt.plot(x, y, 'ok', ms=10);\n", "plt.xlim(0, 1);\n", "plt.ylim(y.min()-1, y.max()+1);\n", "plt.title(\"Linear regression\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. The linear fit is not well adapted here, since the data points are generated according to a non-linear model (an exponential curve). Therefore, we are now going to fit a non-linear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for that, by pre-computing the exponents of our data points. This is done by generating a Vandermonde matrix, using the `np.vander` function. We will explain this trick in more detail in *How it works...*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lrp = lm.LinearRegression()\n", "plt.figure(figsize=(6,3));\n", "plt.plot(x_tr, y_tr, '--k');\n", "\n", "for deg, s in zip([2, 5], ['-', '.']):\n", " lrp.fit(np.vander(x, deg + 1), y);\n", " y_lrp = lrp.predict(np.vander(x_tr, deg + 1))\n", " plt.plot(x_tr, y_lrp, s, label='degree ' + str(deg));\n", " plt.legend(loc=2);\n", " plt.xlim(0, 1.4);\n", " plt.ylim(-10, 40);\n", " # Print the model's coefficients.\n", " print(' '.join(['%.2f' % c for c in lrp.coef_]))\n", "plt.plot(x, y, 'ok', ms=10);\n", "plt.title(\"Linear regression\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial fits the data points less precisely than the degree 5 polynomial. However, it seems more robust: the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the portion $x \\geq 1$). This is what we call **overfitting**: by using a model too complex, we obtain a better fit on the trained dataset, but a less robust model outside this set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "9. We will now use a different learning model, called **ridge regression**. It works like linear regression, except that it prevents the polynomial's coefficients to explode (which was what happened in the overfitting example above). By adding a **regularization term** in the **loss function**, ridge regression imposes some structure on the underlying model. We will see more details in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ridge regression model has a meta-parameter which represents the weight of the regularization term. We could try different values with trials and errors, using the `Ridge` class. However, scikit-learn includes another model called `RidgeCV` which includes a parameter search with cross-validation. In practice, it means that you don't have to tweak this parameter by hand: scikit-learn does it for you. Since the models of scikit-learn always follow the `fit`-`predict` API, all we have to do is replace `lm.LinearRegression` by `lm.RidgeCV` in the code above. We will give more details in the next section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge = lm.RidgeCV()\n", "plt.figure(figsize=(6,3));\n", "plt.plot(x_tr, y_tr, '--k');\n", "\n", "for deg, s in zip([2, 5], ['-', '.']):\n", " ridge.fit(np.vander(x, deg + 1), y);\n", " y_ridge = ridge.predict(np.vander(x_tr, deg + 1))\n", " plt.plot(x_tr, y_ridge, s, label='degree ' + str(deg));\n", " plt.legend(loc=2);\n", " plt.xlim(0, 1.5);\n", " plt.ylim(-5, 80);\n", " # Print the model's coefficients.\n", " print(' '.join(['%.2f' % c for c in ridge.coef_]))\n", "\n", "plt.plot(x, y, 'ok', ms=10);\n", "plt.title(\"Ridge regression\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time, the degree 5 polynomial seems better than the simpler degree 2 polynomial (which now causes **underfitting**). The ridge regression reduces the overfitting issue here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n", "\n", "> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 1 }