{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Feature Engineering" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "\n", "*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n", "\n", "*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "< [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) | [Contents](Index.ipynb) | [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb) >" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "numerical data in a tidy, ``[n_samples, n_features]`` format VS. Real world. \n", "\n", "**Feature engineering** taking whatever information you have about your problem and turning it into numbers that you can use to build your ``feature matrix``.\n", "\n", "In this section, we will cover a few common examples of feature engineering tasks: \n", "- features for representing *categorical data*, \n", "- features for representing *text*, and \n", "- features for representing *images*.\n", "- *derived features* for increasing model complexity\n", "- *imputation* of missing data.\n", "\n", "Often this process is known as *vectorization*\n", "- as it involves converting arbitrary data into well-behaved vectors." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Categorical Features\n", "\n", "One common type of non-numerical data is *categorical* data.\n", "\n", "Housing prices, \n", "- \"price\" and \"rooms\"\n", "- \"neighborhood\" information." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:27:30.529292Z", "start_time": "2018-05-22T07:27:30.524872Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "data = [\n", " {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},\n", " {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},\n", " {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},\n", " {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}\n", "]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "You might be tempted to encode this data with a straightforward numerical mapping:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:27:44.086718Z", "start_time": "2018-05-22T07:27:44.083165Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};\n", "# It turns out that this is not generally a useful approach" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A fundamental assumption: numerical features reflect algebraic quantities.\n", "\n", "- *Queen Anne < Fremont < Wallingford*\n", "- *Wallingford - Queen Anne = Fremont*\n", "\n", "It does not make much sense.\n", "\n", "**One-hot encoding** (Dummy coding) effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively.\n", "- When your data comes as a list of dictionaries\n", " - Scikit-Learn's ``DictVectorizer`` will do this for you:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-05-20T08:44:09.111592Z", "start_time": "2018-05-20T08:44:08.270612Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 0, 850000, 4],\n", " [ 1, 0, 0, 700000, 3],\n", " [ 0, 0, 1, 650000, 3],\n", " [ 1, 0, 0, 600000, 2]], dtype=int64)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction import DictVectorizer\n", "vec = DictVectorizer(sparse=False, dtype=int )\n", "vec.fit_transform(data)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Notice\n", "- the 'neighborhood' column has been expanded into **three** separate columns (why not four?)\n", "- representing the three neighborhood labels, and that each row has a 1 in the column associated with its neighborhood.\n", "\n", "To see the meaning of each column, you can inspect the feature names:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['neighborhood=Fremont',\n", " 'neighborhood=Queen Anne',\n", " 'neighborhood=Wallingford',\n", " 'price',\n", " 'rooms']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vec.get_feature_names()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "There is one clear disadvantage of this approach: \n", "- if your category has many possible values, this can *greatly* increase the size of your dataset.\n", " - However, because the encoded data contains mostly zeros, a sparse output can be a very efficient solution:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:31:07.544822Z", "start_time": "2018-05-22T07:31:07.538636Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "<4x5 sparse matrix of type ''\n", "\twith 12 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vec = DictVectorizer(sparse=True, dtype=int)\n", "vec.fit_transform(data)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. \n", "\n", "two additional tools that Scikit-Learn includes to support this type of encoding:\n", "- ``sklearn.preprocessing.OneHotEncoder``\n", "- ``sklearn.feature_extraction.FeatureHasher`` " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Text Features\n", "\n", "Another common need in feature engineering is to convert text to a set of representative numerical values.\n", "\n", "Most automatic mining of social media data relies on some form of encoding the text as numbers.\n", "- One of the simplest methods of encoding data is by *word counts*: \n", " - you take each snippet of text, count the occurrences of each word within it, and put the results in a table.\n", "\n", "For example, consider the following set of three phrases:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-05-20T08:50:12.288018Z", "start_time": "2018-05-20T08:50:12.285167Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "sample = ['problem of evil',\n", " 'evil queen',\n", " 'horizon problem']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For a vectorization of this data based on word count, we could construct a column representing the word \"problem,\" the word \"evil,\" the word \"horizon,\" and so on.\n", "\n", "While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-05-20T08:50:42.702605Z", "start_time": "2018-05-20T08:50:42.693834Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "<3x5 sparse matrix of type ''\n", "\twith 7 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "vec = CountVectorizer()\n", "X = vec.fit_transform(sample)\n", "X" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The result is a sparse matrix recording the number of times each word appears; \n", "\n", "it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-05-20T08:59:01.576101Z", "start_time": "2018-05-20T08:59:01.122425Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
evilhorizonofproblemqueen
010110
110001
201010
\n", "
" ], "text/plain": [ " evil horizon of problem queen\n", "0 1 0 1 1 0\n", "1 1 0 0 0 1\n", "2 0 1 0 1 0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.DataFrame(X.toarray(), columns=vec.get_feature_names())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Problem: The raw word counts put too much weight on words that appear very frequently.\n", "\n", "*term frequency-inverse document frequency* (**TF–IDF**) weights the word counts by a measure of how often they appear in the documents.\n", "\n", "The syntax for computing these features is similar to the previous example:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-05-20T09:00:54.938057Z", "start_time": "2018-05-20T09:00:54.919254Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/datalab/Applications/anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py:1015: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
evilhorizonofproblemqueen
00.5178560.0000000.6809190.5178560.000000
10.6053490.0000000.0000000.0000000.795961
20.0000000.7959610.0000000.6053490.000000
\n", "
" ], "text/plain": [ " evil horizon of problem queen\n", "0 0.517856 0.000000 0.680919 0.517856 0.000000\n", "1 0.605349 0.000000 0.000000 0.000000 0.795961\n", "2 0.000000 0.795961 0.000000 0.605349 0.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "vec = TfidfVectorizer()\n", "X = vec.fit_transform(sample)\n", "pd.DataFrame(X.toarray(), columns=vec.get_feature_names())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For an example of using TF-IDF in a classification problem, see [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Image Features\n", "The simplest approach is what we used for the digits data in [Introducing Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb): **simply using the pixel values themselves**.\n", "- But depending on the application, such approaches may not be optimal.\n", "- A comprehensive summary of feature extraction techniques for images in the [Scikit-Image project](http://scikit-image.org).\n", "\n", "For one example of using Scikit-Learn and Scikit-Image together, see [Feature Engineering: Working with Images](05.14-Image-Features.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Derived Features\n", "\n", "Another useful type of feature is one that is mathematically derived from some input features.\n", "\n", "We saw an example of this in [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) when we constructed *polynomial features* from our input data.\n", "\n", "To convert a linear regression into a polynomial regression \n", "- not by changing the model\n", "- but by transforming the input!\n", " - *basis function regression*, and is explored further in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb).\n", "\n", "For example, this data clearly cannot be well described by a straight line:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:34:45.322010Z", "start_time": "2018-05-22T07:34:45.185490Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAD+CAYAAADxhFR7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADmJJREFUeJzt3W9oXfd9x/H3N7LTaGlSFUejsdPEa0O0tHniRoX1z1i2lamDFZztQQtbR8oS0wfrSmmV1aEPRtdCqQYbyTOTlWwlMMJizFqSamG0XWALVKkSnI2pSzZnqey1SkBlzW7AUb57oCtXVmTdq+Wee+7Xfr/ARPqdwz0ff3P98fE559qRmUiSaris7QCSpP5Z2pJUiKUtSYVY2pJUiKUtSYVY2pJUiKUtSYVY2pJUiKUtSYXsGfQLXnPNNXnw4MFBv6wkXdSefPLJFzNzstd+Ay/tgwcPsrCwMOiXlaSLWkQ8389+Xh6RpEIsbUkqxNKWpEIsbUkqxNKWpEIsbUkqpOcjfxHxeeDOTUvXAb+dmY80lkqSijixuMzc/BKnVzvsnxhndmaKw4cONHa8nqWdmV8BvgIQEW8BFoG/byyRJBVxYnGZo8dP0jm7BsDyaoejx08CNFbcu7088rvA32bmq02EkaRK5uaXzhX2hs7ZNebmlxo75m5L+w+Ar21djIgjEbEQEQsrKyuDSSZJI+70amdX64PQd2lHxK3AK5n5b1u3ZeaxzJzOzOnJyZ4fnZeki8L+ifFdrQ/Cbs607wL+sqkgklTN7MwU43vHzlsb3zvG7MxUY8fs6y+MiogrgY8An2ssiSQVs3GzcaSeHun6KPCtzPxpY0kkqaDDhw40WtJb9VXamfk1trkBKUkaLj8RKUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmF9FXaEfGWiPibiFiOiOci4vKmg0mSXq/fM+37gGeA64B3A2cbSyRJuqA9vXaIiLcB7wfuyMwEXmk8lSRpW/2cab8b+E/g4YhYiog/i4jYvENEHImIhYhYWFlZaSSoJKm/0v554F3Ap4D3AB8APrJ5h8w8lpnTmTk9OTk5+JSSJKCPyyPAj4EnM/OHABHxGDDVaCpJ0rb6OdN+AnhXROyPiDcBHwIWmo0lSdpOzzPtzHw5Ij4FPAa8CXggM7/deDJJ0uv0c3mEzHwUeLThLJKkHvxEpCQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiGWtiQVYmlLUiF7+tkpIk4Br3a/PZOZv9xYIknSBfVV2gCZeWOTQSRJvXl5RJIK6be0OxHxXEQ8EREzWzdGxJGIWIiIhZWVlQFHlCRt6Ku0M/PmzHwnMAs8GBETW7Yfy8zpzJyenJxsIqckiV1eHsnMx4FTwMEmwkiSdtaztCPiyoi4tvv1IeBa4N+bDiZJer1+nh75OeC7ETEG/AT4vcx8udlYkqTt9CztzFwBbhpCFklSDz7yJ0mFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVIilLUmFWNqSVMiefnaKiMuBp4B/ysw7mwhyYnGZufklTq922D8xzuzMFIcPHWjiUJJUVl+lDdwDnGoqxInFZY4eP0nn7BoAy6sdjh4/CWBxS9ImPS+PRMTNwHuBh5oKMTe/dK6wN3TOrjE3v9TUISWppB1LOyICuBf4dI/9jkTEQkQsrKys7DrE6dXOrtYl6VLV60z7k8B3MvPZnXbKzGOZOZ2Z05OTk7sOsX9ifFfrknSp6lXaHwc+FhFPAV8Ebo+I2UGHmJ2ZYnzv2Hlr43vHmJ2ZGvShJKm0HW9EZub7N76OiDuAD2bm3KBDbNxs9OkRSdpZv0+PNO7woQOWtCT10HdpZ+YDwAONJZEk9eQnIiWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgrZ02uHiLgMmAduABL4o8ycbzqYpHacWFxmbn6J06sd9k+MMzszxeFDB9qOpa6epc16Uf9+Zp6JiA8DX2a9xCVdZE4sLnP0+Ek6Z9cAWF7tcPT4SQCLe0T0vDyS6850v70BeLrZSJLaMje/dK6wN3TOrjE3v9RSIm3Vz5k2EXE38MfACjCzzfYjwBGA66+/fpD5JA3R6dXOrtY1fH3diMzMr2bmPuAeYD4iYsv2Y5k5nZnTk5OTTeSUNAT7J8Z3ta7h29XTI5l5HHgzsK+ZOJLaNDszxfjesfPWxveOMTsz1VIibdXP0yPvAP43M/87It4HvJKZLzYfTdKwbdxs9OmR0dXPNe0J4FsRMQb8GPhos5EktenwoQOW9AjrWdqZ+X3gpiFkkST14CciJakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCrG0JakQS1uSCtnTa4eIuAK4F/gV4ArgLzLzz5sOpp2dWFxmbn6J06sd9k+MMzszxeFDB9qOJalh/ZxpXwnMA78I3Ap8PiLe3mgq7ejE4jJHj59kebVDAsurHY4eP8mJxeW2o0lqWM/SzsyXMvPhXPci8AIw0Xw0Xcjc/BKds2vnrXXOrjE3v9RSIknDsqtr2hFxC+uXSJ7Zsn4kIhYiYmFlZWWQ+bSN06udXa1Lunj0XdoRcQ3wdeATmZmbt2XmscyczszpycnJQWfUFvsnxne1Luni0VdpR8RbgW8A92Tm95qNpF5mZ6YY3zt23tr43jFmZ6ZaSiRpWPp5euRq4O+AL2fmo81HUi8bT4n49Ih06YktVzpev0PEF4CjwJlNy7+Rmf+x3f7T09O5sLAwuISSdAmIiCczc7rXfj3PtDPzS8CXBpJKkvSG+IlISSrE0pakQixtSSrE0pakQixtSSrE0pakQixtSSrE0pakQixtSSrE0pakQixtSSrE0pakQixtSSrE0pakQixtSSrE0pakQixtSSrE0pakQixtSSrE0pakQixtSSrE0pakQvou7YgYj4ibmgwjSdrZnl47RMTVwF8DvwY8BNzZdChpkE4sLjM3v8Tp1Q77J8aZnZni8KEDbceS/l96ljbwGnAf8E3gl5qNIw3WicVljh4/SefsGgDLqx2OHj8JYHGrpJ6XRzLzp5n5D8CrQ8gjDdTc/NK5wt7QObvG3PxSS4mkN2YgNyIj4khELETEwsrKyiBeUhqI06udXa1Lo24gpZ2ZxzJzOjOnJycnB/GS0kDsnxjf1bo06nzkTxe12ZkpxveOnbc2vneM2ZmplhJJb0w/NyKlsjZuNvr0iC4W/TzydxWwCFwFXBERtwF3Zea3G84mDcThQwcsaV00epZ2Zv4PcOMQskiSevCatiQVYmlLUiGWtiQVYmlLUiGWtiQVEpk52BeMWAGefwMvcQ3w4oDiDJK5dmcUc41iJjDXbl2suW7IzJ4fKR94ab9REbGQmdNt59jKXLszirlGMROYa7cu9VxeHpGkQixtSSpkFEv7WNsBLsBcuzOKuUYxE5hrty7pXCN3TVuSdGGjeKYtSbqA1kt7VP+V91HNJenS1lppR8TVEXEC+BFw9zbbb4mIpyPi+Yi4LyKGkrWPXA9ExHJEPNv9cf0QMl0REcciYqk7j89s2d7WrHrlGvqsuse9LCIei4gfdLPNbNne1rx65WplXt1jXx4R/xoR929Zb2VWfeRqbVbd45/adOzHt2xrdmaZ2coP4M3ArwN3Avdvs/0fgd8ExoDvAodHJNcDwG1DntU+4HeAYP0B/h8Bbx+BWfXKNfRZdY8bwLXdrz8MLIzIe6tXrlbm1T32nwCPbH3PtzWrPnK1Nqvu8U/tsK3RmbV2pp07/CvvETEJ/EJmPpqZa8CDrL/JW83Vlsx8KTMfznUvAi8AE9D6rC6Yq03dPGe6394APL2xreV5XTBXmyLiZuC9wENb1lub1U65RtkwZtb6Ne0LuA74r03f/xC4tqUsW50F/ioi/iUiPjvsg0fELcAVwDPdpZGY1Ta5oMVZRcTdEfES8Bngi5s2tTqvHXJBC/OKiADuBT69zebWZtUjF7T86xDoRMRzEfHElstcjc9sVEv7cuC1Td+/Bqy1lOU8mXlXZt7A+u+ed0XEh4Z17Ii4Bvg68Ins/jmMEZjVBXK1OqvM/Gpm7gPuAea7JQAtz2uHXG3N65PAdzLz2W22tTmrnXK1+t7qHv/mzHwnMAs8GBEbf8JsfGajWtpngM3/qN91rP/Re2Rk5gvAN4FbhnG8iHgr8A3gnsz83qZNrc5qh1znDHtWW459nPX7FPu6SyPx3tom1+Ztw5zXx4GPRcRTrJ/53x4Rs91tbc5qp1zntPne6h7/ceAUcLC71PzM2rqQv+mi/R1sf8PvJHAbP7uY/8ERyXVj97/7WL8U8IEhZLkaeBz4rQtsb2VWfeQa+qy6x3sH8Lbu1+8Dnh2RefXK1cq8Nh3/de/5tn8d7pCrtVkBV/KzG8qHgGXgymHNbKjD3/ITvwp4lvUnDn7S/fp24HPd7e/p/uRfAP50hHI9wvrvrEvAHw4p0xeAl7tZNn58dgRm1SvX0Ge1aR4/AJ4D/hm4dUTeW71ytTKvTfnuAO4fhVn1kau1WQGTm/4/fh/41WHOzI+xS1Iho3pNW5K0DUtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpEEtbkgqxtCWpkP8DLtyWt31ukeAAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "x = np.array([1, 2, 3, 4, 5])\n", "y = np.array([4, 2, 1, 3, 7])\n", "plt.scatter(x, y);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Still, we can fit a line to the data using ``LinearRegression`` and get the optimal result:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:34:57.810225Z", "start_time": "2018-05-22T07:34:57.672835Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAD+CAYAAADxhFR7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFLJJREFUeJzt3X1sXXd9x/HPt07SOGlTJ/FDa7eOmyZxbxpNGFy6lgZKa2MmighME2PAVESbMQnYGLhbqiJNG0gTnrSp3V8RQ90Q0oQgigaieLUppTC61WnKAnHcJpA+OMF20joh7U3qh+/+8HXqOPfJzT33nN+975cU1T7nxuebX68/uT73fHLM3QUACMNlcQ8AACgeoQ0AASG0ASAghDYABITQBoCAENoAEBBCGwACQmgDQEAIbQAIyLJSf8H6+npva2sr9ZcFgIq2b9++E+7eUOhxJQ/ttrY2DQ0NlfrLAkBFM7MXinkcp0cAICCENgAEhNAGgIAQ2gAQEEIbAAJCaANAQApe8mdmfyPp3gWbrpX0EXf/QWRTAUAg9u4fVV//iI5NptVcV6vennbt6GiJ7HgFQ9vd/0HSP0iSmV0lab+k/4psIgAIxN79o9q154DSUzOSpNHJtHbtOSBJkQX3Uk+PfFzSd9x9OophACAkff0j5wN7XnpqRn39I5Edc6mh/WlJ31i80cx2mtmQmQ1NTEyUZjIASLhjk+klbS+FokPbzN4h6ay7H1q8z913u3unu3c2NBSszgNARWiuq13S9lJYyivt+yT9a1SDAEBoenvaVbu85oJttctr1NvTHtkxi/oHo8xstaQPSvpSZJMAQGDm32xM1NUjGR+V9EN3PxPZJAAQoB0dLZGG9GJFhba7f0NZ3oAEAJQXjUgACAihDQABIbQBICCENgAEhNAGgIAQ2gAQEEIbAAJCaANAQAhtAAgIoQ0AASG0ASAghDYABITQBoCAENoAEBBCGwACQmgDQEAIbQAICKENAAEhtAEgIIQ2AASE0AaAgBDaABAQQhsAAlJUaJvZVWb2H2Y2amZHzGxF1IMBAC5W7CvthyX9UtK1km6SNBXZRACAnJYVeoCZXS3pNkn3uLtLOhv5VACArIp5pX2TpN9I+q6ZjZjZP5qZLXyAme00syEzG5qYmIhkUABAcaHdKGmrpM9Jerukd0n64MIHuPtud+90986GhobSTwkAkFTE6RFJ45L2ufvLkmRmj0lqj3QqAEBWxbzSfkrSVjNrNrPLJXVJGop2LABANgVfabv7a2b2OUmPSbpc0iPu/njkkwEALlLM6RG5+6OSHo14FgBAATQiASAghDYABITQBoCAENoAEBBCGwACQmgDQEAIbQAICKENAAEhtAEgIIQ2AASE0AaAgBDaABAQQhsAAkJoA0BACG0ACAihDQABIbQBICCENgAEhNAGgIAQ2gAQEEIbAAJCaANAQAhtAAgIoQ0AAVlWzIPM7Kik6cynx919e2QTAQByKiq0JcndN0U5CACgME6PAEBAig3ttJkdMbOnzKxn8U4z22lmQ2Y2NDExUeIRAQDzigptd0+5+w2SeiV9y8zqFu3f7e6d7t7Z0NAQxZwAAC3x9Ii7PynpqKS2KIYBAORXMLTNbLWZXZP5uEPSNZKej3owAMDFirl6ZJWkJ8ysRtIpSZ9w99eiHQsAkE3B0Hb3CUlbyjALAARremZWy2qivyCv6Ou0AQBvcnf96thpDQ6Pa2B4TG31q/XwxzoiPy6hDQBFOjc9o58fOamB4TENDo/r+KmzMpPe3rpWnRvWlmUGQhsA8jh55px+dGhcg8Pj+snzE3r9jRnVLq/Ru7fU66+6t+i9Nzaq/orLyzYPoQ0AC7i7jkyc0WMHxzU4PKZ9L74qd+nqNSv14Y4WdW1t0q0b12vl8ppY5iO0AVS9qZlZDR19VQPDYxoYHtMLJ1+XJG1rWaPP37lZ3VubdFPzGplZzJMS2gCq1Kn0lJ54bkKDw2N6/NC4Tp+d1opll+m2G9brvu0bdVeqUddcVRv3mBchtAFUjRdPvn7+1fT//uYVTc+61q9eoffddLW6Uk3avrleqy9PdiwmezoAuAQzs65nX5rUYCaonxs7I0na3HiF7nv3RnWlGvW269aq5rL4T3sUi9AGUFFef2NaTz5/QgMHx/T4yLhOnHlDNZeZ3tm2Tl++u1VdqUZtWL867jHfMkIbQPB+e+qsBg+NaeDgmH525KTemJ7VlSuX6Y72RnWlGnXHlkZdtWp53GOWBKENIDjzbcT5ksuB0VOSpNZ1q/SJWzaoK9Wom69fp+VlqJWXG6ENIAi52ogd19Xp/ve3qzvVpE2NVyTisrwoEdoAEitfG/EL3Vt0Z5nbiElAaANIjFxtxKY1l8+1EVNNuvWG+NqISUBoA4jV1Mysnj76yvl/LS/JbcQkILQBlF3WNmLNZbpt03rdu32j7rqxUc11yWsjJgGhDaAssrUR151vIzZq++aGxLcRk4AVAhCJfG3Ee7dvVPfW8NqISUBoAyiZfG3EBz9wnbpSTWqrD7eNmASENoBLUk1txCQgtAEsSa424nXravXxW1rVnWqq2DZiEhDaAAoq1EbsSjVpcxW0EZOA0AaQVa424vbN1dtGTAJCG4Ck/G3EHR0t6qaNmAhFhbaZrZD0rKT/dvd7oxhk7/5R9fWP6NhkWs11tertadeOjpYoDgUgI1cb8abmNfrcnZvVnWrSthbaiElS7CvtByQdjWqIvftHtWvPAaWnZiRJo5Np7dpzQJIIbqDEcrURb72BNmIICoa2maUk3Szp25Juj2KIvv6R84E9Lz01o77+EUIbKIFCbcTbNzfoCtqIQcj7f8nmfiZ6SNKfK09gm9lOSTslqbW1dclDHJtML2k7gPxytRE30UYMXqG/Wj8j6cfuftjMcoa2u++WtFuSOjs7falDNNfVajRLQPMjGlC8+Tbi4PCYfnSINmKlKhTan5R0pZn9kaR1klab2Yi795VyiN6e9gvOaUtS7fIa9fa0l/IwQMWhjVh98oa2u982/7GZ3SPp9lIHtvTmm41cPQLkN99GnL/agzZi9UnMOw87OloIaSAL2ohYqOjQdvdHJD0S2SQAzjt55pweH5nQwMEx2oi4QGJeaQPVjHsjoliENhCTqZlZDR19NXPaY0xHF7QRuTciciG0gTI6fXZKT4xMaCDLvRE/TRsRRSC0gYjNtxEHD43pf37NvRFxaXimACU2O+t69uVJDRzk3ogoPUIbKIF8bcQv392qrlSjNqynjYhLR2gDbxFtRMSB0AaKlKuN2LpulT5xywZ1pRppIyJyhDaQx3wbcXB47vrpY4vaiN2pJm2ijYgyIrSBRRa2EZ98fkKvZdqI795Sr7+kjYiYEdqoevNtxIHhcQ0cHNMzL76q2QX3Ruza2qRbN9JGRDIQ2qhK0zOzejpLG3FbS+beiLQRkVCENqrGwjbij0cmdCo9RRsRwSG0UdFeeuXNeyPOtxHXr16h7q1N6ko1afvmetqICArPVlSUhW3EweFxjYz9ThJtRFQOQhvBe/2Naf30+RMayNpG3EobERWF0EaQxk6fPX8nl58ePkEbEVWD0EYQ3F0Hj5/WwMFxDR4a0/+9TBsR1YnQRmLRRgQuRmgjUV557Q396NBcSP/kOdqIwGKENmI110Z8be6yvGxtRO6NCFyA0EbZzbcRBzPXT9NGBIpHaKMsaCMCpUFoIzLZ2ojraCMCl6Tgd4yZXSapX9IGSS7p8+7eH/VgCM98G3FweEwDB2kjhmrv/lH19Y/o2GRazXW16u1p146OlrjHQkYxL3Nc0p+6+3Eze7+kr2ouxIFFbcQJnThzjjZiwPbuH9WuPQeUnpqRJI1OprVrzwFJIrgTomBou7tLOp75dIOkX0Q6ERJv7PTZ87fc+tnhEzpHG7Fi9PWPnA/seempGfX1jxDaCVHUCUUzu1/SX0uakNSTZf9OSTslqbW1tZTzIQFytRGvW1erP7mlVd2pJtqIFeLYZHpJ21F+RYW2u39N0tfM7COS+s0slXkFPr9/t6TdktTZ2ek5vgwCcm56Rk/9+pXMv5Z3cRuxK9WkzbQRK05zXa1GswQ0V/Ykx5Leunf3PWb2kKT1kk5EMxLikquNuH0zbcRq0dvTfsE5bUmqXV6j3p72GKfCQsVcPbJR0uvu/lszu1XSWXcnsCvAwjbi4PCY9r1AG7HazZ+35uqR5CrmlXadpB+aWY2kcUkfjXYkRGl6ZlZDL7yqgYMXthFvap5rI3almrSthTZiNdvR0UJIJ1gxV488I2lLGWZBRObbiIPDY3qcNiIQNOpoFSp/G7FR2zc30EYEAsR3bYXI1UbcRBsRqCiEdsDm24iDw+MaPDR+QRvxwQ+k1JVqUls9bUSgkhDagaGNCFQ3Qjvh8rURP/bOVr1vK21EoJoQ2gmUr43Y29Ou7q20EYFqRWgnxCuvvaHHD82d9qCNCCAXQjsm+dqIH+poUTdtRABZENplRBsRwKUitCOWq4146w20EQEsHaEdgVxtxK5Uk7q3Nur2zQ26gjYigLeA5CiBQm3ErlSjOlppIwK4dIT2W5Tr3og3t62ljQggMoT2EmRtI16+THfcSBsRQHkQ2nkUujdiV6pJN7et04pltBEBlAehvQhtRABJRmirwL0Ru7bovTc2quFK2ogA4leVob2wjThwcEzPvHhhG7Er1ajbbqinjQggcaomtKdnZvX00VfnLstb1Eb87J2b1U0bEUAAKjq087YRb79ed6WaaCMCCErFhXa2NuLaVctpIwKoCMGnV7424qe3X6/uVBNtRAAVI8jQpo0IoFoFE9pjp89m/u3p8QvaiO9pb1D31ia9Z0uD6latiHtMAIhUwdA2s5WSHpL0HkkrJf2zu/9T1IMVujdi99bqbiPu3T+qvv4RHZtMq7muVr097drR0RL3WAAiVswr7dWS+iX9maT1kn5lZt9x95dKPcy56Rn9/MhJDQ6PX9BGfFumjdiVatKWJtqIe/ePateeA0pPzUiSRifT2rXngCQR3ECFKxja7n5S0nczn54ws5ck1UkqeWh/6F9+pkO//Z1ql9fodtqIOfX1j5wP7HnpqRn19Y8Q2kCFW9I5bTPbprlTJL9ctH2npJ2S1Nra+paH+eydm7RqRQ1txAKOTaaXtB1A5Sg6tM2sXtI3JX3K3X3hPnffLWm3JHV2dnqW316Uu3+v+a3+1qrSXFer0SwBTVEIqHxFvYtnZmslfU/SA+7+dLQjoZDennbVLvpJpHZ5jXp72mOaCEC5FHP1yBpJ/ynpq+7+aPQjoZD589ZcPQJUH1t0puPiB5g9KGmXpOMLNr/P3X+d7fGdnZ0+NDRUugkBoAqY2T537yz0uGKuHvmKpK+UZCoAwCWpzmYKAASK0AaAgBDaABAQQhsAAkJoA0BACG0ACAihDQABIbQBICCENgAEhNAGgIAQ2gAQEEIbAAJCaANAQAhtAAgIoQ0AASG0ASAghDYABITQBoCAENoAEBBCGwACQmgDQEAIbQAISNGhbWa1ZrYlymEAAPktK/QAM1sj6d8l3Snp25LujXoooJT27h9VX/+Ijk2m1VxXq96edu3oaIl7LOAtKRjakmYlPSzp+5J+P9pxgNLau39Uu/YcUHpqRpI0OpnWrj0HJIngRpAKnh5x9zPuPihpugzzACXV1z9yPrDnpadm1Nc/EtNEwKUpyRuRZrbTzIbMbGhiYqIUXxIoiWOT6SVtB5KuJKHt7rvdvdPdOxsaGkrxJYGSaK6rXdJ2IOm45A8VrbenXbXLay7YVru8Rr097TFNBFyaYt6IBII1/2YjV4+gUhRzyd+VkvZLulLSSjO7Q9J97v54xLMBJbGjo4WQRsUoGNru/jtJm8owCwCgAM5pA0BACG0ACAihDQABIbQBICCENgAExNy9tF/QbELSC5fwJeolnSjROKXEXEuTxLmSOJPEXEtVqXNtcPeClfKSh/alMrMhd++Me47FmGtpkjhXEmeSmGupqn0uTo8AQEAIbQAISBJDe3fcA+TAXEuTxLmSOJPEXEtV1XMl7pw2ACC3JL7SBgDkEHtoJ/Uu70mdC0B1iy20zWyNme2VNCbp/iz7t5nZL8zsBTN72MzKMmsRcz1iZqNmdjjzq7UMM600s91mNpJZjy8s2h/XWhWaq+xrlTnuZWb2mJk9l5mtZ9H+uNar0FyxrFfm2CvM7KCZfX3R9ljWqoi5YlurzPGPLjj2k4v2Rbtm7h7LL0lXSLpL0r2Svp5l/08k/YGkGklPSNqRkLkekXRHmddqvaQ/lGSau4B/TNJ1CVirQnOVfa0yxzVJ12Q+fr+koYQ8twrNFct6ZY79t5J+sPg5H9daFTFXbGuVOf7RPPsiXbPYXml7nru8m1mDpOvd/VF3n5H0Lc09yWOdKy7uftLdv+tzTkh6SVKdFPta5ZwrTpl5jmc+3SDpF/P7Yl6vnHPFycxSkm6W9O1F22Nbq3xzJVk51iz2c9o5XCvpxQWfvyzpmphmWWxK0r+Z2a/M7IvlPriZbZO0UtIvM5sSsVZZ5pJiXCszu9/MTkr6gqS/W7Ar1vXKM5cUw3qZmUl6SNJfZNkd21oVmEuK+ftQUtrMjpjZU4tOc0W+ZkkN7RWSZhd8PitpJqZZLuDu97n7Bs397XmfmXWV69hmVi/pm5I+5Zmfw5SAtcoxV6xr5e5fc/f1kh6Q1J8JASnm9cozV1zr9RlJP3b3w1n2xblW+eaK9bmVOX7K3W+Q1CvpW2Y2/xNm5GuW1NA+LmnhTf2u1dyP3onh7i9J+r6kbeU4npmtlfQ9SQ+4+9MLdsW6VnnmOq/ca7Xo2Hs09z7F+symRDy3ssy1cF851+uTkv7YzJ7V3Cv/D5tZb2ZfnGuVb67z4nxuZY7/pKSjktoym6Jfs7hO5C84aX+Psr/hd0DSHXrzZP7tCZlrU+a/6zV3KuBdZZhljaQnJd2dY38sa1XEXGVfq8zxNkq6OvPxrZIOJ2S9Cs0Vy3otOP5Fz/m4vw/zzBXbWklarTffUO6QNCppdbnWrKyLv+gPfqWkw5q74uBU5uMPS/pSZv/bM3/4lyT9fYLm+oHm/mYdkfTZMs30oKTXMrPM//piAtaq0FxlX6sF6/GcpCOSfi7pHQl5bhWaK5b1WjDfPZK+noS1KmKu2NZKUsOC/4/PSHpvOdeMGjsABCSp57QBAFkQ2gAQEEIbAAJCaANAQAhtAAgIoQ0AASG0ASAghDYABITQBoCA/D9H/CTr9G3gNwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "X = x[:, np.newaxis]\n", "model = LinearRegression().fit(X, y)\n", "yfit = model.predict(X)\n", "plt.scatter(x, y)\n", "plt.plot(x, yfit);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We need a more sophisticated model to describe the relationship between $x$ and $y$.\n", "- One approach to this is to transform the data, \n", " - adding extra columns of features to drive more flexibility in the model.\n", "\n", "For example, we can add polynomial features to the data this way:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:35:14.541731Z", "start_time": "2018-05-22T07:35:14.536044Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 1. 1. 1.]\n", " [ 2. 4. 8.]\n", " [ 3. 9. 27.]\n", " [ 4. 16. 64.]\n", " [ 5. 25. 125.]]\n" ] } ], "source": [ "from sklearn.preprocessing import PolynomialFeatures\n", "poly = PolynomialFeatures(degree=3, include_bias=False)\n", "X2 = poly.fit_transform(X)\n", "print(X2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The derived feature matrix has one column representing $x$, and a second column representing $x^2$, and a third column representing $x^3$.\n", "Computing a linear regression on this expanded input gives a much closer fit to our data:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:35:35.180038Z", "start_time": "2018-05-22T07:35:35.048861Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = LinearRegression().fit(X2, y)\n", "yfit = model.predict(X2)\n", "plt.scatter(x, y)\n", "plt.plot(x, yfit);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This idea of improving a model not by changing the model, but by transforming the inputs, is fundamental to many of the more powerful machine learning methods.\n", "\n", "- We explore this idea further in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) in the context of *basis function regression*.\n", "\n", "- More generally, this is one motivational path to the powerful set of techniques known as *kernel methods*, which we will explore in [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Imputation of Missing Data\n", "\n", "Another common need in feature engineering is handling of missing data.\n", "\n", "- [Handling Missing Data](03.04-Missing-Values.ipynb)\n", " - ``NaN`` value is used to mark missing values.\n", " \n", "For example, we might have a dataset that looks like this:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:37:02.210360Z", "start_time": "2018-05-22T07:37:02.205046Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from numpy import nan\n", "X = np.array([[ nan, 0, 3 ],\n", " [ 3, 7, 9 ],\n", " [ 3, 5, 2 ],\n", " [ 4, nan, 6 ],\n", " [ 8, 8, 1 ]])\n", "y = np.array([14, 16, -1, 8, -5])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "When applying a typical machine learning model to such data, we will need to first replace such missing data with some appropriate fill value.\n", "\n", "This is known as *imputation* of missing values\n", "- simple method, e.g., replacing missing values with the mean of the column\n", "- sophisticated method, e.g., using matrix completion or a robust model to handle such data\n", " - It tends to be very application-specific, and we won't dive into them here.\n", "\n", "For a baseline imputation approach, using the mean, median, or most frequent value, Scikit-Learn provides the ``Imputer`` class:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:37:28.585039Z", "start_time": "2018-05-22T07:37:28.578062Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([[4.5, 0. , 3. ],\n", " [3. , 7. , 9. ],\n", " [3. , 5. , 2. ],\n", " [4. , 5. , 6. ],\n", " [8. , 8. , 1. ]])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import Imputer\n", "imp = Imputer(strategy='mean')\n", "X2 = imp.fit_transform(X)\n", "X2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We see that in the resulting data, the two missing values have been replaced with the mean of the remaining values in the column. \n", "\n", "This imputed data can then be fed directly into, for example, a ``LinearRegression`` estimator:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:37:51.007780Z", "start_time": "2018-05-22T07:37:51.001380Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([13.14869292, 14.3784627 , -1.15539732, 10.96606197, -5.33782027])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = LinearRegression().fit(X2, y)\n", "model.predict(X2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Feature Pipelines\n", "\n", "With any of the preceding examples, it can quickly become tedious to do the transformations by hand, especially if you wish to string together multiple steps.\n", "\n", "For example, we might want a processing pipeline that looks something like this:\n", "\n", "1. Impute missing values using the mean\n", "2. Transform features to quadratic\n", "3. Fit a linear regression\n", "\n", "To streamline this type of processing pipeline, Scikit-Learn provides a ``Pipeline`` object, which can be used as follows:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:38:44.347096Z", "start_time": "2018-05-22T07:38:44.342923Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "\n", "model = make_pipeline(Imputer(strategy='mean'),\n", " PolynomialFeatures(degree=2),\n", " LinearRegression())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This pipeline looks and acts like a standard Scikit-Learn object, and will apply all the specified steps to any input data." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2018-05-22T07:38:47.965622Z", "start_time": "2018-05-22T07:38:47.958847Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[14 16 -1 8 -5]\n", "[14. 16. -1. 8. -5.]\n" ] } ], "source": [ "model.fit(X, y) # X with missing values, from above\n", "print(y)\n", "print(model.predict(X))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "All the steps of the model are applied automatically.\n", "\n", "Notice that for the simplicity of this demonstration, we've applied the model to the data it was trained on; \n", "- this is why it was able to perfectly predict the result (refer back to [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) for further discussion of this).\n", "\n", "For some examples of Scikit-Learn pipelines in action, see the following section on naive Bayes classification, as well as [In Depth: Linear Regression](05.06-Linear-Regression.ipynb), and [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "< [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) | [Contents](Index.ipynb) | [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb) >" ] } ], "metadata": { "anaconda-cloud": {}, "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }