{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Artificial Intelligence

\n", "

Lesson 3

\n", "

Applying Regression

\n", "\n", "
\n", "\n", "
Linear Regression
\n", "\n", "
Logistic Regression
\n", "\n", "
Polynomial Regression
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "OVERVIEW\n", "
\n", "\n", "
\n", "
This lesson takes the content from lessons 6 through 8 and guides you through a practical application of Linear, Logistic, and Polynomial Regression.\n", "
\n", "\n", "
\n", "\n", "
\n", "LINEAR REGRESSION\n", "
\n", "\n", "**Lets take a look at some data, ask some questions and use linear regression to solve said questions.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# imports\n", "import numpy as np\n", "from sklearn import linear_model\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "# this allows plots to appear directly in the notebook\n", "%matplotlib inline\n", "\n", "# read data into a DataFrame and verify contents\n", "data = pd.read_csv('./goog.csv')\n", "data.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Now that we have our dataset, lets split the dates and prices into their own frames.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a dataframe for the dates\n", "# select the days only from the Date column using a for loop\n", "dates = [int(i.split('-')[0]) for i in np.array(data)[:,0]]\n", "\n", "# create a dataframe for the open prices\n", "# select the data in the Open column\n", "prices = np.array(data)[:,1]\n", "\n", "# create a dataframe for the high prices\n", "# select the data from the High column\n", "high = np.array(data)[:,2]\n", "# select the data from the Prices column\n", "prices = np.array([prices]).T\n", "# select the data from the Dates column\n", "dates = np.array([dates]).T\n", "\n", "#high = np.array([high]).T\n", "#prices = np.hstack((price, high))\n", "\n", "#print(dates)\n", "#print(prices)\n", "#print(high)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**With our price and date data split, we can now create functions to simplify the process.**\n", "\n", "**The first function will be for predicting the price of a stock on day 'x'.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# define a function for predicting the price\n", "# given the dates and prices in a dataframe \n", "# and a day value represented as x\n", "def predict_price(dates, prices, x):\n", " # initialize the linear regression model\n", " linear_mod = linear_model.LinearRegression()\n", " # fit the data to the model\n", " linear_mod.fit(dates, prices)\n", " # store the result of linear prediction at value x\n", " predicted_price = linear_mod.predict(x)\n", " # return the predicted price, linear coefficient, and the intercept\n", " return predicted_price, linear_mod.coef_, linear_mod.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Next, in order to properly show the variation in our methods, let's create a function for plotting our data points.**\n", "\n", "**The second function will be for displaying a visualized plot of our data and prediction.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# define a function for displaying a plot given\n", "# the dates and prices data as X and Y values\n", "def show_plot(dates, prices):\n", " # initialize the linear regression model\n", " linear_mod = linear_model.LinearRegression()\n", " # fit the submitted data to the model\n", " linear_mod.fit(dates, prices)\n", " # mark the scatter points using the dates and prices as X and Y values\n", " plt.scatter(dates, prices, color='lime')\n", " # plot the line of best fit using our model prediction\n", " plt.plot(dates, linear_mod.predict(dates), color='blue', linewidth=3)\n", " #display the model\n", " plt.show()\n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**With our functions in place let's test them!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# display the result of the predict_prices function\n", "# pass the function the dates, prices, and an x value\n", "predict_price(dates, prices, 39)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Now let's view our plot.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# display the result of the show_plot function\n", "show_plot(dates, prices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
\n", "
\n", "\n", "
\n", "LOGISTIC REGRESSION\n", "
\n", "\n", "\n", "**For this practical example we will be using a prepared dataset provided by sklearn's 'datasets' class.**\n", "\n", "**The dataset we will be using is called *'iris'*. The goal for this example is to predict the target value by using the feature values**." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import required libraries\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from sklearn import linear_model, datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**This enables us to simply load a dataset by calling a simple function that returns a *'bunch'* of data - quite literally.**\n", "\n", "
\n", "\n", "A *'bunch'* is similar to a data-dictionary, as it provides attributes for our dataset, mainly:\n", "\n", ">**‘data’, the data to learn**\n", "\n", ">**‘target’, the classification labels**\n", "\n", ">‘target_names’, the meaning of the labels\n", "\n", ">‘feature_names’, the meaning of the features\n", "\n", ">‘DESCR’, the full description of the dataset\n", "\n", "**In order to build our new dataframe - we will need to extract the data and the target!**\n", "\n", "
\n", "\n", "*The code below will load the **iris** dataset's **bunch** of data so we can then store it in a dataframe.*\n", "\n", ">*iris = datasets.load_iris()*\n", "\n", "\n", "*Click the link below to learn more about this command and the other datasets: *\n", "\n", "Documentation\n", "\n", "
\n", "\n", "**With the above in mind, let's get the *iris dataset bunch* so we can turn it into a dataframe using pandas' *'DataFrame()'* function.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import some data to play with using sklearn's datasets.load_iris() function\n", "iris = datasets.load_iris()\n", "# display the chunk feature names or description\n", "print(iris.feature_names)\n", "\n", "# use pandas to combine the data with the target\n", "# define the column/feature names for our new dataframe\n", "df = pd.DataFrame(np.c_[iris.data, iris.target], columns = [\"Sepal Length\", \"Sepal Width\", \n", " \"Petal Length\", \"Petal Width\", \n", " \"Class\"])\n", "# verify successful creation of the dataframe\n", "#df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is always good practice to verify your data before continuing to the next step!\n", "\n", "
\n", "
\n", "With our dataset extracted into a dataframe, let's now select the features we want to use to plot our logistic regression.\n", "

\n", "This will extract the data for the selected features and allow us to combine it with the target values in a new dataframe.\n", "
\n", "\n", ">After we have our sub-frame of selected features and their relative target values; We can randomize the order using *shuffle* to increase the variability between results.\n", "\n", "\n", "By shuffling before splitting our data into training and testing pools, it will help us to verify the effectiveness of the logistic regression model on our data by ensuring it is always trained/tested on a different series of data.\n", "\n", "\n", "#### Target Values:\n", "- Setosa\n", "- Versicolour\n", "- Virginica" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# select the first two features from the bunch data and combine it with the target values\n", "data = np.c_[iris.data[:, :2], iris.target]\n", "\n", "# shuffle the data for increased variability when splitting into train/test\n", "\n", "\n", "#print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Now that we have our data together and randomized, let's assign the 'X' and 'y' data to be plotted.**\n", "\n", "**We can do this by simply selecting all rows (:), and then specifying the column number we want as our starting point (:# or #:).**\n", "\n", ">*Look to the combine function above if you're feeling lost about selecting data by row and column*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# for the X values, we want all rows and the first two columns ( :2 )\n", "#X = ?\n", "\n", "# for the y values, we want all rows and the last column ( 2: )\n", "#y = ?\n", "\n", "#print(X)\n", "#print(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Our data is now ready to be split into training and test sets.**\n", "\n", "**However, before we split it, let's determine the point at which it will be split.**\n", "\n", ">*We determine this using a factor of the data's shape to determine a row value which represents a location 70 percent of the way through the data.*\n", "\n", "**For the training data we will select all of the data following the calculated row.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# determine the percentile split of data for training and testing\n", "# 0.7 is equal to 70/30 : test/train\n", "test_train_split = 0.7\n", "\n", "# get the training data for X using it's shape multiplied by \n", "# the split that was determined above to select the data AFTER that row\n", "X_training = X[:int(X.shape[0]*test_train_split),:]\n", "\n", "# get the training data for Y using it's shape multiplied by \n", "# the split that was determined above to select the data AFTER that row\n", "y_training = y[:int(y.shape[0]*test_train_split)]\n", "\n", "#print(int(X.shape[0]*train_test_split))\n", "#print(int(y.shape[0]*train_test_split))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Our training data is prepared, now all we need to do is change the position of the ':' to select the data preceding the calculated row.**\n", "\n", "**This technique makes it easy to manually assign data on the fly when tinkering with what is the most effective train/test split for your data.**\n", "\n", ">*That being said, a split of 70/30 is quite common and should suffice in the majority of cases!*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# get the training data for X using it's shape multiplied by \n", "# the split that was determined above to select the data BEFORE that row\n", "#X_testing = ?\n", "\n", "# get the training data for Y using it's shape multiplied by \n", "# the split that was determined above to select the data BEFORE that row\n", "#y_testing = ?\n", "\n", "#print(X_testing)\n", "#print(y_testing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### We should now take a look at our data to make sure everything looks okay" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#print(X_testing)\n", "#print(X_testing.ravel())\n", "#print(X_testing.shape)\n", "\n", "#print(y_testing)\n", "#print(y_testing.ravel())\n", "#print(y_testing.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### If our data looks good, it's time to initialize the logistic regression model we imported from sklearn" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Initialize the model using the LogisticRegression function\n", "logreg = linear_model.LogisticRegression()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Then we need to fit the training data to the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We use the initialized function to then fit the data.\n", "logreg.fit(X_training, y_training.ravel())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", " With the training data fit, we can now run prediction on the test data using the '*predict()*' function\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# run prediction on the test data and store it as 'Z'\n", "Z = logreg.predict(X_testing)\n", "\n", "# compare the data\n", "#print(Z)\n", "#print(y_testing.ravel())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "Now that we have our predictions, we need to define a function for determining the classification rate\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def classification_rate(y, Z):\n", " num_right = 0\n", " for i in range(len(Z)):\n", " if y[i] == Z[i]:\n", " num_right = num_right + 1\n", " return num_right/Z.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "Let's execute the function we created above to compare the values, and then return an overall percentage of successful predictions\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "classification_rate(y_testing.ravel(), Z)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Well this isn't that great now is it?\n", "#### What comes next? \n", ">Well, a simple adaptation we can make is to implement ***K-fold Cross Validation*** training and see how our output changes\n", "\n", "
\n", "\n", "\n", "In order to do so we need to import the required libraries first\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import metrics, model_selection\n", "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "Now that we have the libraries we need, let's implement cross validation and see the classification rate of the predictions it creates.\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Z_cross_validation = model_selection.cross_val_predict(LogisticRegression(), X, y.ravel(), cv=10)\n", "\n", "#print(model_selection.cross_val_score(LogisticRegression(), X, y.ravel()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**With our new cross validation predictions let's see what the classification rate changed too.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Execute the classification_rate function on the new predictions (Z*)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### This is not that much better...\n", "#### There are many different classification algorithms sklearn has available to utilize in a similar way to what was demonstrated above. \n", "\n", "The following will be touched on in future lessons:\n", "1. SVM\n", "2. Naive Bayes\n", "3. Decision Trees\n", "4. Random Forests\n", "5. Neural Networks\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "POLYNOMIAL REGRESSION\n", "
\n", "\n", "#### With the variety of classification available, it is important to understand the origin of many of these advanced regression analysis techniques - polynomial regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "# from scipy.interpolate import *\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We're going to take a look at some synthetic data for this one\n", "Hopefully this will help visualize what's going on a bit better" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##### Create a couple arrays with \n", "X = np.array([0,1,2,3,4,5])\n", "y = np.array([0,0.8,0.9,0.1,-0.8,-0.5])\n", "\n", "# print to observe\n", "print(X)\n", "print(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's begin fitting our data. The polyfit method uses the sum of square errors to compute the line of best fit.**\n", "\n", "*In this first piece of code, we're going to stick with a straight line*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The last parameters is a 1 for now as we'll do linear to begin with\n", "# We will use the 'polyfit()' function to determine the slope \n", "# and intercept from our data\n", "p1 = np.polyfit(X, y, 1)\n", "\n", "# This prints the slope and intercept\n", "print(p1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Now that we have our slope and intercept, let's plot it as a first degree polynomial" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# polyval plots the data with respect to the slope, intercept, and data\n", "plt.plot(X, np.polyval(p1, X), color='blue')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "#### Now we'll move on to quadratric and cubes functions and repeat the process" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use the polyfit() function\n", "p2 = np.polyfit(X, y, 2)\n", "p3 = np.polyfit(X, y, 3)\n", "\n", "# use polyval to plot the data again\n", "plt.plot(X, np.polyval(p2, X), color='lime')\n", "plt.plot(X, np.polyval(p3, X), color='red')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "*Observe the p1 values*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p1 # y = Ax + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "*Observe the p2 values*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p2 # y = Ax^2 + Bx + C" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "*Observe the p3 values*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p3 # y = Ax^3 + Bx^2 + Cx + D" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "Now let's see how these plots all fit together.\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the data points\n", "# ('yo' just means yellow circle markers)\n", "plt.plot(X,y,'yo')\n", "\n", "xp = X\n", "#xp = np.linspace(-2,6,100)\n", "\n", "# plot the data\n", "plt.plot(xp, np.polyval(p1,xp), color='blue')\n", "plt.plot(xp, np.polyval(p2,xp), color='red')\n", "plt.plot(xp, np.polyval(p3,xp), color='lime')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Display the polynomial values for each defined point on the plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.polyval(p3, X)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }