{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scikit-learn\n", "\n", "(If you're using the code files, please open scikit_learn_lessons.py)\n", "\n", "## A brief intro to machine learning\n", "\n", "There's a fair bit of backround knowledge that's important to know before we dive into the code. The actual code is rather simple, but I want you to understand exactly what's going on.\n", "\n", "### What is machine learning?\n", "\n", "Machine learning is the study and application of algorithms that learn from examples. It's concerned with constructing systems that learn from data, systems that can then make future predictions based on past input. It's based on the ideas of representation and generalization: the representation of relationships within the data, and the ability to generalize those relationships to new data. This means that we can create a model that will learn something from the data that we have, and then apply what it learns to data that it hasn't seen before. Machine learning provides a way to build executable data summaries; it helps us build better software by predicting more accurately on future inputs.\n", "\n", "### Why is it useful?\n", "\n", "This is an important topic because machine learning is everywhere. For example, your email spam filter is already trained to mark certain emails as spam, based on things like frequency of capital letters or number of suspicious links within an email. If a spam email does get through to your inbox and you manually mark it as spam, your spam filter learns from that input, and will mark similar emails as spam in the future. Another example is Netflix's recommender system. The more movies you rate on Netflix, the more that the recommender system learns what kind of movies you like to watch. The system will then get better at recommending to you appropriate movie choices. Machine learning is especially useful in data analysis.\n", "\n", "### Some terms\n", "\n", "- observation/instance/data point: these all mean the same thing, and that is one particular piece of the data that we can grab information about and learn relationships from.\n", "- label/class: in classification, the label/class is what we aim to classify our new data as. Ex: email as spam or not spam.\n", "- feature: features describe the data. Features of email spam could be number of capital letter or frequency of known spam words.\n", "- categorical: discrete and finite data; has categories. Ex. spam or not spam.\n", "- continuous: subset of real numbers, can take on any value between two points. Ex. temperature degrees.\n", "\n", "### Types of machine learning\n", "\n", "#### Supervised\n", "Supervised learning is machine learning that makes use of labeled data. Supervised learning algorithms can use past observations to make future predictions on both categorical and continuous data. The two main types of supervised learning are classification and regression. Classification predicts labels, such as spam or not spam. Regression predicts the relationship between continuous variables, such as the relationship between temperature and elevation.\n", "\n", "#### Unsupervised\n", "Unsupervised learning is used when the data is unlabeled. You might not know what you're looking for within your data, and unsupervised learning can help you figure it out. Clustering is an example of unsupervised learning, where data instances are grouped together in a way that observations in the same group are more similar to each other than to those in other groups. Another example is dimensionality reduction, where the number of random variables is reduced, and is used for both feature selection and feature extraction.\n", "\n", "## What is scikit-learn?\n", "Scikit-learn is an open-source machine learning module. The scikit-learn project is constantly being developed and improved, and it has a very active user community. The documentation on the website is very thorough with plenty of examples, and there are a number of tutorials and guides for learning how scikit-learn works.\n", "\n", "### Why scikit-learn?\n", "You might be wondering why you'd want to use Python and scikit-learn, rather than other popular tools like MATLAB or R. Because scikit-learn is open source, it's free to use! Also, it's currently the most comprehensive machine learning tool for Python. There are also a number of Python libraries that work well with scikit-learn and extend its capabilities. \n", "\n", "## About this section\n", "We're going to cover supervised learning due to time constraints. We'll talk about a few classifiers as well as linear regression. For the final lesson of this section, we'll use the three classifiers we learn about, k-nearest neighbor, decision trees, and the Naive Bayes classifier, on our census_data dataset. We'll then compare the classifiers and see which one is better for our data.\n", "\n", "## Let's start with classification\n", "Classification, again, classifies data into specific categories, and solves the task of figuring out which category new data belong to. There are many different kinds of classifiers, and which one you want to use depends on your data. We're only going to be covering k-Nearest Neighbors (kNN) and the Naive Bayes classifier (NB) because they're among the simplest to implement and understand.\n", "\n", "For both algorithms, I'll walk you through simple examples of each, so that you'll have an idea of how they work. I'll also show you how to evaluate the models we create.\n", "\n", "Something important to notice in my examples is that when we train, we use a different dataset than when we predict. This is to avoid the problem of overfitting. So, what's overfitting? Well, let's say we train our model on the entire dataset. If we want to also test with that dataset, we won't be able to get an accurate picture of how good our model is, because now it knows our entire dataset by heart. This is why we split up our sets.\n", "\n", "## k-Nearest Neighbors\n", "\n", "The k-Nearest Neighbors (kNN) algorithm finds a predetermined number of \"neighbor\" samples that are closest in distance to a starting data point and makes predictions based on the distances. kNN predicts labels by looking at the labels of its nearest neighbors. The metric used to calcuate the distances between points can be any distance metric measure, such as the Euclidean metric or the Manhattan distance.\n", "\n", "kNN is useful when your data is linear in nature and can therefore be measured with a distance metric. Also, kNN does well when the decision boundary (or the delineation between classes) is hard to identify. \n", "\n", "kNN comes with a couple of caveats. If the classes in your dataset are unevenly distributed, the highest-occuring label will tend to dominate predictions. Also, choosing the *k* of kNN can be tricky. Choosing *k* deserves its own three hour tutorial, so we'll just go with the defaults for today.\n", "\n", "### Classifying in scikit-learn: kNN\n", "\n", "As we go through these examples, you'll notice that the basic fitting and predicting process is basically the same, which is one of the things that makes scikit-learn relatively easy to use.\n", "\n", "Let's start by reading in some data and its labels, and then split it up so we don't overfit. The default split for the `train_test_split()` function is 0.25, meaning that 75% of the data is split into the training set and %25 is split into the test set. If you want a different split, that's something that can be changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.cross_validation import train_test_split\n", "\n", "wine_data = pd.read_csv('../data/wine_data.csv')\n", "wine_labels = pd.read_csv('../data/wine_labels.csv', squeeze=True)\n", "\n", "wine_data_train, wine_data_test, wine_labels_train, wine_labels_test = train_test_split(wine_data, wine_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn can actually understand `DataFrames`, and can use them as input. Here's what one row from `wine_data_train` looks like, which is a `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wine_data_train[:1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare the lengths of the original `DataFrame` and the training set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print len(wine_data), len(wine_data_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can fit kNN to our training data. This is pretty easy. We create our estimator object and then use the fit() function to fit the algorithm to our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "knn = KNeighborsClassifier()\n", "knn.fit(wine_data_train, wine_labels_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally, let's use our fitted model to predict on new data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "knn.predict(wine_data_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the real labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wine_labels_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that there are some differences between the predictions and the actual labels. Let's actually calculate how accurate our classifier is. We can do that using cross-validation. Cross-validation is a method that takes a dataset, randomly splits it into training and test sets, and computes how accurate the model is by checking it against the real labels. It does this multiple times, and splits the dataset differently each time. \n", "\n", "The `cross_val_score()` function takes several parameters. The first is the model you've fitted (in this case it's knn), the second is the entire dataset, the second is the entire list of labels, and if you'd like you can specify how many times you want to cross-validate (the cv parameter)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import cross_val_score\n", "\n", "cross_val_score(knn, wine_data, wine_labels, cv=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So our model is approximately 70% accurate. That's not so great, but you get the idea." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lesson: classification with kNN!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We're going to be using scikit-learn's built in datasets for this.\n", "\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "iris_data = iris.data\n", "iris_labels = iris.target\n", "\n", "# Can you split the data into training and test sets?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Now, let's use the training data and labels to train our model.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# And now, let's predict on our test set.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's compare the predictions to the actual labels. Output the real labels.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's score our model using cross-validation to see how good it is.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decision trees\n", "\n", "Decision trees are predictive models that learn simple decision rules based on the features within a dataset. They map observations about an item to conclusions about the item's target value. Leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees in machine learning are made of the same decision trees that are used in game theory or decision analysis. \n", "\n", "Decision trees are great for heterogeneous data, having the ability to handle both categorical and continuous data; however, they are fairly simple in nature and can lack the ability to capture rich relationships within a dataset.\n", "\n", "### Classifying in scikit-learn: decision trees\n", "\n", "We're going to basically do the same thing we just did, but with a different classifier." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "\n", "tree = DecisionTreeClassifier()\n", "tree.fit(wine_data_train, wine_labels_train)\n", "tree.predict(wine_data_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the real labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "wine_labels_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's cross-validate again. Let's only run it four times." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cross_val_score(tree, wine_data, wine_labels, cv=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model is a much better fit for our dataset, and is much more accurate than k-nearest neighbors.\n", "\n", "## Lesson: classification with decision trees!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We're going to be using scikit-learn's built in datasets for this.\n", "\n", "from sklearn.datasets import load_digits\n", "\n", "digits = load_digits()\n", "digits_data = digits.data\n", "digits_labels = digits.target\n", "\n", "# Once again, split the data into training and test sets.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Fit the model to our data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Predict on the test set\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Look at the test set labels\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Finally, cross-validate\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Naive Bayes\n", "\n", "The Naive Bayes classifier is a probabilistic classifier based on Bayes' Theorem, which states that the probability of *A* given the probability of *B* is equal to the probability of *B* given *A* times the probability of *A*, divided by the probability of *B*. In Naive Bayes classification, the classifier assumes that the features in your dataset are independent of each other; that is, one feature being a certain way has no effect on what values the other features take. This is a naive assumption because this doesn't always hold true in reality, but despite this naivety and oversimplified assumptions, the classifier performs decently and even quite well in certain classification situations.\n", "\n", "The Naive Bayes classifier is useful when your features are independent and your data is normally distributed. More sophisticated methods generally perform better.\n", "\n", "### Classifying in scikit-learn: Naive Bayes\n", "\n", "Just as we did the first two times, let's do it again. We're going to use the GaussianNB estimator object, because our data is for the most part normally distributed. We're also going to use the same wine training and test sets we made earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.naive_bayes import GaussianNB\n", "\n", "gnb = GaussianNB()\n", "gnb.fit(wine_data_train, wine_labels_train)\n", "gnb.predict(wine_data_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, that was easy! Let's look at the real test labels again." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wine_labels_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And of course, let's cross-validate to see how well we did. Let's only run it four times." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cross_val_score(gnb, wine_data, wine_labels, cv=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow! This classifier does much better on this dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lesson: classification with Naive Bayes!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We're going to be using scikit-learn's built in datasets for this.\n", "\n", "from sklearn.datasets import load_digits\n", "\n", "digits = load_digits()\n", "digits_data = digits.data\n", "digits_labels = digits.target\n", "\n", "# Once again, split the data into training and test sets.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Fit the model to our data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Predict on the test set\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Look at the test set labels\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Finally, cross-validate\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear regression\n", "\n", "Linear regression is used when the target value is expected to be a linear combination of the input variables. The goal of linear regression, in creating a linear model, is to minimize the sum of squared residuals between the observed data and the responses predicted by linear approximation. Linear regression can be used to represent the relationship between variables like temperature and elevation, or something like housing prices and square footage.\n", "\n", "Linear regression is appropriate when your data is continuous and linear.\n", "\n", "### Linear regression in scikit-learn\n", "\n", "Let's try this on subset of our wine data, since those values are continuous other than `wine_type`. Let's see what the relationship is between magnesium and abv. First, let's subset the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wine_data_mag = wine_data.loc[:, ['magnesium', 'color']]\n", "wine_data_abv = wine_data.loc[:, 'abv']\n", "wine_data_mag.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And, as always, let's split up the data. Our target values are going to be the continuous abv values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wine_mag_train, wine_mag_test, wine_abv_train, wine_abv_test = train_test_split(wine_data_mag, wine_data_abv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we fit the model to our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "lr = LinearRegression()\n", "lr.fit(wine_mag_train, wine_abv_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally, we predict." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lr.predict(wine_mag_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare those predictions to the actual abv values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wine_abv_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the accuracy of our linear regression model by using the `score()` function. The `score()` function returns the R^2 coefficient, which is a measure of how far away from the actual values are predictions were. The closer to 1, the better the regression model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lr.score(wine_mag_test, wine_abv_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So our score is rather low.\n", "\n", "## Lesson: linear regression!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We're going to be using scikit-learn's built in datasets for this.\n", "\n", "from sklearn.datasets import load_boston\n", "\n", "boston = load_boston()\n", "boston_data = boston.data\n", "boston_target = boston.target\n", "\n", "# Once again, split the data into training and test sets.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a new linear regression model and fit the model to our data.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Predict on the test set.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Take a look at the actual target values.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Score the model!\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# For those using IPython Notebook/Wakari/NBViewer: Go to the [data_analysis](data_analysis.ipynb) notebook!\n", "\n", "# For those using code files, go to data_analysis.py!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }