{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification\n", "> A Summary of lecture \"Supervised Learning with scikit-learn\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/digits.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# plt.style.use('ggplot')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supervised learning\n", "- What is machine learning?\n", " - The art and science of:\n", " - Giving computers the ability to learn to make decisions from data\n", " - without being explicitly programmed\n", " - Examples:\n", " - Learning to predict whether an email is spam or not\n", " - Clustering wikipedia entries into different categories\n", " - Supervised learning : Uses labeled data\n", " - Unsupervised learning : Uses unlabeled data \n", " \n", "- Unsupervised learning\n", " - Uncovering hidden patterns from unlabeled data\n", " - Example:\n", " - Grouping customers into distinct categories (Clustering)\n", "\n", "- Reinforcement learning\n", " - Software agents interact with an environment\n", " - Learn how to optimize their behavior\n", " - Given a system of rewards and punishments\n", " - Draws inspiration from behavioral psychology\n", " - Applications\n", " - Economics\n", " - Genetics\n", " - Game playing\n", " \n", "- Supervised learning\n", " - Predictor variables / features and a target variable\n", " - Automate time-consuming or expensive manual tasks\n", " - Doctor's diagnosis\n", " - Make predictions about the future\n", " - Will acustomer click on an ad or not?\n", " - Need labeled data\n", " - Historical data with labels\n", " - Experiments to get labeled data\n", " - Crowd-sourcing labeled data\n", " \n", "- Naming Conventions\n", " - Features = predictor variables = independent variables\n", " - Target variable = dependent variable = response variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory data analysis\n", "- Iris dataset\n", " - Features\n", " - Petal length\n", " - Petal width\n", " - Sepal length\n", " - Sepal width\n", " - Target variable : Species \n", " - Versicolor\n", " - Virginica\n", " - Setosa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Numerical EDA\n", "In this chapter, you'll be working with a dataset obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records) consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocessing" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
partyinfantswaterbudgetphysiciansalvadorreligioussatelliteaidmissileimmigrationsynfuelseducationsuperfundcrimeduty_free_exportseaa_rsa
0republican0101110001011101
1republican0101110000011100
2democrat0110110000101100
3democrat0110010000101001
4democrat1110110000101111
\n", "
" ], "text/plain": [ " party infants water budget physician salvador religious \\\n", "0 republican 0 1 0 1 1 1 \n", "1 republican 0 1 0 1 1 1 \n", "2 democrat 0 1 1 0 1 1 \n", "3 democrat 0 1 1 0 0 1 \n", "4 democrat 1 1 1 0 1 1 \n", "\n", " satellite aid missile immigration synfuels education superfund \\\n", "0 0 0 0 1 0 1 1 \n", "1 0 0 0 0 0 1 1 \n", "2 0 0 0 0 1 0 1 \n", "3 0 0 0 0 1 0 1 \n", "4 0 0 0 0 1 0 1 \n", "\n", " crime duty_free_exports eaa_rsa \n", "0 1 0 1 \n", "1 1 0 0 \n", "2 1 0 0 \n", "3 0 0 1 \n", "4 1 1 1 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./dataset/house-votes-84.csv', header=None)\n", "df.columns = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',\n", " 'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',\n", " 'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']\n", "df.replace({'?':'n'}, inplace=True)\n", "df.replace({'n':0, 'y': 1}, inplace=True)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 435 entries, 0 to 434\n", "Data columns (total 17 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 party 435 non-null object\n", " 1 infants 435 non-null int64 \n", " 2 water 435 non-null int64 \n", " 3 budget 435 non-null int64 \n", " 4 physician 435 non-null int64 \n", " 5 salvador 435 non-null int64 \n", " 6 religious 435 non-null int64 \n", " 7 satellite 435 non-null int64 \n", " 8 aid 435 non-null int64 \n", " 9 missile 435 non-null int64 \n", " 10 immigration 435 non-null int64 \n", " 11 synfuels 435 non-null int64 \n", " 12 education 435 non-null int64 \n", " 13 superfund 435 non-null int64 \n", " 14 crime 435 non-null int64 \n", " 15 duty_free_exports 435 non-null int64 \n", " 16 eaa_rsa 435 non-null int64 \n", "dtypes: int64(16), object(1)\n", "memory usage: 57.9+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
infantswaterbudgetphysiciansalvadorreligioussatelliteaidmissileimmigrationsynfuelseducationsuperfundcrimeduty_free_exportseaa_rsa
count435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000435.000000
mean0.4298850.4482760.5816090.4068970.4873560.6252870.5494250.5563220.4758620.4965520.3448280.3931030.4804600.5701150.4000000.618391
std0.4956300.4978900.4938630.4918210.5004160.4846060.4981240.4973900.4999920.5005640.4758590.4890020.5001930.4956300.4904620.486341
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%0.0000000.0000001.0000000.0000000.0000001.0000001.0000001.0000000.0000000.0000000.0000000.0000000.0000001.0000000.0000001.000000
75%1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000
max1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000
\n", "
" ], "text/plain": [ " infants water budget physician salvador religious \\\n", "count 435.000000 435.000000 435.000000 435.000000 435.000000 435.000000 \n", "mean 0.429885 0.448276 0.581609 0.406897 0.487356 0.625287 \n", "std 0.495630 0.497890 0.493863 0.491821 0.500416 0.484606 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 \n", "75% 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "\n", " satellite aid missile immigration synfuels \\\n", "count 435.000000 435.000000 435.000000 435.000000 435.000000 \n", "mean 0.549425 0.556322 0.475862 0.496552 0.344828 \n", "std 0.498124 0.497390 0.499992 0.500564 0.475859 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "50% 1.000000 1.000000 0.000000 0.000000 0.000000 \n", "75% 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "\n", " education superfund crime duty_free_exports eaa_rsa \n", "count 435.000000 435.000000 435.000000 435.000000 435.000000 \n", "mean 0.393103 0.480460 0.570115 0.400000 0.618391 \n", "std 0.489002 0.500193 0.495630 0.490462 0.486341 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 1.000000 0.000000 1.000000 \n", "75% 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 1.000000 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visual EDA\n", "The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the scatter_matrix() function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as Seaborn's [countplot](http://seaborn.pydata.org/generated/seaborn.countplot.html)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([,\n", " ],\n", " )" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(5, 5))\n", "sns.countplot(x='education', hue='party', data=df, palette='RdBu')\n", "plt.xticks([0, 1], ['No', 'Yes'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In ```sns.countplot()```, we specify the x-axis data to be ```'education'```, and hue to be ```'party'```. Recall that ```'party'``` is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the ```'education'``` bill, with each party colored differently. We manually specified the color to be ```'RdBu'```, as the Republican party has been traditionally associated with red, and the Democratic party with blue.\n", "\n", "It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!\n", "\n", "Explore the voting behavior further by generating countplots for the ```'satellite'``` and ```'missile'``` bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with ```plt.figure()``` so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([,\n", " ],\n", " )" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(5, 5))\n", "sns.countplot(x='satellite', hue='party', data=df, palette='RdBu')\n", "plt.xticks([0, 1], ['No', 'Yes'])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([,\n", " ],\n", " )" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(5, 5))\n", "sns.countplot(x='missile', hue='party', data=df, palette='RdBu')\n", "plt.xticks([0, 1], ['No', 'Yes'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The classification challenge\n", "- k-Nearest Neighbors\n", " - Basic idea : Predict the label of a data point by looking at the 'k' closest labeled data points\n", " - Looking at the 'k' closest labeled data points\n", " - Taking a majority vote\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### k-Nearest Neighbors: Fit\n", "Having explored the Congressional voting records dataset, it is time now to build your first classifier.\n", "\n", "In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.\n", "\n", "Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['party', 'infants', 'water', 'budget', 'physician', 'salvador',\n", " 'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',\n", " 'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa'],\n", " dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=None, n_neighbors=6, p=2,\n", " weights='uniform')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import KNeighborsClassifier from sklearn.neighbors\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "# Create arrays for the features and the response variable\n", "y = df['party'].values\n", "X = df.drop('party', axis=1).values\n", "\n", "# Create a k-NN classifier with 6 neighbors\n", "knn = KNeighborsClassifier(n_neighbors=6)\n", "\n", "# Fit the classifier to the data\n", "knn.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### k-Nearest Neighbors: Predict\n", "Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the ```.predict()``` method on the ```X``` that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.\n", "\n", "In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as ```X_new```. You will use your classifier to predict the label for this new data point, as well as on the training data ```X``` that the model has already seen. Using ```.predict()``` on ```X_new``` will generate 1 prediction, while using it on ```X``` will generate 435 predictions: 1 for each sample." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "X_new = pd.DataFrame([0.696469, 0.286139, 0.226851, 0.551315, 0.719469, 0.423106, 0.980764, \n", " 0.68483, 0.480932, 0.392118, 0.343178, 0.72905, 0.438572, 0.059678,\n", " 0.398044, 0.737995]).transpose()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Prediction: ['democrat']\n" ] } ], "source": [ "# Predict the labels for the training data X\n", "y_pred = knn.predict(X)\n", "\n", "# Predict and print the label for the new data point X_new\n", "new_prediction = knn.predict(X_new)\n", "print(\"Prediction: {}\".format(new_prediction))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Measuring model performance\n", "- In classification, accuracy is a commonly used metric\n", "- Accuracy = Fraction of correct predictions\n", "- Which data should be used to compute accuracy\n", "- How well will the model perform on new data?\n", "\n", "- Could compute accuracy on data used to fit classifier, but NOT indicative of ability to generalize\n", "- Splitdata into training and test set\n", "- Fit/train the classifier on the training set\n", "- Make predictions on test set\n", "- Compare predictions with the known labels\n", "- Model Complexity\n", " - Larger k = smoother decision boundary = less complex model\n", " - Smaller k = more complex model = can lead to overfitting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The digits recognition dataset\n", "Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the [MNIST](http://yann.lecun.com/exdb/mnist/) digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.\n", "\n", "Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type ```Bunch```, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an ```'images'``` key in addition to the ```'data'``` and ```'target'``` keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this ```'images'``` key is useful for visualizing the images, as you'll see in this exercise. On the other hand, the ```'data'``` key contains the feature array - that is, the images as a flattened array of 64 pixels.\n", "\n", "Notice that you can access the keys of these Bunch objects in two different ways: By using the . notation, as in ```digits.images```, or the ```[]``` notation, as in ```digits['images']```." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])\n", ".. _digits_dataset:\n", "\n", "Optical recognition of handwritten digits dataset\n", "--------------------------------------------------\n", "\n", "**Data Set Characteristics:**\n", "\n", " :Number of Instances: 5620\n", " :Number of Attributes: 64\n", " :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n", " :Missing Attribute Values: None\n", " :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n", " :Date: July; 1998\n", "\n", "This is a copy of the test set of the UCI ML hand-written digits datasets\n", "https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n", "\n", "The data set contains images of hand-written digits: 10 classes where\n", "each class refers to a digit.\n", "\n", "Preprocessing programs made available by NIST were used to extract\n", "normalized bitmaps of handwritten digits from a preprinted form. From a\n", "total of 43 people, 30 contributed to the training set and different 13\n", "to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n", "4x4 and the number of on pixels are counted in each block. This generates\n", "an input matrix of 8x8 where each element is an integer in the range\n", "0..16. This reduces dimensionality and gives invariance to small\n", "distortions.\n", "\n", "For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\n", "T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\n", "L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n", "1994.\n", "\n", ".. topic:: References\n", "\n", " - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n", " Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n", " Graduate Studies in Science and Engineering, Bogazici University.\n", " - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n", " - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n", " Linear dimensionalityreduction using relevance weighted LDA. School of\n", " Electrical and Electronic Engineering Nanyang Technological University.\n", " 2005.\n", " - Claudio Gentile. A New Approximate Maximal Margin Classification\n", " Algorithm. NIPS. 2000.\n", "(1797, 8, 8)\n", "(1797, 64)\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPUAAAD4CAYAAAA0L6C7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAKm0lEQVR4nO3d34tc9RnH8c+nq9L6cyUbimRDR0ECUmgiS0ACYmNbkirai14koLBS8KaKoQXRXqX/gKQXRZCoEUyVNppFxGoFDa3QWpO4tsZNShq2ZKs2CSVEDTREn17sBKLduGfOnF/79P2C4O7OsN9nSN6embOz5+uIEIA8vtL2AACqRdRAMkQNJEPUQDJEDSRzUR3fdGxsLHq9Xh3fulVHjx5tdL3Tp083ul5GY2Njja63bNmyRtaZnZ3ViRMnvNBttUTd6/W0d+/eOr51q7Zs2dLoetPT042ul9Hk5GTK9SYmJi54G0+/gWSIGkiGqIFkiBpIhqiBZIgaSIaogWSIGkiGqIFkCkVte4PtQ7YP236o7qEAlLdo1LZHJP1S0kZJN0jabPuGugcDUE6RI/VaSYcj4khEnJH0rKQ76x0LQFlFol4h6fxfT5rrf+1zbN9re6/tvcePH69qPgADKhL1Qr/e9T9XK4yIxyJiIiImli9fPvxkAEopEvWcpJXnfT4u6f16xgEwrCJRvyXpetvX2r5E0iZJL9Q7FoCyFr1IQkSctX2fpFckjUh6IiIO1D4ZgFIKXfkkIl6S9FLNswCoAO8oA5IhaiAZogaSIWogGaIGkiFqIBmiBpKpZYeOJp08ebKxtaamphpbS5K2bt3a2FoZt0mS8j6uL8ORGkiGqIFkiBpIhqiBZIgaSIaogWSIGkiGqIFkiBpIhqiBZIrs0PGE7WO2321iIADDKXKk3iFpQ81zAKjIolFHxO8l/buBWQBUoLLX1Gy7A3RDZVGz7Q7QDZz9BpIhaiCZIj/SekbSHyWtsj1n+0f1jwWgrCJ7aW1uYhAA1eDpN5AMUQPJEDWQDFEDyRA1kAxRA8kQNZDMkt92Z3p6urG1mtziR5J27NjR2FqrV69ubK0mtxMaHR1tbK2u4EgNJEPUQDJEDSRD1EAyRA0kQ9RAMkQNJEPUQDJEDSRD1EAyRa5RttL267ZnbB+w/UATgwEop8h7v89K+mlE7Ld9haR9tl+NiPdqng1ACUW23fkgIvb3P/5I0oykFXUPBqCcgV5T2+5JWiPpzQVuY9sdoAMKR237cknPSdoSEae+eDvb7gDdUChq2xdrPuidEfF8vSMBGEaRs9+W9LikmYh4pP6RAAyjyJF6naS7Ja23Pd3/8/2a5wJQUpFtd96Q5AZmAVAB3lEGJEPUQDJEDSRD1EAyRA0kQ9RAMkQNJEPUQDJLfi+tzG655ZbG1mpyT7LJycnG1pqammpsra7gSA0kQ9RAMkQNJEPUQDJEDSRD1EAyRA0kQ9RAMkQNJFPkwoNftf1n2+/0t935eRODASinyNtE/yNpfUR83L9U8Bu2fxsRf6p5NgAlFLnwYEj6uP/pxf0/UedQAMorejH/EdvTko5JejUi2HYH6KhCUUfEpxGxWtK4pLW2v7nAfdh2B+iAgc5+R8RJSXskbahlGgBDK3L2e7nt0f7HX5P0HUkH6x4MQDlFzn5fI+kp2yOa/5/AryPixXrHAlBWkbPff9H8ntQAlgDeUQYkQ9RAMkQNJEPUQDJEDSRD1EAyRA0kQ9RAMkt+250mt6aZnZ1tbC1JGh0dbXS9pvR6vcbW2rNnT2NrSc3+e7wQjtRAMkQNJEPUQDJEDSRD1EAyRA0kQ9RAMkQNJEPUQDJEDSRTOOr+Bf3fts1FB4EOG+RI/YCkmboGAVCNotvujEu6TdL2escBMKyiR+ptkh6U9NmF7sBeWkA3FNmh43ZJxyJi35fdj720gG4ocqReJ+kO27OSnpW03vbTtU4FoLRFo46IhyNiPCJ6kjZJei0i7qp9MgCl8HNqIJmBLmcUEXs0v5UtgI7iSA0kQ9RAMkQNJEPUQDJEDSRD1EAyRA0ks+S33WlS1m1wmtbk1jRsuwNgySNqIBmiBpIhaiAZogaSIWogGaIGkiFqIBmiBpIhaiCZQm8T7V9J9CNJn0o6GxETdQ4FoLxB3vv97Yg4UdskACrB028gmaJRh6Tf2d5n+96F7sC2O0A3FI16XUTcKGmjpB/bvvmLd2DbHaAbCkUdEe/3/3tM0m5Ja+scCkB5RTbIu8z2Fec+lvQ9Se/WPRiAcoqc/f66pN22z93/VxHxcq1TASht0agj4oikbzUwC4AK8CMtIBmiBpIhaiAZogaSIWogGaIGkiFqIBm23RnA5ORko+tt27atsbWa3FKo1+s1ttb/I47UQDJEDSRD1EAyRA0kQ9RAMkQNJEPUQDJEDSRD1EAyRA0kUyhq26O2d9k+aHvG9k11DwagnKLv/f6FpJcj4oe2L5F0aY0zARjColHbvlLSzZImJSkizkg6U+9YAMoq8vT7OknHJT1p+23b2/vX//4ctt0BuqFI1BdJulHSoxGxRtInkh764p3YdgfohiJRz0mai4g3+5/v0nzkADpo0agj4kNJR22v6n/pVknv1ToVgNKKnv2+X9LO/pnvI5LuqW8kAMMoFHVETEuaqHkWABXgHWVAMkQNJEPUQDJEDSRD1EAyRA0kQ9RAMkQNJMNeWgNocr8pSbr66qsbXa8pV111VWNrTU1NNbZWV3CkBpIhaiAZogaSIWogGaIGkiFqIBmiBpIhaiAZogaSWTRq26tsT5/355TtLU0MB2Bwi75NNCIOSVotSbZHJP1T0u6a5wJQ0qBPv2+V9PeI+EcdwwAY3qBRb5L0zEI3sO0O0A2Fo+5f8/sOSb9Z6Ha23QG6YZAj9UZJ+yPiX3UNA2B4g0S9WRd46g2gOwpFbftSSd+V9Hy94wAYVtFtd05LWlbzLAAqwDvKgGSIGkiGqIFkiBpIhqiBZIgaSIaogWSIGkjGEVH9N7WPSxr01zPHJJ2ofJhuyPrYeFzt+UZELPibU7VEXYbtvREx0fYcdcj62Hhc3cTTbyAZogaS6VLUj7U9QI2yPjYeVwd15jU1gGp06UgNoAJEDSTTiahtb7B9yPZh2w+1PU8VbK+0/brtGdsHbD/Q9kxVsj1i+23bL7Y9S5Vsj9reZftg/+/uprZnGlTrr6n7GwT8TfOXS5qT9JakzRHxXquDDcn2NZKuiYj9tq+QtE/SD5b64zrH9k8kTUi6MiJub3ueqth+StIfImJ7/wq6l0bEybbnGkQXjtRrJR2OiCMRcUbSs5LubHmmoUXEBxGxv//xR5JmJK1od6pq2B6XdJuk7W3PUiXbV0q6WdLjkhQRZ5Za0FI3ol4h6eh5n88pyT/+c2z3JK2R9Ga7k1Rmm6QHJX3W9iAVu07ScUlP9l9abLd9WdtDDaoLUXuBr6X5OZvtyyU9J2lLRJxqe55h2b5d0rGI2Nf2LDW4SNKNkh6NiDWSPpG05M7xdCHqOUkrz/t8XNL7Lc1SKdsXaz7onRGR5fLK6yTdYXtW8y+V1tt+ut2RKjMnaS4izj2j2qX5yJeULkT9lqTrbV/bPzGxSdILLc80NNvW/GuzmYh4pO15qhIRD0fEeET0NP939VpE3NXyWJWIiA8lHbW9qv+lWyUtuRObha77XaeIOGv7PkmvSBqR9EREHGh5rCqsk3S3pL/anu5/7WcR8VKLM2Fx90va2T/AHJF0T8vzDKz1H2kBqFYXnn4DqBBRA8kQNZAMUQPJEDWQDFEDyRA1kMx/AUgsoPqgKYP7AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Import necessary modules\n", "from sklearn import datasets\n", "\n", "# Load the digits dataset: digits\n", "digits = datasets.load_digits()\n", "\n", "# Print the keys and DESCR of the dataset\n", "print(digits.keys())\n", "print(digits['DESCR'])\n", "\n", "# Print the shape of the images and data keys\n", "print(digits.images.shape)\n", "print(digits.data.shape)\n", "\n", "# Display digit 1010\n", "plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')\n", "plt.savefig('../images/digits.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train/Test Split + Fit/Predict/Accuracy\n", "Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the ```.score()``` method." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9833333333333333\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Create feature and target arrays\n", "X = digits.data\n", "y = digits.target\n", "\n", "# Split into training and test set\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, \n", " random_state=42, stratify=y)\n", "\n", "# Create a k-NN classifier with 7 neighbors: knn\n", "knn = KNeighborsClassifier(n_neighbors=7)\n", "\n", "# Fit the classifier to the training data\n", "knn.fit(X_train, y_train)\n", "\n", "# Print the accuracy\n", "print(knn.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overfitting and underfitting\n", "Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Accuracy')" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Setup arrays to store train and test accuracies\n", "neighbors = np.arange(1, 9)\n", "train_accuracy = np.empty(len(neighbors))\n", "test_accuracy = np.empty(len(neighbors))\n", "\n", "# Loop over different values of k\n", "for i, k in enumerate(neighbors):\n", " # Setup a k-NN Classifier with k neighbors: knn\n", " knn = KNeighborsClassifier(n_neighbors=k)\n", " \n", " # Fit the classifier to the training data\n", " knn.fit(X_train, y_train)\n", " \n", " # Compute accuracy on the training set\n", " train_accuracy[i] = knn.score(X_train, y_train)\n", " \n", " # Compute accuracy on the testing set\n", " test_accuracy[i] = knn.score(X_test, y_test)\n", " \n", "# Generate plot\n", "plt.title('k-NN: Varying Number of Neighbors')\n", "plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')\n", "plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')\n", "plt.legend()\n", "plt.xlabel('Number of Neighbors')\n", "plt.ylabel('Accuracy')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }