{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic Regression with the Iris Dataset\n", "\n", "For more explanation of logistic regression, see\n", "1. [Our course notes](https://jennselby.github.io/MachineLearningCourseNotes/#binomial-logistic-regression)\n", "1. [This scikit-learn explanation](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)\n", "1. [The full scikit-learn documentation of the LogisticRegression model class](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)\n", "\n", "## Instructions\n", "0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.\n", "0. Read through the code in the following sections\n", " * [Iris Dataset](#Iris-Dataset)\n", " * [Visualization](#Visualization)\n", " * [Model Training](#Model-Training)\n", " * [Prediction](#Prediction)\n", "0. Complete at least one of the following exercises\n", " * [Exercise Option #1 - Standard Difficulty](#Exercise-Option-#1---Standard-Difficulty)\n", " * [Exercise Option #2 - Advanced Difficulty](#Exercise-Option-#2---Advanced-Difficulty)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sklearn import linear_model # for fitting our model\n", "from sklearn.datasets import load_iris # the iris dataset is included in scikit-learn\n", "\n", "# force numpy not to use scientific notation, to make it easier to read the numbers the program prints out\n", "import numpy\n", "numpy.set_printoptions(suppress=True)\n", "\n", "# to display graphs in this notebook\n", "%matplotlib inline\n", "import matplotlib.pyplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iris Dataset\n", "\n", "Before you go on, make sure you understand this dataset. Modify the cell below to examine different parts of the dataset that are contained in the 'iris' dictionary object.\n", "\n", "What are the features? What are we trying to classify?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris = load_iris()\n", "iris.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also try looking at it using a [pandas dataframe](https://jennselby.github.io/MachineLearningCourseNotes/#pandas)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " target \n", "0 setosa \n", "1 setosa \n", "2 setosa \n", "3 setosa \n", "4 setosa " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "iris_df = pandas.DataFrame(iris.data)\n", "iris_df.columns = iris.feature_names\n", "iris_df['target'] = [iris.target_names[target] for target in iris.target]\n", "iris_df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean5.8433333.0540003.7586671.198667
std0.8280660.4335941.7644200.763161
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) \\\n", "count 150.000000 150.000000 150.000000 \n", "mean 5.843333 3.054000 3.758667 \n", "std 0.828066 0.433594 1.764420 \n", "min 4.300000 2.000000 1.000000 \n", "25% 5.100000 2.800000 1.600000 \n", "50% 5.800000 3.000000 4.350000 \n", "75% 6.400000 3.300000 5.100000 \n", "max 7.900000 4.400000 6.900000 \n", "\n", " petal width (cm) \n", "count 150.000000 \n", "mean 1.198667 \n", "std 0.763161 \n", "min 0.100000 \n", "25% 0.300000 \n", "50% 1.300000 \n", "75% 1.800000 \n", "max 2.500000 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this tutorial, at least to start, we're not going to use the whole dataset, just because it is easier to visualize two features than four. The code below decides which two features we're going to use.\n", "\n", "We'll also need to know at what location in the list each of the classes start at." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Use just two columns (the first and fourth in this case).\n", "x1_feature = 0\n", "x2_feature = 3\n", "iris_inputs = iris.data[:,[x1_feature,x2_feature]]\n", "\n", "# The data are in order by class. Find out where the other classes start in the list\n", "start_class_one = list(iris.target).index(1)\n", "start_class_two = list(iris.target).index(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n", "\n", "Let's visualize our dataset, so that we can better understand what it looks like." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# split the two inputs into single dimensional arrays for plotting\n", "x1 = iris_inputs[:,0]\n", "x2 = iris_inputs[:,1]\n", "\n", "# create a figure and label it\n", "fig = matplotlib.pyplot.figure()\n", "fig.suptitle('Iris Data Set')\n", "matplotlib.pyplot.xlabel(iris.feature_names[x1_feature])\n", "matplotlib.pyplot.ylabel(iris.feature_names[x2_feature])\n", "\n", "# put the input data on the graph, with different colors and shapes for each type\n", "scatter_0 = matplotlib.pyplot.scatter(x1[:start_class_one], x2[:start_class_one],\n", " c=\"red\", marker=\"o\", label=iris.target_names[0])\n", "scatter_1 = matplotlib.pyplot.scatter(x1[start_class_one:start_class_two], x2[start_class_one:start_class_two],\n", " c=\"blue\", marker=\"^\", label=iris.target_names[1])\n", "scatter_2 = matplotlib.pyplot.scatter(x1[start_class_two:], x2[start_class_two:],\n", " c=\"yellow\", marker=\"*\", label=iris.target_names[2])\n", "\n", "# add a legend to explain which points are which\n", "matplotlib.pyplot.legend(handles=[scatter_0, scatter_1, scatter_2])\n", "\n", "# show the graph\n", "matplotlib.pyplot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Training\n", "\n", "Next, we want to fit our logistic regression model to the subset of the data we're using." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept: [ 0.96256986 -0.19641091 -1.7644289 ] Coefficients: [[ 0.44374849 -4.60187424]\n", " [-0.17912292 0.45576962]\n", " [-0.77517855 4.03438217]]\n" ] } ], "source": [ "model = linear_model.LogisticRegression()\n", "model.fit(iris_inputs, iris.target)\n", "\n", "print('Intercept: {0} Coefficients: {1}'.format(model.intercept_, model.coef_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction\n", "\n", "Now we can make some predictions using the trained model. We'll pull out some examples from our training data and see what the model says about them." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Class predictions: [0 1 2]\n", "Probabilities:\n", "[[0.76937325 0.22444044 0.0061863 ]\n", " [0.14977904 0.54049543 0.30972553]\n", " [0.00030362 0.3188655 0.68083088]]\n" ] } ], "source": [ "# Use the first input from each class\n", "inputs = [iris_inputs[0], iris_inputs[start_class_one], iris_inputs[start_class_two]]\n", "\n", "print('Class predictions: {0}'.format(model.predict(inputs))) # guess which class\n", "print('Probabilities:\\n{0}'.format(model.predict_proba(inputs))) # give probability of each class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise Option #1 - Standard Difficulty\n", "\n", "Answer the following questions. You can also use the graph below, if seeing the data visually helps you understand the data.\n", "1. What should we be expecting as the output for class predictions in the above cell? If the model is not giving the expected output, what are some of the reasons it might not be?\n", "1. How do the probabilities output by the above cell relate to the class predictions? Why do you think the model might be more or less confident in its predictions?\n", "1. Looking at the intercept and coefficient output further above, if a coefficient is negative, what has the model learned about this feature? In other words, if you took a datapoint and you increased the value of a feature that has a negative coefficient, what would you expect to happen to the probabilities the model gives this datapoint?\n", "1. Do these two features allow you to predict the iris type well? How do you know? Explain using both the text output in the cells above and the graph below.\n", "1. Try a different pair of features. Do these allow you to predict the iris type well? How do you know?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise Option #2 - Advanced Difficulty\n", "\n", "The plot above is only showing the data, and not anything about what the model learned. Come up with some ideas for how to show the model fit and implement one of them in code. Remember, we are here to help if you are not sure how to write the code for your ideas!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }