{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic Regression with Synthetic Data\n", "\n", "For more explanation of logistic regression, see\n", "1. [Our course notes](https://jennselby.github.io/MachineLearningCourseNotes/#binomial-logistic-regression)\n", "1. [This scikit-learn explanation](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)\n", "1. [The full scikit-learn documentation of the LogisticRegression model class](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)\n", "\n", "## Instructions\n", "0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.\n", "0. Read through the code in the following sections:\n", " * [Data Generation](#Data-Generation)\n", " * [Visualization](#Visualization)\n", " * [Model Training](#Model-Training)\n", " * [Prediction](#Prediction)\n", "0. Complete at least one of the exercise options:\n", " * [Exercise Option #1 - Standard Difficulty](#Exercise-Option-#1---Standard-Difficulty)\n", " * [Exercise Option #2 - Advanced Difficulty](#Exercise-Option-#2---Advanced-Difficulty)\n", " * [Exercise Option #3 - Advanced Difficulty](#Exercise-Option-#3---Advanced-Difficulty)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy.random # for generating our dataset\n", "from sklearn import linear_model # for fitting our model\n", "\n", "# force numpy not to use scientific notation, to make it easier to read the numbers the program prints out\n", "numpy.set_printoptions(suppress=True)\n", "\n", "# to display graphs in this notebook\n", "%matplotlib inline\n", "import matplotlib.pyplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Generation\n", "\n", "As we did in the [linear regression notebook](https://nbviewer.jupyter.org/github/jennselby/MachineLearningCourseNotes/blob/master/assets/ipynb/LinearRegression.ipynb), we will be generating some fake data.\n", "\n", "In this fake dataset, we have two types of plants.\n", "* Plant A tends to be taller (average 60cm) and thinner (average 8cm).\n", "* Plant B tends to be shorter (average 58cm) and wider (average 10cm).\n", "* The heights and diameters of both plants are normally distributed (they follow a bell curve).\n", "\n", "* Class 0 will represent Plant A and Class 1 will represent Plant B" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "NUM_INPUTS = 50 # inputs per class\n", "PLANT_A_AVG_HEIGHT = 60.0\n", "PLANT_A_AVG_WIDTH = 8.0\n", "PLANT_B_AVG_HEIGHT = 58.0\n", "PLANT_B_AVG_WIDTH = 10.0\n", "\n", "# Pick numbers randomly with a normal distribution centered around the averages\n", "\n", "plant_a_heights = numpy.random.normal(loc=PLANT_A_AVG_HEIGHT, size=NUM_INPUTS)\n", "plant_a_widths = numpy.random.normal(loc=PLANT_A_AVG_WIDTH, size=NUM_INPUTS)\n", "\n", "plant_b_heights = numpy.random.normal(loc=PLANT_B_AVG_HEIGHT, size=NUM_INPUTS)\n", "plant_b_widths = numpy.random.normal(loc=PLANT_B_AVG_WIDTH, size=NUM_INPUTS)\n", "\n", "# this creates a 2-dimensional matrix, with heights in the first column and widths in the second\n", "# the first half of rows are all plants of type a and the second half are type b\n", "plant_inputs = list(zip(numpy.append(plant_a_heights, plant_b_heights),\n", " numpy.append(plant_a_widths, plant_b_widths)))\n", "\n", "# this is a list where the first half are 0s (representing plants of type a) and the second half are 1s (type b)\n", "classes = [0]*NUM_INPUTS + [1]*NUM_INPUTS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n", "\n", "Let's visualize our dataset, so that we can better understand what it looks like." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# create a figure and label it\n", "fig = matplotlib.pyplot.figure()\n", "fig.suptitle('Plant Data Set')\n", "matplotlib.pyplot.xlabel('Height')\n", "matplotlib.pyplot.ylabel('Width')\n", "\n", "# put the generated points on the graph\n", "a_scatter = matplotlib.pyplot.scatter(plant_a_heights, plant_a_widths, c=\"red\", marker=\"o\", label='plant a')\n", "b_scatter = matplotlib.pyplot.scatter(plant_b_heights, plant_b_widths, c=\"blue\", marker=\"^\", label='plant b')\n", "\n", "# add a legend to explain which points are which\n", "matplotlib.pyplot.legend(handles=[a_scatter, b_scatter])\n", "\n", "# show the graph\n", "matplotlib.pyplot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Training\n", "\n", "Next, we want to fit our logistic regression model to our dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept: [0.4923611] Coefficients: [[-0.30765052 1.95764602]]\n" ] } ], "source": [ "model = linear_model.LogisticRegression()\n", "model.fit(plant_inputs, classes)\n", "\n", "print('Intercept: {0} Coefficients: {1}'.format(model.intercept_, model.coef_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction\n", "\n", "Now we can make some predictions using the trained model. Note that we are generating the new data exactly the same way that we generated the training data above." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Plant A: 59.93207269053181 7.030073699859045\n", "Plant B: 59.619487931898846 10.240893276828622\n", "Class predictions: [0 1]\n", "Probabilities:\n", "[[0.98498204 0.01501796]\n", " [0.09989079 0.90010921]]\n" ] } ], "source": [ "# Generate some new random values for two plants, one of each class\n", "new_a_height = numpy.random.normal(loc=PLANT_A_AVG_HEIGHT)\n", "new_a_width = numpy.random.normal(loc=PLANT_A_AVG_WIDTH)\n", "new_b_height = numpy.random.normal(loc=PLANT_B_AVG_HEIGHT)\n", "new_b_width = numpy.random.normal(loc=PLANT_B_AVG_WIDTH)\n", "\n", "# Pull the values into a matrix, because that is what the predict function wants\n", "inputs = [[new_a_height, new_a_width], [new_b_height, new_b_width]]\n", "\n", "# Print out the outputs for these new inputs\n", "print('Plant A: {0} {1}'.format(new_a_height, new_a_width))\n", "print('Plant B: {0} {1}'.format(new_b_height, new_b_width))\n", "print('Class predictions: {0}'.format(model.predict(inputs))) # guess which class\n", "print('Probabilities:\\n{0}'.format(model.predict_proba(inputs))) # give probability of each class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise Option #1 - Standard Difficulty\n", "\n", "Answer the following questions. You can also use the graph below, if seeing the data visually helps you understand the data.\n", "1. What should we be expecting as the output for class predictions in the above cell? If the model is not giving the expected output, what are some of the reasons it might not be?\n", "1. How do the probabilities output by the above cell relate to the class predictions? Why do you think the model might be more or less confident in its predictions?\n", "1. If you change the averages in the data generation code (like PLANT_A_AVG_HEIGHT) and re-run the code, how do the predictions change, and why?\n", "1. Looking at the intercept and coefficient output further above, if a coefficient is negative, what has the model learned about this feature? In other words, if you took a datapoint and you increased the value of a feature that has a negative coefficient, what would you expect to happen to the probabilities the model gives this datapoint?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise Option #2 - Advanced Difficulty\n", "\n", "The plot above is only showing the data, and not anything about what the model learned. Come up with some ideas for how to show the model fit and implement one of them in code. Remember, we are here to help if you are not sure how to write the code for your ideas!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise Option #3 - Advanced Difficulty\n", "\n", "If you have more than two classes, you can use multinomial logistic regression or the one vs. rest technique, where you use a binomial logistic regression for each class that you have and decide if it is or is not in that class. Try expanding the program with a third type and implementing your own one vs. rest models. To test if this is working, compare your output to running your expanded dataset through scikit-learn, which will automatically do one vs. rest if there are more than two classes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }