{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Applying Bayes' theorem to iris classification\n", "\n", "Can **Bayes' theorem** help us to solve a **classification problem**, namely predicting the species of an iris?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the data\n", "\n", "We'll read the iris data into a DataFrame, and **round up** all of the measurements to the next integer:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the iris data into a DataFrame\n", "url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'\n", "col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']\n", "iris = pd.read_csv(url, header=None, names=col_names)\n", "iris.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
06421Iris-setosa
15321Iris-setosa
25421Iris-setosa
35421Iris-setosa
45421Iris-setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 6 4 2 1 Iris-setosa\n", "1 5 3 2 1 Iris-setosa\n", "2 5 4 2 1 Iris-setosa\n", "3 5 4 2 1 Iris-setosa\n", "4 5 4 2 1 Iris-setosa" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply the ceiling function to the numeric columns\n", "iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)\n", "iris.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deciding how to make a prediction\n", "\n", "Let's say that I have an **out-of-sample iris** with the following measurements: **7, 3, 5, 2**. How might I predict the species?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
547352Iris-versicolor
587352Iris-versicolor
637352Iris-versicolor
687352Iris-versicolor
727352Iris-versicolor
737352Iris-versicolor
747352Iris-versicolor
757352Iris-versicolor
767352Iris-versicolor
777352Iris-versicolor
877352Iris-versicolor
917352Iris-versicolor
977352Iris-versicolor
1237352Iris-virginica
1267352Iris-virginica
1277352Iris-virginica
1467352Iris-virginica
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "54 7 3 5 2 Iris-versicolor\n", "58 7 3 5 2 Iris-versicolor\n", "63 7 3 5 2 Iris-versicolor\n", "68 7 3 5 2 Iris-versicolor\n", "72 7 3 5 2 Iris-versicolor\n", "73 7 3 5 2 Iris-versicolor\n", "74 7 3 5 2 Iris-versicolor\n", "75 7 3 5 2 Iris-versicolor\n", "76 7 3 5 2 Iris-versicolor\n", "77 7 3 5 2 Iris-versicolor\n", "87 7 3 5 2 Iris-versicolor\n", "91 7 3 5 2 Iris-versicolor\n", "97 7 3 5 2 Iris-versicolor\n", "123 7 3 5 2 Iris-virginica\n", "126 7 3 5 2 Iris-virginica\n", "127 7 3 5 2 Iris-virginica\n", "146 7 3 5 2 Iris-virginica" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show all observations with features: 7, 3, 5, 2\n", "iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Iris-versicolor 13\n", "Iris-virginica 4\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the species for these observations\n", "iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Iris-setosa 50\n", "Iris-versicolor 50\n", "Iris-virginica 50\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the species for all observations\n", "iris.species.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's frame this as a **conditional probability problem**: What is the probability of some particular species, given the measurements 7, 3, 5, and 2?\n", "\n", "$$P(species \\ | \\ 7352)$$\n", "\n", "We could calculate the conditional probability for **each of the three species**, and then predict the species with the **highest probability**:\n", "\n", "$$P(setosa \\ | \\ 7352)$$\n", "$$P(versicolor \\ | \\ 7352)$$\n", "$$P(virginica \\ | \\ 7352)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating the probability of each species\n", "\n", "**Bayes' theorem** gives us a way to calculate these conditional probabilities.\n", "\n", "Let's start with **versicolor**:\n", "\n", "$$P(versicolor \\ | \\ 7352) = \\frac {P(7352 \\ | \\ versicolor) \\times P(versicolor)} {P(7352)}$$\n", "\n", "We can calculate each of the terms on the right side of the equation:\n", "\n", "$$P(7352 \\ | \\ versicolor) = \\frac {13} {50} = 0.26$$\n", "\n", "$$P(versicolor) = \\frac {50} {150} = 0.33$$\n", "\n", "$$P(7352) = \\frac {17} {150} = 0.11$$\n", "\n", "Therefore, Bayes' theorem says the **probability of versicolor given these measurements** is:\n", "\n", "$$P(versicolor \\ | \\ 7352) = \\frac {0.26 \\times 0.33} {0.11} = 0.76$$\n", "\n", "Let's repeat this process for **virginica** and **setosa**:\n", "\n", "$$P(virginica \\ | \\ 7352) = \\frac {0.08 \\times 0.33} {0.11} = 0.24$$\n", "\n", "$$P(setosa \\ | \\ 7352) = \\frac {0 \\times 0.33} {0.11} = 0$$\n", "\n", "We predict that the iris is a versicolor, since that species had the **highest conditional probability**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "1. We framed a **classification problem** as three conditional probability problems.\n", "2. We used **Bayes' theorem** to calculate those conditional probabilities.\n", "3. We made a **prediction** by choosing the species with the highest conditional probability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonus: The intuition behind Bayes' theorem\n", "\n", "Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem makes intuitive sense:\n", "\n", "Pretend that **more of the existing versicolors had measurements of 7352:**\n", "\n", "- $P(7352 \\ | \\ versicolor)$ would increase, thus increasing the numerator.\n", "- It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would also increase.\n", "\n", "Pretend that **most of the existing irises were versicolor:**\n", "\n", "- $P(versicolor)$ would increase, thus increasing the numerator.\n", "- It would make sense that the probability of any iris being a versicolor (regardless of measurements) would also increase.\n", "\n", "Pretend that **17 of the setosas had measurements of 7352:**\n", "\n", "- $P(7352)$ would double, thus doubling the denominator.\n", "- It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would be cut in half." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }