{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applying Bayes' theorem to iris classification\n",
"\n",
"Can **Bayes' theorem** help us to solve a **classification problem**, namely predicting the species of an iris?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparing the data\n",
"\n",
"We'll read the iris data into a DataFrame, and **round up** all of the measurements to the next integer:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 5.1 3.5 1.4 0.2 Iris-setosa\n",
"1 4.9 3.0 1.4 0.2 Iris-setosa\n",
"2 4.7 3.2 1.3 0.2 Iris-setosa\n",
"3 4.6 3.1 1.5 0.2 Iris-setosa\n",
"4 5.0 3.6 1.4 0.2 Iris-setosa"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# read the iris data into a DataFrame\n",
"url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'\n",
"col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']\n",
"iris = pd.read_csv(url, header=None, names=col_names)\n",
"iris.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 6 | \n",
" 4 | \n",
" 2 | \n",
" 1 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 4 | \n",
" 2 | \n",
" 1 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 5 | \n",
" 4 | \n",
" 2 | \n",
" 1 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 4 | \n",
" 2 | \n",
" 1 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"0 6 4 2 1 Iris-setosa\n",
"1 5 3 2 1 Iris-setosa\n",
"2 5 4 2 1 Iris-setosa\n",
"3 5 4 2 1 Iris-setosa\n",
"4 5 4 2 1 Iris-setosa"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# apply the ceiling function to the numeric columns\n",
"iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)\n",
"iris.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deciding how to make a prediction\n",
"\n",
"Let's say that I have an **out-of-sample iris** with the following measurements: **7, 3, 5, 2**. How might I predict the species?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal_length | \n",
" sepal_width | \n",
" petal_length | \n",
" petal_width | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 54 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 58 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 63 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 68 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 72 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 73 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 74 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 75 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 76 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 77 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 87 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 91 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 97 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 123 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
" 126 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
" 127 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
" 146 | \n",
" 7 | \n",
" 3 | \n",
" 5 | \n",
" 2 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal_length sepal_width petal_length petal_width species\n",
"54 7 3 5 2 Iris-versicolor\n",
"58 7 3 5 2 Iris-versicolor\n",
"63 7 3 5 2 Iris-versicolor\n",
"68 7 3 5 2 Iris-versicolor\n",
"72 7 3 5 2 Iris-versicolor\n",
"73 7 3 5 2 Iris-versicolor\n",
"74 7 3 5 2 Iris-versicolor\n",
"75 7 3 5 2 Iris-versicolor\n",
"76 7 3 5 2 Iris-versicolor\n",
"77 7 3 5 2 Iris-versicolor\n",
"87 7 3 5 2 Iris-versicolor\n",
"91 7 3 5 2 Iris-versicolor\n",
"97 7 3 5 2 Iris-versicolor\n",
"123 7 3 5 2 Iris-virginica\n",
"126 7 3 5 2 Iris-virginica\n",
"127 7 3 5 2 Iris-virginica\n",
"146 7 3 5 2 Iris-virginica"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# show all observations with features: 7, 3, 5, 2\n",
"iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Iris-versicolor 13\n",
"Iris-virginica 4\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# count the species for these observations\n",
"iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Iris-setosa 50\n",
"Iris-versicolor 50\n",
"Iris-virginica 50\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# count the species for all observations\n",
"iris.species.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's frame this as a **conditional probability problem**: What is the probability of some particular species, given the measurements 7, 3, 5, and 2?\n",
"\n",
"$$P(species \\ | \\ 7352)$$\n",
"\n",
"We could calculate the conditional probability for **each of the three species**, and then predict the species with the **highest probability**:\n",
"\n",
"$$P(setosa \\ | \\ 7352)$$\n",
"$$P(versicolor \\ | \\ 7352)$$\n",
"$$P(virginica \\ | \\ 7352)$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculating the probability of each species\n",
"\n",
"**Bayes' theorem** gives us a way to calculate these conditional probabilities.\n",
"\n",
"Let's start with **versicolor**:\n",
"\n",
"$$P(versicolor \\ | \\ 7352) = \\frac {P(7352 \\ | \\ versicolor) \\times P(versicolor)} {P(7352)}$$\n",
"\n",
"We can calculate each of the terms on the right side of the equation:\n",
"\n",
"$$P(7352 \\ | \\ versicolor) = \\frac {13} {50} = 0.26$$\n",
"\n",
"$$P(versicolor) = \\frac {50} {150} = 0.33$$\n",
"\n",
"$$P(7352) = \\frac {17} {150} = 0.11$$\n",
"\n",
"Therefore, Bayes' theorem says the **probability of versicolor given these measurements** is:\n",
"\n",
"$$P(versicolor \\ | \\ 7352) = \\frac {0.26 \\times 0.33} {0.11} = 0.76$$\n",
"\n",
"Let's repeat this process for **virginica** and **setosa**:\n",
"\n",
"$$P(virginica \\ | \\ 7352) = \\frac {0.08 \\times 0.33} {0.11} = 0.24$$\n",
"\n",
"$$P(setosa \\ | \\ 7352) = \\frac {0 \\times 0.33} {0.11} = 0$$\n",
"\n",
"We predict that the iris is a versicolor, since that species had the **highest conditional probability**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"1. We framed a **classification problem** as three conditional probability problems.\n",
"2. We used **Bayes' theorem** to calculate those conditional probabilities.\n",
"3. We made a **prediction** by choosing the species with the highest conditional probability."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bonus: The intuition behind Bayes' theorem\n",
"\n",
"Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem makes intuitive sense:\n",
"\n",
"Pretend that **more of the existing versicolors had measurements of 7352:**\n",
"\n",
"- $P(7352 \\ | \\ versicolor)$ would increase, thus increasing the numerator.\n",
"- It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would also increase.\n",
"\n",
"Pretend that **most of the existing irises were versicolor:**\n",
"\n",
"- $P(versicolor)$ would increase, thus increasing the numerator.\n",
"- It would make sense that the probability of any iris being a versicolor (regardless of measurements) would also increase.\n",
"\n",
"Pretend that **17 of the setosas had measurements of 7352:**\n",
"\n",
"- $P(7352)$ would double, thus doubling the denominator.\n",
"- It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would be cut in half."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}