{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Applying Bayes' theorem to iris classification\n",
    "\n",
    "Can **Bayes' theorem** help us to solve a **classification problem**, namely predicting the species of an iris?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the data\n",
    "\n",
    "We'll read the iris data into a DataFrame, and **round up** all of the measurements to the next integer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width      species\n",
       "0           5.1          3.5           1.4          0.2  Iris-setosa\n",
       "1           4.9          3.0           1.4          0.2  Iris-setosa\n",
       "2           4.7          3.2           1.3          0.2  Iris-setosa\n",
       "3           4.6          3.1           1.5          0.2  Iris-setosa\n",
       "4           5.0          3.6           1.4          0.2  Iris-setosa"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read the iris data into a DataFrame\n",
    "url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'\n",
    "col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']\n",
    "iris = pd.read_csv(url, header=None, names=col_names)\n",
    "iris.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal_length  sepal_width  petal_length  petal_width      species\n",
       "0             6            4             2            1  Iris-setosa\n",
       "1             5            3             2            1  Iris-setosa\n",
       "2             5            4             2            1  Iris-setosa\n",
       "3             5            4             2            1  Iris-setosa\n",
       "4             5            4             2            1  Iris-setosa"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# apply the ceiling function to the numeric columns\n",
    "iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)\n",
    "iris.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deciding how to make a prediction\n",
    "\n",
    "Let's say that I have an **out-of-sample iris** with the following measurements: **7, 3, 5, 2**. How might I predict the species?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal_length</th>\n",
       "      <th>sepal_width</th>\n",
       "      <th>petal_length</th>\n",
       "      <th>petal_width</th>\n",
       "      <th>species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>63</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>68</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>77</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>123</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>126</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-virginica</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>Iris-virginica</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     sepal_length  sepal_width  petal_length  petal_width          species\n",
       "54              7            3             5            2  Iris-versicolor\n",
       "58              7            3             5            2  Iris-versicolor\n",
       "63              7            3             5            2  Iris-versicolor\n",
       "68              7            3             5            2  Iris-versicolor\n",
       "72              7            3             5            2  Iris-versicolor\n",
       "73              7            3             5            2  Iris-versicolor\n",
       "74              7            3             5            2  Iris-versicolor\n",
       "75              7            3             5            2  Iris-versicolor\n",
       "76              7            3             5            2  Iris-versicolor\n",
       "77              7            3             5            2  Iris-versicolor\n",
       "87              7            3             5            2  Iris-versicolor\n",
       "91              7            3             5            2  Iris-versicolor\n",
       "97              7            3             5            2  Iris-versicolor\n",
       "123             7            3             5            2   Iris-virginica\n",
       "126             7            3             5            2   Iris-virginica\n",
       "127             7            3             5            2   Iris-virginica\n",
       "146             7            3             5            2   Iris-virginica"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# show all observations with features: 7, 3, 5, 2\n",
    "iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Iris-versicolor    13\n",
       "Iris-virginica      4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# count the species for these observations\n",
    "iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Iris-setosa        50\n",
       "Iris-versicolor    50\n",
       "Iris-virginica     50\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# count the species for all observations\n",
    "iris.species.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's frame this as a **conditional probability problem**: What is the probability of some particular species, given the measurements 7, 3, 5, and 2?\n",
    "\n",
    "$$P(species \\ | \\ 7352)$$\n",
    "\n",
    "We could calculate the conditional probability for **each of the three species**, and then predict the species with the **highest probability**:\n",
    "\n",
    "$$P(setosa \\ | \\ 7352)$$\n",
    "$$P(versicolor \\ | \\ 7352)$$\n",
    "$$P(virginica \\ | \\ 7352)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculating the probability of each species\n",
    "\n",
    "**Bayes' theorem** gives us a way to calculate these conditional probabilities.\n",
    "\n",
    "Let's start with **versicolor**:\n",
    "\n",
    "$$P(versicolor \\ | \\ 7352) = \\frac {P(7352 \\ | \\ versicolor) \\times P(versicolor)} {P(7352)}$$\n",
    "\n",
    "We can calculate each of the terms on the right side of the equation:\n",
    "\n",
    "$$P(7352 \\ | \\ versicolor) = \\frac {13} {50} = 0.26$$\n",
    "\n",
    "$$P(versicolor) = \\frac {50} {150} = 0.33$$\n",
    "\n",
    "$$P(7352) = \\frac {17} {150} = 0.11$$\n",
    "\n",
    "Therefore, Bayes' theorem says the **probability of versicolor given these measurements** is:\n",
    "\n",
    "$$P(versicolor \\ | \\ 7352) = \\frac {0.26 \\times 0.33} {0.11} = 0.76$$\n",
    "\n",
    "Let's repeat this process for **virginica** and **setosa**:\n",
    "\n",
    "$$P(virginica \\ | \\ 7352) = \\frac {0.08 \\times 0.33} {0.11} = 0.24$$\n",
    "\n",
    "$$P(setosa \\ | \\ 7352) = \\frac {0 \\times 0.33} {0.11} = 0$$\n",
    "\n",
    "We predict that the iris is a versicolor, since that species had the **highest conditional probability**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "1. We framed a **classification problem** as three conditional probability problems.\n",
    "2. We used **Bayes' theorem** to calculate those conditional probabilities.\n",
    "3. We made a **prediction** by choosing the species with the highest conditional probability."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus: The intuition behind Bayes' theorem\n",
    "\n",
    "Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem makes intuitive sense:\n",
    "\n",
    "Pretend that **more of the existing versicolors had measurements of 7352:**\n",
    "\n",
    "- $P(7352 \\ | \\ versicolor)$ would increase, thus increasing the numerator.\n",
    "- It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would also increase.\n",
    "\n",
    "Pretend that **most of the existing irises were versicolor:**\n",
    "\n",
    "- $P(versicolor)$ would increase, thus increasing the numerator.\n",
    "- It would make sense that the probability of any iris being a versicolor (regardless of measurements) would also increase.\n",
    "\n",
    "Pretend that **17 of the setosas had measurements of 7352:**\n",
    "\n",
    "- $P(7352)$ would double, thus doubling the denominator.\n",
    "- It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would be cut in half."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}