{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes Classifiers\n",
    "\n",
    "In this lecture we will learn how to use Naive Bayes Classifier to perform a Multi Class Classification on a data set we are already familiar with: the Iris Data Set. This Lecture will consist of 7 main parts:\n",
    "\n",
    "    Part 1: Note on Notation and Math Terms\n",
    "    Part 2: Bayes' Theorem\n",
    "    Part 3: Introduction to Naive Bayes\n",
    "    Part 4: Naive Bayes Classifier Mathematics Overview\n",
    "    Part 5: Constructing a classifier from the probability model\n",
    "    Part 6: Gaussian Naive Bayes\n",
    "    Part 7: Gaussian Naive Bayes with SciKit Learn\n",
    "    \n",
    "Let's go ahead and begin!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Part 1: Note on Notation and Math Terms\n",
    "\n",
    "There are a few more advanced notations and amthematical terms used during the explanation of naive Bayes Classification.\n",
    "You should be familiar with the following:\n",
    "\n",
    "[Product of Sequence](http://en.wikipedia.org/wiki/Product_%28mathematics%29#Product_of_sequences)\n",
    "\n",
    "The product of a sequence of terms can be written with the product symbol, which derives from the capital letter Π (Pi) in the Greek alphabet. The meaning of this notation is given by:\n",
    " $$\\prod_{i=1}^4 i = 1\\cdot 2\\cdot 3\\cdot 4,  $$\n",
    "that is\n",
    " $$\\prod_{i=1}^4 i = 24. $$\n",
    " \n",
    "[Arg Max](http://en.wikipedia.org/wiki/Arg_max)\n",
    "\n",
    "In mathematics, the argument of the maximum (abbreviated arg max or argmax) is the set of points of the given argument for which the given function attains its maximum value. In contrast to global maximums, which refer to a function's largest outputs, the arg max refers to the inputs which create those maximum outputs.\n",
    "\n",
    "The arg max is defined by\n",
    "\n",
    "$$\\operatorname*{arg\\,max}_x  f(x) := \\{x \\mid \\forall y : f(y) \\le f(x)\\}$$\n",
    "\n",
    "In other words, it is the set of points x for which f(x) attains its largest value. This set may be empty, have one element, or have multiple elements. For example, if f(x) is 1−|x|, then it attains its maximum value of 1 at x = 0 and only there, so\n",
    "\n",
    "$$\\operatorname*{arg\\,max}_x (1-|x|) = \\{0\\}$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Part 2: Bayes' Theorem\n",
    "\n",
    "First, for a quick introduction to Bayes' Theorem, check out the Bayes' Theorem Lecture in the statistics appendix portion of this course, in order ot fully understand Naive Bayes, you'll need a complete understanding of the Bayes' Theorem.\n",
    "\n",
    "### Part 3: Introduction to Naive Bayes\n",
    "\n",
    "Naive Bayes is probably one of the practical machine learning algorithms. Despite its name, it is actually performs very well considering its classification performance. It proves to be quite robust to irrelevant features, which it  ignores. It learns and predicts very fast and it does not require lots of storage. So, why is it then called naive?\n",
    "\n",
    "The naive was added to the account for one assumption that is required for Bayes to work optimally: all features must be independent of each other. In reality, this is usually not the case, however, it still returns very good accuracy in practice even when the independent assumption does not hold.\n",
    "\n",
    "Naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. We will be working with the Iris Flower data set in this lecture.\n",
    "\n",
    "### Part 4: Naive Bayes Classifier Mathematics Overview\n",
    "\n",
    "Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. Given a class variable y and a dependent feature vector x<sub>1</sub> through x<sub>n</sub>, Bayes’ theorem states the following relationship:\n",
    "\n",
    "$$P(y \\mid x_1, \\dots, x_n) = \\frac{P(y) P(x_1, \\dots x_n \\mid y)}\n",
    "                                 {P(x_1, \\dots, x_n)}$$\n",
    "                                 \n",
    "Using the naive independence assumption that\n",
    "$$P(x_i | y, x_1, \\dots, x_{i-1}, x_{i+1}, \\dots, x_n) = P(x_i | y)$$\n",
    "\n",
    "for all i, this relationship is simplified to:\n",
    "\n",
    "$$P(y \\mid x_1, \\dots, x_n) = \\frac{P(y) \\prod_{i=1}^{n} P(x_i \\mid y)}\n",
    "                                 {P(x_1, \\dots, x_n)}$$\n",
    "                               \n",
    "We now have a relationship between the target and the features using Bayes Theorem along with a Naive Assumption that all features are independent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 5: Constructing a classifier from the probability model\n",
    "\n",
    "So far we have derived the independent feature model, the Naive Bayes probability model. The Naive Bayes classifier combines this model with a *decision rule*, this decision rule will decide which hypothesis is most probable, in our example case this will be which class of flower is most probable.\n",
    "\n",
    "Picking the hypothesis that is most probable  is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label to y as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since P(x<sub>1</sub>, ..., x<sub>n</sub>) is constant given the input, we can use the following classification rule:\n",
    "$$P(y \\mid x_1, \\dots, x_n) \\propto P(y) \\prod_{i=1}^{n} P(x_i \\mid y)$$\n",
    "\n",
    "$$\\Downarrow$$\n",
    "\n",
    "$$\\hat{y} = \\arg\\max_y P(y) \\prod_{i=1}^{n} P(x_i \\mid y),$$\n",
    "\n",
    "and we can use Maximum A Posteriori (MAP) estimation to estimate P(y) and P(x<sub>i</sub> | y); the former is then the relative frequency of class y in the training set.\n",
    "\n",
    "There are different naive Bayes classifiers that differ mainly by the assumptions they make regarding the distribution of P(x<sub>i</sub> | y).\n",
    "\n",
    "### Part 6: Gaussian Naive Bayes\n",
    "\n",
    "When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution. Go back to the normal distribution lecture to review the formulas for the Gaussian/Normal Distribution.\n",
    "\n",
    "For example of using the Gaussian Distribution, suppose the training data contain a continuous attribute, x. We first segment the data by the class, and then compute the mean and variance of x in each class. Let  μ<sub>c</sub> be the mean of the values in x associated with class c, and let  σ<sup>2</sup><sub>c</sub> be the variance of the values in x associated with class c. Then, the probability distribution of some value given a class, p(x=v|c), can be computed by plugging v into the equation for a Normal distribution parameterized by μ<sub>c</sub> and  σ<sup>2</sup><sub>c</sub>. That is:\n",
    "\n",
    "$$p(x=v|c)=\\frac{1}{\\sqrt{2\\pi\\sigma^2_c}}\\,e^{ -\\frac{(v-\\mu_c)^2}{2\\sigma^2_c} }$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The key to Naive Bayes is making the (rather large) assumption that the presences (or absences) of\n",
    "each data feature are independent of one another, conditional on a data having a certain label."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 7: Gaussian Naive Bayes with SciKit Learn\n",
    "We'll start by importing the usual.\n",
    "\n",
    "Quick note we will actually only use the SciKit Learn Library in this lecture:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas import Series,DataFrame\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Gaussian Naive Bayes\n",
    "from sklearn import datasets\n",
    "from sklearn import metrics\n",
    "from sklearn.naive_bayes import GaussianNB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our module imports, let's go ahead and import the Iris Data Set. We have previously worked with this dataset, so go ahead and look at Lectures on MultiClass Classification for a complete breakdown on this data set!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iris Plants Database\n",
      "\n",
      "Notes\n",
      "-----\n",
      "Data Set Characteristics:\n",
      "    :Number of Instances: 150 (50 in each of three classes)\n",
      "    :Number of Attributes: 4 numeric, predictive attributes and the class\n",
      "    :Attribute Information:\n",
      "        - sepal length in cm\n",
      "        - sepal width in cm\n",
      "        - petal length in cm\n",
      "        - petal width in cm\n",
      "        - class:\n",
      "                - Iris-Setosa\n",
      "                - Iris-Versicolour\n",
      "                - Iris-Virginica\n",
      "    :Summary Statistics:\n",
      "    ============== ==== ==== ======= ===== ====================\n",
      "                    Min  Max   Mean    SD   Class Correlation\n",
      "    ============== ==== ==== ======= ===== ====================\n",
      "    sepal length:   4.3  7.9   5.84   0.83    0.7826\n",
      "    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n",
      "    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n",
      "    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n",
      "    ============== ==== ==== ======= ===== ====================\n",
      "    :Missing Attribute Values: None\n",
      "    :Class Distribution: 33.3% for each of 3 classes.\n",
      "    :Creator: R.A. Fisher\n",
      "    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n",
      "    :Date: July, 1988\n",
      "\n",
      "This is a copy of UCI ML iris datasets.\n",
      "http://archive.ics.uci.edu/ml/datasets/Iris\n",
      "\n",
      "The famous Iris database, first used by Sir R.A Fisher\n",
      "\n",
      "This is perhaps the best known database to be found in the\n",
      "pattern recognition literature.  Fisher's paper is a classic in the field and\n",
      "is referenced frequently to this day.  (See Duda & Hart, for example.)  The\n",
      "data set contains 3 classes of 50 instances each, where each class refers to a\n",
      "type of iris plant.  One class is linearly separable from the other 2; the\n",
      "latter are NOT linearly separable from each other.\n",
      "\n",
      "References\n",
      "----------\n",
      "   - Fisher,R.A. \"The use of multiple measurements in taxonomic problems\"\n",
      "     Annual Eugenics, 7, Part II, 179-188 (1936); also in \"Contributions to\n",
      "     Mathematical Statistics\" (John Wiley, NY, 1950).\n",
      "   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n",
      "     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n",
      "   - Dasarathy, B.V. (1980) \"Nosing Around the Neighborhood: A New System\n",
      "     Structure and Classification Rule for Recognition in Partially Exposed\n",
      "     Environments\".  IEEE Transactions on Pattern Analysis and Machine\n",
      "     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n",
      "   - Gates, G.W. (1972) \"The Reduced Nearest Neighbor Rule\".  IEEE Transactions\n",
      "     on Information Theory, May 1972, 431-433.\n",
      "   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al\"s AUTOCLASS II\n",
      "     conceptual clustering system finds 3 classes in the data.\n",
      "   - Many, many more ...\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# load the iris datasets\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "# Grab features (X) and the Target (Y)\n",
    "X = iris.data\n",
    "\n",
    "Y = iris.target\n",
    "\n",
    "# Show the Built-in Data Description\n",
    "print iris.DESCR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we have already done a general analysis of this data in earlier lectures, let's go ahead and move on to using the Naive Bayes Method to seperate this data set into multiple classes.\n",
    "\n",
    "First we create and fit the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Fit a Naive Bayes model to the data\n",
    "model = GaussianNB()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our model, we will continue by seperating into training and testing sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "# Split the data into Trainging and Testing sets\n",
    "X_train, X_test, Y_train, Y_test = train_test_split(X, Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we fit our model using the training data set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GaussianNB()"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fit the training model\n",
    "model.fit(X_train,Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we predict the outcomes from the Testing Set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Predicted outcomes\n",
    "predicted = model.predict(X_test)\n",
    "\n",
    "# Actual Expected Outvomes\n",
    "expected = Y_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we can see the metrics for performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.947368421053\n"
     ]
    }
   ],
   "source": [
    "print metrics.accuracy_score(expected, predicted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like we have about 94.7% accuracy using the Naive Bayes method!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 8: More Resources\n",
    "\n",
    "There are plenty more resources and types of Naive Bayes Classifiers, For more resources on Naive Bayes, check out the following links:\n",
    "\n",
    "1.) [SciKit Learn Documentation](http://scikit-learn.org/stable/modules/naive_bayes.html)\n",
    "\n",
    "2.) [Naive Bayes with NLTK](http://slendermeans.org/ml4h-ch3.html)\n",
    "\n",
    "3.) [Wikipedia on Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier)\n",
    "\n",
    "4.) [Andrew Ng's Class Notes](http://cs229.stanford.edu/notes/cs229-notes2.pdf)\n",
    "\n",
    "5.) [Andrew Ng's Video Lecture on Naive Bayes](https://www.youtube.com/watch?v=z5UQyCESW64)\n",
    "\n",
    "6.) [UC Berkeley Lecture by Pieter Abbeel](https://www.youtube.com/watch?v=DNvwfNEiKvw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}