The Naive Bayes Classifier

The Naive Bayes classifier[2] is an extremely simple classifier that relies on Bayesian probability and the assumption that feature probabilities are independent of one another. Baye's Rule gives:

$\displaystyle P(C \vert F_1, F_2, \ldots, F_n)
= \frac{P(C)P(F_1, F_2, \ldots, F_n \vert C)}{P(F_1, F_2, \ldots, F_n)} \\
$

Simplifying the numerator gives:

$\displaystyle P(C)P(F_1, F_2, \ldots, F_n \vert C)\\ $

$\displaystyle = P(C)P(F_1 \vert C)P( F_2, F_3, \ldots, F_n\vert C, F_1) \\ $

$\displaystyle = P(C)P(F_1 \vert C)P(F_2 \vert C, F_1)P(F_3, F_4, \ldots, F_n \vert C, F_1, F_2) \\ $

$\displaystyle \ldots$

Then, assuming the probabilities are independent gives

$\displaystyle P(F_i \vert F_j\ldots F_k) = F(F_i)$

so

$\displaystyle P(F_i \vert C, F_j\ldots F_k) = P(F_i \vert C)$

$\displaystyle P(C \vert F_1\ldots F_n) = P(C) [\prod_{i=0}^n P(F_i \vert C) ]$

$ P(Fi \vert C)$ is estimated through plus-one smoothing on a labeled training set, that is:

$\displaystyle \frac{(1+count(C, F_i))}{\sum_i count(C_j, F_i))}$

where $ count(C, F_j)$ is the number of times that $ F_j$ appears over all training documents in class $ C$.

The class a feature vector belongs to is given by

$\displaystyle C^* = \operatorname*{arg\,max}_C P(C \vert F_1...F_n)$

Taking the logarithm of both sides gives

$\displaystyle C^* = \operatorname*{arg\,max}_C (P(C) + \sum_i [F_i (\lg count (C, F_i)$

$\displaystyle - \lg (\sum_j count C_j, F_i))])$

While the Naive Bayes classifier seems very simple, it is observed to have high predictive power; in our tests, it performed competitively with the more sophisticated classifiers we used. The Bayes classifier can also be implemented very efficiently. Its independence assumption means that it does not fall prey to the curse of dimensionality, and its running time is linear in the size of the input.

Pranjal Vachaspati 2012-02-05