{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Applying Naive Bayes classification to spam filtering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's pretend we have an email with three words: \"Send money now.\" We'll use Naive Bayes to classify it as **ham or spam.**\n", "\n", "$$P(spam \\ | \\ \\text{send money now}) = \\frac {P(\\text{send money now} \\ | \\ spam) \\times P(spam)} {P(\\text{send money now})}$$\n", "\n", "By assuming that the features (the words) are **conditionally independent**, we can simplify the likelihood function:\n", "\n", "$$P(spam \\ | \\ \\text{send money now}) \\approx \\frac {P(\\text{send} \\ | \\ spam) \\times P(\\text{money} \\ | \\ spam) \\times P(\\text{now} \\ | \\ spam) \\times P(spam)} {P(\\text{send money now})}$$\n", "\n", "We can calculate all of the values in the numerator by examining a corpus of **spam email**:\n", "\n", "$$P(spam \\ | \\ \\text{send money now}) \\approx \\frac {0.2 \\times 0.1 \\times 0.1 \\times 0.9} {P(\\text{send money now})} = \\frac {0.0018} {P(\\text{send money now})}$$\n", "\n", "We would repeat this process with a corpus of **ham email**:\n", "\n", "$$P(ham \\ | \\ \\text{send money now}) \\approx \\frac {0.05 \\times 0.01 \\times 0.1 \\times 0.1} {P(\\text{send money now})} = \\frac {0.000005} {P(\\text{send money now})}$$\n", "\n", "All we care about is whether spam or ham has the **higher probability**, and so we predict that the email is **spam**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key takeaways\n", "\n", "- The **\"naive\" assumption** of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.\n", "- The **normalization constant** (the denominator) can be ignored since it's the same for all classes.\n", "- The **prior probability** is much less relevant once you have a lot of features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing Naive Bayes with other models\n", "\n", "Advantages of Naive Bayes:\n", "\n", "- Model training and prediction are very fast\n", "- Somewhat interpretable\n", "- No tuning is required\n", "- Features don't need scaling\n", "- Insensitive to irrelevant features (with enough observations)\n", "- Performs better than logistic regression when the training set is very small\n", "\n", "Disadvantages of Naive Bayes:\n", "\n", "- Predicted probabilities are not well-calibrated\n", "- Correlated features can be problematic (due to the independence assumption)\n", "- Can't handle negative features (with Multinomial Naive Bayes)\n", "- Has a higher \"asymptotic error\" than logistic regression" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }