{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparing Multinomial and Gaussian Naive Bayes\n", "\n", "scikit-learn documentation: [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) and [GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)\n", "\n", "Dataset: [Pima Indians Diabetes](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes) from the UCI Machine Learning Repository" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read the data\n", "import pandas as pd\n", "url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'\n", "col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']\n", "pima = pd.read_csv(url, header=None, names=col_names)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pregnantglucosebpskininsulinbmipedigreeagelabel
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", "
" ], "text/plain": [ " pregnant glucose bp skin insulin bmi pedigree age label\n", "0 6 148 72 35 0 33.6 0.627 50 1\n", "1 1 85 66 29 0 26.6 0.351 31 0\n", "2 8 183 64 0 0 23.3 0.672 32 1\n", "3 1 89 66 23 94 28.1 0.167 21 0\n", "4 0 137 40 35 168 43.1 2.288 33 1" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# notice that all features are continuous\n", "pima.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create X and y\n", "X = pima.drop('label', axis=1)\n", "y = pima.label" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# split into training and testing sets\n", "from sklearn.cross_validation import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import both Multinomial and Gaussian Naive Bayes\n", "from sklearn.naive_bayes import MultinomialNB, GaussianNB\n", "from sklearn import metrics" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.541666666667\n" ] } ], "source": [ "# testing accuracy of Multinomial Naive Bayes\n", "mnb = MultinomialNB()\n", "mnb.fit(X_train, y_train)\n", "y_pred_class = mnb.predict(X_test)\n", "print metrics.accuracy_score(y_test, y_pred_class)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.791666666667\n" ] } ], "source": [ "# testing accuracy of Gaussian Naive Bayes\n", "gnb = GaussianNB()\n", "gnb.fit(X_train, y_train)\n", "y_pred_class = gnb.predict(X_test)\n", "print metrics.accuracy_score(y_test, y_pred_class)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Conclusion:** When applying Naive Bayes classification to a dataset with **continuous features**, it is better to use Gaussian Naive Bayes than Multinomial Naive Bayes. The latter is suitable for datasets containing **discrete features** (e.g., word counts).\n", "\n", "Wikipedia has a short [description](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Gaussian_naive_Bayes) of Gaussian Naive Bayes, as well as an excellent [example](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification) of its usage." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }