{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Classifying News Headlines with Naive Bayes\n", "Reference: Classifying News Headlines and Explaining the Result from Kaggle[^1]\n", "\n", "[^1]: http://nbviewer.jupyter.org/github/dreamgonfly/lime-examples/blob/master/Classifying%20News%20Headlines%20and%20Explaining%20the%20Result.ipynb\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "news = pd.read_csv('uci-news-aggregator.csv').sample(frac=0.1)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "42242" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(news)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDTITLEURLPUBLISHERCATEGORYSTORYHOSTNAMETIMESTAMP
2458924590Wal-Mart takes aim at $2B used video game mark...http://www.wncn.com/story/25002497/wal-mart-ta...WNCNbdxrStbpfnH5zjIM-2HS6xpt7arezMwww.wncn.com1395317817464
114446114782You'll likely recognize this 'Fargo,' you betcha!http://www.usatoday.com/story/life/tv/2014/04/...USA TODAYedHALsQo-hgLH_0MDzDRorxcN9BVnMwww.usatoday.com1397520248385
357516357976Sentence for sex abuser Rolf Harris 'too lenient'http://www.scotsman.com/news/sentence-for-sex-...Scotsman \\(blog\\)edDGfw8WpQ5GpyLMGzdY_Xi99sjNPMwww.scotsman.com1404528720056
241734242180Google (GOOGL) Developing Tablet with Advanced...http://wallstreetpit.com/104126-google-googl-d...Wall Street PittduaEdk4hebuJCDM4aiSvC1Nv3ThrMwallstreetpit.com1400868306109
206956207392FDA : Advanced prosthetic arm is approved for ...http://canadajournal.net/health/fda-advanced-p...Canada Newsmdz-jKxOMO7HMK0M7xdtCAX5E1-UOMcanadajournal.net1399927161820
\n", "
" ], "text/plain": [ " ID TITLE \\\n", "24589 24590 Wal-Mart takes aim at $2B used video game mark... \n", "114446 114782 You'll likely recognize this 'Fargo,' you betcha! \n", "357516 357976 Sentence for sex abuser Rolf Harris 'too lenient' \n", "241734 242180 Google (GOOGL) Developing Tablet with Advanced... \n", "206956 207392 FDA : Advanced prosthetic arm is approved for ... \n", "\n", " URL PUBLISHER \\\n", "24589 http://www.wncn.com/story/25002497/wal-mart-ta... WNCN \n", "114446 http://www.usatoday.com/story/life/tv/2014/04/... USA TODAY \n", "357516 http://www.scotsman.com/news/sentence-for-sex-... Scotsman \\(blog\\) \n", "241734 http://wallstreetpit.com/104126-google-googl-d... Wall Street Pit \n", "206956 http://canadajournal.net/health/fda-advanced-p... Canada News \n", "\n", " CATEGORY STORY HOSTNAME \\\n", "24589 b dxrStbpfnH5zjIM-2HS6xpt7arezM www.wncn.com \n", "114446 e dHALsQo-hgLH_0MDzDRorxcN9BVnM www.usatoday.com \n", "357516 e dDGfw8WpQ5GpyLMGzdY_Xi99sjNPM www.scotsman.com \n", "241734 t duaEdk4hebuJCDM4aiSvC1Nv3ThrM wallstreetpit.com \n", "206956 m dz-jKxOMO7HMK0M7xdtCAX5E1-UOM canadajournal.net \n", "\n", " TIMESTAMP \n", "24589 1395317817464 \n", "114446 1397520248385 \n", "357516 1404528720056 \n", "241734 1400868306109 \n", "206956 1399927161820 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "news.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "encoder = LabelEncoder()\n", "\n", "X = news['TITLE']\n", "y = encoder.fit_transform(news['CATEGORY'])\n", "\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "31681" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(X_train)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10561" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(X_test)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(X_train)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "288020 Oil prices rises to highest level in 9 month h...\n", "382781 Gold gains on bargain hunting, though Yellen c...\n", "339760 LG G3 Beat vs. Galaxy S5 Mini vs. Moto X: Came...\n", "334230 Bulgarian depositors withdraw more savings as ...\n", "39177 New Motorola Phablet to Hit the Market Later T...\n", "Name: TITLE, dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "vectorizer = CountVectorizer(min_df=3)\n", "\n", "train_vectors = vectorizer.fit_transform(X_train)\n", "test_vectors = vectorizer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<31681x9915 sparse matrix of type ''\n", "\twith 267651 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_vectors" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Gold gains on bargain hunting, though Yellen comments still weigh'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.iloc[1]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<1x9915 sparse matrix of type ''\n", "\twith 10 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_vectors[1]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "scipy.sparse.csr.csr_matrix" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(train_vectors)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 0, ..., 0, 0, 0]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# one-hot vector\n", "train_vectors[1].toarray()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " ..., \n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0]], dtype=int64)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_vectors.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Gaussian Naive Bayes" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GaussianNB(priors=None)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.naive_bayes import GaussianNB\n", "clf = GaussianNB()\n", "clf.fit(train_vectors.toarray(), y_train)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.82056623425811948" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred = clf.predict(test_vectors.toarray())\n", "accuracy_score(y_test, pred, )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multinomial Naive Bayes" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.naive_bayes import MultinomialNB\n", "clf = MultinomialNB()\n", "clf.fit(train_vectors, y_train)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.89792633273364264" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred = clf.predict(test_vectors)\n", "accuracy_score(y_test, pred, )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bernoulli Naive Bayes" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.naive_bayes import BernoulliNB\n", "clf = BernoulliNB()\n", "clf.fit(train_vectors, y_train)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.89660070069122244" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred = clf.predict(test_vectors.toarray())\n", "accuracy_score(y_test, pred, )" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }