{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scikit-learn tutorial\n", "\n", "In this tutorial, we will demonstrate an exemplary complete machine learning process starting with the data and ending with predictions and proper evaluation. We will focus on textual data in this tutorial.\n", "\n", "Technically, we will utilize Python and specifically, the scikit-learn library." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Parts of this tutorial are motivated by http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "We work with one of the most classic machine learning textual datasets---the so-called 20 newsgroup dataset. This dataset is directly available in scikit-learn (after downloading it internally).\n", "\n", "Basically, it consists of 20,000 newsgroup documents that are partitioned across 20 different newsgroups." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:sklearn.datasets.twenty_newsgroups:Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)\n" ] } ], "source": [ "train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print type(train)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']\n" ] } ], "source": [ "print train.keys()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u\"From: lerxst@wam.umd.edu (where's my thing)\\nSubject: WHAT car is this!?\\nNntp-Posting-Host: rac3.wam.umd.edu\\nOrganization: University of Maryland, College Park\\nLines: 15\\n\\n I was wondering if anyone out there could enlighten me on this car I saw\\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\\nthe front bumper was separate from the rest of the body. This is \\nall I know. If anyone can tellme a model name, engine specs, years\\nof production, where this car is made, history, or whatever info you\\nhave on this funky looking car, please e-mail.\\n\\nThanks,\\n- IL\\n ---- brought to you by your neighborhood Lerxst ----\\n\\n\\n\\n\\n\"]\n" ] } ], "source": [ "print train.data[:1]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[7]\n" ] } ], "source": [ "print train.target[:1]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['alt.atheism']\n" ] } ], "source": [ "print train.target_names[:1]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "set(['rec.motorcycles', 'comp.sys.mac.hardware', 'talk.politics.misc', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'talk.religion.misc', 'comp.windows.x', 'comp.sys.ibm.pc.hardware', 'talk.politics.guns', 'alt.atheism', 'comp.os.ms-windows.misc', 'sci.crypt', 'sci.space', 'misc.forsale', 'rec.sport.hockey', 'rec.sport.baseball', 'sci.electronics', 'rec.autos', 'talk.politics.mideast'])\n" ] } ], "source": [ "print set(train.target_names)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11314\n" ] } ], "source": [ "print len(train.data)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x_train = train.data\n", "y_train = train.target" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7532\n" ] } ], "source": [ "print len(test.data)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "x_test = test.data\n", "y_test = test.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Goal\n", "\n", "Classify newsgroup postings simply based on their text into their respective category" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature engineering" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vec = CountVectorizer()\n", "x_train_f = vec.fit_transform(x_train)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(11314, 130107)\n" ] } ], "source": [ "print x_train_f.shape" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (0, 4605)\t1\n", " (0, 16574)\t1\n", " (0, 18299)\t1\n" ] } ], "source": [ "print x_train_f[0,0:20000]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "x_test_f = vec.transform(x_test)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(7532, 130107)\n" ] } ], "source": [ "print x_test_f.shape" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "28856" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vec.vocabulary_.get('apple')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Naive Bayes Classifier" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.naive_bayes import MultinomialNB" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clf = MultinomialNB().fit(x_train_f, y_train)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [], "source": [ "docs = [\"Where is the start menu?\", \"Most homeruns in a game\", \"Who was the first man on the moon?\"]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predicted = clf.predict(vec.transform(docs))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 5 9 14]\n" ] } ], "source": [ "print predicted" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'Where is the start menu?' => comp.windows.x\n", "'Most homeruns in a game' => rec.sport.baseball\n", "'Who was the first man on the moon?' => sci.space\n" ] } ], "source": [ "for doc, category in zip(docs, predicted):\n", " print('%r => %s' % (doc, train.target_names[category]))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "predicted = clf.predict(x_test_f)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.772835900159\n" ] } ], "source": [ "print np.mean(predicted==y_test)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import f1_score" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.75111275774411768" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(y_test,predicted,average=\"weighted\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipeline" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "clf = Pipeline([('vect', CountVectorizer()),\n", " ('clf', MultinomialNB()),\n", " ])" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", " dtype=, encoding=u'utf-8', input=u'content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "predicted = clf.predict(x_test)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.772835900159\n" ] } ], "source": [ "print np.mean(predicted==y_test)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.metrics import classification_report" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.79 0.77 0.78 319\n", " 1 0.67 0.74 0.70 389\n", " 2 0.20 0.00 0.01 394\n", " 3 0.56 0.77 0.65 392\n", " 4 0.84 0.75 0.79 385\n", " 5 0.65 0.84 0.73 395\n", " 6 0.93 0.65 0.77 390\n", " 7 0.87 0.91 0.89 396\n", " 8 0.96 0.92 0.94 398\n", " 9 0.96 0.87 0.91 397\n", " 10 0.93 0.96 0.95 399\n", " 11 0.67 0.95 0.78 396\n", " 12 0.79 0.66 0.72 393\n", " 13 0.87 0.82 0.85 396\n", " 14 0.83 0.89 0.86 394\n", " 15 0.70 0.96 0.81 398\n", " 16 0.69 0.91 0.79 364\n", " 17 0.85 0.94 0.89 376\n", " 18 0.58 0.63 0.60 310\n", " 19 0.89 0.33 0.49 251\n", "\n", "avg / total 0.76 0.77 0.75 7532\n", "\n" ] } ], "source": [ "print classification_report(y_test,predicted)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parameter tuning" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.grid_search import GridSearchCV" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "parameters = {'vect__ngram_range': [(1, 1), (1, 2)],\n", " 'clf__alpha': (1., 2.),\n", "}" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "gs_clf = GridSearchCV(clf, parameters, n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py:664: RuntimeWarning: divide by zero encountered in log\n", " self.feature_log_prob_ = (np.log(smoothed_fc)\n", "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py:664: RuntimeWarning: divide by zero encountered in log\n", " self.feature_log_prob_ = (np.log(smoothed_fc)\n", "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py:664: RuntimeWarning: divide by zero encountered in log\n", " self.feature_log_prob_ = (np.log(smoothed_fc)\n", "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py:664: RuntimeWarning: divide by zero encountered in log\n", " self.feature_log_prob_ = (np.log(smoothed_fc)\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=None, error_score='raise',\n", " estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", " dtype=, encoding=u'utf-8', input=u'content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),\n", " fit_params={}, iid=True, n_jobs=-1,\n", " param_grid={'vect__ngram_range': [(1, 1), (1, 2)], 'clf__alpha': (0.0, 1.0, 2.0)},\n", " pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gs_clf.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "predicted = gs_clf.predict(x_test)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.79 0.77 0.78 319\n", " 1 0.67 0.74 0.70 389\n", " 2 0.20 0.00 0.01 394\n", " 3 0.56 0.77 0.65 392\n", " 4 0.84 0.75 0.79 385\n", " 5 0.65 0.84 0.73 395\n", " 6 0.93 0.65 0.77 390\n", " 7 0.87 0.91 0.89 396\n", " 8 0.96 0.92 0.94 398\n", " 9 0.96 0.87 0.91 397\n", " 10 0.93 0.96 0.95 399\n", " 11 0.67 0.95 0.78 396\n", " 12 0.79 0.66 0.72 393\n", " 13 0.87 0.82 0.85 396\n", " 14 0.83 0.89 0.86 394\n", " 15 0.70 0.96 0.81 398\n", " 16 0.69 0.91 0.79 364\n", " 17 0.85 0.94 0.89 376\n", " 18 0.58 0.63 0.60 310\n", " 19 0.89 0.33 0.49 251\n", "\n", "avg / total 0.76 0.77 0.75 7532\n", "\n" ] } ], "source": [ "print classification_report(y_test,predicted)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[mean: 0.17191, std: 0.00346, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 0.0}, mean: 0.07380, std: 0.00217, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 0.0}, mean: 0.82075, std: 0.00376, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 1.0}, mean: 0.81625, std: 0.00692, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 1.0}, mean: 0.75597, std: 0.00834, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 2.0}, mean: 0.75367, std: 0.00576, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 2.0}]\n" ] } ], "source": [ "print gs_clf.grid_scores_" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n", " dtype=, encoding=u'utf-8', input=u'content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", " strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])\n" ] } ], "source": [ "print gs_clf.best_estimator_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reddit data" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train = pd.read_csv(\"reddit_train_top20\")" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "test = pd.read_csv(\"reddit_test_top20\")" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "28559" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "33640" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(test)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlesubreddit
0[PS4] LF5M (who has) HM gate keeper CP.Fireteams
1POV view in competitiveleagueoflegends
2If you were given the chance to go back and re...AskReddit
3[H] FN Howl 0.04fv + MW Vulcan [W] Knife OffersGlobalOffensiveTrade
4STOP PRESSING THE BUTTONthebutton
\n", "
" ], "text/plain": [ " title subreddit\n", "0 [PS4] LF5M (who has) HM gate keeper CP. Fireteams\n", "1 POV view in competitive leagueoflegends\n", "2 If you were given the chance to go back and re... AskReddit\n", "3 [H] FN Howl 0.04fv + MW Vulcan [W] Knife Offers GlobalOffensiveTrade\n", "4 STOP PRESSING THE BUTTON thebutton" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }