{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scikit-learn tutorial\n",
    "\n",
    "In this tutorial, we will demonstrate an exemplary complete machine learning process starting with the data and ending with predictions and proper evaluation. We will focus on textual data in this tutorial.\n",
    "\n",
    "Technically, we will utilize Python and specifically, the <a href=\"http://scikit-learn.org/\">scikit-learn</a> library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Parts of this tutorial are motivated by http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "We work with one of the most classic machine learning textual datasets---the so-called <a href=\"http://qwone.com/~jason/20Newsgroups/\">20 newsgroup dataset</a>. This dataset is directly available in scikit-learn (after downloading it internally).\n",
    "\n",
    "Basically, it consists of 20,000 newsgroup documents that are partitioned across 20 different newsgroups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:sklearn.datasets.twenty_newsgroups:Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)\n"
     ]
    }
   ],
   "source": [
    "train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'sklearn.datasets.base.Bunch'>\n"
     ]
    }
   ],
   "source": [
    "print type(train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']\n"
     ]
    }
   ],
   "source": [
    "print train.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u\"From: lerxst@wam.umd.edu (where's my thing)\\nSubject: WHAT car is this!?\\nNntp-Posting-Host: rac3.wam.umd.edu\\nOrganization: University of Maryland, College Park\\nLines: 15\\n\\n I was wondering if anyone out there could enlighten me on this car I saw\\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\\nthe front bumper was separate from the rest of the body. This is \\nall I know. If anyone can tellme a model name, engine specs, years\\nof production, where this car is made, history, or whatever info you\\nhave on this funky looking car, please e-mail.\\n\\nThanks,\\n- IL\\n   ---- brought to you by your neighborhood Lerxst ----\\n\\n\\n\\n\\n\"]\n"
     ]
    }
   ],
   "source": [
    "print train.data[:1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[7]\n"
     ]
    }
   ],
   "source": [
    "print train.target[:1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['alt.atheism']\n"
     ]
    }
   ],
   "source": [
    "print train.target_names[:1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "set(['rec.motorcycles', 'comp.sys.mac.hardware', 'talk.politics.misc', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'talk.religion.misc', 'comp.windows.x', 'comp.sys.ibm.pc.hardware', 'talk.politics.guns', 'alt.atheism', 'comp.os.ms-windows.misc', 'sci.crypt', 'sci.space', 'misc.forsale', 'rec.sport.hockey', 'rec.sport.baseball', 'sci.electronics', 'rec.autos', 'talk.politics.mideast'])\n"
     ]
    }
   ],
   "source": [
    "print set(train.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "11314\n"
     ]
    }
   ],
   "source": [
    "print len(train.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x_train = train.data\n",
    "y_train = train.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7532\n"
     ]
    }
   ],
   "source": [
    "print len(test.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x_test = test.data\n",
    "y_test = test.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Goal\n",
    "\n",
    "Classify newsgroup postings simply based on their text into their respective category"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature engineering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vec = CountVectorizer()\n",
    "x_train_f = vec.fit_transform(x_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(11314, 130107)\n"
     ]
    }
   ],
   "source": [
    "print x_train_f.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  (0, 4605)\t1\n",
      "  (0, 16574)\t1\n",
      "  (0, 18299)\t1\n"
     ]
    }
   ],
   "source": [
    "print x_train_f[0,0:20000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x_test_f = vec.transform(x_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(7532, 130107)\n"
     ]
    }
   ],
   "source": [
    "print x_test_f.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "28856"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vec.vocabulary_.get('apple')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Naive Bayes Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clf = MultinomialNB().fit(x_train_f, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "docs = [\"Where is the start menu?\", \"Most homeruns in a game\", \"Who was the first man on the moon?\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "predicted = clf.predict(vec.transform(docs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 5  9 14]\n"
     ]
    }
   ],
   "source": [
    "print predicted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'Where is the start menu?' => comp.windows.x\n",
      "'Most homeruns in a game' => rec.sport.baseball\n",
      "'Who was the first man on the moon?' => sci.space\n"
     ]
    }
   ],
   "source": [
    "for doc, category in zip(docs, predicted):\n",
    "    print('%r => %s' % (doc, train.target_names[category]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predicted = clf.predict(x_test_f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.772835900159\n"
     ]
    }
   ],
   "source": [
    "print np.mean(predicted==y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import f1_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.75111275774411768"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(y_test,predicted,average=\"weighted\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "clf = Pipeline([('vect', CountVectorizer()),\n",
    "                ('clf', MultinomialNB()),\n",
    "                ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
       "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(x_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predicted = clf.predict(x_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.772835900159\n"
     ]
    }
   ],
   "source": [
    "print np.mean(predicted==y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.79      0.77      0.78       319\n",
      "          1       0.67      0.74      0.70       389\n",
      "          2       0.20      0.00      0.01       394\n",
      "          3       0.56      0.77      0.65       392\n",
      "          4       0.84      0.75      0.79       385\n",
      "          5       0.65      0.84      0.73       395\n",
      "          6       0.93      0.65      0.77       390\n",
      "          7       0.87      0.91      0.89       396\n",
      "          8       0.96      0.92      0.94       398\n",
      "          9       0.96      0.87      0.91       397\n",
      "         10       0.93      0.96      0.95       399\n",
      "         11       0.67      0.95      0.78       396\n",
      "         12       0.79      0.66      0.72       393\n",
      "         13       0.87      0.82      0.85       396\n",
      "         14       0.83      0.89      0.86       394\n",
      "         15       0.70      0.96      0.81       398\n",
      "         16       0.69      0.91      0.79       364\n",
      "         17       0.85      0.94      0.89       376\n",
      "         18       0.58      0.63      0.60       310\n",
      "         19       0.89      0.33      0.49       251\n",
      "\n",
      "avg / total       0.76      0.77      0.75      7532\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print classification_report(y_test,predicted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameter tuning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.grid_search import GridSearchCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "parameters = {'vect__ngram_range': [(1, 1), (1, 2)],\n",
    "            'clf__alpha': (1., 2.),\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gs_clf = GridSearchCV(clf, parameters, n_jobs=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py:664: RuntimeWarning: divide by zero encountered in log\n",
      "  self.feature_log_prob_ = (np.log(smoothed_fc)\n",
      "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py:664: RuntimeWarning: divide by zero encountered in log\n",
      "  self.feature_log_prob_ = (np.log(smoothed_fc)\n",
      "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py:664: RuntimeWarning: divide by zero encountered in log\n",
      "  self.feature_log_prob_ = (np.log(smoothed_fc)\n",
      "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py:664: RuntimeWarning: divide by zero encountered in log\n",
      "  self.feature_log_prob_ = (np.log(smoothed_fc)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=None, error_score='raise',\n",
       "       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
       "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),\n",
       "       fit_params={}, iid=True, n_jobs=-1,\n",
       "       param_grid={'vect__ngram_range': [(1, 1), (1, 2)], 'clf__alpha': (0.0, 1.0, 2.0)},\n",
       "       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs_clf.fit(x_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predicted = gs_clf.predict(x_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       0.79      0.77      0.78       319\n",
      "          1       0.67      0.74      0.70       389\n",
      "          2       0.20      0.00      0.01       394\n",
      "          3       0.56      0.77      0.65       392\n",
      "          4       0.84      0.75      0.79       385\n",
      "          5       0.65      0.84      0.73       395\n",
      "          6       0.93      0.65      0.77       390\n",
      "          7       0.87      0.91      0.89       396\n",
      "          8       0.96      0.92      0.94       398\n",
      "          9       0.96      0.87      0.91       397\n",
      "         10       0.93      0.96      0.95       399\n",
      "         11       0.67      0.95      0.78       396\n",
      "         12       0.79      0.66      0.72       393\n",
      "         13       0.87      0.82      0.85       396\n",
      "         14       0.83      0.89      0.86       394\n",
      "         15       0.70      0.96      0.81       398\n",
      "         16       0.69      0.91      0.79       364\n",
      "         17       0.85      0.94      0.89       376\n",
      "         18       0.58      0.63      0.60       310\n",
      "         19       0.89      0.33      0.49       251\n",
      "\n",
      "avg / total       0.76      0.77      0.75      7532\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print classification_report(y_test,predicted)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[mean: 0.17191, std: 0.00346, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 0.0}, mean: 0.07380, std: 0.00217, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 0.0}, mean: 0.82075, std: 0.00376, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 1.0}, mean: 0.81625, std: 0.00692, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 1.0}, mean: 0.75597, std: 0.00834, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 2.0}, mean: 0.75367, std: 0.00576, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 2.0}]\n"
     ]
    }
   ],
   "source": [
    "print gs_clf.grid_scores_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
      "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
      "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
      "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
      "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
      "        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])\n"
     ]
    }
   ],
   "source": [
    "print gs_clf.best_estimator_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reddit data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train = pd.read_csv(\"reddit_train_top20\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "test = pd.read_csv(\"reddit_test_top20\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "28559"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "33640"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>subreddit</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[PS4] LF5M (who has) HM gate keeper CP.</td>\n",
       "      <td>Fireteams</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>POV view in competitive</td>\n",
       "      <td>leagueoflegends</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>If you were given the chance to go back and re...</td>\n",
       "      <td>AskReddit</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[H] FN Howl 0.04fv + MW Vulcan [W] Knife Offers</td>\n",
       "      <td>GlobalOffensiveTrade</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>STOP PRESSING THE BUTTON</td>\n",
       "      <td>thebutton</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title             subreddit\n",
       "0            [PS4] LF5M (who has) HM gate keeper CP.             Fireteams\n",
       "1                            POV view in competitive       leagueoflegends\n",
       "2  If you were given the chance to go back and re...             AskReddit\n",
       "3    [H] FN Howl 0.04fv + MW Vulcan [W] Knife Offers  GlobalOffensiveTrade\n",
       "4                           STOP PRESSING THE BUTTON             thebutton"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}