# Scikit-learn tutorial

In this tutorial, we will demonstrate an exemplary complete machine learning process starting with the data and ending with predictions and proper evaluation. We will focus on textual data in this tutorial.

Technically, we will utilize Python and specifically, the <a href="http://scikit-learn.org/">scikit-learn</a> library.

Parts of this tutorial are motivated by http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.

## Data

We work with one of the most classic machine learning textual datasets---the so-called <a href="http://qwone.com/~jason/20Newsgroups/">20 newsgroup dataset</a>. This dataset is directly available in scikit-learn (after downloading it internally).

Basically, it consists of 20,000 newsgroup documents that are partitioned across 20 different newsgroups.

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)



In [3]:
print type(train)

<class 'sklearn.datasets.base.Bunch'>


In [4]:
print train.keys()

['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']


In [5]:
print train.data[:1]

[u"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"]


In [6]:
print train.target[:1]

[7]


In [7]:
print train.target_names[:1]

['alt.atheism']


In [8]:
print set(train.target_names)

set(['rec.motorcycles', 'comp.sys.mac.hardware', 'talk.politics.misc', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'talk.religion.misc', 'comp.windows.x', 'comp.sys.ibm.pc.hardware', 'talk.politics.guns', 'alt.atheism', 'comp.os.ms-windows.misc', 'sci.crypt', 'sci.space', 'misc.forsale', 'rec.sport.hockey', 'rec.sport.baseball', 'sci.electronics', 'rec.autos', 'talk.politics.mideast'])


In [9]:
print len(train.data)

11314


In [10]:
x_train = train.data
y_train = train.target

In [11]:
test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

In [12]:
print len(test.data)

7532


In [13]:
x_test = test.data
y_test = test.target

## Goal

Classify newsgroup postings simply based on their text into their respective category

## Feature engineering

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
vec = CountVectorizer()
x_train_f = vec.fit_transform(x_train)

In [16]:
print x_train_f.shape

(11314, 130107)


In [17]:
print x_train_f[0,0:20000]

  (0, 4605)	1
  (0, 16574)	1
  (0, 18299)	1


In [18]:
x_test_f = vec.transform(x_test)

In [19]:
print x_test_f.shape

(7532, 130107)


In [20]:
vec.vocabulary_.get('apple')

28856

## Naive Bayes Classifier

In [21]:
from sklearn.naive_bayes import MultinomialNB

In [22]:
clf = MultinomialNB().fit(x_train_f, y_train)

In [23]:
docs = ["Where is the start menu?", "Most homeruns in a game", "Who was the first man on the moon?"]

In [24]:
predicted = clf.predict(vec.transform(docs))

In [25]:
print predicted

[ 5  9 14]


In [26]:
for doc, category in zip(docs, predicted):
    print('%r => %s' % (doc, train.target_names[category]))

'Where is the start menu?' => comp.windows.x
'Most homeruns in a game' => rec.sport.baseball
'Who was the first man on the moon?' => sci.space


In [27]:
predicted = clf.predict(x_test_f)

In [28]:
import numpy as np

In [29]:
print np.mean(predicted==y_test)

0.772835900159


In [30]:
from sklearn.metrics import f1_score

In [31]:
f1_score(y_test,predicted,average="weighted")

0.75111275774411768

## Pipeline

In [32]:
from sklearn.pipeline import Pipeline

In [33]:
clf = Pipeline([('vect', CountVectorizer()),
                ('clf', MultinomialNB()),
                ])

In [34]:
clf.fit(x_train, y_train)

Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [35]:
predicted = clf.predict(x_test)

In [36]:
print np.mean(predicted==y_test)

0.772835900159


In [37]:
from sklearn.metrics import classification_report

In [38]:
print classification_report(y_test,predicted)

             precision    recall  f1-score   support

          0       0.79      0.77      0.78       319
          1       0.67      0.74      0.70       389
          2       0.20      0.00      0.01       394
          3       0.56      0.77      0.65       392
          4       0.84      0.75      0.79       385
          5       0.65      0.84      0.73       395
          6       0.93      0.65      0.77       390
          7       0.87      0.91      0.89       396
          8       0.96      0.92      0.94       398
          9       0.96      0.87      0.91       397
         10       0.93      0.96      0.95       399
         11       0.67      0.95      0.78       396
         12       0.79      0.66      0.72       393
         13       0.87      0.82      0.85       396
         14       0.83      0.89      0.86       394
         15       0.70      0.96      0.81       398
         16       0.69      0.91      0.79       364
         17       0.85      0.94      0.89   

## Parameter tuning

In [39]:
from sklearn.grid_search import GridSearchCV

In [40]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
            'clf__alpha': (1., 2.),
}

In [41]:
gs_clf = GridSearchCV(clf, parameters, n_jobs=-1)

In [42]:
gs_clf.fit(x_train, y_train)

  self.feature_log_prob_ = (np.log(smoothed_fc)
  self.feature_log_prob_ = (np.log(smoothed_fc)
  self.feature_log_prob_ = (np.log(smoothed_fc)
  self.feature_log_prob_ = (np.log(smoothed_fc)


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 1), (1, 2)], 'clf__alpha': (0.0, 1.0, 2.0)},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [43]:
predicted = gs_clf.predict(x_test)

In [44]:
print classification_report(y_test,predicted)

             precision    recall  f1-score   support

          0       0.79      0.77      0.78       319
          1       0.67      0.74      0.70       389
          2       0.20      0.00      0.01       394
          3       0.56      0.77      0.65       392
          4       0.84      0.75      0.79       385
          5       0.65      0.84      0.73       395
          6       0.93      0.65      0.77       390
          7       0.87      0.91      0.89       396
          8       0.96      0.92      0.94       398
          9       0.96      0.87      0.91       397
         10       0.93      0.96      0.95       399
         11       0.67      0.95      0.78       396
         12       0.79      0.66      0.72       393
         13       0.87      0.82      0.85       396
         14       0.83      0.89      0.86       394
         15       0.70      0.96      0.81       398
         16       0.69      0.91      0.79       364
         17       0.85      0.94      0.89   

In [45]:
print gs_clf.grid_scores_

[mean: 0.17191, std: 0.00346, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 0.0}, mean: 0.07380, std: 0.00217, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 0.0}, mean: 0.82075, std: 0.00376, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 1.0}, mean: 0.81625, std: 0.00692, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 1.0}, mean: 0.75597, std: 0.00834, params: {'vect__ngram_range': (1, 1), 'clf__alpha': 2.0}, mean: 0.75367, std: 0.00576, params: {'vect__ngram_range': (1, 2), 'clf__alpha': 2.0}]


In [46]:
print gs_clf.best_estimator_

Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])


## Reddit data

In [48]:
import pandas as pd

In [49]:
train = pd.read_csv("reddit_train_top20")

In [50]:
test = pd.read_csv("reddit_test_top20")

In [51]:
len(train)

28559

In [52]:
len(test)

33640

In [53]:
train.head()

Unnamed: 0,title,subreddit
0,[PS4] LF5M (who has) HM gate keeper CP.,Fireteams
1,POV view in competitive,leagueoflegends
2,If you were given the chance to go back and re...,AskReddit
3,[H] FN Howl 0.04fv + MW Vulcan [W] Knife Offers,GlobalOffensiveTrade
4,STOP PRESSING THE BUTTON,thebutton
