# Building the baseline classifier

We'll now do a basic round of supervised classification using scikit-learn. We start by loading the data. We actually have the final classifications in this dataset, so that we can figure out what our accuracy rate was, but we'll ignore it initially and pretend we're starting from scratch.

In [33]:
import pandas as pd

In [34]:
df = pd.read_csv('singapore-roadnames-final-classified.csv')

In [35]:
df

Unnamed: 0.1,Unnamed: 0,road_name,has_malay_road_tag,classification,comment
0,0,Abingdon,0,British,
1,1,Abu Talib,1,Malay,
2,2,Adam,0,British,
3,3,Adat,1,Malay,
4,4,Adis,0,Other,Indian Jewish
5,5,Admiralty,0,British,
6,6,Ah Hood,0,Chinese,
7,7,Ah Soo,1,Chinese,
8,8,Ahmad Ibrahim,1,Malay,
9,9,Aida,0,Other,


In this step, we'll use about 10% of the data to mimic the process I actually used.

## Step 0: putting the data together

In [36]:
# let's pick a random 10% to train with

import random
random.seed(1965)
train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]

X = train_test_set['road_name']
y = train_test_set['classification']

In [37]:
zip(X,y)[::10]

[('Opal', 'Generic'),
 ('Club', 'Generic'),
 ('Minto', 'Other'),
 ('Woodlands', 'Generic'),
 ('Hai Sing', 'Chinese'),
 ('Batalong', 'Malay'),
 ('Hikayat', 'Malay'),
 ('Bassein', 'Other'),
 ('Mount Echo', 'Generic'),
 ('Kallang Pudding', 'Malay'),
 ('Republic', 'Generic'),
 ('Wan Tho', 'Chinese'),
 ('Rengkam', 'Malay'),
 ('Keong Saik', 'Chinese'),
 ('Sedap', 'Malay'),
 ('Stratton', 'British'),
 ('Seagull', 'Generic'),
 ('Manila', 'Other')]

You never actually train and test on the same data. So we'll split this dataset even further. scikit-learn provides a convenient function for this.

In [38]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_true = train_test_split(X, y)

## Step 1: Figure out your classification labels

This was actually one of the trickiest parts of the process. These are the labels I finally decided on:

* Malay (including Indonesian/Bugis names)
* British
* Chinese (all languages ("dialects"))
* Indian (all languages)
* Other (e.g. other European names, Jewish names, Armenian names...)
* Generic (Temple Street, Sunrise Avenue, etc)

Something to bear in mind is that some of the streets can be classified in multiple ways. For example, is Queen Street "British" or "Generic"? In this case I selected "British" because it was specifically named after Queen Victoria. I tried to be consistent in my criteria, but up to ~5% of the roads might be arguable. Also, there is insufficient information for some of the roads so I went with my gut feel about the orthotactics of the word (the letter patterns).

In [39]:
df.classification.value_counts()

Malay      614
British    518
Generic    255
Chinese    217
Other      119
Indian      28
dtype: int64

## Step 2: decide what features to use

What we're doing is basically language classification. Often, people use n-grams as features for this. scikit-learn conveniently provides a function that counts n-grams for us.

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1,4), analyzer='char')

# fit_transform for the training data
X_train_feats = vect.fit_transform(X_train)
# transform for the test data
# because we need to match the ngrams that were found in the training set 
X_test_feats  = vect.transform(X_test) 

print type(X_train_feats)
print X_train_feats.shape
print X_test_feats.shape

<class 'scipy.sparse.csr.csr_matrix'>
(131, 1410)
(44, 1410)


## Step 3: pick a classifier

<img width="80%" src="http://scikit-learn.org/stable/_static/ml_map.png">

According to this, we should be starting out with Linear SVC.

In [41]:
from sklearn.svm import LinearSVC
clf = LinearSVC()

## Step 4: Train the model

Use the classifier to fit a model based on the feature matrix of `X_train` and the label vector of `y_train`.

In [42]:
model = clf.fit(X_train_feats, y_train)

## Step 5: Predict the labels of the test set

Now that we have our model, we can use it to predict labels on a fresh test set.

In [43]:
y_predicted = model.predict(X_test_feats)

In [44]:
y_predicted

array(['Malay', 'Malay', 'British', 'Malay', 'British', 'British',
       'British', 'British', 'British', 'British', 'Malay', 'Chinese',
       'British', 'Chinese', 'British', 'Other', 'Generic', 'Malay',
       'Malay', 'Chinese', 'British', 'British', 'Malay', 'British',
       'British', 'Generic', 'Other', 'British', 'British', 'British',
       'British', 'British', 'Malay', 'Generic', 'Malay', 'Generic',
       'Malay', 'British', 'Malay', 'British', 'British', 'Malay', 'Malay',
       'Generic'], dtype=object)

## Step 6: select an evaluation metric

scikit-learn comes with a bunch of evaluation metrics. Which one should be chosen depends on what we're trying to minimise/maximise. In this case, we want to make as few errors as possible, so it makes sense to use accuracy as our metric.

$$ accuracy = \frac{\# correct}{\# classified} $$

In [45]:
from sklearn.metrics import accuracy_score

In [46]:
accuracy_score(y_true, y_predicted)

0.59090909090909094

So we got 60% accuracy. Let's try it with a few more train/test splits to see whether this is typical.

In [47]:
def classify(X, y):
    # do the train-test split
    X_train, X_test, y_train, y_true = train_test_split(X, y)

    # get our features
    X_train_feats = vect.fit_transform(X_train)
    X_test_feats  = vect.transform(X_test) 

    # train our model
    model = clf.fit(X_train_feats, y_train)
    
    # predict labels on the test set
    y_predicted = model.predict(X_test_feats)
    
    # return the accuracy score obtained
    return accuracy_score(y_true, y_predicted)

In [50]:
scores = list()
num_expts = 100
for i in range(num_expts):
    score = classify(X,y)
    scores.append(score)
    
print sum(scores) / num_expts

0.551818181818


## Conclusion

The accuracy we obtain with this set of features and this classifier is about 55%. This isn't completely terrible. With 6 categories, a completely random classifier should expect to get only 16.6% of them right. But 55% accuracy also means that I'd have to go through and correct every other label. How can we improve this?

There are a few ways that spring to mind:

* Increase the amount of data - easier said than done
* Try different classifiers - scikit-learn makes this dead easy
* Use more features - worth a try (and we will)
* Adjust the hyperparameters of the classifiers - more on this later