# Adding more features with Pipelines

Much of the art of machine learning lies in choosing appropriate features. So far we've only used n-grams. But we often want to add more features, in parallel. We might also want to perform transformations on features such as normalisation. If you look at high-scoring Kaggle competition entries, the classifiers often involve many features and transformations. You can imagine that the code for this can get pretty scraggly.
"
A solution to this is to use Pipelines. In this section, I'll add a few extra features using scikit-learn's Pipeline object.

The features we'll be adding are these:

* Number of words in road name
 * More words => more likely to be Chinese
* Average word length in road name
 * Longer words => more likely to be British or Indian
* Are all words in dictionary
 * If yes => likely to be Generic
* Is the road type Malay?
 * If yes => very correlated with being Malay

In [29]:
import pandas as pd
import numpy as np
df = pd.read_csv('singapore-roadnames-final-classified.csv')

In [30]:
# let's pick the same random 10% of the data to train with

import random
random.seed(1965)
train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]

X = train_test_set['road_name']
y = train_test_set['classification']

## Redo-ing our previous setup with Pipelines

As a first step, let's redo our previous process with Pipelines.

In [31]:
# our two ingredients: the ngram counter and the classifier
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1,4), analyzer='char')

from sklearn.svm import LinearSVC
clf = LinearSVC()

In [32]:
from sklearn.pipeline import Pipeline, FeatureUnion

# There are just two steps to our process: extracting the ngrams and
# putting them through the classifier. So our Pipeline looks like this:

pipeline = Pipeline([
 ('vect', vect), # extract ngrams from roadnames
 ('clf' , clf), # feed the output through a classifier
])

In [33]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

def run_experiment(X, y, pipeline, num_expts=100):
 scores = list()
 for i in range(num_expts):
 X_train, X_test, y_train, y_true = train_test_split(X, y)
 model = pipeline.fit(X_train, y_train) # train the classifier
 y_test = model.predict(X_test) # apply the model to the test data
 score = accuracy_score(y_test, y_true) # compare the results to the gold standard
 scores.append(score)

 print sum(scores) / num_expts

## Column selection

Previously, we were operating on a single column of our Pandas dataframe. But our dataframe really has two relevant columns - the text column and the boolean column indicating whether the name occurred with a Malay road tag or not. We'll modify our pipeline to operate on the entire dataframe, which means doing some column selection.

The way we'll do this is to write custom data transformers which we will use as initial steps in the pipeline. The output of this transformer will be passed on to further steps in the pipeline.

In [152]:
# The general shape of a custom data transformer is as follows:

from sklearn.base import TransformerMixin, BaseEstimator

class DataTransformer(BaseEstimator, TransformerMixin):
 
 def __init__(self, vars):
 self.vars = vars # this contains whatever variables you need 
 # to pass in for use in the `transform` step
 
 def transform(self, data):
 # this is the crucial method. It takes in whatever data is passed into
 # the tranformer as a whole, such as a Pandas dataframe or a numpy array,
 # and returns the transformed data
 return mydatatransform(data, self.vars)
 
 def fit(self, *_):
 # most of the time, `fit` doesn't need to do anything
 # just return `self`
 # exceptions: if you're writing a custom classifier,
 # or if how the test data is transformed is dependent on
 # how the training data was transformed
 # Examples of the second type are scalers and the n-gram transformer
 return self

In [153]:
# Now let's actually write our extractor

class TextExtractor(BaseEstimator, TransformerMixin):
 """Adapted from code by @zacstewart 
 https://github.com/zacstewart/kaggle_seeclickfix/blob/master/estimator.py
 Also see Zac Stewart's excellent blogpost on pipelines:
 http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
 """
 
 def __init__(self, column_name):
 self.column_name = column_name

 def transform(self, df):
 # select the relevant column and return it as a numpy array
 # set the array type to be string
 return np.asarray(df[self.column_name]).astype(str)
 
 def fit(self, *_):
 return self

In [82]:
# Now let's update our previous code to operate on the full dataframe

random.seed(1965)
train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]

X = train_test_set[['road_name', 'has_malay_road_tag']]
y = train_test_set['classification']

In [77]:
pipeline = Pipeline([
 ('name_extractor', TextExtractor('road_name')), # extract names from df
 ('vect', vect), # extract ngrams from roadnames
 ('clf' , clf), # feed the output through a classifier
])

In [78]:
run_experiment(X, y, pipeline)

0.553409090909


## Adding new features based on `road_name`

The next feature to add is the number of words in the road name. For this we will need to operate on a numpy array of strings and transform it into the number of words in each string. We'll need to add similar functions for extracting the average word length, etc. For this reason, I'm going to define a very general Apply transformer that takes in a function and applies it element-wise to every element in the numpy array it's supplied with.

In [53]:
class Apply(BaseEstimator, TransformerMixin):
 """Applies a function f element-wise to the numpy array
 """
 
 def __init__(self, fn):
 self.fn = np.vectorize(fn)
 
 def transform(self, data):
 # note: reshaping is necessary because otherwise sklearn
 # interprets 1-d array as a single sample
 return self.fn(data.reshape(data.size, 1))

 def fit(self, *_):
 return self

However, adding this to our existing Pipeline just won't work. We aren't trying to serially transform the n-grams, but transform the text in parallel with the n-gram extractor. For this, we need to use a FeatureUnion.

In [48]:
# we already imported FeatureUnion earlier, so here goes

pipeline = Pipeline([
 ('name_extractor', TextExtractor('road_name')), # extract names from df
 ('text_features', FeatureUnion([
 ('vect', vect), # extract ngrams from roadnames
 ('num_words', Apply(lambda s: len(s.split()))), # length of string
 ])),
 ('clf' , clf), # feed the output through a classifier
])

In [49]:
run_experiment(X, y, pipeline)

0.559772727273


In [51]:
# Okay! That didn't really improve our accuracy that much...let's try another feature

pipeline = Pipeline([
 ('name_extractor', TextExtractor('road_name')), # extract names from df
 ('text_features', FeatureUnion([
 ('vect', vect), # extract ngrams from roadnames
 ('num_words', Apply(lambda s: len(s.split()))), # length of string
 ('ave_word_length', Apply(lambda s: np.mean([len(w) for w in s.split()]))), # average word length
 ])),
 ('clf' , clf), # feed the output through a classifier
])

In [52]:
run_experiment(X, y, pipeline)

0.563863636364


In [63]:
# That didn't help much either. Let's write another transformer that returns True
# if all the words in the roadname are in the dictionary
# we could use Apply and a lambda function for this, but let's be good and pass
# in the dictionary of words for better replicability

from operator import and_

class AllDictionaryWords(BaseEstimator, TransformerMixin):
 
 def __init__(self, dictloc='../resources/scowl-7.1/final/english-words*'):
 from glob import glob
 self.dictionary = dict()
 for dictfile in glob(dictloc):
 if dictfile.endswith('95'):
 continue
 with open(dictfile, 'r') as g:
 for line in g.readlines():
 self.dictionary[line.strip()] = 1

 self.fn = np.vectorize(self.all_words_in_dict)
 
 def all_words_in_dict(self, s):
 return reduce(and_, [word.lower() in self.dictionary
 for word in s.split()])

 def transform(self, data):
 # note: reshaping is necessary because otherwise sklearn
 # interprets 1-d array as a single sample
 return self.fn(data.reshape(data.size, 1))

 def fit(self, *_):
 return self

In [67]:
text_pipeline = Pipeline([
 ('name_extractor', TextExtractor('road_name')), # extract names from df
 ('text_features', FeatureUnion([
 ('vect', vect), # extract ngrams from roadnames
 ('num_words', Apply(lambda s: len(s.split()))), # length of string
 ('ave_word_length', Apply(lambda s: np.mean([len(w) for w in s.split()]))), # average word length
 ('all_dictionary_words', AllDictionaryWords()),
 ])),
])

pipeline = Pipeline([
 ('text_pipeline', text_pipeline), # all text features
 ('clf' , clf), # feed the output through a classifier
])

In [68]:
run_experiment(X, y, pipeline)

0.583181818182


That saw a marginal improvement. Now let's add in the feature for the Malay roadnames - which is really just a Boolean column extraction operation.

In [79]:
class BooleanExtractor(BaseEstimator, TransformerMixin):
 
 def __init__(self, column_name):
 self.column_name = column_name

 def transform(self, df):
 # select the relevant column and return it as a numpy array
 # set the array type to be string
 return np.asarray(df[self.column_name]).astype(np.bool)
 
 def fit(self, *_):
 return self

In [85]:
malay_pipeline = Pipeline([
 ('malay_feature', BooleanExtractor('has_malay_road_tag')),
 ('identity', Apply(lambda x: x)), # this is a bit silly but we need to do the transform and this was the easiest way to do it
])

pipeline = Pipeline([
 ('all_features', FeatureUnion([
 ('text_pipeline', text_pipeline), # all text features
 ('malay_pipeline', malay_pipeline),
 ])),
 ('clf' , clf), # feed the output through a classifier
])

In [86]:
run_experiment(X, y, pipeline)

0.664545454545


Finally, some progress - most of it from the addition of the Malay road tag feature, which is really highly predictive of the Malay label. Moreover, the Malay label is the most common label, so it makes sense that improving this results in a larger increase in accuracy.

## Final notes

To be clear: adding Pipelines and FeatureUnions does not improve accuracy in and of itself.
It merely helps to organise one's code: if well-indented, it's quite easy to read off what steps are involved in the pipeline. Machine learning often involves a lot of experimentation, adding and subtracting features and transformations, so having a clear understanding of the pipeline is crucial.

Another point to note is that there are shortcut functions `make_pipeline` and `make_union` that simplify the writing of Pipelines by removing the need (or ability) to supply names for each of the steps. So we can rewrite the pipeline above as follows:

In [158]:
from sklearn.pipeline import make_pipeline, make_union

def num_words(s):
 return len(s.split())

def ave_word_length(s):
 return np.mean([len(w) for w in s.split()])

def identity(s):
 return s

from sklearn.preprocessing import StandardScaler, MinMaxScaler

pipeline = make_pipeline(
 # features
 make_union(
 # text features
 make_pipeline(
 TextExtractor('road_name'),
 make_union(
 CountVectorizer(ngram_range=(1,4), analyzer='char'),
 make_pipeline(
 Apply(num_words), # number of words
 MinMaxScaler()
 ),
# make_pipeline(
# Apply(ave_word_length), # average length of words
# StandardScaler()
# ),
 AllDictionaryWords(),
 ),
 ),
 AveWordLengthExtractor(),
 # malay feature
 make_pipeline(
 BooleanExtractor('has_malay_road_tag'),
 Apply(identity),
 )
 ),
 # classifier
 LinearSVC(),
)

In [159]:
run_experiment(X, y, pipeline)

0.662045454545
