{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adding more features with Pipelines\n",
    "\n",
    "Much of the art of machine learning lies in choosing appropriate features. So far we've only used n-grams. But we often want to add more features, in parallel. We might also want to perform transformations on features such as normalisation. If you look at high-scoring Kaggle competition entries, the classifiers often involve many features and transformations. You can imagine that the code for this can get pretty scraggly.\n",
    "\"\n",
    "A solution to this is to use Pipelines. In this section, I'll add a few extra features using scikit-learn's Pipeline object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features we'll be adding are these:\n",
    "\n",
    "* Number of words in road name\n",
    "  * More words => more likely to be Chinese\n",
    "* Average word length in road name\n",
    "  * Longer words => more likely to be British or Indian\n",
    "* Are all words in dictionary\n",
    "  * If yes => likely to be Generic\n",
    "* Is the road type Malay?\n",
    "  * If yes => very correlated with being Malay"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.read_csv('singapore-roadnames-final-classified.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# let's pick the same random 10% of the data to train with\n",
    "\n",
    "import random\n",
    "random.seed(1965)\n",
    "train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]\n",
    "\n",
    "X = train_test_set['road_name']\n",
    "y = train_test_set['classification']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Redo-ing our previous setup with Pipelines\n",
    "\n",
    "As a first step, let's redo our previous process with Pipelines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# our two ingredients: the ngram counter and the classifier\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vect = CountVectorizer(ngram_range=(1,4), analyzer='char')\n",
    "\n",
    "from sklearn.svm import LinearSVC\n",
    "clf = LinearSVC()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline, FeatureUnion\n",
    "\n",
    "# There are just two steps to our process: extracting the ngrams and\n",
    "# putting them through the classifier. So our Pipeline looks like this:\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('vect', vect),  # extract ngrams from roadnames\n",
    "    ('clf' , clf),   # feed the output through a classifier\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "def run_experiment(X, y, pipeline, num_expts=100):\n",
    "    scores = list()\n",
    "    for i in range(num_expts):\n",
    "        X_train, X_test, y_train, y_true = train_test_split(X, y)\n",
    "        model = pipeline.fit(X_train, y_train)  # train the classifier\n",
    "        y_test = model.predict(X_test)          # apply the model to the test data\n",
    "        score = accuracy_score(y_test, y_true)  # compare the results to the gold standard\n",
    "        scores.append(score)\n",
    "\n",
    "    print sum(scores) / num_expts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Column selection\n",
    "\n",
    "Previously, we were operating on a single column of our Pandas dataframe. But our dataframe really has two relevant columns - the text column and the boolean column indicating whether the name occurred with a Malay road tag or not. We'll modify our pipeline to operate on the entire dataframe, which means doing some column selection.\n",
    "\n",
    "The way we'll do this is to write custom data transformers which we will use as initial steps in the pipeline. The output of this transformer will be passed on to further steps in the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# The general shape of a custom data transformer is as follows:\n",
    "\n",
    "from sklearn.base import TransformerMixin, BaseEstimator\n",
    "\n",
    "class DataTransformer(BaseEstimator, TransformerMixin):\n",
    "    \n",
    "    def __init__(self, vars):\n",
    "        self.vars = vars # this contains whatever variables you need \n",
    "                         # to pass in for use in the `transform` step\n",
    "            \n",
    "    def transform(self, data):\n",
    "        # this is the crucial method. It takes in whatever data is passed into\n",
    "        # the tranformer as a whole, such as a Pandas dataframe or a numpy array,\n",
    "        # and returns the transformed data\n",
    "        return mydatatransform(data, self.vars)\n",
    "    \n",
    "    def fit(self, *_):\n",
    "        # most of the time, `fit` doesn't need to do anything\n",
    "        # just return `self`\n",
    "        # exceptions: if you're writing a custom classifier,\n",
    "        #          or if how the test data is transformed is dependent on\n",
    "        #                how the training data was transformed\n",
    "        # Examples of the second type are scalers and the n-gram transformer\n",
    "        return self"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Now let's actually write our extractor\n",
    "\n",
    "class TextExtractor(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"Adapted from code by @zacstewart \n",
    "       https://github.com/zacstewart/kaggle_seeclickfix/blob/master/estimator.py\n",
    "       Also see Zac Stewart's excellent blogpost on pipelines:\n",
    "       http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html\n",
    "       \"\"\"\n",
    "    \n",
    "    def __init__(self, column_name):\n",
    "        self.column_name = column_name\n",
    "\n",
    "    def transform(self, df):\n",
    "        # select the relevant column and return it as a numpy array\n",
    "        # set the array type to be string\n",
    "        return np.asarray(df[self.column_name]).astype(str)\n",
    "        \n",
    "    def fit(self, *_):\n",
    "        return self"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Now let's update our previous code to operate on the full dataframe\n",
    "\n",
    "random.seed(1965)\n",
    "train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]\n",
    "\n",
    "X = train_test_set[['road_name', 'has_malay_road_tag']]\n",
    "y = train_test_set['classification']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pipeline = Pipeline([\n",
    "    ('name_extractor', TextExtractor('road_name')), # extract names from df\n",
    "    ('vect', vect),  # extract ngrams from roadnames\n",
    "    ('clf' , clf),   # feed the output through a classifier\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.553409090909\n"
     ]
    }
   ],
   "source": [
    "run_experiment(X, y, pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding new features based on `road_name`\n",
    "\n",
    "The next feature to add is the number of words in the road name. For this we will need to operate on a numpy array of strings and transform it into the number of words in each string. We'll need to add similar functions for extracting the average word length, etc. For this reason, I'm going to define a very general Apply transformer that takes in a function and applies it element-wise to every element in the numpy array it's supplied with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class Apply(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"Applies a function f element-wise to the numpy array\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, fn):\n",
    "        self.fn = np.vectorize(fn)\n",
    "        \n",
    "    def transform(self, data):\n",
    "        # note: reshaping is necessary because otherwise sklearn\n",
    "        # interprets 1-d array as a single sample\n",
    "        return self.fn(data.reshape(data.size, 1))\n",
    "\n",
    "    def fit(self, *_):\n",
    "        return self"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, adding this to our existing Pipeline just won't work. We aren't trying to serially transform the n-grams, but transform the text in parallel with the n-gram extractor. For this, we need to use a FeatureUnion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# we already imported FeatureUnion earlier, so here goes\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('name_extractor', TextExtractor('road_name')), # extract names from df\n",
    "    ('text_features', FeatureUnion([\n",
    "        ('vect', vect),  # extract ngrams from roadnames\n",
    "        ('num_words', Apply(lambda s: len(s.split()))), # length of string\n",
    "    ])),\n",
    "    ('clf' , clf),   # feed the output through a classifier\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.559772727273\n"
     ]
    }
   ],
   "source": [
    "run_experiment(X, y, pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Okay! That didn't really improve our accuracy that much...let's try another feature\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('name_extractor', TextExtractor('road_name')), # extract names from df\n",
    "    ('text_features', FeatureUnion([\n",
    "        ('vect', vect),  # extract ngrams from roadnames\n",
    "        ('num_words', Apply(lambda s: len(s.split()))), # length of string\n",
    "        ('ave_word_length', Apply(lambda s: np.mean([len(w) for w in s.split()]))), # average word length\n",
    "    ])),\n",
    "    ('clf' , clf),   # feed the output through a classifier\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.563863636364\n"
     ]
    }
   ],
   "source": [
    "run_experiment(X, y, pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# That didn't help much either. Let's write another transformer that returns True\n",
    "# if all the words in the roadname are in the dictionary\n",
    "# we could use Apply and a lambda function for this, but let's be good and pass\n",
    "# in the dictionary of words for better replicability\n",
    "\n",
    "from operator import and_\n",
    "\n",
    "class AllDictionaryWords(BaseEstimator, TransformerMixin):\n",
    "    \n",
    "    def __init__(self, dictloc='../resources/scowl-7.1/final/english-words*'):\n",
    "        from glob import glob\n",
    "        self.dictionary = dict()\n",
    "        for dictfile in glob(dictloc):\n",
    "            if dictfile.endswith('95'):\n",
    "                continue\n",
    "            with open(dictfile, 'r') as g:\n",
    "                for line in g.readlines():\n",
    "                    self.dictionary[line.strip()] = 1\n",
    "\n",
    "        self.fn = np.vectorize(self.all_words_in_dict)\n",
    "                \n",
    "    def all_words_in_dict(self, s):\n",
    "        return reduce(and_, [word.lower() in self.dictionary\n",
    "                      for word in s.split()])\n",
    "\n",
    "    def transform(self, data):\n",
    "        # note: reshaping is necessary because otherwise sklearn\n",
    "        # interprets 1-d array as a single sample\n",
    "        return self.fn(data.reshape(data.size, 1))\n",
    "\n",
    "    def fit(self, *_):\n",
    "        return self"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "text_pipeline = Pipeline([\n",
    "    ('name_extractor', TextExtractor('road_name')), # extract names from df\n",
    "    ('text_features', FeatureUnion([\n",
    "        ('vect', vect),  # extract ngrams from roadnames\n",
    "        ('num_words', Apply(lambda s: len(s.split()))), # length of string\n",
    "        ('ave_word_length', Apply(lambda s: np.mean([len(w) for w in s.split()]))), # average word length\n",
    "        ('all_dictionary_words', AllDictionaryWords()),\n",
    "    ])),\n",
    "])\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('text_pipeline', text_pipeline), # all text features\n",
    "    ('clf' , clf),   # feed the output through a classifier\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.583181818182\n"
     ]
    }
   ],
   "source": [
    "run_experiment(X, y, pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That saw a marginal improvement. Now let's add in the feature for the Malay roadnames - which is really just a Boolean column extraction operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class BooleanExtractor(BaseEstimator, TransformerMixin):\n",
    "    \n",
    "    def __init__(self, column_name):\n",
    "        self.column_name = column_name\n",
    "\n",
    "    def transform(self, df):\n",
    "        # select the relevant column and return it as a numpy array\n",
    "        # set the array type to be string\n",
    "        return np.asarray(df[self.column_name]).astype(np.bool)\n",
    "                                                       \n",
    "    def fit(self, *_):\n",
    "        return self"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "malay_pipeline = Pipeline([\n",
    "  ('malay_feature', BooleanExtractor('has_malay_road_tag')),\n",
    "  ('identity', Apply(lambda x: x)), # this is a bit silly but we need to do the transform and this was the easiest way to do it\n",
    "])\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('all_features', FeatureUnion([\n",
    "        ('text_pipeline', text_pipeline), # all text features\n",
    "        ('malay_pipeline', malay_pipeline),\n",
    "    ])),\n",
    "    ('clf' , clf),   # feed the output through a classifier\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.664545454545\n"
     ]
    }
   ],
   "source": [
    "run_experiment(X, y, pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, some progress - most of it from the addition of the Malay road tag feature, which is really highly predictive of the Malay label. Moreover, the Malay label is the most common label, so it makes sense that improving this results in a larger increase in accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final notes\n",
    "\n",
    "To be clear: adding Pipelines and FeatureUnions does not improve accuracy in and of itself.\n",
    "It merely helps to organise one's code: if well-indented, it's quite easy to read off what steps are involved in the pipeline. Machine learning often involves a lot of experimentation, adding and subtracting features and transformations, so having a clear understanding of the pipeline is crucial.\n",
    "\n",
    "Another point to note is that there are shortcut functions `make_pipeline` and `make_union` that simplify the writing of Pipelines by removing the need (or ability) to supply names for each of the steps. So we can rewrite the pipeline above as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline, make_union\n",
    "\n",
    "def num_words(s):\n",
    "    return len(s.split())\n",
    "\n",
    "def ave_word_length(s):\n",
    "    return np.mean([len(w) for w in s.split()])\n",
    "\n",
    "def identity(s):\n",
    "    return s\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n",
    "\n",
    "pipeline = make_pipeline(\n",
    "    # features\n",
    "    make_union(\n",
    "        # text features\n",
    "        make_pipeline(\n",
    "            TextExtractor('road_name'),\n",
    "            make_union(\n",
    "                CountVectorizer(ngram_range=(1,4), analyzer='char'),\n",
    "                make_pipeline(\n",
    "                    Apply(num_words), # number of words\n",
    "                    MinMaxScaler()\n",
    "                ),\n",
    "#                make_pipeline(\n",
    "#                    Apply(ave_word_length), # average length of words\n",
    "#                    StandardScaler()\n",
    "#                ),\n",
    "                AllDictionaryWords(),\n",
    "            ),\n",
    "        ),\n",
    "        AveWordLengthExtractor(),\n",
    "        # malay feature\n",
    "        make_pipeline(\n",
    "            BooleanExtractor('has_malay_road_tag'),\n",
    "            Apply(identity),\n",
    "        )\n",
    "    ),\n",
    "    # classifier\n",
    "    LinearSVC(),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.662045454545\n"
     ]
    }
   ],
   "source": [
    "run_experiment(X, y, pipeline)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}