{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## What Rangespan to\n",
      "\n",
      "- Large catalog of products for retailers like Tesco and Argos\n",
      "- Middleman between Tesco/Argos and customer orders/returns\n",
      "- Offer a search engine over products, e.g. audio products.\n",
      "- Retailers think in terms of product categories, especially for search"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Taxonomy Classification\n",
      "\n",
      "- Initially contracted out classification to manual workers\n",
      "    - Amazon Mechanical Turk\n",
      "    - Outsources to low-wage countries\n",
      "- Categories structured as hierarchical tree.\n",
      "    - Root -> Electronics -> Audio -> Amps\n",
      "- Input: raw product data, output: category."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Pipeline\n",
      "\n",
      "- Data gathering\n",
      "    - Name, Manufacturer, Description, Label\n",
      "- Feature extraction\n",
      "\n",
      "## Feature extraction\n",
      "\n",
      "- Text cleaning (stopwords, lexicalisation)\n",
      "- Unigram and Bigram Features\n",
      "- Latent Dirichlet allocation (LDA) Topic Features\n",
      "    - Run topic model on each product, unsupervised.\n",
      "    - In production has 50 topics.\n",
      "    - `gensim`, topic modelling library in Python\n",
      "- Not focus of talk\n",
      "\n",
      "## Training, Testing, and Labelling\n",
      "\n",
      "### Hierarchical Classification\n",
      "\n",
      "- One way, take hierarchy (multi-level) then flatten.\n",
      "    - Root -> [A, B, C], B -> [D, E]\n",
      "    - Then flatten to four-way classifier (A, C, D, E). B is internal node.\n",
      "    - Take your favourite classifier, done.\n",
      "- But with 4000 classes, doesn't really scale.\n",
      "- Alternative, for every internal node create a classifier.\n",
      "    - Classify [A, C] or [B].\n",
      "    - If B, then classify [D, E].\n",
      "    - Hence two classifiers.\n",
      "    - 2 + 3 way multiclass classification\n",
      "\n",
      "### What classifier to use?\n",
      "\n",
      "- Want to extract value from all the feature engineering they did.\n",
      "- Want classifier that supports multiclass classification.\n",
      "- Ended up choosing logistic regression, easy.\n",
      "    - Bag of words, weight each word for given classification label.\n",
      "- Need a probability output, normalised [0.0, 1.0].\n",
      "\n",
      "## How to Train Logistic Regression\n",
      "\n",
      "$ \\textrm{min}_{\\beta} \\sum_n \\textrm{log} p(y_n | X_n, \\beta) + \\lambda_1 ||\\beta||_1 + \\lambda_2 ||\\beta||_2$\n",
      "\n",
      "- Wealth of tools to optimise objective function.\n",
      "- Optimise using Wapiti [http://wapiti.limsi.fr](http://wapiti.limsi.fr)\n",
      "    - Segments and labels sequences.\n",
      "    - Not well known.\n",
      "    - Extremely fast, vectorised C.\n",
      "- Nowadays could use scikit-learn.\n",
      "- lambda is regularization. You want to assign cost to extra parameters.\n",
      "    - Just try different hyperparameters (lambdas) using grid search.\n",
      "    - In production they try 20 hyperparameter values.\n",
      "\n",
      "##\u00a0What to train\n",
      "\n",
      "- One classifier for every internal node. ROOT node is an internal node.\n",
      "- Note that each data point spreads around to all internal nodes on the path to the respective leaf category it ends up in.\n",
      "    - e.g. radio is in ROOT and Electronics.\n",
      "    - five levels implies five copies.\n",
      "\n",
      "##\u00a0How to train\n",
      "\n",
      "- Two stages\n",
      "- Cross-validation.\n",
      "    - Estimate classifier errors.\n",
      "    - Do not test on training data.\n",
      "    - Have three sets of data: training, cross-validation, testing.\n",
      "    - They split training set into 5 chunks. 5-fold cross validation. \n",
      "- Calibration\n",
      "    - Are my estimates correct.\n",
      "    - Make sure 90% of labels correct.\n",
      "\n",
      "##\u00a0How to use model\n",
      "\n",
      "- Use Bayes rule to chain classifiers.\n",
      "\n",
      "    p(ROOT, electronics, ... | X) = p(ROOT|X) * p(electronics|ROOT) * ...\n",
      "\n",
      "- Use greedy algorithm to traverse all paths\n",
      "\n",
      "##\u00a0How to re-use human knowledge\n",
      "\n",
      "- Active learning\n",
      "- Some data is labelled, some data isn't.\n",
      "- Especially helpful for novel data, e.g. a vuvuzela, completely unseen before.\n",
      "- For unknown or decisions close to decision boundary send the data to humans, actually Amazon Mechanical Turk\n",
      "\n",
      "##\u00a0Implementation\n",
      "\n",
      "- Simple MapReduce task for data cleaning and feature extraction.\n",
      "\n",
      "    def mapper(id, category, features):\n",
      "        for subcat in lineage(category):\n",
      "            for hyper in num_folds:\n",
      "                for fold in num_folds:\n",
      "                yield {\n",
      "                    'fold': fold,\n",
      "                    'classifier': subcat,\n",
      "                    'hyper': hyper\n",
      "                },\n",
      "                (\n",
      "                    id,\n",
      "                    category,\n",
      "                    features\n",
      "                )\n",
      "\n",
      "    def reducer((fold, subcat, hyper), data):\n",
      "        model = wapiti.train(\n",
      "            d in data if d['fold']\n",
      "        ...\n",
      "\n",
      "- Use Dumbo on Hadoop\n",
      "\n",
      "## Thoughts\n",
      "\n",
      "- Most thought went into feature engineering\n",
      "    - LDA topic model. How to clean.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}