{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## What Rangespan to\n", "\n", "- Large catalog of products for retailers like Tesco and Argos\n", "- Middleman between Tesco/Argos and customer orders/returns\n", "- Offer a search engine over products, e.g. audio products.\n", "- Retailers think in terms of product categories, especially for search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Taxonomy Classification\n", "\n", "- Initially contracted out classification to manual workers\n", " - Amazon Mechanical Turk\n", " - Outsources to low-wage countries\n", "- Categories structured as hierarchical tree.\n", " - Root -> Electronics -> Audio -> Amps\n", "- Input: raw product data, output: category." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipeline\n", "\n", "- Data gathering\n", " - Name, Manufacturer, Description, Label\n", "- Feature extraction\n", "\n", "## Feature extraction\n", "\n", "- Text cleaning (stopwords, lexicalisation)\n", "- Unigram and Bigram Features\n", "- Latent Dirichlet allocation (LDA) Topic Features\n", " - Run topic model on each product, unsupervised.\n", " - In production has 50 topics.\n", " - `gensim`, topic modelling library in Python\n", "- Not focus of talk\n", "\n", "## Training, Testing, and Labelling\n", "\n", "### Hierarchical Classification\n", "\n", "- One way, take hierarchy (multi-level) then flatten.\n", " - Root -> [A, B, C], B -> [D, E]\n", " - Then flatten to four-way classifier (A, C, D, E). B is internal node.\n", " - Take your favourite classifier, done.\n", "- But with 4000 classes, doesn't really scale.\n", "- Alternative, for every internal node create a classifier.\n", " - Classify [A, C] or [B].\n", " - If B, then classify [D, E].\n", " - Hence two classifiers.\n", " - 2 + 3 way multiclass classification\n", "\n", "### What classifier to use?\n", "\n", "- Want to extract value from all the feature engineering they did.\n", "- Want classifier that supports multiclass classification.\n", "- Ended up choosing logistic regression, easy.\n", " - Bag of words, weight each word for given classification label.\n", "- Need a probability output, normalised [0.0, 1.0].\n", "\n", "## How to Train Logistic Regression\n", "\n", "$ \\textrm{min}_{\\beta} \\sum_n \\textrm{log} p(y_n | X_n, \\beta) + \\lambda_1 ||\\beta||_1 + \\lambda_2 ||\\beta||_2$\n", "\n", "- Wealth of tools to optimise objective function.\n", "- Optimise using Wapiti [http://wapiti.limsi.fr](http://wapiti.limsi.fr)\n", " - Segments and labels sequences.\n", " - Not well known.\n", " - Extremely fast, vectorised C.\n", "- Nowadays could use scikit-learn.\n", "- lambda is regularization. You want to assign cost to extra parameters.\n", " - Just try different hyperparameters (lambdas) using grid search.\n", " - In production they try 20 hyperparameter values.\n", "\n", "##\u00a0What to train\n", "\n", "- One classifier for every internal node. ROOT node is an internal node.\n", "- Note that each data point spreads around to all internal nodes on the path to the respective leaf category it ends up in.\n", " - e.g. radio is in ROOT and Electronics.\n", " - five levels implies five copies.\n", "\n", "##\u00a0How to train\n", "\n", "- Two stages\n", "- Cross-validation.\n", " - Estimate classifier errors.\n", " - Do not test on training data.\n", " - Have three sets of data: training, cross-validation, testing.\n", " - They split training set into 5 chunks. 5-fold cross validation. \n", "- Calibration\n", " - Are my estimates correct.\n", " - Make sure 90% of labels correct.\n", "\n", "##\u00a0How to use model\n", "\n", "- Use Bayes rule to chain classifiers.\n", "\n", " p(ROOT, electronics, ... | X) = p(ROOT|X) * p(electronics|ROOT) * ...\n", "\n", "- Use greedy algorithm to traverse all paths\n", "\n", "##\u00a0How to re-use human knowledge\n", "\n", "- Active learning\n", "- Some data is labelled, some data isn't.\n", "- Especially helpful for novel data, e.g. a vuvuzela, completely unseen before.\n", "- For unknown or decisions close to decision boundary send the data to humans, actually Amazon Mechanical Turk\n", "\n", "##\u00a0Implementation\n", "\n", "- Simple MapReduce task for data cleaning and feature extraction.\n", "\n", " def mapper(id, category, features):\n", " for subcat in lineage(category):\n", " for hyper in num_folds:\n", " for fold in num_folds:\n", " yield {\n", " 'fold': fold,\n", " 'classifier': subcat,\n", " 'hyper': hyper\n", " },\n", " (\n", " id,\n", " category,\n", " features\n", " )\n", "\n", " def reducer((fold, subcat, hyper), data):\n", " model = wapiti.train(\n", " d in data if d['fold']\n", " ...\n", "\n", "- Use Dumbo on Hadoop\n", "\n", "## Thoughts\n", "\n", "- Most thought went into feature engineering\n", " - LDA topic model. How to clean.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }