{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inputs\n", "\n", "- Want one, canonical ID for every unique company.\n", "- Fields\n", " - Account name\n", " - Contact name\n", " - Premises address\n", " - Billing address\n", "- Companies are duplicated, with different premises and/or billing addresses.\n", "- What about companies with same account name, in different places?\n", "- What about edge cases - almost same company names, different places - same or different company?\n", "\n", "## Tools\n", "\n", "- Just 1 week work, basic pre-processing to help interns.\n", "- Want to do clustering (are these different rows actually the same company?)\n", "\n", "1. Input rows -> `nltk.PunktTokeniser`. -> Cleaned, parsed, tokensized strings.\n", " - Sometimes need to preserve non-alphanumeric characters (e.g. \"203-205 Upper Street\").\n", "2. -> `sklearn.TfidVectorizer` -> TF-IDF 2D matrix\n", "3. -> `sklearn.TruncatedSVD` -> SVD N-D matrix\n", "\n", "Now, we can:\n", "\n", "1. Suggest similar accounts t- be grouped into manageable chunks for people to look at.\n", " - `sklearn.MiniBatchKmeans` (ridiculously fast, coupled with grid search for hyperparameters)\n", " - `sklearn.AffinityPropagation`\n", "2. Human validation + verification\n", "3. Incorporate and propagate *valid* groupings\n", " - `sklearn.RadiusNeighborsClassifier`\n", " - Supervised, passing through human knowledge\n", " \n", "##\u00a0Results\n", "\n", "- 93% accuracy, compared to human experts who validated the results\n", "- `nltk` and `scikit-learn` allowed rapid development and testing\n", "- Human input is important - no ML problem is an island." ] } ], "metadata": {} } ] }