## Inputs

- Want one, canonical ID for every unique company.
- Fields
 - Account name
 - Contact name
 - Premises address
 - Billing address
- Companies are duplicated, with different premises and/or billing addresses.
- What about companies with same account name, in different places?
- What about edge cases - almost same company names, different places - same or different company?

## Tools

- Just 1 week work, basic pre-processing to help interns.
- Want to do clustering (are these different rows actually the same company?)

1. Input rows -> `nltk.PunktTokeniser`. -> Cleaned, parsed, tokensized strings.
 - Sometimes need to preserve non-alphanumeric characters (e.g. "203-205 Upper Street").
2. -> `sklearn.TfidVectorizer` -> TF-IDF 2D matrix
3. -> `sklearn.TruncatedSVD` -> SVD N-D matrix

Now, we can:

1. Suggest similar accounts t- be grouped into manageable chunks for people to look at.
 - `sklearn.MiniBatchKmeans` (ridiculously fast, coupled with grid search for hyperparameters)
 - `sklearn.AffinityPropagation`
2. Human validation + verification
3. Incorporate and propagate *valid* groupings
 - `sklearn.RadiusNeighborsClassifier`
 - Supervised, passing through human knowledge
 
## Results

- 93% accuracy, compared to human experts who validated the results
- `nltk` and `scikit-learn` allowed rapid development and testing
- Human input is important - no ML problem is an island.