{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Building the baseline classifier\n", "\n", "We'll now do a basic round of supervised classification using scikit-learn. We start by loading the data. We actually have the final classifications in this dataset, so that we can figure out what our accuracy rate was, but we'll ignore it initially and pretend we're starting from scratch." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_csv('singapore-roadnames-final-classified.csv')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0road_namehas_malay_road_tagclassificationcomment
0 0 Abingdon 0 British NaN
1 1 Abu Talib 1 Malay NaN
2 2 Adam 0 British NaN
3 3 Adat 1 Malay NaN
4 4 Adis 0 Other Indian Jewish
5 5 Admiralty 0 British NaN
6 6 Ah Hood 0 Chinese NaN
7 7 Ah Soo 1 Chinese NaN
8 8 Ahmad Ibrahim 1 Malay NaN
9 9 Aida 0 Other NaN
10 10 Airport 0 Generic NaN
11 11 Alexandra 0 British NaN
12 12 Aliwal 0 Indian Battle of Aliwal in the Indo-Sikh war
13 13 Aljunied 0 Other Arab
14 14 Allanbrooke 0 British NaN
15 15 Allenby 0 British NaN
16 16 Almond 0 Generic NaN
17 17 Alnwick 0 British NaN
18 18 Alps 0 Other NaN
19 19 Ama Keng 0 Chinese NaN
20 20 Amber 0 Other after the Amber Trust fund established for poo...
21 21 Amoy 0 Chinese NaN
22 22 Ampang 1 Malay NaN
23 23 Ampas 1 Malay NaN
24 24 Ampat 1 Malay NaN
25 25 Anak Bukit 1 Malay NaN
26 26 Anak Patong 1 Malay NaN
27 27 Anamalai 0 Indian NaN
28 28 Anchorvale 0 Generic marine theme
29 29 Anderson 0 British NaN
..................
1721 1721 Woodgrove 0 Generic NaN
1722 1722 Woodland 0 Generic NaN
1723 1723 Woodlands 0 Generic NaN
1724 1724 Woodleigh 0 British NaN
1725 1725 Woodsville 0 Generic NaN
1726 1726 Woollerton 0 British NaN
1727 1727 Worthing 0 British NaN
1728 1728 Xilin 0 Chinese NaN
1729 1729 Yan Kit 0 Chinese NaN
1730 1730 Yarrow 0 British NaN
1731 1731 Yarwood 0 British NaN
1732 1732 Yasin 1 Malay NaN
1733 1733 Yio Chu Kang 0 Chinese NaN
1734 1734 Yishun 0 Chinese NaN
1735 1735 Yong Siak 0 Chinese NaN
1736 1736 York 0 British NaN
1737 1737 Youngberg 0 British NaN
1738 1738 Yuan Ching 0 Chinese NaN
1739 1739 Yuk Tong 0 Chinese NaN
1740 1740 Yung An 0 Chinese NaN
1741 1741 Yung Ho 0 Chinese NaN
1742 1742 Yung Kuang 0 Chinese NaN
1743 1743 Yung Sheng 0 Chinese NaN
1744 1744 Yunnan 0 Chinese NaN
1745 1745 Zamrud 1 Malay NaN
1746 1746 Zehnder 0 Other Eurasian
1747 1747 Zion 0 Other NaN
1748 1748 Zubir Said 0 Malay NaN
1749 1749 kukoh 1 Malay NaN
1750 1750 one-north Gateway 0 Generic NaN
\n", "

1751 rows × 5 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 road_name has_malay_road_tag classification \\\n", "0 0 Abingdon 0 British \n", "1 1 Abu Talib 1 Malay \n", "2 2 Adam 0 British \n", "3 3 Adat 1 Malay \n", "4 4 Adis 0 Other \n", "5 5 Admiralty 0 British \n", "6 6 Ah Hood 0 Chinese \n", "7 7 Ah Soo 1 Chinese \n", "8 8 Ahmad Ibrahim 1 Malay \n", "9 9 Aida 0 Other \n", "10 10 Airport 0 Generic \n", "11 11 Alexandra 0 British \n", "12 12 Aliwal 0 Indian \n", "13 13 Aljunied 0 Other \n", "14 14 Allanbrooke 0 British \n", "15 15 Allenby 0 British \n", "16 16 Almond 0 Generic \n", "17 17 Alnwick 0 British \n", "18 18 Alps 0 Other \n", "19 19 Ama Keng 0 Chinese \n", "20 20 Amber 0 Other \n", "21 21 Amoy 0 Chinese \n", "22 22 Ampang 1 Malay \n", "23 23 Ampas 1 Malay \n", "24 24 Ampat 1 Malay \n", "25 25 Anak Bukit 1 Malay \n", "26 26 Anak Patong 1 Malay \n", "27 27 Anamalai 0 Indian \n", "28 28 Anchorvale 0 Generic \n", "29 29 Anderson 0 British \n", "... ... ... ... ... \n", "1721 1721 Woodgrove 0 Generic \n", "1722 1722 Woodland 0 Generic \n", "1723 1723 Woodlands 0 Generic \n", "1724 1724 Woodleigh 0 British \n", "1725 1725 Woodsville 0 Generic \n", "1726 1726 Woollerton 0 British \n", "1727 1727 Worthing 0 British \n", "1728 1728 Xilin 0 Chinese \n", "1729 1729 Yan Kit 0 Chinese \n", "1730 1730 Yarrow 0 British \n", "1731 1731 Yarwood 0 British \n", "1732 1732 Yasin 1 Malay \n", "1733 1733 Yio Chu Kang 0 Chinese \n", "1734 1734 Yishun 0 Chinese \n", "1735 1735 Yong Siak 0 Chinese \n", "1736 1736 York 0 British \n", "1737 1737 Youngberg 0 British \n", "1738 1738 Yuan Ching 0 Chinese \n", "1739 1739 Yuk Tong 0 Chinese \n", "1740 1740 Yung An 0 Chinese \n", "1741 1741 Yung Ho 0 Chinese \n", "1742 1742 Yung Kuang 0 Chinese \n", "1743 1743 Yung Sheng 0 Chinese \n", "1744 1744 Yunnan 0 Chinese \n", "1745 1745 Zamrud 1 Malay \n", "1746 1746 Zehnder 0 Other \n", "1747 1747 Zion 0 Other \n", "1748 1748 Zubir Said 0 Malay \n", "1749 1749 kukoh 1 Malay \n", "1750 1750 one-north Gateway 0 Generic \n", "\n", " comment \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 Indian Jewish \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "10 NaN \n", "11 NaN \n", "12 Battle of Aliwal in the Indo-Sikh war \n", "13 Arab \n", "14 NaN \n", "15 NaN \n", "16 NaN \n", "17 NaN \n", "18 NaN \n", "19 NaN \n", "20 after the Amber Trust fund established for poo... \n", "21 NaN \n", "22 NaN \n", "23 NaN \n", "24 NaN \n", "25 NaN \n", "26 NaN \n", "27 NaN \n", "28 marine theme \n", "29 NaN \n", "... ... \n", "1721 NaN \n", "1722 NaN \n", "1723 NaN \n", "1724 NaN \n", "1725 NaN \n", "1726 NaN \n", "1727 NaN \n", "1728 NaN \n", "1729 NaN \n", "1730 NaN \n", "1731 NaN \n", "1732 NaN \n", "1733 NaN \n", "1734 NaN \n", "1735 NaN \n", "1736 NaN \n", "1737 NaN \n", "1738 NaN \n", "1739 NaN \n", "1740 NaN \n", "1741 NaN \n", "1742 NaN \n", "1743 NaN \n", "1744 NaN \n", "1745 NaN \n", "1746 Eurasian \n", "1747 NaN \n", "1748 NaN \n", "1749 NaN \n", "1750 NaN \n", "\n", "[1751 rows x 5 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this step, we'll use about 10% of the data to mimic the process I actually used." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 0: putting the data together" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# let's pick a random 10% to train with\n", "\n", "import random\n", "random.seed(1965)\n", "train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]\n", "\n", "X = train_test_set['road_name']\n", "y = train_test_set['classification']" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('Opal', 'Generic'),\n", " ('Club', 'Generic'),\n", " ('Minto', 'Other'),\n", " ('Woodlands', 'Generic'),\n", " ('Hai Sing', 'Chinese'),\n", " ('Batalong', 'Malay'),\n", " ('Hikayat', 'Malay'),\n", " ('Bassein', 'Other'),\n", " ('Mount Echo', 'Generic'),\n", " ('Kallang Pudding', 'Malay'),\n", " ('Republic', 'Generic'),\n", " ('Wan Tho', 'Chinese'),\n", " ('Rengkam', 'Malay'),\n", " ('Keong Saik', 'Chinese'),\n", " ('Sedap', 'Malay'),\n", " ('Stratton', 'British'),\n", " ('Seagull', 'Generic'),\n", " ('Manila', 'Other')]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zip(X,y)[::10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You never actually train and test on the same data. So we'll split this dataset even further. scikit-learn provides a convenient function for this." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "X_train, X_test, y_train, y_true = train_test_split(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Figure out your classification labels\n", "\n", "This was actually one of the trickiest parts of the process. These are the labels I finally decided on:\n", "\n", "* Malay (including Indonesian/Bugis names)\n", "* British\n", "* Chinese (all languages (\"dialects\"))\n", "* Indian (all languages)\n", "* Other (e.g. other European names, Jewish names, Armenian names...)\n", "* Generic (Temple Street, Sunrise Avenue, etc)\n", "\n", "Something to bear in mind is that some of the streets can be classified in multiple ways. For example, is Queen Street \"British\" or \"Generic\"? In this case I selected \"British\" because it was specifically named after Queen Victoria. I tried to be consistent in my criteria, but up to ~5% of the roads might be arguable. Also, there is insufficient information for some of the roads so I went with my gut feel about the orthotactics of the word (the letter patterns)." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Malay 614\n", "British 518\n", "Generic 255\n", "Chinese 217\n", "Other 119\n", "Indian 28\n", "dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.classification.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: decide what features to use\n", "\n", "What we're doing is basically language classification. Often, people use n-grams as features for this. scikit-learn conveniently provides a function that counts n-grams for us." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "(131, 1410)\n", "(44, 1410)\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "vect = CountVectorizer(ngram_range=(1,4), analyzer='char')\n", "\n", "# fit_transform for the training data\n", "X_train_feats = vect.fit_transform(X_train)\n", "# transform for the test data\n", "# because we need to match the ngrams that were found in the training set \n", "X_test_feats = vect.transform(X_test) \n", "\n", "print type(X_train_feats)\n", "print X_train_feats.shape\n", "print X_test_feats.shape" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Step 3: pick a classifier\n", "\n", "\n", "\n", "According to this, we should be starting out with Linear SVC." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.svm import LinearSVC\n", "clf = LinearSVC()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Train the model\n", "\n", "Use the classifier to fit a model based on the feature matrix of `X_train` and the label vector of `y_train`." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = clf.fit(X_train_feats, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Predict the labels of the test set\n", "\n", "Now that we have our model, we can use it to predict labels on a fresh test set." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_predicted = model.predict(X_test_feats)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['Malay', 'Malay', 'British', 'Malay', 'British', 'British',\n", " 'British', 'British', 'British', 'British', 'Malay', 'Chinese',\n", " 'British', 'Chinese', 'British', 'Other', 'Generic', 'Malay',\n", " 'Malay', 'Chinese', 'British', 'British', 'Malay', 'British',\n", " 'British', 'Generic', 'Other', 'British', 'British', 'British',\n", " 'British', 'British', 'Malay', 'Generic', 'Malay', 'Generic',\n", " 'Malay', 'British', 'Malay', 'British', 'British', 'Malay', 'Malay',\n", " 'Generic'], dtype=object)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_predicted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: select an evaluation metric\n", "\n", "scikit-learn comes with a bunch of evaluation metrics. Which one should be chosen depends on what we're trying to minimise/maximise. In this case, we want to make as few errors as possible, so it makes sense to use accuracy as our metric.\n", "\n", "$$ accuracy = \\frac{\\# correct}{\\# classified} $$" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.59090909090909094" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(y_true, y_predicted)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we got 60% accuracy. Let's try it with a few more train/test splits to see whether this is typical." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def classify(X, y):\n", " # do the train-test split\n", " X_train, X_test, y_train, y_true = train_test_split(X, y)\n", "\n", " # get our features\n", " X_train_feats = vect.fit_transform(X_train)\n", " X_test_feats = vect.transform(X_test) \n", "\n", " # train our model\n", " model = clf.fit(X_train_feats, y_train)\n", " \n", " # predict labels on the test set\n", " y_predicted = model.predict(X_test_feats)\n", " \n", " # return the accuracy score obtained\n", " return accuracy_score(y_true, y_predicted)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.551818181818\n" ] } ], "source": [ "scores = list()\n", "num_expts = 100\n", "for i in range(num_expts):\n", " score = classify(X,y)\n", " scores.append(score)\n", " \n", "print sum(scores) / num_expts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "The accuracy we obtain with this set of features and this classifier is about 55%. This isn't completely terrible. With 6 categories, a completely random classifier should expect to get only 16.6% of them right. But 55% accuracy also means that I'd have to go through and correct every other label. How can we improve this?\n", "\n", "There are a few ways that spring to mind:\n", "\n", "* Increase the amount of data - easier said than done\n", "* Try different classifiers - scikit-learn makes this dead easy\n", "* Use more features - worth a try (and we will)\n", "* Adjust the hyperparameters of the classifiers - more on this later" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }