{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building the baseline classifier\n",
    "\n",
    "We'll now do a basic round of supervised classification using scikit-learn. We start by loading the data. We actually have the final classifications in this dataset, so that we can figure out what our accuracy rate was, but we'll ignore it initially and pretend we're starting from scratch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('singapore-roadnames-final-classified.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>road_name</th>\n",
       "      <th>has_malay_road_tag</th>\n",
       "      <th>classification</th>\n",
       "      <th>comment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0   </th>\n",
       "      <td>    0</td>\n",
       "      <td>          Abingdon</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1   </th>\n",
       "      <td>    1</td>\n",
       "      <td>         Abu Talib</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2   </th>\n",
       "      <td>    2</td>\n",
       "      <td>              Adam</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3   </th>\n",
       "      <td>    3</td>\n",
       "      <td>              Adat</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4   </th>\n",
       "      <td>    4</td>\n",
       "      <td>              Adis</td>\n",
       "      <td> 0</td>\n",
       "      <td>   Other</td>\n",
       "      <td>                                     Indian Jewish</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5   </th>\n",
       "      <td>    5</td>\n",
       "      <td>         Admiralty</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6   </th>\n",
       "      <td>    6</td>\n",
       "      <td>           Ah Hood</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7   </th>\n",
       "      <td>    7</td>\n",
       "      <td>            Ah Soo</td>\n",
       "      <td> 1</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8   </th>\n",
       "      <td>    8</td>\n",
       "      <td>     Ahmad Ibrahim</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9   </th>\n",
       "      <td>    9</td>\n",
       "      <td>              Aida</td>\n",
       "      <td> 0</td>\n",
       "      <td>   Other</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10  </th>\n",
       "      <td>   10</td>\n",
       "      <td>           Airport</td>\n",
       "      <td> 0</td>\n",
       "      <td> Generic</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11  </th>\n",
       "      <td>   11</td>\n",
       "      <td>         Alexandra</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12  </th>\n",
       "      <td>   12</td>\n",
       "      <td>            Aliwal</td>\n",
       "      <td> 0</td>\n",
       "      <td>  Indian</td>\n",
       "      <td>             Battle of Aliwal in the Indo-Sikh war</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13  </th>\n",
       "      <td>   13</td>\n",
       "      <td>          Aljunied</td>\n",
       "      <td> 0</td>\n",
       "      <td>   Other</td>\n",
       "      <td>                                              Arab</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14  </th>\n",
       "      <td>   14</td>\n",
       "      <td>       Allanbrooke</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15  </th>\n",
       "      <td>   15</td>\n",
       "      <td>           Allenby</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16  </th>\n",
       "      <td>   16</td>\n",
       "      <td>            Almond</td>\n",
       "      <td> 0</td>\n",
       "      <td> Generic</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17  </th>\n",
       "      <td>   17</td>\n",
       "      <td>           Alnwick</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18  </th>\n",
       "      <td>   18</td>\n",
       "      <td>              Alps</td>\n",
       "      <td> 0</td>\n",
       "      <td>   Other</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19  </th>\n",
       "      <td>   19</td>\n",
       "      <td>          Ama Keng</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20  </th>\n",
       "      <td>   20</td>\n",
       "      <td>             Amber</td>\n",
       "      <td> 0</td>\n",
       "      <td>   Other</td>\n",
       "      <td> after the Amber Trust fund established for poo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21  </th>\n",
       "      <td>   21</td>\n",
       "      <td>              Amoy</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22  </th>\n",
       "      <td>   22</td>\n",
       "      <td>            Ampang</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23  </th>\n",
       "      <td>   23</td>\n",
       "      <td>             Ampas</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24  </th>\n",
       "      <td>   24</td>\n",
       "      <td>             Ampat</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25  </th>\n",
       "      <td>   25</td>\n",
       "      <td>        Anak Bukit</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26  </th>\n",
       "      <td>   26</td>\n",
       "      <td>       Anak Patong</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27  </th>\n",
       "      <td>   27</td>\n",
       "      <td>          Anamalai</td>\n",
       "      <td> 0</td>\n",
       "      <td>  Indian</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28  </th>\n",
       "      <td>   28</td>\n",
       "      <td>        Anchorvale</td>\n",
       "      <td> 0</td>\n",
       "      <td> Generic</td>\n",
       "      <td>                                      marine theme</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29  </th>\n",
       "      <td>   29</td>\n",
       "      <td>          Anderson</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1721</th>\n",
       "      <td> 1721</td>\n",
       "      <td>         Woodgrove</td>\n",
       "      <td> 0</td>\n",
       "      <td> Generic</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1722</th>\n",
       "      <td> 1722</td>\n",
       "      <td>          Woodland</td>\n",
       "      <td> 0</td>\n",
       "      <td> Generic</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1723</th>\n",
       "      <td> 1723</td>\n",
       "      <td>         Woodlands</td>\n",
       "      <td> 0</td>\n",
       "      <td> Generic</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1724</th>\n",
       "      <td> 1724</td>\n",
       "      <td>         Woodleigh</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1725</th>\n",
       "      <td> 1725</td>\n",
       "      <td>        Woodsville</td>\n",
       "      <td> 0</td>\n",
       "      <td> Generic</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1726</th>\n",
       "      <td> 1726</td>\n",
       "      <td>        Woollerton</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1727</th>\n",
       "      <td> 1727</td>\n",
       "      <td>          Worthing</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1728</th>\n",
       "      <td> 1728</td>\n",
       "      <td>             Xilin</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1729</th>\n",
       "      <td> 1729</td>\n",
       "      <td>           Yan Kit</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1730</th>\n",
       "      <td> 1730</td>\n",
       "      <td>            Yarrow</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1731</th>\n",
       "      <td> 1731</td>\n",
       "      <td>           Yarwood</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1732</th>\n",
       "      <td> 1732</td>\n",
       "      <td>             Yasin</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1733</th>\n",
       "      <td> 1733</td>\n",
       "      <td>      Yio Chu Kang</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1734</th>\n",
       "      <td> 1734</td>\n",
       "      <td>            Yishun</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1735</th>\n",
       "      <td> 1735</td>\n",
       "      <td>         Yong Siak</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1736</th>\n",
       "      <td> 1736</td>\n",
       "      <td>              York</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1737</th>\n",
       "      <td> 1737</td>\n",
       "      <td>         Youngberg</td>\n",
       "      <td> 0</td>\n",
       "      <td> British</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1738</th>\n",
       "      <td> 1738</td>\n",
       "      <td>        Yuan Ching</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1739</th>\n",
       "      <td> 1739</td>\n",
       "      <td>          Yuk Tong</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1740</th>\n",
       "      <td> 1740</td>\n",
       "      <td>           Yung An</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1741</th>\n",
       "      <td> 1741</td>\n",
       "      <td>           Yung Ho</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1742</th>\n",
       "      <td> 1742</td>\n",
       "      <td>        Yung Kuang</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1743</th>\n",
       "      <td> 1743</td>\n",
       "      <td>        Yung Sheng</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1744</th>\n",
       "      <td> 1744</td>\n",
       "      <td>            Yunnan</td>\n",
       "      <td> 0</td>\n",
       "      <td> Chinese</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1745</th>\n",
       "      <td> 1745</td>\n",
       "      <td>            Zamrud</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1746</th>\n",
       "      <td> 1746</td>\n",
       "      <td>           Zehnder</td>\n",
       "      <td> 0</td>\n",
       "      <td>   Other</td>\n",
       "      <td>                                          Eurasian</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1747</th>\n",
       "      <td> 1747</td>\n",
       "      <td>              Zion</td>\n",
       "      <td> 0</td>\n",
       "      <td>   Other</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1748</th>\n",
       "      <td> 1748</td>\n",
       "      <td>        Zubir Said</td>\n",
       "      <td> 0</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1749</th>\n",
       "      <td> 1749</td>\n",
       "      <td>             kukoh</td>\n",
       "      <td> 1</td>\n",
       "      <td>   Malay</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1750</th>\n",
       "      <td> 1750</td>\n",
       "      <td> one-north Gateway</td>\n",
       "      <td> 0</td>\n",
       "      <td> Generic</td>\n",
       "      <td>                                               NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1751 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      Unnamed: 0          road_name  has_malay_road_tag classification  \\\n",
       "0              0           Abingdon                   0        British   \n",
       "1              1          Abu Talib                   1          Malay   \n",
       "2              2               Adam                   0        British   \n",
       "3              3               Adat                   1          Malay   \n",
       "4              4               Adis                   0          Other   \n",
       "5              5          Admiralty                   0        British   \n",
       "6              6            Ah Hood                   0        Chinese   \n",
       "7              7             Ah Soo                   1        Chinese   \n",
       "8              8      Ahmad Ibrahim                   1          Malay   \n",
       "9              9               Aida                   0          Other   \n",
       "10            10            Airport                   0        Generic   \n",
       "11            11          Alexandra                   0        British   \n",
       "12            12             Aliwal                   0         Indian   \n",
       "13            13           Aljunied                   0          Other   \n",
       "14            14        Allanbrooke                   0        British   \n",
       "15            15            Allenby                   0        British   \n",
       "16            16             Almond                   0        Generic   \n",
       "17            17            Alnwick                   0        British   \n",
       "18            18               Alps                   0          Other   \n",
       "19            19           Ama Keng                   0        Chinese   \n",
       "20            20              Amber                   0          Other   \n",
       "21            21               Amoy                   0        Chinese   \n",
       "22            22             Ampang                   1          Malay   \n",
       "23            23              Ampas                   1          Malay   \n",
       "24            24              Ampat                   1          Malay   \n",
       "25            25         Anak Bukit                   1          Malay   \n",
       "26            26        Anak Patong                   1          Malay   \n",
       "27            27           Anamalai                   0         Indian   \n",
       "28            28         Anchorvale                   0        Generic   \n",
       "29            29           Anderson                   0        British   \n",
       "...          ...                ...                 ...            ...   \n",
       "1721        1721          Woodgrove                   0        Generic   \n",
       "1722        1722           Woodland                   0        Generic   \n",
       "1723        1723          Woodlands                   0        Generic   \n",
       "1724        1724          Woodleigh                   0        British   \n",
       "1725        1725         Woodsville                   0        Generic   \n",
       "1726        1726         Woollerton                   0        British   \n",
       "1727        1727           Worthing                   0        British   \n",
       "1728        1728              Xilin                   0        Chinese   \n",
       "1729        1729            Yan Kit                   0        Chinese   \n",
       "1730        1730             Yarrow                   0        British   \n",
       "1731        1731            Yarwood                   0        British   \n",
       "1732        1732              Yasin                   1          Malay   \n",
       "1733        1733       Yio Chu Kang                   0        Chinese   \n",
       "1734        1734             Yishun                   0        Chinese   \n",
       "1735        1735          Yong Siak                   0        Chinese   \n",
       "1736        1736               York                   0        British   \n",
       "1737        1737          Youngberg                   0        British   \n",
       "1738        1738         Yuan Ching                   0        Chinese   \n",
       "1739        1739           Yuk Tong                   0        Chinese   \n",
       "1740        1740            Yung An                   0        Chinese   \n",
       "1741        1741            Yung Ho                   0        Chinese   \n",
       "1742        1742         Yung Kuang                   0        Chinese   \n",
       "1743        1743         Yung Sheng                   0        Chinese   \n",
       "1744        1744             Yunnan                   0        Chinese   \n",
       "1745        1745             Zamrud                   1          Malay   \n",
       "1746        1746            Zehnder                   0          Other   \n",
       "1747        1747               Zion                   0          Other   \n",
       "1748        1748         Zubir Said                   0          Malay   \n",
       "1749        1749              kukoh                   1          Malay   \n",
       "1750        1750  one-north Gateway                   0        Generic   \n",
       "\n",
       "                                                comment  \n",
       "0                                                   NaN  \n",
       "1                                                   NaN  \n",
       "2                                                   NaN  \n",
       "3                                                   NaN  \n",
       "4                                         Indian Jewish  \n",
       "5                                                   NaN  \n",
       "6                                                   NaN  \n",
       "7                                                   NaN  \n",
       "8                                                   NaN  \n",
       "9                                                   NaN  \n",
       "10                                                  NaN  \n",
       "11                                                  NaN  \n",
       "12                Battle of Aliwal in the Indo-Sikh war  \n",
       "13                                                 Arab  \n",
       "14                                                  NaN  \n",
       "15                                                  NaN  \n",
       "16                                                  NaN  \n",
       "17                                                  NaN  \n",
       "18                                                  NaN  \n",
       "19                                                  NaN  \n",
       "20    after the Amber Trust fund established for poo...  \n",
       "21                                                  NaN  \n",
       "22                                                  NaN  \n",
       "23                                                  NaN  \n",
       "24                                                  NaN  \n",
       "25                                                  NaN  \n",
       "26                                                  NaN  \n",
       "27                                                  NaN  \n",
       "28                                         marine theme  \n",
       "29                                                  NaN  \n",
       "...                                                 ...  \n",
       "1721                                                NaN  \n",
       "1722                                                NaN  \n",
       "1723                                                NaN  \n",
       "1724                                                NaN  \n",
       "1725                                                NaN  \n",
       "1726                                                NaN  \n",
       "1727                                                NaN  \n",
       "1728                                                NaN  \n",
       "1729                                                NaN  \n",
       "1730                                                NaN  \n",
       "1731                                                NaN  \n",
       "1732                                                NaN  \n",
       "1733                                                NaN  \n",
       "1734                                                NaN  \n",
       "1735                                                NaN  \n",
       "1736                                                NaN  \n",
       "1737                                                NaN  \n",
       "1738                                                NaN  \n",
       "1739                                                NaN  \n",
       "1740                                                NaN  \n",
       "1741                                                NaN  \n",
       "1742                                                NaN  \n",
       "1743                                                NaN  \n",
       "1744                                                NaN  \n",
       "1745                                                NaN  \n",
       "1746                                           Eurasian  \n",
       "1747                                                NaN  \n",
       "1748                                                NaN  \n",
       "1749                                                NaN  \n",
       "1750                                                NaN  \n",
       "\n",
       "[1751 rows x 5 columns]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this step, we'll use about 10% of the data to mimic the process I actually used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 0: putting the data together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# let's pick a random 10% to train with\n",
    "\n",
    "import random\n",
    "random.seed(1965)\n",
    "train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]\n",
    "\n",
    "X = train_test_set['road_name']\n",
    "y = train_test_set['classification']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Opal', 'Generic'),\n",
       " ('Club', 'Generic'),\n",
       " ('Minto', 'Other'),\n",
       " ('Woodlands', 'Generic'),\n",
       " ('Hai Sing', 'Chinese'),\n",
       " ('Batalong', 'Malay'),\n",
       " ('Hikayat', 'Malay'),\n",
       " ('Bassein', 'Other'),\n",
       " ('Mount Echo', 'Generic'),\n",
       " ('Kallang Pudding', 'Malay'),\n",
       " ('Republic', 'Generic'),\n",
       " ('Wan Tho', 'Chinese'),\n",
       " ('Rengkam', 'Malay'),\n",
       " ('Keong Saik', 'Chinese'),\n",
       " ('Sedap', 'Malay'),\n",
       " ('Stratton', 'British'),\n",
       " ('Seagull', 'Generic'),\n",
       " ('Manila', 'Other')]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zip(X,y)[::10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You never actually train and test on the same data. So we'll split this dataset even further. scikit-learn provides a convenient function for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "X_train, X_test, y_train, y_true = train_test_split(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Figure out your classification labels\n",
    "\n",
    "This was actually one of the trickiest parts of the process. These are the labels I finally decided on:\n",
    "\n",
    "* Malay (including Indonesian/Bugis names)\n",
    "* British\n",
    "* Chinese (all languages (\"dialects\"))\n",
    "* Indian (all languages)\n",
    "* Other (e.g. other European names, Jewish names, Armenian names...)\n",
    "* Generic (Temple Street, Sunrise Avenue, etc)\n",
    "\n",
    "Something to bear in mind is that some of the streets can be classified in multiple ways. For example, is Queen Street \"British\" or \"Generic\"? In this case I selected \"British\" because it was specifically named after Queen Victoria. I tried to be consistent in my criteria, but up to ~5% of the roads might be arguable. Also, there is insufficient information for some of the roads so I went with my gut feel about the orthotactics of the word (the letter patterns)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Malay      614\n",
       "British    518\n",
       "Generic    255\n",
       "Chinese    217\n",
       "Other      119\n",
       "Indian      28\n",
       "dtype: int64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.classification.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: decide what features to use\n",
    "\n",
    "What we're doing is basically language classification. Often, people use n-grams as features for this. scikit-learn conveniently provides a function that counts n-grams for us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'scipy.sparse.csr.csr_matrix'>\n",
      "(131, 1410)\n",
      "(44, 1410)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vect = CountVectorizer(ngram_range=(1,4), analyzer='char')\n",
    "\n",
    "# fit_transform for the training data\n",
    "X_train_feats = vect.fit_transform(X_train)\n",
    "# transform for the test data\n",
    "# because we need to match the ngrams that were found in the training set \n",
    "X_test_feats  = vect.transform(X_test) \n",
    "\n",
    "print type(X_train_feats)\n",
    "print X_train_feats.shape\n",
    "print X_test_feats.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Step 3: pick a classifier\n",
    "\n",
    "<img width=\"80%\" src=\"http://scikit-learn.org/stable/_static/ml_map.png\">\n",
    "\n",
    "According to this, we should be starting out with Linear SVC."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.svm import LinearSVC\n",
    "clf = LinearSVC()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Train the model\n",
    "\n",
    "Use the classifier to fit a model based on the feature matrix of `X_train` and the label vector of `y_train`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "model = clf.fit(X_train_feats, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Predict the labels of the test set\n",
    "\n",
    "Now that we have our model, we can use it to predict labels on a fresh test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_predicted = model.predict(X_test_feats)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Malay', 'Malay', 'British', 'Malay', 'British', 'British',\n",
       "       'British', 'British', 'British', 'British', 'Malay', 'Chinese',\n",
       "       'British', 'Chinese', 'British', 'Other', 'Generic', 'Malay',\n",
       "       'Malay', 'Chinese', 'British', 'British', 'Malay', 'British',\n",
       "       'British', 'Generic', 'Other', 'British', 'British', 'British',\n",
       "       'British', 'British', 'Malay', 'Generic', 'Malay', 'Generic',\n",
       "       'Malay', 'British', 'Malay', 'British', 'British', 'Malay', 'Malay',\n",
       "       'Generic'], dtype=object)"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_predicted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: select an evaluation metric\n",
    "\n",
    "scikit-learn comes with a bunch of evaluation metrics. Which one should be chosen depends on what we're trying to minimise/maximise. In this case, we want to make as few errors as possible, so it makes sense to use accuracy as our metric.\n",
    "\n",
    "$$ accuracy = \\frac{\\# correct}{\\# classified} $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.59090909090909094"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(y_true, y_predicted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we got 60% accuracy. Let's try it with a few more train/test splits to see whether this is typical."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def classify(X, y):\n",
    "    # do the train-test split\n",
    "    X_train, X_test, y_train, y_true = train_test_split(X, y)\n",
    "\n",
    "    # get our features\n",
    "    X_train_feats = vect.fit_transform(X_train)\n",
    "    X_test_feats  = vect.transform(X_test) \n",
    "\n",
    "    # train our model\n",
    "    model = clf.fit(X_train_feats, y_train)\n",
    "    \n",
    "    # predict labels on the test set\n",
    "    y_predicted = model.predict(X_test_feats)\n",
    "    \n",
    "    # return the accuracy score obtained\n",
    "    return accuracy_score(y_true, y_predicted)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.551818181818\n"
     ]
    }
   ],
   "source": [
    "scores = list()\n",
    "num_expts = 100\n",
    "for i in range(num_expts):\n",
    "    score = classify(X,y)\n",
    "    scores.append(score)\n",
    "    \n",
    "print sum(scores) / num_expts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "The accuracy we obtain with this set of features and this classifier is about 55%. This isn't completely terrible. With 6 categories, a completely random classifier should expect to get only 16.6% of them right. But 55% accuracy also means that I'd have to go through and correct every other label. How can we improve this?\n",
    "\n",
    "There are a few ways that spring to mind:\n",
    "\n",
    "* Increase the amount of data - easier said than done\n",
    "* Try different classifiers - scikit-learn makes this dead easy\n",
    "* Use more features - worth a try (and we will)\n",
    "* Adjust the hyperparameters of the classifiers - more on this later"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}