{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "K-Means Clustering - Data Exploration\n",
    "==========================================\n",
    "***\n",
    "\n",
    "###UN Data on Countries of the World\n",
    "\n",
    "We are going to explore or dataset which we get in a csv format but may have missing values.\n",
    "We need to be able to drill down on useful dimensions to explore after cleaning up the data.\n",
    "Since we only have one observation per country, we may not have the option to use columns where there are many missing values as we are effectively going to drop many countries when we drop rows with missing values.\n",
    "But then how did we drop such rows before? Because in those cases there were many observations per individual entity and dropping some did not eliminate an entity altogether.\n",
    "\n",
    "So first we import the data and explore the columns and types - this time rather than doing it manually we are going to use the facilities in our software to do that.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Populating the interactive namespace from numpy and matplotlib\n",
      "----\n",
      "                  country   region   tfr  contraception  educationMale  \\\n",
      "0             Afghanistan     Asia  6.90            NaN            NaN   \n",
      "1                 Albania   Europe  2.60            NaN            NaN   \n",
      "2                 Algeria   Africa  3.81             52           11.1   \n",
      "3          American.Samoa     Asia   NaN            NaN            NaN   \n",
      "4                 Andorra   Europe   NaN            NaN            NaN   \n",
      "5                  Angola   Africa  6.69            NaN            NaN   \n",
      "6                 Antigua  America   NaN             53            NaN   \n",
      "7               Argentina  America  2.62            NaN            NaN   \n",
      "8                 Armenia   Europe  1.70             22            NaN   \n",
      "9               Australia  Oceania  1.89             76           16.3   \n",
      "10                Austria   Europe  1.42             71           14.4   \n",
      "11             Azerbaijan     Asia  2.30             17            NaN   \n",
      "12                Bahamas  America  1.95             62           12.1   \n",
      "13                Bahrain     Asia  2.97             53           12.6   \n",
      "14             Bangladesh     Asia  3.14             49            NaN   \n",
      "15               Barbados  America  1.73             55            NaN   \n",
      "16                Belarus   Europe  1.40             50            NaN   \n",
      "17                Belgium   Europe  1.62             79           15.6   \n",
      "18                 Belize  America  3.66             47           10.6   \n",
      "19                  Benin   Africa  5.83             16            NaN   \n",
      "20                 Bhutan     Asia  5.89             19            NaN   \n",
      "21                Bolivia  America  4.36             45            NaN   \n",
      "22                 Bosnia   Europe  1.40            NaN            NaN   \n",
      "23               Botswana   Africa  4.45             33           10.5   \n",
      "24                 Brazil  America  2.17             74            NaN   \n",
      "25                 Brunei     Asia  2.70            NaN           11.8   \n",
      "26               Bulgaria   Europe  1.45            NaN           11.8   \n",
      "27           Burkina.Faso   Africa  6.57              8            3.3   \n",
      "28                Burundi   Africa  6.28              9            5.1   \n",
      "29               Cambodia     Asia  4.50            NaN            NaN   \n",
      "..                    ...      ...   ...            ...            ...   \n",
      "177             Swaziland   Africa  4.46             20           11.5   \n",
      "178                Sweden   Europe  1.80             78           13.9   \n",
      "179           Switzerland   Europe  1.46             71           14.5   \n",
      "180                 Syria     Asia  4.00             36            9.8   \n",
      "181            Tajikistan     Asia  3.93             21            NaN   \n",
      "182              Tanzania   Africa  5.48             18            NaN   \n",
      "183              Thailand     Asia  1.74             74            NaN   \n",
      "184                  Togo   Africa  6.08             12            NaN   \n",
      "185                 Tonga  Oceania  4.02             74            NaN   \n",
      "186   Trinidad.and.Tobago  America  2.10             53           10.1   \n",
      "187               Tunisia   Africa  2.92             60            NaN   \n",
      "188                Turkey     Asia  2.50             63           10.6   \n",
      "189          Turkmenistan     Asia  3.58             20            NaN   \n",
      "190                Tuvalu  Oceania   NaN            NaN            NaN   \n",
      "191                Uganda   Africa  7.10             15            NaN   \n",
      "192               Ukraine   Europe  1.38             23            NaN   \n",
      "193  United.Arab.Emirates     Asia  3.46            NaN            9.8   \n",
      "194        United.Kingdom   Europe  1.72             82           16.1   \n",
      "195         United.States  America  1.96             71           15.4   \n",
      "196               Uruguay  America  2.25            NaN            NaN   \n",
      "197            Uzbekistan     Asia  3.48             56            NaN   \n",
      "198               Vanuatu  Oceania  4.36             15            NaN   \n",
      "199             Venezuela  America  2.98             52           10.2   \n",
      "200              Viet.Nam     Asia  2.97             65            NaN   \n",
      "201        Virgin.Islands  America  3.03            NaN            NaN   \n",
      "202        Western.Sahara   Africa  3.98            NaN            NaN   \n",
      "203                 Yemen     Asia  7.60              7            NaN   \n",
      "204            Yugoslavia   Europe  1.80            NaN            NaN   \n",
      "205                Zambia   Africa  5.49             25            7.9   \n",
      "206              Zimbabwe   Africa  4.68             48            NaN   \n",
      "\n",
      "     educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  \\\n",
      "0                NaN      45.0        46.0              154          2848   \n",
      "1                NaN      68.0        74.0               32           863   \n",
      "2                9.9      67.5        70.3               44          1531   \n",
      "3                NaN      68.0        73.0               11           NaN   \n",
      "4                NaN       NaN         NaN              NaN           NaN   \n",
      "5                NaN      44.9        48.1              124           355   \n",
      "6                NaN       NaN         NaN               24          6966   \n",
      "7                NaN      69.6        76.8               22          8055   \n",
      "8                NaN      67.2        74.0               25           354   \n",
      "9               16.1      75.4        81.2                6         20046   \n",
      "10              14.2      73.7        80.1                6         29006   \n",
      "11               NaN      66.5        74.5               33           321   \n",
      "12              13.2      70.5        77.1               14         12545   \n",
      "13              13.3      71.1        75.3               18          9073   \n",
      "14               NaN      58.1        58.2               78           280   \n",
      "15               NaN      73.6        78.7                9          7173   \n",
      "16               NaN      64.4        74.8               15           994   \n",
      "17              15.4      73.9        80.6                7         26582   \n",
      "18              10.4      73.4        76.1               30          2569   \n",
      "19               NaN      52.4        57.2               84           391   \n",
      "20               NaN      51.6        54.9              104           166   \n",
      "21               NaN      59.8        63.2               66           909   \n",
      "22               NaN      70.5        75.9               13           271   \n",
      "23              10.7      48.9        51.7               56          3640   \n",
      "24               NaN      63.4        71.2               42          4510   \n",
      "25              12.1      73.4        78.1                9         16683   \n",
      "26              12.5      67.8        74.9               16          1518   \n",
      "27               2.0      45.1        47.0               97           165   \n",
      "28               4.0      45.5        48.8              114           205   \n",
      "29               NaN      52.6        55.4              102           130   \n",
      "..               ...       ...         ...              ...           ...   \n",
      "177             10.8      57.7        62.3               65          1389   \n",
      "178             14.5      76.2        80.8                5         26253   \n",
      "179             13.5      75.3        81.8                5         42416   \n",
      "180              8.5      66.7        71.2               33          3573   \n",
      "181              NaN      64.2        70.2               56           122   \n",
      "182              NaN      50.0        52.8               80           139   \n",
      "183              NaN      66.3        72.3               30          2896   \n",
      "184              NaN      48.8        51.5               86           322   \n",
      "185              NaN      67.0        71.0                3          1787   \n",
      "186             11.3      71.5        76.2               14          4083   \n",
      "187              NaN      68.4        70.7               37          2030   \n",
      "188              8.7      66.5        71.7               44          2814   \n",
      "189              NaN      61.2        68.0               57           321   \n",
      "190              NaN       NaN         NaN              NaN           NaN   \n",
      "191              NaN      40.4        42.3              113           305   \n",
      "192              NaN      63.6        74.0               18           694   \n",
      "193             10.3      73.9        76.5               15         17690   \n",
      "194             16.6      74.5        79.8                6         18913   \n",
      "195             16.2      73.4        80.1                7         26037   \n",
      "196              NaN      69.6        76.1               17          5602   \n",
      "197              NaN      64.3        70.7               43           435   \n",
      "198              NaN      65.5        69.5               38          1289   \n",
      "199             10.7      70.0        75.7               21          3496   \n",
      "200              NaN      64.9        69.6               37           270   \n",
      "201              NaN       NaN         NaN               12           NaN   \n",
      "202              NaN      59.8        63.1               64           NaN   \n",
      "203              NaN      57.4        58.4               80           732   \n",
      "204              NaN      69.8        75.3               19          1487   \n",
      "205              6.8      42.2        43.7              103           382   \n",
      "206              NaN      47.6        49.4               68           786   \n",
      "\n",
      "     economicActivityMale  economicActivityFemale  illiteracyMale  \\\n",
      "0                    87.5                     7.2          52.800   \n",
      "1                     NaN                     NaN             NaN   \n",
      "2                    76.4                     7.8          26.100   \n",
      "3                    58.8                    42.4           0.264   \n",
      "4                     NaN                     NaN             NaN   \n",
      "5                     NaN                     NaN             NaN   \n",
      "6                    74.4                    56.2             NaN   \n",
      "7                    76.2                    41.3           3.800   \n",
      "8                    65.0                    52.0           0.300   \n",
      "9                    74.0                    53.8             NaN   \n",
      "10                   69.5                    47.7             NaN   \n",
      "11                    NaN                     NaN           0.300   \n",
      "12                   81.2                    67.0           1.500   \n",
      "13                   88.2                    29.2          10.900   \n",
      "14                   88.8                    55.9          50.600   \n",
      "15                   73.4                    61.4           2.000   \n",
      "16                   76.4                    61.3           0.300   \n",
      "17                    NaN                     NaN             NaN   \n",
      "18                   79.0                    34.0          21.252   \n",
      "19                   90.0                    57.8          51.300   \n",
      "20                    NaN                     NaN          43.800   \n",
      "21                   74.1                    56.3           9.500   \n",
      "22                    NaN                     NaN             NaN   \n",
      "23                   75.4                    41.7          19.500   \n",
      "24                   84.0                    53.6          16.700   \n",
      "25                   82.2                    46.4           7.400   \n",
      "26                   60.7                    53.8           0.924   \n",
      "27                   88.9                    79.4          70.500   \n",
      "28                   90.1                    90.6          50.700   \n",
      "29                   77.3                    84.7             NaN   \n",
      "..                    ...                     ...             ...   \n",
      "177                  64.3                    27.7          22.000   \n",
      "178                  80.0                    75.6             NaN   \n",
      "179                  78.5                    56.8             NaN   \n",
      "180                   NaN                     NaN          14.300   \n",
      "181                  75.0                    60.0           0.300   \n",
      "182                   NaN                     NaN          20.600   \n",
      "183                  83.8                    65.2           4.000   \n",
      "184                  77.5                    50.8          33.000   \n",
      "185                  74.2                    45.4           0.264   \n",
      "186                  75.5                    44.9           1.200   \n",
      "187                  75.4                    20.3          21.400   \n",
      "188                  75.9                    30.6           8.300   \n",
      "189                  78.0                    62.0           0.200   \n",
      "190                   NaN                     NaN             NaN   \n",
      "191                   NaN                     NaN          26.300   \n",
      "192                  69.1                    57.1           0.330   \n",
      "193                  92.5                    24.2          21.100   \n",
      "194                  71.9                    53.5             NaN   \n",
      "195                  74.9                    59.3           2.244   \n",
      "196                  74.0                    46.7           3.100   \n",
      "197                  75.0                    61.0           0.200   \n",
      "198                  88.6                    79.3          34.914   \n",
      "199                  82.1                    41.2           8.200   \n",
      "200                  81.6                    74.1           3.500   \n",
      "201                  72.3                    59.5             NaN   \n",
      "202                   NaN                     NaN             NaN   \n",
      "203                  80.6                     1.9          32.406   \n",
      "204                   NaN                     NaN           1.782   \n",
      "205                   NaN                     NaN          14.400   \n",
      "206                  77.7                    46.7           9.600   \n",
      "\n",
      "     illiteracyFemale  \n",
      "0              85.000  \n",
      "1                 NaN  \n",
      "2              51.000  \n",
      "3               0.360  \n",
      "4                 NaN  \n",
      "5                 NaN  \n",
      "6                 NaN  \n",
      "7               3.800  \n",
      "8               0.500  \n",
      "9                 NaN  \n",
      "10                NaN  \n",
      "11              0.500  \n",
      "12              2.000  \n",
      "13             20.600  \n",
      "14             73.900  \n",
      "15              3.200  \n",
      "16              0.600  \n",
      "17                NaN  \n",
      "18             23.472  \n",
      "19             74.200  \n",
      "20             71.900  \n",
      "21             24.000  \n",
      "22                NaN  \n",
      "23             40.100  \n",
      "24             16.800  \n",
      "25             16.600  \n",
      "26              2.376  \n",
      "27             90.800  \n",
      "28             77.500  \n",
      "29                NaN  \n",
      "..                ...  \n",
      "177            24.400  \n",
      "178               NaN  \n",
      "179               NaN  \n",
      "180            44.200  \n",
      "181             0.400  \n",
      "182            43.200  \n",
      "183             8.400  \n",
      "184            63.000  \n",
      "185             0.504  \n",
      "186             3.000  \n",
      "187            45.400  \n",
      "188            27.600  \n",
      "189             0.400  \n",
      "190               NaN  \n",
      "191            49.800  \n",
      "192             2.160  \n",
      "193            20.200  \n",
      "194               NaN  \n",
      "195             2.232  \n",
      "196             2.300  \n",
      "197             0.400  \n",
      "198            46.368  \n",
      "199             9.700  \n",
      "200             8.800  \n",
      "201               NaN  \n",
      "202               NaN  \n",
      "203            69.552  \n",
      "204             9.072  \n",
      "205            28.700  \n",
      "206            20.100  \n",
      "\n",
      "[207 rows x 14 columns]\n",
      "----\n",
      "Individual columns - Python data types\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('country', str),\n",
       " ('region', str),\n",
       " ('tfr', numpy.float64),\n",
       " ('contraception', numpy.float64),\n",
       " ('educationMale', numpy.float64),\n",
       " ('educationFemale', numpy.float64),\n",
       " ('lifeMale', numpy.float64),\n",
       " ('lifeFemale', numpy.float64),\n",
       " ('infantMortality', numpy.float64),\n",
       " ('GDPperCapita', numpy.float64),\n",
       " ('economicActivityMale', numpy.float64),\n",
       " ('economicActivityFemale', numpy.float64),\n",
       " ('illiteracyMale', numpy.float64),\n",
       " ('illiteracyFemale', numpy.float64)]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%pylab inline\n",
    "import pandas as pd\n",
    "df = pd.read_csv('../datasets/UN.csv')\n",
    "print('----')\n",
    "# print the raw column information plus summary header\n",
    "print(df)\n",
    "print('----')\n",
    "# look at the types of each column explicitly\n",
    "print('Individual columns - Python data types')\n",
    "[(x, type(df[x][0])) for x in df.columns] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we see that we have 14 columns with country and region being string types and the rest being floats.  We also see that the country column has 207 values, ie this is data on 207 countries.\n",
    "The region columns also has 207 entries, but the rest of the columns have many missing entries, indicated by number of non-null values less than 207.\n",
    "\n",
    "We see that tfr, lifeMale, lifeFemale and GDP, and infantMortality are the columns closest to 207.  That is, if we use these columns we will only drop a few countries and not whole clusters as we might if we used educationMale and educationFemale.  On the other hand were we to use educationMale and educatonFemale we would have to drop almost 2/3 of the data. So we focus on the columns with non-null values close to 207.\n",
    "\n",
    "So our short list is now, country, region, tfr, lifeMale, lifeFemale and GDP, and infantMortality.\n",
    "\n",
    "We suspect that there is clustering of lifeMale, lifeFemale and infantMortality according to GDP and we are going to pull out the heavy machinery of K-Means sofwtare to analyse this in detail and look at the clusters.  \n",
    "\n",
    "We don't know in advance how many clusters there will be which is different from the iris example where we had a 'species' label and there were three unique species.\n",
    "\n",
    "So while using our KMeans software we will also look at some analytical measures to decide what the right number of clusters might be after looking at multiple such possibilities from 1 through 10 candidate clusters.\n",
    "\n",
    "Finally, to be able to apply the KMeans modeling software we convert each field in our file to a scientific float format that the numerical algorithms expect.\n",
    "\n",
    "Onward!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "    @font-face {\n",
       "        font-family: \"Computer Modern\";\n",
       "        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
       "    }\n",
       "    div.cell{\n",
       "        width:800px;\n",
       "        margin-left:auto;\n",
       "        margin-right:auto;\n",
       "    }\n",
       "    h1 {\n",
       "        font-family: \"Charis SIL\", Palatino, serif;\n",
       "    }\n",
       "    h4{\n",
       "        margin-top:12px;\n",
       "        margin-bottom: 3px;\n",
       "       }\n",
       "    div.text_cell_render{\n",
       "        font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
       "        line-height: 145%;\n",
       "        font-size: 120%;\n",
       "        width:800px;\n",
       "        margin-left:auto;\n",
       "        margin-right:auto;\n",
       "    }\n",
       "    .CodeMirror{\n",
       "            font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
       "    }\n",
       "    .prompt{\n",
       "        display: None;\n",
       "    }\n",
       "    .text_cell_render h5 {\n",
       "        font-weight: 300;\n",
       "        font-size: 16pt;\n",
       "        color: #4057A1;\n",
       "        font-style: italic;\n",
       "        margin-bottom: .5em;\n",
       "        margin-top: 0.5em;\n",
       "        display: block;\n",
       "    }\n",
       "    \n",
       "    .warning{\n",
       "        color: rgb( 240, 20, 20 )\n",
       "        }\n",
       "</style>\n",
       "<script>\n",
       "    MathJax.Hub.Config({\n",
       "                        TeX: {\n",
       "                           extensions: [\"AMSmath.js\"]\n",
       "                           },\n",
       "                tex2jax: {\n",
       "                    inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
       "                    displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
       "                },\n",
       "                displayAlign: 'center', // Change this to 'center' to center equations.\n",
       "                \"HTML-CSS\": {\n",
       "                    styles: {'.MathJax_Display': {\"margin\": 4}}\n",
       "                }\n",
       "        });\n",
       "</script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML at 0x1092b1890>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.core.display import HTML\n",
    "def css_styling():\n",
    "    styles = open(\"../styles/custom.css\", \"r\").read()\n",
    "    return HTML(styles)\n",
    "css_styling()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}