{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "K-Means Clustering - Data Exploration\n", "==========================================\n", "***\n", "\n", "###UN Data on Countries of the World\n", "\n", "We are going to explore or dataset which we get in a csv format but may have missing values.\n", "We need to be able to drill down on useful dimensions to explore after cleaning up the data.\n", "Since we only have one observation per country, we may not have the option to use columns where there are many missing values as we are effectively going to drop many countries when we drop rows with missing values.\n", "But then how did we drop such rows before? Because in those cases there were many observations per individual entity and dropping some did not eliminate an entity altogether.\n", "\n", "So first we import the data and explore the columns and types - this time rather than doing it manually we are going to use the facilities in our software to do that.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n", "----\n", " country region tfr contraception educationMale \\\n", "0 Afghanistan Asia 6.90 NaN NaN \n", "1 Albania Europe 2.60 NaN NaN \n", "2 Algeria Africa 3.81 52 11.1 \n", "3 American.Samoa Asia NaN NaN NaN \n", "4 Andorra Europe NaN NaN NaN \n", "5 Angola Africa 6.69 NaN NaN \n", "6 Antigua America NaN 53 NaN \n", "7 Argentina America 2.62 NaN NaN \n", "8 Armenia Europe 1.70 22 NaN \n", "9 Australia Oceania 1.89 76 16.3 \n", "10 Austria Europe 1.42 71 14.4 \n", "11 Azerbaijan Asia 2.30 17 NaN \n", "12 Bahamas America 1.95 62 12.1 \n", "13 Bahrain Asia 2.97 53 12.6 \n", "14 Bangladesh Asia 3.14 49 NaN \n", "15 Barbados America 1.73 55 NaN \n", "16 Belarus Europe 1.40 50 NaN \n", "17 Belgium Europe 1.62 79 15.6 \n", "18 Belize America 3.66 47 10.6 \n", "19 Benin Africa 5.83 16 NaN \n", "20 Bhutan Asia 5.89 19 NaN \n", "21 Bolivia America 4.36 45 NaN \n", "22 Bosnia Europe 1.40 NaN NaN \n", "23 Botswana Africa 4.45 33 10.5 \n", "24 Brazil America 2.17 74 NaN \n", "25 Brunei Asia 2.70 NaN 11.8 \n", "26 Bulgaria Europe 1.45 NaN 11.8 \n", "27 Burkina.Faso Africa 6.57 8 3.3 \n", "28 Burundi Africa 6.28 9 5.1 \n", "29 Cambodia Asia 4.50 NaN NaN \n", ".. ... ... ... ... ... \n", "177 Swaziland Africa 4.46 20 11.5 \n", "178 Sweden Europe 1.80 78 13.9 \n", "179 Switzerland Europe 1.46 71 14.5 \n", "180 Syria Asia 4.00 36 9.8 \n", "181 Tajikistan Asia 3.93 21 NaN \n", "182 Tanzania Africa 5.48 18 NaN \n", "183 Thailand Asia 1.74 74 NaN \n", "184 Togo Africa 6.08 12 NaN \n", "185 Tonga Oceania 4.02 74 NaN \n", "186 Trinidad.and.Tobago America 2.10 53 10.1 \n", "187 Tunisia Africa 2.92 60 NaN \n", "188 Turkey Asia 2.50 63 10.6 \n", "189 Turkmenistan Asia 3.58 20 NaN \n", "190 Tuvalu Oceania NaN NaN NaN \n", "191 Uganda Africa 7.10 15 NaN \n", "192 Ukraine Europe 1.38 23 NaN \n", "193 United.Arab.Emirates Asia 3.46 NaN 9.8 \n", "194 United.Kingdom Europe 1.72 82 16.1 \n", "195 United.States America 1.96 71 15.4 \n", "196 Uruguay America 2.25 NaN NaN \n", "197 Uzbekistan Asia 3.48 56 NaN \n", "198 Vanuatu Oceania 4.36 15 NaN \n", "199 Venezuela America 2.98 52 10.2 \n", "200 Viet.Nam Asia 2.97 65 NaN \n", "201 Virgin.Islands America 3.03 NaN NaN \n", "202 Western.Sahara Africa 3.98 NaN NaN \n", "203 Yemen Asia 7.60 7 NaN \n", "204 Yugoslavia Europe 1.80 NaN NaN \n", "205 Zambia Africa 5.49 25 7.9 \n", "206 Zimbabwe Africa 4.68 48 NaN \n", "\n", " educationFemale lifeMale lifeFemale infantMortality GDPperCapita \\\n", "0 NaN 45.0 46.0 154 2848 \n", "1 NaN 68.0 74.0 32 863 \n", "2 9.9 67.5 70.3 44 1531 \n", "3 NaN 68.0 73.0 11 NaN \n", "4 NaN NaN NaN NaN NaN \n", "5 NaN 44.9 48.1 124 355 \n", "6 NaN NaN NaN 24 6966 \n", "7 NaN 69.6 76.8 22 8055 \n", "8 NaN 67.2 74.0 25 354 \n", "9 16.1 75.4 81.2 6 20046 \n", "10 14.2 73.7 80.1 6 29006 \n", "11 NaN 66.5 74.5 33 321 \n", "12 13.2 70.5 77.1 14 12545 \n", "13 13.3 71.1 75.3 18 9073 \n", "14 NaN 58.1 58.2 78 280 \n", "15 NaN 73.6 78.7 9 7173 \n", "16 NaN 64.4 74.8 15 994 \n", "17 15.4 73.9 80.6 7 26582 \n", "18 10.4 73.4 76.1 30 2569 \n", "19 NaN 52.4 57.2 84 391 \n", "20 NaN 51.6 54.9 104 166 \n", "21 NaN 59.8 63.2 66 909 \n", "22 NaN 70.5 75.9 13 271 \n", "23 10.7 48.9 51.7 56 3640 \n", "24 NaN 63.4 71.2 42 4510 \n", "25 12.1 73.4 78.1 9 16683 \n", "26 12.5 67.8 74.9 16 1518 \n", "27 2.0 45.1 47.0 97 165 \n", "28 4.0 45.5 48.8 114 205 \n", "29 NaN 52.6 55.4 102 130 \n", ".. ... ... ... ... ... \n", "177 10.8 57.7 62.3 65 1389 \n", "178 14.5 76.2 80.8 5 26253 \n", "179 13.5 75.3 81.8 5 42416 \n", "180 8.5 66.7 71.2 33 3573 \n", "181 NaN 64.2 70.2 56 122 \n", "182 NaN 50.0 52.8 80 139 \n", "183 NaN 66.3 72.3 30 2896 \n", "184 NaN 48.8 51.5 86 322 \n", "185 NaN 67.0 71.0 3 1787 \n", "186 11.3 71.5 76.2 14 4083 \n", "187 NaN 68.4 70.7 37 2030 \n", "188 8.7 66.5 71.7 44 2814 \n", "189 NaN 61.2 68.0 57 321 \n", "190 NaN NaN NaN NaN NaN \n", "191 NaN 40.4 42.3 113 305 \n", "192 NaN 63.6 74.0 18 694 \n", "193 10.3 73.9 76.5 15 17690 \n", "194 16.6 74.5 79.8 6 18913 \n", "195 16.2 73.4 80.1 7 26037 \n", "196 NaN 69.6 76.1 17 5602 \n", "197 NaN 64.3 70.7 43 435 \n", "198 NaN 65.5 69.5 38 1289 \n", "199 10.7 70.0 75.7 21 3496 \n", "200 NaN 64.9 69.6 37 270 \n", "201 NaN NaN NaN 12 NaN \n", "202 NaN 59.8 63.1 64 NaN \n", "203 NaN 57.4 58.4 80 732 \n", "204 NaN 69.8 75.3 19 1487 \n", "205 6.8 42.2 43.7 103 382 \n", "206 NaN 47.6 49.4 68 786 \n", "\n", " economicActivityMale economicActivityFemale illiteracyMale \\\n", "0 87.5 7.2 52.800 \n", "1 NaN NaN NaN \n", "2 76.4 7.8 26.100 \n", "3 58.8 42.4 0.264 \n", "4 NaN NaN NaN \n", "5 NaN NaN NaN \n", "6 74.4 56.2 NaN \n", "7 76.2 41.3 3.800 \n", "8 65.0 52.0 0.300 \n", "9 74.0 53.8 NaN \n", "10 69.5 47.7 NaN \n", "11 NaN NaN 0.300 \n", "12 81.2 67.0 1.500 \n", "13 88.2 29.2 10.900 \n", "14 88.8 55.9 50.600 \n", "15 73.4 61.4 2.000 \n", "16 76.4 61.3 0.300 \n", "17 NaN NaN NaN \n", "18 79.0 34.0 21.252 \n", "19 90.0 57.8 51.300 \n", "20 NaN NaN 43.800 \n", "21 74.1 56.3 9.500 \n", "22 NaN NaN NaN \n", "23 75.4 41.7 19.500 \n", "24 84.0 53.6 16.700 \n", "25 82.2 46.4 7.400 \n", "26 60.7 53.8 0.924 \n", "27 88.9 79.4 70.500 \n", "28 90.1 90.6 50.700 \n", "29 77.3 84.7 NaN \n", ".. ... ... ... \n", "177 64.3 27.7 22.000 \n", "178 80.0 75.6 NaN \n", "179 78.5 56.8 NaN \n", "180 NaN NaN 14.300 \n", "181 75.0 60.0 0.300 \n", "182 NaN NaN 20.600 \n", "183 83.8 65.2 4.000 \n", "184 77.5 50.8 33.000 \n", "185 74.2 45.4 0.264 \n", "186 75.5 44.9 1.200 \n", "187 75.4 20.3 21.400 \n", "188 75.9 30.6 8.300 \n", "189 78.0 62.0 0.200 \n", "190 NaN NaN NaN \n", "191 NaN NaN 26.300 \n", "192 69.1 57.1 0.330 \n", "193 92.5 24.2 21.100 \n", "194 71.9 53.5 NaN \n", "195 74.9 59.3 2.244 \n", "196 74.0 46.7 3.100 \n", "197 75.0 61.0 0.200 \n", "198 88.6 79.3 34.914 \n", "199 82.1 41.2 8.200 \n", "200 81.6 74.1 3.500 \n", "201 72.3 59.5 NaN \n", "202 NaN NaN NaN \n", "203 80.6 1.9 32.406 \n", "204 NaN NaN 1.782 \n", "205 NaN NaN 14.400 \n", "206 77.7 46.7 9.600 \n", "\n", " illiteracyFemale \n", "0 85.000 \n", "1 NaN \n", "2 51.000 \n", "3 0.360 \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 3.800 \n", "8 0.500 \n", "9 NaN \n", "10 NaN \n", "11 0.500 \n", "12 2.000 \n", "13 20.600 \n", "14 73.900 \n", "15 3.200 \n", "16 0.600 \n", "17 NaN \n", "18 23.472 \n", "19 74.200 \n", "20 71.900 \n", "21 24.000 \n", "22 NaN \n", "23 40.100 \n", "24 16.800 \n", "25 16.600 \n", "26 2.376 \n", "27 90.800 \n", "28 77.500 \n", "29 NaN \n", ".. ... \n", "177 24.400 \n", "178 NaN \n", "179 NaN \n", "180 44.200 \n", "181 0.400 \n", "182 43.200 \n", "183 8.400 \n", "184 63.000 \n", "185 0.504 \n", "186 3.000 \n", "187 45.400 \n", "188 27.600 \n", "189 0.400 \n", "190 NaN \n", "191 49.800 \n", "192 2.160 \n", "193 20.200 \n", "194 NaN \n", "195 2.232 \n", "196 2.300 \n", "197 0.400 \n", "198 46.368 \n", "199 9.700 \n", "200 8.800 \n", "201 NaN \n", "202 NaN \n", "203 69.552 \n", "204 9.072 \n", "205 28.700 \n", "206 20.100 \n", "\n", "[207 rows x 14 columns]\n", "----\n", "Individual columns - Python data types\n" ] }, { "data": { "text/plain": [ "[('country', str),\n", " ('region', str),\n", " ('tfr', numpy.float64),\n", " ('contraception', numpy.float64),\n", " ('educationMale', numpy.float64),\n", " ('educationFemale', numpy.float64),\n", " ('lifeMale', numpy.float64),\n", " ('lifeFemale', numpy.float64),\n", " ('infantMortality', numpy.float64),\n", " ('GDPperCapita', numpy.float64),\n", " ('economicActivityMale', numpy.float64),\n", " ('economicActivityFemale', numpy.float64),\n", " ('illiteracyMale', numpy.float64),\n", " ('illiteracyFemale', numpy.float64)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%pylab inline\n", "import pandas as pd\n", "df = pd.read_csv('../datasets/UN.csv')\n", "print('----')\n", "# print the raw column information plus summary header\n", "print(df)\n", "print('----')\n", "# look at the types of each column explicitly\n", "print('Individual columns - Python data types')\n", "[(x, type(df[x][0])) for x in df.columns] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that we have 14 columns with country and region being string types and the rest being floats. We also see that the country column has 207 values, ie this is data on 207 countries.\n", "The region columns also has 207 entries, but the rest of the columns have many missing entries, indicated by number of non-null values less than 207.\n", "\n", "We see that tfr, lifeMale, lifeFemale and GDP, and infantMortality are the columns closest to 207. That is, if we use these columns we will only drop a few countries and not whole clusters as we might if we used educationMale and educationFemale. On the other hand were we to use educationMale and educatonFemale we would have to drop almost 2/3 of the data. So we focus on the columns with non-null values close to 207.\n", "\n", "So our short list is now, country, region, tfr, lifeMale, lifeFemale and GDP, and infantMortality.\n", "\n", "We suspect that there is clustering of lifeMale, lifeFemale and infantMortality according to GDP and we are going to pull out the heavy machinery of K-Means sofwtare to analyse this in detail and look at the clusters. \n", "\n", "We don't know in advance how many clusters there will be which is different from the iris example where we had a 'species' label and there were three unique species.\n", "\n", "So while using our KMeans software we will also look at some analytical measures to decide what the right number of clusters might be after looking at multiple such possibilities from 1 through 10 candidate clusters.\n", "\n", "Finally, to be able to apply the KMeans modeling software we convert each field in our file to a scientific float format that the numerical algorithms expect.\n", "\n", "Onward!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"../styles/custom.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.8" } }, "nbformat": 4, "nbformat_minor": 0 }