{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Goals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some goals for this exercise:\n", "\n", "* to reacquaint ourselves with Python\n", "* to start learning how to use a particular IPython notebook environment, one which is easy to jump right into, namely [Wakari](https://www.wakari.io/)\n", "* to learn a bit about the international context before diving into the US Census.\n", "* to get ourselves into looking at the Wikipedia as a data source.\n", "\n", "Thinking about populations of various geographic entities is a good place to start with open data. We can to work with numbers without necessarily involving complicated mathematics. Just addition if we're lucky. We can also think about geographical locations. We can build out from our initial explorations in a systematic manner.\n" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Things to Think About" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Off the top of your head:\n", " \n", " * What do you think is the current world population?\n", " * How many countries are there?\n", " * How many people are there in the USA? Canada? Mexico? Your favorite country?\n", " * What is the minimum number of countries to add up to 50% of the world's population? How about 90%?\n", " \n", "Now go answer these questions looking on the web. Find some a source or two or three." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Data Source for Populations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two open sources we'll consider:\n", "\n", "* [CIA World Factbook: Country Comparison Population](https://www.cia.gov/library/publications/the-world-factbook/rankorder/2119rank.html) (see also [The World Factbook: ABOUT :: COPYRIGHT AND CONTRIBUTORS](https://www.cia.gov/library/publications/the-world-factbook/docs/contributor_copyright.html))\n", "* [List of countries by population (United Nations) - Wikipedia, the free encyclopedia](https://en.wikipedia.org/w/index.php?title=List_of_countries_by_population_(United_Nations)) (see also [Wikipedia:Reusing Wikipedia content - Wikipedia, the free encyclopedia](https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content)) -- btw, not the same as [List of countries by population - Wikipedia, the free encyclopedia](https://en.wikipedia.org/wiki/List_of_countries_by_population).\n", "\n", "We will study how to parse these data sources in a later exercise, but for this exercise, the data sets have been parsed into [JSON format](https://en.wikipedia.org/wiki/JSON), which is easily loadable in many languages, including Python using the [json Python standard library](http://docs.python.org/2/library/json.html). We'll also use [requests](http://docs.python-requests.org/en/latest/).\n", "\n", "Let's look first at the Wikipedia source." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# https://gist.github.com/rdhyee/8511607/raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json\n", "# scraping of https://en.wikipedia.org/w/index.php?title=List_of_countries_by_population_(United_Nations)&oldid=590438477\n", "\n", "# read population in\n", "import json\n", "import requests\n", "\n", "pop_json_url = \"https://gist.github.com/rdhyee/8511607/raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json\"\n", "pop_list= requests.get(pop_json_url).json()\n", "pop_list" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "[[1, u'China', 1385566537],\n", " [2, u'India', 1252139596],\n", " [3, u'United States', 320050716],\n", " [4, u'Indonesia', 249865631],\n", " [5, u'Brazil', 200361925],\n", " [6, u'Pakistan', 182142594],\n", " [7, u'Nigeria', 173615345],\n", " [8, u'Bangladesh', 156594962],\n", " [9, u'Russia', 142833689],\n", " [10, u'Japan', 127143577],\n", " [11, u'Mexico', 122332399],\n", " [12, u'Philippines', 98393574],\n", " [13, u'Ethiopia', 94100756],\n", " [14, u'Vietnam', 91679733],\n", " [15, u'Germany', 82726626],\n", " [16, u'Egypt', 82056378],\n", " [17, u'Iran', 77447168],\n", " [18, u'Turkey', 74932641],\n", " [19, u'Congo, Democratic Republic of the', 67513677],\n", " [20, u'Thailand', 67010502],\n", " [21, u'France', 64291280],\n", " [22, u'United Kingdom', 63136265],\n", " [23, u'Italy', 60990277],\n", " [24, u'Myanmar', 53259018],\n", " [25, u'South Africa', 52776130],\n", " [26, u'Korea, South', 49262698],\n", " [27, u'Tanzania', 49253126],\n", " [28, u'Colombia', 48321405],\n", " [29, u'Spain', 46926963],\n", " [30, u'Ukraine', 45238805],\n", " [31, u'Kenya', 44353691],\n", " [32, u'Argentina', 41446246],\n", " [33, u'Algeria', 39208194],\n", " [34, u'Poland', 38216635],\n", " [35, u'Sudan', 37964306],\n", " [36, u'Uganda', 37578876],\n", " [37, u'Canada', 35181704],\n", " [38, u'Iraq', 33765232],\n", " [39, u'Morocco', 33008150],\n", " [40, u'Afghanistan', 30551674],\n", " [41, u'Venezuela', 30405207],\n", " [42, u'Peru', 30375603],\n", " [43, u'Malaysia', 29716965],\n", " [44, u'Uzbekistan', 28934102],\n", " [45, u'Saudi Arabia', 28828870],\n", " [46, u'Nepal', 27797457],\n", " [47, u'Ghana', 25904598],\n", " [48, u'Mozambique', 25833752],\n", " [49, u'Korea, North', 24895480],\n", " [50, u'Yemen', 24407381],\n", " [51, u'Australia', 23342553],\n", " [52, u'Taiwan', 23329772],\n", " [53, u'Madagascar', 22924851],\n", " [54, u'Cameroon', 22253959],\n", " [55, u'Syria', 21898061],\n", " [56, u'Romania', 21698585],\n", " [57, u'Angola', 21471618],\n", " [58, u'Sri Lanka', 21273228],\n", " [59, u\"C\\xf4te d'Ivoire\", 20316086],\n", " [60, u'Niger', 17831270],\n", " [61, u'Chile', 17619708],\n", " [62, u'Burkina Faso', 16934839],\n", " [63, u'Netherlands', 16759229],\n", " [64, u'Kazakhstan', 16440586],\n", " [65, u'Malawi', 16362567],\n", " [66, u'Ecuador', 15737878],\n", " [67, u'Guatemala', 15468203],\n", " [68, u'Mali', 15301650],\n", " [69, u'Cambodia', 15135169],\n", " [70, u'Zambia', 14538640],\n", " [71, u'Zimbabwe', 14149648],\n", " [72, u'Senegal', 14133280],\n", " [73, u'Chad', 12825314],\n", " [74, u'Rwanda', 11776522],\n", " [75, u'Guinea', 11745189],\n", " [76, u'South Sudan', 11296173],\n", " [77, u'Cuba', 11265629],\n", " [78, u'Greece', 11127990],\n", " [79, u'Belgium', 11104476],\n", " [80, u'Tunisia', 10996515],\n", " [81, u'Czech Republic', 10702197],\n", " [82, u'Bolivia', 10671200],\n", " [83, u'Portugal', 10608156],\n", " [84, u'Somalia', 10495583],\n", " [85, u'Dominican Republic', 10403761],\n", " [86, u'Benin', 10323474],\n", " [87, u'Haiti', 10317461],\n", " [88, u'Burundi', 10162532],\n", " [89, u'Hungary', 9954941],\n", " [90, u'Sweden', 9571105],\n", " [91, u'Serbia; Kosovo', 9510506],\n", " [92, u'Azerbaijan', 9413420],\n", " [93, u'Belarus', 9356678],\n", " [94, u'United Arab Emirates', 9346129],\n", " [95, u'Austria', 8495145],\n", " [96, u'Tajikistan', 8207834],\n", " [97, u'Honduras', 8097688],\n", " [98, u'Switzerland', 8077833],\n", " [99, u'Israel', 7733144],\n", " [100, u'Papua New Guinea', 7321262],\n", " [101, u'Jordan', 7273799],\n", " [102, u'Bulgaria', 7222943],\n", " [None, u'Hong Kong', 7203836],\n", " [103, u'Togo', 6816982],\n", " [104, u'Paraguay', 6802295],\n", " [105, u'Laos', 6769727],\n", " [106, u'El Salvador', 6340454],\n", " [107, u'Eritrea', 6333135],\n", " [108, u'Libya', 6201521],\n", " [109, u'Sierra Leone', 6092075],\n", " [110, u'Nicaragua', 6080478],\n", " [111, u'Denmark', 5619096],\n", " [112, u'Kyrgyzstan', 5547548],\n", " [113, u'Slovakia', 5450223],\n", " [114, u'Finland', 5426323],\n", " [115, u'Singapore', 5411737],\n", " [116, u'Turkmenistan', 5240072],\n", " [117, u'Norway', 5042671],\n", " [118, u'Costa Rica', 4872166],\n", " [119, u'Lebanon', 4821971],\n", " [120, u'Ireland', 4627173],\n", " [121, u'Central African Republic', 4616417],\n", " [122, u'New Zealand', 4505761],\n", " [123, u'Congo, Republic of the', 4447632],\n", " [124, u'Georgia', 4340895],\n", " [125, u'Palestine', 4326295],\n", " [126, u'Liberia', 4294077],\n", " [127, u'Croatia', 4289714],\n", " [128, u'Mauritania', 3889880],\n", " [129, u'Panama', 3864170],\n", " [130, u'Bosnia and Herzegovina', 3829307],\n", " [None, u'Puerto Rico', 3688318],\n", " [131, u'Oman', 3632444],\n", " [132, u'Moldova', 3487204],\n", " [133, u'Uruguay', 3407062],\n", " [134, u'Kuwait', 3368572],\n", " [135, u'Albania', 3173271],\n", " [136, u'Lithuania', 3016933],\n", " [137, u'Armenia', 2976566],\n", " [138, u'Mongolia', 2839073],\n", " [139, u'Jamaica', 2783888],\n", " [140, u'Namibia', 2303315],\n", " [141, u'Qatar', 2168673],\n", " [142, u'Macedonia', 2107158],\n", " [143, u'Lesotho', 2074465],\n", " [144, u'Slovenia', 2071997],\n", " [145, u'Latvia', 2050317],\n", " [146, u'Botswana', 2021144],\n", " [147, u'Gambia', 1849285],\n", " [148, u'Guinea-Bissau', 1704255],\n", " [149, u'Gabon', 1671711],\n", " [150, u'Trinidad and Tobago', 1341151],\n", " [151, u'Bahrain', 1332171],\n", " [152, u'Estonia', 1287251],\n", " [153, u'Swaziland', 1249514],\n", " [154, u'Mauritius', 1244403],\n", " [155, u'Cyprus', 1141166],\n", " [156, u'Timor-Leste', 1132879],\n", " [157, u'Fiji', 881065],\n", " [None, u'R\\xe9union', 875375],\n", " [158, u'Djibouti', 872932],\n", " [159, u'Guyana', 799613],\n", " [160, u'Equatorial Guinea', 757014],\n", " [161, u'Bhutan', 753947],\n", " [162, u'Comoros', 734917],\n", " [163, u'Montenegro', 621383],\n", " [None, u'Western Sahara', 567315],\n", " [None, u'Macau', 566375],\n", " [164, u'Solomon Islands', 561231],\n", " [165, u'Suriname', 539276],\n", " [166, u'Luxembourg', 530380],\n", " [167, u'Cape Verde', 498897],\n", " [None, u'Guadeloupe', 465800],\n", " [168, u'Malta', 429004],\n", " [169, u'Brunei', 417784],\n", " [None, u'Martinique', 403682],\n", " [170, u'Bahamas', 377374],\n", " [171, u'Maldives', 345023],\n", " [172, u'Belize', 331900],\n", " [173, u'Iceland', 329535],\n", " [174, u'Barbados', 284644],\n", " [None, u'French Polynesia', 276831],\n", " [None, u'New Caledonia', 256496],\n", " [175, u'Vanuatu', 252763],\n", " [None, u'French Guiana', 249227],\n", " [None, u'Mayotte', 222152],\n", " [176, u'S\\xe3o Tom\\xe9 and Pr\\xedncipe', 192993],\n", " [177, u'Samoa', 190372],\n", " [178, u'Saint Lucia', 182273],\n", " [None, u'Guam', 165124],\n", " [None, u'Guernsey; Jersey', 162018],\n", " [None, u'Cura\\xe7ao', 158760],\n", " [179, u'Saint Vincent and the Grenadines', 109373],\n", " [None, u'Virgin Islands, United States', 106627],\n", " [180, u'Grenada', 105897],\n", " [181, u'Tonga', 105323],\n", " [182, u'Micronesia, Federated States of', 103549],\n", " [None, u'Aruba', 102911],\n", " [183, u'Kiribati', 102351],\n", " [184, u'Seychelles', 92838],\n", " [185, u'Antigua and Barbuda', 89985],\n", " [None, u'Isle of Man', 85888],\n", " [186, u'Andorra', 79218],\n", " [187, u'Dominica', 72003],\n", " [None, u'Bermuda', 65341],\n", " [None, u'Cayman Islands', 58435],\n", " [None, u'Greenland', 56987],\n", " [None, u'American Samoa', 55165],\n", " [188, u'Saint Kitts and Nevis', 54191],\n", " [None, u'Northern Mariana Islands', 53855],\n", " [189, u'Marshall Islands', 52634],\n", " [None, u'Faroe Islands', 49469],\n", " [None, u'Sint Maarten', 45233],\n", " [190, u'Monaco', 37831],\n", " [191, u'Liechtenstein', 36925],\n", " [None, u'Turks and Caicos Islands', 33098],\n", " [192, u'San Marino', 31448],\n", " [None, u'Gibraltar', 29310],\n", " [None, u'Virgin Islands, British', 28341],\n", " [193, u'Palau', 20918],\n", " [None, u'Cook Islands', 20629],\n", " [None, u'Caribbean Netherlands', 19130],\n", " [None, u'Anguilla', 14300],\n", " [None, u'Wallis and Futuna', 13272],\n", " [194, u'Nauru', 10051],\n", " [195, u'Tuvalu', 9876],\n", " [None, u'Saint Pierre and Miquelon', 6043],\n", " [None, u'Montserrat', 5091],\n", " [None, u'Saint Helena, Ascension and Tristan da Cunha', 4129],\n", " [None, u'Falkland Islands', 3044],\n", " [None, u'Niue', 1344],\n", " [None, u'Tokelau', 1195],\n", " [196, u'Vatican City', 799]]" ] } ], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "EXERCISES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show how to calculate the total population according to the list in the Wikipedia. (Answer: 7,162,119,434)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "total = 0\n", "for country in pop_list:\n", " total += country[2]\n", "print total" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "7162119434\n" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "sum([country[2] for country in pop_list])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "7162119434" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the total population of 196 entities that have a numeric rank. (Answer: 7,145,999,288) BTW, are those entities actually countries?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sum([country[2] for country in pop_list if country[0] is not None])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "7145999288" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the total population according to [The World Factbook: Country Comparison Population](https://www.cia.gov/library/publications/the-world-factbook/rankorder/2119rank.html) (See https://gist.github.com/rdhyee/8530164)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# https://gist.github.com/rdhyee/8530164/raw/f8e842fe8ccd6e3bc424e3a24e41ef5c38f419e8/world_factbook_poulation.json\n", "# https://gist.github.com/rdhyee/8530164\n", "# https://www.cia.gov/library/publications/the-world-factbook/rankorder/2119rank.html\n", "# https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2119.txt\n", "\n", "\n", "import json\n", "import requests\n", "\n", "cia_json_url = \"https://gist.github.com/rdhyee/8530164/raw/f8e842fe8ccd6e3bc424e3a24e41ef5c38f419e8/world_factbook_poulation.json\"\n", "cia_list= requests.get(cia_json_url).json()\n", "\n", "\n", "cia_world_pop = sum([r[2] for r in cia_list if r[1] != 'European Union'])\n", "cia_world_pop" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "7091218583" ] } ], "prompt_number": 5 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "CHALLENGE EXERCISE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for something more interesting. I'd like for us to get a feel of what it'd be like to pick a person completely at random from the world's population. Say, if you were picking 5 people -- where might these people be from? Of course, you won't be surprised to pick someone from China or India since those countries are so populous. But how likely will it be for someone from the USA to show up?\n", "\n", "To the end of answering this question, start thinking about writing a Python generator that will return the name of a country in which the probability of that country being returned is the proportion of the world's population represented by that country. \n", "\n", "Work with your neighbors -- we'll come back to this problem in detail in class on Thursday." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# This solution is a bit \"fancy\" and can certainly be better commented.\n", "\n", "import bisect\n", "import random\n", "from itertools import repeat, islice\n", "from collections import Counter\n", "\n", "# http://stackoverflow.com/a/15889203/7782\n", "def cumsum(lis):\n", " total = 0\n", " for x in lis:\n", " total += x\n", " yield total\n", "\n", "# depends on pop_list above\n", " \n", "world_pop = sum([r[2] for r in pop_list])\n", "cum_pop = list(cumsum((r[2] for r in pop_list)))\n", "\n", "def random_country_weighted_by_pop():\n", " while True:\n", " yield pop_list[bisect.bisect_left(cum_pop,random.randint(1,world_pop))][1]\n", " " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# generate 5 random countries and feed to Counter\n", "Counter(islice(random_country_weighted_by_pop(),5))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "Counter({u'Japan': 2, u'Egypt': 1, u'Romania': 1, u'India': 1})" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] } ], "metadata": {} } ] }