{ "metadata": { "name": "", "signature": "sha256:1c9c0d174dc331d2feb2d83313cdd1b55ff0be3459b17d693386fc33f646bd3b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Read in JSON and DataFrame Basics**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# read population in\n", "import json\n", "import requests\n", "from pandas import DataFrame\n", "\n", "# pop_json_url holds a \n", "pop_json_url = \"https://gist.github.com/rdhyee/8511607/raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json\"\n", "pop_list= requests.get(pop_json_url).json()\n", "\n", "df = DataFrame(pop_list)\n", "df[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0 1 China 1385566537
1 2 India 1252139596
2 3 United States 320050716
3 4 Indonesia 249865631
4 5 Brazil 200361925
\n", "

5 rows \u00d7 3 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ " 0 1 2\n", "0 1 China 1385566537\n", "1 2 India 1252139596\n", "2 3 United States 320050716\n", "3 4 Indonesia 249865631\n", "4 5 Brazil 200361925\n", "\n", "[5 rows x 3 columns]" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "df.dtypes" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "0 float64\n", "1 object\n", "2 int64\n", "dtype: object" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Based on the above statement, which of these would you expect to see in pop_list?\n", " \n", "1. ```['1', 'United States', '320050716']```\n", "1. ```[1, 'United States', 320050716]```\n", "1. ```['United States', 320050716]```\n", "1. ```[1, 'United States', '320050716']```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What is the relationship between `s` and the population of China?\n", "\n", " s = sum(df[df[1].str.startswith('C')][2])\n", " \n", "1. `s` is **greater** than the population of China\n", "1. `s` is the **same** as the population of China\n", "1. `s` is **less** than the population of China\n", "1. `s` is not a number.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: This statement does the following?\n", "\n", " df.columns = ['Number','Country','Population']\n", " \n", "1. Nothing\n", "1. df gets a new attribute called `columns`\n", "1. df's columns are renamed based on the list\n", "1. Throws an exception" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: How would you rewrite this statement to get the same result\n", "\n", " s = sum(df[df[1].str.startswith('C')][2])\n", "\n", "after running:\n", "\n", " df.columns = ['Number','Country','Population']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Series Examples**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandas import DataFrame, Series\n", "import numpy as np\n", "\n", "s1 = Series(np.arange(1,4))\n", "s1" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 54, "text": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] } ], "prompt_number": 54 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What is\n", "\n", " s1 + 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What is\n", " \n", " s1.apply(lambda k: 2*k).sum()\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What is \n", "\n", " s1.cumsum()[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What is\n", "\n", " s1.cumsum() + s1.cumsum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Describe what is happening in these statements:\n", "\n", " s1 + 1\n", "\n", "and\n", "\n", " s1.cumsum() + s1.cumsum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What is\n", "\n", " np.any(s1 > 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Census API Examples ** " ] }, { "cell_type": "code", "collapsed": false, "input": [ "from census import Census\n", "from us import states\n", "\n", "import settings\n", "\n", "c = Census(settings.CENSUS_KEY)\n", "c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 62, "text": [ "[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]" ] } ], "prompt_number": 62 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What is the purpose of `settings.CENSUS_KEY`?\n", "\n", "1. It is the password for the Census Python package\n", "1. It is an API Access key for authentication with the Census API\n", "1. It is an API Access key for authentication with Github\n", "1. It is key shared by all users of the Census API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What is the difference between `r1` and `r2`?\n", "\n", " r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})\n", " r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Which is the correct geographic hierarchy?\n", "\n", "Nation > States = Nation is subdivided into States\n", "\n", "1. Counties > States\n", "1. Counties > Census Blocks > Census Tracks\n", "1. Places > Counties\n", "1. Census Tracts > Block Groups > Census Blocks" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pandas import DataFrame\n", "\n", "r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})\n", "df = DataFrame(r)\n", "\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NAMEP0010001state
0 Alabama 4779736 01
1 Alaska 710231 02
2 Arizona 6392017 04
3 Arkansas 2915918 05
4 California 37253956 06
\n", "

5 rows \u00d7 3 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 72, "text": [ " NAME P0010001 state\n", "0 Alabama 4779736 01\n", "1 Alaska 710231 02\n", "2 Arizona 6392017 04\n", "3 Arkansas 2915918 05\n", "4 California 37253956 06\n", "\n", "[5 rows x 3 columns]" ] } ], "prompt_number": 72 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Why does `df` have 52 items? Please explain" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(df)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 75, "text": [ "52" ] } ], "prompt_number": 75 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Why are the results below different? Please explain" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print df.P0010001.sum()\n", "print\n", "print df.P0010001.astype(int).sum()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "477973671023163920172915918372539565029196357409789793460172318801310968765313603011567582128306326483802304635528531184339367453337213283615773552654762998836405303925296729759889279894151826341270055113164708791894205917919378102953548367259111536504375135138310741270237910525674625364814180634610525145561276388562574180010246724540185299456869865636263725789\n", "\n", "312471327\n" ] } ], "prompt_number": 84 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Describe the output of the following:\n", "\n", " df.P0010001 = df.P0010001.astype(int)\n", " df[['NAME','P0010001']].sort('P0010001', ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: After running:\n", "\n", " df.set_index('NAME', inplace=True)\n", "\n", "how would you access the Series for the state of Alaska?\n", "\n", "1. df['Alaska']\n", "1. df[1]\n", "1. df.ix['Alaska']\n", "1. df[df['NAME'] == 'Alaska']" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.in1d([ s.fips for s in states.STATES], df.state)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 90, "text": [ "array([ True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True], dtype=bool)" ] } ], "prompt_number": 90 }, { "cell_type": "code", "collapsed": false, "input": [ "df[np.in1d(df.state, [ s.fips for s in states.STATES])]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NAMEP0010001state
0 Alabama 4779736 01
1 Alaska 710231 02
2 Arizona 6392017 04
3 Arkansas 2915918 05
4 California 37253956 06
5 Colorado 5029196 08
6 Connecticut 3574097 09
7 Delaware 897934 10
8 District of Columbia 601723 11
9 Florida 18801310 12
10 Georgia 9687653 13
11 Hawaii 1360301 15
12 Idaho 1567582 16
13 Illinois 12830632 17
14 Indiana 6483802 18
15 Iowa 3046355 19
16 Kansas 2853118 20
17 Kentucky 4339367 21
18 Louisiana 4533372 22
19 Maine 1328361 23
20 Maryland 5773552 24
21 Massachusetts 6547629 25
22 Michigan 9883640 26
23 Minnesota 5303925 27
24 Mississippi 2967297 28
25 Missouri 5988927 29
26 Montana 989415 30
27 Nebraska 1826341 31
28 Nevada 2700551 32
29 New Hampshire 1316470 33
30 New Jersey 8791894 34
31 New Mexico 2059179 35
32 New York 19378102 36
33 North Carolina 9535483 37
34 North Dakota 672591 38
35 Ohio 11536504 39
36 Oklahoma 3751351 40
37 Oregon 3831074 41
38 Pennsylvania 12702379 42
39 Rhode Island 1052567 44
40 South Carolina 4625364 45
41 South Dakota 814180 46
42 Tennessee 6346105 47
43 Texas 25145561 48
44 Utah 2763885 49
45 Vermont 625741 50
46 Virginia 8001024 51
47 Washington 6724540 53
48 West Virginia 1852994 54
49 Wisconsin 5686986 55
50 Wyoming 563626 56
\n", "

51 rows \u00d7 3 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 91, "text": [ " NAME P0010001 state\n", "0 Alabama 4779736 01\n", "1 Alaska 710231 02\n", "2 Arizona 6392017 04\n", "3 Arkansas 2915918 05\n", "4 California 37253956 06\n", "5 Colorado 5029196 08\n", "6 Connecticut 3574097 09\n", "7 Delaware 897934 10\n", "8 District of Columbia 601723 11\n", "9 Florida 18801310 12\n", "10 Georgia 9687653 13\n", "11 Hawaii 1360301 15\n", "12 Idaho 1567582 16\n", "13 Illinois 12830632 17\n", "14 Indiana 6483802 18\n", "15 Iowa 3046355 19\n", "16 Kansas 2853118 20\n", "17 Kentucky 4339367 21\n", "18 Louisiana 4533372 22\n", "19 Maine 1328361 23\n", "20 Maryland 5773552 24\n", "21 Massachusetts 6547629 25\n", "22 Michigan 9883640 26\n", "23 Minnesota 5303925 27\n", "24 Mississippi 2967297 28\n", "25 Missouri 5988927 29\n", "26 Montana 989415 30\n", "27 Nebraska 1826341 31\n", "28 Nevada 2700551 32\n", "29 New Hampshire 1316470 33\n", "30 New Jersey 8791894 34\n", "31 New Mexico 2059179 35\n", "32 New York 19378102 36\n", "33 North Carolina 9535483 37\n", "34 North Dakota 672591 38\n", "35 Ohio 11536504 39\n", "36 Oklahoma 3751351 40\n", "37 Oregon 3831074 41\n", "38 Pennsylvania 12702379 42\n", "39 Rhode Island 1052567 44\n", "40 South Carolina 4625364 45\n", "41 South Dakota 814180 46\n", "42 Tennessee 6346105 47\n", "43 Texas 25145561 48\n", "44 Utah 2763885 49\n", "45 Vermont 625741 50\n", "46 Virginia 8001024 51\n", "47 Washington 6724540 53\n", "48 West Virginia 1852994 54\n", "49 Wisconsin 5686986 55\n", "50 Wyoming 563626 56\n", "\n", "[51 rows x 3 columns]" ] } ], "prompt_number": 91 } ], "metadata": {} } ] }