{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "# our usual pylab import\n", "%pylab --no-import-all inline" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "prompt_number": 2 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Goal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For background, see [Mapping Census Data](http://www.udel.edu/johnmack/frec682/census/), including the \n", "[scan of the 10-question form](http://www.udel.edu/johnmack/frec682/census/census_form.png). Keep in mind what people were asked and the range of data available in the census.\n", "\n", "Using the census API to get an understanding of some of the geographic entities in the **2010 census**. We'll specifically be using the variable `P0010001`, the total population. \n", "\n", "What you will do in this notebook:\n", "\n", " * Sum the population of the **states** (or state-like entity like DC) to get the total population of the **nation**\n", " * Add up the **counties** for each **state** and validate the sums\n", " * Add up the **census tracts** for each **county** and validate the sums\n", " \n", "We will make use of `pandas` in this notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I often have the following [diagram](http://www.census.gov/geo/reference/pdfs/geodiagram.pdf) in mind to help understand the relationship among entities. Also use the [list of example URLs](http://api.census.gov/data/2010/sf1/geo.html) -- it'll come in handy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Census" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Working out the geographical hierarchy for Cafe Milano" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's helpful to have a concrete instance of a place to work with, especially when dealing with rather intangible entities like census tracts, block groups, and blocks. You can use the [American FactFinder](http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml) site to look up for any given US address the corresponding census geographies. \n", "\n", "Let's use Cafe Milano in Berkeley as an example. You can verify the following results by typing in the address into http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t. \n", "\n", "https://www.evernote.com/shard/s1/sh/dc0bfb96-4965-4fbf-bc28-c9d4d0080782/2bd8c92a045d62521723347d62fa2b9d\n", "\n", "2522 Bancroft Way, BERKELEY, CA, 94704\n", "\n", "* State: California\n", "* County: Alameda County\n", "* County Subdivision: Berkeley CCD, Alameda County, California\n", "* Census Tract: Census Tract 4228, Alameda County, California\n", "* Block Group: Block Group 1, Census Tract 4228, Alameda County, California\n", "* Block: Block 1001, Block Group 1, Census Tract 4228, Alameda County, California\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# YouTube video I made on how to use the American Factfinder site to look up addresses\n", "from IPython.display import YouTubeVideo\n", "YouTubeVideo('HeXcliUx96Y')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " " ], "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "# standard numpy, pandas, matplotlib imports\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from pandas import DataFrame, Series, Index\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "# check that CENSUS_KEY is defined\n", "import census\n", "import us\n", "\n", "import requests\n", "\n", "import settings\n", "assert settings.CENSUS_KEY is not None" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The census documentation has example URLs but needs your API key to work. In this notebook, we'll use the IPython notebook HTML display mechanism to help out.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "c = census.Census(key=settings.CENSUS_KEY)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: we can use `c.sf1` to access 2010 census (SF1: Census Summary File 1 (2010, 2000, 1990) available in API -- 2010 is the default)\n", "\n", "see documentation: [sunlightlabs/census](https://github.com/sunlightlabs/census)" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Summing up populations by state " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a `DataFrame` named `states_df` with columns `NAME`, `P0010001` (for population), and `state` (to hold the FIPS code). **Make sure to exclude Puerto Rico.**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# call the API and instantiate `df`\n", "df = DataFrame(c.sf1.get('NAME,P0010001', geo={'for':'state:*'}))\n", "# convert the population to integer\n", "df['P0010001'] = df['P0010001'].astype(np.int)\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can filter Puerto Rico (PR) in a number of ways -- use the way you're most comfortable with. \n", "\n", "Optional fun: filter PR in the following way\n", "\n", "* calculate a `np.array` holding the the fips of the states\n", "* then use [numpy.in1d](http://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html), which is a analogous to the [in](http://stackoverflow.com/a/3437130/7782) operator to test membership in a list" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## FILL IN\n", "## calculate states_fips so that PR not included\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If `states_df` is calculated properly, the following asserts will pass silently." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# check that we have three columns\n", "assert set(states_df.columns) == set((u'NAME', u'P0010001', u'state'))\n", "\n", "# check that the total 2010 census population is correct\n", "assert np.sum(states_df.P0010001) == 308745538 \n", "\n", "# check that the number of states+DC is 51\n", "assert len(states_df) == 51" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Counties" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at http://api.census.gov/data/2010/sf1/geo.html, we see\n", "\n", " state-county: http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*\n", " \n", "if we want to grab all counties in one go, or you can grab counties state-by-state:\n", "\n", " http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*&in=state:06\n", " \n", "for all counties in the state with FIPS code `06` (which is what state?)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Here's a way to use translate \n", "# http://api.census.gov/data/2010/sf1?get=P0010001&for=county:*\n", "# into a call using the census.Census object\n", "\n", "r = c.sf1.get('NAME,P0010001', geo={'for':'county:*'})\n", "\n", "# ask yourself what len(r) means and what it should be\n", "len(r)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# let's try out one of the `census` object convenience methods\n", "# instead of using `c.sf1.get`\n", "\n", "r = c.sf1.state_county('NAME,P0010001',census.ALL,census.ALL)\n", "r" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# convert the json from the API into a DataFrame\n", "# coerce to integer the P0010001 column\n", "\n", "df = DataFrame(r)\n", "df['P0010001'] = df['P0010001'].astype('int')\n", "\n", "# display the first records\n", "df.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# calculate the total population \n", "# what happens when you google the number you get?\n", "\n", "np.sum(df['P0010001'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# often you can use dot notation to access a DataFrame column\n", "df.P0010001.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "## FILL IN\n", "## compute counties_df\n", "## counties_df should have same columns as df\n", "## filter out PR -- what's the total population now\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check properties of `counties_df`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# number of counties\n", "assert len(counties_df) == 3143 #3143 county/county-equivs in US" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# check that the total population by adding all counties == population by adding all states\n", "\n", "assert np.sum(counties_df['P0010001']) == np.sum(states_df.P0010001)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# check we have same columns between counties_df and df\n", "set(counties_df.columns) == set(df.columns)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Using FIPS code as the Index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From [Mapping Census Data](http://www.udel.edu/johnmack/frec682/census/):\n", "\n", "* Each state (SUMLEV = 040) has a 2-digit FIPS ID; Delaware's is 10.\n", "* Each county (SUMLEV = 050) within a state has a 3-digit FIPS ID, appended to the 2-digit state ID. New Castle County, Delaware, has FIPS ID 10003.\n", "* Each Census Tract (SUMLEV = 140) within a county has a 6-digit ID, appended to the county code. The Tract in New Castle County DE that contains most of the the UD campus has FIPS ID 10003014502.\n", "* Each Block Group (SUMLEV = 150) within a Tract has a single digit ID appended to the Tract ID. The center of campus in the northwest corner of the tract is Block Group100030145022.\n", "* Each Block (SUMLEV = 750) within a Block Group is identified by three more digits appended to the Block Group ID. Pearson Hall is located in Block 100030145022009." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# take a look at the current structure of counties_df\n", "\n", "counties_df.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you add all the counties on a state-by-state basisi, do you get the same populations for each state?\n", "\n", "* use `set_index` to make the FIPS code for the state the index for `states_df`\n", "* calculate the FIPS code for the counties and make the county FIPS code the index of of `county_fips`\n", "* use groupby on `counties_df` to compare the populations of states with that in `states_df`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## FILL IN\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Counties in California" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at home: California state and Alameda County" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# boolean indexing to pull up California\n", "states_df[states_df.NAME == 'California']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# use .ix -- most general indexing \n", "# http://pandas.pydata.org/pandas-docs/dev/indexing.html#different-choices-for-indexing-loc-iloc-and-ix\n", "states_df.ix['06']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# California counties\n", "\n", "counties_df[counties_df.state=='06']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "counties_df[counties_df.NAME == 'Alameda County']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "counties_df[counties_df.NAME == 'Alameda County']['P0010001']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different ways to read off the population of Alameda County -- still looking for the best way" ] }, { "cell_type": "code", "collapsed": false, "input": [ "counties_df[counties_df.NAME == 'Alameda County']['P0010001'].to_dict().values()[0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "list(counties_df[counties_df.NAME == 'Alameda County']['P0010001'].iteritems())[0][1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "int(counties_df[counties_df.NAME == 'Alameda County']['P0010001'].values)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you know the FIPS code for Alameda County, just read off the population using `.ix`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# this is like accessing a cell in a spreadsheet -- row, col\n", "\n", "ALAMEDA_COUNTY_FIPS = '06001'\n", "\n", "counties_df.ix[ALAMEDA_COUNTY_FIPS,'P0010001']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Reading off all the tracts in Alameda County" ] }, { "cell_type": "code", "collapsed": false, "input": [ "counties_df.ix[ALAMEDA_COUNTY_FIPS,'county']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "## FILL IN \n", "## generate a DataFrame named alameda_county_tracts_df by\n", "## calling the census api and the state-county-tract technique\n", "\n", "## how many census tracts in Alameda County?\n", "## if you add up the population, what do you get?\n", "## generate the FIPS code for each tract\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "# confirm that you can find the census tract in which Cafe Milano is located\n", "# Cafe Milano is in tract 4228\n", "\n", "MILANO_TRACT_ID = '422800'\n", "alameda_county_tracts_df[alameda_county_tracts_df.tract==MILANO_TRACT_ID]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Using Generators to yield all the tracts in the country" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## FILL IN\n", "## try to reproduce the generator I show in class for all the census tracts\n", "## start to think about how to do this for other geographical entities\n", "\n", "import time\n", "import us\n", "\n", "from itertools import islice\n", "\n", "def census_tracts(variable=('NAME','P0010001'), sleep_time=1.0):\n", " \n", " for state in us.states.STATES:\n", " print state\n", " for tract in c.sf1.get(variable, \n", " geo={'for':\"tract:*\", \n", " 'in':'state:{state_fips}'.format(state_fips=state.fips)\n", " }):\n", " yield tract\n", " # don't hit the API more than once a second \n", " time.sleep(sleep_time)\n", " \n", "# limit the number of tracts we crawl for until we're reading to get all of them \n", "tracts_df = DataFrame(list(islice(census_tracts(), 100)))\n", "tracts_df['P0010001'] = tracts_df['P0010001'].astype('int')\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "tracts_df.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "## EXERCISE for next time\n", "## write a generator all census places\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Compare with Tabulations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compare the total number of tracts we calculate to:\n", "\n", "https://www.census.gov/geo/maps-data/data/tallies/tractblock.html\n", "\n", "and\n", "\n", "https://www.census.gov/geo/maps-data/data/docs/geo_tallies/Tract_Block2010.txt" ] } ], "metadata": {} } ] }