{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Census ACS\n", "\n", "Retrieves data from the Census Bureau's American Community Survey 5-year series API for zctas, pumas, and tracts. A specific list of census variables is passed into the script, which are retrieved from the four ACS profile tables. Variables must be retrieved in chunks because only 50 can be passed to the API at a time, and each url varies by geography and retreives them in different combinations. After some processing output is written to a SQLite database. An option to create a metadata table appears at the bottom, but should only be run once for a given extract (acs1 and acs2) and not for each individual geography.\n", "\n", "https://www.census.gov/data/developers/data-sets/acs-5year.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Variables" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os, requests, json, sqlite3, random, pandas as pd, numpy as np\n", "from IPython.display import clear_output" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "keyfile='census_key.txt'\n", "\n", "#API variables - UPDATE THE YEAR AND GEO\n", "year='2018'\n", "geo='zip code tabulation area' # 'zip code tabulation area' or 'public use microdata area' or 'tract'\n", "state='36'\n", "dsource='acs'\n", "dsource2='acs5'\n", "dname='profile'\n", "\n", "#Variables to read in from spreadsheet - UPDATE WORKSHEET\n", "worksheet='acs2' # 'acs1' or 'acs2'\n", "geoexcelsheet={'zip code tabulation area':'zctas', 'public use microdata area':'pumas', 'tract':'tracts'}\n", "geotype=geoexcelsheet.get(geo)\n", "\n", "#SQL output\n", "tabname='{}_{}{}'.format(geotype,year,worksheet)\n", "dbname=os.path.join('outputs','testdb.sqlite')\n", "\n", "#Dump files for api data storage\n", "jsonpath=os.path.join('outputs', tabname+'_retrieved_data.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Variable Lists\n", "Get full list of variables from the API, read in our retrieval list, and compare the varianle IDs and names to make sure nothing is missing and that nothing has changed since the last iteration. *Don't move on to the next block until both lists match.* Lastly, read in list of geographies." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('DP05_0086MA',\n", " {'label': 'Annotation of Margin of Error!!Total housing units',\n", " 'predicateType': 'string',\n", " 'group': 'DP05',\n", " 'limit': 0,\n", " 'predicateOnly': True}),\n", " ('DP02_0011PM',\n", " {'label': 'Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Nonfamily households!!Householder living alone',\n", " 'concept': 'SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES',\n", " 'predicateType': 'float',\n", " 'group': 'DP02',\n", " 'limit': 0,\n", " 'predicateOnly': True})]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datadict={}\n", "dps=['DP02','DP03','DP04','DP05']\n", "for p in dps:\n", " vars_url = f'https://api.census.gov/data/{year}/{dsource}/{dsource2}/{dname}/groups/{p}.json'\n", " response=requests.get(vars_url)\n", " var_data=response.json()\n", " datadict.update(var_data['variables'])\n", "random.sample(datadict.items(), 2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
db_varcensus_varcensus_labeldtype
0HOC01_EDP04_0001EEstimate!!HOUSING OCCUPANCY!!Total housing unitsint
1HOC01_MDP04_0001MMargin of Error!!HOUSING OCCUPANCY!!Total hous...int
2HOC01_PCDP04_0001PEPercent Estimate!!HOUSING OCCUPANCY!!Total hou...int
3HOC01_PMDP04_0001PMPercent Margin of Error!!HOUSING OCCUPANCY!!To...int
4HOC02_EDP04_0002EEstimate!!HOUSING OCCUPANCY!!Total housing uni...int
\n", "
" ], "text/plain": [ " db_var census_var census_label \\\n", "0 HOC01_E DP04_0001E Estimate!!HOUSING OCCUPANCY!!Total housing units \n", "1 HOC01_M DP04_0001M Margin of Error!!HOUSING OCCUPANCY!!Total hous... \n", "2 HOC01_PC DP04_0001PE Percent Estimate!!HOUSING OCCUPANCY!!Total hou... \n", "3 HOC01_PM DP04_0001PM Percent Margin of Error!!HOUSING OCCUPANCY!!To... \n", "4 HOC02_E DP04_0002E Estimate!!HOUSING OCCUPANCY!!Total housing uni... \n", "\n", " dtype \n", "0 int \n", "1 int \n", "2 int \n", "3 int \n", "4 int " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfexcel = pd.read_excel(os.path.join('inputs','acs_variables.xlsx'),sheet_name=worksheet)\n", "dfexcel.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are an equal number of variables in both lists: 236\n" ] } ], "source": [ "dfvars = pd.DataFrame.from_dict(datadict,columns=['label'],orient='index')\n", "dfvars_selected=dfvars.loc[dfvars.index.isin(dfexcel['census_var'])]\n", "dfvars_count=len(dfvars_selected)\n", "dfexcel_count=len(dfexcel['census_var'])\n", "\n", "if dfvars_count==dfexcel_count:\n", " print('There are an equal number of variables in both lists:', dfvars_count)\n", "else:\n", " print('There is a mismatch in the number of variables; the api has,', dfvars_count, \n", " 'while the original list has',dfexcel_count,'. Missing:')\n", " nomatch=dfexcel[~dfexcel['census_var'].isin(dfvars_selected.index)]\n", " print(nomatch)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All labels match\n" ] } ], "source": [ "mismatch=dfexcel[~dfexcel['census_label'].isin(dfvars_selected['label'])]\n", "\n", "if len (mismatch) ==0:\n", " print('All labels match')\n", "else:\n", " compare=pd.merge(mismatch,dfvars_selected, left_on='census_var', right_on=dfvars_selected.index)\n", " misfile=os.path.join('outputs','{}_mismatch.csv'.format(worksheet))\n", " compare.to_csv(misfile, columns=['db_var', 'census_var', 'census_label', 'label'],sep=',', index=False,\n", " header=['db_var','census_var' ,'oldlabel','newlabel'])\n", " print('* Mismatch file printed to outputs folder * \\n')\n", " print('These labels do not match:')\n", " print(compare[['census_var','census_label','label']])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of geographic indetifiers: 215\n" ] } ], "source": [ "# Geographic indetifiers: zctas to retrieve, pumas to filter by, and counties containing tracts to retrieve\n", "excelgeo = pd.read_excel(os.path.join('inputs','acs_variables.xlsx'),sheet_name=geotype, dtype=object)\n", "geoids = excelgeo['GEO'].tolist()\n", "print('Number of geographic indetifiers:',len(geoids))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieve Data\n", "Given the large number of variables in the ACS and limits of the API, variables must be passed to the url in separate blocks or chunks. The first chunk that's captured is written to an empty datalist; the header row and then one row for each geography. Each subsequent chunk is iterated through by row, so each row is appended to the correct row in datalist. In all cases, the last values, identifiers automatically returned with each API call, are not appended." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def chunks(l, n):\n", " # For item i in a range that is a length of l,\n", " for i in range(0, len(l), n):\n", " # Create an index range for l of n items:\n", " yield l[i:i+n]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of chunks: 6\n" ] } ], "source": [ "reqvars=list(chunks(dfvars_selected.index.tolist(),46))\n", "reqvars[0].insert(0,'NAME')\n", "reqvars[0].insert(0,'GEO_ID')\n", "print('Number of chunks:',len(reqvars))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://api.census.gov/data/2018/acs/acs5/profile'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open(keyfile) as key:\n", " api_key=key.read().strip()\n", "\n", "base_url = f'https://api.census.gov/data/{year}/{dsource}/{dsource2}/{dname}'\n", "base_url" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "#Function for retrieving data; running this block loads it into memory\n", "#Different geographies have different urls, \n", "#and a different number of identifiers tacked on to the end of each request\n", "\n", "def getdata():\n", " dlist=[]\n", " for i, v in enumerate(reqvars):\n", " batchcols=','.join(v)\n", " if geotype=='zctas':\n", " data_url = f'{base_url}?get={batchcols}&for={geo}:{g}&key={api_key}'\n", " dropvar=-1\n", " elif geotype=='pumas':\n", " data_url = f'{base_url}?get={batchcols}&for={geo}:*&in=state:{state}&key={api_key}'\n", " dropvar=-2\n", " elif geotype=='tracts':\n", " data_url = f'{base_url}?get={batchcols}&for={geo}:*&in=state:{state}&in=county:{county}&key={api_key}'\n", " dropvar=-3\n", " else:\n", " print('Appropriate geography not specified in variables block')\n", " break \n", " response=requests.get(data_url)\n", " if response.status_code==200:\n", " clear_output(wait=True)\n", " data=response.json()\n", " for i2, v2 in enumerate(data):\n", " if i == 0:\n", " dlist.append(v2[:dropvar])\n", " else:\n", " for item in v2[:dropvar]:\n", " dlist[i2].append(item)\n", " else:\n", " print('***Problem with retrieval***, response code',response.status_code)\n", " return dlist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ***THIS BLOCK IS A REQUESTS BLOCK!***\n", "*NOTE: ZCTA retrieval takes a long time - 15 mins for 215 ZCTAs*" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Retrieved data for 11697\n", "Retrieved 216 records and 51408 data points with 238.0 points for each record...\n", "Done - Data dumped to json file\n" ] } ], "source": [ "#If this block was run successfully for a given table and geography don't rerun - next block pulls from saved json\n", "datalist=[]\n", "if geotype=='zctas':\n", " for g in geoids:\n", " georecord=getdata()\n", " print('Retrieved data for',g)\n", " if len(datalist)==0:\n", " datalist.append(georecord[0])\n", " datalist.append(georecord[1])\n", " else:\n", " datalist.append(georecord[1])\n", "elif geotype=='pumas':\n", " datalist=getdata()\n", "elif geotype=='tracts':\n", " for county in geoids:\n", " georecord=getdata()\n", " print('Retrieved data for',county)\n", " if len(datalist)==0:\n", " for tract in georecord:\n", " datalist.append(tract)\n", " else:\n", " for tract in georecord[1:]:\n", " datalist.append(tract)\n", " \n", "dlrows=len(datalist)\n", "dlitems=sum(len(x) for x in datalist)\n", "dlbyrow=dlitems / dlrows\n", "print('Retrieved', dlrows, 'records and', dlitems,'data points with', dlbyrow, 'points for each record...')\n", " \n", "with open(jsonpath, 'w') as f:\n", " json.dump(datalist, f)\n", "print('Done - Data dumped to json file')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Process Data\n", "Replace footnotes with nulls, create a new GEOID2 column, replace census variable names with database variable names." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 215 entries, 8600000US10001 to 8600000US11697\n", "Columns: 237 entries, GEOLABEL to DP05_0087PM\n", "dtypes: object(237)\n", "memory usage: 399.8+ KB\n" ] } ], "source": [ "with open(jsonpath, 'r') as f:\n", " jsondata=json.load(f)\n", "alldata = pd.DataFrame(jsondata[1:],columns=jsondata[0],dtype=object).rename(columns={\n", " 'GEO_ID':'GEOID','NAME':'GEOLABEL'}).set_index('GEOID')\n", "alldata.info()\n", "# Index and column entries should be 1 row and 1 column less than previous count (excludes header row and index column) " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GEOLABELDP04_0001EDP04_0001MDP04_0001PEDP04_0001PMDP04_0002EDP04_0002MDP04_0002PEDP04_0002PMDP04_0003E...DP05_0082PEDP05_0082PMDP05_0083EDP05_0083MDP05_0083PEDP05_0083PMDP05_0087EDP05_0087MDP05_0087PEDP05_0087PM
GEOID
8600000US10001ZCTA5 100011414153914141-8888888881243152187.92.21710...0.50.53091331.30.61744198117441-888888888
8600000US10002ZCTA5 100023572445335724-8888888883354061493.91.22184...0.20.215164452.00.653042171553042-888888888
8600000US10003ZCTA5 100033087637830876-8888888882612470384.62.14752...0.30.210743042.00.644049156044049-888888888
\n", "

3 rows × 237 columns

\n", "
" ], "text/plain": [ " GEOLABEL DP04_0001E DP04_0001M DP04_0001PE DP04_0001PM \\\n", "GEOID \n", "8600000US10001 ZCTA5 10001 14141 539 14141 -888888888 \n", "8600000US10002 ZCTA5 10002 35724 453 35724 -888888888 \n", "8600000US10003 ZCTA5 10003 30876 378 30876 -888888888 \n", "\n", " DP04_0002E DP04_0002M DP04_0002PE DP04_0002PM DP04_0003E ... \\\n", "GEOID ... \n", "8600000US10001 12431 521 87.9 2.2 1710 ... \n", "8600000US10002 33540 614 93.9 1.2 2184 ... \n", "8600000US10003 26124 703 84.6 2.1 4752 ... \n", "\n", " DP05_0082PE DP05_0082PM DP05_0083E DP05_0083M DP05_0083PE \\\n", "GEOID \n", "8600000US10001 0.5 0.5 309 133 1.3 \n", "8600000US10002 0.2 0.2 1516 445 2.0 \n", "8600000US10003 0.3 0.2 1074 304 2.0 \n", "\n", " DP05_0083PM DP05_0087E DP05_0087M DP05_0087PE DP05_0087PM \n", "GEOID \n", "8600000US10001 0.6 17441 981 17441 -888888888 \n", "8600000US10002 0.6 53042 1715 53042 -888888888 \n", "8600000US10003 0.6 44049 1560 44049 -888888888 \n", "\n", "[3 rows x 237 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alldata.head(3)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "#This is a lousy solution, come up with something better in the future\n", "footnotes=['-999999999','-999999999.0', '-999999999.00',\n", " '-888888888','-888888888.0', '-888888888.00',\n", " '-666666666','-666666666.0', '-666666666.00',\n", " '-555555555','-555555555.0', '-555555555.00',\n", " '-333333333','-333333333.0', '-333333333.00',\n", " '-222222222','-222222222.0', '-222222222.00']\n", "alldata.replace(footnotes,np.nan,inplace=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GEOID2GEOLABELDP04_0001EDP04_0001MDP04_0001PEDP04_0001PMDP04_0002EDP04_0002MDP04_0002PEDP04_0002PM...DP05_0082PEDP05_0082PMDP05_0083EDP05_0083MDP05_0083PEDP05_0083PMDP05_0087EDP05_0087MDP05_0087PEDP05_0087PM
GEOID
8600000US1000110001ZCTA5 100011414153914141NaN1243152187.92.2...0.50.53091331.30.61744198117441NaN
8600000US1000210002ZCTA5 100023572445335724NaN3354061493.91.2...0.20.215164452.00.653042171553042NaN
8600000US1000310003ZCTA5 100033087637830876NaN2612470384.62.1...0.30.210743042.00.644049156044049NaN
\n", "

3 rows × 238 columns

\n", "
" ], "text/plain": [ " GEOID2 GEOLABEL DP04_0001E DP04_0001M DP04_0001PE \\\n", "GEOID \n", "8600000US10001 10001 ZCTA5 10001 14141 539 14141 \n", "8600000US10002 10002 ZCTA5 10002 35724 453 35724 \n", "8600000US10003 10003 ZCTA5 10003 30876 378 30876 \n", "\n", " DP04_0001PM DP04_0002E DP04_0002M DP04_0002PE DP04_0002PM \\\n", "GEOID \n", "8600000US10001 NaN 12431 521 87.9 2.2 \n", "8600000US10002 NaN 33540 614 93.9 1.2 \n", "8600000US10003 NaN 26124 703 84.6 2.1 \n", "\n", " ... DP05_0082PE DP05_0082PM DP05_0083E DP05_0083M DP05_0083PE \\\n", "GEOID ... \n", "8600000US10001 ... 0.5 0.5 309 133 1.3 \n", "8600000US10002 ... 0.2 0.2 1516 445 2.0 \n", "8600000US10003 ... 0.3 0.2 1074 304 2.0 \n", "\n", " DP05_0083PM DP05_0087E DP05_0087M DP05_0087PE DP05_0087PM \n", "GEOID \n", "8600000US10001 0.6 17441 981 17441 NaN \n", "8600000US10002 0.6 53042 1715 53042 NaN \n", "8600000US10003 0.6 44049 1560 44049 NaN \n", "\n", "[3 rows x 238 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idxgeoid2={'zctas':-5, 'pumas':-7,'tracts':-11}\n", "alldata.insert(loc=0, column='GEOID2',value=alldata.index.str[idxgeoid2.get(geotype):])\n", "alldata.head(3)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(215, 238)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For PUMAS filter all the geotype for the state by local areas\n", "if geotype == 'pumas':\n", " acsdata=alldata.loc[alldata.GEOID2.isin(geoids)].copy().astype(object).sort_index()\n", "else:\n", " acsdata=alldata.copy().astype(object).sort_index()\n", "acsdata.shape" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('DP05_0001M', 'SXAG01_M'),\n", " ('DP05_0036PM', 'RACE02_PM'),\n", " ('DP05_0039M', 'RACE05_M'),\n", " ('DP04_0134E', 'RENT09_E'),\n", " ('DP05_0037PM', 'RACE03_PM')]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Dictionary of column names from the census and the nyc geodatabase\n", "cv_to_db=dict(zip(dfexcel.census_var, dfexcel.db_var))\n", "random.sample(cv_to_db.items(), 5)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GEOID2GEOLABELHOC01_EHOC01_MHOC01_PCHOC01_PMHOC02_EHOC02_MHOC02_PCHOC02_PM...HISL09_PCHISL09_PMHISL10_EHISL10_MHISL10_PCHISL10_PMVOTE01_EVOTE01_MVOTE01_PCVOTE01_PM
GEOID
8600000US1000110001ZCTA5 100011414153914141NaN1243152187.92.2...0.50.53091331.30.61744198117441NaN
8600000US1000210002ZCTA5 100023572445335724NaN3354061493.91.2...0.20.215164452.00.653042171553042NaN
8600000US1000310003ZCTA5 100033087637830876NaN2612470384.62.1...0.30.210743042.00.644049156044049NaN
\n", "

3 rows × 238 columns

\n", "
" ], "text/plain": [ " GEOID2 GEOLABEL HOC01_E HOC01_M HOC01_PC HOC01_PM HOC02_E \\\n", "GEOID \n", "8600000US10001 10001 ZCTA5 10001 14141 539 14141 NaN 12431 \n", "8600000US10002 10002 ZCTA5 10002 35724 453 35724 NaN 33540 \n", "8600000US10003 10003 ZCTA5 10003 30876 378 30876 NaN 26124 \n", "\n", " HOC02_M HOC02_PC HOC02_PM ... HISL09_PC HISL09_PM HISL10_E \\\n", "GEOID ... \n", "8600000US10001 521 87.9 2.2 ... 0.5 0.5 309 \n", "8600000US10002 614 93.9 1.2 ... 0.2 0.2 1516 \n", "8600000US10003 703 84.6 2.1 ... 0.3 0.2 1074 \n", "\n", " HISL10_M HISL10_PC HISL10_PM VOTE01_E VOTE01_M VOTE01_PC \\\n", "GEOID \n", "8600000US10001 133 1.3 0.6 17441 981 17441 \n", "8600000US10002 445 2.0 0.6 53042 1715 53042 \n", "8600000US10003 304 2.0 0.6 44049 1560 44049 \n", "\n", " VOTE01_PM \n", "GEOID \n", "8600000US10001 NaN \n", "8600000US10002 NaN \n", "8600000US10003 NaN \n", "\n", "[3 rows x 238 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Rename the census variables to nyc geodatabase variables\n", "acsdata.rename(columns=cv_to_db,inplace=True)\n", "acsdata.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write to Database\n", "Update list of variables and data types, build create table string, create datatable in temporary database.\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
db_varcensus_varcensus_labeldtype
0HOC01_EDP04_0001EEstimate - HOUSING OCCUPANCY - Total housing u...INTEGER
1HOC01_MDP04_0001MMargin of Error - HOUSING OCCUPANCY - Total ho...INTEGER
2HOC01_PCDP04_0001PEPercent Estimate - HOUSING OCCUPANCY - Total h...INTEGER
3HOC01_PMDP04_0001PMPercent Margin of Error - HOUSING OCCUPANCY - ...INTEGER
4HOC02_EDP04_0002EEstimate - HOUSING OCCUPANCY - Total housing u...INTEGER
\n", "
" ], "text/plain": [ " db_var census_var census_label \\\n", "0 HOC01_E DP04_0001E Estimate - HOUSING OCCUPANCY - Total housing u... \n", "1 HOC01_M DP04_0001M Margin of Error - HOUSING OCCUPANCY - Total ho... \n", "2 HOC01_PC DP04_0001PE Percent Estimate - HOUSING OCCUPANCY - Total h... \n", "3 HOC01_PM DP04_0001PM Percent Margin of Error - HOUSING OCCUPANCY - ... \n", "4 HOC02_E DP04_0002E Estimate - HOUSING OCCUPANCY - Total housing u... \n", "\n", " dtype \n", "0 INTEGER \n", "1 INTEGER \n", "2 INTEGER \n", "3 INTEGER \n", "4 INTEGER " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfexcel.replace({'dtype': {'int': 'INTEGER', 'float': 'REAL'}},inplace=True)\n", "dfexcel.census_label.replace({'!!': ' - '},inplace=True, regex=True)\n", "dfexcel.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('HISL09_PM',\n", " ['DP05_0082PM',\n", " 'Percent Margin of Error - HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino - Some other race alone',\n", " 'REAL']),\n", " ('HISL03_E',\n", " ['DP05_0076E',\n", " 'Estimate - HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino',\n", " 'INTEGER'])]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vardict=dfexcel.set_index('db_var').T.to_dict('list')\n", "random.sample(vardict.items(), 2)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "con = sqlite3.connect(dbname) \n", "cur = con.cursor()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "CREATE TABLE zctas_2018acs2 (\n", "GEOID TEXT,\n", "GEOID2 TEXT NOT NULL PRIMARY KEY,\n", "GEOLABEL TEXT,\n", "HOC01_E INTEGER, \n", "HOC01_M INTEGER, \n", "HOC01_PC INTEGER, \n", "HOC01_PM INTEGER, \n", "HOC02_E INTEGER, \n", "HOC02_M INTEGER, \n", "HOC02_PC REAL, \n", "HOC02_PM REAL, \n", "HOC03_E INTEGER, \n", "HOC03_M INTEGER, \n", "HOC03_PC REAL, \n", "HOC03_PM REAL, \n", "HOC04_E REAL, \n", "HOC04_M REAL, \n", "HOC04_PC INTEGER, \n", "HOC04_PM INTEGER, \n", "HOC05_E REAL, \n", "HOC05_M REAL, \n", "HOC05_PC INTEGER, \n", "HOC05_PM INTEGER, \n", "HTEN01_E INTEGER, \n", "HTEN01_M INTEGER, \n", "HTEN01_PC INTEGER, \n", "HTEN01_PM INTEGER, \n", "HTEN02_E INTEGER, \n", "HTEN02_M INTEGER, \n", "HTEN02_PC REAL, \n", "HTEN02_PM REAL, \n", "HTEN03_E INTEGER, \n", "HTEN03_M INTEGER, \n", "HTEN03_PC REAL, \n", "HTEN03_PM REAL, \n", "HVAL01_E INTEGER, \n", "HVAL01_M INTEGER, \n", "HVAL01_PC INTEGER, \n", "HVAL01_PM INTEGER, \n", "MORT01_E INTEGER, \n", "MORT01_M INTEGER, \n", "MORT01_PC INTEGER, \n", "MORT01_PM INTEGER, \n", "MORT02_E INTEGER, \n", "MORT02_M INTEGER, \n", "MORT02_PC REAL, \n", "MORT02_PM REAL, \n", "MORT03_E INTEGER, \n", "MORT03_M INTEGER, \n", "MORT03_PC REAL, \n", "MORT03_PM REAL, \n", "RENT01_E INTEGER, \n", "RENT01_M INTEGER, \n", "RENT01_PC INTEGER, \n", "RENT01_PM INTEGER, \n", "RENT02_E INTEGER, \n", "RENT02_M INTEGER, \n", "RENT02_PC REAL, \n", "RENT02_PM REAL, \n", "RENT03_E INTEGER, \n", "RENT03_M INTEGER, \n", "RENT03_PC REAL, \n", "RENT03_PM REAL, \n", "RENT04_E INTEGER, \n", "RENT04_M INTEGER, \n", "RENT04_PC REAL, \n", "RENT04_PM REAL, \n", "RENT05_E INTEGER, \n", "RENT05_M INTEGER, \n", "RENT05_PC REAL, \n", "RENT05_PM REAL, \n", "RENT06_E INTEGER, \n", "RENT06_M INTEGER, \n", "RENT06_PC REAL, \n", "RENT06_PM REAL, \n", "RENT07_E INTEGER, \n", "RENT07_M INTEGER, \n", "RENT07_PC REAL, \n", "RENT07_PM REAL, \n", "RENT08_E INTEGER, \n", "RENT08_M INTEGER, \n", "RENT08_PC REAL, \n", "RENT08_PM REAL, \n", "RENT09_E INTEGER, \n", "RENT09_M INTEGER, \n", "RENT09_PC INTEGER, \n", "RENT09_PM INTEGER, \n", "RENT10_E INTEGER, \n", "RENT10_M INTEGER, \n", "RENT10_PC INTEGER, \n", "RENT10_PM INTEGER, \n", "SXAG01_E INTEGER, \n", "SXAG01_M INTEGER, \n", "SXAG01_PC INTEGER, \n", "SXAG01_PM INTEGER, \n", "SXAG02_E INTEGER, \n", "SXAG02_M INTEGER, \n", "SXAG02_PC REAL, \n", "SXAG02_PM REAL, \n", "SXAG03_E INTEGER, \n", "SXAG03_M INTEGER, \n", "SXAG03_PC REAL, \n", "SXAG03_PM REAL, \n", "SXAG04_E INTEGER, \n", "SXAG04_M INTEGER, \n", "SXAG04_PC REAL, \n", "SXAG04_PM REAL, \n", "SXAG05_E INTEGER, \n", "SXAG05_M INTEGER, \n", "SXAG05_PC REAL, \n", "SXAG05_PM REAL, \n", "SXAG06_E INTEGER, \n", "SXAG06_M INTEGER, \n", "SXAG06_PC REAL, \n", "SXAG06_PM REAL, \n", "SXAG07_E INTEGER, \n", "SXAG07_M INTEGER, \n", "SXAG07_PC REAL, \n", "SXAG07_PM REAL, \n", "SXAG08_E INTEGER, \n", "SXAG08_M INTEGER, \n", "SXAG08_PC REAL, \n", "SXAG08_PM REAL, \n", "SXAG09_E INTEGER, \n", "SXAG09_M INTEGER, \n", "SXAG09_PC REAL, \n", "SXAG09_PM REAL, \n", "SXAG10_E INTEGER, \n", "SXAG10_M INTEGER, \n", "SXAG10_PC REAL, \n", "SXAG10_PM REAL, \n", "SXAG11_E INTEGER, \n", "SXAG11_M INTEGER, \n", "SXAG11_PC REAL, \n", "SXAG11_PM REAL, \n", "SXAG12_E INTEGER, \n", "SXAG12_M INTEGER, \n", "SXAG12_PC REAL, \n", "SXAG12_PM REAL, \n", "SXAG13_E INTEGER, \n", "SXAG13_M INTEGER, \n", "SXAG13_PC REAL, \n", "SXAG13_PM REAL, \n", "SXAG14_E INTEGER, \n", "SXAG14_M INTEGER, \n", "SXAG14_PC REAL, \n", "SXAG14_PM REAL, \n", "SXAG15_E INTEGER, \n", "SXAG15_M INTEGER, \n", "SXAG15_PC REAL, \n", "SXAG15_PM REAL, \n", "SXAG16_E INTEGER, \n", "SXAG16_M INTEGER, \n", "SXAG16_PC REAL, \n", "SXAG16_PM REAL, \n", "SXAG17_E REAL, \n", "SXAG17_M REAL, \n", "SXAG17_PC INTEGER, \n", "SXAG17_PM INTEGER, \n", "RACE01_E INTEGER, \n", "RACE01_M INTEGER, \n", "RACE01_PC INTEGER, \n", "RACE01_PM INTEGER, \n", "RACE02_E INTEGER, \n", "RACE02_M INTEGER, \n", "RACE02_PC REAL, \n", "RACE02_PM REAL, \n", "RACE03_E INTEGER, \n", "RACE03_M INTEGER, \n", "RACE03_PC REAL, \n", "RACE03_PM REAL, \n", "RACE04_E INTEGER, \n", "RACE04_M INTEGER, \n", "RACE04_PC REAL, \n", "RACE04_PM REAL, \n", "RACE05_E INTEGER, \n", "RACE05_M INTEGER, \n", "RACE05_PC REAL, \n", "RACE05_PM REAL, \n", "RACE06_E INTEGER, \n", "RACE06_M INTEGER, \n", "RACE06_PC REAL, \n", "RACE06_PM REAL, \n", "RACE07_E INTEGER, \n", "RACE07_M INTEGER, \n", "RACE07_PC REAL, \n", "RACE07_PM REAL, \n", "RACE08_E INTEGER, \n", "RACE08_M INTEGER, \n", "RACE08_PC REAL, \n", "RACE08_PM REAL, \n", "RACE09_E INTEGER, \n", "RACE09_M INTEGER, \n", "RACE09_PC REAL, \n", "RACE09_PM REAL, \n", "HISL01_E INTEGER, \n", "HISL01_M INTEGER, \n", "HISL01_PC INTEGER, \n", "HISL01_PM INTEGER, \n", "HISL02_E INTEGER, \n", "HISL02_M INTEGER, \n", "HISL02_PC REAL, \n", "HISL02_PM REAL, \n", "HISL03_E INTEGER, \n", "HISL03_M INTEGER, \n", "HISL03_PC REAL, \n", "HISL03_PM REAL, \n", "HISL04_E INTEGER, \n", "HISL04_M INTEGER, \n", "HISL04_PC REAL, \n", "HISL04_PM REAL, \n", "HISL05_E INTEGER, \n", "HISL05_M INTEGER, \n", "HISL05_PC REAL, \n", "HISL05_PM REAL, \n", "HISL06_E INTEGER, \n", "HISL06_M INTEGER, \n", "HISL06_PC REAL, \n", "HISL06_PM REAL, \n", "HISL07_E INTEGER, \n", "HISL07_M INTEGER, \n", "HISL07_PC REAL, \n", "HISL07_PM REAL, \n", "HISL08_E INTEGER, \n", "HISL08_M INTEGER, \n", "HISL08_PC REAL, \n", "HISL08_PM REAL, \n", "HISL09_E INTEGER, \n", "HISL09_M INTEGER, \n", "HISL09_PC REAL, \n", "HISL09_PM REAL, \n", "HISL10_E INTEGER, \n", "HISL10_M INTEGER, \n", "HISL10_PC REAL, \n", "HISL10_PM REAL, \n", "VOTE01_E INTEGER, \n", "VOTE01_M INTEGER, \n", "VOTE01_PC INTEGER, \n", "VOTE01_PM INTEGER);\n" ] } ], "source": [ "cur.execute('DROP TABLE IF EXISTS {}'.format(tabname))\n", "dbstring=\"\"\"\n", "CREATE TABLE {} (\n", "GEOID TEXT,\n", "GEOID2 TEXT NOT NULL PRIMARY KEY,\n", "GEOLABEL TEXT,\n", "\"\"\".format(tabname)\n", "\n", "for k,v in vardict.items():\n", " dbstring=dbstring+k+' '+v[2]+', \\n'\n", " \n", "dbstring=dbstring[:-3]\n", "dbstring=dbstring+');'\n", "print(dbstring)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cur.execute(dbstring)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "acsdata.to_sql(name=tabname, if_exists='append', index=True, con=con)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "215 records written to zctas_2018acs2\n" ] } ], "source": [ "cur.execute('SELECT COUNT(*) FROM {};'.format(tabname))\n", "rows = cur.fetchone()\n", "print(rows[0], 'records written to', tabname)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "239 columns written to zctas_2018acs2\n" ] } ], "source": [ "cur.execute('SELECT * FROM {} LIMIT 1;'.format(tabname))\n", "col_names = [cn[0] for cn in cur.description]\n", "print(len(col_names), 'columns written to', tabname)\n", "#Number should be same as number in df acsdata plus 1, since index not included in df count" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "con.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Metadata Table\n", "DO NOT RERUN THIS SECTION FOR MULTIPLE GEOGRAPHIES. In the NYC Geodatabase there is only one metadata table for all of the ACS tables (acs1 and acs2) for all geographies. For whichever geography is processed first, set action variable to 'create' and run this entire series of blocks for the acs1 table. For the acs2 table, set the action variable to 'append' and skip the table creation and identifier insertion blocks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Change table name and specify an action - you're creating the table for the first time with acs1 variables, \n", "#or appending the tables with acs2 variables\n", "\n", "metatab='acslookup2018'\n", "action='append' # 'create' or 'append'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "con = sqlite3.connect(dbname) \n", "cur = con.cursor()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Only run this block when creating initial table\n", "if action=='create':\n", " mdstring=\"\"\"\n", " CREATE TABLE {} (\n", " tabnum TEXT,\n", " est_id TEXT,\n", " est_value TEXT);\n", " \"\"\".format(metatab)\n", " cur.execute(mdstring)\n", "else:\n", " print('Block not executed because \"create\" not selected as an action in earlier block')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Only run this block when creating initial table\n", "if action=='create':\n", " exstring=\"\"\"\n", " INSERT INTO {} VALUES('both','NOTE','Each variable has 4 values that are identified by a particular suffix: E for estimate, M for margin of error for the estimate, PC for percent total, and PM for margin of error for the percent total');\n", " INSERT INTO {} VALUES('both','GEOID','Id');\n", " INSERT INTO {} VALUES('both','GEOID2','Id2');\n", " INSERT INTO {} VALUES('both','GEOLABEL','Geography');\n", " \"\"\".format(metatab,metatab,metatab,metatab)\n", " cur.executescript(exstring)\n", " con.commit()\n", "else:\n", " print('Block not executed because \"create\" not selected as an action in earlier block')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Run when creating table or when appending records\n", "#Keys and values - db ids and labels - are simplified and truncated to 1 entry for each 4-column group (E,M,PC,PM)\n", "if action in ('create','append'):\n", " for mk, mv in vardict.items():\n", " if mk.endswith('_E'):\n", " cur.execute(\"INSERT INTO {} values(?,?,?)\".format(metatab),(worksheet,mk[:-2],mv[1][11:]))\n", " con.commit()\n", "else:\n", " print('Block not executed because action not specified in earlier block')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cur.execute('SELECT COUNT(*) FROM {};'.format(metatab))\n", "rows = cur.fetchone()\n", "print(rows[0], 'records in', metatab)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "action=''\n", "con.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }