{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ZBP to ZCTA\n", "Retrieves data from the Census Bureau's ZIP Code Business Patterns API for a specific area and summarizes it by ZCTA. Three tables are generated: one for employees that contains employment, establishments, and wages, one for industries that contains counts of establishments by 2-digit sector NAICS codes, and one reference table that correlates sector numbers and names. Initial data retrieved from API is written to json, final output is written to a SQLite database. \n", "\n", "https://www.census.gov/data/developers/data-sets/cbp-nonemp-zbp/zbp-api.html\n", "\n", "NOTE - Works for 2016 ZBP, but NOT for 2017 due to changes in the API and the underlying dataset\n", "Use zbp_to_zcta notebook for current data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd, requests, sqlite3, os, json\n", "from IPython.display import clear_output" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Crosswalk files - update only if necessary\n", "uszips_file='Zip_to_ZCTA_crosswalk_2015_JSI.csv'\n", "zcta_file='geocorr14_modified.csv'\n", "\n", "uszips_path=os.path.join('inputs',uszips_file)\n", "zcta_path=zcta_file=os.path.join('inputs',zcta_file)\n", "\n", "#Dump files for api data storage\n", "ejsonpath=os.path.join('outputs', 'emp_data.json')\n", "ijsonpath=os.path.join('outputs', 'ind_data.json')\n", "cjsonpath=os.path.join('outputs', 'codes_data.json')\n", "\n", "#API variables - UPDATE THE YEAR\n", "keyfile='census_key.txt'\n", "\n", "year='2017'\n", "dsource='zbp'\n", "state='36'\n", "ecols='ESTAB,EMP,PAYQTR1,PAYANN'\n", "icols='ESTAB'\n", "ncodes=['00','11','21','22','23','31-33','42','44-45','48-49',\n", " '51','52','53','54','55','56','61','62','71','72','81',\n", " '99']\n", "\n", "#SQL output - UPDATE EACH TABLE NAME\n", "dbname=os.path.join('outputs','testdb.sqlite')\n", "emptable='zbp2017emp'\n", "indtable='zbp2017ind'\n", "codetable='zbp2017indcodes'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Crosswalking\n", "Read in the files that relate US ZIP codes to ZCTAs (from JSI) and\n", "ZCTAs to counties for local area (from MCDC Geocorr), then join them by \n", "ZIP Code to create a ZCTA to ZIP table for the local area" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uszips=pd.read_csv(uszips_path, sep=',', dtype={'ZIP':str, 'ZCTA':str})\n", "uszips.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#All ZIP Codes in US\n", "uszips.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "zcta=pd.read_csv(zcta_path, sep=',', dtype={'zcta5':str, 'county14':str})\n", "zcta.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#ZCTAs in local area\n", "zcta.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Merge ZIP Codes with ZCTAs for local area\n", "zip2zcta = pd.merge(uszips[['ZIP','ZIP_TYPE','PO_NAME','ZCTA']],zcta[['zcta5','county14']],how='right', \n", " left_on='ZCTA', right_on='zcta5').set_index('ZIP')\n", "zip2zcta.drop(columns=['ZCTA'],inplace=True)\n", "zip2zcta.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#ZIP Codes in local area\n", "zip2zcta.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API Call and Processing\n", "Request the ZBP data from the Census Bureau for the state, join the ZBP data to the local ZCTA to ZIP table based on ZIP Code, and group the data by ZCTA " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(keyfile) as key:\n", " api_key=key.read().strip()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "base_url = f'https://api.census.gov/data/{year}/{dsource}'\n", "base_url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZBP Employment Data\n", "This data can be requested in a single api call - do not rerun the requests block if retrieval is successful but subsequent notebook changes are needed. Proceed to the next block and pull data from json dump file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ***THIS BLOCK IS A REQUESTS BLOCK!*** " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Code 200 = success, do not rerun this block unless it's necessary\n", "edata_url = f'{base_url}?get={ecols}&for=zipcode:*&ST={state}&key={api_key}'\n", "response=requests.get(edata_url)\n", "if response.status_code==200:\n", " emp_data=response.json()\n", " with open(ejsonpath, 'w') as f:\n", " json.dump(emp_data, f)\n", " print('Data dumped to json file')\n", "else:\n", " print('Problem with retrieval, response code',response.status_code)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(ejsonpath, 'r') as f:\n", " ejsondata=json.load(f)\n", "zbpemp=pd.DataFrame(ejsondata[1:], columns=ejsondata[0]).rename(columns={'PAYQTR1':'PAYQ1','PAYANN':'PAYAN'}).set_index('zipcode')\n", "zbpemp.drop(columns=['ST'],inplace=True)\n", "for field in zbpemp.columns:\n", " zbpemp=zbpemp.astype(dtype={field:'int64'},inplace=True)\n", "zbpemp.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#ZIP Codes for the entire state\n", "zbpemp.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Flag columns count the number of establishments for which data is not disclosed\n", "flags=['FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']\n", "for flagcol in flags:\n", " datacol=flagcol.split('_')[1]\n", " zbpemp[flagcol]=0\n", " zbpemp.loc[zbpemp[datacol] == 0, flagcol] = zbpemp['ESTAB']\n", "zbpemp.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Join to ZIP ZCTA crosswalk\n", "zbpemp2zcta = pd.merge(zip2zcta,zbpemp,how='inner',left_index=True,right_index=True)\n", "zbpemp2zcta.index.name = 'ZIP'\n", "zbpemp2zcta.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#ZIP codes in the local area that appear in the ZBP data\n", "zbpemp2zcta.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Aggregate to ZCTAs\n", "zctaemp=zbpemp2zcta[['zcta5','ESTAB','EMP','PAYQ1','PAYAN','FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']].groupby(['zcta5'])[['ESTAB','EMP','PAYQ1','PAYAN','FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']].sum()\n", "zctaemp.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZBP Industry Data\n", "This data must be requested one record at a time. NOTE that this request can take a LONG TIME, up to one hour to complete for approx 300 ZIP Codes. Once the request is finished the data gets dumped into a json file. If the request is successful but subsequent blocks need to be modified, don't rerun the requests block - pull the data from the json file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "zipcodes=zbpemp2zcta.index.tolist()\n", "len(zipcodes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ***THIS BLOCK IS A REQUESTS BLOCK!*** \n", "Retrieving approx 300 ZIP Codes takes 1 hour" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#If this block is successful but there are subsequent problems, do not rerun it - start from the following block.\n", "#For industry data, if there are no records for an industry create a blank record with zeros\n", "n=0\n", "z=0\n", "ind_data=[['estab','naics','zipcode']]\n", "for zcode in zipcodes:\n", "#for zcode in zipcodes[0:5]:\n", " clear_output(wait=True)\n", " for naics in ncodes:\n", " idata_url = f'{base_url}?get={icols}&NAICS2012={naics}&for=zipcode:{zcode}&key={api_key}'\n", " try:\n", " response=requests.get(idata_url)\n", " except requests.exceptions.RequestException as e:\n", " print (e)\n", " break\n", " if response.status_code==200:\n", " jsondata=response.json()\n", " ind_data.append(jsondata[1]) \n", " n=n+1\n", " elif response.status_code==204:\n", " record=['0',naics,zcode]\n", " ind_data.append(record)\n", " n=n+1\n", " else:\n", " print('Problem retrieving data, status code:',response.status_code)\n", " break\n", " z=z+1\n", " print(n,'records have been retrieved for',z,'ZIP codes...')\n", "print('Done')\n", "\n", "with open(ijsonpath, 'w') as f:\n", " json.dump(ind_data, f)\n", "print('Data dumped to json file')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(ijsonpath, 'r') as f:\n", " ijsondata=json.load(f)\n", "zbpind = pd.DataFrame(ijsondata[1:],columns=ijsondata[0])\n", "zbpind['estab']=zbpind['estab'].astype('int64', inplace=True)\n", "zbpind.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Pivot data to move NAICS to columns\n", "zbpind_tab=zbpind.pivot(index='zipcode', columns='naics', values='estab')\n", "zbpind_tab=zbpind_tab.add_prefix('N')\n", "zbpind_tab.rename(columns=lambda x: x.replace('-', '_'),inplace=True)\n", "zbpind_tab.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Join to ZIP ZCTA crosswalk\n", "zbpind2zcta = pd.merge(zip2zcta[['zcta5']],zbpind_tab,how='inner',left_index=True,right_index=True)\n", "zbpind2zcta.index.name = 'ZIP'\n", "zbpind2zcta.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Aggregate to ZCTAs\n", "zctaind=zbpind2zcta.groupby(['zcta5']).sum()\n", "zctaind.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#Generate and calculate percent total columns\n", "ncols=list(zctaind)\n", "for c in ncols[1:]:\n", " pct=c+'_PCT'\n", " zctaind[pct]=((zctaind[c]/zctaind['N00'])*100).round(2)\n", "zctaind.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NAICS Codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ***THIS BLOCK IS A REQUESTS BLOCK!***" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Need to input a ZIP Code that contains establishments in every industry\n", "#Don't rerun unless necessary\n", "ind_codes=[['name','naics','zip']]\n", "for naics in ncodes:\n", " data_url = f'{base_url}?get=NAICS2012_TTL&NAICS2012={naics}&for=zipcode:08088&key={api_key}'\n", " response=requests.get(data_url)\n", " jsondata=response.json()\n", " ind_codes.append(jsondata[1]) \n", "if len(ind_codes)==22:\n", " with open(cjsonpath, 'w') as f:\n", " json.dump(ind_codes, f)\n", " print('Retrieved all 21 codes and dumped to json')\n", "else:\n", " print('Some codes are missing; try a different ZIP')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(cjsonpath, 'r') as f:\n", " cjsondata=json.load(f)\n", "codes=pd.DataFrame(cjsondata[1:],columns=cjsondata[0]).set_index('naics').drop(columns='zip')\n", "codes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quality Control Checks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Does sum of industries equal industry total?\n", "indsum=zctaind['N00'].subtract(zctaind.iloc[:,1:21].sum(axis=1))\n", "if indsum.sum()==0:\n", " print (True)\n", "else:\n", " print(indsum.loc[indsum != 0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Is sum of percent totals approximately 100?\n", "ptotal=zctaind.iloc[:,21:].sum(axis=1)\n", "if ptotal.loc[(ptotal <= 99.05) | (ptotal >= 100.05)].empty:\n", " print(True)\n", "else:\n", " print(ptotal.loc[(ptotal <= 99.05) | (ptotal >= 100.05)])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Do number of ZCTAs in employment table match the industries table?\n", "ecount=zctaemp.shape[0]\n", "icount=zctaind.shape[0]\n", "if ecount == icount:\n", " print (True)\n", "else:\n", " print('Mistmatched count between employment',ecount, 'rows and industry',icount, 'rows')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Does sum of estabslishments from employment table equal establishments in industries table?\n", "estsum=zctaemp['ESTAB'].subtract(zctaind['N00'])\n", "if estsum.sum()==0:\n", " print (True)\n", "else:\n", " print(estsum.loc[estsum != 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write to Database " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "con = sqlite3.connect(dbname) \n", "cur = con.cursor()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Employment table\n", "cur.execute('DROP TABLE IF EXISTS {};'.format(emptable))\n", "qcreate_emptab=\"\"\"\n", "CREATE TABLE {}(\n", "zcta5 TEXT NOT NULL PRIMARY KEY,\n", "estab INTEGER,\n", "emp INTEGER,\n", "payq1 INTEGER,\n", "payan INTEGER,\n", "flag_emp INTEGER,\n", "flag_payq1 INTEGER,\n", "flag_payan INTEGER);\n", "\"\"\".format(emptable)\n", "\n", "cur.execute(qcreate_emptab)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Don't run this block unless you've run the previous one\n", "zctaemp.to_sql(name='{}'.format(emptable), if_exists='append', index=True, con=con)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cur.execute('SELECT COUNT(*) FROM {};'.format(emptable))\n", "rows = cur.fetchone()\n", "print(rows[0], 'records written to', emptable)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Replace zeros with nulls, as these values really represent no data\n", "for col in zctaemp.columns[1:]:\n", " qupdate='UPDATE {} SET {} = NULL WHERE {} = 0;'.format(emptable,col,col)\n", " cur.execute(qupdate)\n", " print(cur.rowcount,'records updated for',col)\n", " con.commit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Industry table\n", "cur.execute('DROP TABLE IF EXISTS {}'.format(indtable))\n", "qcreate_indtab=\"\"\"\n", "CREATE TABLE {} (\n", "zcta5 TEXT NOT NULL PRIMARY KEY, \n", "N00 INTEGER, \n", "N11 INTEGER, \n", "N21 INTEGER, \n", "N22 INTEGER, \n", "N23 INTEGER, \n", "N31_33 INTEGER, \n", "N42 INTEGER, \n", "N44_45 INTEGER, \n", "N48_49 INTEGER, \n", "N51 INTEGER, \n", "N52 INTEGER, \n", "N53 INTEGER, \n", "N54 INTEGER, \n", "N55 INTEGER, \n", "N56 INTEGER, \n", "N61 INTEGER, \n", "N62 INTEGER, \n", "N71 INTEGER, \n", "N72 INTEGER, \n", "N81 INTEGER, \n", "N99 INTEGER, \n", "N11_PCT REAL, \n", "N21_PCT REAL, \n", "N22_PCT REAL, \n", "N23_PCT REAL, \n", "N31_33_PCT REAL, \n", "N42_PCT REAL, \n", "N44_45_PCT REAL, \n", "N48_49_PCT REAL, \n", "N51_PCT REAL, \n", "N52_PCT REAL, \n", "N53_PCT REAL, \n", "N54_PCT REAL, \n", "N55_PCT REAL, \n", "N56_PCT REAL, \n", "N61_PCT REAL, \n", "N62_PCT REAL, \n", "N71_PCT REAL, \n", "N72_PCT REAL, \n", "N81_PCT REAL, \n", "N99_PCT REAL);\n", "\"\"\".format(indtable)\n", "\n", "cur.execute(qcreate_indtab)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Don't run this block unless you've run the previous one\n", "zctaind.to_sql(name='{}'.format(indtable), if_exists='append', index=True, con=con)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cur.execute('SELECT COUNT(*) FROM {};'.format(indtable))\n", "rows = cur.fetchone()\n", "print(rows[0], 'records written to', indtable)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Replace zeros with nulls, as these values really represent no data\n", "for col in zctaind.columns[21:]:\n", " qupdate='UPDATE {} SET {} = NULL WHERE {} = 0.0;'.format(indtable,col,col)\n", " cur.execute(qupdate)\n", " print(cur.rowcount,'records updated for',col)\n", " con.commit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#NAICS code table\n", "cur.execute('DROP TABLE IF EXISTS {};'.format(codetable))\n", "qcreate_codetab=\"\"\"\n", "CREATE TABLE {}(\n", "naics TEXT NOT NULL PRIMARY KEY,\n", "name TEXT);\n", "\"\"\".format(codetable)\n", "\n", "cur.execute(qcreate_codetab)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Don't run this block unless you've run the previous one\n", "codes.to_sql(name='{}'.format(codetable), if_exists='append', index=True, con=con)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cur.execute('SELECT COUNT(*) FROM {};'.format(codetable))\n", "rows = cur.fetchone()\n", "print(rows[0], 'records written to', codetable)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "con.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }