{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ZBP to ZCTA\n", "Retrieves data from the Census Bureau's ZIP Code Business Patterns API for a specific area and summarizes it by ZCTA. Three tables are generated: one for employees that contains employment, establishments, and wages, one for industries that contains counts of establishments by 2-digit sector NAICS codes, and one reference table that correlates sector numbers and names. Initial data retrieved from API is written to json, final output is written to a SQLite database. \n", "\n", "Confirmed to work with the 2018 ZBP series\n", "\n", "https://www.census.gov/data/developers/data-sets/cbp-nonemp-zbp/zbp-api.html\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Variables" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd, requests, sqlite3, os, json\n", "from IPython.display import clear_output" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#Crosswalk files - update only if necessary\n", "uszips_file='zip_to_zcta_2019_uds.csv'\n", "zcta_file='geocorr14_modified.csv'\n", "\n", "uszips_path=os.path.join('inputs',uszips_file)\n", "zcta_path=zcta_file=os.path.join('inputs',zcta_file)\n", "\n", "#Dump files for api data storage\n", "ejsonpath=os.path.join('outputs', 'emp_data.json')\n", "ijsonpath=os.path.join('outputs', 'ind_data.json')\n", "cjsonpath=os.path.join('outputs', 'codes_data.json')\n", "\n", "#API variables - UPDATE THE YEAR\n", "keyfile='census_key.txt'\n", "\n", "year='2018'\n", "dsource='zbp'\n", "state='36'\n", "ecols='ESTAB,EMP,PAYQTR1,PAYANN'\n", "icols='ESTAB'\n", "ncodes=['00','11','21','22','23','31-33','42','44-45','48-49',\n", " '51','52','53','54','55','56','61','62','71','72','81',\n", " '99']\n", "\n", "#SQL output - UPDATE EACH TABLE NAME\n", "dbname=os.path.join('outputs','testdb.sqlite')\n", "emptable='zbp2018emp'\n", "indtable='zbp2018ind'\n", "codetable='zbp2018indcodes'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Crosswalking\n", "Read in the files that relate US ZIP codes to ZCTAs (from JSI) and\n", "ZCTAs to counties for local area (from MCDC Geocorr), then join them by \n", "ZIP Code to create a ZCTA to ZIP table for the local area" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZIP_CODEPO_NAMESTATEZIP_TYPEZCTAzip_join_type
000501HoltsvilleNYPost Office or large volume customer11742Spatial join to ZCTA
100544HoltsvilleNYPost Office or large volume customer11742Spatial join to ZCTA
200601AdjuntasPRZip Code Area00601Zip Matches ZCTA
300602AguadaPRZip Code Area00602Zip Matches ZCTA
400603AguadillaPRZip Code Area00603Zip Matches ZCTA
\n", "
" ], "text/plain": [ " ZIP_CODE PO_NAME STATE ZIP_TYPE ZCTA \\\n", "0 00501 Holtsville NY Post Office or large volume customer 11742 \n", "1 00544 Holtsville NY Post Office or large volume customer 11742 \n", "2 00601 Adjuntas PR Zip Code Area 00601 \n", "3 00602 Aguada PR Zip Code Area 00602 \n", "4 00603 Aguadilla PR Zip Code Area 00603 \n", "\n", " zip_join_type \n", "0 Spatial join to ZCTA \n", "1 Spatial join to ZCTA \n", "2 Zip Matches ZCTA \n", "3 Zip Matches ZCTA \n", "4 Zip Matches ZCTA " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uszips=pd.read_csv(uszips_path, sep=',', dtype={'ZIP_CODE':str, 'ZCTA':str})\n", "uszips.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(41107, 6)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#All ZIP Codes in US\n", "uszips.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
zcta5county14cntyname2zipnamepop10afact
01000136061New York NYNew York, NY211021.0
11000236061New York NYNew York, NY814101.0
21000336061New York NYNew York, NY560241.0
31000436061New York NYNew York, NY30891.0
41000536061New York NYNew York, NY71351.0
\n", "
" ], "text/plain": [ " zcta5 county14 cntyname2 zipname pop10 afact\n", "0 10001 36061 New York NY New York, NY 21102 1.0\n", "1 10002 36061 New York NY New York, NY 81410 1.0\n", "2 10003 36061 New York NY New York, NY 56024 1.0\n", "3 10004 36061 New York NY New York, NY 3089 1.0\n", "4 10005 36061 New York NY New York, NY 7135 1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zcta=pd.read_csv(zcta_path, sep=',', dtype={'zcta5':str, 'county14':str})\n", "zcta.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(214, 6)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#ZCTAs in local area\n", "zcta.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZIP_TYPEPO_NAMEzcta5county14
ZIP_CODE
10001Zip Code AreaNew York1000136061
10118Post Office or large volume customerNew York1000136061
10120Post Office or large volume customerNew York1000136061
10122Post Office or large volume customerNew York1000136061
10123Post Office or large volume customerNew York1000136061
\n", "
" ], "text/plain": [ " ZIP_TYPE PO_NAME zcta5 county14\n", "ZIP_CODE \n", "10001 Zip Code Area New York 10001 36061\n", "10118 Post Office or large volume customer New York 10001 36061\n", "10120 Post Office or large volume customer New York 10001 36061\n", "10122 Post Office or large volume customer New York 10001 36061\n", "10123 Post Office or large volume customer New York 10001 36061" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Merge ZIP Codes with ZCTAs for local area\n", "zip2zcta = pd.merge(uszips[['ZIP_CODE','ZIP_TYPE','PO_NAME','ZCTA']],zcta[['zcta5','county14']],how='right', \n", " left_on='ZCTA', right_on='zcta5').set_index('ZIP_CODE')\n", "zip2zcta.drop(columns=['ZCTA'],inplace=True)\n", "zip2zcta.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(313, 4)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#ZIP Codes in local area\n", "zip2zcta.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API Call and Processing\n", "Request the ZBP data from the Census Bureau for the state, join the ZBP data to the local ZCTA to ZIP table based on ZIP Code, and group the data by ZCTA " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "with open(keyfile) as key:\n", " api_key=key.read().strip()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://api.census.gov/data/2018/zbp'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_url = f'https://api.census.gov/data/{year}/{dsource}'\n", "base_url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZBP Employment Data\n", "This data is requested in a series of chunks which contain multiple ZIP Codes - do not rerun the requests block if retrieval is successful but subsequent notebook changes are needed. Proceed to the next block and pull data from json dump file." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def chunks(l, n):\n", " # For item i in a range that is a length of l,\n", " for i in range(0, len(l), n):\n", " # Create an index range for l of n items:\n", " yield l[i:i+n]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of chunks: 7\n" ] } ], "source": [ "reqzips=list(chunks(zip2zcta.index.tolist(),48))\n", "print('Number of chunks:',len(reqzips))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ***THIS BLOCK IS A REQUESTS BLOCK!*** " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Retrieved data for chunk 6\n", "Done - Data dumped to json file\n" ] } ], "source": [ "#Code 200 = success, do not rerun this block unless it's necessary\n", "emp_data=[]\n", "for i, v in enumerate (reqzips):\n", " batchzips=','.join(v)\n", " edata_url = f'{base_url}?get={ecols}&EMPSZES=001&for=zipcode:{batchzips}&key={api_key}'\n", " response=requests.get(edata_url)\n", " if response.status_code==200:\n", " clear_output(wait=True)\n", " data=response.json()\n", " if i == 0: \n", " for record in data:\n", " emp_data.append(record)\n", " else:\n", " for record in data[1:]:\n", " emp_data.append(record) \n", " print('Retrieved data for chunk',i)\n", " else:\n", " print('***Problem with retrieval***, response code',response.status_code)\n", " break\n", "with open(ejsonpath, 'w') as f:\n", " json.dump(emp_data, f)\n", "print('Done - Data dumped to json file')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ESTABEMPPAYQ1PAYAN
zipcode
101183266358202796815737
1012070137841023132076
10121696104199451887899
10122177191740477156824
10123180203651353195513
\n", "
" ], "text/plain": [ " ESTAB EMP PAYQ1 PAYAN\n", "zipcode \n", "10118 326 6358 202796 815737\n", "10120 70 1378 41023 132076\n", "10121 69 6104 199451 887899\n", "10122 177 1917 40477 156824\n", "10123 180 2036 51353 195513" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open(ejsonpath, 'r') as f:\n", " ejsondata=json.load(f)\n", "zbpemp=pd.DataFrame(ejsondata[1:], columns=ejsondata[0]).rename(columns={'PAYQTR1':'PAYQ1','PAYANN':'PAYAN','zip code':'zipcode'}).set_index('zipcode')\n", "for field in zbpemp.columns:\n", " zbpemp=zbpemp.astype(dtype={field:'int64'})\n", "zbpemp.drop(columns=['EMPSZES'],inplace=True)\n", "zbpemp.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(266, 4)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#ZIP Codes retrieved - may differ from zip2zcta as some zips have no businesses\n", "zbpemp.shape" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ESTABEMPPAYQ1PAYANFLAG_EMPFLAG_PAYQ1FLAG_PAYAN
zipcode
101183266358202796815737000
1012070137841023132076000
10121696104199451887899000
10122177191740477156824000
10123180203651353195513000
\n", "
" ], "text/plain": [ " ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN\n", "zipcode \n", "10118 326 6358 202796 815737 0 0 0\n", "10120 70 1378 41023 132076 0 0 0\n", "10121 69 6104 199451 887899 0 0 0\n", "10122 177 1917 40477 156824 0 0 0\n", "10123 180 2036 51353 195513 0 0 0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Flag columns count the number of establishments for which data is not disclosed\n", "flags=['FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']\n", "for flagcol in flags:\n", " datacol=flagcol.split('_')[1]\n", " zbpemp[flagcol]=0\n", " zbpemp.loc[zbpemp[datacol] == 0, flagcol] = zbpemp['ESTAB']\n", "zbpemp.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZIP_TYPEPO_NAMEzcta5county14ESTABEMPPAYQ1PAYANFLAG_EMPFLAG_PAYQ1FLAG_PAYAN
ZIP_CODE
10001Zip Code AreaNew York10001360617248151769271718610646611000
10118Post Office or large volume customerNew York10001360613266358202796815737000
10120Post Office or large volume customerNew York100013606170137841023132076000
10122Post Office or large volume customerNew York1000136061177191740477156824000
10123Post Office or large volume customerNew York1000136061180203651353195513000
\n", "
" ], "text/plain": [ " ZIP_TYPE PO_NAME zcta5 county14 \\\n", "ZIP_CODE \n", "10001 Zip Code Area New York 10001 36061 \n", "10118 Post Office or large volume customer New York 10001 36061 \n", "10120 Post Office or large volume customer New York 10001 36061 \n", "10122 Post Office or large volume customer New York 10001 36061 \n", "10123 Post Office or large volume customer New York 10001 36061 \n", "\n", " ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN \n", "ZIP_CODE \n", "10001 7248 151769 2717186 10646611 0 0 0 \n", "10118 326 6358 202796 815737 0 0 0 \n", "10120 70 1378 41023 132076 0 0 0 \n", "10122 177 1917 40477 156824 0 0 0 \n", "10123 180 2036 51353 195513 0 0 0 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Join to ZIP ZCTA crosswalk\n", "zbpemp2zcta = pd.merge(zip2zcta,zbpemp,how='inner',left_index=True,right_index=True)\n", "zbpemp2zcta.index.name = 'ZIP_CODE'\n", "zbpemp2zcta.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(266, 11)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#ZIP codes in the local area that appear in the ZBP data\n", "zbpemp2zcta.shape" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ESTABEMPPAYQ1PAYANFLAG_EMPFLAG_PAYQ1FLAG_PAYAN
zcta5
100018006163513305321911948301000
10002296222737176288764720000
1000342739874218303696916902000
1000416606902523379567349828000
1000513974812930436207586070000
\n", "
" ], "text/plain": [ " ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN\n", "zcta5 \n", "10001 8006 163513 3053219 11948301 0 0 0\n", "10002 2962 22737 176288 764720 0 0 0\n", "10003 4273 98742 1830369 6916902 0 0 0\n", "10004 1660 69025 2337956 7349828 0 0 0\n", "10005 1397 48129 3043620 7586070 0 0 0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Aggregate to ZCTAs\n", "zctaemp=zbpemp2zcta[['zcta5','ESTAB','EMP','PAYQ1','PAYAN','FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']].groupby(['zcta5'])[['ESTAB','EMP','PAYQ1','PAYAN','FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']].sum()\n", "zctaemp.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZBP Industry Data\n", "This data must be requested one record at a time. NOTE that this request can take a LONG TIME, up to one hour to complete for approx 300 ZIP Codes. Once the request is finished the data gets dumped into a json file. If the request is successful but subsequent blocks need to be modified, don't rerun the requests block - pull the data from the json file." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "266" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zipcodes=zbpemp2zcta.index.tolist()\n", "len(zipcodes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ***THIS BLOCK IS A REQUESTS BLOCK!*** \n", "Retrieving approx 300 ZIP Codes takes 1 hour\n", "\n", "*NOTE - revise in the future to retrieve chunks of zip codes*" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5586 records have been retrieved for 266 ZIP codes...\n", "Done\n", "Data dumped to json file\n" ] } ], "source": [ "#If this block is successful but there are subsequent problems, do not rerun it - start from the following block.\n", "#For industry data, if there are no records for an industry create a blank record with zeros\n", "n=0\n", "z=0\n", "ind_data=[['estab','naics','zipcode']]\n", "for zcode in zipcodes:\n", "#for zcode in zipcodes[0:5]:\n", " clear_output(wait=True)\n", " for naics in ncodes:\n", " idata_url = f'{base_url}?get={icols}&NAICS2017={naics}&for=zipcode:{zcode}&key={api_key}'\n", " try:\n", " response=requests.get(idata_url)\n", " except requests.exceptions.RequestException as e:\n", " print (e)\n", " break\n", " if response.status_code==200:\n", " jsondata=response.json()\n", " ind_data.append(jsondata[1]) \n", " n=n+1\n", " elif response.status_code==204:\n", " record=['0',naics,zcode]\n", " ind_data.append(record)\n", " n=n+1\n", " else:\n", " print('Problem retrieving data, status code:',response.status_code)\n", " break\n", " z=z+1\n", " print(n,'records have been retrieved for',z,'ZIP codes...')\n", "print('Done')\n", "\n", "with open(ijsonpath, 'w') as f:\n", " json.dump(ind_data, f)\n", "print('Data dumped to json file')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
estabnaicszipcode
072480010001
101110001
202110001
352210001
42622310001
\n", "
" ], "text/plain": [ " estab naics zipcode\n", "0 7248 00 10001\n", "1 0 11 10001\n", "2 0 21 10001\n", "3 5 22 10001\n", "4 262 23 10001" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open(ijsonpath, 'r') as f:\n", " ijsondata=json.load(f)\n", "zbpind = pd.DataFrame(ijsondata[1:],columns=ijsondata[0])\n", "zbpind['estab']=zbpind['estab'].astype('int64')\n", "zbpind.head()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
naicsN00N11N21N22N23N31_33N42N44_45N48_49N51...N53N54N55N56N61N62N71N72N81N99
zipcode
100017248005262188107965462418...50014531533631614053135285069
100022962000100601675204666...274323388272061006073143
1000342680005135944335195...37267225120753425616585048
1000416000003413686520109...8748312843485311311344
10005134600026639431172...784212574214130811023
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ "naics N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... \\\n", "zipcode ... \n", "10001 7248 0 0 5 262 188 1079 654 62 418 ... \n", "10002 2962 0 0 0 100 60 167 520 46 66 ... \n", "10003 4268 0 0 0 51 35 94 433 5 195 ... \n", "10004 1600 0 0 0 34 13 68 65 20 109 ... \n", "10005 1346 0 0 0 26 6 39 43 11 72 ... \n", "\n", "naics N53 N54 N55 N56 N61 N62 N71 N72 N81 N99 \n", "zipcode \n", "10001 500 1453 153 363 161 405 313 528 506 9 \n", "10002 274 323 3 88 27 206 100 607 314 3 \n", "10003 372 672 25 120 75 342 561 658 504 8 \n", "10004 87 483 12 84 34 85 31 131 134 4 \n", "10005 78 421 25 74 21 41 30 81 102 3 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Pivot data to move NAICS to columns\n", "zbpind_tab=zbpind.pivot(index='zipcode', columns='naics', values='estab')\n", "zbpind_tab=zbpind_tab.add_prefix('N')\n", "zbpind_tab.rename(columns=lambda x: x.replace('-', '_'),inplace=True)\n", "zbpind_tab.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
naicsN00N11N21N22N23N31_33N42N44_45N48_49N51...N54N55N56N61N62N71N72N81N99NXX
zipcode
100017248005262188107965462418...145315336316140531352850690
100022962000100601675204666...3233882720610060731432
1000342680005135944335195...672251207534256165850482
1000416000003413686520109...483128434853113113441
10005134600026639431172...42125742141308110230
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ "naics N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... \\\n", "zipcode ... \n", "10001 7248 0 0 5 262 188 1079 654 62 418 ... \n", "10002 2962 0 0 0 100 60 167 520 46 66 ... \n", "10003 4268 0 0 0 51 35 94 433 5 195 ... \n", "10004 1600 0 0 0 34 13 68 65 20 109 ... \n", "10005 1346 0 0 0 26 6 39 43 11 72 ... \n", "\n", "naics N54 N55 N56 N61 N62 N71 N72 N81 N99 NXX \n", "zipcode \n", "10001 1453 153 363 161 405 313 528 506 9 0 \n", "10002 323 3 88 27 206 100 607 314 3 2 \n", "10003 672 25 120 75 342 561 658 504 8 2 \n", "10004 483 12 84 34 85 31 131 134 4 1 \n", "10005 421 25 74 21 41 30 81 102 3 0 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Create column to summarize businesses that were not disclosed and categorized, but that are included in the total\n", "zbpind_tab['NXX']=zbpind_tab.loc[:,'N00'].subtract(zbpind_tab.loc[:,'N11':'N99'].sum(axis=1))\n", "zbpind_tab.head()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
zcta5N00N11N21N22N23N31_33N42N44_45N48_49...N54N55N56N61N62N71N72N81N99NXX
ZIP_CODE
10001100017248005262188107965462...145315336316140531352850690
10118100013260008045105...10312170829112305
10120100017000000880...1810003000011
1012210001177000001030...4507055301003
1012310001180000701440...7701371130806
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " zcta5 N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 ... \\\n", "ZIP_CODE ... \n", "10001 10001 7248 0 0 5 262 188 1079 654 62 ... \n", "10118 10001 326 0 0 0 8 0 45 10 5 ... \n", "10120 10001 70 0 0 0 0 0 8 8 0 ... \n", "10122 10001 177 0 0 0 0 0 10 3 0 ... \n", "10123 10001 180 0 0 0 7 0 14 4 0 ... \n", "\n", " N54 N55 N56 N61 N62 N71 N72 N81 N99 NXX \n", "ZIP_CODE \n", "10001 1453 153 363 161 405 313 528 506 9 0 \n", "10118 103 12 17 0 8 29 11 23 0 5 \n", "10120 18 10 0 0 3 0 0 0 0 11 \n", "10122 45 0 7 0 5 53 0 10 0 3 \n", "10123 77 0 13 7 11 3 0 8 0 6 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Join to ZIP ZCTA crosswalk\n", "zbpind2zcta = pd.merge(zip2zcta[['zcta5']],zbpind_tab,how='inner',left_index=True,right_index=True)\n", "zbpind2zcta.index.name = 'ZIP_CODE'\n", "zbpind2zcta.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(266, 23)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zbpind2zcta.shape" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
N00N11N21N22N23N31_33N42N44_45N48_49N51...N54N55N56N61N62N71N72N81N99NXX
zcta5
100018006005277188115667967450...1696175400168432398539547930
100022962000100601675204666...3233882720610060731432
1000342730005135944335195...672251207534256165850487
1000416600003413686520126...4911288348531131137412
10005139700026639431172...421257421413081102313
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... N54 \\\n", "zcta5 ... \n", "10001 8006 0 0 5 277 188 1156 679 67 450 ... 1696 \n", "10002 2962 0 0 0 100 60 167 520 46 66 ... 323 \n", "10003 4273 0 0 0 51 35 94 433 5 195 ... 672 \n", "10004 1660 0 0 0 34 13 68 65 20 126 ... 491 \n", "10005 1397 0 0 0 26 6 39 43 11 72 ... 421 \n", "\n", " N55 N56 N61 N62 N71 N72 N81 N99 NXX \n", "zcta5 \n", "10001 175 400 168 432 398 539 547 9 30 \n", "10002 3 88 27 206 100 607 314 3 2 \n", "10003 25 120 75 342 561 658 504 8 7 \n", "10004 12 88 34 85 31 131 137 4 12 \n", "10005 25 74 21 41 30 81 102 3 13 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Aggregate to ZCTAs\n", "zctaind=zbpind2zcta.groupby(['zcta5']).sum()\n", "zctaind.head()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
N00N11N21N22N23N31_33N42N44_45N48_49N51...N54_PCTN55_PCTN56_PCTN61_PCTN62_PCTN71_PCTN72_PCTN81_PCTN99_PCTNXX_PCT
zcta5
100018006005277188115667967450...21.182.195.002.105.404.976.736.830.110.37
100022962000100601675204666...10.900.102.970.916.953.3820.4910.600.100.07
1000342730005135944335195...15.730.592.811.768.0013.1315.4011.790.190.16
1000416600003413686520126...29.580.725.302.055.121.877.898.250.240.72
10005139700026639431172...30.141.795.301.502.932.155.807.300.210.93
\n", "

5 rows × 43 columns

\n", "
" ], "text/plain": [ " N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... \\\n", "zcta5 ... \n", "10001 8006 0 0 5 277 188 1156 679 67 450 ... \n", "10002 2962 0 0 0 100 60 167 520 46 66 ... \n", "10003 4273 0 0 0 51 35 94 433 5 195 ... \n", "10004 1660 0 0 0 34 13 68 65 20 126 ... \n", "10005 1397 0 0 0 26 6 39 43 11 72 ... \n", "\n", " N54_PCT N55_PCT N56_PCT N61_PCT N62_PCT N71_PCT N72_PCT N81_PCT \\\n", "zcta5 \n", "10001 21.18 2.19 5.00 2.10 5.40 4.97 6.73 6.83 \n", "10002 10.90 0.10 2.97 0.91 6.95 3.38 20.49 10.60 \n", "10003 15.73 0.59 2.81 1.76 8.00 13.13 15.40 11.79 \n", "10004 29.58 0.72 5.30 2.05 5.12 1.87 7.89 8.25 \n", "10005 30.14 1.79 5.30 1.50 2.93 2.15 5.80 7.30 \n", "\n", " N99_PCT NXX_PCT \n", "zcta5 \n", "10001 0.11 0.37 \n", "10002 0.10 0.07 \n", "10003 0.19 0.16 \n", "10004 0.24 0.72 \n", "10005 0.21 0.93 \n", "\n", "[5 rows x 43 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Generate and calculate percent total columns\n", "ncols=list(zctaind)\n", "for c in ncols[1:]:\n", " pct=c+'_PCT'\n", " zctaind[pct]=((zctaind[c]/zctaind['N00'])*100).round(2)\n", "zctaind.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NAICS Codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ***THIS BLOCK IS A REQUESTS BLOCK!***" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "codedict={}\n", "codes_url=f'https://api.census.gov/data/2018/zbp/variables/NAICS2017.json'\n", "response=requests.get(codes_url)\n", "codes_data=response.json()\n", "codedict.update(codes_data['values']['item'])\n", "sectordict=dict((k, codedict[k]) for k in ncodes)\n", "sectordict['XX']='Establishments omitted from classification due to privacy regulations'" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
naics
00Total for all sectors
11Agriculture, forestry, fishing and hunting
21Mining, quarrying, and oil and gas extraction
22Utilities
23Construction
31-33Manufacturing
42Wholesale trade
44-45Retail trade
48-49Transportation and warehousing
51Information
52Finance and insurance
53Real estate and rental and leasing
54Professional, scientific, and technical services
55Management of companies and enterprises
56Administrative and support and waste managemen...
61Educational services
62Health care and social assistance
71Arts, entertainment, and recreation
72Accommodation and food services
81Other services (except public administration)
99Industries not classified
XXEstablishments omitted from classification due...
\n", "
" ], "text/plain": [ " name\n", "naics \n", "00 Total for all sectors\n", "11 Agriculture, forestry, fishing and hunting\n", "21 Mining, quarrying, and oil and gas extraction\n", "22 Utilities\n", "23 Construction\n", "31-33 Manufacturing\n", "42 Wholesale trade\n", "44-45 Retail trade\n", "48-49 Transportation and warehousing\n", "51 Information\n", "52 Finance and insurance\n", "53 Real estate and rental and leasing\n", "54 Professional, scientific, and technical services\n", "55 Management of companies and enterprises\n", "56 Administrative and support and waste managemen...\n", "61 Educational services\n", "62 Health care and social assistance\n", "71 Arts, entertainment, and recreation\n", "72 Accommodation and food services\n", "81 Other services (except public administration)\n", "99 Industries not classified\n", "XX Establishments omitted from classification due..." ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "codes=pd.DataFrame(list(sectordict.items()), columns=['naics', 'name']).set_index('naics')\n", "codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quality Control Checks" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "#Does sum of industries equal industry total?\n", "indsum=zctaind['N00'].subtract(zctaind.iloc[:,1:22].sum(axis=1))\n", "if indsum.sum()==0:\n", " print (True)\n", "else:\n", " print(indsum.loc[indsum != 0])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "#Is sum of percent totals approximately 100?\n", "ptotal=zctaind.iloc[:,22:].sum(axis=1)\n", "if ptotal.loc[(ptotal <= 99.05) | (ptotal >= 100.05)].empty:\n", " print(True)\n", "else:\n", " print(ptotal.loc[(ptotal <= 99.05) | (ptotal >= 100.05)])\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "#Do number of ZCTAs in employment table match the industries table?\n", "ecount=zctaemp.shape[0]\n", "icount=zctaind.shape[0]\n", "if ecount == icount:\n", " print (True)\n", "else:\n", " print('Mistmatched count between employment',ecount, 'rows and industry',icount, 'rows')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "#Does sum of estabslishments from employment table equal establishments in industries table?\n", "estsum=zctaemp['ESTAB'].subtract(zctaind['N00'])\n", "if estsum.sum()==0:\n", " print (True)\n", "else:\n", " print(estsum.loc[estsum != 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write to Database " ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "con = sqlite3.connect(dbname) \n", "cur = con.cursor()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Employment table\n", "cur.execute('DROP TABLE IF EXISTS {};'.format(emptable))\n", "qcreate_emptab=\"\"\"\n", "CREATE TABLE {}(\n", "zcta5 TEXT NOT NULL PRIMARY KEY,\n", "estab INTEGER,\n", "emp INTEGER,\n", "payq1 INTEGER,\n", "payan INTEGER,\n", "flag_emp INTEGER,\n", "flag_payq1 INTEGER,\n", "flag_payan INTEGER);\n", "\"\"\".format(emptable)\n", "\n", "cur.execute(qcreate_emptab)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "#Don't run this block unless you've run the previous one\n", "zctaemp.to_sql(name='{}'.format(emptable), if_exists='append', index=True, con=con)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "212 records written to zbp2018emp\n" ] } ], "source": [ "cur.execute('SELECT COUNT(*) FROM {};'.format(emptable))\n", "rows = cur.fetchone()\n", "print(rows[0], 'records written to', emptable)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 records updated for EMP\n", "0 records updated for PAYQ1\n", "0 records updated for PAYAN\n", "212 records updated for FLAG_EMP\n", "212 records updated for FLAG_PAYQ1\n", "212 records updated for FLAG_PAYAN\n" ] } ], "source": [ "#Replace zeros with nulls, as these values really represent no data\n", "for col in zctaemp.columns[1:]:\n", " qupdate='UPDATE {} SET {} = NULL WHERE {} = 0;'.format(emptable,col,col)\n", " cur.execute(qupdate)\n", " print(cur.rowcount,'records updated for',col)\n", " con.commit()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Industry table\n", "cur.execute('DROP TABLE IF EXISTS {}'.format(indtable))\n", "qcreate_indtab=\"\"\"\n", "CREATE TABLE {} (\n", "zcta5 TEXT NOT NULL PRIMARY KEY, \n", "N00 INTEGER, \n", "N11 INTEGER, \n", "N21 INTEGER, \n", "N22 INTEGER, \n", "N23 INTEGER, \n", "N31_33 INTEGER, \n", "N42 INTEGER, \n", "N44_45 INTEGER, \n", "N48_49 INTEGER, \n", "N51 INTEGER, \n", "N52 INTEGER, \n", "N53 INTEGER, \n", "N54 INTEGER, \n", "N55 INTEGER, \n", "N56 INTEGER, \n", "N61 INTEGER, \n", "N62 INTEGER, \n", "N71 INTEGER, \n", "N72 INTEGER, \n", "N81 INTEGER, \n", "N99 INTEGER,\n", "NXX INTEGER,\n", "N11_PCT REAL, \n", "N21_PCT REAL, \n", "N22_PCT REAL, \n", "N23_PCT REAL, \n", "N31_33_PCT REAL, \n", "N42_PCT REAL, \n", "N44_45_PCT REAL, \n", "N48_49_PCT REAL, \n", "N51_PCT REAL, \n", "N52_PCT REAL, \n", "N53_PCT REAL, \n", "N54_PCT REAL, \n", "N55_PCT REAL, \n", "N56_PCT REAL, \n", "N61_PCT REAL, \n", "N62_PCT REAL, \n", "N71_PCT REAL, \n", "N72_PCT REAL, \n", "N81_PCT REAL, \n", "N99_PCT REAL,\n", "NXX_PCT REAL);\n", "\"\"\".format(indtable)\n", "\n", "cur.execute(qcreate_indtab)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "#Don't run this block unless you've run the previous one\n", "zctaind.to_sql(name='{}'.format(indtable), if_exists='append', index=True, con=con)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "212 records written to zbp2018ind\n" ] } ], "source": [ "cur.execute('SELECT COUNT(*) FROM {};'.format(indtable))\n", "rows = cur.fetchone()\n", "print(rows[0], 'records written to', indtable)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "212 records updated for N11_PCT\n", "211 records updated for N21_PCT\n", "201 records updated for N22_PCT\n", "27 records updated for N23_PCT\n", "58 records updated for N31_33_PCT\n", "26 records updated for N42_PCT\n", "18 records updated for N44_45_PCT\n", "35 records updated for N48_49_PCT\n", "39 records updated for N51_PCT\n", "12 records updated for N52_PCT\n", "10 records updated for N53_PCT\n", "7 records updated for N54_PCT\n", "140 records updated for N55_PCT\n", "12 records updated for N56_PCT\n", "39 records updated for N61_PCT\n", "16 records updated for N62_PCT\n", "46 records updated for N71_PCT\n", "17 records updated for N72_PCT\n", "8 records updated for N81_PCT\n", "158 records updated for N99_PCT\n", "19 records updated for NXX_PCT\n" ] } ], "source": [ "#For percentages, replace zeros with nulls, as these values really represent no data\n", "for col in zctaind.columns[22:]:\n", " qupdate='UPDATE {} SET {} = NULL WHERE {} = 0.0;'.format(indtable,col,col)\n", " cur.execute(qupdate)\n", " print(cur.rowcount,'records updated for',col)\n", " con.commit()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "193 records updated for N11\n", "192 records updated for N21\n", "184 records updated for N22\n", "27 records updated for N23\n", "58 records updated for N31_33\n", "26 records updated for N42\n", "18 records updated for N44_45\n", "35 records updated for N48_49\n", "39 records updated for N51\n", "12 records updated for N52\n", "10 records updated for N53\n", "7 records updated for N54\n", "132 records updated for N55\n", "12 records updated for N56\n", "39 records updated for N61\n", "16 records updated for N62\n", "46 records updated for N71\n", "17 records updated for N72\n", "8 records updated for N81\n", "147 records updated for N99\n", "0 records updated for NXX\n" ] } ], "source": [ "#For establishments, replace zeros with nulls unless establishments were omitted from classification\n", "for col in zctaind.columns[1:22]:\n", " qupdate='UPDATE {} SET {} = NULL WHERE {} = 0 AND NXX !=0;'.format(indtable,col,col)\n", " cur.execute(qupdate)\n", " print(cur.rowcount,'records updated for',col)\n", " con.commit()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#NAICS code table\n", "cur.execute('DROP TABLE IF EXISTS {};'.format(codetable))\n", "qcreate_codetab=\"\"\"\n", "CREATE TABLE {}(\n", "naics TEXT NOT NULL PRIMARY KEY,\n", "name TEXT);\n", "\"\"\".format(codetable)\n", "\n", "cur.execute(qcreate_codetab)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "#Don't run this block unless you've run the previous one\n", "codes.to_sql(name='{}'.format(codetable), if_exists='append', index=True, con=con)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "22 records written to zbp2018indcodes\n" ] } ], "source": [ "cur.execute('SELECT COUNT(*) FROM {};'.format(codetable))\n", "rows = cur.fetchone()\n", "print(rows[0], 'records written to', codetable)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "con.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }