{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ZBP to ZCTA\n",
"Retrieves data from the Census Bureau's ZIP Code Business Patterns API for a specific area and summarizes it by ZCTA. Three tables are generated: one for employees that contains employment, establishments, and wages, one for industries that contains counts of establishments by 2-digit sector NAICS codes, and one reference table that correlates sector numbers and names. Initial data retrieved from API is written to json, final output is written to a SQLite database. \n",
"\n",
"Confirmed to work with the 2018 ZBP series\n",
"\n",
"https://www.census.gov/data/developers/data-sets/cbp-nonemp-zbp/zbp-api.html\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Variables"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd, requests, sqlite3, os, json\n",
"from IPython.display import clear_output"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#Crosswalk files - update only if necessary\n",
"uszips_file='zip_to_zcta_2019_uds.csv'\n",
"zcta_file='geocorr14_modified.csv'\n",
"\n",
"uszips_path=os.path.join('inputs',uszips_file)\n",
"zcta_path=zcta_file=os.path.join('inputs',zcta_file)\n",
"\n",
"#Dump files for api data storage\n",
"ejsonpath=os.path.join('outputs', 'emp_data.json')\n",
"ijsonpath=os.path.join('outputs', 'ind_data.json')\n",
"cjsonpath=os.path.join('outputs', 'codes_data.json')\n",
"\n",
"#API variables - UPDATE THE YEAR\n",
"keyfile='census_key.txt'\n",
"\n",
"year='2018'\n",
"dsource='zbp'\n",
"state='36'\n",
"ecols='ESTAB,EMP,PAYQTR1,PAYANN'\n",
"icols='ESTAB'\n",
"ncodes=['00','11','21','22','23','31-33','42','44-45','48-49',\n",
" '51','52','53','54','55','56','61','62','71','72','81',\n",
" '99']\n",
"\n",
"#SQL output - UPDATE EACH TABLE NAME\n",
"dbname=os.path.join('outputs','testdb.sqlite')\n",
"emptable='zbp2018emp'\n",
"indtable='zbp2018ind'\n",
"codetable='zbp2018indcodes'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Crosswalking\n",
"Read in the files that relate US ZIP codes to ZCTAs (from JSI) and\n",
"ZCTAs to counties for local area (from MCDC Geocorr), then join them by \n",
"ZIP Code to create a ZCTA to ZIP table for the local area"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ZIP_CODE | \n",
" PO_NAME | \n",
" STATE | \n",
" ZIP_TYPE | \n",
" ZCTA | \n",
" zip_join_type | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 00501 | \n",
" Holtsville | \n",
" NY | \n",
" Post Office or large volume customer | \n",
" 11742 | \n",
" Spatial join to ZCTA | \n",
"
\n",
" \n",
" 1 | \n",
" 00544 | \n",
" Holtsville | \n",
" NY | \n",
" Post Office or large volume customer | \n",
" 11742 | \n",
" Spatial join to ZCTA | \n",
"
\n",
" \n",
" 2 | \n",
" 00601 | \n",
" Adjuntas | \n",
" PR | \n",
" Zip Code Area | \n",
" 00601 | \n",
" Zip Matches ZCTA | \n",
"
\n",
" \n",
" 3 | \n",
" 00602 | \n",
" Aguada | \n",
" PR | \n",
" Zip Code Area | \n",
" 00602 | \n",
" Zip Matches ZCTA | \n",
"
\n",
" \n",
" 4 | \n",
" 00603 | \n",
" Aguadilla | \n",
" PR | \n",
" Zip Code Area | \n",
" 00603 | \n",
" Zip Matches ZCTA | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ZIP_CODE PO_NAME STATE ZIP_TYPE ZCTA \\\n",
"0 00501 Holtsville NY Post Office or large volume customer 11742 \n",
"1 00544 Holtsville NY Post Office or large volume customer 11742 \n",
"2 00601 Adjuntas PR Zip Code Area 00601 \n",
"3 00602 Aguada PR Zip Code Area 00602 \n",
"4 00603 Aguadilla PR Zip Code Area 00603 \n",
"\n",
" zip_join_type \n",
"0 Spatial join to ZCTA \n",
"1 Spatial join to ZCTA \n",
"2 Zip Matches ZCTA \n",
"3 Zip Matches ZCTA \n",
"4 Zip Matches ZCTA "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"uszips=pd.read_csv(uszips_path, sep=',', dtype={'ZIP_CODE':str, 'ZCTA':str})\n",
"uszips.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(41107, 6)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#All ZIP Codes in US\n",
"uszips.shape"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" zcta5 | \n",
" county14 | \n",
" cntyname2 | \n",
" zipname | \n",
" pop10 | \n",
" afact | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10001 | \n",
" 36061 | \n",
" New York NY | \n",
" New York, NY | \n",
" 21102 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 10002 | \n",
" 36061 | \n",
" New York NY | \n",
" New York, NY | \n",
" 81410 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 10003 | \n",
" 36061 | \n",
" New York NY | \n",
" New York, NY | \n",
" 56024 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 10004 | \n",
" 36061 | \n",
" New York NY | \n",
" New York, NY | \n",
" 3089 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 10005 | \n",
" 36061 | \n",
" New York NY | \n",
" New York, NY | \n",
" 7135 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" zcta5 county14 cntyname2 zipname pop10 afact\n",
"0 10001 36061 New York NY New York, NY 21102 1.0\n",
"1 10002 36061 New York NY New York, NY 81410 1.0\n",
"2 10003 36061 New York NY New York, NY 56024 1.0\n",
"3 10004 36061 New York NY New York, NY 3089 1.0\n",
"4 10005 36061 New York NY New York, NY 7135 1.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zcta=pd.read_csv(zcta_path, sep=',', dtype={'zcta5':str, 'county14':str})\n",
"zcta.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(214, 6)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#ZCTAs in local area\n",
"zcta.shape"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ZIP_TYPE | \n",
" PO_NAME | \n",
" zcta5 | \n",
" county14 | \n",
"
\n",
" \n",
" ZIP_CODE | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10001 | \n",
" Zip Code Area | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
"
\n",
" \n",
" 10118 | \n",
" Post Office or large volume customer | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
"
\n",
" \n",
" 10120 | \n",
" Post Office or large volume customer | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
"
\n",
" \n",
" 10122 | \n",
" Post Office or large volume customer | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
"
\n",
" \n",
" 10123 | \n",
" Post Office or large volume customer | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ZIP_TYPE PO_NAME zcta5 county14\n",
"ZIP_CODE \n",
"10001 Zip Code Area New York 10001 36061\n",
"10118 Post Office or large volume customer New York 10001 36061\n",
"10120 Post Office or large volume customer New York 10001 36061\n",
"10122 Post Office or large volume customer New York 10001 36061\n",
"10123 Post Office or large volume customer New York 10001 36061"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Merge ZIP Codes with ZCTAs for local area\n",
"zip2zcta = pd.merge(uszips[['ZIP_CODE','ZIP_TYPE','PO_NAME','ZCTA']],zcta[['zcta5','county14']],how='right', \n",
" left_on='ZCTA', right_on='zcta5').set_index('ZIP_CODE')\n",
"zip2zcta.drop(columns=['ZCTA'],inplace=True)\n",
"zip2zcta.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(313, 4)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#ZIP Codes in local area\n",
"zip2zcta.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API Call and Processing\n",
"Request the ZBP data from the Census Bureau for the state, join the ZBP data to the local ZCTA to ZIP table based on ZIP Code, and group the data by ZCTA "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"with open(keyfile) as key:\n",
" api_key=key.read().strip()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://api.census.gov/data/2018/zbp'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base_url = f'https://api.census.gov/data/{year}/{dsource}'\n",
"base_url"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ZBP Employment Data\n",
"This data is requested in a series of chunks which contain multiple ZIP Codes - do not rerun the requests block if retrieval is successful but subsequent notebook changes are needed. Proceed to the next block and pull data from json dump file."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def chunks(l, n):\n",
" # For item i in a range that is a length of l,\n",
" for i in range(0, len(l), n):\n",
" # Create an index range for l of n items:\n",
" yield l[i:i+n]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of chunks: 7\n"
]
}
],
"source": [
"reqzips=list(chunks(zip2zcta.index.tolist(),48))\n",
"print('Number of chunks:',len(reqzips))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ***THIS BLOCK IS A REQUESTS BLOCK!*** "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Retrieved data for chunk 6\n",
"Done - Data dumped to json file\n"
]
}
],
"source": [
"#Code 200 = success, do not rerun this block unless it's necessary\n",
"emp_data=[]\n",
"for i, v in enumerate (reqzips):\n",
" batchzips=','.join(v)\n",
" edata_url = f'{base_url}?get={ecols}&EMPSZES=001&for=zipcode:{batchzips}&key={api_key}'\n",
" response=requests.get(edata_url)\n",
" if response.status_code==200:\n",
" clear_output(wait=True)\n",
" data=response.json()\n",
" if i == 0: \n",
" for record in data:\n",
" emp_data.append(record)\n",
" else:\n",
" for record in data[1:]:\n",
" emp_data.append(record) \n",
" print('Retrieved data for chunk',i)\n",
" else:\n",
" print('***Problem with retrieval***, response code',response.status_code)\n",
" break\n",
"with open(ejsonpath, 'w') as f:\n",
" json.dump(emp_data, f)\n",
"print('Done - Data dumped to json file')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ESTAB | \n",
" EMP | \n",
" PAYQ1 | \n",
" PAYAN | \n",
"
\n",
" \n",
" zipcode | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10118 | \n",
" 326 | \n",
" 6358 | \n",
" 202796 | \n",
" 815737 | \n",
"
\n",
" \n",
" 10120 | \n",
" 70 | \n",
" 1378 | \n",
" 41023 | \n",
" 132076 | \n",
"
\n",
" \n",
" 10121 | \n",
" 69 | \n",
" 6104 | \n",
" 199451 | \n",
" 887899 | \n",
"
\n",
" \n",
" 10122 | \n",
" 177 | \n",
" 1917 | \n",
" 40477 | \n",
" 156824 | \n",
"
\n",
" \n",
" 10123 | \n",
" 180 | \n",
" 2036 | \n",
" 51353 | \n",
" 195513 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ESTAB EMP PAYQ1 PAYAN\n",
"zipcode \n",
"10118 326 6358 202796 815737\n",
"10120 70 1378 41023 132076\n",
"10121 69 6104 199451 887899\n",
"10122 177 1917 40477 156824\n",
"10123 180 2036 51353 195513"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open(ejsonpath, 'r') as f:\n",
" ejsondata=json.load(f)\n",
"zbpemp=pd.DataFrame(ejsondata[1:], columns=ejsondata[0]).rename(columns={'PAYQTR1':'PAYQ1','PAYANN':'PAYAN','zip code':'zipcode'}).set_index('zipcode')\n",
"for field in zbpemp.columns:\n",
" zbpemp=zbpemp.astype(dtype={field:'int64'})\n",
"zbpemp.drop(columns=['EMPSZES'],inplace=True)\n",
"zbpemp.head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(266, 4)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#ZIP Codes retrieved - may differ from zip2zcta as some zips have no businesses\n",
"zbpemp.shape"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ESTAB | \n",
" EMP | \n",
" PAYQ1 | \n",
" PAYAN | \n",
" FLAG_EMP | \n",
" FLAG_PAYQ1 | \n",
" FLAG_PAYAN | \n",
"
\n",
" \n",
" zipcode | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10118 | \n",
" 326 | \n",
" 6358 | \n",
" 202796 | \n",
" 815737 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10120 | \n",
" 70 | \n",
" 1378 | \n",
" 41023 | \n",
" 132076 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10121 | \n",
" 69 | \n",
" 6104 | \n",
" 199451 | \n",
" 887899 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10122 | \n",
" 177 | \n",
" 1917 | \n",
" 40477 | \n",
" 156824 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10123 | \n",
" 180 | \n",
" 2036 | \n",
" 51353 | \n",
" 195513 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN\n",
"zipcode \n",
"10118 326 6358 202796 815737 0 0 0\n",
"10120 70 1378 41023 132076 0 0 0\n",
"10121 69 6104 199451 887899 0 0 0\n",
"10122 177 1917 40477 156824 0 0 0\n",
"10123 180 2036 51353 195513 0 0 0"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Flag columns count the number of establishments for which data is not disclosed\n",
"flags=['FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']\n",
"for flagcol in flags:\n",
" datacol=flagcol.split('_')[1]\n",
" zbpemp[flagcol]=0\n",
" zbpemp.loc[zbpemp[datacol] == 0, flagcol] = zbpemp['ESTAB']\n",
"zbpemp.head()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ZIP_TYPE | \n",
" PO_NAME | \n",
" zcta5 | \n",
" county14 | \n",
" ESTAB | \n",
" EMP | \n",
" PAYQ1 | \n",
" PAYAN | \n",
" FLAG_EMP | \n",
" FLAG_PAYQ1 | \n",
" FLAG_PAYAN | \n",
"
\n",
" \n",
" ZIP_CODE | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10001 | \n",
" Zip Code Area | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
" 7248 | \n",
" 151769 | \n",
" 2717186 | \n",
" 10646611 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10118 | \n",
" Post Office or large volume customer | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
" 326 | \n",
" 6358 | \n",
" 202796 | \n",
" 815737 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10120 | \n",
" Post Office or large volume customer | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
" 70 | \n",
" 1378 | \n",
" 41023 | \n",
" 132076 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10122 | \n",
" Post Office or large volume customer | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
" 177 | \n",
" 1917 | \n",
" 40477 | \n",
" 156824 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10123 | \n",
" Post Office or large volume customer | \n",
" New York | \n",
" 10001 | \n",
" 36061 | \n",
" 180 | \n",
" 2036 | \n",
" 51353 | \n",
" 195513 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ZIP_TYPE PO_NAME zcta5 county14 \\\n",
"ZIP_CODE \n",
"10001 Zip Code Area New York 10001 36061 \n",
"10118 Post Office or large volume customer New York 10001 36061 \n",
"10120 Post Office or large volume customer New York 10001 36061 \n",
"10122 Post Office or large volume customer New York 10001 36061 \n",
"10123 Post Office or large volume customer New York 10001 36061 \n",
"\n",
" ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN \n",
"ZIP_CODE \n",
"10001 7248 151769 2717186 10646611 0 0 0 \n",
"10118 326 6358 202796 815737 0 0 0 \n",
"10120 70 1378 41023 132076 0 0 0 \n",
"10122 177 1917 40477 156824 0 0 0 \n",
"10123 180 2036 51353 195513 0 0 0 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Join to ZIP ZCTA crosswalk\n",
"zbpemp2zcta = pd.merge(zip2zcta,zbpemp,how='inner',left_index=True,right_index=True)\n",
"zbpemp2zcta.index.name = 'ZIP_CODE'\n",
"zbpemp2zcta.head()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(266, 11)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#ZIP codes in the local area that appear in the ZBP data\n",
"zbpemp2zcta.shape"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ESTAB | \n",
" EMP | \n",
" PAYQ1 | \n",
" PAYAN | \n",
" FLAG_EMP | \n",
" FLAG_PAYQ1 | \n",
" FLAG_PAYAN | \n",
"
\n",
" \n",
" zcta5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10001 | \n",
" 8006 | \n",
" 163513 | \n",
" 3053219 | \n",
" 11948301 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10002 | \n",
" 2962 | \n",
" 22737 | \n",
" 176288 | \n",
" 764720 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10003 | \n",
" 4273 | \n",
" 98742 | \n",
" 1830369 | \n",
" 6916902 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10004 | \n",
" 1660 | \n",
" 69025 | \n",
" 2337956 | \n",
" 7349828 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10005 | \n",
" 1397 | \n",
" 48129 | \n",
" 3043620 | \n",
" 7586070 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN\n",
"zcta5 \n",
"10001 8006 163513 3053219 11948301 0 0 0\n",
"10002 2962 22737 176288 764720 0 0 0\n",
"10003 4273 98742 1830369 6916902 0 0 0\n",
"10004 1660 69025 2337956 7349828 0 0 0\n",
"10005 1397 48129 3043620 7586070 0 0 0"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Aggregate to ZCTAs\n",
"zctaemp=zbpemp2zcta[['zcta5','ESTAB','EMP','PAYQ1','PAYAN','FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']].groupby(['zcta5'])[['ESTAB','EMP','PAYQ1','PAYAN','FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']].sum()\n",
"zctaemp.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ZBP Industry Data\n",
"This data must be requested one record at a time. NOTE that this request can take a LONG TIME, up to one hour to complete for approx 300 ZIP Codes. Once the request is finished the data gets dumped into a json file. If the request is successful but subsequent blocks need to be modified, don't rerun the requests block - pull the data from the json file."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"266"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zipcodes=zbpemp2zcta.index.tolist()\n",
"len(zipcodes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ***THIS BLOCK IS A REQUESTS BLOCK!*** \n",
"Retrieving approx 300 ZIP Codes takes 1 hour\n",
"\n",
"*NOTE - revise in the future to retrieve chunks of zip codes*"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5586 records have been retrieved for 266 ZIP codes...\n",
"Done\n",
"Data dumped to json file\n"
]
}
],
"source": [
"#If this block is successful but there are subsequent problems, do not rerun it - start from the following block.\n",
"#For industry data, if there are no records for an industry create a blank record with zeros\n",
"n=0\n",
"z=0\n",
"ind_data=[['estab','naics','zipcode']]\n",
"for zcode in zipcodes:\n",
"#for zcode in zipcodes[0:5]:\n",
" clear_output(wait=True)\n",
" for naics in ncodes:\n",
" idata_url = f'{base_url}?get={icols}&NAICS2017={naics}&for=zipcode:{zcode}&key={api_key}'\n",
" try:\n",
" response=requests.get(idata_url)\n",
" except requests.exceptions.RequestException as e:\n",
" print (e)\n",
" break\n",
" if response.status_code==200:\n",
" jsondata=response.json()\n",
" ind_data.append(jsondata[1]) \n",
" n=n+1\n",
" elif response.status_code==204:\n",
" record=['0',naics,zcode]\n",
" ind_data.append(record)\n",
" n=n+1\n",
" else:\n",
" print('Problem retrieving data, status code:',response.status_code)\n",
" break\n",
" z=z+1\n",
" print(n,'records have been retrieved for',z,'ZIP codes...')\n",
"print('Done')\n",
"\n",
"with open(ijsonpath, 'w') as f:\n",
" json.dump(ind_data, f)\n",
"print('Data dumped to json file')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" estab | \n",
" naics | \n",
" zipcode | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 7248 | \n",
" 00 | \n",
" 10001 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 11 | \n",
" 10001 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 21 | \n",
" 10001 | \n",
"
\n",
" \n",
" 3 | \n",
" 5 | \n",
" 22 | \n",
" 10001 | \n",
"
\n",
" \n",
" 4 | \n",
" 262 | \n",
" 23 | \n",
" 10001 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" estab naics zipcode\n",
"0 7248 00 10001\n",
"1 0 11 10001\n",
"2 0 21 10001\n",
"3 5 22 10001\n",
"4 262 23 10001"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open(ijsonpath, 'r') as f:\n",
" ijsondata=json.load(f)\n",
"zbpind = pd.DataFrame(ijsondata[1:],columns=ijsondata[0])\n",
"zbpind['estab']=zbpind['estab'].astype('int64')\n",
"zbpind.head()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" naics | \n",
" N00 | \n",
" N11 | \n",
" N21 | \n",
" N22 | \n",
" N23 | \n",
" N31_33 | \n",
" N42 | \n",
" N44_45 | \n",
" N48_49 | \n",
" N51 | \n",
" ... | \n",
" N53 | \n",
" N54 | \n",
" N55 | \n",
" N56 | \n",
" N61 | \n",
" N62 | \n",
" N71 | \n",
" N72 | \n",
" N81 | \n",
" N99 | \n",
"
\n",
" \n",
" zipcode | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10001 | \n",
" 7248 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 262 | \n",
" 188 | \n",
" 1079 | \n",
" 654 | \n",
" 62 | \n",
" 418 | \n",
" ... | \n",
" 500 | \n",
" 1453 | \n",
" 153 | \n",
" 363 | \n",
" 161 | \n",
" 405 | \n",
" 313 | \n",
" 528 | \n",
" 506 | \n",
" 9 | \n",
"
\n",
" \n",
" 10002 | \n",
" 2962 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 100 | \n",
" 60 | \n",
" 167 | \n",
" 520 | \n",
" 46 | \n",
" 66 | \n",
" ... | \n",
" 274 | \n",
" 323 | \n",
" 3 | \n",
" 88 | \n",
" 27 | \n",
" 206 | \n",
" 100 | \n",
" 607 | \n",
" 314 | \n",
" 3 | \n",
"
\n",
" \n",
" 10003 | \n",
" 4268 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 51 | \n",
" 35 | \n",
" 94 | \n",
" 433 | \n",
" 5 | \n",
" 195 | \n",
" ... | \n",
" 372 | \n",
" 672 | \n",
" 25 | \n",
" 120 | \n",
" 75 | \n",
" 342 | \n",
" 561 | \n",
" 658 | \n",
" 504 | \n",
" 8 | \n",
"
\n",
" \n",
" 10004 | \n",
" 1600 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 34 | \n",
" 13 | \n",
" 68 | \n",
" 65 | \n",
" 20 | \n",
" 109 | \n",
" ... | \n",
" 87 | \n",
" 483 | \n",
" 12 | \n",
" 84 | \n",
" 34 | \n",
" 85 | \n",
" 31 | \n",
" 131 | \n",
" 134 | \n",
" 4 | \n",
"
\n",
" \n",
" 10005 | \n",
" 1346 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 26 | \n",
" 6 | \n",
" 39 | \n",
" 43 | \n",
" 11 | \n",
" 72 | \n",
" ... | \n",
" 78 | \n",
" 421 | \n",
" 25 | \n",
" 74 | \n",
" 21 | \n",
" 41 | \n",
" 30 | \n",
" 81 | \n",
" 102 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 21 columns
\n",
"
"
],
"text/plain": [
"naics N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... \\\n",
"zipcode ... \n",
"10001 7248 0 0 5 262 188 1079 654 62 418 ... \n",
"10002 2962 0 0 0 100 60 167 520 46 66 ... \n",
"10003 4268 0 0 0 51 35 94 433 5 195 ... \n",
"10004 1600 0 0 0 34 13 68 65 20 109 ... \n",
"10005 1346 0 0 0 26 6 39 43 11 72 ... \n",
"\n",
"naics N53 N54 N55 N56 N61 N62 N71 N72 N81 N99 \n",
"zipcode \n",
"10001 500 1453 153 363 161 405 313 528 506 9 \n",
"10002 274 323 3 88 27 206 100 607 314 3 \n",
"10003 372 672 25 120 75 342 561 658 504 8 \n",
"10004 87 483 12 84 34 85 31 131 134 4 \n",
"10005 78 421 25 74 21 41 30 81 102 3 \n",
"\n",
"[5 rows x 21 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Pivot data to move NAICS to columns\n",
"zbpind_tab=zbpind.pivot(index='zipcode', columns='naics', values='estab')\n",
"zbpind_tab=zbpind_tab.add_prefix('N')\n",
"zbpind_tab.rename(columns=lambda x: x.replace('-', '_'),inplace=True)\n",
"zbpind_tab.head()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" naics | \n",
" N00 | \n",
" N11 | \n",
" N21 | \n",
" N22 | \n",
" N23 | \n",
" N31_33 | \n",
" N42 | \n",
" N44_45 | \n",
" N48_49 | \n",
" N51 | \n",
" ... | \n",
" N54 | \n",
" N55 | \n",
" N56 | \n",
" N61 | \n",
" N62 | \n",
" N71 | \n",
" N72 | \n",
" N81 | \n",
" N99 | \n",
" NXX | \n",
"
\n",
" \n",
" zipcode | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10001 | \n",
" 7248 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 262 | \n",
" 188 | \n",
" 1079 | \n",
" 654 | \n",
" 62 | \n",
" 418 | \n",
" ... | \n",
" 1453 | \n",
" 153 | \n",
" 363 | \n",
" 161 | \n",
" 405 | \n",
" 313 | \n",
" 528 | \n",
" 506 | \n",
" 9 | \n",
" 0 | \n",
"
\n",
" \n",
" 10002 | \n",
" 2962 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 100 | \n",
" 60 | \n",
" 167 | \n",
" 520 | \n",
" 46 | \n",
" 66 | \n",
" ... | \n",
" 323 | \n",
" 3 | \n",
" 88 | \n",
" 27 | \n",
" 206 | \n",
" 100 | \n",
" 607 | \n",
" 314 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 10003 | \n",
" 4268 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 51 | \n",
" 35 | \n",
" 94 | \n",
" 433 | \n",
" 5 | \n",
" 195 | \n",
" ... | \n",
" 672 | \n",
" 25 | \n",
" 120 | \n",
" 75 | \n",
" 342 | \n",
" 561 | \n",
" 658 | \n",
" 504 | \n",
" 8 | \n",
" 2 | \n",
"
\n",
" \n",
" 10004 | \n",
" 1600 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 34 | \n",
" 13 | \n",
" 68 | \n",
" 65 | \n",
" 20 | \n",
" 109 | \n",
" ... | \n",
" 483 | \n",
" 12 | \n",
" 84 | \n",
" 34 | \n",
" 85 | \n",
" 31 | \n",
" 131 | \n",
" 134 | \n",
" 4 | \n",
" 1 | \n",
"
\n",
" \n",
" 10005 | \n",
" 1346 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 26 | \n",
" 6 | \n",
" 39 | \n",
" 43 | \n",
" 11 | \n",
" 72 | \n",
" ... | \n",
" 421 | \n",
" 25 | \n",
" 74 | \n",
" 21 | \n",
" 41 | \n",
" 30 | \n",
" 81 | \n",
" 102 | \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 22 columns
\n",
"
"
],
"text/plain": [
"naics N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... \\\n",
"zipcode ... \n",
"10001 7248 0 0 5 262 188 1079 654 62 418 ... \n",
"10002 2962 0 0 0 100 60 167 520 46 66 ... \n",
"10003 4268 0 0 0 51 35 94 433 5 195 ... \n",
"10004 1600 0 0 0 34 13 68 65 20 109 ... \n",
"10005 1346 0 0 0 26 6 39 43 11 72 ... \n",
"\n",
"naics N54 N55 N56 N61 N62 N71 N72 N81 N99 NXX \n",
"zipcode \n",
"10001 1453 153 363 161 405 313 528 506 9 0 \n",
"10002 323 3 88 27 206 100 607 314 3 2 \n",
"10003 672 25 120 75 342 561 658 504 8 2 \n",
"10004 483 12 84 34 85 31 131 134 4 1 \n",
"10005 421 25 74 21 41 30 81 102 3 0 \n",
"\n",
"[5 rows x 22 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Create column to summarize businesses that were not disclosed and categorized, but that are included in the total\n",
"zbpind_tab['NXX']=zbpind_tab.loc[:,'N00'].subtract(zbpind_tab.loc[:,'N11':'N99'].sum(axis=1))\n",
"zbpind_tab.head()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" zcta5 | \n",
" N00 | \n",
" N11 | \n",
" N21 | \n",
" N22 | \n",
" N23 | \n",
" N31_33 | \n",
" N42 | \n",
" N44_45 | \n",
" N48_49 | \n",
" ... | \n",
" N54 | \n",
" N55 | \n",
" N56 | \n",
" N61 | \n",
" N62 | \n",
" N71 | \n",
" N72 | \n",
" N81 | \n",
" N99 | \n",
" NXX | \n",
"
\n",
" \n",
" ZIP_CODE | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10001 | \n",
" 10001 | \n",
" 7248 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 262 | \n",
" 188 | \n",
" 1079 | \n",
" 654 | \n",
" 62 | \n",
" ... | \n",
" 1453 | \n",
" 153 | \n",
" 363 | \n",
" 161 | \n",
" 405 | \n",
" 313 | \n",
" 528 | \n",
" 506 | \n",
" 9 | \n",
" 0 | \n",
"
\n",
" \n",
" 10118 | \n",
" 10001 | \n",
" 326 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 8 | \n",
" 0 | \n",
" 45 | \n",
" 10 | \n",
" 5 | \n",
" ... | \n",
" 103 | \n",
" 12 | \n",
" 17 | \n",
" 0 | \n",
" 8 | \n",
" 29 | \n",
" 11 | \n",
" 23 | \n",
" 0 | \n",
" 5 | \n",
"
\n",
" \n",
" 10120 | \n",
" 10001 | \n",
" 70 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 8 | \n",
" 8 | \n",
" 0 | \n",
" ... | \n",
" 18 | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 11 | \n",
"
\n",
" \n",
" 10122 | \n",
" 10001 | \n",
" 177 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 10 | \n",
" 3 | \n",
" 0 | \n",
" ... | \n",
" 45 | \n",
" 0 | \n",
" 7 | \n",
" 0 | \n",
" 5 | \n",
" 53 | \n",
" 0 | \n",
" 10 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" 10123 | \n",
" 10001 | \n",
" 180 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 7 | \n",
" 0 | \n",
" 14 | \n",
" 4 | \n",
" 0 | \n",
" ... | \n",
" 77 | \n",
" 0 | \n",
" 13 | \n",
" 7 | \n",
" 11 | \n",
" 3 | \n",
" 0 | \n",
" 8 | \n",
" 0 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 23 columns
\n",
"
"
],
"text/plain": [
" zcta5 N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 ... \\\n",
"ZIP_CODE ... \n",
"10001 10001 7248 0 0 5 262 188 1079 654 62 ... \n",
"10118 10001 326 0 0 0 8 0 45 10 5 ... \n",
"10120 10001 70 0 0 0 0 0 8 8 0 ... \n",
"10122 10001 177 0 0 0 0 0 10 3 0 ... \n",
"10123 10001 180 0 0 0 7 0 14 4 0 ... \n",
"\n",
" N54 N55 N56 N61 N62 N71 N72 N81 N99 NXX \n",
"ZIP_CODE \n",
"10001 1453 153 363 161 405 313 528 506 9 0 \n",
"10118 103 12 17 0 8 29 11 23 0 5 \n",
"10120 18 10 0 0 3 0 0 0 0 11 \n",
"10122 45 0 7 0 5 53 0 10 0 3 \n",
"10123 77 0 13 7 11 3 0 8 0 6 \n",
"\n",
"[5 rows x 23 columns]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Join to ZIP ZCTA crosswalk\n",
"zbpind2zcta = pd.merge(zip2zcta[['zcta5']],zbpind_tab,how='inner',left_index=True,right_index=True)\n",
"zbpind2zcta.index.name = 'ZIP_CODE'\n",
"zbpind2zcta.head()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(266, 23)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zbpind2zcta.shape"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" N00 | \n",
" N11 | \n",
" N21 | \n",
" N22 | \n",
" N23 | \n",
" N31_33 | \n",
" N42 | \n",
" N44_45 | \n",
" N48_49 | \n",
" N51 | \n",
" ... | \n",
" N54 | \n",
" N55 | \n",
" N56 | \n",
" N61 | \n",
" N62 | \n",
" N71 | \n",
" N72 | \n",
" N81 | \n",
" N99 | \n",
" NXX | \n",
"
\n",
" \n",
" zcta5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10001 | \n",
" 8006 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 277 | \n",
" 188 | \n",
" 1156 | \n",
" 679 | \n",
" 67 | \n",
" 450 | \n",
" ... | \n",
" 1696 | \n",
" 175 | \n",
" 400 | \n",
" 168 | \n",
" 432 | \n",
" 398 | \n",
" 539 | \n",
" 547 | \n",
" 9 | \n",
" 30 | \n",
"
\n",
" \n",
" 10002 | \n",
" 2962 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 100 | \n",
" 60 | \n",
" 167 | \n",
" 520 | \n",
" 46 | \n",
" 66 | \n",
" ... | \n",
" 323 | \n",
" 3 | \n",
" 88 | \n",
" 27 | \n",
" 206 | \n",
" 100 | \n",
" 607 | \n",
" 314 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 10003 | \n",
" 4273 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 51 | \n",
" 35 | \n",
" 94 | \n",
" 433 | \n",
" 5 | \n",
" 195 | \n",
" ... | \n",
" 672 | \n",
" 25 | \n",
" 120 | \n",
" 75 | \n",
" 342 | \n",
" 561 | \n",
" 658 | \n",
" 504 | \n",
" 8 | \n",
" 7 | \n",
"
\n",
" \n",
" 10004 | \n",
" 1660 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 34 | \n",
" 13 | \n",
" 68 | \n",
" 65 | \n",
" 20 | \n",
" 126 | \n",
" ... | \n",
" 491 | \n",
" 12 | \n",
" 88 | \n",
" 34 | \n",
" 85 | \n",
" 31 | \n",
" 131 | \n",
" 137 | \n",
" 4 | \n",
" 12 | \n",
"
\n",
" \n",
" 10005 | \n",
" 1397 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 26 | \n",
" 6 | \n",
" 39 | \n",
" 43 | \n",
" 11 | \n",
" 72 | \n",
" ... | \n",
" 421 | \n",
" 25 | \n",
" 74 | \n",
" 21 | \n",
" 41 | \n",
" 30 | \n",
" 81 | \n",
" 102 | \n",
" 3 | \n",
" 13 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 22 columns
\n",
"
"
],
"text/plain": [
" N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... N54 \\\n",
"zcta5 ... \n",
"10001 8006 0 0 5 277 188 1156 679 67 450 ... 1696 \n",
"10002 2962 0 0 0 100 60 167 520 46 66 ... 323 \n",
"10003 4273 0 0 0 51 35 94 433 5 195 ... 672 \n",
"10004 1660 0 0 0 34 13 68 65 20 126 ... 491 \n",
"10005 1397 0 0 0 26 6 39 43 11 72 ... 421 \n",
"\n",
" N55 N56 N61 N62 N71 N72 N81 N99 NXX \n",
"zcta5 \n",
"10001 175 400 168 432 398 539 547 9 30 \n",
"10002 3 88 27 206 100 607 314 3 2 \n",
"10003 25 120 75 342 561 658 504 8 7 \n",
"10004 12 88 34 85 31 131 137 4 12 \n",
"10005 25 74 21 41 30 81 102 3 13 \n",
"\n",
"[5 rows x 22 columns]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Aggregate to ZCTAs\n",
"zctaind=zbpind2zcta.groupby(['zcta5']).sum()\n",
"zctaind.head()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" N00 | \n",
" N11 | \n",
" N21 | \n",
" N22 | \n",
" N23 | \n",
" N31_33 | \n",
" N42 | \n",
" N44_45 | \n",
" N48_49 | \n",
" N51 | \n",
" ... | \n",
" N54_PCT | \n",
" N55_PCT | \n",
" N56_PCT | \n",
" N61_PCT | \n",
" N62_PCT | \n",
" N71_PCT | \n",
" N72_PCT | \n",
" N81_PCT | \n",
" N99_PCT | \n",
" NXX_PCT | \n",
"
\n",
" \n",
" zcta5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 10001 | \n",
" 8006 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 277 | \n",
" 188 | \n",
" 1156 | \n",
" 679 | \n",
" 67 | \n",
" 450 | \n",
" ... | \n",
" 21.18 | \n",
" 2.19 | \n",
" 5.00 | \n",
" 2.10 | \n",
" 5.40 | \n",
" 4.97 | \n",
" 6.73 | \n",
" 6.83 | \n",
" 0.11 | \n",
" 0.37 | \n",
"
\n",
" \n",
" 10002 | \n",
" 2962 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 100 | \n",
" 60 | \n",
" 167 | \n",
" 520 | \n",
" 46 | \n",
" 66 | \n",
" ... | \n",
" 10.90 | \n",
" 0.10 | \n",
" 2.97 | \n",
" 0.91 | \n",
" 6.95 | \n",
" 3.38 | \n",
" 20.49 | \n",
" 10.60 | \n",
" 0.10 | \n",
" 0.07 | \n",
"
\n",
" \n",
" 10003 | \n",
" 4273 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 51 | \n",
" 35 | \n",
" 94 | \n",
" 433 | \n",
" 5 | \n",
" 195 | \n",
" ... | \n",
" 15.73 | \n",
" 0.59 | \n",
" 2.81 | \n",
" 1.76 | \n",
" 8.00 | \n",
" 13.13 | \n",
" 15.40 | \n",
" 11.79 | \n",
" 0.19 | \n",
" 0.16 | \n",
"
\n",
" \n",
" 10004 | \n",
" 1660 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 34 | \n",
" 13 | \n",
" 68 | \n",
" 65 | \n",
" 20 | \n",
" 126 | \n",
" ... | \n",
" 29.58 | \n",
" 0.72 | \n",
" 5.30 | \n",
" 2.05 | \n",
" 5.12 | \n",
" 1.87 | \n",
" 7.89 | \n",
" 8.25 | \n",
" 0.24 | \n",
" 0.72 | \n",
"
\n",
" \n",
" 10005 | \n",
" 1397 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 26 | \n",
" 6 | \n",
" 39 | \n",
" 43 | \n",
" 11 | \n",
" 72 | \n",
" ... | \n",
" 30.14 | \n",
" 1.79 | \n",
" 5.30 | \n",
" 1.50 | \n",
" 2.93 | \n",
" 2.15 | \n",
" 5.80 | \n",
" 7.30 | \n",
" 0.21 | \n",
" 0.93 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 43 columns
\n",
"
"
],
"text/plain": [
" N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... \\\n",
"zcta5 ... \n",
"10001 8006 0 0 5 277 188 1156 679 67 450 ... \n",
"10002 2962 0 0 0 100 60 167 520 46 66 ... \n",
"10003 4273 0 0 0 51 35 94 433 5 195 ... \n",
"10004 1660 0 0 0 34 13 68 65 20 126 ... \n",
"10005 1397 0 0 0 26 6 39 43 11 72 ... \n",
"\n",
" N54_PCT N55_PCT N56_PCT N61_PCT N62_PCT N71_PCT N72_PCT N81_PCT \\\n",
"zcta5 \n",
"10001 21.18 2.19 5.00 2.10 5.40 4.97 6.73 6.83 \n",
"10002 10.90 0.10 2.97 0.91 6.95 3.38 20.49 10.60 \n",
"10003 15.73 0.59 2.81 1.76 8.00 13.13 15.40 11.79 \n",
"10004 29.58 0.72 5.30 2.05 5.12 1.87 7.89 8.25 \n",
"10005 30.14 1.79 5.30 1.50 2.93 2.15 5.80 7.30 \n",
"\n",
" N99_PCT NXX_PCT \n",
"zcta5 \n",
"10001 0.11 0.37 \n",
"10002 0.10 0.07 \n",
"10003 0.19 0.16 \n",
"10004 0.24 0.72 \n",
"10005 0.21 0.93 \n",
"\n",
"[5 rows x 43 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Generate and calculate percent total columns\n",
"ncols=list(zctaind)\n",
"for c in ncols[1:]:\n",
" pct=c+'_PCT'\n",
" zctaind[pct]=((zctaind[c]/zctaind['N00'])*100).round(2)\n",
"zctaind.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### NAICS Codes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ***THIS BLOCK IS A REQUESTS BLOCK!***"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"codedict={}\n",
"codes_url=f'https://api.census.gov/data/2018/zbp/variables/NAICS2017.json'\n",
"response=requests.get(codes_url)\n",
"codes_data=response.json()\n",
"codedict.update(codes_data['values']['item'])\n",
"sectordict=dict((k, codedict[k]) for k in ncodes)\n",
"sectordict['XX']='Establishments omitted from classification due to privacy regulations'"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
"
\n",
" \n",
" naics | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 00 | \n",
" Total for all sectors | \n",
"
\n",
" \n",
" 11 | \n",
" Agriculture, forestry, fishing and hunting | \n",
"
\n",
" \n",
" 21 | \n",
" Mining, quarrying, and oil and gas extraction | \n",
"
\n",
" \n",
" 22 | \n",
" Utilities | \n",
"
\n",
" \n",
" 23 | \n",
" Construction | \n",
"
\n",
" \n",
" 31-33 | \n",
" Manufacturing | \n",
"
\n",
" \n",
" 42 | \n",
" Wholesale trade | \n",
"
\n",
" \n",
" 44-45 | \n",
" Retail trade | \n",
"
\n",
" \n",
" 48-49 | \n",
" Transportation and warehousing | \n",
"
\n",
" \n",
" 51 | \n",
" Information | \n",
"
\n",
" \n",
" 52 | \n",
" Finance and insurance | \n",
"
\n",
" \n",
" 53 | \n",
" Real estate and rental and leasing | \n",
"
\n",
" \n",
" 54 | \n",
" Professional, scientific, and technical services | \n",
"
\n",
" \n",
" 55 | \n",
" Management of companies and enterprises | \n",
"
\n",
" \n",
" 56 | \n",
" Administrative and support and waste managemen... | \n",
"
\n",
" \n",
" 61 | \n",
" Educational services | \n",
"
\n",
" \n",
" 62 | \n",
" Health care and social assistance | \n",
"
\n",
" \n",
" 71 | \n",
" Arts, entertainment, and recreation | \n",
"
\n",
" \n",
" 72 | \n",
" Accommodation and food services | \n",
"
\n",
" \n",
" 81 | \n",
" Other services (except public administration) | \n",
"
\n",
" \n",
" 99 | \n",
" Industries not classified | \n",
"
\n",
" \n",
" XX | \n",
" Establishments omitted from classification due... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name\n",
"naics \n",
"00 Total for all sectors\n",
"11 Agriculture, forestry, fishing and hunting\n",
"21 Mining, quarrying, and oil and gas extraction\n",
"22 Utilities\n",
"23 Construction\n",
"31-33 Manufacturing\n",
"42 Wholesale trade\n",
"44-45 Retail trade\n",
"48-49 Transportation and warehousing\n",
"51 Information\n",
"52 Finance and insurance\n",
"53 Real estate and rental and leasing\n",
"54 Professional, scientific, and technical services\n",
"55 Management of companies and enterprises\n",
"56 Administrative and support and waste managemen...\n",
"61 Educational services\n",
"62 Health care and social assistance\n",
"71 Arts, entertainment, and recreation\n",
"72 Accommodation and food services\n",
"81 Other services (except public administration)\n",
"99 Industries not classified\n",
"XX Establishments omitted from classification due..."
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"codes=pd.DataFrame(list(sectordict.items()), columns=['naics', 'name']).set_index('naics')\n",
"codes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Quality Control Checks"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n"
]
}
],
"source": [
"#Does sum of industries equal industry total?\n",
"indsum=zctaind['N00'].subtract(zctaind.iloc[:,1:22].sum(axis=1))\n",
"if indsum.sum()==0:\n",
" print (True)\n",
"else:\n",
" print(indsum.loc[indsum != 0])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n"
]
}
],
"source": [
"#Is sum of percent totals approximately 100?\n",
"ptotal=zctaind.iloc[:,22:].sum(axis=1)\n",
"if ptotal.loc[(ptotal <= 99.05) | (ptotal >= 100.05)].empty:\n",
" print(True)\n",
"else:\n",
" print(ptotal.loc[(ptotal <= 99.05) | (ptotal >= 100.05)])\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n"
]
}
],
"source": [
"#Do number of ZCTAs in employment table match the industries table?\n",
"ecount=zctaemp.shape[0]\n",
"icount=zctaind.shape[0]\n",
"if ecount == icount:\n",
" print (True)\n",
"else:\n",
" print('Mistmatched count between employment',ecount, 'rows and industry',icount, 'rows')"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n"
]
}
],
"source": [
"#Does sum of estabslishments from employment table equal establishments in industries table?\n",
"estsum=zctaemp['ESTAB'].subtract(zctaind['N00'])\n",
"if estsum.sum()==0:\n",
" print (True)\n",
"else:\n",
" print(estsum.loc[estsum != 0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Write to Database "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"con = sqlite3.connect(dbname) \n",
"cur = con.cursor()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Employment table\n",
"cur.execute('DROP TABLE IF EXISTS {};'.format(emptable))\n",
"qcreate_emptab=\"\"\"\n",
"CREATE TABLE {}(\n",
"zcta5 TEXT NOT NULL PRIMARY KEY,\n",
"estab INTEGER,\n",
"emp INTEGER,\n",
"payq1 INTEGER,\n",
"payan INTEGER,\n",
"flag_emp INTEGER,\n",
"flag_payq1 INTEGER,\n",
"flag_payan INTEGER);\n",
"\"\"\".format(emptable)\n",
"\n",
"cur.execute(qcreate_emptab)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"#Don't run this block unless you've run the previous one\n",
"zctaemp.to_sql(name='{}'.format(emptable), if_exists='append', index=True, con=con)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"212 records written to zbp2018emp\n"
]
}
],
"source": [
"cur.execute('SELECT COUNT(*) FROM {};'.format(emptable))\n",
"rows = cur.fetchone()\n",
"print(rows[0], 'records written to', emptable)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 records updated for EMP\n",
"0 records updated for PAYQ1\n",
"0 records updated for PAYAN\n",
"212 records updated for FLAG_EMP\n",
"212 records updated for FLAG_PAYQ1\n",
"212 records updated for FLAG_PAYAN\n"
]
}
],
"source": [
"#Replace zeros with nulls, as these values really represent no data\n",
"for col in zctaemp.columns[1:]:\n",
" qupdate='UPDATE {} SET {} = NULL WHERE {} = 0;'.format(emptable,col,col)\n",
" cur.execute(qupdate)\n",
" print(cur.rowcount,'records updated for',col)\n",
" con.commit()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Industry table\n",
"cur.execute('DROP TABLE IF EXISTS {}'.format(indtable))\n",
"qcreate_indtab=\"\"\"\n",
"CREATE TABLE {} (\n",
"zcta5 TEXT NOT NULL PRIMARY KEY, \n",
"N00 INTEGER, \n",
"N11 INTEGER, \n",
"N21 INTEGER, \n",
"N22 INTEGER, \n",
"N23 INTEGER, \n",
"N31_33 INTEGER, \n",
"N42 INTEGER, \n",
"N44_45 INTEGER, \n",
"N48_49 INTEGER, \n",
"N51 INTEGER, \n",
"N52 INTEGER, \n",
"N53 INTEGER, \n",
"N54 INTEGER, \n",
"N55 INTEGER, \n",
"N56 INTEGER, \n",
"N61 INTEGER, \n",
"N62 INTEGER, \n",
"N71 INTEGER, \n",
"N72 INTEGER, \n",
"N81 INTEGER, \n",
"N99 INTEGER,\n",
"NXX INTEGER,\n",
"N11_PCT REAL, \n",
"N21_PCT REAL, \n",
"N22_PCT REAL, \n",
"N23_PCT REAL, \n",
"N31_33_PCT REAL, \n",
"N42_PCT REAL, \n",
"N44_45_PCT REAL, \n",
"N48_49_PCT REAL, \n",
"N51_PCT REAL, \n",
"N52_PCT REAL, \n",
"N53_PCT REAL, \n",
"N54_PCT REAL, \n",
"N55_PCT REAL, \n",
"N56_PCT REAL, \n",
"N61_PCT REAL, \n",
"N62_PCT REAL, \n",
"N71_PCT REAL, \n",
"N72_PCT REAL, \n",
"N81_PCT REAL, \n",
"N99_PCT REAL,\n",
"NXX_PCT REAL);\n",
"\"\"\".format(indtable)\n",
"\n",
"cur.execute(qcreate_indtab)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"#Don't run this block unless you've run the previous one\n",
"zctaind.to_sql(name='{}'.format(indtable), if_exists='append', index=True, con=con)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"212 records written to zbp2018ind\n"
]
}
],
"source": [
"cur.execute('SELECT COUNT(*) FROM {};'.format(indtable))\n",
"rows = cur.fetchone()\n",
"print(rows[0], 'records written to', indtable)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"212 records updated for N11_PCT\n",
"211 records updated for N21_PCT\n",
"201 records updated for N22_PCT\n",
"27 records updated for N23_PCT\n",
"58 records updated for N31_33_PCT\n",
"26 records updated for N42_PCT\n",
"18 records updated for N44_45_PCT\n",
"35 records updated for N48_49_PCT\n",
"39 records updated for N51_PCT\n",
"12 records updated for N52_PCT\n",
"10 records updated for N53_PCT\n",
"7 records updated for N54_PCT\n",
"140 records updated for N55_PCT\n",
"12 records updated for N56_PCT\n",
"39 records updated for N61_PCT\n",
"16 records updated for N62_PCT\n",
"46 records updated for N71_PCT\n",
"17 records updated for N72_PCT\n",
"8 records updated for N81_PCT\n",
"158 records updated for N99_PCT\n",
"19 records updated for NXX_PCT\n"
]
}
],
"source": [
"#For percentages, replace zeros with nulls, as these values really represent no data\n",
"for col in zctaind.columns[22:]:\n",
" qupdate='UPDATE {} SET {} = NULL WHERE {} = 0.0;'.format(indtable,col,col)\n",
" cur.execute(qupdate)\n",
" print(cur.rowcount,'records updated for',col)\n",
" con.commit()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"193 records updated for N11\n",
"192 records updated for N21\n",
"184 records updated for N22\n",
"27 records updated for N23\n",
"58 records updated for N31_33\n",
"26 records updated for N42\n",
"18 records updated for N44_45\n",
"35 records updated for N48_49\n",
"39 records updated for N51\n",
"12 records updated for N52\n",
"10 records updated for N53\n",
"7 records updated for N54\n",
"132 records updated for N55\n",
"12 records updated for N56\n",
"39 records updated for N61\n",
"16 records updated for N62\n",
"46 records updated for N71\n",
"17 records updated for N72\n",
"8 records updated for N81\n",
"147 records updated for N99\n",
"0 records updated for NXX\n"
]
}
],
"source": [
"#For establishments, replace zeros with nulls unless establishments were omitted from classification\n",
"for col in zctaind.columns[1:22]:\n",
" qupdate='UPDATE {} SET {} = NULL WHERE {} = 0 AND NXX !=0;'.format(indtable,col,col)\n",
" cur.execute(qupdate)\n",
" print(cur.rowcount,'records updated for',col)\n",
" con.commit()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#NAICS code table\n",
"cur.execute('DROP TABLE IF EXISTS {};'.format(codetable))\n",
"qcreate_codetab=\"\"\"\n",
"CREATE TABLE {}(\n",
"naics TEXT NOT NULL PRIMARY KEY,\n",
"name TEXT);\n",
"\"\"\".format(codetable)\n",
"\n",
"cur.execute(qcreate_codetab)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"#Don't run this block unless you've run the previous one\n",
"codes.to_sql(name='{}'.format(codetable), if_exists='append', index=True, con=con)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"22 records written to zbp2018indcodes\n"
]
}
],
"source": [
"cur.execute('SELECT COUNT(*) FROM {};'.format(codetable))\n",
"rows = cur.fetchone()\n",
"print(rows[0], 'records written to', codetable)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"con.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}