{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cooling Tower Identification: MODA Data Prep" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import dependencies\n", "\n", "* This was written in Python 2.7" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import sys as sys\n", "sys.path.append('./code')\n", "import pandas as pd\n", "import numpy as np\n", "import os\n", "\n", "\n", "## available via pip install nyc_geoclient \n", "from nyc_geoclient import Geoclient\n", "\n", "# need API keys from Geoclient, substitute in your own here\n", "# https://developer.cityofnewyork.us/api/geoclient-api\n", "g = Geoclient(os.environ['GEOSUPPORT_ID'],os.environ['GEOSUPPORT_KEY']) \n", "\n", "\n", "## Some geocoding helper functions MODA wrote \n", "from geocoding import GetBBLFromBIN\n", "\n", "%matplotlib inline\n", "\n", "import warnings; warnings.simplefilter('ignore') # for presentation purposes " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "###Some helper functions\n", "\n", "def Remove_Dummy_BINS(df):\n", " '''\n", " removes illegitimate BINs from input dataframe.\n", " '''\n", " dummy_BINs = [1000000 , 2000000, 3000000, 4000000, 5000000]\n", " df = df[~df.BIN.isin(dummy_BINs)]\n", " return df\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data \n", "\n", "Several of the source datasets are large and impractical to store in with the repository. Their locations are linked below. Getting the data from the provided links will also ensure getting the most up-to-date versions.\n", "\n", "To re-run the code, download these files to the '/input' directory and check that the file-path variables in the code match the actual file name.\n", "\n", "* [Cooling Towers](https://data.cityofnewyork.us/City-Government/Cooling-Tower/miz8-534t)\n", "* [PLUTO](https://data.cityofnewyork.us/City-Government/Primary-Land-Use-Tax-Lot-Output-PLUTO-/xuk2-nczf)\n", "* [PAD](https://data.cityofnewyork.us/City-Government/Property-Address-Directory/bc8t-ecyu)\n", "\n", "*Note: The Cooling Towers dataset linked above was derived by DOITT GIS from planimetrics data; the latest planimetrics images were taken in 2014 and the dataset was published in 2016 (more information on planimetrics can be found in [this GitHub repository](https://github.com/CityOfNewYork/nyc-planimetrics/blob/master/Capture_Rules.md).) The dataset used during the 2015 Cooling Towers activation was not based on this planimetrics data.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tower List" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3204\n", "2035\n" ] } ], "source": [ "\n", "\n", "path_towers = \"../input/Cooling_Towers.csv\" # or another location for the file \n", "\n", "if not os.path.isfile(path_towers):\n", " print \"Please download Cooling Tower Data to /input directory\"\n", "else:\n", " Towers = pd.read_csv(path_towers)\n", "\n", "print len(Towers)\n", "\n", "Towers = Remove_Dummy_BINS(Towers)\n", "\n", "## Drop any duplicate buildings ( a building may have more than one cooling tower)\n", "Towers.drop_duplicates(subset=['BIN'], inplace=True) \n", "\n", "Towers['Identified'] = 1 # noting that these buildings have a cooling tower\n", "\n", "print len(Towers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### PLUTO" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "reading borough file: BK\n", "277748 rows in BK\n", "277748 rows in total PLUTO data\n", "reading borough file: BX\n", "89963 rows in BX\n", "367711 rows in total PLUTO data\n", "reading borough file: Mn\n", "43231 rows in Mn\n", "410942 rows in total PLUTO data\n", "reading borough file: QN\n", "324630 rows in QN\n", "735572 rows in total PLUTO data\n", "reading borough file: SI\n", "123892 rows in SI\n", "859464 rows in total PLUTO data\n" ] } ], "source": [ "## This is PLUTO data from the relevant timeperiod in 2015, updated versions are available. \n", "path_PLUTO = \"../input/nyc_pluto_15v1\"\n", "\n", "if not os.path.exists(path_PLUTO):\n", " print 'Please download PLUTO data to input/ directory'\n", "\n", "\n", "\n", "boros = ['BK','BX','Mn','QN','SI']\n", "\n", "PLUTO = pd.DataFrame()\n", "\n", "for b in boros:\n", " filename = path_PLUTO + \"/\"+ b + \".csv\"\n", " print 'reading borough file: ' + b\n", " temp = pd.read_csv(filename)\n", " \n", " print str(len(temp)) + ' rows in ' + b\n", " \n", " PLUTO = pd.concat([PLUTO,temp],axis=0)\n", " print str(len(PLUTO)) + ' rows in total PLUTO data'\n", " \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PAD File" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## PAD file is also from 2015\n", "path_PAD = \"../input/pad15b/bobaadr.txt\"\n", "\n", "if not os.path.isfile(path_PAD):\n", " print 'Please download PAD to /input directory'\n", "\n", "pad = pd.read_csv(path_PAD)\n", "\n", "pad['BBL'] = (pad['boro'].astype(str) + \\\n", " pad['block'].apply('{0:0>5}'.format) + \\\n", " pad['lot'].apply('{0:0>4}'.format)).astype(float)\n", "\n", "\n", "pad.rename(columns={'bin':'BIN'},inplace=True)\n", "\n", "pad.drop_duplicates(subset = ['BIN','BBL'], inplace=True)\n", "\n", "pad = Remove_Dummy_BINS(pad)\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# using Building Identification Number (BIN) as index, checking to make sure there's no duplicates.\n", "\n", "pad.set_index('BIN',inplace=True,verify_integrity=True)\n", "Towers.set_index('BIN',inplace=True,verify_integrity=True)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Important Note: BIN vs. BBL\n", "\n", "In New York City **a BIN identifies a building** while a **BBL identifies a lot of land**. Some lots have more than one building on them, so one BBL may have multiple associated BINs. \n", "\n", "As the code below shows, we joined a **list of buildings to a list of lots** using BBL. This was a simplifying assumption made in order to make use of as much data as possible as quickly as possible. Two things make this a palatable assumption:\n", "\n", "* ~70% of lots only have one building on them\n", "* Most of the lots with more than one building are residential homes or walkup apartments which are not part of the target population for finding cooling towers.\n", "* When more than one building is on a lot, PLUTO will list the attributes (e.g., number of floors) of the larger building, which is the more likely candidate for a commercial cooling tower\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Run this code to test the BIN vs. BBL assumptions from above\n", "\n", "#PLUTO[PLUTO.NumBldgs > 1].BldgClass.value_counts(normalize = True).cumsum().head(10)\n", "#PLUTO.NumBldgs.value_counts(normalize=1)\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2035 buildings in tower data\n", "2035 rows joined to PAD\n", "1929 unique BBLs in tower set\n", "390 tower BBLs not in PLUTO\n" ] } ], "source": [ "print str(len(Towers)) + ' buildings in tower data'\n", "\n", "Towers = Towers.join(pad[['BBL']],how='left')\n", "\n", "print str(len(Towers)) + ' rows joined to PAD'\n", "\n", "Towers.reset_index(inplace=True)\n", "\n", "Towers.drop_duplicates(subset = ['BBL'],inplace=True) # collapsing towers to lot level\n", "\n", "print str(len(Towers)) + ' unique BBLs in tower set'\n", "\n", "\n", "orphan_towers = Towers[~Towers.BBL.isin(PLUTO.BBL)] ## these seem to be mostly condo BBLs\n", "\n", "print str(len(orphan_towers)) + ' tower BBLs not in PLUTO'\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Checking BBLs that don't match PLUTO against Geosupport \n", "\n", "Sometimes identifiers can be inconsistent across datasets and going 'to the source' of DCP's GeoSupport geocoding service is the easiest way to reconcile differences. \n", "\n", "\n", "*Note: Running this geocoding process now may produce results inconsistent with when it was originally run in 2015, since BINs and BBLs may have changed.*" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2118405\n", "2096626\n", "2096857\n", "2116415\n", "2117335\n", "2080263\n", "3393449\n", "3392416\n", "3396842\n", "3388709\n", "3348849\n", "3397489\n", "3058390\n", "3393008\n", "3061628\n", "3061652\n", "3396506\n", "3397446\n", "3391008\n", "3391222\n", "3329418\n", "3054615\n", "3397042\n", "3398702\n", "3388496\n", "3006148\n", "3396696\n", "3346104\n", "3002008\n", "3000259\n", "3392184\n", "3413889\n", "3391746\n", "3397720\n", "3000484\n", "3347736\n", "3392968\n", "3394020\n", "3000274\n", "3398101\n", "3391379\n", "1001531\n", "1002114\n", "1087716\n", "1087847\n", "1087848\n", "1087170\n", "1085867\n", "1085789\n", "1087238\n", "1082634\n", "1087700\n", "1083346\n", "1087842\n", "1000005\n", "1010899\n", "1088864\n", "1012223\n", "1086105\n", "1012238\n", "1067116\n", "1010382\n", "1009746\n", "1088431\n", "1080855\n", "1087241\n", "1088698\n", "1087833\n", "1016065\n", "1089677\n", "1088500\n", "1087725\n", "1016163\n", "1017954\n", "1069344\n", "1007638\n", "1007251\n", "1087488\n", "1002734\n", "1002167\n", "1002125\n", "1002814\n", "1020579\n", "1087667\n", "1088222\n", "1001877\n", "1001883\n", "1001838\n", "1060526\n", "1060788\n", "1087917\n", "1085673\n", "1015937\n", "1017178\n", "1017247\n", "1087746\n", "1027472\n", "1027366\n", "1013039\n", "1013011\n", "1087701\n", "1013854\n", "1087260\n", "1087721\n", "1080359\n", "1088885\n", "1015006\n", "1087767\n", "1087723\n", "1015146\n", "1087989\n", "1085695\n", "1076166\n", "1021925\n", "1016212\n", "1087902\n", "1088910\n", "1016877\n", "1087263\n", "1016884\n", "1089380\n", "1019668\n", "1019951\n", "1016907\n", "1087770\n", "1087726\n", "1087536\n", "1087841\n", "1018549\n", "1018474\n", "1087309\n", "1087537\n", "1080794\n", "1017097\n", "1022051\n", "1020338\n", "1020214\n", "1087674\n", "1035776\n", "1035052\n", "1076272\n", "1087269\n", "1076271\n", "1035078\n", "1038550\n", "1086160\n", "1036086\n", "1027191\n", "1087510\n", "1077844\n", "1070358\n", "1028950\n", "1084564\n", "1028839\n", "1028840\n", "1027074\n", "1087835\n", "1030315\n", "1030329\n", "1087704\n", "1070721\n", "1070689\n", "1070711\n", "1070703\n", "1070696\n", "1088576\n", "1081034\n", "1030352\n", "1086065\n", "1087072\n", "1088786\n", "1087660\n", "1030903\n", "1088301\n", "1087548\n", "1088733\n", "1087691\n", "1081578\n", "1088438\n", "1088432\n", "1088631\n", "1088152\n", "1088481\n", "1032491\n", "1087768\n", "1089333\n", "1087782\n", "1089046\n", "1051486\n", "1086570\n", "1048264\n", "1087920\n", "1045197\n", "1087912\n", "1082303\n", "1087742\n", "1076331\n", "1081307\n", "1086440\n", "1036153\n", "1028827\n", "1088175\n", "1025451\n", "1024900\n", "1070014\n", "1026318\n", "1034856\n", "1076843\n", "1023398\n", "1023760\n", "1024878\n", "1024898\n", "1086135\n", "1025450\n", "1023163\n", "1025211\n", "1087538\n", "1026676\n", "1026728\n", "1035253\n", "1071457\n", "1034537\n", "1071418\n", "1089376\n", "1036469\n", "1087513\n", "1035481\n", "1082746\n", "1038040\n", "1076268\n", "1087455\n", "1035407\n", "1038896\n", "1086068\n", "1071414\n", "1034238\n", "1034205\n", "1034227\n", "1035385\n", "1038761\n", "1038673\n", "1034212\n", "1037550\n", "1038658\n", "1088126\n", "1087672\n", "1038935\n", "1038909\n", "1071989\n", "1040356\n", "1044390\n", "1045271\n", "1045290\n", "1087926\n", "1087313\n", "1043870\n", "1072696\n", "1072694\n", "1023166\n", "1076262\n", "1088760\n", "1087978\n", "1087264\n", "1022706\n", "1022699\n", "1087187\n", "1024757\n", "1076193\n", "1024736\n", "1024771\n", "1076844\n", "1087119\n", "1022578\n", "1022579\n", "1015272\n", "1087186\n", "1088437\n", "1012987\n", "\n", "BUILDING IDENTIFICATION NUMBER (BIN) NOT FOUND\n", "1013558\n", "1015198\n", "1015191\n", "1088533\n", "1088726\n", "1085962\n", "1078988\n", "1078985\n", "1001050\n", "1001026\n", "1001041\n", "1001022\n", "1000794\n", "1088750\n", "1088442\n", "1001068\n", "1087167\n", "1001102\n", "1001193\n", "1001115\n", "1087755\n", "1001249\n", "1001215\n", "1001263\n", "1001199\n", "1075701\n", "1088793\n", "1072630\n", "1076299\n", "1043871\n", "1089700\n", "1040756\n", "1076296\n", "1043780\n", "1087103\n", "1043366\n", "1040766\n", "1041959\n", "1086581\n", "1087494\n", "1037174\n", "1042771\n", "1034520\n", "1014736\n", "1086104\n", "1085936\n", "1082441\n", "1089686\n", "\n", "BUILDING IDENTIFICATION NUMBER (BIN) NOT FOUND\n", "1088577\n", "1001137\n", "1001162\n", "1001292\n", "1087484\n", "1000830\n", "1000827\n", "1001003\n", "1001007\n", "1087459\n", "1087651\n", "1087518\n", "1087715\n", "1087549\n", "1007197\n", "4534816\n", "4536853\n", "1001389\n", "1087485\n", "1087756\n", "1079153\n", "1087951\n", "1001629\n", "1001451\n", "1088224\n", "1001228\n", "1087081\n", "1008455\n", "1083503\n", "1089381\n", "1009706\n", "1045282\n", "1042799\n", "1088565\n", "1056465\n", "1087671\n", "1015487\n", "1023732\n", "1088713\n", "1036458\n", "1087267\n", "1035762\n", "4312614\n", "4534920\n", "4003522\n", "4540105\n", "4312083\n", "4307915\n", "4592106\n", "4539469\n", "4050411\n", "\n", "BUILDING IDENTIFICATION NUMBER (BIN) NOT FOUND\n", "4045397\n", "4533805\n", "4080169\n", "4312087\n", "4538327\n", "4539429\n", "4536823\n", "4540156\n", "4539589\n", "4536915\n", "4014554\n", "4024498\n", "5089354\n" ] } ], "source": [ "orphan_towers_geosupport, errors = GetBBLFromBIN(orphan_towers,'BIN', geoclient_instance=g); # geoclient returns the most up-to-date BBL\n", "\n", "orphan_towers_geosupport[['BBL']] = orphan_towers_geosupport[['BBL']].astype(float)\n", "\n", "Towers.set_index('BIN',inplace=True,verify_integrity=True)\n", "\n", "orphan_towers_geosupport.set_index('BIN',inplace=True,verify_integrity=True)\n", "\n", "Towers.update(orphan_towers_geosupport) # sub in the geosupport BBLs\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.0 towers not accounted for in PLUTO\n" ] } ], "source": [ "# Join with PLUTO DATA\n", "\n", "Towers.drop_duplicates(subset = ['BBL'],inplace=True) # collapsing towers to lot level\n", "\n", "tower_data = PLUTO.merge(Towers,how='left',on='BBL') \n", "print str(Towers.Identified.sum() - tower_data.Identified.sum()) + ' towers not accounted for in PLUTO'\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating derived variables\n", "\n", "**BC**: first character in building class, e.g., A is single-family home \n", "**BC_num**: second character in building class. Not used. \n", "**Zone**: First character in 'AllZoning1' field. \n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tower_data['Identified'] = tower_data['Identified'].fillna(0)\n", "\n", "\n", "tower_data['BC'] = tower_data['BldgClass'].str[0]\n", "tower_data['BC_num'] = tower_data['BldgClass'].str[1]\n", "tower_data['Zone'] = tower_data['AllZoning1'].str[0]\n", "\n", "tower_data['Log_BldgArea'] = np.log(tower_data['BldgArea'])\n", "tower_data['Root_BldgArea'] = np.sqrt(tower_data['BldgArea'])\n", "tower_data['Log_NumFloors'] = np.log(tower_data['NumFloors'])\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# save the file locally\n", "\n", "if not os.path.exists(\"../data\"):\n", " os.makedirs(\"../data\")\n", "\n", "tower_data.to_csv('../data/tower_data.csv')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }