{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NC Water Scraper\n", "This notebook consolidates water withdrawal, discharge, and transfer data from NCDEQs Water Withdrawal & Transfer Registry ([link](http://www.ncwater.org/Permits_and_Registration/Water_Withdrawal_and_Transfer_Registration/report)). The sequence of analysis is as follows:\n", "* First data from the registry's front page (link above) is scraped into a dataframe listing Registered Owner, Facility Name, Status, and Facility ID. \n", "* Then, using each entry's facility ID to access its annual report, data are scraped - one facility and one year at a time - to compile a table listing monthly withdrawals, discharges, and transfers for a set of years. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#import libraries\n", "import os,requests\n", "import pandas as pd\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1. Create a dataframe of sites in the registry" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Extract the contents of the base web page into a 'soup' object for scraping\n", "baseURL = 'http://www.ncwater.org/Permits_and_Registration/Water_Withdrawal_and_Transfer_Registration/report'\n", "r = requests.get(baseURL)\n", "soup = BeautifulSoup(r.text,'lxml')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following works with the current format of the web page. If it changes, this may have to be revised. Here we select the items in the web form containing the data we want to extract. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Get the main table, identified with the id=main\n", "trTable = soup.find_all(id=\"main\")[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Get the second table contained in the table selected above\n", "dataTable = trTable.find_all('table')[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Get all rows in the table selected above; these contain the data we want\n", "rows = dataTable.find_all('tr')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Initialize the dataframe that will hold our data\n", "colNames = ['Owner','Name','Status','Code']\n", "dfSites = pd.DataFrame(columns=colNames)\n", "dfSites.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Loop through each row (skipping the first, which contains headers), extracting data into our data frame\n", "for row in rows[1:]:\n", " \n", " #Create a collection of columns for the current row\n", " columns = row.find_all('td')\n", " \n", " #Construct a dictionary of the items we want\n", " dictR = {'Owner':columns[0].string,\n", " 'Name':columns[1].string,\n", " 'Status':columns[2].string,\n", " 'Code':columns[3].find(\"a\")['href'].split(\"/\")[-2]}\n", " \n", " #Append these data to our dataframe\n", " dfSites = dfSites.append(dictR,ignore_index=True)\n", " \n", "dfSites.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create a folder to hold all the downloads\n", "outFolder = \"NCDEQ\"\n", "if not os.path.exists(outFolder): os.mkdir(outFolder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Save contents to a file...\n", "dfSites.to_csv(\"NCDEQ/WithdrawalMaster.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2. Extract report data for each site" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def unstackTable(dfStacked):\n", " '''\n", " Unstacks monthly tables presented in 2-column formats into a \n", " single column format. For example:\n", " | Jan | Jul | \n", " | Feb | Aug |\n", " | Mar | Sep | \n", " | Apr | Oct | \n", " | May | Nov | \n", " | Jun | Dec | \n", " is converted to a single column with associated data attached. \n", " '''\n", " #Copy the table\n", " df2 = dfStacked.copy(deep=True)\n", " \n", " #Convert the first row to columns, then drop the row\n", " colNames = df2.iloc[0]\n", " df2.columns = colNames\n", " df2.drop(0,inplace=True)\n", "\n", " #Convert two column format to one\n", " df2a = df2.iloc[:,:4]\n", " df2b = df2.iloc[:,4:]\n", " df2 = df2a.append(df2b)\n", "\n", " #Set month to be the index\n", " df2.set_index(\"Month\",inplace=True)\n", "\n", " #Convert data types for columns 2, 3, and 4 (days, avg, max)\n", " df2.iloc[:,0] = df2.iloc[:,0].fillna(0).astype(int)\n", " df2.iloc[:,1] = df2.iloc[:,1].astype(float)\n", " df2.iloc[:,2] = df2.iloc[:,2].astype(float)\n", " \n", " #Return the table\n", " return df2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def ScrapeSite(siteID, year, first=False):\n", " \n", " #--DATA EXTRACTION--\n", " #Construct the URL\n", " siteURL = 'http://www.ncwater.org/Permits_and_Registration/Water_Withdrawal_and_Transfer_Registration/report/view/{0}/{1}'.format(siteID,year)\n", "\n", " #Extract all tables from the URL into a collection of dataframes\n", " dfs = pd.read_html(siteURL,na_values='NaN')\n", "\n", " #Separate tables into labeled variables, unstacking as needed\n", " dfFacility = dfs[2] # Information on the facility\n", " dfWithdrawal = unstackTable(dfs[3]) # Monthly withdrawal data \n", " dfSource = dfs[4] # Information on where water was drawn\n", " dfDischarge = unstackTable(dfs[5]) # Monthly discharge data \n", " dfDischargeMethod = dfs[6] # Information on type and amounts of discharge\n", " dfTransferDescription = dfs[7] # Information on source and destination of transfers\n", " dfTransfer= unstackTable(dfs[8]) # Monthly transfer data\n", "\n", " #Extract facility information into variables\n", " registrant = dfFacility.iloc[0,1]\n", " facility_name = dfFacility.iloc[0,1]\n", " county = dfFacility.iloc[2,1]\n", " subbasin= dfFacility.iloc[2,3]\n", " facility_type = dfFacility.iloc[1,3]\n", " \n", " #--MONTHLY VOLUME DATA----------------------\n", " #Combine monthly withdrawal, discharge, and transfer tables\n", " dfSiteData = pd.concat([dfWithdrawal,dfDischarge,dfTransfer], axis=1).reset_index()\n", "\n", " #Add site information as columns\n", " dfSiteData['SiteID'] = siteID\n", " dfSiteData['Year'] = year\n", " dfSiteData['Registrant'] = registrant\n", " dfSiteData['Facility'] = facility_name\n", " dfSiteData['Type'] = facility_type\n", " dfSiteData['County'] = county\n", " dfSiteData['Subbasin'] = subbasin\n", "\n", " #Rearrange columns\n", " columns = dfSiteData.columns.tolist()[10:] + dfSiteData.columns.tolist()[:10]\n", " dfSiteData = dfSiteData[columns]\n", "\n", " #--WITHDRAWAL INFO--------------------------\n", " dfSource = dfs[4].copy(deep=True)\n", " dfSource.columns = ('Name','Type','AvgDaily','DaysUsed','Capacity_MGD')\n", " dfSource.drop(0,inplace=True)\n", " dfSource.insert(0,'SiteID',siteID)\n", " dfSource.insert(1,'Year',year) \n", " dfSource.insert(2,'FacilityType',facility_type)\n", " dfSource.insert(3,'County',county)\n", " dfSource.insert(4,'Subbasin',subbasin)\n", "\n", " #--DISCHARGE INFO-------------------------\n", " dfDischargeMethod = dfs[6].copy(deep=True)\n", " dfDischargeMethod.columns = ('Permit','Type','AvgDaily','DaysUsed','Capacity_MGD')\n", " dfDischargeMethod.drop(0,inplace=True)\n", " dfDischargeMethod.insert(0,'SiteID',siteID)\n", " dfDischargeMethod.insert(1,'Year',year)\n", " dfDischargeMethod.insert(2,'FacilityType',facility_type)\n", " dfDischargeMethod.insert(3,'County',county)\n", " dfDischargeMethod.insert(4,'Subbasin',subbasin)\n", " \n", " #--TRANSFER INFO------------------------------\n", " dfTransferDescription = dfs[7].copy(deep=True)\n", " dfTransferDescription.columns = ('Description','SourceBasin','ReceivingBasin','Capacity')\n", " dfTransferDescription.drop(0,inplace=True)\n", " dfTransferDescription.insert(0,'SiteID',siteID)\n", " dfTransferDescription.insert(1,'Year',year)\n", " dfTransferDescription.insert(2,'FacilityType',facility_type)\n", " dfTransferDescription.insert(3,'County',county)\n", " dfTransferDescription.insert(4,'Subbasin',subbasin) \n", "\n", " #-WRITE DATA TO OUTPUT FILES------------------\n", " outCSV1 = \"NCDEQ/MonthlyVolumeData.csv\"\n", " outCSV2 = \"NCDEQ/WithdrawalSourceData.csv\"\n", " outCSV3 = \"NCDEQ/DischargeMethods.csv\"\n", " outCSV4 = \"NCDEQ/TransferInfo.csv\"\n", " \n", " #If this is the first table, write to new csv files\n", " if first:\n", " outputType = 'w' #Write to new file\n", " head = True #Include header row\n", " else: \n", " outputType = 'a' #Append to existing file\n", " head = False #Don't include headers\n", " \n", " #Write monthly volume data to new file\n", " with open(outCSV1,outputType) as outFile:\n", " dfSiteData.to_csv(outFile,header=head,index=False)\n", "\n", " #Write source info data to new file \n", " with open(outCSV2, outputType) as outFile:\n", " dfSource.to_csv(outFile,header=head,index=False)\n", "\n", " #Write discharge info data to new file \n", " with open(outCSV3, outputType) as outFile:\n", " dfDischargeMethod.to_csv(outFile,header=head,index=False)\n", "\n", " #Write transfer info data to new file \n", " with open(outCSV4, outputType) as outFile:\n", " dfTransferDescription.to_csv(outFile,header=head,index=False)\n", " \n", " return (outCSV1,outCSV2, outCSV3, outCSV4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Set flag for the first file (to create a new output)\n", "firstFile = True\n", "\n", "#Loop through each site ID and scrape it's data\n", "for index, row in dfSites.iterrows():\n", " \n", " #Skip draft data\n", " if row['Status'] == 'Draft': continue\n", " \n", " #Get the code and loop through years\n", " siteID = row['Code']\n", " print(index,siteID,end=': ')\n", "\n", " #Loop through years 2010 to 2017 and scrape the data\n", " for year in range(2010,2018):\n", " print(year,end='...')\n", " outFiles = ScrapeSite(siteID,year,first=firstFile)\n", " firstFile = False\n", " print() " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }