{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieve State Usage Data\n", "\n", "This notebook downloads water use data for a user specified state and year from the National Water Information System server, and translates these data into physical supply and use table (PSUT), using a preformatted PSUT template and value mapping table. \n", "\n", "Original data are from the USGS state water programs. Below is a sample URL for Louisiana:\n", "`https://waterdata.usgs.gov/la/nwis/water_use?format=rdb&rdb_compression=value&wu_area=County&wu_year=2000%2C2005%2C2010&wu_county=ALL&wu_category=ALL&wu_county_nms=--ALL%2BCounties--&wu_category_nms=--ALL%2BCategories--`\n", "\n", "##### Workflow:\n", "* Get the state and year from the user\n", "* Get the filenames for the PSUT template and value mapping table\n", "* Construct the url and retrieve the data into a pandas data frame\n", "* Melt/gather the usage columns into row values under the column name 'Group'\n", "* Remove rows with no usage data (identified by not having \"Mgal\" in the 'Group' name)\n", "\n", "#### Required 3rd party modules\n", "The following may require installation before running this notebook. These can each be installed using `pip`\n", "* Pandas` ------>pip install pandas`\n", "* OpenPyXl` ---->pip install openpyxl`\n", "---\n", "`Created September 2017`
`John.Fay@duke.edu`\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Begin by specifying the 2-character state code and the year to process" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Specify the state and year to process\n", "state = 'la' #Louisiana\n", "year = 2010" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Next, specify the path to the template PSUT and the value mapping filenames" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Get the input filenames\n", "templateFilename = '../Data/Templates/StatePSUT_Template.xlsx'\n", "remapFilename = '../Data/RemapTables/StatePSUTLookup.csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The rest is automated..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### First, we need to import required packages, installing 3rd party packages if required" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Import built-in modules\n", "import sys, os, urllib\n", "from shutil import copyfile" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Import pandas, install if necessary...\n", "try:\n", " import pandas as pd\n", "except:\n", " import pip\n", " pip.main(['install','pandas'])\n", " import pandas as pd" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Import openpyxl, install if necessary...\n", "try:\n", " from openpyxl import load_workbook\n", "except:\n", " import pip\n", " pip.main(['install','openpyxl'])\n", " from openpyxl import load_workbook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Output data will be named with the state code and year (e.g. `la_2010.csv`) and will be saved in the StateData subfolder of the Data directory.\n", "This folder will be created, if it does not exist already. And then the PSUT template will be copied to this location using the output filename. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Create the StateData folder, if not already present\n", "outFolder = '..\\\\Data\\\\StateData'\n", "if os.path.exists(outFolder) == False:\n", " os.mkdir(outFolder)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Output will be written to ..\\Data\\StateData\\la_2010.xlsx\n" ] } ], "source": [ "#Set the output filename, putting it in the output data folder\n", "outFilename = outFolder + os.sep + '{0}_{1}.xlsx'.format(state,year)\n", "print(\"Output will be written to \" + outFilename)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Copy the template to the output filename\n", "copyfile(src=templateFilename,dst=outFilename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Load in the value mapping table as a data frame" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Get the remap table and load as a dataframe\n", "dfRemap = pd.read_csv(remapFilename,dtype='str',index_col=\"Index\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Construct the URL for data retreival and then pull the on-line data to an in-memory data frame named `dfRaw`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Set the data URL path and parameters and construct the url\n", "path = 'https://waterdata.usgs.gov/{}/nwis/water_use?'.format(state)\n", "values = {'format':'rdb',\n", " 'rdb_compression':'value',\n", " 'wu_area':'County',\n", " 'wu_year': year,\n", " 'wu_county':'ALL',\n", " 'wu_county_nms':'--ALL+Counties--',\n", " 'wu_category_nms':'--ALL+Categories--'\n", " }\n", "url = path + urllib.urlencode(values)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Pull data in using the URL and remove the 2nd row of headers\n", "dfRaw = pd.read_table(url,comment='#',header=[0,1],na_values='-')\n", "dfRaw.columns = dfRaw.columns.droplevel(level=1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
state_cdstate_namecounty_cdcounty_nmyearTotal Population total population of area, in thousandsPublic Supply population served by groundwater, in thousandsPublic Supply population served by surface water, in thousandsPublic Supply total population served, in thousandsPublic Supply self-supplied groundwater withdrawals, fresh, in Mgal/d...Hydroelectric Power total offstream surface-water withdrawals in Mgal/dHydroelectric Power power generated by instream use, in gigawatt-hoursHydroelectric Power power generated by offstream use, in gigawatt-hoursHydroelectric Power total power generated, in gigawatt-hoursHydroelectric Power number of instream facilitiesHydroelectric Power number of offstream facilitiesHydroelectric Power total number of facilitiesWastewater Treatment returns by public wastewater facilities, in Mgal/dWastewater Treatment number of public wastewater facilitiesWastewater Treatment reclaimed wastewater released by wastewater facilities, in Mgal/d
022Louisiana1Acadia Parish201061.77344.6660.00044.6665.82...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
122Louisiana3Allen Parish201025.76422.5730.00022.5734.27...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
222Louisiana5Ascension Parish2010107.21529.51027.09456.6043.02...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
322Louisiana7Assumption Parish201023.4210.00023.04123.0410.00...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
422Louisiana9Avoyelles Parish201042.07339.7560.00039.7563.85...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 281 columns

\n", "
" ], "text/plain": [ " state_cd state_name county_cd county_nm year \\\n", "0 22 Louisiana 1 Acadia Parish 2010 \n", "1 22 Louisiana 3 Allen Parish 2010 \n", "2 22 Louisiana 5 Ascension Parish 2010 \n", "3 22 Louisiana 7 Assumption Parish 2010 \n", "4 22 Louisiana 9 Avoyelles Parish 2010 \n", "\n", " Total Population total population of area, in thousands \\\n", "0 61.773 \n", "1 25.764 \n", "2 107.215 \n", "3 23.421 \n", "4 42.073 \n", "\n", " Public Supply population served by groundwater, in thousands \\\n", "0 44.666 \n", "1 22.573 \n", "2 29.510 \n", "3 0.000 \n", "4 39.756 \n", "\n", " Public Supply population served by surface water, in thousands \\\n", "0 0.000 \n", "1 0.000 \n", "2 27.094 \n", "3 23.041 \n", "4 0.000 \n", "\n", " Public Supply total population served, in thousands \\\n", "0 44.666 \n", "1 22.573 \n", "2 56.604 \n", "3 23.041 \n", "4 39.756 \n", "\n", " Public Supply self-supplied groundwater withdrawals, fresh, in Mgal/d \\\n", "0 5.82 \n", "1 4.27 \n", "2 3.02 \n", "3 0.00 \n", "4 3.85 \n", "\n", " ... \\\n", "0 ... \n", "1 ... \n", "2 ... \n", "3 ... \n", "4 ... \n", "\n", " Hydroelectric Power total offstream surface-water withdrawals in Mgal/d \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Hydroelectric Power power generated by instream use, in gigawatt-hours \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Hydroelectric Power power generated by offstream use, in gigawatt-hours \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Hydroelectric Power total power generated, in gigawatt-hours \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Hydroelectric Power number of instream facilities \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Hydroelectric Power number of offstream facilities \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Hydroelectric Power total number of facilities \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Wastewater Treatment returns by public wastewater facilities, in Mgal/d \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Wastewater Treatment number of public wastewater facilities \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Wastewater Treatment reclaimed wastewater released by wastewater facilities, in Mgal/d \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", "[5 rows x 281 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#CHECK: Display a sample of the retrieved data\n", "dfRaw.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### With the data now held locally, we reformat the table so that each of the usage (and other) columns are \"[melted](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html)\" into a single column such that the original column name is stored in a new field and the value is stored in another. This facilitates subsequent analysis, which includes:\n", " * Removing rows (formerly columns) that don't report volume, e.g. population served values, and\n", " * Ensuring that volume data is a floating point number." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data transformed from (64, 281) rows/columns to (17664, 7) rows/columns\n" ] } ], "source": [ "#Tidy the data: transform so data in each usage column become row values with a new column listing the usage type\n", "rowHeadings = ['county_cd', 'county_nm', 'state_cd', 'state_name', 'year']\n", "dfTidy = pd.melt(dfRaw,id_vars=rowHeadings,value_name='MGal',var_name='Group')\n", "print(\"Data transformed from {0} rows/columns to {1} rows/columns\".format(dfRaw.shape, dfTidy.shape))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(15040, 7)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Remove rows that don't have volume data (i.e. keep only columns with 'Mgal' in the name)\n", "dfTidy = dfTidy[dfTidy['Group'].str.contains('Mgal')].copy(deep=True)\n", "dfTidy.shape" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Change the type of the MGal column to float \n", "dfTidy['MGal'] = dfTidy.MGal.astype(\"float\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
county_cdcounty_nmstate_cdstate_nameyearGroupMGal
2561Acadia Parish22Louisiana2010Public Supply self-supplied groundwater withdr...5.82
2573Allen Parish22Louisiana2010Public Supply self-supplied groundwater withdr...4.27
2585Ascension Parish22Louisiana2010Public Supply self-supplied groundwater withdr...3.02
2597Assumption Parish22Louisiana2010Public Supply self-supplied groundwater withdr...0.00
2609Avoyelles Parish22Louisiana2010Public Supply self-supplied groundwater withdr...3.85
\n", "
" ], "text/plain": [ " county_cd county_nm state_cd state_name year \\\n", "256 1 Acadia Parish 22 Louisiana 2010 \n", "257 3 Allen Parish 22 Louisiana 2010 \n", "258 5 Ascension Parish 22 Louisiana 2010 \n", "259 7 Assumption Parish 22 Louisiana 2010 \n", "260 9 Avoyelles Parish 22 Louisiana 2010 \n", "\n", " Group MGal \n", "256 Public Supply self-supplied groundwater withdr... 5.82 \n", "257 Public Supply self-supplied groundwater withdr... 4.27 \n", "258 Public Supply self-supplied groundwater withdr... 3.02 \n", "259 Public Supply self-supplied groundwater withdr... 0.00 \n", "260 Public Supply self-supplied groundwater withdr... 3.85 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#CHECK: Show the structure of the 'tidied' data frame\n", "dfTidy.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Next, we join the value mapping table to the tidied data frame, thus linking the values (in MGal/year) with the address of the PSUT cell in which they should be ported. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
county_cdcounty_nmstate_cdstate_nameyearGroupMGalColumn1Row1Column2Row2Column3Row3
01Acadia Parish22Louisiana2010Public Supply self-supplied groundwater withdr...5.82N31NaNNaNNaNNaN
13Allen Parish22Louisiana2010Public Supply self-supplied groundwater withdr...4.27N31NaNNaNNaNNaN
25Ascension Parish22Louisiana2010Public Supply self-supplied groundwater withdr...3.02N31NaNNaNNaNNaN
37Assumption Parish22Louisiana2010Public Supply self-supplied groundwater withdr...0.00N31NaNNaNNaNNaN
49Avoyelles Parish22Louisiana2010Public Supply self-supplied groundwater withdr...3.85N31NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " county_cd county_nm state_cd state_name year \\\n", "0 1 Acadia Parish 22 Louisiana 2010 \n", "1 3 Allen Parish 22 Louisiana 2010 \n", "2 5 Ascension Parish 22 Louisiana 2010 \n", "3 7 Assumption Parish 22 Louisiana 2010 \n", "4 9 Avoyelles Parish 22 Louisiana 2010 \n", "\n", " Group MGal Column1 Row1 \\\n", "0 Public Supply self-supplied groundwater withdr... 5.82 N 31 \n", "1 Public Supply self-supplied groundwater withdr... 4.27 N 31 \n", "2 Public Supply self-supplied groundwater withdr... 3.02 N 31 \n", "3 Public Supply self-supplied groundwater withdr... 0.00 N 31 \n", "4 Public Supply self-supplied groundwater withdr... 3.85 N 31 \n", "\n", " Column2 Row2 Column3 Row3 \n", "0 NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Join the remap table\n", "dfAll = pd.merge(dfTidy,dfRemap,how='inner',left_on=\"Group\",right_on=\"Group\")\n", "dfAll.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### With the tables joined, we can open the output worksheet and start porting over values" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "os.path.exists(outFilename)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "#Open the spreadsheet template\n", "wb = load_workbook(filename=outFilename)\n", "ws = wb['template']\n", "ws.title = str(year)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Notes on mapping values in the dataframe to PSUT cells\n", "* As some water usage categories are inserted into more than one cell - some in as many as three - in the PSUT, the value mapping file contains three sets of cell coordinates (i.e. column/row pairs). Thus, we repeat the process of reading the dataframe and mapping values to the PSUT three times. \n", "\n", "* In cases where two or more usage categories map to the same PSUT cell, the values are summed. This is done in the `dfAll.groupby` statement. \n", "\n", "* Where no data exist for a particular usage category, the value `n/a` is inserted. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Loop through the first set of row/columns and insert values into the Excel spreadsheet\n", "dfRound1 = dfAll.groupby(['Column1','Row1'])['MGal'].sum()\n", "dfRound1.fillna(value=\"n/a\",inplace=True)\n", "for (row,column), value in dfRound1.iteritems():\n", " #Set the value in the workbook\n", " rv = str(row)+ str(column)\n", " ws[rv] = value\n", "#Save the workbook\n", "wb.save(outFilename)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Loop through the second set of row/columns and insert values into the Excel spreadsheet\n", "dfRound2 = dfAll.groupby(['Column2','Row2'])['MGal'].sum()\n", "dfRound2.fillna(value=\"n/a\",inplace=True)\n", "for (row,column), value in dfRound2.iteritems():\n", " #Set the value in the workbook\n", " rv = str(row)+ str(column)\n", " ws[rv] = value\n", "#Save the workbook\n", "wb.save(outFilename)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Loop through the third set of row/columns and insert values into the Excel spreadsheet\n", "dfRound3 = dfAll.groupby(['Column3','Row3'])['MGal'].sum()\n", "dfRound3.fillna(value=\"n/a\",inplace=True)\n", "for (row,column), value in dfRound3.iteritems():\n", " #print row, column, value\n", " #Set the value in the workbook\n", " rv = str(row)+ str(column)\n", " ws[rv] = value\n", "#Save the workbook\n", "wb.save(outFilename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Open the notebook" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "os.startfile(outFilename)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }