{ "cells": [ { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt, seaborn as sn, mpld3\n", "import pandas as pd, imp, glob, os, numpy as np\n", "from sqlalchemy import create_engine\n", "sn.set_context('notebook')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# TOC trends 2015: database clean-up (part 3)\n", "\n", "Back from holiday and trying to remember where I'd got to with the ICPW database. John has sent detailed replies to my questions regarding the US sites (see e-mails received 25/08/2016 at 01:45 and 03:33), so this seems like a good place to start.\n", "\n", "## 1. Checking US sites\n", "\n", "My initial clean-up of RESA2 ended with 72 sites linked to the `ICPW_TOCTRENDS_2015_US_LTM` project (see section2 of [this notebook](http://nbviewer.jupyter.org/url/www.googledrive.com/host/0BximeC_RweaeUy1jd2k3Nm1kdms/toc_trends_2015_data_cleaning.ipynb) for details) and, following advice from Tore, I also shifted 19 US sites into a new project called `ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED`. In addition, manually checking RESA2 identifies 95 sites associated with the `ICPWaters US` project. \n", "\n", "Based on the information in John's emails, some of this needs revisiting.\n", "\n", "### 1.1. Connect to db and check site numbers are as expected" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import custom functions and connect to db\n", "resa2_basic_path = (r'C:\\Data\\James_Work\\Staff\\Heleen_d_W\\ICP_Waters\\Upload_Template'\n", " r'\\useful_resa2_code.py')\n", "\n", "resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)\n", "\n", "engine, conn = resa2_basic.connect_to_resa2()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1679 95\n", "3870 72\n", "4150 19\n" ] } ], "source": [ "# Check numbers of sites\n", "proj_codes = [1679, # ICPWaters US\n", " 3870, # ICPW_TOCTRENDS_2015_US_LTM\n", " 4150] # ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED\n", "\n", "for code in proj_codes:\n", " # Read stations list\n", " sql = ('SELECT * '\n", " 'FROM resa2.projects_stations '\n", " 'WHERE project_id = %s' % code)\n", " df = pd.read_sql_query(sql, engine)\n", " \n", " print code, len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Check sites against John's spreadsheet\n", "\n", "Based on John's e-mail, four of the sites that I previously excluded should actually be included in the trends project, so I need to move these back across:\n", "\n", "| Site ID | Site name |\n", "|:--------:|:-----------------------------:|\n", "| 040874O | Brook Trout Lake, Adirondacks |\n", "| 1A1-109O | Moss Lake, Adirondacks |\n", "| 1A1-110O | Lake Rondaxe, Adirondacks |\n", "| ME-4474 | Duck Pond, Maine |\n", "\n", "John has sent a spreadsheet (*U.S.Site.Reconciliation.August.2016.xlsx*) listing which sites should be included in which projects. First, compare the number of sites in John's spreadsheet with the numbers in RESA2." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "John, core ICPW sites: 95\n", "John, DOC trends sites: 76\n" ] } ], "source": [ "# Read John's spreadsheet\n", "in_xls = (r'C:\\Data\\James_Work\\Staff\\Heleen_d_W\\ICP_Waters\\Call_for_Data_2016'\n", " '\\Replies\\usa\\U.S.Site.Reconciliation.August.2016.xlsx')\n", "\n", "j_df = pd.read_excel(in_xls, sheetname='Sheet1')\n", "\n", "print 'John, core ICPW sites: ', len(j_df[j_df['INCLUDE IN ICP WATERS DATABASE']=='YES'])\n", "print 'John, DOC trends sites:', len(j_df[j_df['INCLUDE IN ICP DOC ANALYSIS']=='YES'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks promising. With a bit of luck, I can just move the four sites listed above from the `EXCLUDED` project back into `ICPW_TOCTRENDS_2015_US_LTM` and everything will be OK...\n", "\n", "...actually, no - it's not that simple. Of the 95 `ICPWaters US` sites in the spreadsheet, John only has RESA2 codes for 91 of them (four are marked NA). This implies the 95 sites currently in RESA2 are not the same as the 95 in the spreadsheet, so I need to work out where the discrepancies are." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_idstation_codestation_namelatitudelongitudeStation CodeNFC_SITEIDNFC_SITENAMELatitudeLongitude
6023647.0US126Little, Winhall, VermontNaNNaNNaNNaNNaNNaNNaN
6923643.0US122Cow Mountain, VermontNaNNaNNaNNaNNaNNaNNaN
7023645.0US124Kettle, VermontNaNNaNNaNNaNNaNNaNNaN
7123620.01C3-075ESouth � Marlboro, Vermont42.8439-72.7125NaNNaNNaNNaNNaN
95NaNNaNNaNNaNNaNNaN1A1-029OMIDDLE POND44.3369-74.3719
96NaNNaNNaNNaNNaNNaN1A2-076OBARNES LAKE43.5667-75.2278
97NaNNaNNaNNaNNaNNaN1C1-106ESUCKER42.8250-73.1292
98NaNNaNNaNNaNNaNNaNME-9997OTUNK LAKE44.5986-68.0592
\n", "
" ], "text/plain": [ " station_id station_code station_name latitude longitude \\\n", "60 23647.0 US126 Little, Winhall, Vermont NaN NaN \n", "69 23643.0 US122 Cow Mountain, Vermont NaN NaN \n", "70 23645.0 US124 Kettle, Vermont NaN NaN \n", "71 23620.0 1C3-075E South � Marlboro, Vermont 42.8439 -72.7125 \n", "95 NaN NaN NaN NaN NaN \n", "96 NaN NaN NaN NaN NaN \n", "97 NaN NaN NaN NaN NaN \n", "98 NaN NaN NaN NaN NaN \n", "\n", " Station Code NFC_SITEID NFC_SITENAME Latitude Longitude \n", "60 NaN NaN NaN NaN NaN \n", "69 NaN NaN NaN NaN NaN \n", "70 NaN NaN NaN NaN NaN \n", "71 NaN NaN NaN NaN NaN \n", "95 NaN 1A1-029O MIDDLE POND 44.3369 -74.3719 \n", "96 NaN 1A2-076O BARNES LAKE 43.5667 -75.2278 \n", "97 NaN 1C1-106E SUCKER 42.8250 -73.1292 \n", "98 NaN ME-9997O TUNK LAKE 44.5986 -68.0592 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the 95 'ICPWaters US' sites from RESA2\n", "sql = ('SELECT * FROM resa2.stations '\n", " 'WHERE station_id IN (SELECT station_id '\n", " 'FROM resa2.projects_stations '\n", " 'WHERE project_id = 1679)')\n", "core_df = pd.read_sql_query(sql, engine)\n", "\n", "# Outer join based on RESA2 station code\n", "df = core_df.merge(j_df[j_df['INCLUDE IN ICP WATERS DATABASE']=='YES'], \n", " how='outer', \n", " left_on='station_code',\n", " right_on='Station Code')\n", "\n", "# Cols of interest\n", "cols = ['station_id', 'station_code', 'station_name', 'latitude', 'longitude',\n", " 'Station Code', 'NFC_SITEID', 'NFC_SITENAME', 'Latitude', 'Longitude']\n", "df = df[cols]\n", "\n", "# Get entries where no match was found\n", "df[pd.isnull(df['station_code']) | pd.isnull(df['Station Code'])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, there are four stations (rows 60, 69, 70 and 71 in the table above) currently included in the `ICPWaters US` project that should be removed. There are also four stations (rows 95 to 98 above) that are present in John's spreadsheet but not associated with the project. Of these, two of them (Middle Pond and Tunk) are already in the database but associated with the `ICPW_TOCTRENDS_2015_US_LTM` project only, while one of them (Sucker) has been shifted to the `ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED` project. The remaining site (Barnes Lake) does not appear anywhere in the database as far as I can tell.\n", "\n", "Next steps:\n", "\n", " * Remove four sites (Little, Cow Mountain, Kettle and South Marlboro) from the `ICPWaters US` project (and transfer them to `EXCLUDED`)\n", " * Add Middle Pond, Sucker and Tunk Lake to the `ICPWaters US` project\n", " * Create a new site for Barnes Lake and link it to the `ICPWaters US` project\n", " * Move four sites (Brook Trout Lake, Moss Lake, Lake Rondaxe and Duck Pond) from `ICPW_TOCTRENDS_2015_US_LTM_EXCLUDED` to `ICPW_TOCTRENDS_2015_US_LTM`\n", " \n", "These changes are most easily made manually using Access. As a quick check, let's make sure there are now 95 sites associated with `ICPWaters US` and 76 sites associated with `ICPW_TOCTRENDS_2015_US_LTM`" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1679 95\n", "3870 76\n" ] } ], "source": [ "# Check numbers of sites\n", "proj_codes = [1679, # ICPWaters US\n", " 3870] # ICPW_TOCTRENDS_2015_US_LTM\n", " \n", "for code in proj_codes:\n", " # Read stations list\n", " sql = ('SELECT * '\n", " 'FROM resa2.projects_stations '\n", " 'WHERE project_id = %s' % code)\n", " df = pd.read_sql_query(sql, engine)\n", " \n", " print code, len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. The Mystery of Mud Pond\n", "\n", "I wish this section was as exciting as the title makes it sound!\n", "\n", "Section 2 of the [previous notebook](http://nbviewer.jupyter.org/url/www.googledrive.com/host/0BximeC_RweaeUy1jd2k3Nm1kdms/toc_trends_2015_data_cleaning.ipynb) identified two Mud Ponds. The one with RESA2 `station_id=37063` is correct, but the other (`station_id=23709`) has incorrect co-ordinates and is several hundred kilometres from the true location. The incorrect site has become associated with lots of different projects and this means that deleting it is not straightforward. Let's get a list of projects using each of these sites." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Projects using station_id 23709:\n", "\n", " project_id project_name\n", "0 1559 ICP Waters\n", "1 1679 ICPWaters US\n", "2 1901 ICPW-Nitrogen\n", "3 1961 TOCICP_MAINE\n", "4 2240 ICPW_TRENDS\n", "5 3245 ICPW Regional Trend ESO4*\n", "6 3246 ICPW Regional Trend ECA*+EMG*\n", "7 3247 ICPW Regional Trend TOC/DOC\n", "8 3248 ICPW Regional Trend ANC\n", "9 3249 ICPW Regional Trend Alk\n", "10 3250 ICPW Regional Trend H+\n", "11 3251 ICPW Regional Trend ENO3\n", "12 3765 ICPW_TOCTRENDS2006\n", "13 3785 ICPW_TOCTRENDS2006_AM\n", "14 3789 TOCICP-US\n", "15 3870 ICPW_TOCTRENDS_2015_US_LTM\n", "\n", "\n", "Projects using station_id 37063:\n", "\n", " project_id project_name\n", "0 3870 ICPW_TOCTRENDS_2015_US_LTM\n", "\n", "\n" ] } ], "source": [ "# Find projects using the different versions of Mud Pond\n", "for site_id in [23709, 37063]:\n", " sql = ('SELECT project_id, project_name '\n", " 'FROM resa2.projects '\n", " 'WHERE project_id IN (SELECT project_id '\n", " 'FROM resa2.projects_stations '\n", " 'WHERE station_id = %s)' % site_id)\n", " df = pd.read_sql_query(sql, engine)\n", " \n", " print 'Projects using station_id %s:\\n' % site_id\n", " print df\n", " print '\\n'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, the incorrect site has been used in 16 different projects, most (all?) of which are linked to ICP Waters in some way. The only project using the correct site is `ICPW_TOCTRENDS_2015_US_LTM`. Before I can delete the incorrect site, I need to switch all these other projects to use `station_id=37063` instead.\n", "\n", "**NB:** Based on the above output, `ICPW_TOCTRENDS_2015_US_LTM` currently uses *both* Mud Ponds, so when I delete the incorrect one, there will only be 75 sites associated with this project. This means I'm still missing a site somewhere compared to John's spreadsheet. **Come back to this**.\n", "\n", "In Access:\n", "\n", " * Delete the link between `station_id=23709` and `ICPW_TOCTRENDS_2015_US_LTM` in the `projects_stations` table.\n", " * For the other 15 projects, switch the `station_id` from `23709` to `37063`.\n", " * Delete all references to site `23709` from the `stations_par_values` table.\n", " \n", "The tricky part is how to deal with the water chemistry results associated with each of these sites." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of samples associated with site 23709: 88\n", "Number of samples associated with site 37063: 107\n", "Number of samples with common dates: 63\n" ] } ], "source": [ "# Read sample info for incorrect site\n", "sql = ('SELECT water_sample_id, station_id, sample_date '\n", " 'FROM resa2.water_samples '\n", " 'WHERE station_id = 23709')\n", "df_23709 = pd.read_sql_query(sql, engine)\n", "\n", "# Read sample info for correct site\n", "sql = ('SELECT water_sample_id, station_id, sample_date '\n", " 'FROM resa2.water_samples '\n", " 'WHERE station_id = 37063')\n", "df_37063 = pd.read_sql_query(sql, engine)\n", "\n", "# Outer join\n", "df = df_23709.merge(df_37063, how='outer',\n", " on='sample_date',\n", " suffixes=['_23709', '_37063'])\n", "\n", "print 'Number of samples associated with site 23709:', len(df_23709)\n", "print 'Number of samples associated with site 37063:', len(df_37063)\n", "print 'Number of samples with common dates: ', len(df.dropna(how='any')) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks as though there is at least some overlap here, but it's not easy to see what's going on without some plots. Let's pick a commonly measured parameter ($Cl^-$ in $mg/l$) and plot the values for each site to see whether they're actually the same." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Total number of samples in merged series: 132\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Site 23709\n", "sql = ('SELECT sample_id, value '\n", " 'FROM resa2.water_chemistry_values2 '\n", " 'WHERE method_id = 10253 '\n", " 'AND sample_id IN (SELECT water_sample_id '\n", " 'FROM resa2.water_samples '\n", " 'WHERE station_id = 23709)')\n", "df_23709 = pd.read_sql_query(sql, engine)\n", "\n", "# Join dates\n", "df_23709 = df_23709.merge(df, how='left',\n", " left_on='sample_id',\n", " right_on='water_sample_id_23709')\n", "\n", "# Tidy df\n", "df_23709.sort_values(by='sample_date', inplace=True, ascending='True')\n", "df_23709.index = df_23709['sample_date']\n", "df_23709 = df_23709[['value']]\n", "df_23709.columns = ['site_23709']\n", "\n", "# Site 37063\n", "sql = ('SELECT sample_id, value '\n", " 'FROM resa2.water_chemistry_values2 '\n", " 'WHERE method_id = 10253 '\n", " 'AND sample_id IN (SELECT water_sample_id '\n", " 'FROM resa2.water_samples '\n", " 'WHERE station_id = 37063)')\n", "df_37063 = pd.read_sql_query(sql, engine)\n", "\n", "# Join dates\n", "df_37063 = df_37063.merge(df, how='left',\n", " left_on='sample_id',\n", " right_on='water_sample_id_37063')\n", "\n", "# Tidy df\n", "df_37063.sort_values(by='sample_date', inplace=True, ascending='True')\n", "df_37063.index = df_37063['sample_date']\n", "df_37063 = df_37063[['value']]\n", "df_37063.columns = ['site_37063']\n", "\n", "# Merge\n", "merged_df = df_23709.merge(df_37063, how='outer',\n", " left_index=True,\n", " right_index=True)\n", "\n", "print 'Total number of samples in merged series:', len(merged_df)\n", "\n", "# Plot\n", "merged_df.plot(figsize=(12, 8))\n", "plt.title('Chloride concentration in mg/l at two Mud Ponds', fontsize=20)\n", "plt.tight_layout()\n", "mpld3.display()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's pretty clear that this is the same data series (although something weird happens in 1996 - use the pan and zoom tools to see this). However, there appears to be no consistent pattern regarding which samples have been assigned to which site. The 1996 discrepancy seems pretty minor, so the best way to fix this is probably as follows:\n", "\n", " * For dates where samples are recorded at both sites, choose data from site 37063 in preference to those from site 23709.\n", " * On dates where there are samples *only* for site 23709, transfer these to site 37063. \n", "\n", "This should create a single series for site 37063 with 132 water samples. Any remaining samples associated with site 23709 can then be deleted (as they're already duplicated for site 37063) and then, with a bit of luck, I should also be able to remove the \"bad\" Mud Pond from the database without inflicting too much damage elsewhere." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Identify samples associated ONLY with site 23709\n", "trans_df = df[pd.isnull(df['station_id_37063'])]\n", "\n", "# These need transferring to site 37063\n", "sql = ('UPDATE resa2.water_samples '\n", " 'SET station_id = 37063 '\n", " 'WHERE water_sample_id IN %s' \n", " % str(tuple(trans_df['water_sample_id_23709'].map(int))))\n", "\n", "result = conn.execute(sql)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now plot the results for chloride again to make sure it's worked." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Site 23709\n", "# Get dates\n", "sql = ('SELECT water_sample_id, sample_date '\n", " 'FROM resa2.water_samples '\n", " 'WHERE station_id = 23709')\n", "dates_23709 = pd.read_sql_query(sql, engine)\n", "\n", "sql = ('SELECT sample_id, value '\n", " 'FROM resa2.water_chemistry_values2 '\n", " 'WHERE method_id = 10253 '\n", " 'AND sample_id IN (SELECT water_sample_id '\n", " 'FROM resa2.water_samples '\n", " 'WHERE station_id = 23709)')\n", "df_23709 = pd.read_sql_query(sql, engine)\n", "\n", "# Join dates\n", "df_23709 = df_23709.merge(dates_23709, how='left',\n", " left_on='sample_id',\n", " right_on='water_sample_id')\n", "\n", "# Tidy df\n", "df_23709.sort_values(by='sample_date', inplace=True, ascending='True')\n", "df_23709.index = df_23709['sample_date']\n", "df_23709 = df_23709[['value']]\n", "df_23709.columns = ['site_23709']\n", "\n", "# Site 37063\n", "# Get dates\n", "sql = ('SELECT water_sample_id, sample_date '\n", " 'FROM resa2.water_samples '\n", " 'WHERE station_id = 37063')\n", "dates_37063 = pd.read_sql_query(sql, engine)\n", "\n", "sql = ('SELECT sample_id, value '\n", " 'FROM resa2.water_chemistry_values2 '\n", " 'WHERE method_id = 10253 '\n", " 'AND sample_id IN (SELECT water_sample_id '\n", " 'FROM resa2.water_samples '\n", " 'WHERE station_id = 37063)')\n", "df_37063 = pd.read_sql_query(sql, engine)\n", "\n", "# Join dates\n", "df_37063 = df_37063.merge(dates_37063, how='left',\n", " left_on='sample_id',\n", " right_on='water_sample_id')\n", "\n", "# Tidy df\n", "df_37063.sort_values(by='sample_date', inplace=True, ascending='True')\n", "df_37063.index = df_37063['sample_date']\n", "df_37063 = df_37063[['value']]\n", "df_37063.columns = ['site_37063']\n", "\n", "# Merge\n", "merged_df = df_23709.merge(df_37063, how='outer',\n", " left_index=True,\n", " right_index=True)\n", "\n", "# Plot\n", "merged_df.plot(subplots=True, sharey=True, figsize=(12, 8))\n", "plt.tight_layout()\n", "mpld3.display()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks OK, so I can now delete all the remaining records associated with `station_id=23709`...\n", "\n", "...Actually, I can't delete these samples. They're protected by the database constraint `SS_WSID_FK`, which links `water_samples` to `sample_selections`. I'm not sure what the `sample_selections` tables is, and I'm wary of deleting records from tables I don't understand in case it breaks some other component of the wider LabWare-RESA2-NivaBase system.\n", "\n", "For now, I'll leave site 23709 in the database, along with the remaining samples that are associated with it. All i've actually done here is transfer 25 water samples from the early record of station 23709 to station 37063. It won't be difficult to undo these changes if necessary, but it would be good to tidy this up properly. **Come back to this** and find out what the `sample_selections` table actually does.\n", "\n", "### 1.4. The Mystery of Multiple sites\n", "\n", "It seems the Mystery of Mud Pond runs a bit deeper than originally thought. Many of the ICPW sites in RESA2 are duplicated, with the following as a typical example:\n", "\n", "| station_id | station_code | station_name | lat | long |\n", "|:----------:|:------------:|:------------------------:|:-------:|:--------:|\n", "| 23661 | US26 | Biscuit Brook, Catskills | 41.9869 | -74.5031 |\n", "| 37014 | X15:1434025 | BISCUIT BROOK | 42.0111 | -74.4147 |\n", "\n", "The first entry (`station_id 23661`) is associated with the \"core\" ICP Waters programme, whereas the second (`station_id 37014`) was added later as part of the trend analysis project. Despite the slight differences in co-ordinates, **these are actually the same site**.\n", "\n", "Adding stations twice is a little dangerous in a relational database, as it breaks the principle of database normalisation. Nevertheless, as Tore points out in his e-mail of 01/09/2016 at 08:51, it is sometimes justified as the most convenient thing to do. There are some potential problems with this, though. For example, when John and Don were here in May, they extracted site data for both the `ICPWaters US` and `ICPW_TOCTRENDS_2015_US_LTM` projects and were understandably confused by the different site codes and co-ordinates. More importantly, **when sites are duplicated, it's very easy to forget to update both versions when new data comes in**. The result, as illustrated by Mud Pond above, is we end up with two patchy data records, neither of which tell the whole story by themselves. This wouldn't be a problem if the `ICPWaters XX` projects (and the associated sites e.g. US26 in the table above) were \"archived\" to represent the state of the data at the time of some previous analysis. Any more recent data would then be added to a new project. However, as far as I can tell, this isn't the case: each year new data from the Focal Centres are added to the `ICPWaters XX` sites, while `ICPW_TOCTRENDS_2015_XX` is more of a \"static\" dataset, based on a bulk data upload at a single point in time.\n", "\n", "For the US sites, John has produced a spreadsheet that attempts to sort all this out, by linking the RESA2 codes used by `ICPWaters` to USEPA codes and corrected site properties. The main question now is **how to use this information**.\n", "\n", "Given that the `ICPWaters XX` projects do not represent an accurate archive of the data at some previous time point (because more data has been added subsequently), and considering that many of the latitudes and longitudes for these stations are now known to be incorrect, there doesn't seem much justification for keeping the duplicate sites. However, because the water samples associated with the individual sites in each pair are different (see the Mud Pond example above), it's not necessarily straightforward to merge the duplicated datasets. A simpler option would be to keep the sites separate but to correct/update the site properties (names, co-ordinates etc.) based on John's spreadsheet, so at least the duplicated sites are *consistently* duplicated. This would require less effort, but will likely lead to further confusion in the future, and it also means the updated trends analysis will not make use of the full dataset.\n", "\n", "#### 1.4.1. Which US sites are in the database?\n", "\n", "Rather than worrying about which projects have which sites, a useful preliminary step will be to simply check which of the sites in John's spreadsheet are actually in the database, and whether the properties are correct. Even this isn't entirely straightforward, but I've had a go using the code below." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_idstation_codestation_namelatitudelongitude
01101845-601Tennvatn67.59379615.487890
1111604-608�vre Jerpetjern59.6080009.427000
21121101-43Glypstadvatnet58.4872236.188384
31132030-701Serdivatnet69.60607530.374109
4115831-501Brårvatn59.2949217.727118
\n", "
" ], "text/plain": [ " station_id station_code station_name latitude longitude\n", "0 110 1845-601 Tennvatn 67.593796 15.487890\n", "1 111 604-608 �vre Jerpetjern 59.608000 9.427000\n", "2 112 1101-43 Glypstadvatnet 58.487223 6.188384\n", "3 113 2030-701 Serdivatnet 69.606075 30.374109\n", "4 115 831-501 Brårvatn 59.294921 7.727118" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read stations table\n", "sql = ('SELECT station_id, station_code, station_name, latitude, longitude '\n", " 'FROM resa2.stations')\n", "\n", "stn_df = pd.read_sql_query(sql, engine)\n", "stn_df.head()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_codestation_namesuggested_nameactivecoretrendnfc_codenfc_namelatitudelongitude
0US24New York, Catskill Mnt., Rondout CreekRondout Creek, CatskillsYESYESYES01364959RONDOUT CREEK41.9367-74.3764
1US25W Br Neversink R At Winnisook, CatskillsWest Branch Neversink River at Winnisook, Cats...YESYESYES01434021W BR NEVERSINK R AT WINNISOOK41.9725-74.4483
2US26Biscuit Brook, CatskillsBiscuit Brook, CatskillsYESYESYES01434025BISCUIT BROOK42.0111-74.4147
3US23New York, Catskill Mnt., E. Branch Neversink, ...East Branch Neversink River, CatskillsYESYESYES0143400680EAST BRANCH NEVERSINK, HEADWAT41.9869-74.5031
4US27Little Hope Pond, AdirondacksLittle Hope Pond, AdirondacksYESYESNO, road salt contamination020058OLITTLE HOPE POND44.5158-74.1252
\n", "
" ], "text/plain": [ " station_code station_name \\\n", "0 US24 New York, Catskill Mnt., Rondout Creek \n", "1 US25 W Br Neversink R At Winnisook, Catskills \n", "2 US26 Biscuit Brook, Catskills \n", "3 US23 New York, Catskill Mnt., E. Branch Neversink, ... \n", "4 US27 Little Hope Pond, Adirondacks \n", "\n", " suggested_name active core \\\n", "0 Rondout Creek, Catskills YES YES \n", "1 West Branch Neversink River at Winnisook, Cats... YES YES \n", "2 Biscuit Brook, Catskills YES YES \n", "3 East Branch Neversink River, Catskills YES YES \n", "4 Little Hope Pond, Adirondacks YES YES \n", "\n", " trend nfc_code nfc_name \\\n", "0 YES 01364959 RONDOUT CREEK \n", "1 YES 01434021 W BR NEVERSINK R AT WINNISOOK \n", "2 YES 01434025 BISCUIT BROOK \n", "3 YES 0143400680 EAST BRANCH NEVERSINK, HEADWAT \n", "4 NO, road salt contamination 020058O LITTLE HOPE POND \n", "\n", " latitude longitude \n", "0 41.9367 -74.3764 \n", "1 41.9725 -74.4483 \n", "2 42.0111 -74.4147 \n", "3 41.9869 -74.5031 \n", "4 44.5158 -74.1252 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get a subset of columns from John's spreadsheet\n", "us_df = j_df[['Station Code', 'Station name', 'Suggested Name',\n", " 'Active', 'INCLUDE IN ICP WATERS DATABASE', \n", " 'INCLUDE IN ICP DOC ANALYSIS', 'NFC_SITEID',\n", " 'NFC_SITENAME', 'Latitude', 'Longitude']]\n", "\n", "# Rename columns\n", "us_df.columns = ['station_code', 'station_name', 'suggested_name',\n", " 'active', 'core', 'trend', 'nfc_code', 'nfc_name', \n", " 'latitude', 'longitude']\n", "\n", "us_df.head()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Data\\64_Bit_WinPython\\python-2.7.10.amd64\\lib\\site-packages\\ipykernel\\__main__.py:31: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stn_codecode_presentname_oklat_oklon_ok
0US241FalseFalseFalse
1US251FalseFalseFalse
2US261TrueFalseFalse
3US231FalseFalseFalse
4US271TrueTrueTrue
\n", "
" ], "text/plain": [ " stn_code code_present name_ok lat_ok lon_ok\n", "0 US24 1 False False False\n", "1 US25 1 False False False\n", "2 US26 1 True False False\n", "3 US23 1 False False False\n", "4 US27 1 True True True" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dict to store results\n", "res_dict = {'stn_code':[],\n", " 'code_present':[],\n", " 'name_ok':[],\n", " 'lat_ok':[],\n", " 'lon_ok':[]}\n", "\n", "# Iterate over spreadsheet\n", "for idx in range(len(us_df)):\n", " # Get stn id\n", " stn_cd = us_df.ix[idx, 'station_code']\n", " name = us_df.ix[idx, 'suggested_name'] # These are John's ideal names\n", " lat = us_df.ix[idx, 'latitude']\n", " lon = us_df.ix[idx, 'longitude']\n", " \n", " # Check stations table\n", " q_res = stn_df.query('station_code == @stn_cd')\n", " \n", " if len(q_res) == 0:\n", " # The site isn't present with this code\n", " res_dict['stn_code'].append(stn_cd)\n", " res_dict['code_present'].append(0)\n", " res_dict['name_ok'].append(np.nan)\n", " res_dict['lat_ok'].append(np.nan)\n", " res_dict['lon_ok'].append(np.nan)\n", " \n", " elif len(q_res) == 1:\n", " # Check the site properties\n", " res_dict['stn_code'].append(stn_cd)\n", " res_dict['code_present'].append(1)\n", " res_dict['name_ok'].append(name == q_res.station_name.values[0])\n", " res_dict['lat_ok'].append(lat == q_res.latitude.values[0])\n", " res_dict['lon_ok'].append(lon == q_res.longitude.values[0])\n", " \n", " else:\n", " # This should be impossible\n", " raise ValueError('Site code %s appears to be duplicated.' % stn_cd)\n", "\n", "# Build df \n", "res_df = pd.DataFrame(res_dict)\n", "res_df = res_df[['stn_code', 'code_present', \n", " 'name_ok', 'lat_ok', 'lon_ok']]\n", "\n", "res_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the Unicode warning here, highlighting that some matches on the `name` column have failed due to issues with special text characters, so the results are probably a worst case scenario. Nevertheless, some of the sites match exactly. Let's see how many matches and partial matches we have." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of records in spreadsheet: 110\n", "Of these...\n", "Number matched by code: 95\n", "Number matched by code with correct lat and lon: 67\n", "Number matched by code with correct name, lat and lon: 58\n" ] } ], "source": [ "# How many matches?\n", "print 'Total number of records in spreadsheet: ', len(res_df)\n", "print 'Of these...'\n", "print 'Number matched by code: ', len(res_df.query('code_present==1'))\n", "print 'Number matched by code with correct lat and lon: ', len(res_df.query('lat_ok==1 & lon_ok==1'))\n", "print 'Number matched by code with correct name, lat and lon:', len(res_df.query('name_ok==1 & lat_ok==1 & lon_ok==1'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The script above uses RESA2 codes taken from John's spreadsheet. For the most part these are the old codes (e.g. `US26`) associated with the original `ICPWaters US` project (rather than the more recent codes, such as `X15:1434025`, used in the trends project). It seems that 58 of these sites have exactly the correct information (name, lat and long), and a further 9 have the correct lat and long, some do not use John's suggested ideal names. A further 28 sites can be matched by code, but have incorrect geographic co-ordinates.\n", "\n", "In total, there are 95 matches based on site codes. A further 15 sites (the ones marked `NA` in John's spreadsheet) cannot be matched at all but, of these, only 4 should be included in the core ICPW dataset and only 2 in the trends analysis. These four sites have already been dealt with above (see the start of section 1.2). They *are* in the database with suitable codes - it's just that these codes are in John's workbook.\n", "\n", "What about the sites created as duplicates for the trend analysis project? How many of those can be matched from John's spreadsheet?" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nfc_codecode_presentname_oklat_oklon_ok
0013649591TrueTrueTrue
1014340211FalseTrueTrue
2014340251TrueTrueTrue
301434006801FalseTrueTrue
4020058O1TrueTrueTrue
\n", "
" ], "text/plain": [ " nfc_code code_present name_ok lat_ok lon_ok\n", "0 01364959 1 True True True\n", "1 01434021 1 False True True\n", "2 01434025 1 True True True\n", "3 0143400680 1 False True True\n", "4 020058O 1 True True True" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dict to store results\n", "res_dict = {'nfc_code':[],\n", " 'code_present':[],\n", " 'name_ok':[],\n", " 'lat_ok':[],\n", " 'lon_ok':[]}\n", "\n", "# Iterate over spreadsheet\n", "for idx in range(len(us_df)):\n", " # Get stn id\n", " nfc_cd = us_df.ix[idx, 'nfc_code']\n", " name = us_df.ix[idx, 'nfc_name'] # These are John's ideal names\n", " lat = us_df.ix[idx, 'latitude']\n", " lon = us_df.ix[idx, 'longitude']\n", " \n", " # Check stations table. Need to add 'X15:' and allow for variants\n", " q_res = stn_df[(stn_df['station_code']=='X15:%s' % nfc_cd) |\n", " (stn_df['station_code']=='X15:%s' % nfc_cd[1:]) |\n", " (stn_df['station_code']=='X15:%s' % nfc_cd[:-1])]\n", " \n", " if len(q_res) == 0:\n", " # The site isn't present with this code\n", " res_dict['nfc_code'].append(nfc_cd)\n", " res_dict['code_present'].append(0)\n", " res_dict['name_ok'].append(np.nan)\n", " res_dict['lat_ok'].append(np.nan)\n", " res_dict['lon_ok'].append(np.nan)\n", " \n", " elif len(q_res) == 1:\n", " # Check the site properties\n", " res_dict['nfc_code'].append(nfc_cd)\n", " res_dict['code_present'].append(1)\n", " res_dict['name_ok'].append(name == q_res.station_name.values[0])\n", " res_dict['lat_ok'].append(lat == q_res.latitude.values[0])\n", " res_dict['lon_ok'].append(lon == q_res.longitude.values[0])\n", " \n", " else:\n", " # This should be impossible\n", " raise ValueError('Site code %s appears to be duplicated.' % stn_cd)\n", "\n", "# Build df \n", "res_df = pd.DataFrame(res_dict)\n", "res_df = res_df[['nfc_code', 'code_present', \n", " 'name_ok', 'lat_ok', 'lon_ok']]\n", "\n", "res_df.head()" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of records in spreadsheet: 110\n", "Of these...\n", "Number matched by code: 89\n", "Number matched by code with correct lat and lon: 74\n", "Number matched by code with correct name, lat and lon: 68\n" ] } ], "source": [ "# How many matches?\n", "print 'Total number of records in spreadsheet: ', len(res_df)\n", "print 'Of these...'\n", "print 'Number matched by code: ', len(res_df.query('code_present==1'))\n", "print 'Number matched by code with correct lat and lon: ', len(res_df.query('lat_ok==1 & lon_ok==1'))\n", "print 'Number matched by code with correct name, lat and lon:', len(res_df.query('name_ok==1 & lat_ok==1 & lon_ok==1'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This approach manages to match most (but not necessarily all) of the US sites using th `X15:` convention, but note that I had to jump through a few hoops to get here: the code above initially tried to find matches just using the raw NFC code with `X15:` prepended to it, but this failed in many cases. Two causes (there may be others) are:\n", "\n", " * Leading zeros on the NFC site codes have been truncated upon adding to RESA. For example, station `01364959` is in the database `X15:1364959`.

\n", " \n", " * Trailing letter codes have been removed (or weren't included originally) upon adding to RESA2. For example, station `1C1-112E` is in the database as `X15:1C1-112`.\n", " \n", "Unfortunately, these kinds of discrepancies makes automated cleaning of the database difficult.\n", "\n", "A final test for the moment is to see whether I've actually managed to find all the stations that John would like including in the trends analysis." ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_codestation_namesuggested_nameactivecoretrendnfc_codenfc_namelatitudelongitudecode_presentname_oklat_oklon_ok
65US74Mud Pond, MaineMud Pond, MaineYESYESYES1E1-134EMUD POND44.6330-68.0908NaNNaNNaNNaN
74US82Newbert Pond, MaineNewbert Pond, MaineYESYESYESME-9998ENEWBERT POND44.3364-69.2711NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " station_code station_name suggested_name active core trend \\\n", "65 US74 Mud Pond, Maine Mud Pond, Maine YES YES YES \n", "74 US82 Newbert Pond, Maine Newbert Pond, Maine YES YES YES \n", "\n", " nfc_code nfc_name latitude longitude code_present name_ok lat_ok \\\n", "65 1E1-134E MUD POND 44.6330 -68.0908 NaN NaN NaN \n", "74 ME-9998E NEWBERT POND 44.3364 -69.2711 NaN NaN NaN \n", "\n", " lon_ok \n", "65 NaN \n", "74 NaN " ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter dfs\n", "match_df = res_df.query('code_present==1')\n", "trend_df = us_df.query('trend == \"YES\"')\n", "\n", "# Join\n", "trend_df = trend_df.merge(match_df, how='left',\n", " on='nfc_code')\n", "\n", "\n", "# Identify records with no match\n", "trend_df[pd.isnull(trend_df['code_present'])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So I've manged to match all the sites for the trends work, with two exceptions. The first is (the increasingly troublesome) Mud Pond, which hasn't matched because when I tidied it up above I changed the site code for station `37063` back to the original `US74`, rather than the `X15:` version. The second is Newbert Pond, which is in the database as part of the core ICP Waters programme, but apparently hasn't previously been part of the trends analysis.\n", "\n", "### Summary\n", "\n", "This exploration proves that, with a bit of work, I can identify the sites that John would like including in each project and clean up their properties. I haven't really begun to tackle the problem of actually integrating the data, though - that's going to be much more challenging. \n", "\n", "It's also worth emphasising that the above was only really possible thanks to John's spreadsheet, so doing the same thing for other countries may be more difficult." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }