{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Jupyter Supported Interactive Data Processing Workflow \n", "\n", "A product of : \n", "\n", "> [Environmental Sytems Dynamics Laboratory (ESDL)](https://www.esdlberkeley.com/)\n", " University of California, Berkeley \n", "\n", "Authors: \n", "\n", "> > ***Edom Moges1, Liang Zhang 1, Laurel Larsen1, Fernando Perez1, and Lindsey Heagy1***\n", "\n", "1 University of California, Berkeley\n", "\n", "\n", "\n", ">\n", " ==================================================================================\n", " --------------------***********************************************---------------\n", " ==================================================================================" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Purpose\n", "\n", "Raw hydrometeorological datasets contain errors, gaps and outliers that needs preprocessing. \n", "The objective of this work is developing an interactive data preprocessing platform that enables acquiring and transforming publicly available raw hydrometeorological data to a ready to use  dataset. This interactive platform is at the core of the Comprehensive Hydrologic Observatory SEnsor Network CHOSEN dataset (Zhang et al. 2021 submitted to HP). [CHOSEN](https://gitlab.com/esdl/chosen) provides a multitude of intensively measured hydrometeorological datasets (e.g., snow melt, soil moisture besides the common precipitation, air temperature and streamflow) across 30 watersheds covering the conterminous US. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Technical Contribution\n", "\n", "* Bringing together common data quality control (QC) and missing value filling techniques such as interpolation and regression\n", "* Making data QC and filling missing values interactive to facilitate ease of computation and choosing parameters\n", "* Developing a new missing value filling technique named the Climate Catalog method that leverages similarity in annual hydrological cycles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Methodology\n", "\n", "This notebook starts with a cell that acquires a standard raw hydrometeorological data table and proceeds with cells that perform interactive computation to fill missing values. Three data filling methods are adopted:\n", "1. Interpolation\n", "2. Regression\n", "3. Climate Caltalog\n", "\n", "The details of the methods are described in section 6." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results\n", "\n", "This notebook presents data preprocessing performed on one of the [CHOSEN](https://gitlab.com/esdl/chosen) datasets, the Dry Creek watershed. Using this notebook the range of interpolation (interplimit), thresholds for potention regression data donors (RegThreshold) and climate catalog thresholds (corrThr and thrLen) are determined interactively. Through these three interactive data preprocessing methods missing values are filled. Having a filled hydrometeorological dataset enables hydrological modeling and other water resources management analysis. \n", "\n", "In summary, this work:\n", "\n", "* introduced a new missing value filling technique (Climate Catalog)\n", "* developed Jupyter based interactive open source data preprocessing tool\n", "* the tool interactively transforms raw hydrometeorological data to a ready to use dataset\n", "* the tool enables further collection and dissemination of datasets such as the CHOSEN data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Citation\n", "\n", "Edom Moges, Liang Zhang, Laurel Larsen, Lindsey Heagy and Fernando Perez, 2021. EM_v01_Jupyter Supported Interactive Data Processing Workflow. Accessed 05/15/2021 at https://github.com/EMscience/CHOSENDryCreek" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Work in Progress \n", "\n", "* Linking the three steps of filling missing values\n", "* Parallelizing the regression method to enable faster computation\n", "* Releasing the local function as a python package" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Acknowledgements\n", "\n", "This work is supported by the NSF Earth Cube Program under awards 1928406 and 1928374." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Library import " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'E:\\\\My Drive\\\\DryCreek'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# data manuplation\n", "import numpy as np\n", "import pandas as pd\n", "# from scipy import signal\n", "\n", "# visualization\n", "import matplotlib.pyplot as plt\n", "from matplotlib import rcParams\n", "import ipywidgets\n", "\n", "# basic date and file\n", "import datetime as dt\n", "import copy\n", "import os\n", "\n", "\n", "# math and statistics\n", "from pandas.plotting import register_matplotlib_converters\n", "from sklearn import linear_model\n", "from sklearn.metrics import r2_score\n", "from math import sqrt, pi\n", "#import handcalcs.render\n", "\n", "rcParams[\"font.size\"]=14\n", "plt.rcParams.update({'figure.max_open_warning': 0})\n", "register_matplotlib_converters()\n", "os.getcwd()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Local library import" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.insert(1, './Functions')\n", "\n", "from Source_QC_Widgets_functions_EM import regressorFunc, funcClimateCatalogWg,widgetInterpolation,widgetRegression,Date_to_float" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Parameter definitions\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "watershed = 'DryCreek' # name of the example watershed\n", "main_str = 'LG' # name of the main watershed station in focus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data import\n", "\n", "The input data table needs to follow a standard naming. A column in the input table should have station name, variable name and measurement depths separated by underscore respectively.\n", "- e.g., LS_SoilTemperature_Pit2_2cm - represent a station LS, for soil temperature data, at pit 2 at a depth of 2cm.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TL_DischargeBSG_DischargeLG_DischargeC1E_DischargeC1W_DischargeC2E_DischargeC2M_DischargeBRW_PrecipitationLDP_PrecipitationSCR_Precipitation...LS_SoilTemperature_Pit1_30cmLS_SoilTemperature_Pit2_2cmLS_SoilTemperature_Pit2_15cmLS_SoilTemperature_Pit2_30cmLS_SoilTemperature_Pit3_2cmLS_SoilTemperature_Pit3_15cmLS_SoilTemperature_Pit3_30cmLS_SoilTemperature_Pit4_2cmLS_SoilTemperature_Pit4_15cmLS_SoilTemperature_Pit4_30cm
DateTime
1999-01-01NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1999-01-02NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

2 rows × 355 columns

\n", "
" ], "text/plain": [ " TL_Discharge BSG_Discharge LG_Discharge C1E_Discharge \\\n", "DateTime \n", "1999-01-01 NaN NaN NaN NaN \n", "1999-01-02 NaN NaN NaN NaN \n", "\n", " C1W_Discharge C2E_Discharge C2M_Discharge BRW_Precipitation \\\n", "DateTime \n", "1999-01-01 NaN NaN NaN NaN \n", "1999-01-02 NaN NaN NaN NaN \n", "\n", " LDP_Precipitation SCR_Precipitation ... \\\n", "DateTime ... \n", "1999-01-01 NaN NaN ... \n", "1999-01-02 NaN NaN ... \n", "\n", " LS_SoilTemperature_Pit1_30cm LS_SoilTemperature_Pit2_2cm \\\n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", " LS_SoilTemperature_Pit2_15cm LS_SoilTemperature_Pit2_30cm \\\n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", " LS_SoilTemperature_Pit3_2cm LS_SoilTemperature_Pit3_15cm \\\n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", " LS_SoilTemperature_Pit3_30cm LS_SoilTemperature_Pit4_2cm \\\n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", " LS_SoilTemperature_Pit4_15cm LS_SoilTemperature_Pit4_30cm \n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", "[2 rows x 355 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TL_DischargeBSG_DischargeLG_DischargeC1E_DischargeC1W_DischargeC2E_DischargeC2M_DischargeBRW_PrecipitationLDP_PrecipitationSCR_Precipitation...LS_SoilTemperature_Pit1_30cmLS_SoilTemperature_Pit2_2cmLS_SoilTemperature_Pit2_15cmLS_SoilTemperature_Pit2_30cmLS_SoilTemperature_Pit3_2cmLS_SoilTemperature_Pit3_15cmLS_SoilTemperature_Pit3_30cmLS_SoilTemperature_Pit4_2cmLS_SoilTemperature_Pit4_15cmLS_SoilTemperature_Pit4_30cm
DateTime
2020-05-10NaNNaNNaNNaNNaNNaNNaN0.218182NaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2020-05-11NaNNaNNaNNaNNaNNaNNaN0.108333NaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

2 rows × 355 columns

\n", "
" ], "text/plain": [ " TL_Discharge BSG_Discharge LG_Discharge C1E_Discharge \\\n", "DateTime \n", "2020-05-10 NaN NaN NaN NaN \n", "2020-05-11 NaN NaN NaN NaN \n", "\n", " C1W_Discharge C2E_Discharge C2M_Discharge BRW_Precipitation \\\n", "DateTime \n", "2020-05-10 NaN NaN NaN 0.218182 \n", "2020-05-11 NaN NaN NaN 0.108333 \n", "\n", " LDP_Precipitation SCR_Precipitation ... \\\n", "DateTime ... \n", "2020-05-10 NaN NaN ... \n", "2020-05-11 NaN NaN ... \n", "\n", " LS_SoilTemperature_Pit1_30cm LS_SoilTemperature_Pit2_2cm \\\n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", " LS_SoilTemperature_Pit2_15cm LS_SoilTemperature_Pit2_30cm \\\n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", " LS_SoilTemperature_Pit3_2cm LS_SoilTemperature_Pit3_15cm \\\n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", " LS_SoilTemperature_Pit3_30cm LS_SoilTemperature_Pit4_2cm \\\n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", " LS_SoilTemperature_Pit4_15cm LS_SoilTemperature_Pit4_30cm \n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", "[2 rows x 355 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Read the raw original data table\n", "table = pd.read_csv('1_'+watershed+'_Download_Aggregation.csv',header = 0,index_col = 'DateTime',\n", " parse_dates = True, infer_datetime_format = True,low_memory=False)\n", "display(table.head(2))\n", "display(table.tail(2))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All stations names: ['TL' 'BSG' 'LG' 'C1E' 'C1W' 'C2E' 'C2M' 'BRW' 'LDP' 'SCR' 'LW' 'HN' 'HS'\n", " 'MHN' 'MHS' 'MLN' 'MLS' 'LN' 'LS']\n", " \n", "Discharge stations : ['TL', 'BSG', 'LG', 'C1E', 'C1W', 'C2E', 'C2M']\n", " \n", "Meteorology stations: ['BRW', 'LDP', 'SCR', 'TL', 'LW', 'HN', 'HS', 'MHN', 'MHS', 'MLN', 'MLS', 'LN', 'LS']\n" ] } ], "source": [ "# Double Check the station names\n", "# breakdown discharge and hydrometeorological station names\n", "all_stations = table.columns.str.extract(r'([^_]+)')[0]\n", "print('All stations names: ', all_stations.unique())\n", "print (' ')\n", "nameStrflwStation=[]\n", "nameHydrMetStation=[]\n", "for i in np.arange(len(table.columns)):\n", " if table.columns[i][-9:]=='Discharge': ### \n", " if not all_stations[i] in nameStrflwStation:\n", " nameStrflwStation.append(all_stations[i]) ### \n", " else:\n", " if not all_stations[i] in nameHydrMetStation:\n", " nameHydrMetStation.append(all_stations[i]) ### \n", "\n", "print('Discharge stations :',nameStrflwStation)\n", "print(' ')\n", "print('Meteorology stations:',nameHydrMetStation) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data processing and analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Trim the original table\n", "This step adjusts the input table and removes lengthy missing values at the beigning and end of the input table." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TL_DischargeBSG_DischargeLG_DischargeC1E_DischargeC1W_DischargeC2E_DischargeC2M_DischargeBRW_PrecipitationLDP_PrecipitationSCR_Precipitation...LS_SoilTemperature_Pit1_30cmLS_SoilTemperature_Pit2_2cmLS_SoilTemperature_Pit2_15cmLS_SoilTemperature_Pit2_30cmLS_SoilTemperature_Pit3_2cmLS_SoilTemperature_Pit3_15cmLS_SoilTemperature_Pit3_30cmLS_SoilTemperature_Pit4_2cmLS_SoilTemperature_Pit4_15cmLS_SoilTemperature_Pit4_30cm
DateTime
1999-01-01NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1999-01-02NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

2 rows × 355 columns

\n", "
" ], "text/plain": [ " TL_Discharge BSG_Discharge LG_Discharge C1E_Discharge \\\n", "DateTime \n", "1999-01-01 NaN NaN NaN NaN \n", "1999-01-02 NaN NaN NaN NaN \n", "\n", " C1W_Discharge C2E_Discharge C2M_Discharge BRW_Precipitation \\\n", "DateTime \n", "1999-01-01 NaN NaN NaN NaN \n", "1999-01-02 NaN NaN NaN NaN \n", "\n", " LDP_Precipitation SCR_Precipitation ... \\\n", "DateTime ... \n", "1999-01-01 NaN NaN ... \n", "1999-01-02 NaN NaN ... \n", "\n", " LS_SoilTemperature_Pit1_30cm LS_SoilTemperature_Pit2_2cm \\\n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", " LS_SoilTemperature_Pit2_15cm LS_SoilTemperature_Pit2_30cm \\\n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", " LS_SoilTemperature_Pit3_2cm LS_SoilTemperature_Pit3_15cm \\\n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", " LS_SoilTemperature_Pit3_30cm LS_SoilTemperature_Pit4_2cm \\\n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", " LS_SoilTemperature_Pit4_15cm LS_SoilTemperature_Pit4_30cm \n", "DateTime \n", "1999-01-01 NaN NaN \n", "1999-01-02 NaN NaN \n", "\n", "[2 rows x 355 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TL_DischargeBSG_DischargeLG_DischargeC1E_DischargeC1W_DischargeC2E_DischargeC2M_DischargeBRW_PrecipitationLDP_PrecipitationSCR_Precipitation...LS_SoilTemperature_Pit1_30cmLS_SoilTemperature_Pit2_2cmLS_SoilTemperature_Pit2_15cmLS_SoilTemperature_Pit2_30cmLS_SoilTemperature_Pit3_2cmLS_SoilTemperature_Pit3_15cmLS_SoilTemperature_Pit3_30cmLS_SoilTemperature_Pit4_2cmLS_SoilTemperature_Pit4_15cmLS_SoilTemperature_Pit4_30cm
DateTime
2020-05-10NaNNaNNaNNaNNaNNaNNaN0.218182NaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2020-05-11NaNNaNNaNNaNNaNNaNNaN0.108333NaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

2 rows × 355 columns

\n", "
" ], "text/plain": [ " TL_Discharge BSG_Discharge LG_Discharge C1E_Discharge \\\n", "DateTime \n", "2020-05-10 NaN NaN NaN NaN \n", "2020-05-11 NaN NaN NaN NaN \n", "\n", " C1W_Discharge C2E_Discharge C2M_Discharge BRW_Precipitation \\\n", "DateTime \n", "2020-05-10 NaN NaN NaN 0.218182 \n", "2020-05-11 NaN NaN NaN 0.108333 \n", "\n", " LDP_Precipitation SCR_Precipitation ... \\\n", "DateTime ... \n", "2020-05-10 NaN NaN ... \n", "2020-05-11 NaN NaN ... \n", "\n", " LS_SoilTemperature_Pit1_30cm LS_SoilTemperature_Pit2_2cm \\\n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", " LS_SoilTemperature_Pit2_15cm LS_SoilTemperature_Pit2_30cm \\\n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", " LS_SoilTemperature_Pit3_2cm LS_SoilTemperature_Pit3_15cm \\\n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", " LS_SoilTemperature_Pit3_30cm LS_SoilTemperature_Pit4_2cm \\\n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", " LS_SoilTemperature_Pit4_15cm LS_SoilTemperature_Pit4_30cm \n", "DateTime \n", "2020-05-10 NaN NaN \n", "2020-05-11 NaN NaN \n", "\n", "[2 rows x 355 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "trimmed row number is 0\n" ] } ], "source": [ "t = table.notna() \n", "t = ~np.isnan(table)\n", "col = len(t.columns)\n", "b = np.zeros([table.shape[1]])\n", "c = np.array([table.shape[0]] * table.shape[1])\n", "\n", "for i in range(col):\n", " if any(t.iloc[:,i]): # Since some are empty\n", " b[i] = list(np.where(t.iloc[:,i] == True))[0][0] # the first non nan value location\n", " c[i] = list(np.where(t.iloc[:,i] == True))[0][-1] # the last non nan value location\n", " \n", "st_tab = b.min()\n", "table1 = table.iloc[int(b.min()):int(c.max()) + 1,:] \n", "\n", "# Display the trimmed table\n", "display(table1.head(2))\n", "display(table1.tail(2))\n", "print('trimmed row number is ', int(table.shape[0] - table1.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Quantity of interest (QOI)\n", "\n", "QOI defines the variable name to be quality controlled and gap filled.\n", "\n", "This variable can be assigned to any variable among the column names of the above table (raw data table)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "QOI = 'LS_SoilTemperature_Pit2_2cm'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Filling missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interpolation\n", "\n", "- Interpolation is the first gap filling technique adopted to fill short length missing values. It is not recommended for longer period missing values as it predicts unrealiable/unrealistc values. \n", "\n", "- This unrealiabilty can easily be detected by the interactive plots under the control parameter interpolation length limit (interplimit).\n", "\n", "- As an example, compare interplimit = 7 and interplimit = 30 using the interactive widgets. This shows how the value 30 leads to unrealistic (e.g., horizontal straight lines) filling of missing values." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def plot_widgetInterpolation(QOI=QOI, interplimit=7, Intpmethod='time', Intplimit_direction='both',\n", " xmin=None, xmax=None, ymin=None, ymax=None):\n", " \n", " y, yIntp = widgetInterpolation(table1,QOI, interplimit, 'time', 'both')\n", " \n", " indx = Date_to_float(y.index)\n", " \n", " fig, ax = plt.subplots(1, 1, figsize=(9, 4))\n", " ax.plot(indx, yIntp,'r',linewidth=2,label='Interpolated')\n", " ax.plot(indx, y, 'b',linewidth=2,label='Original')\n", " \n", " ax.set_xlabel(\"Year\")\n", " ax.set_ylabel(QOI.split('_')[1])\n", " ax.set_title(QOI)\n", " ax.set_xlim([xmin,xmax])\n", " ax.set_ylim([ymin, ymax])\n", " ax.grid(color='gray', linestyle='-.', linewidth=1)\n", " #ax.legend(bbox_to_anchor=(1, 1), loc='upper left', fontsize='small')\n", " ax.legend(fontsize='small')\n", " \n", " # interaction responsiveness\n", " ax.relim()\n", " ax.autoscale_view()\n", " fig.canvas.draw_idle()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beside the interplimit parameter, the interactive widget can also be used to choose different alternatives of the interpolation function (interpolation method and direction). Please take a look at the widget." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "018ce79ba1954a2aa534cb0d3a37ab89", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='QOI', index=346, options=('TL_Discharge', 'BSG_Discharge', 'LG_Dis…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "\n", "Interpolation_widget = ipywidgets.interactive(\n", " plot_widgetInterpolation,\n", " \n", " QOI=table1,\n", " Intpmethod = ['linear','time','space','index','pad','nearest', 'zero', 'slinear', 'quadratic', 'cubic'],\n", " Intplimit_direction=['forward', 'backward', 'both'],\n", " \n", " interplimit=ipywidgets.IntSlider(min=2, max=100, value=7),\n", " xmin=ipywidgets.FloatText(value=2008),\n", " xmax=ipywidgets.FloatText(value=2018),\n", " ymin=ipywidgets.FloatText(value=0),\n", " ymax=ipywidgets.FloatText(value=50)\n", ")\n", "Interpolation_widget" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression\n", "\n", "\n", "As an alternative to interpolation, longer period missing values are filled  by linear one at a time regression. \n", "\n", "- Regression requires measurements from more than one station. By comparing these stations using correlation coefficient, a station with the highest correlation coefficient is adopted as a donor regression station. \n", "\n", "- However, if the donor station's correlation coefficient is below a user defined regression thresold (Regthresh), regression will not be adopted. \n", "\n", "- This threshold (Regthresh) is identified interactively.\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def plot_widgetRegression(QOI=QOI, RegThreshold=0.7, xmin=None, xmax=None, ymin=None, ymax=None):\n", " \n", " y, yReg = widgetRegression(table1, QOI, RegThreshold=RegThreshold)\n", " \n", " indx = Date_to_float(y.index)\n", " \n", " fig, ax = plt.subplots(1, 1, figsize=(9, 4))\n", " ax.plot(indx, yReg,'r',linewidth=2,label='After Regression')\n", " ax.plot(indx, y, 'b',linewidth=2,label='Raw Data')\n", " \n", " ax.set_xlabel(\"Year\")\n", " ax.set_ylabel(QOI.split('_')[1])\n", " ax.set_title(QOI)\n", " ax.set_xlim([xmin,xmax])\n", " ax.set_ylim([ymin, ymax])\n", " ax.grid(color='gray', linestyle='-.', linewidth=1)\n", " #ax.legend(bbox_to_anchor=(1, 1), loc='upper left', fontsize='small')\n", " ax.legend(fontsize='small')\n", " \n", " # interaction responsiveness\n", " ax.relim()\n", " ax.autoscale_view()\n", " fig.canvas.draw_idle()\n", " # fig.gcf().canvas.set_window_title('Regression')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider setting the RegThreshold widget to different values to understand its significance in performing regression for the purpose of filling missing values. For instance, RegThreshold = 1.0 will lead to no regression. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1748923a247441faad6bb53572b789a1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='QOI', index=346, options=('TL_Discharge', 'BSG_Discharge', 'LG_Dis…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "# %matplotlib qt, - a separate plot\n", "# %matplotlib notebook - the plot reside within the notebook with save and zooming options\n", "# %matplotlib widget - for jupyter lab\n", "Regression_widget = ipywidgets.interactive(\n", " plot_widgetRegression,\n", " \n", " QOI=table1,\n", " \n", " RegThreshold=ipywidgets.FloatSlider(min=.5, max=1, value=.8),\n", " xmin=ipywidgets.FloatText(value=2008),\n", " xmax=ipywidgets.FloatText(value=2018),\n", " ymin=ipywidgets.FloatText(value=0),\n", " ymax=ipywidgets.FloatText(value=50)\n", ")\n", "Regression_widget" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Climate catalog\n", "\n", "\n", "Climate Catalog performs gap filling based on comparing the similarity of years. For instance, a missing value in a calendar day for a given year can be filled by a data from a year $(D_\\text{missing year})$ that has the highest correlation with the missing year $(D_\\text{year with the highest $r^2$})$ among the other years plus a sample from a normal distribution with a standard deviation of all the years for the missing calandar day $(N(0, \\sigma ^2))$. Below is the mathematical formulation:\n", "\n", "\\begin{equation}\n", "D_\\text{missing year} = D_\\text{year with the highest $r^2$} + N(0, \\sigma ^2)\n", "\\end{equation}\n", "\n", "\n", "- In using Climate Catalog, we set two interactive parameters: \n", " 1. corrThr - a minimum correaltion coefficient a  candidate year needs to satisfy to qualify as a donor year and \n", " 2. thrLen - the minimum number of days the missing year has to have in order to perform Climate Catalog based gap filling. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def plot_widgetClimateCatalog(QOI=QOI, thrLen=200, corrThr=0.7, xmin=None, xmax=None, ymin=None, ymax=None):\n", " \n", " y_CC, yraw, indx_cc = funcClimateCatalogWg(table1, QOI, thrLen, corrThr)\n", " \n", " indx = Date_to_float(yraw.index)\n", " \n", " fig, ax = plt.subplots(1, 1, figsize=(9, 4))\n", " ax.plot(indx, y_CC,'r',linewidth=2,label='After Climate Catalog')\n", " ax.plot(indx, yraw, 'b',linewidth=2,label='Raw Data')\n", " \n", " ax.set_xlabel(\"Year\")\n", " ax.set_ylabel(QOI.split('_')[1])\n", " ax.set_title(QOI)\n", " ax.set_xlim([xmin,xmax])\n", " ax.set_ylim([ymin, ymax])\n", " ax.grid(color='gray', linestyle='-.', linewidth=1)\n", " #ax.legend(bbox_to_anchor=(1, 1), loc='upper left', fontsize='small')\n", " ax.legend(fontsize='small')\n", " \n", " # interaction responsiveness\n", " ax.relim()\n", " ax.autoscale_view()\n", " fig.canvas.draw_idle()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a demonstration, consider setting thrLen and corrThr at different values to understand the tradeoff between the two parameters in filling missing values using the climate catalog method." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f94e30c3c5b5468abb9ff3df064cf3df", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(Dropdown(description='QOI', index=346, options=('TL_Discharge', 'BSG_Discharge', 'LG_Dis…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "\n", "# %matplotlib qt, - a separate plot\n", "# %matplotlib notebook - the plot reside within the notebook \n", "# %matplotlib widget - for jupyter lab\n", "ClimateCatalog_widget = ipywidgets.interactive(\n", " plot_widgetClimateCatalog,\n", " \n", " QOI=table1,\n", " \n", " corrThr=ipywidgets.FloatSlider(min=.5, max=1., value=.7),\n", " thrLen=ipywidgets.IntSlider(min=1, max=365, value=7),\n", " xmin=ipywidgets.FloatText(value=2008),\n", " xmax=ipywidgets.FloatText(value=2018),\n", " ymin=ipywidgets.FloatText(value=0),\n", " ymax=ipywidgets.FloatText(value=50)\n", ")\n", "ClimateCatalog_widget" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }