{ "cells": [ { "cell_type": "markdown", "id": "42e07fb9", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "---\n", "title: \"Lab 04\"\n", "subtitle: \"Advanced Data Analysis and Statistics\"\n", "page-layout: full\n", "toc: true\n", "freeze: auto\n", "title-block-banner: false\n", "sidebar: course-ear-419\n", "---" ] }, { "cell_type": "markdown", "id": "e64e56ac", "metadata": {}, "source": [ "[{{< iconify ph:download-fill >}} Notebook](https://raw.githubusercontent.com/jaywt/jaywt.github.io/main/courses/ear-419-environmental-aqueous-geochemistry/labs/lab-04.ipynb){.btn target=_blank} [{{< iconify simple-icons:googlecolab >}} Open in Colab](https://colab.research.google.com/github/jaywt/jaywt.github.io/blob/main/courses/ear-419-environmental-aqueous-geochemistry/labs/lab-04.ipynb){.btn target=_blank}" ] }, { "cell_type": "markdown", "id": "30473151", "metadata": {}, "source": [ "## Pre-lab Prep\n", "\n", "- To use the cloud computing platform [Google Colab](https://colab.research.google.com){target=_blank}, you need a Google account and access to Google Drive. SU students can use their @g.syr.edu account.\n", "- Students who are not familiar with Google Colab are strongly encouraged to watch this [quickview video](https://youtu.be/inN8seMm7UI?si=kAOG8o4G_6rM4FNA){target=_blank} and visit the [Google Colab](https://colab.research.google.com){target=_blank} website to navigate through __Welcome to Colab__, __Overview of Colab__, and __Guide to Markdown__\n", "- Students are strongly encouraged to read through [**Lab 04 Instructions**](../files_to_share/Lab4_Instructions_AdvancedDataAnalysis.pdf) before class." ] }, { "cell_type": "markdown", "id": "c3391a12", "metadata": {}, "source": [ "## Lab Materials" ] }, { "cell_type": "markdown", "id": "4fa7476f", "metadata": {}, "source": [ "### 1. Lab 04 Instructions\n", "Please download [**Lab 04 Instructions**](../files_to_share/Lab4_Instructions_AdvancedDataAnalysis.pdf) and go through it before class." ] }, { "cell_type": "markdown", "id": "79a954ef", "metadata": {}, "source": [ "### 2. Lab 04 Demo\n", "- To save time setting up the coding environment and dependencies on your local computer, you can click the Open in Colab button at the top of this webpage to open it in Google Colab.\n", "- Once you have opened it in Google Colab, log in using your Google account then click on ‘Runtime’ in the menu bar. Then, select ‘Change runtime type’ and modify the runtime type from Python 3 to R." ] }, { "cell_type": "code", "execution_count": null, "id": "b3641609", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "###############################################################\n", "# Installing and Loading Packages\n", "###############################################################\n", "# Install all needed packages\n", "install.packages(\"dataRetrieval\")\n", "install.packages(\"dplyr\")\n", "install.packages(\"lubridate\")" ] }, { "cell_type": "code", "execution_count": null, "id": "c88df438", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# load these packages into the memory for later use\n", "library(dataRetrieval)\n", "library(dplyr)\n", "library(lubridate)" ] }, { "cell_type": "code", "execution_count": 121, "id": "05c81432", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "#######################################################################################\n", "# Downloading River Chemistry Time Series Data Using dataRetrieval package\n", "#######################################################################################\n", "# In the demo, the specific USGS site I am going to download Ca data for \n", "# has the site number USGS-01391500 (Saddle River at Lodi NJ)\n", "# let's define a variable to store the site number\n", "siteid <- \"USGS-01391500\"\n", "\n", "# In lab 04 deliverables, you need to explore the other two sites:\n", "# USGS-01111500 (BRANCH RIVER AT FORESTDALE, RI)\n", "# USGS-02336300 (PEACHTREE CREEK AT ATLANTA, GA)\n", "\n", "# USGS encode all chemicals as numeric codes. Calcium's code is 00915 while Sodium code is 00930\n", "parmCd <- \"00930\"\n", "\n", "# let's focus on water quality data collected from 1978 to 2018\n", "start.date = as.Date(\"1978-01-01\")\n", "end.date = as.Date(\"2019-01-01\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f855beed", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# download data and assigned downloaded data to a variable named \"demo_site\"\n", "demo_site <- readWQPqw(siteNumbers = siteid,\n", " parameterCd = parmCd,\n", " startDate = start.date,\n", " endDate = end.date)" ] }, { "cell_type": "code", "execution_count": 123, "id": "392eadd0", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# extract year, month, month-day info of sampling date and add them into three columns\n", "demo_site$year <- year(demo_site$ActivityStartDate)\n", "demo_site$month <- month(demo_site$ActivityStartDate)\n", "demo_site$mday <- mday(demo_site$ActivityStartDate)" ] }, { "cell_type": "code", "execution_count": 124, "id": "bd682b39", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Simplify the dataset by keeping only most essential columns (i.e., location, sampling date, data value)\n", "demo_site <- demo_site %>%\n", " select(c(\"MonitoringLocationIdentifier\", \"ActivityStartDate\", \n", " \"year\", \"month\", \"mday\", \"ResultMeasureValue\")) %>%\n", " rename(site_no=MonitoringLocationIdentifier, sample_dt=ActivityStartDate,\n", " result_va=ResultMeasureValue)" ] }, { "cell_type": "code", "execution_count": null, "id": "0ec04ca4", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# open this refined dataset to overview\n", "View(demo_site)" ] }, { "cell_type": "code", "execution_count": 126, "id": "2bcf1b20", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# we can also save this dataset as a csv file for future use\n", "write.csv(demo_site, file = \"./demo_site.csv\", row.names = FALSE, na = \"\")\n", "\n", "# in the future, you can read this file by using read.csv function\n", "#read.csv(file = \"./demo_site.csv\", header = TRUE, na.strings = c(\"\", \"NA\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "4389352c", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "#######################################################################################\n", "# Data Processing\n", "#######################################################################################\n", "\n", "#### Data overview\n", "# print top 6 rows\n", "head(demo_site)" ] }, { "cell_type": "code", "execution_count": null, "id": "4cc45e81", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# print top 5 rows\n", "head(demo_site,5)" ] }, { "cell_type": "code", "execution_count": null, "id": "787db9f0", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# print bottom 6 rows\n", "tail(demo_site)" ] }, { "cell_type": "code", "execution_count": null, "id": "f7a7c427", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# print the statistical summary of each column\n", "summary(demo_site)" ] }, { "cell_type": "code", "execution_count": null, "id": "bf50780f", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "#### Index and subset\n", "# print the cell that is at row 2 and column 3.\n", "demo_site[2,3]\n", "# In the above example, first number indicates row number while the second number indicates column number" ] }, { "cell_type": "code", "execution_count": null, "id": "4e8dbbeb", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# print the 2nd row\n", "demo_site[2,]" ] }, { "cell_type": "code", "execution_count": null, "id": "66ea2374", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# print the top 10 rows in the column of 'result_va'\n", "demo_site[1:10,\"result_va\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "21389fd6", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# print the column named 'site_no'\n", "demo_site$site_no" ] }, { "cell_type": "code", "execution_count": null, "id": "2626523c", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# print out all rows whose column 'year' is larger than 2000\n", "demo_site[demo_site$year>=2000,]" ] }, { "cell_type": "code", "execution_count": null, "id": "7b1ec1ec", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "#### Column-wise and row-wise summary\n", "# print out the minimum value of column 'result_va'\n", "min(demo_site$result_va)" ] }, { "cell_type": "code", "execution_count": 137, "id": "72ca5108", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# What is the max Ca concentration?\n", "# hint: use the fucntion max" ] }, { "cell_type": "code", "execution_count": 138, "id": "1dca6384", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "### Helpful Functions apply() and tapply()\n", "# We create a new data frame with 2 columns. First column contains 1, 2, and 3.\n", "# Second column contains 4, 5, and 6\n", "test_df <- data.frame(c1=c(1,2,3),c2=c(4,5,6))" ] }, { "cell_type": "code", "execution_count": null, "id": "16558a4b", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# How does 'apply' work? \n", "# Try to run '?apply'\n", "apply(test_df, 1, sum)\n", "apply(test_df, 2, sum)" ] }, { "cell_type": "code", "execution_count": null, "id": "5925c93e", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Try '?tapply'\n", "tapply(demo_site$result_va, list(demo_site$year, demo_site$site_no), mean, na.rm=TRUE)" ] }, { "cell_type": "code", "execution_count": 141, "id": "9ebdfbea", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "#######################################################################################\n", "# Data Visualization and Regression Analysis\n", "#######################################################################################\n", "\n", "# aggregate data by year\n", "annual_summary <- tapply(demo_site$result_va, demo_site$year, mean, na.rm=TRUE)\n", "annual_summary <- data.frame(year=as.numeric(names(annual_summary)), \n", " result_annual=annual_summary)" ] }, { "cell_type": "code", "execution_count": null, "id": "48345efc", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# plotting\n", "plot(x = annual_summary$year,\n", " y = annual_summary$result_annual,\n", " xlab=\"Year\",\n", " ylab=\"Annual Mean Na Concentration (mg/L)\",\n", " main=\"Temporal Trend of Annual Mean Na Concentration 1978-2018\",\n", " type=\"b\")\n", "\n", "# regression analysis\n", "abline(lm(result_annual ~ year, data=annual_summary), col=2)\n", "summary(lm(result_annual ~ year, data=annual_summary))\n", "\n", "# save the figure to pdf" ] }, { "cell_type": "markdown", "id": "a0367380", "metadata": { "vscode": { "languageId": "r" } }, "source": [ "### 3. Lab 04 Deliverable\n", "- Modify and rerun the demo code to generate the temporal plot of Na concentration for the other two sites (USGS-01111500 and USGS-02336300) and perform the regression analysis for both sites\n", "- Submit a __single-page__ PDF file including these __two plots__ plus __2-3 paragraphs__ describing these two plots and what explain the difference (refer to papers in Lab 01 and previous lectures) in the temporal trend at these two sites" ] }, { "cell_type": "markdown", "id": "7db1d99c", "metadata": {}, "source": [ "## Deliverables\n", "\n", "| Deliverables | Date Assigned | Date Due |\n", "|:--------------------------------------------------|:----------------|-----------------------------|\n", "| Lab 04 ([refer to the SU Blackboard website](http://blackboard.syracuse.edu){target=_blank}) | Thur 10/24/2024 | Thur 10/31/2024, 12:30pm ET |" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.3.3" } }, "nbformat": 4, "nbformat_minor": 5 }