---
title: "2_5_merging_temperature_data"
author: "Aspen Coyle"
date: "7/25/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

In the previous script, we fixed individual issues with our temperature data. 

Here, we'll do the following:

- Use a custom function to read in our Tidbit data

- Clean up the pot data. In some years, Tidbit IDs are in the comments - we need to move those over

- Merge out Tidbit data with our pot data

#### Load libraries (and install if necessary), and load packages

```{r libraries, message=FALSE, warning=FALSE}
# Add all required libraries here
list.of.packages <- c("tidyverse", "readxl", "lubridate")
# Get names of all required packages that aren't installed
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
# Install all new packages
if(length(new.packages)) install.packages(new.packages)


# Load all required libraries
lapply(list.of.packages, FUN = function(X) {
  do.call("require", list(X))
})

# Load custom functions
source("hemat_modeling_functions.R")
```

# Read in the Tidbit data for all years

```{r}
# For the sake of formatting, don't put a trailing / at the end of the path
full_dat <- read_tidbit_data(data_path = "../output/ADFG_SE_AK_pot_surveys/cleaned_data/temperature_data/2005")

years <- as.character(2006:2019)

for (i in 1:length(years)) {
  temp_dat <- read_tidbit_data(data_path = paste0("../output/ADFG_SE_AK_pot_surveys/cleaned_data/temperature_data/", years[i]))
  
  full_dat <- rbind(full_dat, temp_dat)
}


colSums(is.na(full_dat))

# Heck yes, no NA values!

# Now we'll check for weird values in all columns
table(full_dat$year)
table(full_dat$survey)
table(full_dat$tidbit_id)
head(sort(full_dat$tidbit_datetime))
head(sort(full_dat$tidbit_datetime, decreasing = TRUE))
head(sort(full_dat$Temp))
head(sort(full_dat$Temp, decreasing = TRUE))

# Max and min values all look good

# Convert formats to correct ones
full_dat$year <- as.numeric(full_dat$year)
full_dat$tidbit_datetime <- as_datetime(full_dat$tidbit_datetime, tz = "US/Alaska")

# To ensure all date/times were converted directly, check the hours for each year/survey combo
full_dat$hour <- hour(full_dat$tidbit_datetime)
table(full_dat$hour, full_dat$year)

# Everything looks great - no year has any sort of bias for pre-12pm, and values are typically within a few tens per hour.
full_dat <- full_dat %>%
  select(-hour)

# We'll double check that times weren't rounded by examining the distribution of minutes. Looking for odd numbers of :00
full_dat$min <- minute(full_dat$tidbit_datetime)

table(full_dat$min, full_dat$year)

# Looks good!!! (The excitement is because the first 20+ renditions were less successful).
full_dat[full_dat$tidbit_id == "17" & full_dat$year == 2016, ]

# We can now drop the min column
full_dat <- full_dat %>%
  select(-min)
```

We now have an extremely long dataframe consisting entirely of the following columns:
- year: 2005-2019
- survey: RKC or Tanner crab
- tidbit_id: A number unique to that year that identifies the Tidbit
- tidbit_datetime: The date and time measured by the Tidbit
- temp: The temperature measured by the Tidbit. 

Each row is one temperature measurement by one Tidbit. We want to do the following:

- Match each Tidbit with the pot it was in at the moment of measurement
- Average out temperatures across each pot deployment

To accomplish the first goal, we need to merge the Tidbit dataframe (described above) with our dataframe of pot data. However, due to the size of the Tidbit dataframe, it makes the most sense to do this on a year-by-year basis.

# Get the pot data cleaned up
```{r}
pot_dat <- read.csv(file = "../data/ADFG_SE_AK_pot_surveys/Pot_Set_Data_for_Tanner_and_RKC_surveys.csv")

# Remove spaces from column names
names(pot_dat) <- make.names(names(pot_dat), unique = TRUE)

# Eliminate columns we don't care about
pot_dat <- pot_dat %>%
  select(-c(Pot.Dimension.Feet, Pot.Escape.Device, Bait.Method, Weight.Of.Pot.Pounds,
            Pot.Type, Substrate.Type, Debris.Type))

# Change name of Tidbit ID column to match the one in use for tidbit_data
pot_dat <- rename(pot_dat, tidbit_id = Tidbit.No)

# Convert format of time columns from character to date
pot_dat$Time.Hauled <- mdy_hm(pot_dat$Time.Hauled, tz = "US/Alaska")
pot_dat$Time.Set <- mdy_hm(pot_dat$Time.Set, tz = "US/Alaska")

# Remove years before 2005 (first year with tidbits)
pot_dat <- pot_dat %>%
  filter(Year >= 2005)

# Check on frequency of tidbits
table(pot_dat$Year, pot_dat$tidbit_id, pot_dat$Project)

```

Despite having Tidbits, 2005-2007 have no tidbit_id values. After that, tidbit_id values are present for all surveys. 

A check of the data shows that in this years, tidbit_id is stored in the comments for these years. Additionally, it's often only in the comments for 2008. 
Oof, this is gonna take a while

We'll go through year by year and move comments into the tidbit_id section. We'll do this as follows:

1: Create new dataframe from pot_dat with post-2008 pot data
2: Create new dataframe from pot_dat with 2005 data, convert tidbit-related comments into tidbit_id values
3: Repeat step 2 with 2006, 2007, 2008 data
4: Merge 2005-2008 dataframes with dataframe created in Step 1. This'll be our new pot_dat column

##### 2005: Convert pot_dat comments to Tidbit IDs

```{r}
# Create version of pot_dat that only has years with tidbit_id info
full_pot_dat <- pot_dat %>%
  filter(Year >= 2009)

# Filter pot data to only include info from 2005 surveys
ohfive_pot_dat <- pot_dat %>%
  filter(Year == 2005)

tidbit_pots <- ohfive_pot_dat %>%
  filter(grepl("Tidbit", Pot.Comment, ignore.case = TRUE))

# Check we got all tidbit pots by looking at comments that don't match
non_tidbit_pots <- ohfive_pot_dat %>%
  filter(!grepl("Tidbit", Pot.Comment, ignore.case = TRUE))

# Yep, looks like tidbit_pots contains all 2005 tidbit pots!
# Now we need to extract the tidbit number
# In most, it's after a #

# Get all from single-digit tidbits 
tidbit_pots$tidbit_id <- str_extract(tidbit_pots$Pot.Comment, "#.")

# Overwrite single-digits that should be double-digits without adding NAs
# (prev. command read, say, #12 as #1)
new_dig <- str_extract(tidbit_pots$Pot.Comment, "#.\\d")
for (i in 1:length(new_dig)){
  if (!is.na(new_dig[i])){
    tidbit_pots$tidbit_id[i] <- new_dig[i]
  }
  else{}
}

# Remove the # from the column
tidbit_pots$tidbit_id <- gsub("#", "", tidbit_pots$tidbit_id)

# We still have a few left, but we can just add these manually
tidbit_pots[grep("5 Tidbit", tidbit_pots$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "5"
  
tidbit_pots[grep("11 Tidbit", tidbit_pots$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "11"

tidbit_pots[grep("10 Tidbit", tidbit_pots$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "10"

tidbit_pots[grep("7 Tidbit", tidbit_pots$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "7"

tidbit_pots[tidbit_pots$Pot.Comment == "Tidbit # 10 on this pot.", ]$tidbit_id <- "10"

tidbit_pots[tidbit_pots$Pot.Comment == "Tidbit 11 on this pot.", ]$tidbit_id <- "11"

# Alright, verify we did a good job here
any(is.na(tidbit_pots$tidbit_id))
table(tidbit_pots$tidbit_id)
# Looks great!

# Now we'll remerge the tidbit and non-tidbit pots to recreate the full 2005 dataset. It'll look the same, just with the Tidbit info added from the comments
ohfive_pot_dat <- rbind(tidbit_pots, non_tidbit_pots)
```

#### 2006: Convert pot_dat comments to Tidbit IDs

```{r}
ohsix_pot_dat <- pot_dat %>%
  filter(Year == 2006)

# All pots had some sort of comment noting the presence of Tidbits, so we don't need to filter
# out pots without tidbit references

# For most, the tidbit number is after an "#", so we'll just extract that.
ohsix_pot_dat$tidbit_id <- str_extract(ohsix_pot_dat$Pot.Comment, "#\\d{1,2}")

# For others, the tidbit number is just after "Tidbit", so we'll extract that too
new_digs <- str_extract(ohsix_pot_dat$Pot.Comment, "idbit \\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    ohsix_pot_dat$tidbit_id[i] <- new_digs[i]
  }
}

# For a few more, the tidbit number is after an "# " (ex: "Tidbit # 19"), so we'll extract that too
new_digs <- str_extract(ohsix_pot_dat$Pot.Comment, "# \\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    ohsix_pot_dat$tidbit_id[i] <- new_digs[i]
  }
}

# We still have a few left, but we can just add these manually
ohsix_pot_dat[grep("tidbit  20", ohsix_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "20"
ohsix_pot_dat[grep("tidbit  13", ohsix_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "13"
ohsix_pot_dat[grep("Tidibt 24", ohsix_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "24"


# Alright, verify we did a good job here
ohsix_pot_dat[is.na(ohsix_pot_dat$tidbit_id), ]
# We have one NA, and it's a comment specifically noted as not having a tidbit. All good!
```
#### 2007: Convert pot_dat comments to Tidbit IDs

```{r}
ohseven_pot_dat <- pot_dat %>%
  filter(Year == 2007)

# All pots had some sort of comment noting the presence of Tidbits, so we don't need to filter
# out pots without tidbit references

# We'll add in all Tidbits after a # (no space)
new_digs <- str_extract(ohseven_pot_dat$Pot.Comment, "idbit #\\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
ohseven_pot_dat$tidbit_id <- new_digs

# Some Tidbit IDs didn't have the # (ex: Tidbit 19, not Tidbit #19), we'll add those too
new_digs <- str_extract(ohseven_pot_dat$Pot.Comment, "idbit \\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    ohseven_pot_dat$tidbit_id[i] <- new_digs[i]
  }
}

# Instead of being prefaced with "Tidbit", some are prefaced with "TB" or "tb" (ex: TB 14)
new_digs <- str_extract(ohseven_pot_dat$Pot.Comment, "TB \\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    ohseven_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}

new_digs <- str_extract(ohseven_pot_dat$Pot.Comment, "tb \\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    ohseven_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}


# Now an edge case - some Tidbits are four digits, with a dash, space, or underscore between the first and last set. For all, the last two digits are 07. Let's extract those.
new_digs <- str_extract(ohseven_pot_dat$Pot.Comment, "\\d{2}.07")
# Replace last 3 digits (either " 07", "-07", or "_07") with "07"
new_digs <- str_replace(new_digs, ".07", "-07")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    ohseven_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}

# We also have a few Tidbits where it's four digits ending in 07 (ex: 2007, not 20-07. We'll convert those too.)
new_digs <- str_extract(ohseven_pot_dat$Pot.Comment, "\\d{2}07")
# Replace last 3 digits (either " 07", "-07", or "_07") with "07"
new_digs <- str_replace(new_digs, "07", "-07")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    ohseven_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}

# We still have a few left, but we can just add these manually
ohseven_pot_dat[grep("tb  03", ohseven_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "03"
ohseven_pot_dat[grep("tibit 11", ohseven_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "11"
ohseven_pot_dat[grep("Tidbit  11", ohseven_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "11"
ohseven_pot_dat[grep("Tidbit # 25", ohseven_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "25"
ohseven_pot_dat[grep("TB 17.-07", ohseven_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "17-07"
ohseven_pot_dat[grep("TB 2-07", ohseven_pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- "02-07"

# Alright, verify we did a good job here
ohseven_pot_dat[is.na(ohseven_pot_dat$tidbit_id), ]
# We have three NA, and none have comments noting tidbits. 
# All set!
```

#### 2008: Convert pot_dat comments to Tidbit IDs

```{r}
oheight_pot_dat <- pot_dat %>%
  filter(Year == 2008)

# All pots had some sort of comment noting the presence of Tidbits, so we don't need to filter
# out pots without tidbit references

# We'll add in all Tidbits after a # (no space)
new_digs <- str_extract(oheight_pot_dat$Pot.Comment, "#\\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    oheight_pot_dat$tidbit_id[i] <- new_digs[i]
  }
}

# Instead of being prefaced with "Tidbit", some are prefaced with "TB" or "tb" (ex: TB 14)
new_digs <- str_extract(oheight_pot_dat$Pot.Comment, "TB \\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    oheight_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}
new_digs <- str_extract(oheight_pot_dat$Pot.Comment, "tb \\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    oheight_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}
# Sometimes there's no space between the TB and the number (all these are capitalized). Let's fix
new_digs <- str_extract(oheight_pot_dat$Pot.Comment, "TB\\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    oheight_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}

# Now an edge case - some Tidbits are four digits, with a dash, space, or underscore between the first and last set. For all, the last two digits are 07. Let's extract those.
new_digs <- str_extract(oheight_pot_dat$Pot.Comment, "\\d{2}.07")
# Replace last 3 digits (either " 07", "-07", or "_07") with "07"
new_digs <- str_replace(new_digs, ".07", "-07")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    oheight_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}

# We also have a few Tidbits where it's four digits ending in 07 (ex: 2007, not 20-07. We'll convert those too.)
new_digs <- str_extract(oheight_pot_dat$Pot.Comment, "\\d{2}07")
# Replace last 3 digits (either " 07", "-07", or "_07") with "07"
new_digs <- str_replace(new_digs, "07", "-07")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    oheight_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}



# Alright, do a visual inspection. Looks good to me! We've got some overlap in tidbit names (ex: 20-07 and 2007), but we'll solve that later. 
```

### Merging all Pot Data

We now have five data tables - one for each year from 2005-2008 with accurate Tidbit IDs from that year, and one from 2009-2019 which already had accurate Tidbit IDs. Merge them all.

```{r}
pot_dat <- rbind(ohfive_pot_dat, ohsix_pot_dat, ohseven_pot_dat, oheight_pot_dat, full_pot_dat)

# Check other year comments for tidbit ID
tidbit_pots <- pot_dat %>%
  filter(grepl("Tidbit", Pot.Comment, ignore.case = TRUE)) %>%
  filter(Year > 2008)

# Alright, we've got a few notable commments, but almost none!
# Some tidbits are listed as missing, but have tidbit numbers in tidbit_id
# Others don't have a value in tidbit_id, but are listed in the comments
# Let's make some manual corrections
pot_dat[grep("no tidbit", pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- NA
pot_dat[grep("tidbit missing", pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- NA
pot_dat[grep("Pot open, no sample, Tidbit broken", pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- NA
pot_dat[grep("tidbit 10", pot_dat$Pot.Comment, ignore.case = TRUE), ]$tidbit_id <- 10

# Let's also check about values labeled "tb"
tidbit_pots <- pot_dat %>%
  filter(grepl("TB", Pot.Comment, ignore.case = TRUE)) %>%
  filter(Year > 2008)

# Yep, we've got a bunch. Bummer! Let's fix those. All are from 2009, so we'll specify
# Simplest way is gonna be just removing the 2009 data, editing it separately, then reading it back in
# That'll eliminate issues with, say, 2008 pots with "TB 2007" in the comments being read in as tidbit_id = 20

ohnine_pot_dat <- pot_dat %>%
  filter(Year == 2009)

pot_dat <- pot_dat %>%
  filter(Year != 2009)

new_digs <- str_extract(ohnine_pot_dat$Pot.Comment, "TB \\d{1,2}")
new_digs <- str_extract(new_digs, "\\d{1,2}")
for (i in 1:length(new_digs)){
  if (!is.na(new_digs[i])) {
    ohnine_pot_dat$tidbit_id[i] <- new_digs[i]
  } 
}

# Re-merge 2009 data in
pot_dat <- rbind(ohnine_pot_dat, pot_dat)

# Manually scan through comments from un-examined years to look for any unusual references to tidbits or loggers
test <- pot_dat %>%
  filter(Year > 2009)
# None found! We can safely remove the comments column
pot_dat <- pot_dat %>%
  select(-Pot.Comment)
```

# Alter tidbit IDs in the pot data 

We want to make sure that tidbit IDs in the pot data match those in the Tidbit ID data, and alter them if they don't!

```{r}
# 2005
table(pot_dat[pot_dat$Year == 2005, ]$tidbit_id)
table(full_dat[full_dat$year == 2005, ]$tidbit_id)

# We've got a few that aren't present in the tidbit ID data (these are likely from the RKC survey), but all IDs in the tidbit data have a corollary in the pot data

# Searching for two things:
#   - pot_data tidbit ID values that appear to be duplicated, with both having a match to full_data tidbit ID values (ex: 2207 and 22-07)

#   - full_data values with errors in tidbit ID entries (ex: leading zeros)

# 2006
table(pot_dat[pot_dat$Year == 2006, ]$tidbit_id)
table(full_dat[full_dat$year == 2006, ]$tidbit_id)
# Oof, looks like a lot within the pot data are preceded by a #. Let's remove that from all throughout all years
pot_dat$tidbit_id <- sub("#", "", pot_dat$tidbit_id)
# Check again
names(table(pot_dat[pot_dat$Year == 2006, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2006, ]$tidbit_id))
# Okay, we've still got issues. What's listed as Tidbit 2 in the tidbit data is listed as Tidbit 02 in the comments. This is an issue in many years, so let's just solve it all at once.
pot_dat$tidbit_id <- str_remove(pot_dat$tidbit_id, "^0")
# Check again
names(table(pot_dat[pot_dat$Year == 2006, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2006, ]$tidbit_id))
# Looks good!

#2007
names(table(pot_dat[pot_dat$Year == 2007, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2007, ]$tidbit_id))
# Looks good!

#2008
names(table(pot_dat[pot_dat$Year == 2008, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2008, ]$tidbit_id))
# Alright, some tidbits from this year have multiple names. Let's modify
# Change 17-07 to 17
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "17-07"), ]$tidbit_id <- "17"
# Change 177 to 17
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "177"), ]$tidbit_id <- "17"
# Change 18-07 to 18
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "18-07"), ]$tidbit_id <- "18"
# Change 1807 to 18
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "1807"), ]$tidbit_id <- "18"
# Change 187 to 18
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "187"), ]$tidbit_id <- "18"
# Change 2-07 to 2
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "2-07"), ]$tidbit_id <- "2"
# Change 20-07 to 2007
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "20-07"), ]$tidbit_id <- "2007"
# Change 207 to 2
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "207"), ]$tidbit_id <- "2"
# Change 21-07, 2107, and 217 to 21
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "21-07" | pot_dat$tidbit_id == "2107" | pot_dat$tidbit_id == "217"), ]$tidbit_id <- "21"
# Change 22-07, 2207, and 227 to 21
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "22-07" | pot_dat$tidbit_id == "2207" | pot_dat$tidbit_id == "227"), ]$tidbit_id <- "22"
# Change 24-07, 2407, and 247 to 21
pot_dat[which(pot_dat$Year == 2008 & pot_dat$tidbit_id == "24-07" | pot_dat$tidbit_id == "2407" | pot_dat$tidbit_id == "247"), ]$tidbit_id <- "24"

# 2009
names(table(pot_dat[pot_dat$Year == 2009, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2009, ]$tidbit_id))
# Looks good!

# 2010
names(table(pot_dat[pot_dat$Year == 2010, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2010, ]$tidbit_id))
# Looks good!

# 2011
names(table(pot_dat[pot_dat$Year == 2011, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2011, ]$tidbit_id))
# Looks like for Tanner survey, Leg 2, 5 got renamed 5_0.
# Let's fix
full_dat[which(full_dat$year == 2011 & full_dat$tidbit_id == "5_0"), ]$tidbit_id <- "5"
# Alright, now it looks good!

# 2012
names(table(pot_dat[pot_dat$Year == 2012, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2012, ]$tidbit_id))
# Looks good!

# 2013
names(table(pot_dat[pot_dat$Year == 2013, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2013, ]$tidbit_id))
# We've got some prefix zeroes in full_dat as well, let's remove those too
full_dat$tidbit_id <- str_remove(full_dat$tidbit_id, "^0")
# Looks good now!

# 2014
names(table(pot_dat[pot_dat$Year == 2014, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2014, ]$tidbit_id))
# Looks good!

# 2015
names(table(pot_dat[pot_dat$Year == 2015, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2015, ]$tidbit_id))
# Looks good!

# 2016
names(table(pot_dat[pot_dat$Year == 2016, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2016, ]$tidbit_id))
# Looks good!

# 2017
names(table(pot_dat[pot_dat$Year == 2017, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2017, ]$tidbit_id))
# Looks good!

# 2018
names(table(pot_dat[pot_dat$Year == 2018, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2018, ]$tidbit_id))
# Looks good!

# 2019
names(table(pot_dat[pot_dat$Year == 2019, ]$tidbit_id))
names(table(full_dat[full_dat$year == 2019, ]$tidbit_id))
# Looks good!
```


# Merge Pot Data and Tidbit ID data

```{r}
# It won't work to merge both full dataframes at once (it's just too large - 400,000 rows x 10,000 rows)
# Instead, we'll try to merge each year individually using a for loop

# Initialize loop with matrix and vector of years
tidbit_pot_merged <- matrix(nrow = 0, ncol = 20)
years <- 2005:2019


for (i in 1:length(years)){

  crab_year <- years[i]
  
  # Filter pot data and tidbit data to only include a single year
  year_pot <- pot_dat %>%
    filter (Year == crab_year)
  year_tidbit <- full_dat %>%
    filter(year == crab_year)
  # Join pot and tidbit data by tidbit ID
  year_dat <- inner_join(year_pot, year_tidbit, by = "tidbit_id")
  # Select only true entries by choosing rows where the data point is at least 1 hour away from set or haul
  
  # Create new column calculating difference between tidbit time and haul time
  year_dat$change_time <- difftime(year_dat$tidbit_datetime, year_dat$Time.Hauled, tz = "US/Alaska", units = "mins")
  # Filter to only include rows 60+ min before haul time in case of minor timekeeping errors
  year_dat <- year_dat %>%
    filter(change_time < -61)
  
  # Change column to difference between tidbit time and set time
  year_dat$change_time <- difftime(year_dat$tidbit_datetime, year_dat$Time.Set, tz = "US/Alaska", units = "mins")
  # Filter to only include rows 60+ min after set time in case of minor timekeeping errors
  year_dat <- year_dat %>%
    filter(change_time > 61)
  
  # Remove column
  year_dat <- year_dat %>% 
    select(-change_time)
  
  # Calculate average temperature, adding a new column in the database for it
  year_dat <- year_dat %>%
    group_by(Time.Hauled) %>%
    mutate(mean_temp = mean(Temp)) %>%
    ungroup
  
  # We'll also calculate the difference between max and min temperatures. This will be a "red flag" column - 
  # entries with a large difference between max and min temps are likely to be the result of data entry errors.
  
  # Calculate max temperature, adding a new column in the database for it
  year_dat <- year_dat %>%
    group_by(Time.Hauled) %>%
    mutate(diff_temp = (max(Temp) - min(Temp))) %>%
    ungroup

  
  # Remove duplicate IDs from Time.Hauled. This leaves us with one row per haul. That eliminates temp (the specific measurements), date, time, and DateTime, but we don't care about that. 
  
  # It keeps average temperature of the pot, which is what we care about!
  
  year_dat <- year_dat[!duplicated(year_dat$Time.Hauled), ]
  
  # Append this to our merged data
  tidbit_pot_merged <- rbind(tidbit_pot_merged, year_dat)
}
```

Alright, looks good! However, we've got some errors. Let's try to fix them!

```{r}
# First, we'll ditch any line with a difference in temperature greater than 3, as it's likely to be an error in readings or in data entry (for the pot ID or tidbit ID).

tidbit_pot_merged <- tidbit_pot_merged %>%
  filter(diff_temp < 3)

# Next, we'll eliminate all pots with a set or haul time of 0:00:00. While I'm sure some may have been hauled or set at precisely midnight, it's more likely that the exact time was simply not entered (or present on the data forms), and thus our values are incorrect. 

# Look at Time.Set column, eliminate rows set at 0:00:00
tidbit_pot_merged$time <- tidbit_pot_merged$Time.Set %>%
  str_split(" ") %>%
  map_chr(2) %>%
  hms()

tidbit_pot_merged <- tidbit_pot_merged %>%
  filter(time != "0S")

# Look at Time.hauled column, eliminate rows set at 0:00:00
tidbit_pot_merged$time <- tidbit_pot_merged$Time.Hauled %>%
  str_split(" ") %>%
  map_chr(2) %>%
  hms()

tidbit_pot_merged <- tidbit_pot_merged %>%
  filter(time != "0S")

# Drop the time column
tidbit_pot_merged <- tidbit_pot_merged %>%
  select(-time)

# We'll now go through the top mean temperatures in search of data entry errors
# These are likely to be errors in noting which Tidbit is which, and thus can't really be searched for.
# We have to do it...sigh...manually

# 2014 RKC, Leg 3, Tidbit 22. Mean temp = 14.11C.
# First deployment of the leg, I think they forgot to add the Tidbit
# Following deployments were approx 6C, don't believe this was valid msmt
tidbit_pot_merged <- tidbit_pot_merged[as.character(tidbit_pot_merged$Time.Hauled) != "2014-07-23 10:29:00", ]

# 2011 Tanner crab, Leg 1
# Almost all our new highest temp values are from those set on 10-07 on this survey
# An examination of NOAA weather reports (at https://www.ndbc.noaa.gov/station_history.php?station=jnea2) don't reveal any unusual warmth
# Furthermore, the pots experience a dramatic cold snap (substantially cooler than air temperature at the time) after they're lifted from the water
# Finally, no pots are listed as being set on 10-08 or pulled on  10-09 which is quite unusual (especially as pots were set on 10-09).
# Finally, pots set the next day are quite cool, typically around 4C. 
# My hypothesis: the captain got their dates mixed up, and entered the pots as being set on the 7th and pulled on the 8th, whereas in reality they were
# set on the 8th and pulled on the 9th.

# Therefore, we'll remove all pots from this day and the one before. 

tidbit_pot_merged$date <- tidbit_pot_merged$Time.Set %>%
  str_split(" ") %>%
  map_chr(1)

tidbit_pot_merged <- tidbit_pot_merged %>%
  filter(date != "2011-10-07")

# 2016 RKC, Leg 3
# Our highest three mean temperatures are all from this survey, specifically from pots set on July 25th (tidbits 16, 7, and 13).
# All temperatures are approx. 9.3C
# However, these look quite legitimate. In all cases, the set time and tidbit data agree. Prior to the pot being set, the tidbit data gives temps
# around 13-14C. And after the pot was hauled, the tidbit data temps return to 13-14C. Therefore, it's quite likely that these temperatures are accurate.


# 2013 Tanner crab, Leg 2
# Most of the remaining top 20 or so temp values are from those set at 10-20 on this survey.
# However, these look legitimate as well. The temps are cooler than those from 2011 Tanner, Leg 2, plus the pots don't experience a cooling after being hauled. # Instead, they warm slightly. An example of this is Tidbit 20, which is approx. 9C from 8pm on Oct 19th until 11am on Oct 20th. This is then followed # by a few hours at approx 11C, then at 1pm it drops to 9C again. This corresponds with the pot data pattern of sets + hauls (set at 8pm on the 19th, hauled at 11am on the 20th, set again at 1pm on the 20th).
# NOAA sea surface temperatures are warm at this time in Juneau (7.7C), and could be warmer at the nearby Port Camden, where these pots were taken.
# Given that they align broadly with the pot data pattern of sets + hauls, and these temperatures are seen across many Tidbits, it's quite likely that these temperatures are accurate.

# Remove unnecessary columns we've created, plus some additional unneeded columns
tidbit_pot_merged <- tidbit_pot_merged %>%
  select(-c(year, survey, tidbit_datetime, Temp, diff_temp, date, Density.Strata, Time.Set))



# Change column names
tidbit_pot_merged <- tidbit_pot_merged %>%
  rename(temp = mean_temp)


# Convert 

# Write out data to file
write.csv(tidbit_pot_merged, file = "../output/ADFG_SE_AK_pot_surveys/cleaned_data/pot_data_with_temperature.csv",
          row.names = FALSE)
```