### R script for the hands-on examples
### Week 8


## Install {tidyverse} and Load the Package ---------------------------------------


## Mini Data Project -------------------------------------------------------------
# This mini data project is based on a real project that focuses on gene expression across different time points.
# A researcher has measured the expression levels of 20 genes (anonymed as 1 to 20) using the RT-qPCR technique.
# The gene expression was assessed in two structures of the mouse brain.
# Mice ranged in age from 10 to 60 days (10, 15, 20, 25, 30, 35, 40, 45, 50, 60 days),
# and the experiment was repeated with both male and female mice,
# with 6 animals (named from A to F) in each group.

# According to the researcher, the data was stored in two files, one for each brain structure.
# Within each file, rows represent the different ages,
# and columns represent the gene, sex, and animal.

# A small Gaussian noise has been added to the original data, preserving the overall structure.

# The data is available in two CSV files:
# - data_anonym_struc1_noise.csv
# - data_anonym_struc2_noise.csv

# We will focus on the data from the brain structure 1.

### Import the Data ---------------------------------------------------------------

# 1. Please download the `data_anonym_struc1_noise.csv` file.
# Observe your data file:
# - Is there a header line?
# - What is the separator between columns?
# - Which character was used for decimal points?
# - Which character was used for missing data (between two seperators where there's no value)?

# 2. Import the `data_anonym_struc1_noise.csv` into RStudio, you can use either:
# - the `read_csv2()` from the package {`readr`} (`?readr::read_csv2`), or
# - use the click-button way and copy-paste the code in your script.

## Don't forget to use/select the appropriate parameters to make sure you import correctly the data.

# Name the data as `data1`.
# Convert your imported data to tibble format if it's not the case.

# What is the data dimension?


# 3. Show the first 10 columns of your data.


# 4. Rename the first column as `age`.


### Reshape the Data --------------------------------------------------------------

# How should the data be organized?
# 5. Reshape data to "tidy" format with the `pivot_longer()` function.
# (tidy format: each variable is a column, each observation is a row.)
# What are the columns to be included to pivot into longer format?


# 6. Add a column `struc` which contains the name of the measured structure `s1`.


# 7. Extract information about gene, sex and animal from the column `id` using the `extract()` function. Name the new columns as "gene_id", "sex" and "animal".

# Hint: Find the patterns for the extraction.
# You can use AI to help you to write the regular expression.


# Now, the data is ready for downstream analysis.

### Manipulate the Data -----------------------------------------------------------

# For question 8 to 11, let's focus on gene 1 from the data.

# 8. At age of 10 days, which animal has the highest expression value for gene 1 overall?
#     And which animal has the highest expression value in each sex?


# 9. Is there any missing value for gene 1?
#     If yes, how to remove lines with NA?


# 10. After removing NAs, how many animals are there for each sex in gene 1?


# 11. Summarize the median, mean, and standard deviation of gene 1 expression for both sexes.


### Explore the Data ---------------------------------------------------------------

# What kind of analysis would you like to perform with this data?
# In statistics, it's common to begin by exploring the dataset as a whole and visualizing the relationships
#   between different variables.
# The basic R function `pairs()` (`?pairs`) is useful for creating a matrix of scatter plots to examine
#   the relationships between each pair of continuous variables.
# For instance, we can explore the relationships between continuous variables such as age and
#   the expression levels of genes 1, 2, 3, *etc.*

# 12. How will you reshape the `data1_long` to provide the necessary data for the `pairs()` function?


# To save space, we will focus on examining the relationship between age and the first 5 genes.
# 13. What did you observe from these scatter plots?

## put histograms on the diagonal
panel.hist <- function(x, ...) {
  usr <- par("usr")
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
## put (absolute) correlations on the upper panels,
## with size proportional to the correlations.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y, use = "na.or.complete")) # modified to allow NA
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(
  x = data1_wider[, c(1, 5:9)], # age and the first 5 genes
  diag.panel = panel.hist,
  lower.panel = panel.smooth,
  upper.panel = panel.cor
)


# 14. Calculate the correlation between gene 1 and 2. (`?cor`)


# It seems that there are two groups of mice that express genes 4 and 5 in a similar way.
# 15. Draw a scatter plot using {`ggplot2`} to show the expression levels of genes 4 and 5.
#     Color the points by different categorical variables that we have, *i.e.*, age, sex, and animal.
#     Is there any categorical variable that can explain the groups we observed in the figure?


# ## Bonus

# # Use the `read.table()` function to import the data and continue to reshape the data based on the imported data.
# # (Check the approporiate parameters to be included with `?read.table`)