### R script for the hands-on examples
### Week 6

## "Fix the Code" Challenge -------------------------------------------------------------------

## 1. Fix the data frame error.
data <- data.frame(
  gene = c("BRCA1", "TP53", "MYC")
  expression = c(10.5, 8.2, 12.7)
  condition = c("Tumor", "Normal", "Tumor")
)


## 2. Fix error when using the `mean()` function.
counts <- data.frame(
  sample1 = c(1, 2, 3),
  sample2 = c(1, 1, 3),
  sample3 = c(0, 1, 2),
  row.names = paste0("gene", 1:3)
)
mean(counts["gene3", ])


## 3. Fix error in condition.
##    The goal is to find which value is 3 from the vector.
vec_num <- c(4, 6, 8, 3, 2, 4, 7, 3, 9, 3)
which(vec_num = 3)


## 4. Fix error in ggplot2.
##    The goal is to show petal length with a boxplot for each species.

ggplot(iris, aes(x = Sepal.Length, y = Species))
geom_boxplot()


## 5. Fix error in ggplot2. The aim is to draw boxplot for each group.

# simulate data for two groups of samples.
set.seed(1)
data <- data.frame(
  gp = rep(1:2, each = 20),
  value = c(rnorm(20), rnorm(20, mean = 5))
)
str(data)
# draw boxplot by group.
ggplot(data, aes(x = gp, y = value)) + 
  geom_boxplot()


## 6. Fix code in data filtration.
# try to keep rows where the value is smaller than -0.5
data[data$value<-0.5, ]


## Mini Data Project -------------------------------------------------------------------------

## Gene expression data were collected from 12 samples of two groups.
## Help the researcher to perform data quality control and
## check if the expression is different between groups.


## Tasks:
## 1. Download the data: https://raw.githubusercontent.com/InforBio/ioc-r-2026/refs/heads/main/ioc_r/exos_data/rna_seq_simulated_data.csv
##    and import it into Rstudio, name it `rnaseq_data`.


## 2. Perform data quality control and describe the data.
## - If missing value exists, which variable(s) and sample(s) are concerned?
## - Do we need to recode variable?

## 3. Remove rows with missing values.
## Create a new dataset without missing values, name it `data_clean`.


## 4. Find the average expression (`mean()`) and the standard deviation (`sd()`)
##    for each sample group (Control *vs.* Case)


## 5. Use `data_clean` to visualize expression between two groups.


## 6. To go futher: Let's visualise the average expression of each group using a barplot,
##                  with the help of ChatGPT (or any other AI tool).

## 6a. Prepare data for barplot with error bars.
##    We need to reorganize the data in a dataframe with 2 rows and 3 columns:
##    - the column `group` contain the name of each group
##    - the column `mean_val` contain the average expression of each group
##    - the column `se_val` contain the standard error of each group


## 6b. Draw the bar plot:
##    - Plot bars for mean expression (`geom_bar(stat = "identity")`)
##    - Add error bars for standard errors (`geom_errorbar()`)
##    - Change the aesthetic apsects as you want, *e.g.*: color, title, legend, *etc.*


## 6c. What if we want to add the expression level of each sample to the bar plot?
## Hint: Add another layer for drawing points (`geom_point`),
##       using the data frame that contains the individual data (`data_clean`).


## 7. Is the expression different between groups?
## Select a test to compare the expression between groups.
## What is your conclusion?