---
title: "Programming in R"
subtitle: "GEN242: Data Analysis in Genome Biology"
author: "Thomas Girke"
date: today
format:
  revealjs:
    theme: [default, ../../assets/revealjs_custom.scss]
    slide-number: true
    progress: true
    scrollable: true
    smaller: true
    highlight-style: github
    code-block-height: 380px
    transition: slide
    embed-resources: true
    footer: "GEN242 · UC Riverside · [Tutorial](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rprogramming/rprogramming_index.html) · [HW04](https://girke.bioinformatics.ucr.edu/GEN242/assignments/homework/hw04/hw04.html)"
    logo: "https://girke.bioinformatics.ucr.edu/GEN242/assets/logo_gen242.png"
execute:
  echo: true
  eval: false
---

## Overview

One of the main attractions of R is how easy it is to write custom functions
and programs — even for users with no prior programming experience. Once the
basic control structures are understood, R becomes a powerful environment for
complex custom analyses of almost any type of data.

Topics covered in this tutorial:

- Why programming in R?
- R scripts — structure and execution
- Control structures: `if`, `ifelse`
- Loops: `for`, `while`, `apply` family
- Speed performance of loops
- Writing custom functions
- Useful utilities: regex, string operations, debugging
- Executing R scripts from console and command-line
- Programming exercises (HW04)

::: {.callout-note}
**Prerequisite:** The [R Basics tutorial](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rbasics/rbasics_index.html)
provides the foundational R knowledge assumed here.
:::

---

## Why Programming in R?

- Powerful statistical environment and programming language
- Facilitates **reproducible research** — analyses are scripts, not clicks
- Efficient data structures make programming very easy
- Easy to implement custom functions for any analysis
- Powerful, publication-quality graphics
- Access to a rapidly growing ecosystem of packages
- Widely used language in bioinformatics
- Standard for data mining and biostatistical analysis
- Free, open-source, available for all operating systems

---

## R Scripts {.scrollable}

An **R script** is a plain text file (`.R` or `.Rmd/.qmd`) containing R code and comments.
It is the primary way to write reproducible analyses.

### Structure of a well-organized R script

```{r}
#!/usr/bin/env Rscript
# -----------------------------------------------
# Script: my_analysis.R
# Author: Thomas Girke
# Description: Example analysis workflow
# -----------------------------------------------

## Section 1 — Load libraries
library(ggplot2)
library(dplyr)

## Section 2 — Import data
myDF <- read.delim("mydata.txt", sep="\t")

## Section 3 — Analysis
result <- myDF |> filter(value > 10) |> summarize(mean=mean(value))

## Section 4 — Export
write.table(result, "result.txt", sep="\t", quote=FALSE)
```

### Style guidelines

- Use `#` for comments — explain *why*, not just *what*
- Organize code into labeled sections with `## Section name`
- Keep functions in a separate file and load with `source()`
- Follow the [Tidyverse style guide](http://adv-r.had.co.nz/Style.html)
- Use the [formatR](https://yihui.org/formatr/) package to auto-format scripts

### Rmd and Quarto scripts

`.Rmd` and `.qmd` files extend R scripts with narrative text, results, and
formatted output. They render to HTML, PDF, and other formats.
Details in the [R Markdown tutorial](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rmarkdown/rmarkdown_index.html).

---

## Control Structures — Operators

### Comparison operators

| Operator | Meaning |
|---|---|
| `==` | equal |
| `!=` | not equal |
| `>` / `>=` | greater than / or equal |
| `<` / `<=` | less than / or equal |

### Logical operators

| Operator | Meaning | Scope |
|---|---|---|
| `&` | AND | element-wise on vectors |
| `&&` | AND | first element only — use in `if` statements |
| `\|` | OR | element-wise on vectors |
| `\|\|` | OR | first element only — use in `if` statements |
| `!` | NOT | |

::: {.callout-tip}
Use `&&` and `||` in `if` statements (they evaluate only the first element and
short-circuit). Use `&` and `|` for element-wise operations on vectors.
:::

---

## Conditional Execution — `if` and `ifelse` {.scrollable}

### `if` statement — operates on a single logical value

```{r}
if (TRUE) {
    statements_1
} else {
    statements_2
}
```

::: {.callout-warning}
Keep `} else {` on the same line — avoid a newline before `else` or R will
misparse the statement.
:::

### Examples

```{r}
# Basic if / else
if (1 == 0) {
    print(1)
} else {
    print(2)     # runs this branch
}

# if / else if / else chain
if (1 == 0) {
    print(1)
} else if (1 == 2) {
    print(2)
} else {
    print(3)     # runs this branch
}
```

### `ifelse` — vectorized conditional, operates on entire vectors

```{r}
ifelse(test, true_value, false_value)   # syntax
```

```{r}
x <- 1:10
ifelse(x < 5, sqrt(x), 0)   # sqrt for values < 5, else 0
```

`ifelse` is much more efficient than a `for` loop with an `if` inside when
operating on vectors.

---

## `for` Loops {.scrollable}

Iterate over elements of a sequence:

```{r}
for (variable in sequence) {
    statements
}
```

### Example — compute row means (append approach)

```{r}
mydf <- iris
myve <- NULL
for (i in seq(along=mydf[,1])) {
    myve <- c(myve, mean(as.numeric(mydf[i, 1:3])))  # appends result each iteration
}
myve[1:8]
```

::: {.callout-warning}
**The append approach (`c()`) is slow for large objects** — each iteration
creates a new copy of the entire vector. Use the inject approach instead.
:::

### Inject approach — pre-allocate the result vector (much faster)

```{r}
myve <- numeric(length(mydf[,1]))   # pre-allocate vector of correct length
for (i in seq(along=myve)) {
    myve[i] <- mean(as.numeric(mydf[i, 1:3]))   # assign result by index
}
myve[1:8]
```

### Conditional stop inside a loop

Use `stop()` to break out of a loop with an error message when a condition is met:

```{r}
x <- 1:10
z <- NULL
for (i in seq(along=x)) {
    if (x[i] < 5) {
        z <- c(z, x[i]-1)
        print(z)
    } else {
        stop("values need to be < 5")   # breaks loop and prints error
    }
}
```

---

## `while` Loop

Iterates as long as a condition remains `TRUE`:

```{r}
while (condition) {
    statements
}
```

### Example

```{r}
z <- 0
while (z < 5) {
    z <- z + 2    # increment z each iteration
    print(z)      # prints: 2, 4, 6  (stops when z >= 5)
}
```

::: {.callout-tip}
Use `while` when the number of iterations is not known in advance.
Use `for` when iterating over a known sequence or vector.
:::

---

## The `apply` Function Family {.scrollable}

Apply functions avoid explicit loops and are often more readable and faster.

### `apply` — apply a function over rows or columns of a matrix/data.frame

```{r}
apply(X, MARGIN, FUN, ...)
# X:      matrix, array, or data.frame
# MARGIN: 1 = rows, 2 = columns
# FUN:    function to apply
```

```{r}
apply(iris[1:8, 1:3], 1, mean)    # row-wise mean for first 8 rows, cols 1-3
apply(iris[, 1:4], 2, mean)       # column-wise mean for all numeric columns
```

### `tapply` — apply a function to groups defined by a factor

```{r}
tapply(vector, factor, FUN)
```

```{r}
tapply(iris$Sepal.Length, iris$Species, mean)   # mean Sepal.Length per species
```

### `lapply` and `sapply` — apply a function to each element of a list or vector

```{r}
l <- list(a=1:10, beta=exp(-3:3), logic=c(TRUE,FALSE,FALSE,TRUE))

lapply(l, mean)    # returns a list
sapply(l, mean)    # returns a vector or matrix when possible
vapply(l, mean, FUN.VALUE=numeric(1))   # safer: enforces output type
```

Often used with an inline anonymous function:

```{r}
sapply(names(l), function(x) mean(l[[x]]))    # same result, explicit element access
```

### Choosing between `lapply`, `sapply`, `vapply`

| Function | Returns | Best for |
|---|---|---|
| `lapply` | always a list | when output types may vary |
| `sapply` | vector/matrix if possible, else list | interactive use |
| `vapply` | vector/array of specified type | scripts — safer, faster |

---

## Loop Speed Performance {.scrollable}

Looping over large data sets can be slow. The key principle: **avoid growing
objects inside loops** and prefer **vectorized operations** over loops entirely.

### Test matrix used in all benchmarks

```{r}
myMA <- matrix(rnorm(1000000), 100000, 10,
               dimnames=list(1:100000, paste("C", 1:10, sep="")))
```

### 1. `for` loop: append vs inject

```{r}
# SLOW — append approach: c() creates a new copy each iteration
results <- NULL
system.time(for(i in seq(along=myMA[,1])) results <- c(results, mean(myMA[i,])))
#  user: 39.2s   elapsed: 45.6s

# FAST — inject approach: pre-allocate, assign by index
results <- numeric(length(myMA[,1]))
system.time(for(i in seq(along=myMA[,1])) results[i] <- mean(myMA[i,]))
#  user: 1.6s    elapsed: 1.6s   → ~25× faster
```

### 2. `apply` vs `rowMeans`

```{r}
system.time(myMAmean <- apply(myMA, 1, mean))     # user: 1.5s
system.time(myMAmean <- rowMeans(myMA))            # user: 0.005s  → ~300× faster
```

### 3. `apply` vs vectorized calculation

```{r}
# apply loop for row-wise standard deviation
system.time(myMAsd <- apply(myMA, 1, sd))                               # user: 3.7s

# vectorized equivalent — same result, dramatically faster
system.time(myMAsd <- sqrt((rowSums((myMA - rowMeans(myMA))^2)) /
                            (length(myMA[1,]) - 1)))                    # user: 0.02s
```

### Key takeaways

- Use `rowMeans`, `rowSums`, `colMeans`, `colSums` instead of `apply` where possible
- Pre-allocate result objects before loops — never grow with `c()`, `cbind()`, `rbind()`
- Design data structures for **matrix-to-matrix** operations — eliminates loops entirely

---

## Fast Querying with Matrix Operations {.scrollable}

A practical example: filtering differentially expressed genes (DEGs) across
multiple contrasts using matrix-to-matrix logic — no looping required.

### Create a test DEG matrix (LFCs and p-values)

```{r}
lfcPvalMA <- function(Nrow=200, Ncol=4, stats_labels=c("lfc", "pval")) {
    set.seed(1410)
    assign(stats_labels[1], runif(n=Nrow*Ncol, min=-4, max=4))
    assign(stats_labels[2], runif(n=Nrow*Ncol, min=0,  max=1))
    lfc_ma  <- matrix(lfc,  Nrow, Ncol, dimnames=list(paste("g",1:Nrow,sep=""),
                      paste("t",1:Ncol,"_",stats_labels[1],sep="")))
    pval_ma <- matrix(pval, Nrow, Ncol, dimnames=list(paste("g",1:Nrow,sep=""),
                      paste("t",1:Ncol,"_",stats_labels[2],sep="")))
    statsMA <- cbind(lfc_ma, pval_ma)
    return(statsMA[, order(colnames(statsMA))])
}
degMA <- lfcPvalMA(Nrow=200, Ncol=4)
degMA[1:4,]
```

### Separate LFC and p-value matrices into a list

```{r}
degList <- list(
    lfc  = degMA[, grepl("lfc",  colnames(degMA))],
    pval = degMA[, grepl("pval", colnames(degMA))]
)
sapply(degList, dim)   # confirm dimensions match
```

### Apply combinatorial filter (|LFC| >= 1 AND pval <= 0.5)

```{r}
queryResult <- (degList$lfc >= 1 | degList$lfc <= -1) & degList$pval <= 0.5
colnames(queryResult) <- gsub("_.*", "", colnames(queryResult))
queryResult[1:4,]
```

### Extract matching gene IDs

```{r}
# Per-column: genes passing filter in each contrast
matchingIDlist <- sapply(colnames(queryResult),
                         function(x) names(queryResult[queryResult[,x], x]),
                         simplify=FALSE)

# Across columns: genes passing filter in > 2 contrasts
matchingID <- rowSums(queryResult) > 2
names(matchingID[matchingID])    # gene names meeting the threshold
```

::: {.callout-tip}
Storing LFC and p-value as two parallel matrices in a list enables
flexible, fast, zero-loop combinatorial filtering — a pattern worth
reusing in any multi-contrast analysis.
:::

---

## Functions — Overview and Syntax {.scrollable}

Functions are the primary way to organize and reuse code in R.
Almost everything in R is a function.

### Define a function

```{r}
myfct <- function(arg1, arg2, ...) {
    # function body — operations on the arguments
    result <- arg1 + arg2
    return(result)    # return value explicitly, or just: result
}
```

### Call a function

```{r}
myfct(arg1=3, arg2=4)   # with argument names (recommended)
myfct(3, 4)              # positional — order must match definition
```

### Key rules

| Concept | Rule |
|---|---|
| **Naming** | Avoid names of existing functions (e.g. don't name a function `mean`) |
| **Default args** | Provide defaults with `arg=value` — caller can then omit them |
| **Empty args** | `function() { ... }` — valid for functions that always return the same value |
| **`...`** | Pass unknown arguments through to another function |
| **Return value** | Last unassigned expression, or explicit `return()` |
| **Scope** | Variables inside a function are local — invisible outside |
| **Global assign** | Use `<<-` to force a variable to exist in the global environment |

---

## Functions — Examples {.scrollable}

### Define a function with a default argument

```{r}
myfct <- function(x1, x2=5) {   # x2 has default value 5
    z1 <- x1 / x1
    z2 <- x2 * x2
    myvec <- c(z1, z2)
    return(myvec)
}
```

### Call with and without the default argument

```{r}
myfct(x1=2, x2=5)   # explicit: returns c(1, 25)
myfct(2, 5)          # positional: same result
myfct(x1=2)          # uses default x2=5: same result
myfct                 # without () prints the function definition
```

### Scope — variables inside functions are local

```{r}
x <- 10                     # global x
myfct2 <- function() {
    x <- 99                 # local x — does not affect global x
    cat("inside:", x, "\n")
}
myfct2()                    # prints: inside: 99
cat("outside:", x, "\n")    # prints: outside: 10

# Force global assignment with <<-
myfct3 <- function() {
    x <<- 99                # modifies global x
}
myfct3()
cat("outside:", x, "\n")    # prints: outside: 99
```

::: {.callout-tip}
Avoid `<<-` in general — it makes code harder to reason about. Prefer
returning values explicitly with `return()` and assigning outside the function.
:::

---

## Useful Utilities — Debugging {.scrollable}

R provides several tools for finding and fixing errors in code:

| Function | Purpose |
|---|---|
| `traceback()` | Shows the call stack after an error |
| `browser()` | Insert a breakpoint — pauses execution and opens interactive prompt |
| `debug(myfct)` | Step through `myfct` line by line |
| `undebug(myfct)` | Remove debug mode from a function |
| `options(error=recover)` | On error, open interactive debugger at the call stack |
| `options(error=NULL)` | Reset to default error handling |

```{r}
# Example: use browser() as a breakpoint inside a function
myfct <- function(x) {
    browser()          # execution pauses here — inspect variables interactively
    result <- x^2
    return(result)
}
myfct(5)
```

Full guide: [Debugging in R (Advanced R)](https://adv-r.hadley.nz/debugging.html)

---

## Useful Utilities — Regular Expressions {.scrollable}

R's regex utilities work similarly to other languages. Main reference: `?regexp`

### Pattern matching with `grep`

```{r}
month.name[grep("^A", month.name)]        # months starting with A
grep("^J", month.name, value=TRUE)        # same with value=TRUE
grepl("^A", month.name)                   # returns logical vector
```

### String substitution with `gsub` and `sub`

```{r}
# gsub: replace ALL matches
gsub("(i.*a)", "xxx_\\1", "virginica", perl=TRUE)   # back reference with \\1

# sub: replace FIRST match only
sub("a", "X", "banana")    # returns "bXnana"
```

### String operations

```{r}
# Insert a character with back reference, then split on it
x <- gsub("(a)", "\\1_", month.name[1], perl=TRUE)   # "J_anu_ary"
strsplit(x, "_")                                       # split on "_"

# Reverse a string
paste(rev(unlist(strsplit("hello", NULL))), collapse="")   # "olleh"
```

### Import lines matching a pattern from a file

```{r}
cat(month.name, file="months.txt", sep="\n")         # write months to file
x <- readLines("months.txt")                          # read all lines
x[grep("^J", x, perl=TRUE)]                          # keep lines starting with J
```

---

## Useful Utilities — String and Time Functions {.scrollable}

### String paste and manipulation

```{r}
paste("sample", 1:5, sep="_")           # "sample_1" ... "sample_5"
paste0("C", 1:5)                         # "C1" ... "C5" (no separator)
paste(month.name[1:3], collapse=", ")    # "January, February, March"
nchar("hello")                           # 5  (string length)
toupper("hello"); tolower("HELLO")       # case conversion
trimws("  hello  ")                      # remove leading/trailing whitespace
```

### Interpret a string as R code

```{r}
myfct <- function(x) x^2
mylist <- ls()
n <- which(mylist %in% "myfct")

get(mylist[n])       # retrieves the object named by the string
get(mylist[n])(2)    # calls it as a function with argument 2
eval(parse(text=mylist[n]))   # alternative: parse string as expression
```

### Timing and system calls

```{r}
system.time(ls())       # measure time for an expression
date()                  # current system date and time
Sys.sleep(1)            # pause R for 1 second
```

### Call external command-line tools from R

```{r}
system("blastall -p blastp -i seq.fasta -d uniprot -o seq.blastp")
system2("blastp", args=c("-i", "seq.fasta", "-d", "uniprot"))
```

### File integrity check

```{r}
library(tools)
md5 <- as.vector(md5sum(dir(R.home(), pattern="^COPY", full.names=TRUE)))
identical(md5, md5)                        # TRUE
identical(md5, sub("^b", "z", md5))       # FALSE — detects any change
```

---

## Executing R Scripts {.scrollable}

### From the R console

```{r}
source("my_script.R")    # execute entire script, output printed to console
```

### From the command-line (preferred for automation)

```bash
Rscript myscript.R                   # standard method
./myscript.R                         # requires shebang + executable permission
R CMD BATCH myscript.R               # older alternative
R --slave < myscript.R               # older alternative
```

### Shebang line — required for `./myscript.R` execution

Add as the **first line** of the script:

```bash
#!/usr/bin/env Rscript
```

Then make the script executable:

```bash
chmod +x myscript.R
./myscript.R
```

### Passing arguments from command-line to R

Create `test.R`:

```{r}
myarg <- commandArgs()
print(iris[1:myarg[6], ])   # myarg[6] receives the first user-provided argument
```

Run it:

```bash
Rscript test.R 10    # prints first 10 rows of iris
```

::: {.callout-tip}
For scripts accepting multiple complex arguments, use the
[argparse](https://cran.r-project.org/web/packages/argparse/index.html)
or [optparse](https://cran.r-project.org/web/packages/optparse/index.html)
packages for clean argument parsing with help messages.
:::

---

## Programming Exercises {.scrollable}

### Exercise 1 — Comparing loop approaches for row-wise means

Create the test matrix:

```{r}
myMA <- matrix(rnorm(500), 100, 5,
               dimnames=list(1:100, paste("C", 1:5, sep="")))
```

**Task 1.1** — `for` loop with append (slow but instructive):

```{r}
myve_for <- NULL
for (i in seq(along=myMA[,1])) {
    myve_for <- c(myve_for, mean(as.numeric(myMA[i,])))
}
```

**Task 1.2** — `while` loop:

```{r}
z <- 1; myve_while <- NULL
while (z <= nrow(myMA)) {
    myve_while <- c(myve_while, mean(as.numeric(myMA[z,])))
    z <- z + 1
}
```

**Task 1.3** — confirm both methods give identical results:

```{r}
all(myve_for == myve_while)    # should return TRUE
```

**Task 1.4** — `apply` loop:

```{r}
myve_apply <- apply(myMA, 1, mean)
```

**Task 1.5** — built-in `rowMeans` (fastest):

```{r}
mymean <- rowMeans(myMA)
# Compare all approaches side by side:
myResult <- cbind(myMA, mean_for=myve_for, mean_while=myve_while,
                  mean_apply=myve_apply, mean_rowMeans=mymean)
myResult[1:4, -c(1,2,3)]    # show only the mean columns
```

---

## Programming Exercises (cont.) {.scrollable}

### Exercise 2 — Custom function for grouped column means

**Task 2.1** — implement a function that computes means for user-specified
column groups in any matrix or data frame:

```{r}
myMA <- matrix(rnorm(100000), 10000, 10,
               dimnames=list(1:10000, paste("C", 1:10, sep="")))

# Group columns: cols 1-3 → group 1, cols 4-6 → group 2, etc.
myList <- tapply(colnames(myMA), c(1,1,1,2,2,2,3,3,4,4), list)
names(myList) <- sapply(myList, paste, collapse="_")

# Apply mean to each column group
myMAmean <- sapply(myList, function(x) apply(myMA[,x], 1, mean))
myMAmean[1:4,]
```

### Exercise 3 — Nested loops: pairwise similarity matrix

**Task 3.1** — create a list of character vectors of varying lengths:

```{r}
setlist <- lapply(11:30, function(x) sample(letters, x, replace=TRUE))
names(setlist) <- paste("S", seq(along=setlist), sep="")
```

**Task 3.2** — compute all pairwise intersect sizes:

```{r}
setlist <- sapply(setlist, unique)    # remove duplicates first
olMA <- sapply(names(setlist), function(x)
               sapply(names(setlist), function(y)
               sum(setlist[[x]] %in% setlist[[y]])))
olMA[1:4, 1:4]
```

**Task 3.3** — plot as heatmap:

```{r}
library(pheatmap); library(RColorBrewer)
pheatmap(olMA, color=brewer.pal(9,"Blues"),
         cluster_rows=FALSE, cluster_cols=FALSE,
         display_numbers=TRUE, number_format="%.0f", fontsize_number=10)
```

---

## HW04 {.scrollable}

::: {.callout-important}
**Assignment:** [HW04 — Programming in R](https://girke.bioinformatics.ucr.edu/GEN242/assignments/homework/hw05/hw05/)

The programming exercises above (Exercises 1–3) form the basis of HW04.
:::

### Summary of exercise tasks

| Exercise | Task | Topic |
|---|---|---|
| **1.1** | `for` loop with append | Row means on matrix |
| **1.2** | `while` loop | Same computation |
| **1.3** | Confirm identical results | `all()` comparison |
| **1.4** | `apply` loop | Same computation |
| **1.5** | `rowMeans` | Fastest approach |
| **2.1** | Custom function | Grouped column means with `tapply` + `sapply` |
| **3.1** | Create list | Character vectors of varying lengths |
| **3.2** | Nested loops | Pairwise intersect matrix with `%in%` |
| **3.3** | Heatmap | Visualize similarity matrix with `pheatmap` |

::: {.callout-note}
For the full homework instructions and submission details, see the
[HW04 page](https://girke.bioinformatics.ucr.edu/GEN242/assignments/homework/hw04/hw04.html).
:::

---

## Summary — Key R Programming Commands

| Category | Command | Purpose |
|---|---|---|
| **Conditionals** | `if / else` | single-value branching |
| | `ifelse(test, yes, no)` | vectorized conditional |
| **Loops** | `for (i in seq)` | iterate over sequence |
| | `while (cond)` | iterate while condition is TRUE |
| | `stop("msg")` | break loop with error |
| **Apply family** | `apply(X, 1, FUN)` | rows of matrix |
| | `apply(X, 2, FUN)` | columns of matrix |
| | `tapply(vec, fac, FUN)` | by factor groups |
| | `sapply(list, FUN)` | returns vector/matrix |
| | `lapply(list, FUN)` | returns list |
| **Speed** | `rowMeans`, `rowSums` | fast built-in row ops |
| | pre-allocate with `numeric(n)` | avoid growing with `c()` |
| **Functions** | `function(arg1, arg2=default)` | define |
| | `return(value)` | explicit return |
| **Strings** | `grep`, `grepl` | pattern match |
| | `gsub`, `sub` | pattern substitute |
| | `strsplit`, `paste`, `paste0` | split / combine |
| **Scripts** | `source("script.R")` | run from R console |
| | `Rscript script.R` | run from command-line |
| **Timing** | `system.time(expr)` | benchmark expression |

**Next:** T5 — [Parallel R](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rparallel/rparallel_index.html)