---
title: "STAT 302 Statistical Computing"
subtitle: "Lecture 1: Introduction and R Basics"
author: "Yikun Zhang (_Autumn 2023_)"
date: ""
output:
xaringan::moon_reader:
css: ["aut.css", "fonts.css"]
lib_dir: libs
nature:
highlightStyle: tomorrow-night-bright
highlightLines: true
countIncrementalSlides: false
titleSlideClass: ["center","top"]
---
```{r setup, include=FALSE, purl=FALSE}
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(comment = "##")
library(kableExtra)
```
# Outline
1. Course Overview
2. Introduction to R, RStudio, and R Markdown
3. Elementary Operations in R
4. Data Types in R
* Acknowledgement: Parts of the slides are modified from the course materials by Prof. Ryan Tibshirani, Prof. Yen-Chi Chen, Prof. Deborah Nolan, Bryan Martin, and Andrea Boskovic.
---
class: inverse
# Part 1: Course Overview and Logistics
---
# What Is Statistical Computing?
--
Answer: A computing program that does Statistics!
Well! Let's see some "official answers"...
--
From ChatGPT:
---
# What Is Statistical Computing?
--
Statistical computing is a course with intensive programming tasks that are related to Statistics.
--
In other words, there will be a lot of coding in this course!!
---
# Why Do We Learn Statistical Computing?
--
- We want to utilize (big) data to address scientific questions.
Cited from
https://bleuwire.com/5-biggest-big-data-challenges/.
---
# Why Do We Learn Statistical Computing?
**An Example From My Research**: Cosmic Web Detection with Observed Galaxies in the Sloan Digital Sky Survey.
See my paper at
MNRAS and our cosmic web catalog (i.e., a well-documented dataset).
--
One scientific question that we address here is "*how is the stellar mass of a galaxy correlated with its distance to nearby cosmic web structures?*"
---
# Why Do We Learn Statistical Computing?
- We need to conduct simulation studies to validate our statistical theory and methodology.
--
- For example, we can verify the asymptotic normality of our proposed statistical estimator with finite samples.
See my recent paper
https://arxiv.org/abs/2309.06429.
---
# Why Do We Learn Statistical Computing?
- Mastering statistical computing skills can give us better jobs.
--
Sources from US News in 2021.
---
# Syllabus
Let's spend some time going over the [Course Syllabus](https://canvas.uw.edu/courses/1662584/assignments/syllabus).
---
# Canvas Discussion
* It is worth up to 2% extra credit on the final grade.
* Only substantive and helpful questions will be counted.
.pull-left[
### Bad questions:
* How do you do Problem 2?
* Here's my code and it's broken. How can I fix it?
]
--
.pull-right[
### Good questions:
* Here's a snippet of code that I used for Problem 2:
`formatted code snippet`
It returned the following error:
`formatted error message`
Does anyone know why? I already tried...
* I don't understand the concept from Slide 18 today. Could anyone elaborate on why...?
]
---
# Canvas Discussion
* It is worth up to 2% extra credit on the final grade.
* Only substantive and helpful answers will be counted.
.pull-left[
### Bad or null answers:
* Here's my solution:
`formatted code snippet`
* The grader is wrong. You should ask the grader to add your points back... (*However, you are encouraged to point out my mistakes and typos during lectures or on the discussion board.*)
]
--
.pull-right[
### Good answers:
* This error message occurs because your variable is a string instead of a numeric.
Have you tried checking...?
* I think that Slide 18 in Lecture 2 will address your questions.
]
---
# Why R?
R is a programming language developed by statisticians for statistical computing.
### Pros:
* R is open-source and has a community of developers and users.
* It is convenient for statistical analysis and data visualization...
--
### Cons:
* R is slow unless we use parallel computing packages or [Rcpp](https://www.rcpp.org/).
* It is not very popular outside of the statistical community.
---
# Why R?
.pull-left[
Windows Interface
]
.pull-right[
Linux/Unix Terminal
]
--
It is not convenient to write programs with thousands of R code lines directly in the R interface!
---
# Why RStudio?
Luckily, we have [RStudio](https://posit.co/download/rstudio-desktop/), an integrated development environment (IDE) designed for writing and running R programming.
--
* We recommend to first install R. Then, Rstudio will automatically locate the R directory in our computer.
* It helps us organizes R scripts, files, plots, code console, etc.
* It provides helpful interactive graphical interface.
--
* And more essentially, it has R Markdown integration.
For the rest of the course, we will use Rstudio to write our code and finish homework/lab assignments.
---
# RStudio Interface
By default...
* *Top left*: Editor panel. Browse and edit scripts or data with tabs.
* *Top right*: List of objects in the Environment (recall `ls()`), code history, etc.
* *Bottom left*: Console for running R code line-by-line (`>` prompt)
* *Bottom right*: Files, plots, packages, help files, etc.
If the Edit window is not open, then choose File -> New File -> Choose R Script.
---
# Editor
* Our important code should be written here (**not** the console).
* Primarily used for writing and editing .R or .Rmd scripts.
* Try opening a file now using *File > New File > R Script*, write two lines of simple code, such as `1 + 3` or `a = 6`.
* Click `Run` in the bar above the script. What happens?
* Click on one of the lines of code. Press `Ctrl`/`⌘` + `Enter`. What happens?
--
.center[**Important:** Every part of our R workflow belongs in this window!]
---
# Console and Environment/History
#### Console
* It gives us an easy way to run and test individual lines of code.
* Nothing that we run here will be saved after we close Rstudio (unless you save the R history)!
--
#### Environment/History
* The variables that we defined can be seen in the _Environment_ tab.
* Click on the _History_ tab to see what it contains. Try searching!
* Select a line from the _History_ tab and click `To Source`. What happens?
- It is useful for adding lines that we tested in our Console to our R scripts.
---
# Files, Plots, Packages, Help
* _Files_ tab is used to browse the files on our computer.
* Open files/data, move files that we are working with, etc.
* **Use caution!** Changing files here is the same as changing them on our computer. If we delete something, it's gone!
* _Plots_ tab is used to display plots that we create in R.
* _Help_ tab is used to browse the documentations of functions. We can explore these by preceding a function name with `?`.
Try `?sqrt` to see its user documentation. (If we are unsure about any function, ask R in this way!)
* _Packages_ tab shows all the packages that we currently have installed. (We will discuss more about it later.)
---
# Why R Markdown?
[R Markdown](https://rmarkdown.rstudio.com/) is a markup language for combining R code with text.
* It facilitates the creations of those neat HTML files, PDF documents, slides (like the one I am using), webpages, books, etc.
--
* And more importantly, it is required for our lab and homework assignments!
---
# Create an R Markdown File
Let's try creating an R Markdown file:
1. Choose *File > New File > R Markdown...*.
2. Make sure *PDF Output* is selected and click OK.
3. Save the file in your new folder, call it `stat302_test1.Rmd`.
4. Click the *Knit* button
* After it is done, browse to the file location using the `Files` tab. What have been added?
Note: The PDF output requires an installation of $\LaTeX$; see the instructions [here](https://bookdown.org/yihui/rmarkdown/installation.html).
---
# R Markdown Syntax
.pull-left[
## Output
**bold/strong emphasis**
*italic/normal emphasis*
.forcehead[Header]
## Subheader
### Subsubheader
]
.pull-right[
## Syntax
**bold/strong emphasis**
*italic/normal emphasis*
# Header
## Subheader
### Subsubheader
]
---
# R Markdown Syntax
.pull-left[
## Output
1. Ordered list Item 1
1. Item 2
1. Even with sub-item 1
2. Sub-item 2
* Unordered lists Item 1
* Item 2
+ Sub-item
[URL link](http://www.uw.edu)
![Insert pictures](http://depts.washington.edu/uwcreate/img/UW_W-Logo_smallRGB.gif)
]
.pull-right[
## Syntax
1. Ordered list Item 1
1. Item 2
1. Even with sub-item 1
2. Sub-item 2
* Unordered lists Item 1
* Item 2
+ Sub-item
[URL link](http://www.uw.edu)
![Insert pictures](http://depts.washington.edu/uwcreate/img/UW_W-Logo_smallRGB.gif)
]
---
# R Markdown Syntax
.pull-left[
## Output
You can put some math $y= \left( \frac{5}{3} \right)^2$ right up in there.
$$\frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x}_n$$
Or a sentence with `code-looking font`.
Or a block of code:
```
y <- 1:5
z <- y^2
```
]
.pull-right[
## Syntax
You can put some math $y= \left(\frac{5}{3}
\right)^2$ right up in there
$$\frac{1}{n} \sum_{i=1}^{n}
x_i = \bar{x}_n$$
Or a sentence with `code-looking font`.
Or a block of code:
```
y <- 1:5
z <- y^2
```
]
---
# R Code Within R Markdown
As in Lab 1, we can run and execute R code within R Markdown.
To do so, we need to encase our code as follows.
`r ''````{r, eval = TRUE, echo = TRUE}
# Your code goes here!
```
We can click the green triangle in the corner to evaluate that code chunk to preview the results without compiling the entire document.
---
# Useful Code Chunk Parameters
Parameters go into the opening brackets `{r}` and are separated by commas. Here are some useful options:
* `echo=FALSE`: Hide R code but keep results.
* `eval=FALSE`: Do not execute the R code.
* `include=FALSE`: Hide all outputs for this chunk (It is useful to load packages at the beginning of your document).
* `cache=TRUE`: Store the results of the chunk, and only re-run if the chunk is changed. (It is useful for files that take a while to compile).
* `fig.height=5, fig.width=5`: Modify the dimensions of any plots that are generated in the chunk (units are in inches).
Note: See the [R Markdown Reference Guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) for a complete list of knitr chunk options.
---
class: inverse
# Part 2: R Basics
---
# R as a Calculator
* **Binary (Arithmetic) Operators** take two arguments. For instance, +, -, *, /, %% (for mod), %/%(integer division), and ^ (exponentiation).
```{r}
# Addition
6 + 1
```
```{r}
# Subtraction
9 - 16.6
```
```{r}
# Multiplication
6 * 3
```
---
# R as a Calculator
* **Binary (Arithmetic) Operators** take two arguments. For instance, +, -, *, /, %% (for mod), %/% (integer division), and ^ (exponentiation).
```{r}
# Division
10 / 3
```
```{r}
# Mod
10 %% 3
```
```{r}
# Integer division
10 %/% 3
```
---
# R as a Calculator
* **Binary (Arithmetic) Operators** take two arguments. For instance, +, -, *, /, %% (for mod), %/% (integer division), and ^ (exponentiation).
```{r}
# Exponentiation
3^4
```
```{r}
# Exponentiation (same as the syntax in Python)
3**4
```
* **Unitary (Arithmetic) Operators** take only one argument. For example, - is for arithmetic negation.
---
# R as a Calculator
* We can also use some build-in functions in R to calculate more advanced math functions.
```{r}
# Exponentiation with natural basis "e"
exp(3)
```
```{r}
# Trigonometric functions
sin(pi)
cos(2*pi)
```
---
# R as a Calculator
```{r}
# Square root
sqrt(5)
```
```{r}
# Logarithm with natural base
log(10)
```
```{r}
# Logarithm with base 10
log(10, base=10)
# Ask R (in the console) if we are unsure of
# any function and its arguments
?log
```
---
# Comparison Operators
```{r}
# Strictly greater than
6 > 3
# Greater than or equal to
6 >= 6
```
```{r}
# Equal to
5 == 3
5 == 2 + 3
```
---
# Comparison Operators
```{r}
# Not equal to
6 != 3
```
```{r}
# Strictly less than
6 < 6
# Less than or equal to
6 <= 6
```
---
# Logical Operators
* **Logical Operators** take one or more "comparison statements" and return TRUE or FALSE.
```{r}
# AND
(6 < 5) & (1 < 3)
```
```{r}
# AND
(6 < 9) & (1 <= 3)
```
---
# Logical Operators
* **Logical Operators** take one or more "comparison statements" and return TRUE or FALSE.
```{r}
# OR
(6 < 5) | (1 < 3)
```
```{r}
# OR
(6 < 5) | (1 <= -3)
```
```{r eval=FALSE}
# Combine AND with OR operators
(6 < 5) & (7 > 2) | (1 <= 3)
```
--
```{r echo=FALSE}
# Combine AND with OR operators
(6 < 5) & (7 > 2) | (1 <= 3)
```
---
# Logical Operators
* **Logical Operators** take one or more "comparison statements" and return TRUE or FALSE.
```{r}
# Logical negation
!(6 < 5)
```
```{r}
# Logical negation
!(6 < 9)
```
---
class: inverse
# Part 3: Data Types in R
---
# Functional Programming
Functional programming in R comprises two basic types of things/objects: **data** and **functions**.
--
* **Data** are things like 8, "James", *NA*, and
$$\begin{bmatrix}
1 & 3 & 6\\
4 & 7 & -1\\
\end{bmatrix}.$$
--
* **Functions** are some programs that turns input objects, or *arguments*, into an output object or a return value (possibly with side effects), according to a definite rule.
--
* Good programming is writing functions to correctly and efficiently transform inputs into outputs. (We will discuss functions later...)
- The principle of good programming is to take a big transformation and break it down into smaller ones so that we can efficiently implement these smaller tasks (using built-in functions).
---
# Data Types
At the base level, all data can represented in binary format, by **bits** (i.e., TRUE/FALSE, YES/NO, 1/0). However, basic data types in R are:
- **Booleans** are direct binary values: `TRUE` or `FALSE` in R.
- **Integers** are whole numbers (positive, negative or zero), represented by a fixed-length block of bits.
- **Floating point numbers** are (some approximations) to rational numbers, i.e., $p/q$ where $p,q$ are both integers.
- **Complex numbers** are numbers like 1+2i.
- **Characters** are fixed-length blocks of bits, with special coding; **strings** are sequences of characters.
- **Missing or ill-defined values**: `NA`, `NaN`, etc.
---
# Data Types (Examples)
```{r}
?typeof()
typeof(TRUE)
# By default, R stores numeric values as 64 floating points.
typeof(6)
# We can coerce it into integer as follows.
typeof(as.integer(6))
typeof(as.integer(6.5))
```
---
# Data Types (Examples)
```{r}
as.integer(6.6)
# It rounded a floating point number 6.5 to the largest
# integer that is less than 6.5.
floor(6.6)
typeof("7")
length("7112")
```
---
# Data Types (Examples)
```{r}
is.character("7")
is.na(6.6)
is.na(NA)
?NaN
```
* We can also use the build-in function `class()` to determine the data type of an object.
---
# Variables in R
- With the preceding arithmetic operations, it is difficult for us to utilize the outputs.
--
- To better keep track of the intermediate results, we can assign the (outputs of) expressions to some **named variables**.
- Naming variables is the first step towards abstraction in functional programming.
--
```{r}
a = 1 + 2
course_code = "STAT 302"
dept = paste("Statistics", "Data Science")
# List all the variables that we have defined
ls()
```
Note: `<-` and `=` are both valid assignment operators.
---
# Variables in R
* A variable in R has its name and value.
* We can access a variable by its name.
```{r}
# Access variable `a`
a
# Check the data type of `course_code`
class(course_code)
# Remove a variable (from R memory)
rm("a")
```
Note: We can also keep track of all the defined variables in the _Environment_ tab (**Top right** in Rstudio).
---
# Rules for Variable's Name
* A variable's name must follow some rules:
- It cannot start with a digit or underscore `_`.
- It may contain characters, digits, and some punctuation (period `.` and underscore `_` are allowed, while others are generally prohibited).
- It is case-sensitive.
--
```{r}
w2v = 1 + 4
W2v = "word to vector"
w2v == W2v
```
---
# Reminder
Submit Lab 1 on Canvas by the end of Thursday (October 5)!!