--- title: "STAT 302 Statistical Computing" subtitle: "Lecture 1: Introduction and R Basics" author: "Yikun Zhang (_Winter 2024_)" date: "" output: xaringan::moon_reader: css: ["uw.css", "fonts.css"] lib_dir: libs nature: highlightStyle: tomorrow-night-bright highlightLines: true countIncrementalSlides: false titleSlideClass: ["center","top"] --- ```{r setup, include=FALSE, purl=FALSE} options(htmltools.dir.version = FALSE) knitr::opts_chunk$set(comment = "##") library(kableExtra) ``` # Outline 1. Course Overview 2. Introduction to R, RStudio, and R Markdown 3. Elementary Operations in R 4. Data Types in R Appendices: A. Probability B. Random Variables * Acknowledgement: Parts of the slides are modified from the course materials by Prof. Ryan Tibshirani, Prof. Yen-Chi Chen, Prof. Deborah Nolan, Bryan Martin, and Andrea Boskovic. --- class: inverse # Part 1: Course Overview and Logistics --- # What Is Statistical Computing? --

Answer: A computing program that does Statistics!

Well! Let's see some "official answers"...

-- From ChatGPT:

stat comp

--- # What Is Statistical Computing?

stat comp

-- Statistical computing is a course with intensive programming tasks that are related to Statistics. --

In other words, there will be a lot of coding in this course!!

--- # Why Do We Learn Statistical Computing? -- - We want to utilize (big) data to address scientific questions.

big data

Cited from https://bleuwire.com/5-biggest-big-data-challenges/.

--- # Why Do We Learn Statistical Computing? **An Example From My Research**: Cosmic Web Detection with Observed Galaxies in the Sloan Digital Sky Survey.

See my paper at MNRAS and our cosmic web catalog (i.e., a well-documented dataset).

One scientific question that we address here is "*how is the stellar mass of a galaxy correlated with its distance to nearby cosmic web structures?*" --- # Why Do We Learn Statistical Computing? - We need to conduct simulation studies to validate our statistical theory and methodology. -- - For example, we can verify the asymptotic normality of our proposed statistical estimator with finite samples.

lasso

See my recent paper https://arxiv.org/abs/2309.06429.

--- # Why Do We Learn Statistical Computing? - Mastering statistical computing skills can give us better jobs. --

Sources from US News in 2021.

--- # Syllabus

Let's spend some time going over the [Course Syllabus](https://zhangyk8.github.io/teaching/file_stat302/Syllabus_Win2024.pdf). --- # Canvas Discussion * It is worth up to 2% extra credit on the final grade. * Only substantive and helpful questions will be counted. .pull-left[ ### Bad questions: * How do you do Problem 2? * Here's my code and it's broken. How can I fix it? ] -- .pull-right[ ### Good questions: * Here's a snippet of code that I used for Problem 2:
`formatted code snippet`
It returned the following error:
`formatted error message`
Does anyone know why? I already tried... * I don't understand the concept from Slide 18 today. Could anyone elaborate on why...? ] --- # Canvas Discussion * It is worth up to 2% extra credit on the final grade. * Only substantive and helpful answers will be counted. .pull-left[ ### Bad or null answers: * Here's my solution:
`formatted code snippet` * The grader is wrong. You should ask the grader to add your points back... (*However, you are encouraged to point out my mistakes and typos during lectures or on the discussion board.*) ] -- .pull-right[ ### Good answers: * This error message occurs because your variable is a string instead of a numeric. Have you tried checking...? * I think that Slide 18 in Lecture 2 will address your questions. ] --- # Why R? R is a programming language developed by statisticians for statistical computing. ### Pros: * R is open-source and has a community of developers and users. * It is convenient for statistical analysis and data visualization...

-- ### Cons: * R is slow unless we use parallel computing packages or [Rcpp](https://www.rcpp.org/). * It is not very popular outside of the statistical community. --- # Why R? .pull-left[ Windows Interface

] .pull-right[ Linux/Unix Terminal

] -- It is not convenient to write programs with thousands of R code lines directly in the R interface! --- # Why RStudio? Luckily, we have [RStudio](https://posit.co/download/rstudio-desktop/), an integrated development environment (IDE) designed for writing and running R programming. -- * We recommend to first install R. Then, Rstudio will automatically locate the R directory in our computer. * It helps us organizes R scripts, files, plots, code console, etc. * It provides helpful interactive graphical interface. -- * And more essentially, it has R Markdown integration. For the rest of the course, we will use Rstudio to write our code and finish the lab assignments. --- # RStudio Interface By default... * *Top left*: Editor panel. Browse and edit scripts or data with tabs. * *Top right*: List of objects in the Environment (recall `ls()`), code history, etc. * *Bottom left*: Console for running R code line-by-line (`>` prompt) * *Bottom right*: Files, plots, packages, help files, etc. If the Edit window is not open, then choose File -> New File -> Choose R Script. --- # Editor * Our important code should be written here (**not** the console). * Primarily used for writing and editing .R or .Rmd scripts. * Try opening a file now using *File > New File > R Script*, write two lines of simple code, such as `1 + 3` or `a = 6`. * Click `Run` in the bar above the script. What happens? * Click on one of the lines of code. Press `Ctrl`/`⌘` + `Enter`. What happens? -- .center[**Important:** Every part of our R workflow belongs in this window!] --- # Console and Environment/History #### Console * It gives us an easy way to run and test individual lines of code. * Nothing that we run here will be saved after we close Rstudio (unless you save the R history)! -- #### Environment/History * The variables that we defined can be seen in the _Environment_ tab. * Click on the _History_ tab to see what it contains. Try searching! * Select a line from the _History_ tab and click `To Source`. What happens? - It is useful for adding lines that we tested in our Console to our R scripts. --- # Files, Plots, Packages, Help * _Files_ tab is used to browse the files on our computer. * Open files/data, move files that we are working with, etc. * **Use caution!** Changing files here is the same as changing them on our computer. If we delete something, it's gone! * _Plots_ tab is used to display plots that we create in R. * _Help_ tab is used to browse the documentations of functions. We can explore these by preceding a function name with `?`. Try `?sqrt` to see its user documentation. (If we are unsure about any function, ask R in this way!) * _Packages_ tab shows all the packages that we currently have installed. (We will discuss more about it later.) --- # Why R Markdown? [R Markdown](https://rmarkdown.rstudio.com/) is a markup language for combining R code with text. * It facilitates the creations of those neat HTML files, PDF documents, slides (like the one I am using), webpages, books, etc. -- * And more importantly, it is required for our lab assignments and final project! --- # Create an R Markdown File Let's try creating an R Markdown file: 1. Choose *File > New File > R Markdown...*. 2. Make sure *PDF Output* is selected and click OK. 3. Save the file in your new folder, call it `stat302_test1.Rmd`. 4. Click the *Knit* button * After it is done, browse to the file location using the `Files` tab. What have been added? Note: The PDF output requires an installation of $\LaTeX$; see the instructions [here](https://bookdown.org/yihui/rmarkdown/installation.html). --- # R Markdown Syntax .pull-left[ ## Output **bold/strong emphasis** *italic/normal emphasis* .forcehead[Header] ## Subheader ### Subsubheader ] .pull-right[ ## Syntax
**bold/strong emphasis**

*italic/normal emphasis*

# Header

## Subheader

### Subsubheader

] --- # R Markdown Syntax .pull-left[ ## Output 1. Ordered list Item 1 1. Item 2 1. Even with sub-item 1 2. Sub-item 2 * Unordered lists Item 1 * Item 2 + Sub-item [URL link](http://www.uw.edu) ![Insert pictures](http://depts.washington.edu/uwcreate/img/UW_W-Logo_smallRGB.gif) ] .pull-right[ ## Syntax
1. Ordered list Item 1
1. Item 2
  1. Even with sub-item 1
  2. Sub-item 2

* Unordered lists Item 1
* Item 2
  + Sub-item

[URL link](http://www.uw.edu)

![Insert pictures](http://depts.washington.edu/uwcreate/img/UW_W-Logo_smallRGB.gif)
] --- # R Markdown Syntax .pull-left[ ## Output You can put some math $y= \left( \frac{5}{3} \right)^2$ right up in there. $$\frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x}_n$$ Or a sentence with `code-looking font`. Or a block of code: ``` y <- 1:5 z <- y^2 ``` ] .pull-right[ ## Syntax
You can put some math $y= \left(\frac{5}{3} 
\right)^2$ right up in there

$$\frac{1}{n} \sum_{i=1}^{n}
x_i = \bar{x}_n$$

Or a sentence with `code-looking font`.

Or a block of code:

    ```
    y <- 1:5
    z <- y^2
    ```
]
--- # R Code Within R Markdown As in Lab 1, we can run and execute R code within R Markdown. To do so, we need to encase our code as follows. `r ''````{r, eval = TRUE, echo = TRUE} # Your code goes here! ``` We can click the green triangle in the corner to evaluate that code chunk to preview the results without compiling the entire document. --- # Useful Code Chunk Parameters Parameters go into the opening brackets `{r}` and are separated by commas. Here are some useful options: * `echo=FALSE`: Hide R code but keep results. * `eval=FALSE`: Do not execute the R code. * `include=FALSE`: Hide all outputs for this chunk (It is useful to load packages at the beginning of your document). * `cache=TRUE`: Store the results of the chunk, and only re-run if the chunk is changed. (It is useful for files that take a while to compile). * `fig.height=5, fig.width=5`: Modify the dimensions of any plots that are generated in the chunk (units are in inches). Note: See the [R Markdown Reference Guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) for a complete list of knitr chunk options. --- class: inverse # Part 2: R Basics --- # R as a Calculator * **Binary (Arithmetic) Operators** take two arguments. For instance, +, -, *, /, %% (for mod), %/%(integer division), and ^ (exponentiation). ```{r} # Addition 6 + 1 ``` ```{r} # Subtraction 9 - 16.6 ``` ```{r} # Multiplication 6 * 3 ``` --- # R as a Calculator * **Binary (Arithmetic) Operators** take two arguments. For instance, +, -, *, /, %% (for mod), %/% (integer division), and ^ (exponentiation). ```{r} # Division 10 / 3 ``` ```{r} # Mod 10 %% 3 ``` ```{r} # Integer division 10 %/% 3 ``` --- # R as a Calculator * **Binary (Arithmetic) Operators** take two arguments. For instance, +, -, *, /, %% (for mod), %/% (integer division), and ^ (exponentiation). ```{r} # Exponentiation 3^4 ``` ```{r} # Exponentiation (same as the syntax in Python) 3**4 ``` * **Unitary (Arithmetic) Operators** take only one argument. For example, - is for arithmetic negation. --- # R as a Calculator * We can also use some built-in functions in R to calculate more advanced math functions. ```{r} # Exponentiation with natural basis "e" exp(3) ``` ```{r} # Trigonometric functions sin(pi) cos(2*pi) ``` --- # R as a Calculator ```{r} # Square root sqrt(5) ``` ```{r} # Logarithm with natural base log(10) ``` ```{r} # Logarithm with base 10 log(10, base=10) # Ask R (in the console) if we are unsure of # any function and its arguments ?log ``` --- # Comparison Operators ```{r} # Strictly greater than 6 > 3 # Greater than or equal to 6 >= 6 ``` ```{r} # Equal to 5 == 3 5 == 2 + 3 ``` --- # Comparison Operators ```{r} # Not equal to 6 != 3 ``` ```{r} # Strictly less than 6 < 6 # Less than or equal to 6 <= 6 ``` --- # Logical Operators * **Logical Operators** take one or more "comparison statements" and return TRUE or FALSE. ```{r} # AND (6 < 5) & (1 < 3) ``` ```{r} # AND (6 < 9) & (1 <= 3) ``` --- # Logical Operators * **Logical Operators** take one or more "comparison statements" and return TRUE or FALSE. ```{r} # OR (6 < 5) | (1 < 3) ``` ```{r} # OR (6 < 5) | (1 <= -3) ``` ```{r eval=FALSE} # Combine AND with OR operators (6 < 5) & (7 > 2) | (1 <= 3) ``` -- ```{r echo=FALSE} # Combine AND with OR operators (6 < 5) & (7 > 2) | (1 <= 3) ``` --- # Logical Operators * **Logical Operators** take one or more "comparison statements" and return TRUE or FALSE. ```{r} # Logical negation !(6 < 5) ``` ```{r} # Logical negation !(6 < 9) ``` --- class: inverse # Part 3: Data Types in R --- # Functional Programming Functional programming in R comprises two basic types of things/objects: **data** and **functions**. -- * **Data** are things like 8, "James", *NA*, and $$\begin{bmatrix} 1 & 3 & 6\\ 4 & 7 & -1\\ \end{bmatrix}.$$ -- * **Functions** are some programs that turns input objects, or *arguments*, into an output object or a return value (possibly with side effects), according to a definite rule. -- * Good programming is writing functions to correctly and efficiently transform inputs into outputs. (We will discuss functions later...) - The principle of good programming is to take a big transformation and break it down into smaller ones so that we can efficiently implement these smaller tasks (using built-in functions). --- # Data Types At the base level, all data can represented in binary format, by **bits** (i.e., TRUE/FALSE, YES/NO, 1/0). However, basic data types in R are: - **Booleans** are direct binary values: `TRUE` or `FALSE` in R. - **Integers** are whole numbers (positive, negative or zero), represented by a fixed-length block of bits. - **Floating point numbers** are (some approximations) to rational numbers, i.e., $p/q$ where $p,q$ are both integers. - **Complex numbers** are numbers like 1+2i. - **Characters** are fixed-length blocks of bits, with special coding; **strings** are sequences of characters. - **Missing or ill-defined values**: `NA`, `NaN`, etc. --- # Data Types (Examples) ```{r} ?typeof() typeof(TRUE) # By default, R stores numeric values as 64 floating points. typeof(6) # We can coerce it into integer as follows. typeof(as.integer(6)) typeof(as.integer(6.5)) ``` --- # Data Types (Examples) We can also use the built-in function `class()` to determine the data type of an object. [This webpage](https://stackoverflow.com/questions/6258004/types-and-classes-of-variables) describes the differences between `typeof()` and `class()`. - In short, `typeof()` or `mode()` represents how an object is stored in memory (numeric, character, list, or function), while `class()` represents its abstract type. ```{r} class(6) typeof(6) mode(6) ``` --- # Data Types (Examples) ```{r} as.integer(6.6) # It rounded a floating point number 6.5 to the largest integer that is less than 6.5. Check its difference with the `ceiling()` function. floor(6.6) typeof("7") length("7112") ``` --- # Data Types (Examples) ```{r} is.character("7") is.na(6.6) is.na(NA) ``` -- ```{r} is.na(NaN) is.nan(NA) ``` --- # Variables in R - With the preceding arithmetic operations, it is difficult for us to utilize the outputs. -- - To better keep track of the intermediate results, we can assign the (outputs of) expressions to some **named variables**. - Naming variables is the first step towards abstraction in functional programming. -- ```{r} a = 1 + 2 course_code = "STAT 302" dept = paste("Statistics", "Data Science") # List all the variables that we have defined ls() ``` Note: `<-` and `=` are both valid assignment operators. --- # Variables in R * A variable in R has its name and value. * We can access a variable by its name. ```{r} # Access variable `a` a # Check the data type of `course_code` class(course_code) ``` -- ```{r} # Remove a variable (from R memory) rm("a") ``` Note: We can also keep track of all the defined variables in the _Environment_ tab (**Top right** in Rstudio). --- # Rules for Variable's Name * A variable's name must follow some rules: - It cannot start with a digit or underscore `_`. - It may contain characters, digits, and some punctuation (period `.` and underscore `_` are allowed, while others are generally prohibited). - It is case-sensitive. -- ```{r} w2v = 1 + 4 W2v = "word to vector" w2v == W2v ``` --- # Summary - Statistical computing focuses on using the computer programs to solve scientific problems with solid statistical methods. - R is an open-source programming language for statistical computing. - RStudio and R markdown further enhance our R programming experience. - R supports arithmetic, comparison, and logical operators. - The basic data types in R enable us to represent Booleans, numbers, characters, etc. Submit Lab 1 on Gradescope by the end of Monday (January 15)!! --- class: inverse # Appendix A. Probability --- # Sample Space A **sample space**, commonly denoted $\Omega$ or $S$, is the set of all possible outcomes from a random experiment. For example, * Coin flip: $\Omega = \{H, T\}$; * Two coin flips: $\Omega = \{HH, HT, TH, TT\}$; * Rolling a 6-sided die: $\Omega = \{1, 2, 3, 4, 5, 6\}$; * Hours spent sleeping in day: $\Omega = \{x: x\in \mathbb{R}, 0 \leq x \leq 24\}$; * A simulation from a normal distribution: $\Omega = (-\infty, \infty)$. The elements of $\Omega$ must be **mutually exclusive** and **collectively exhaustive**. --- # Events An **event**, which we will call $A$, can be any subset of your sample space. For example, * Heads in a coin flip: $A = \{H\}$; * At least one heads in two coin flips: $A = \{HT, HT, TH\}$; * Rolling an even number on a 6-sided die: $A = \{2, 4, 6\}$; * Sleeping at least 8 hours in a day: $A = \{x: x \in \mathbb{R}, 8 \leq x \leq 24\}$; * Simulating a number between 1 and 2, inclusive, from a normal distribution: $A = [1, 2]$. --- # Probability Informally, probability $P$ is often defined as the chance of something happening. More formally, it is a function that goes from an event $A$ to the real line. * $P(\text{heads in a fair coin flip}) = \dfrac{1}{2}$. * $P(\text{at least one heads in two fair coin flips}) = \dfrac{3}{4}$. * $P(\text{rolling an even number on a fair dice}) = \dfrac{1}{2}$. --- # Axioms of Probability Probability allows follows three basic principles, known as the **axioms of probability**. 1. The probability of any event $A$ must be between 0 and 1, inclusive. * $0 \leq P(A)\leq 1$. 2. The probability of the sample space is equal to 1. * $P(\Omega) = 1$. 3. If events $A$ and $B$ are **mutually exclusive**/**disjoint**, then the probability of *either* $A$ *or* $B$ is the same as the sum of the probability of $A$ and the probability of $B$. * $A \cap B = \emptyset \ \ \Rightarrow \ \ P(A\cup B) = P(A) + P(B)$ --- layout: true # Probability Notation --- ## Intersection: $\cap$ $P(A \cap B)$: *joint* probability of $A$ *and* $B$. .center[] --- ## Intersection: $\cap$ $P(A \cap B)$: *joint* probability of $A$ *and* $B$. .center[] --- ## Union: $\cup$ $P(A \cup B)$ probability of $A$ *or* $B$. .center[] --- ## Union: $\cup$ $P(A \cup B)$ probability of $A$ *or* $B$. .center[] --- ## Complement: $A^c$ $P(A^c)$ probability of *not* $A$. .center[] --- ## Complement: $A^c$ $P(A^c)$ probability of *not* $A$. .center[] --- ## Difference: $A\setminus B$ $P(A\setminus B)$ probability of $A$ *and not* $B$. .center[] --- ## Difference: $A\setminus B$ $P(A\setminus B)$ probability of $A$ *and not* $B$. .center[] --- ## Conditional: $A | B$ $P(A | B)$ probability of $A$ *conditional on*/*given* $B$. .center[] --- ## Conditional: $A | B$ $P(A | B)$ probability of $A$ *conditional on*/*given* $B$. .center[] --- ## Subset: $A \subseteq \Omega$ $A \subseteq \Omega$: $A$ is a *subset* of $\Omega$. $A \subset \Omega$: $A$ is a *proper subset* of $\Omega$. .center[] --- ## Subset: $A \subseteq \Omega$ $A \subseteq \Omega$: $A$ is a *subset* of $\Omega$. $A \subset \Omega$: $A$ is a *proper subset* of $\Omega$. .center[] --- ## Superset: $\Omega \supseteq A$ $\Omega \supseteq A$: $A$ is a *superset* of $\Omega$. $\Omega \supset A$: $A$ is a *proper superset* of $\Omega$. .center[] --- ## Superset: $\Omega \supseteq A$ $\Omega \supseteq A$: $A$ is a *superset* of $\Omega$. $\Omega \supset A$: $A$ is a *proper superset* of $\Omega$. .center[] --- ## Element of: $X \in A$ $X \in A$: $X$ is an *element of* $A$. .center[] --- ## Empty set: $\varnothing$ $A\cap E = \varnothing$: the intersection of $A$ and $E$ is the empty set. .center[] --- layout: false # Identities of Probability * The probability of $A^c$ is $1$ minus the probability of $A$: * $P(A^c) = 1-P(A)$. * If $A$ is a subset of $B$, then the probability of $A$ is less than or equal to the probability of $B$: * $A \subseteq B \implies P(A)\leq P(B)$. * The probability of a union is equal to the sum of the probabilities minus the probability of an intersection: * $P(A \cup B) = P(A) + P(B) - P(A\cap B)$. --- # De Morgan's Laws * Complement of the union is equal to the intersection of the complements. .center[] --- # De Morgan's Laws * Complement of the intersection is equal to the union of the complements. .center[] --- # De Morgan's Laws 1. Complement of the union is equal to the intersection of the complements: * $(A \cup B)^c = A^c \cap B^c$. 2. Complement of the intersection is equal to the union of the complements: * $(A \cap B)^c = A^c \cup B^c$. .center[ ] --- # Independence We say that two events $A$ and $B$ are independent, $A\perp \!\!\! \perp B$, *if and only if* one of the followings hold true: * $P(A \cap B) = P(A) P(B)$; * $P(A|B) = P(A)$; * $P(B|A) = P(B)$. This is an *extremely* important concept in statistics! --- # Conditional Probability The conditional probability of $A$ given $B$ is equal to the joint probability of $A$ and $B$ divided by the marginal probability of $B$: $$P(A|B) = \dfrac{P(A\cap B)}{P(B)}.$$ Note that this implies $$P(A\cap B) = P(A|B) P(B).$$ --- # Bayes' Rule $$P(A|B) = \dfrac{P(A\cap B)}{P(B)}$$ also implies $$P(A|B) = \dfrac{P(B|A)P(A)}{P(B)},$$ which is commonly known as **Bayes' rule**! We won't get into the details in this class, but this can be a very useful result for reversing the conditions in our analysis. For example: There is a big difference between the probability of having a disease given a positive screening, and the probability of a positive screening given a disease! These concepts are often confused in popular media! --- # Law of Total Probability We say that a set of events is a **partition** if all the followings hold: * The set does not contain the empty set; * The union of the events in the set is equal to the sample space; * The intersection of any two distinct events in the set is equal to the empty set. Note that an event and its complement always define a partition! .center[] --- # Law of Total Probability The **law of total probability** states that given a partition $P_1, P_2, \ldots, P_n$, then $$P(A) = P(A|P_1)P(P_1) + P(A|P_2)P(P_2) + \cdots + P(A|P_n)P(P_n).$$ Commonly, our partition is some event $B$ and its complement: $$P(A) = P(A|B)P(B) + P(A|B^c)P(B^c).$$ .center[] --- class: inverse # Appendix B. Random Variables --- # Random Variables Typically, we don't care about specific events occuring. Instead, we tend to focus on functions of our events. These functions are called **random variables**. More formally1, a random variable $X$ can be defined as function $X: \Omega \mapsto \mathbb{R}$. For example, * the number of heads out of 10 coin flips; * the sum of 8 standard die rolls; * the average value of 1,000 simulations from a $\mathcal{N}(0,1)$. Typically, a random variable is denoted by a uppercase letter, such as $X$, and values that the random variable takes is denoted by a lowercase letter, such as $x$. For example, we might ask the $P(X=x)$ for multiple values of $x$. We call the set of all values a random variable can take the **support** of that random variable. .footnote[[1] It is still not very formal.] --- # Random Variables Random variables are **not** events! This can be confusing because often use similar notation with random variables and events. Think of an event as an outcome that can lead to a certain value of a random variable. For example, * Event in 10 coinflips: $\{THHTTTHTTH\}$. * Random variable representing the number of heads $X = 4$. * Event in 8 standard die rolls: $\{2, 4, 2, 1, 5, 4, 2, 6\}$. * Random variable representing the sum $X = 25$. --- # Discrete Random Variables Random variables are **discrete** if there are a finite1 number of values in the support. We've already seen some examples of this in class from the binomial distribution! We define the **probability mass function**, or **PMF**, of a discrete random variable $X \sim Bin(n,p)$ as $$P(X=k|n,p) = \begin{pmatrix} n\\ k\end{pmatrix} p^k(1-p)^{n-k}$$ Probability mass functions must satisfy: 1. $0 \leq P(X=x) \leq 1$ for all $x$ 2. $\sum_{x \in support(X)} P(X = x) = 1$ 3. For any set $A\subseteq support(X)$, $P(X \in A) = \sum_{x \in A} P(X = x)$ .footnote[[1] or countably infinite] --- # Continuous Random Variables Random variables are **continuous** if the support is uncountably infinite. We've already seen some examples of this in class from the normal distribution! We define the **probability density function**, or **PDF** of a continuous random variable $X \sim \mathcal{N}(\mu, \sigma^2)$ as $$f_X(x) = \dfrac{1}{\sqrt{2\pi\sigma^2}} e^{-\dfrac{1}{2}\left(\dfrac{x-\mu}{\sigma}\right)^2}$$ Probability density functions must satisfy: 1. $f_X(x) > 0$ for all $x \in support(X)$; 2. The area under the curve of the pdf in the support is equal to $1$. That is, $\int_{support(X)} f_X(x)dx = 1$; 3. If $A$ is some interval in the support of $X$, then $P(X \in A) = \int_A f_X(x)dx$. Note that the second and third properties are essentially the continuous versions of the corresponding properties of discrete PMFs! --- # Expected Value The **expected value** or expectation of a random variable, denoted $E[X]$ or $\mathbb{E}[X]$, is the mean of the random variable. Intuitively, it can be thought of as a weighted average of all values in the support, weighted by their value in the pdf/pmf. Expected values satisfy the following properties for random variables $X$, $Y$ and constants $a$, $b$: * $\mathbb{E}[a] = a$; * $\mathbb{E}[aX + b] = a \cdot \mathbb{E}[X] + b$; * $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$. If $X$ and $Y$ are independent, then * $\mathbb{E}[XY] = \mathbb{E}[X]\cdot \mathbb{E}[Y]$. --- # Variance The **variance** of a random variable is the expected squared difference between a random variable and its mean. $$\text{Var}(X)=\mathbb{E}[(X - \mathbb{E}[X])^2].$$ Intuitively, this measures how far the values of $X$ are from their mean, on average. It is a measure of spread, or variability. Variances satisfy the following properties for a random variable $X$ and constants $a, b$: * $\text{Var}(a) = 0$ * $\text{Var}(aX + b) = a^2\cdot \text{Var}(X)$ The square root of the variance is known as the **standard deviation**, because it is the expected (standard) magnitude of the difference (deviation) between a random variable and its mean. More detailed review of the basic probability theory can be found in Section 3 of [this notes](https://zhangyk8.github.io/teaching/file_stat548/CS547_Proof_Probability_new.pdf).