# A walking introduction to `R`
```{r setup, echo=FALSE}
options(repos='http://lib.stat.cmu.edu/R/CRAN/')
set.seed(42)
```
### A conceptual overview
* Everything is *done* with functions.
* All *data* is in vectors (and things made from vectors).
### Functions
Anything you want to do in `R` is done by telling `R` to run a function.
To run a function with no arguments, follow its name with parentheses.
```{r, eval=FALSE}
help()
```
Arguments are passed inside the parentheses. Arguments are usually named, but names can be omitted if it's unambiguous.
```{r, eval=FALSE}
help(topic=getwd)
help(getwd)
```
If you don't include parentheses, R will try to give you the function itself. Try these:
```{r, eval=FALSE}
help
help.search
```
Even things that don't look like functions are functions. Arithmetic operations are functions.
```{r, tidy=FALSE}
5 + 7
"+"(5,7)
```
This is a super handy function. It returns a vector.
```{r, tidy=FALSE}
":"(1,10)
1:10
```
Convenient short-hand is available for other functions too. Get help fast:
```{r, tidy=FALSE, eval=FALSE}
?glm # This is identical to: help(glm)
```
And of course, assign things to variables:
```{r, tidy=FALSE, eval=FALSE}
my.thing <- 8 # You probably won't use the equivalent: "<-"(my.object, 8)
# Okay, comments aren't functions.
```
Some `R` functions are similar to unix command line conventions. For example, `ls()`, `rm()`, `getwd()`, and `setwd()`.
### Vectors
Data in R is always in a vector. A single value is a vector of length one.
In output, the numbers in brackets tell you the position in the vector at the start of the line.
```{r}
42:100
42
```
`c()` is a function that combines vectors.
```{r, eval=FALSE, tidy=FALSE}
2, 4 # this will fail
c(2, 4) # this will make a vector containing first 2 then 4
```
Very often you will want to pass one vector as an argument to a function.
```{r, tidy=FALSE, eval=FALSE}
mean(2, 4) # this passes the function two arguments,
# a vector containing 2 and a vector containing 4
mean(c(2, 4)) # this passes the function one argument,
# a vector containing first 2 then 4
```
(This is easy way to make a mistake.)
So everything is a vector. Vector of what?
```{r, eval=FALSE}
typeof(TRUE); typeof(T); typeof(FALSE); typeof(F); # logical
typeof(1:10); typeof(42L); # integer
typeof(42); typeof(3.7); typeof(5e7); typeof(1/89) # numeric (double)
typeof("Aaron"); typeof("cow"); typeof("123"); typeof("TRUE") # character
# And then there are these guys... Kind of a different story.
class(factor(c("red", "green", "blue"))) # factor
class(factor(c("medium", "small", "small", "large"),
levels=c("small", "medium", "large"),
ordered=TRUE)) # ordered factor
```
Vectors have exactly one type, and are joined by the `c()` function.
```{r}
c(9, 7, TRUE, FALSE)
c(9, 7, TRUE, FALSE, "cow")
```
Other things: `NA` (missing), `NULL` (not a thing), `NaN` (`sqrt(-1)`), `Inf` (`1/0`).
You can coerce a vector to another type with the appropriate `as.` function.
```{r}
as.integer(c("42", "dog"))
```
R has vectorized operations and recycling. Most operations happen element-wise.
```{r}
c(1, 2, 3, 4) + c(100, 1000, 10000, 10000)
```
If the vectors have different lengths, the shorter one gets 'recycled'.
```{r}
c(1, 2, 3, 4) + c(100, 1000)
```
What will happen with these?
```{r, eval=FALSE, tidy=FALSE}
c(1, 2) * c(4, 5, 6)
1 + 1:10
1:10 / 10
1:10 < 5
```
Things can have names.
```{r}
my.vector <- 101:105
my.vector
names(my.vector) <- c('a', 'b', 'c', 'd', 'e')
my.vector
```
You can select from vectors with `[ ]` in several ways.
```{r, tidy=FALSE}
my.vector[c(2, 4)] # by index numbers
my.vector[c('c', 'e')] # by names
my.vector[c(TRUE, FALSE, TRUE, FALSE, TRUE)] # with logicals
```
Logical selection is very useful.
```{r}
(my.numbers <- sample(1:10, 20, replace=TRUE))
```
How can we get just the entries less than five?
```{r}
my.numbers < 5
my.numbers[my.numbers < 5]
```
Here are some things to do with vectors:
```{r, tidy=FALSE}
length(my.vector) # How long is my vector?
sum(my.vector) # What if I add up the numbers in my vector?
sum(my.vector < 4) # Alternative: length(my.vector[my.vector < 4])
```
Data Frames are useful. They're the most common tabular data structure used in R. There are other data structures as well, of course.
* Matrices are vectors with a number of columns and a number of rows.
* Multiplication is element-wise for `*`, matrix-wise for `%*%`.
* Lists are like vectors where each element could be itself a vector.
* Compare `c(1:3, 4)` with `list(1:3, 4)`.
* Data frames are lists with every vector equal length, and you get row names and column names.
Some example datasets are included with R. (You can list them with `data()` and load them to your workspace with `data(dataSetName)`). External datasets are often read into `R` data frames from CSV files using `read.csv`, which itself calls `read.table`. (Check out the help for a lot of useful details.)
```{r}
(my.data <- read.csv('http://bit.ly/NYUdataset'))
```
These are common ways of starting to work with data frames:
```{r, eval=FALSE}
str(my.data)
summary(my.data)
```
You can access a particular vector in a list or data frame in several ways:
```{r, eval=FALSE}
my.data$gender
my.data[[2]]
my.data[['gender']]
with(my.data, gender)
```
You can subset using `[row(s), column(s)]`, both parts just like selecting from a single vector.
```{r}
my.data[2, 'age']
```
How can we select the `time`s for females? (There are several possibilities.)
```{r, eval=FALSE}
my.data[my.data$gender=='F', "time"]
my.data$time[my.data$gender=='F']
subset(my.data, gender=='F', select="time")
```
To add / compute / make a new column, just assign to it:
```{r}
my.data$number.five <- 5
my.data$mean.1.2 <- my.data$health1 + my.data$health2
my.data$health <- rowMeans(my.data[5:10])
```
To drop / delete / remove a column, you have options:
```{r, tidy=FALSE}
my.data$number.five <- NULL # remove from the data frame 'in place'
my.new.data <- my.data[1:10] # make a new smaller data frame
my.new.data <- my.data[-c(11,12)] # same as last
```
Modulo is `%%`. That's the shortest of the `%` infix operators, such as `%in%` and `%*%`.
Some other very useful function include `table()`, `sample()`, and the distribution functions (`rnorm`, `runif`, etc.), `ifelse`, and the flow control devices `if`/`else`, `for`, `while`, and so on. Output can be forced by `print` and `cat`. The `apply` family of functions are quite useful. There's also `rbind`, `cbind`, `aggregate`, and tools in packages such as `plyr`. And `as.Date`, `strptime`, `gsub`, `grep`, `toupper`, `tolower`, `sub_str`, `strsplit`. `melt` and `cast` from the `reshape` package. `merge`. You'll likely want to do `?formula` eventually. `which`, `max`, `min`, `pmax`, `pmin`. `unlist`, `unique`, `sort`, `order`, `setdiff`, `union`.
### Packages
There are a *lot* of contributed packages (also known as *libraries*) for R. As of ``r date()``, there were this many on the Comprehensive R Archive Network (CRAN) - and there are many more available from other sources.
```{r}
length(unique(rownames(available.packages())))
```
You load a package with `library()`. If you don't yet have the package, this will fail.
```{r}
library(somePackage)
```
You can install the package you want with `install.packages()`. Note the quotation marks:
```{r, eval=FALSE}
install.packages("somePackage")
```
You need to *install* a package once (per machine). You need to *load* the package each `R` session. Once the package is loaded, you have access to everything it provides (functions, datasets, etc.).
You now have access to a huge array of tools!