# A walking introduction to `R`

```{r setup, echo=FALSE}
options(repos='http://lib.stat.cmu.edu/R/CRAN/')
set.seed(42)
```


### A conceptual overview

 * Everything is *done* with functions.
 * All *data* is in vectors (and things made from vectors).


### Functions

Anything you want to do in `R` is done by telling `R` to run a function.

To run a function with no arguments, follow its name with parentheses.

```{r, eval=FALSE}
help()
```

Arguments are passed inside the parentheses. Arguments are usually named, but names can be omitted if it's unambiguous.

```{r, eval=FALSE}
help(topic=getwd)
help(getwd)
```

If you don't include parentheses, R will try to give you the function itself. Try these:

```{r, eval=FALSE}
help
help.search
```

Even things that don't look like functions are functions. Arithmetic operations are functions.

```{r, tidy=FALSE}
5 + 7
"+"(5,7)
```

This is a super handy function. It returns a vector.

```{r, tidy=FALSE}
":"(1,10)
1:10
```

Convenient short-hand is available for other functions too. Get help fast:

```{r, tidy=FALSE, eval=FALSE}
?glm             #  This is identical to: help(glm)
```

And of course, assign things to variables:

```{r, tidy=FALSE, eval=FALSE}
my.thing <- 8   #  You probably won't use the equivalent: "<-"(my.object, 8)

# Okay, comments aren't functions.
```

Some `R` functions are similar to unix command line conventions. For example, `ls()`, `rm()`, `getwd()`, and `setwd()`.


### Vectors

Data in R is always in a vector. A single value is a vector of length one.

In output, the numbers in brackets tell you the position in the vector at the start of the line.

```{r}
42:100
42
```

`c()` is a function that combines vectors.

```{r, eval=FALSE, tidy=FALSE}
2, 4      #  this will fail

c(2, 4)   #  this will make a vector containing first 2 then 4
```

Very often you will want to pass one vector as an argument to a function.

```{r, tidy=FALSE, eval=FALSE}
mean(2, 4)      #  this passes the function two arguments,
                #  a vector containing 2 and a vector containing 4

mean(c(2, 4))   #  this passes the function one argument,
                #  a vector containing first 2 then 4
```

(This is easy way to make a mistake.)

So everything is a vector. Vector of what?

```{r, eval=FALSE}
typeof(TRUE); typeof(T); typeof(FALSE); typeof(F);              #  logical
typeof(1:10); typeof(42L);                                      #  integer
typeof(42); typeof(3.7); typeof(5e7); typeof(1/89)              #  numeric (double)
typeof("Aaron"); typeof("cow"); typeof("123"); typeof("TRUE")   #  character

# And then there are these guys... Kind of a different story.
class(factor(c("red", "green", "blue")))                        #  factor
class(factor(c("medium", "small", "small", "large"),
             levels=c("small", "medium", "large"),
             ordered=TRUE))                                     #  ordered factor
```

Vectors have exactly one type, and are joined by the `c()` function.

```{r}
c(9, 7, TRUE, FALSE)
c(9, 7, TRUE, FALSE, "cow")
```

Other things: `NA` (missing), `NULL` (not a thing), `NaN` (`sqrt(-1)`), `Inf` (`1/0`).

You can coerce a vector to another type with the appropriate `as.` function.

```{r}
as.integer(c("42", "dog"))
```

R has vectorized operations and recycling. Most operations happen element-wise.

```{r}
c(1, 2, 3, 4) + c(100, 1000, 10000, 10000)
```

If the vectors have different lengths, the shorter one gets 'recycled'.

```{r}
c(1, 2, 3, 4) + c(100, 1000)
```

What will happen with these?

```{r, eval=FALSE, tidy=FALSE}
c(1, 2) * c(4, 5, 6)

1 + 1:10

1:10 / 10

1:10 < 5
```

Things can have names.

```{r}
my.vector <- 101:105
my.vector
names(my.vector) <- c('a', 'b', 'c', 'd', 'e')
my.vector
```

You can select from vectors with `[ ]` in several ways.

```{r, tidy=FALSE}
my.vector[c(2, 4)]                             # by index numbers
my.vector[c('c', 'e')]                         # by names
my.vector[c(TRUE, FALSE, TRUE, FALSE, TRUE)]   # with logicals
```

Logical selection is very useful.

```{r}
(my.numbers <- sample(1:10, 20, replace=TRUE))
```

How can we get just the entries less than five?

```{r}
my.numbers < 5
my.numbers[my.numbers < 5]
```

Here are some things to do with vectors:

```{r, tidy=FALSE}
length(my.vector)   #  How long is my vector?
sum(my.vector)      #  What if I add up the numbers in my vector?
sum(my.vector < 4)  #  Alternative: length(my.vector[my.vector < 4])
```

Data Frames are useful. They're the most common tabular data structure used in R. There are other data structures as well, of course.

* Matrices are vectors with a number of columns and a number of rows.
    * Multiplication is element-wise for `*`, matrix-wise for `%*%`.
* Lists are like vectors where each element could be itself a vector.
    * Compare `c(1:3, 4)` with `list(1:3, 4)`.
* Data frames are lists with every vector equal length, and you get row names and column names.

Some example datasets are included with R. (You can list them with `data()` and load them to your workspace with `data(dataSetName)`). External datasets are often read into `R` data frames from CSV files using `read.csv`, which itself calls `read.table`. (Check out the help for a lot of useful details.)

```{r}
(my.data <- read.csv('http://bit.ly/NYUdataset'))
```

These are common ways of starting to work with data frames:

```{r, eval=FALSE}
str(my.data)
summary(my.data)
```

You can access a particular vector in a list or data frame in several ways:

```{r, eval=FALSE}
my.data$gender
my.data[[2]]
my.data[['gender']]
with(my.data, gender)
```

You can subset using `[row(s), column(s)]`, both parts just like selecting from a single vector.

```{r}
my.data[2, 'age']
```

How can we select the `time`s for females? (There are several possibilities.)

```{r, eval=FALSE}
my.data[my.data$gender=='F', "time"]
my.data$time[my.data$gender=='F']
subset(my.data, gender=='F', select="time")
```

To add / compute / make a new column, just assign to it:

```{r}
my.data$number.five <- 5
my.data$mean.1.2 <- my.data$health1 + my.data$health2
my.data$health <- rowMeans(my.data[5:10])
```

To drop / delete / remove a column, you have options:

```{r, tidy=FALSE}
my.data$number.five <- NULL         #  remove from the data frame 'in place'
my.new.data <- my.data[1:10]        #  make a new smaller data frame
my.new.data <- my.data[-c(11,12)]   #  same as last
```

Modulo is `%%`. That's the shortest of the `%` infix operators, such as `%in%` and `%*%`.

Some other very useful function include `table()`, `sample()`, and the distribution functions (`rnorm`, `runif`, etc.), `ifelse`, and the flow control devices `if`/`else`, `for`, `while`, and so on. Output can be forced by `print` and `cat`. The `apply` family of functions are quite useful. There's also `rbind`, `cbind`, `aggregate`, and tools in packages such as `plyr`. And `as.Date`, `strptime`, `gsub`, `grep`, `toupper`, `tolower`, `sub_str`, `strsplit`. `melt` and `cast` from the `reshape` package. `merge`. You'll likely want to do `?formula` eventually. `which`, `max`, `min`, `pmax`, `pmin`. `unlist`, `unique`, `sort`, `order`, `setdiff`, `union`.


### Packages

There are a *lot* of contributed packages (also known as *libraries*) for R. As of ``r date()``, there were this many on the Comprehensive R Archive Network (CRAN) - and there are many more available from other sources.

```{r}
length(unique(rownames(available.packages())))
```

You load a package with `library()`. If you don't yet have the package, this will fail.

```{r}
library(somePackage)
```

You can install the package you want with `install.packages()`. Note the quotation marks:

```{r, eval=FALSE}
install.packages("somePackage")
```

You need to *install* a package once (per machine). You need to *load* the package each `R` session. Once the package is loaded, you have access to everything it provides (functions, datasets, etc.).

You now have access to a huge array of tools!