---
title: 'Linguistic Data: Quantitative Analysis and Visualisation'
author: #### Ilya Schurov, Olga Lyashevskaya, George Moroz, Alla Tambovtseva

date: #### 12 January 2019

output:
html_document: default
---
## Basics of R: variables, vectors and descriptive statistics
### R as calculator
Let's look at very basic calculations in R.
```{r}
1 + 4 # addition
```
```{r}
4 - 9 # subtraction
```
```{r}
6 * 5 + 7 / 2 # multiplication and division
```
```{r}
sqrt(36) # taking a square root
```
```{r}
6 ^ 2 # raising to power
```
```{r}
6 ** 2 # the same
```
**Note:** if you are planning to work both in R and Python, you had better memorize the latter variant of raising a number to some power (via `**`) since in Python the operator `^` corresponds to the bitwise addition that has nothing in common with powers.
In R we can calculate logarithms as well. By default the `log()` function returns the natural logarithm, the logarithm of the base `e`. In English books it is usually denoted as `log`, in Russian ones it is denoted as `ln`.
```{r}
log(4)
```
We can also specify the base of a logarithm adding the option `base`:
```{r}
log(4, base = 2) # so 2^2 = 4
```
Or calculate a logarithm of a base 10:
```{r}
log10(100) # the same as log(100, base=10)
```
If we want to round the results obtained, we can use the function `round()`:
```{r}
round(12.57)
```
By default it rounds a value to the closest integer, so we got 13 above. However, we can specify the number of digits we want to see after a decimal point:
```{r}
round(12.57, 1) # round to tenths, 1 digit after .
```
### Variables in R
Names of the variables in R can contain letters, numbers, dots and underscores, but the name of a variable cannot start with a number (as in many programming languages). A name of a variable should not coincide with the reserved R words and operators (like `if`, `else`, `for`, `while`, etc).
Both operators `<-` and `=` can be used for assigning values to variables, but `<-` is a 'canonical' R operator that is usually applied in practice. In other words, writing code with `=` is technically correct, but not cool and has to be avoided :)
```{r}
a <- 3
a
```
We can change the value of a variable and save it again with the same name:
```{r}
x <- 2
x <- x + 3
x # updated, now it is 3 + 2 = 5
```
We can also assign text values to variables. A text is usually written in quotes:
```{r}
s <- "hello"
s
```
It does not matter which quotes, single `''` or double `""` we will use. The only important thing is that the opening and the closing quote should be of the same type, so it is not allowed to write something like this: `"hello'`.
There are many functions that are aimed at working with text variables (in R they are called *character variables*), but now we will not concentrate on them. Just as an example, look at the function `toupper()` that converts all letters into capital ones:
```{r}
toupper(s)
```
Note that the original value of `s` has not changed, it is still in small letters:
```{r}
s
```
To save changes we have to reassign a value to `s`:
```{r}
s <- toupper(s)
s # updated
```
### Vectors in R
A vector in R is a list (a series) of elements. It is created in the following way using the special function `c()`:
```{r}
v <- c(1, 0, 0, 1, 2) # vector v
```
We can look at this vector:
```{r}
v
```
To get the type of a vector (at least, whether it is numeric or not), we can use the function `class()`:
```{r}
class(v) # numeric values, not text ones
```
Also we can define *a length of a vector*, i.e. a number of its elements:
```{r}
length(v)
```
So as to choose an element of a vector by its index (its position in a vector), we should specify it in square brackets:
```{r}
v[1] # first element
v[2] # second element
```
Note that in R the numeration starts from 1, so if you got used to Python or other programming languages, take this into account. Requesting a zero element will result in nothing:
```{r}
v[0] # no error, but no such element
```
Not only numeric vectors can be created, character ones are possible:
```{r}
names <- c('Ann', 'Tom')
names
```
### Descriptive statistics in R
Consider the following sample (we save its elements to the vector `x`):
```{r}
x <- c(6, 6, 7, 0, 14, 24, 16, 15, 2, 0)
x
```
Let's calculate several descriptive statistics for a numeric sample.
```{r}
min(x) # maximum value
max(x) # maximum value
mean(x) # an average, a sample mean
median(x) # a median
var(x) # a sample variance
sd(x) # a standard deviation
```
**Note:** by default R computes a corrected sample variance (with good statistical properties), one with $n-1$ in the denominator.
And what if we work with a categorical sample? For example, we have a text (character) vector:
```{r}
y <- c("a", "b", "c", "a", "c", "c")
```
We can calculate the frequences of the values using the fucntion `table()`:
```{r}
table(y)
```
This function returns absolute frequences. To get relative ones, we can compute them manually dividing every absolute frequency by the sum of all frequences for a sample:
```{r}
table(y)/sum(table(y))
```
Now let us proceed to histograms (of course, it is suitable only for numeric vectors). We can plot a histogram of our sample `x`:
```{r}
hist(x) # hist - from histogram
```
By default a histogram is white, but you can add a color:
```{r}
hist(x, col="red") # col for color
```
Or:
```{r}
hist(x, col="hotpink") # more interesting color
```
There is a lot of colors in R, see the full list [here](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf).
Now we will not focus on styling, we will discuss it later, but we should mention two points important from a statistical standpoint: setting different types of values by a vertical axis and choosing a different number of bins (rectangles in a histogram).
We can indicate normalised frequences by a vertical axis, i.e. values adjusted in such a way that a histogram has a total area of one.
```{r}
# freq=FALSE, not absolute frequences by y-axis
hist(x, col="red", freq=FALSE)
```
So as to choose a number of rectangles in a histogram different from one set by default (if you are interested, read about [Sturges' algorithm](https://en.wikipedia.org/wiki/Histogram#Sturges'_formula) or [other](https://www.rdocumentation.org/packages/grDevices/versions/3.5.1/topics/nclass) algorithms used in R) you can add a corresponding option:
```{r}
hist(x, col="red", freq=FALSE, breaks=3) # 3 bins
```
That is all for today. If you need something more, please, see the optional materials for this course.