---
title: 'Linguistic Data: Quantitative Analysis and Visualisation'
subtitle: Lab on a Student's t-test
output:
  html_document:
    df_print: paged
  pdf_document: default
  word_document: default
---
### Aspiration and vowel duration in Icelandic 

This set is based on (Coretta 2017, [link](https://goo.gl/NrfgJm). This dissertation dealt with the relation between vowel duration and aspiration in consonants. Author carried out a data collection with 5 natives speakers of Icelandic. Then he extracted the duration of vowels followed by aspirated versus non-aspirated consonants. Check out whether vowels before aspirated consonants (like in Icelandic takka ‘key’ [tʰaʰka]) are signiﬁcantly shorter than vowels followed by non-aspirated consonants (like in kagga ‘barrel’ [kʰakka]).
[Link](http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv) to the dataset.

```{r}
df <- read.csv("http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv")
```

### Descriptive statistics

A general boxplot:

```{r}
boxplot(df$vowel.dur)
```

Get the number of outliers:

```{r}
length(boxplot(df$vowel.dur)$out)
```

Look at number of observations by groups (aspirated and non-aspirated cases):

```{r}
table(df$aspriration)
```

Choose two subsamples, one for words where vowels are followed by aspirated consonants and another for non-aspirated consonants.

```{r}
asp <- df[df$aspiration == 'yes',]
nasp <- df[df$aspiration == 'no',]
```

Summary for aspirated and non-aspirated cases:

```{r}
summary(asp$vowel.dur)
summary(nasp$vowel.dur)
```

Boxplot by groups:

```{r}
boxplot(df$vowel.dur ~ df$aspiration)
```

More interesting - let us create a boxplot by all groups (see the field `cons1`):

```{r}
boxplot(df$vowel.dur ~ df$cons1)
```

You can compare distribution of `vowel.dur` in asp(irated), fri(cative), nasp(non-aspirated), voi(ced), etc.

We can limit our data to just one type of vowels, say, middle vowels. Therefore, we will work with the same type of a consonant:

```{r}
asp <- df[df$aspiration == 'yes' & df$height == 'mid', ]
nasp <- df[df$aspiration == 'no' & df$height == 'mid', ]
```

Again, here is a summary for a corrected case:

```{r}
summary(asp$vowel.dur)
summary(nasp$vowel.dur)

nrow(asp)
nrow(nasp)
```

### T-test

Let us formulate the null hypothesis, the alternative hypotesis, and apply t-test to our dataset.

```{r}
t.test(asp$vowel.dur, nasp$vowel.dur)
```

By default, R calculates t.test with regard to the bi-directional alternative hypothesis, such as $\mu_1 \neq \mu_2$.

### Unidirectional t-test

H1: $\mu_{asp} \lt \mu_{nasp}$

```{r}
t.test(asp$vowel.dur, nasp$vowel.dur, alternative = "less")
```

### Density plots
```{r, message=FALSE, warning=FALSE}
require(tidyverse)
require(dplyr)
```

Let's get a descriptive summary of our data in a dplyr style.

```{r}
df %>% 
  group_by(aspiration) %>%
  summarise(mean = mean(vowel.dur),
            st.dev = sd(vowel.dur))
```

Density plots can be thought of as plots of smoothed histograms.

```{r, warning=FALSE, message=FALSE}
library(ggplot2)
df %>% 
  ggplot(aes(vowel.dur, fill = aspiration, color = aspiration))+
  geom_density(alpha = 0.4)+
  geom_rug()+
  labs(title = "Vowel duration density plot",
       caption = "Data from (Coretta 2017)",
       x = "vowel duration")
```

Density plot by speaker:

```{r}
df %>% 
  ggplot(aes(vowel.dur, fill = aspiration, color = aspiration))+
  geom_density(alpha = 0.4)+
  geom_rug()+
  facet_wrap(~speaker)+
  labs(title = "Vowel duration density plot, by speaker",
       caption = "Data from (Coretta 2017)",
       x = "vowel duration")
```

and descriptive statistics:

```{r}
df %>% 
  group_by(aspiration, speaker) %>%
  summarise(mean = mean(vowel.dur),
            st.dev = sd(vowel.dur))
```