--- title: 'Linguistic Data: Quantitative Analysis and Visualisation' author: "Ilya Schurov, Olga Lyashevskaya, George Moroz, Alla Tambovtseva" date: "09 February 2018" output: pdf_document: default html_document: default subtitle: Confidence intervals in R --- Install library `DescTools` and load it: ```{r, eval=FALSE} install.packages("DescTools") ``` ```{r, warning=FALSE, message=FALSE} library(DescTools) ``` ### Confidence intervals for proportions First, let us consider an abstract example so as to look at different effects connected with confidence intervals (the effect of a sample size and the effect of a confidence level). Suppose we tossed a coin 20 times and got 4 heads. ```{r} nheads1 <- 4 # number of heads n1 <- 20 # total number of tosses ``` What is the probability of getting a head in one tossing? We do not know it exactly since we know nothing about the features of our coin (at least, whether it is fair or not). However, we can calculate a confidence interval for it. Now let's calculate a 95% confidence interval for the probability of obtaining a head in one toss of a coin (proportion of heads in such an experiment). ```{r} BinomCI(nheads1, n1) # 95% CI by default ``` Calculate the length of a confidence interval: ```{r} ci.95 <- BinomCI(nheads1, n1) ci.95[3] - ci.95[2] ``` Let's increase the number of heads and the number of tosses (the proportion of heads remains the same): ```{r} nheads2 <- 40 n2 <- 200 # now 200 tosses ci.95.2 <- BinomCI(nheads2, n2) ci.95.2[3] - ci.95.2[2] # it shrinked ``` The confidence interval has become narrower. And the ratio of the lengths of two confidence intervals should be $\sqrt{N}$ approximately, where $N$ is a number of times we increase the sample size. In our case it is 10 (from 20 to 200). ```{r} sqrt(10) # square root of N 0.3353598/0.1104032 # ratio of lengths ``` Now let's keep the number of tosses equal to 200, but increase the confidence level: ```{r} ci.99 <- BinomCI(nheads2, n2, conf.level = 0.99) ci.99[3] - ci.99[2] # it extended ``` Now let's try to set a true probability of getting a head in one toss of a coin. ```{r} p0 <- 0.5 # true probability of getting a head in one tossing ``` Then take 1000 samples of size 100, calculate confidence intervals for proportion of ones in each sample and count how many intervals contain a population proportion (the true probability of getting a head in one toss of a coin). Now recall the code from our previous seminars and suppose we asked 1000 people to toss a coin 100 times and report the proportion of heads they obtained. ```{r} tosses <- 100 # series of 100 tosses (for one person) samples <- 1000 # 1000 people tossed a coin dat <- matrix(sample(c(0, 1), tosses * samples, replace=TRUE), ncol=tosses, byrow=TRUE) # recall how dat looks like head(dat, 2) ``` Now calculate confidence intervals for the probability of getting a head in one toss based on proportions on heads in each series of tosses (based on each row in `dat`): ```{r} cis <- BinomCI(rowSums(dat), tosses) head(cis) ``` So as to decide how many confidence intervals include true population proportion `p0` (probability of getting a head in one toss). To do so we need the second and the third column of `cis`: ```{r} head(cis[,"lwr.ci"]) head(cis[,"upr.ci"]) ``` Now we can check whether each confidence interval includes `p0`. If it is true, `p0` should be greater or equal to the lower bound of an interval and less or equal to the upper bound. ```{r} head(cis[,"lwr.ci"] <= p0 & cis[,"upr.ci"] >= p0) # count the proportion of CIs that include p0 mean(cis[,"lwr.ci"] <= p0 & cis[,"upr.ci"] >= p0) ``` It is approximately 0.95 as expected. ### Confidence intervals: real data Now let's proceed to real data and work with *Verses* data set. ```{r} verses <- read.csv("https://raw.githubusercontent.com/LingData2019/LingData/master/data/poetry_last_in_lines.csv", sep = "\t") #str(verses) # recall which variables are there ``` Calculate a confidence interval for the proportion of nouns at the end of lines: ```{r} nnouns <- nrow(verses[verses$UPoS == "NOUN", ]) total <- nrow(verses) BinomCI(nnouns, total) ``` ### Confidence intervals for means Now let's work with the data set on Icelandic language from our previuos class. ```{r} phono <- read.csv("http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv") ``` Choose aspirated and non-aspirated cases again: ```{r} asp <- phono[phono$aspiration == "yes", ] nasp <- phono[phono$aspiration == "no", ] ``` Calculate confidence intervals for mean values of vowel duration in each group: ```{r} MeanCI(asp$vowel.dur) MeanCI(nasp$vowel.dur) ``` Plot them using `sciplot`: ```{r, eval=FALSE} install.packages("sciplot") ``` ```{r, warning=FALSE, message=FALSE} library(sciplot) # specify data # response is a variable for which mean we plot a CI # x.factor is a grouping variable (as we create plots by groups) # ci.fun - function that calculates CI (1.96 multipled by standard error) lineplot.CI(data = phono, response = vowel.dur, x.factor = aspiration, ci.fun = function(x) c(mean(x)-1.96*se(x), mean(x)+1.96*se(x))) ``` Bold dots here correspond to sample means and whiskers (called error bars) correspond to the bounds of confidence intervals for means. From this graphs we can see, for example, whether confidence intervals overlap. Why it can be helpful, we will discuss right now. ### Confidence intervals and statistical significance of differences * If two CI's for a population parameter (proportion, mean, median, etc) do not overlap, it means that true values of population parameters are significantly different. * If two CI's for a population parameter overlap, true values of population parameters can coincide (be equal to each other), but **not** necessarily do so. For example, if two confidence intervals for means overlap, we cannot make a definite conclusion, more accurate testing is required (t-test). So, in general, comparison of confidence intervals (with the same confidence level, of course) is **not** equivalent to hypotheses testing. Consider a case when two CI's for means overlap, but population means are significantly different. Let's select only cases with aspirated consonants and compare the average vowel duration for round and unrounded vowels. ```{r} w1 <- phono[phono$aspiration == 'yes' & phono$roundness == "round", ] w2 <- phono[phono$aspiration == 'yes' & phono$roundness == "unrounded", ] ``` Do CI's overlap? ```{r} MeanCI(w1$vowel.dur) MeanCI(w2$vowel.dur) ``` They overlap! Can we conclude that mean vowel duration is different for round and unrounded vowels? In fact, no. Let us see. Now perform an accurate test, a two sample Student's t-test. ```{r} # reject or not reject H0 t.test(w1$vowel.dur, w2$vowel.dur) ``` Null hypotheses should be rejected, so population means are different. So, this is an illustration of the fact described above: two confidence intervals overlap, but population means are statistically different. Actually, testing hypothesis about the equality of population means is equivalent to finding whether *a CI for the difference of means* includes zero. ```{r} # CI for difference between means MeanDiffCI(w1$vowel.dur, w2$vowel.dur) ``` So, intersection of CI's for means (or for any population parameters) $\ne$ CI for the difference includes zero $\ne$ $H_0$ about equality should not be rejected.