--- title: "Lab 10. Binary logistic regression" --- {r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)  ### Libraries {r, message= FALSE, include=TRUE} library(tidyverse) library(stats) # glm() function for logit regression models library(caret) # library to calculate confusion matrix and agreement library(pROC) # library to draw ROC curves  ## 1 Logit model with one numeric predictor It is interesting to know whether the languages with more consonants are more likely to have ejective sounds. So we collected data from phonological database LAPSyD: http://goo.gl/0btfKa. ### 1.1 Data summary {r} ej_cons <- read.csv("https://agricolamz.github.io/2018-MAG_R_course/data/correlation_regressions_ejectives.csv") ej_cons %>% ggplot(aes(ejectives, n.cons.lapsyd, color = ejectives))+ geom_jitter(width = 0.2)+ labs(title = "Number of consonants ~ presence of ejectives", x = "presence of ejectives", y = "number of consonants")+ theme_bw()  ### 1.2 Model without predictors {r} fit1 <- glm(ejectives~1, data = ej_cons, family = "binomial") summary(fit1)  How we get this estimate value? {r} table(ej_cons$ejectives) log(10/17)  What does this model say? This model says that if we have no predictors and take some language it has $\frac{1}{(1+e^{0.5306283})} = 0.37037$ probability to have ejectives. {r} 1/(1+exp(0.5306283))  ### 1.3 Model with numeric predictor {r} fit2 <- glm(ejectives~n.cons.lapsyd, data = ej_cons, family = "binomial") summary(fit2)  What does this model say? This model says: $$\log(odds(ej)) = \beta_o + \beta_1 \times n.cons.lapsyd = -9.9204 + 0.3797 \times n.cons.lapsyd$$ Lets visualize our model: {r} ej_cons %>% mutate(P(ejective) = as.numeric(ejectives) - 1) %>% ggplot(aes(x = n.cons.lapsyd, y = P(ejective)))+ geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) + geom_point()+ theme_bw()  So probability for a language that have 30 consonants will be $$\log(odds(ej)) = -9.9204 + 0.3797 \times 30 = 1.4706$$ $$P(ej) = \frac{1.47061}{1+1.4706}=0.8131486$$ ### 1.4 predict() {r} new.df <- data.frame(n.cons.lapsyd = c(30, 55, 34, 10)) predict(fit2, new.df) # odds predict(fit2, new.df, type = "response") # probabilities predict(fit2, new.df, type = "response", se.fit = TRUE) # probabilities and confidense interval  So we actually can create a plot with confidense intervals. {r} ej_cons_ci <- cbind.data.frame(ej_cons, predict(fit2, ej_cons, type = "response", se.fit = TRUE)[1:2]) ej_cons_ci  {r} ej_cons_ci %>% mutate(P(ejective) = as.numeric(ejectives) - 1) %>% ggplot(aes(x = n.cons.lapsyd, y = P(ejective)))+ geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)+ geom_point() + geom_pointrange(aes(x = n.cons.lapsyd, ymin = fit - se.fit, ymax = fit + se.fit))+ labs(title = "P(ej) ~ number of consonants", x = "number of consonants", caption = "data from LAPSyD database")+ theme_bw()  ## 2. Choice betweeen two constructions in Russian The Russian verb _gruzit'_ load' is special for three reasons. First, this verb has two syntactic constructions it can appear in, second, it has three perfective counterparts with the prefixes _NA-_, _PO-_, and _ZA-_ that do not add to its lexical meaning (and thus can be cosidered Natural Perfectives), and third all three Natural Perfectives can also use both constructions. The two constructions that _gruzit'_ 'load' can appear in are called the ''THEME-object'' construction and the ''GOAL-object'' construction, and this phenomenon is known in many languages as Locative Alternation. We can illustrate these two constructions in Russian with the following examples: * THEME-object: _gruzit'_ _jaschiki_.ACC _na telegu_(PP) 'load the boxes.THEME onto the cart.GOAL'. The goal appears in a prepositional phrase in the theme-object construction, usually with the preposition _na_ onto' or _v_ into'. * GOAL-object: _gruzit'_ _telegu_.ACC _jaschikami_.INS 'load the cart.GOAL with boxes.THEME'. The theme in the GOAL-object construction appears in the instrumental case. _gruzit'_ The verb load' uses not just one, but three prefixes to form Natural Perfectives: _NA-_, _ZA-_, and _PO-_. Collectively we call these four verbs (the simplex and the three Natural Perfectives) ''the 'load' verbs''. All three Natural Perfectives can appear in both the THEME-object and the GOAL-object constructions. Janda et al. 2013, chapter 4 explores whether the choice of prefix makes a difference in the distribution of the THEME-object and GOAL-object constructions. Along with the prefixes, they test whether the passive construction (ie. construction with passive participle) and omission of the prepositional phrase (ie. reduced construction) could motivate the choice between the THEME-object and GOAL-object constructions. The dataset: There are 1920 lines of data, each corresponding to one of the examples extracted from the Russian National Corpus. The dataset includes four variables: * CONSTRUCTION: This is our dependent variable, and it has two values, theme, and goal. * VERB: This is an independent variable, and it has four values, \_zero (for the unprefixed verb _gruzit'_ 'load'), na, za, and po (for the three prefixed variants). * REDUCED: This is an independent variable, and it has two values, yes and no. This refers to whether the construction was reduced (yes) or full (no). * PARTICIPLE: This is an independent variable, and it has two values, yes and no. This refers to whether the construction was passive (yes) or active (no). Source: [Trolling repository](https://hdl.handle.net/10037.1/10022) References: Janda et al. (2013), Why Russian aspectual prefixes arenâ€™t empty: prefixes as verb classifiers. Bloomington, IN: Slavica Publishers. ### 2.1 Data summary {r 2.1} loaddata = read.csv('https://raw.githubusercontent.com/LingData2019/LingData/master/data/loaddata.csv') summary(loaddata)  ### 2.2 Formulate your hypothesis, what motivates the choice between two constructions? {2.2}  ### 2.3 Fit the simplest logistic regression model using VERB as the only factor. {r 2.3} # use glm() in the following way: fit <- glm(Dependent_variable ~ Factor_variable(s), family = binomial, data = ....)  ### 2.4 Formulate the results of your analysis as text: {2.4}  ### 2.5 Add more factors to your model, one by one. Note that we do not consider possible interactions here yet. {r 2.5}  ### 2.6 Which model fits your data the best according to AIC? Note that this model should include only significant factors. AIC (Akaike Information Criterion) is a goodness-of-fit measure to compare the models with different number of predictors. It penalizes a model for having too many predictors. The smaller AIC, the better. {2.6} Name of the model: AIC:  ### 2.7 Fit the model with all factors and all possible interactions. Hint: Dependent_variable ~ Factor1 \* Factor2 \* Factor3 (the same as: Factor1 + Factor2 + Factor3 + Factor1:Factor2 + ... + Factor1:Factor2:Factor3) {r 2.7}  ### 2.8 Remove all insignificant interactions and report the minimal optimal model here: {r 1.8}  ### 2.9 Check the 95% confidence intevals of the estimated coefficients. Use confint(model_name) to calculate them. {r 2.9} print("These are the confidence interval values:") confint(load.glm4)  If a 95% confidence interval contains zero, this indicates that the corresponding effect is not significant. You can also use exp(confint(...)) to obtain simple odds ratios. The confidence interval of a significant effect based on simple odds ratios should not include 1. We can plot this coefficients with confidence intervals at the same time. To do this we need the library jtools that requires ggstance: {r, eval=FALSE} install.packages("ggstance") install.packages("jtools")  {r} library(jtools) plot_summs(load.glm4)  ### 2.10 Report the odds of success for each predictor variable. Use exp(model_name$coefficients) {r 2.10} print("These are the odds of success for each predictor variable:")  ### 2.11 Additional code: stepwise selection of variables See examples from Levshina 2015: m0.glm <- glm(Aux ~ 1, data = doenLaten, family = binomial) m.fw <- step(m0.glm, direction = "forward", scope = ~ Causation + EPTrans + Country) m.glm <- glm(Aux ~ Causation + EPTrans + Country, data = doenLaten, family = binomial) m.bw <- step(m.glm, direction = "backward") {r 2.11} load.glm0 <- glm(CONSTRUCTION ~ 1, family=binomial, data=loaddata) load.glm.fw <- step(load.glm0, direction = "forward", scope = ~ VERB + REDUCED + PARTICIPLE) load.glm.bw <- step(load.glm2, direction = "backward")  ### 2.12 Additional code: variables' importance {r 2.12} library(caret) varImp(load.glm4)  ### 2.13 Model accuracy #### Dividing data into training and test sets The rule of thumb is to use 10% or 20% or 25% data points as a test set (usually not less than 20 data points). The model will be trained on the remaining data. {r 2.13.1} set.seed(42) load.test.index <- sample(1:nrow(loaddata), size=floor(nrow(loaddata)/10)) # select 10% random points, this will create a vector paste0("The test set size: ", floor(nrow(loaddata)/10), ". The training set size: ", nrow(loaddata)-floor(nrow(loaddata)/10), ".") load.test <- loaddata[load.test.index,] load.train <- loaddata[-load.test.index,] load.train %>% ggplot(aes(x=VERB, y=PARTICIPLE, col=CONSTRUCTION)) + scale_color_manual(values=c("black", "red")) + geom_point() + geom_jitter() + ggtitle("Constructions in the train set") load.test %>% ggplot(aes(x=VERB, y=PARTICIPLE, col=CONSTRUCTION)) + scale_color_manual(values=c("black", "red")) + geom_point() + geom_jitter() + ggtitle("Constructions in the test set")  * Training the model on the train set, making prediction on the test set {r 2.13.2} load.glm5 <- glm(CONSTRUCTION ~ VERB + REDUCED + PARTICIPLE + VERB:PARTICIPLE, family="binomial", data=load.train) #print(summary(load.glm5)) load.glm5.link.scores <- predict(load.glm5, newdata=load.test, type="link") load.glm5.response.scores <- predict(load.glm5, newdata=load.test, type="response") load.glm5.scores <- data.frame(link=load.glm5.link.scores, response=load.glm5.response.scores, construction_obs=load.test$CONSTRUCTION, stringsAsFactors=FALSE)  #### Confusion matrix and accuracy Confusion matrix counts the cases of correctly predicted classes as well as the cases of misclassification (false positives and false negatives) . __Accuracy__ are the counts on the backslash diagonal divided by the total counts. {r 2.13.3} load.test %>% count(CONSTRUCTION, VERB, REDUCED, PARTICIPLE) %>% select(-n, -CONSTRUCTION) %>% unique() -> load.test.pdata load.test.pdata %>% predict(load.glm5, newdata = ., type = "response") -> load.test.pdata$PREDICTION load.test.pdata %>% arrange(PREDICTION) load.test.pdata %>% arrange(desc(PREDICTION)) v <- rep(NA, nrow(load.glm5.scores)) v <- ifelse(load.glm5.scores$response >= .5, "theme", "goal") load.glm5.scores$construction_pred <- as.factor(v) confusionMatrix(data = load.glm5.scores$construction_pred, reference = load.glm5.scores$construction_obs, positive="theme")  #### Inspect false positives and false negatives {r 2.13.4} data.frame(load.test, response = load.glm5.scores$response, predicted = load.glm5.scores$construction_pred)[load.glm5.scores$construction_pred != load.glm5.scores$construction_obs,] %>% arrange(predicted, response) load.glm5.scores %>% ggplot(aes(x=link, y=response, col=construction_obs)) + scale_color_manual(values=c("black", "red")) + geom_point() + geom_rug() + geom_jitter(width=.7, height=.02) + geom_hline(yintercept=0.5, linetype="dashed") + annotate(geom="text", x=2, y=0.47, label="threshold at response prob = 0.5", color="darkblue") + ggtitle("Observed and predicted values in test data")  Yet another way to plot observed and predicted variables: {r 2.13.5} ggplot(data=load.glm5.scores, aes(x=construction_obs, y=response)) + geom_violin(fill=rgb(1,1,1,alpha=0.6), color="blue", width = 2) + geom_jitter(aes(color = construction_obs)) + geom_hline(yintercept=0.6, linetype="dashed", alpha=0.8) + # scale_color_discrete(name = "type") + labs(title="Threshold at responce probability 0.6")  At the threshold 0.6, there are 8 false positives (predicted as "theme" whereas the observed class is "goal") and 7 false negatives (predicted as "goal" whereas the observed class is "theme"). The violin plots show that the distribution of response probabilities within each class is quite good (most of the "goal" response probs are below 0.2, most of the "theme" response probs equal 1). ### 2.14 AUC (area under the ROC curve) The ROC* curve shows the trade off between the rate at which you can correctly predict something (True Positive rate) with the rate of incorrectly predicting something (False Positive rate). The curve starts in the bottom left corner and uses an ordered vector of prediction scores (e.g. load.glm5.scores$response above, ordered) to take the next step. Each time the curve "sees" the positive value (e.g. "goal") in the observed output it moves up (northward), and each time it sees the negative value (e.g. "theme") it takes a step right (eastward). Ideally, if we only have true positives and true negatives predicted by the model, the curve will move up till the top left corner and then move right till the top right corner. The area under the ROC curve (AUC) ranges from 0.50 to 1.00, where 0.50 is considered a random prediction, 0.70 is a borderline case, and 0.80 and above indicates that the model does a good job in discriminating between the two output values. The closer the ROC gets to the optimal point of perfect prediction in the top left corner the closer the AUC gets to 1. *ROC stands for Receiver Operating Characteristics. Read more: [https://www.r-bloggers.com/illustrated-guide-to-roc-and-auc/](https://www.r-bloggers.com/illustrated-guide-to-roc-and-auc/). {r 2.14} load.glm5.roc <- roc(load.test$CONSTRUCTION, load.glm5.response.scores, direction="<") plot(load.glm5.roc, col="green", lwd=3, main="AUC") simple_roc <- function(labels, scores){ labels <- labels[order(scores, decreasing=TRUE)] data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels) } #TPR - True Positive Ratio, FPR - False Positive Ratio load.glm5.simple.roc <- simple_roc(load.test$CONSTRUCTION=="theme", load.glm5.link.scores) with(load.glm5.simple.roc, points(1 - FPR, TPR, col=1 + labels)) auc(load.glm5.roc)