---
title: "Lab12. Decision Trees and Forests. Variable importance"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
Sys.setlocale("LC_ALL","ru_RU.utf-8")
```

We will use the library `party`. However, there is a number of other packages for classification and regression tree-based approach (CART): `randomForest`, `rpart`, `crat`, `maptree`, `partykit` and other.

```{r, message=FALSE}
library(party)
```

### 1. Consonant drop in Russian

Our student Varvara Sveshnikova wrote her BA paper on two cases of the consonant drop:  
(a) when in the complex -stvov- (like in _beschinstVovat'_ 'to riot') another labial consonant is pronounced after it, and   
(b) when no consonant follows (in two contexts: _beschinstVuju_ 'I riot', beschinstVo_ 'roistering').  
The [dataset](https://raw.githubusercontent.com/agricolamz/r_on_line_course_data/master/Sveshnikova.2016.v.elision.csv) includes the following data:  
`v.elision` --- elision of [v] / no elision;  
`group` --- a group of test words, first (_beschinstvovat'_), second (_beschinstvuju_), third (_beschinstvo_);   
`word` --- root under analysis;  
`position` --- phrase position: strong, under logical stress (_I am not *CRYING*, I resent), weak (_He ALWAYS likes to *cry*).  
Fit a CART model, using ctree() function, predicting v.elision variable by all others.  
1.1 Visualize a model using plot() function. What is the number of observation in node 6?  
1.2 Visualize a model using print() function. Which split have a statistic 14.01?  
1.3 Predict a value of v.elision for word with a root "попеч" in a third group, in a strong position.  

Fit a cforest model using additional argument controls=cforest_unbiased(ntree=1000, mtry=3).   
1.4 Predict a value of v.elision for word with a root "попеч" in a third group, in a strong position using `cforest` model.  
You need to add an argument OOB=TRUE, e. g. yes  
1.5 Calculate a variable importance for a group variable in the random forest model using varimp() function.  

Code to use:
```{r}
df <- read.csv("https://raw.githubusercontent.com/agricolamz/r_on_line_course_data/master/Sveshnikova.2016.v.elision.csv")
fit <- party::ctree(v.elision~., data = df) # use the argument controls = ctree_control(...) to control the max depth etc.
plot(fit)
print(fit)
plot(fit, type = "simple") # a simplified view
predict(fit, df[45,-1], response = TRUE)
fit2 <- cforest(v.elision~., data = df, control=cforest_unbiased(ntree=1000, mtry=3))
predict(fit2, df[45,-1],OOB=TRUE)
vi <- as.data.frame(sort(varimp(fit2), decreasing=TRUE))
vi
vi1 <- t(replicate(10, varimp(fit2)))
boxplot(vi1)
```
Model accuracy:
```{r}
df.predicted <- predict(fit2, df[,-1], OOB=TRUE)
head(df.predicted)  
table(df[,1], df.predicted)
(sum(df[,1]==df.predicted)) / nrow(df) # accuracy
```

### 2. /S/ deletion in Panamanian Spanish

Here's some data from Henrietta Cedergren's 1973 study of /s/-deletion in Panamanian Spanish (via Greg Guy and Scott Kiesling). Cedergren had noticed that speakers in Panama City, like in many dialects of Spanish, variably deleted the /s/ at the end of words. She undertook a study to find out if there was a change in progress:  
if final /s/ was systematically dropping out of Panamanian Spanish. 
The attached data are from interviews she performed across the city in four different social classes (`1` = highest, `2` = second highest, `3` = second lowest, `4` = lowest), to see how the variation was structured in the community. She also investigated the linguistic constraints on deletion, so she coded for a phonetic constraint — whether the following segment was consonant, vowel, or pause —- and the grammatical category of word that the /s/ is part of:  
* monomorpheme, where the $s$ is part of the free morpheme (e.g. _menos_)  
* verb, where the $s$ is the second singular inflection (e.g. _tu tienes_, _el tienes_)  
* determiner, where $s$ is plural marked on a determiner (e.g. _los_, _las_)  
* adjective, where $s$ is a nominal plural agreeing with the noun (e.g. _buenos_)  
* noun, where $s$ marks a plural noun (e.g. amigos)  
Fit the CART model predicting the $s$ deletion by phonetic environment and social class.  
Data: [https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv](https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv)  
2.1 Visualize a model using plot() function. What is the number of observation in node 6?  
2.2 Visualize a model using print() function. Which split have a statistic 61.559 (e. g. pause, vowel vs. consonant)?  
2.3 Predict a value of s.delition for word said by person from 1 class, before consonant.  
Fit a `cforest` model using additional argument controls=cforest_unbiased(ntree=100, mtry=2).  
2.4 Calculate a variable importance for the random forest model using varimp() function. Which of the variable is more important?  
```{r}
df <- read.csv("https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv")
str(df)

fit <- ctree(s.deletion~phon.cont+social, data = df)
plot(fit)
print(fit)
predict(fit, df[1,-c(1:2)], response = TRUE)
fit2 <- cforest(s.deletion~., data = df, controls=cforest_unbiased(ntree=100, mtry=2))
varimp(fit2)
varimpAUC(fit2)

```
### 3. Vowel reduction in Russian
Pavel Duryagin ran an experiment on perception of vowel reduction in Russian language. The dataset shva includes the following variables:  
* `time1` - reaction time 1    
* `duration` - duration of the vowel in the stimuly (in milliseconds, ms)  
* `time2` - reaction time 2  
* `f1`, `f2`, `f3` - the 1st, 2nd and 3rd formant of the vowel measured in Hz  
* `vowel` - vowel classified according the 3-fold classification (A - a under stress, a - a/o as in the first syllable before the stressed one, y (stands for shva) - a/o as in the second etc. syllable before the stressed one or after the stressed syllable, cf. ```g[y]g[a]t[A]l[y]``` _gogotala_ `guffawed’).
The dataset is available at https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/duryagin_ReductionRussian.txt.
Fit the CART model predicting vowel by f1 and f2.  
3.1 Visualize a model using plot() function. What is the number of observation in node 9?  
3.2 Predict a value of vowel for sound with f1 = 600, f2 = 1300?  
Fit a cforest model using additional argument controls=cforest_unbiased(ntree=100, mtry=2).  
3.3 Predict a value of vowel for sound with f1 = 600, f2 = 1300?  
You need to add an argument OOB=TRUE.   
3.4 Calculate a variable importance for the random forest model using varimp() function. Which of the variable is more important?  
```{r}
shva <- read.csv("https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/duryagin_ReductionRussian.txt", sep = "\t")
fit <- ctree(vowel~f1+f2, data = shva)
plot(fit)
print(fit)
predict(fit, newdata = data.frame(f1 = as.integer(600),
                        f2 = as.integer(1300)), response = TRUE)
fit2 <- cforest(vowel~f1+f2, data = shva, controls=cforest_unbiased(ntree=100, mtry=2))
varimp(fit2)
predict(fit2, newdata = data.frame(f1 = as.integer(600),
                        f2 = as.integer(1300)),OOB=TRUE)
fit
```

### References

* on bias in RF variable importance metrics [R blog](https://www.r-bloggers.com/be-aware-of-bias-in-rf-variable-importance-metrics/)

* trees and forests with diferent R packages [link](https://rpubs.com/YorkLin/cb103_20181106)