--- title: "Lab12. Decision Trees and Forests. Variable importance" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) Sys.setlocale("LC_ALL","ru_RU.utf-8") ``` We will use the library `party`. However, there is a number of other packages for classification and regression tree-based approach (CART): `randomForest`, `rpart`, `crat`, `maptree`, `partykit` and other. ```{r, message=FALSE} library(party) ``` ### 1. Consonant drop in Russian Our student Varvara Sveshnikova wrote her BA paper on two cases of the consonant drop: (a) when in the complex -stvov- (like in _beschinstVovat'_ 'to riot') another labial consonant is pronounced after it, and (b) when no consonant follows (in two contexts: _beschinstVuju_ 'I riot', beschinstVo_ 'roistering'). The [dataset](https://raw.githubusercontent.com/agricolamz/r_on_line_course_data/master/Sveshnikova.2016.v.elision.csv) includes the following data: `v.elision` --- elision of [v] / no elision; `group` --- a group of test words, first (_beschinstvovat'_), second (_beschinstvuju_), third (_beschinstvo_); `word` --- root under analysis; `position` --- phrase position: strong, under logical stress (_I am not *CRYING*, I resent), weak (_He ALWAYS likes to *cry*). Fit a CART model, using ctree() function, predicting v.elision variable by all others. 1.1 Visualize a model using plot() function. What is the number of observation in node 6? 1.2 Visualize a model using print() function. Which split have a statistic 14.01? 1.3 Predict a value of v.elision for word with a root "попеч" in a third group, in a strong position. Fit a cforest model using additional argument controls=cforest_unbiased(ntree=1000, mtry=3). 1.4 Predict a value of v.elision for word with a root "попеч" in a third group, in a strong position using `cforest` model. You need to add an argument OOB=TRUE, e. g. yes 1.5 Calculate a variable importance for a group variable in the random forest model using varimp() function. Code to use: ```{r} df <- read.csv("https://raw.githubusercontent.com/agricolamz/r_on_line_course_data/master/Sveshnikova.2016.v.elision.csv") fit <- party::ctree(v.elision~., data = df) # use the argument controls = ctree_control(...) to control the max depth etc. plot(fit) print(fit) plot(fit, type = "simple") # a simplified view predict(fit, df[45,-1], response = TRUE) fit2 <- cforest(v.elision~., data = df, control=cforest_unbiased(ntree=1000, mtry=3)) predict(fit2, df[45,-1],OOB=TRUE) vi <- as.data.frame(sort(varimp(fit2), decreasing=TRUE)) vi vi1 <- t(replicate(10, varimp(fit2))) boxplot(vi1) ``` Model accuracy: ```{r} df.predicted <- predict(fit2, df[,-1], OOB=TRUE) head(df.predicted) table(df[,1], df.predicted) (sum(df[,1]==df.predicted)) / nrow(df) # accuracy ``` ### 2. /S/ deletion in Panamanian Spanish Here's some data from Henrietta Cedergren's 1973 study of /s/-deletion in Panamanian Spanish (via Greg Guy and Scott Kiesling). Cedergren had noticed that speakers in Panama City, like in many dialects of Spanish, variably deleted the /s/ at the end of words. She undertook a study to find out if there was a change in progress: if final /s/ was systematically dropping out of Panamanian Spanish. The attached data are from interviews she performed across the city in four different social classes (`1` = highest, `2` = second highest, `3` = second lowest, `4` = lowest), to see how the variation was structured in the community. She also investigated the linguistic constraints on deletion, so she coded for a phonetic constraint — whether the following segment was consonant, vowel, or pause —- and the grammatical category of word that the /s/ is part of: * monomorpheme, where the $s$ is part of the free morpheme (e.g. _menos_) * verb, where the $s$ is the second singular inflection (e.g. _tu tienes_, _el tienes_) * determiner, where $s$ is plural marked on a determiner (e.g. _los_, _las_) * adjective, where $s$ is a nominal plural agreeing with the noun (e.g. _buenos_) * noun, where $s$ marks a plural noun (e.g. amigos) Fit the CART model predicting the $s$ deletion by phonetic environment and social class. Data: [https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv](https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv) 2.1 Visualize a model using plot() function. What is the number of observation in node 6? 2.2 Visualize a model using print() function. Which split have a statistic 61.559 (e. g. pause, vowel vs. consonant)? 2.3 Predict a value of s.delition for word said by person from 1 class, before consonant. Fit a `cforest` model using additional argument controls=cforest_unbiased(ntree=100, mtry=2). 2.4 Calculate a variable importance for the random forest model using varimp() function. Which of the variable is more important? ```{r} df <- read.csv("https://raw.githubusercontent.com/LingData2019/LingData/master/data/cedergren73.csv") str(df) fit <- ctree(s.deletion~phon.cont+social, data = df) plot(fit) print(fit) predict(fit, df[1,-c(1:2)], response = TRUE) fit2 <- cforest(s.deletion~., data = df, controls=cforest_unbiased(ntree=100, mtry=2)) varimp(fit2) varimpAUC(fit2) ``` ### 3. Vowel reduction in Russian Pavel Duryagin ran an experiment on perception of vowel reduction in Russian language. The dataset shva includes the following variables: * `time1` - reaction time 1 * `duration` - duration of the vowel in the stimuly (in milliseconds, ms) * `time2` - reaction time 2 * `f1`, `f2`, `f3` - the 1st, 2nd and 3rd formant of the vowel measured in Hz * `vowel` - vowel classified according the 3-fold classification (A - a under stress, a - a/o as in the first syllable before the stressed one, y (stands for shva) - a/o as in the second etc. syllable before the stressed one or after the stressed syllable, cf. ```g[y]g[a]t[A]l[y]``` _gogotala_ `guffawed’). The dataset is available at https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/duryagin_ReductionRussian.txt. Fit the CART model predicting vowel by f1 and f2. 3.1 Visualize a model using plot() function. What is the number of observation in node 9? 3.2 Predict a value of vowel for sound with f1 = 600, f2 = 1300? Fit a cforest model using additional argument controls=cforest_unbiased(ntree=100, mtry=2). 3.3 Predict a value of vowel for sound with f1 = 600, f2 = 1300? You need to add an argument OOB=TRUE. 3.4 Calculate a variable importance for the random forest model using varimp() function. Which of the variable is more important? ```{r} shva <- read.csv("https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/duryagin_ReductionRussian.txt", sep = "\t") fit <- ctree(vowel~f1+f2, data = shva) plot(fit) print(fit) predict(fit, newdata = data.frame(f1 = as.integer(600), f2 = as.integer(1300)), response = TRUE) fit2 <- cforest(vowel~f1+f2, data = shva, controls=cforest_unbiased(ntree=100, mtry=2)) varimp(fit2) predict(fit2, newdata = data.frame(f1 = as.integer(600), f2 = as.integer(1300)),OOB=TRUE) fit ``` ### References * on bias in RF variable importance metrics [R blog](https://www.r-bloggers.com/be-aware-of-bias-in-rf-variable-importance-metrics/) * trees and forests with diferent R packages [link](https://rpubs.com/YorkLin/cb103_20181106)