--- title: "Dimensionality Reduction" output: html_document: df_print: paged pdf_document: default editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE) ``` ```{r} library(tidyverse) library(ggfortify) library(FactoMineR) # PCA and CA library(ca) # CA library(vcd) theme_set(theme_bw()) #Sys.setlocale(locale = "ru_RU.UTF-8") ``` ## Principal component analysis (PCA) ### 1. Main approaches to the dimensionality reduction Sometimes you have a huge amount of variables. To make your data profitable and more clearly structured, you may want to reduce the number of variables -- not loosing too much information. * Principal component analysis (PCA) * Linear discriminant analysis (LDA) * Multidimensional scaling (MDS) * ... * t-SNE For categorical variables: * Correspondense analysis (CA) * Multiple correspondence analysis (MCA) ### 2. Mother an Child data set This is a dataset from [Huttenlocher, Vasilyeva, Cymerman, Levine 2002]. The authors analysed 46 pairs of mothers and children (aged from 47 to 59 months, mean age -- 54). They recorded and transcribed 2 hours from each child per day. In their study, they compared the number of noun phrases per utterance in mother speech to the number of noun phrases per utterance in child speech. ```{r} df <- read_csv("https://raw.githubusercontent.com/LingData2019/LingData2020/master/data/Huttenlocher.csv") df %>% ggplot(aes(mother, child))+ geom_point(color = "darkgreen", size = 3)+ stat_ellipse(linetype=2)+ theme_bw() ``` ### 3. PCA PCA is essentially a rotation of the coordinate axes, chosen such that each successful axis captures as much variance as possible. #### Recap: PCA is alternative to regression One can reduce 2 dimensions to one using a regression: ```{r} fit <- lm(child~mother, data = df) df$model <- predict(fit) p1 <- df %>% ggplot(aes(mother, child)) + geom_line(aes(mother, model), color = "blue") + geom_point(color = "darkgreen", size = 3) + stat_ellipse(linetype=2) + scale_y_continuous(breaks = c(1.2, 1.4, 1.6, 1.8, 2.0)) + theme_bw() p1 ``` We used regression for predicting value of one variable by another variable. ```{r} p1 + # plot red arrows geom_segment(aes(x=min(mother), y=1.8, xend=2, yend=1.8), size=0.5, color = "red", arrow = arrow(angle = 10, type = "closed", ends = "first")) + geom_segment(aes(x=2, y=min(child), xend=2, yend=1.8), size=0.5, color = "red", arrow = arrow(angle = 10, type = "closed")) + # pin two points on axes theme(axis.text.x = element_text(color=c("black", "black", "black", "red", "black"), size=c(9, 9, 9, 14, 9)), axis.text.y = element_text(color=c("black", "black", "black", "red", "black", "black"), size=c(9, 9, 9, 14, 9, 9))) ``` In PCA we change coordinate system and start predicting variables' values using less variables. ```{r echo=FALSE, warning=FALSE} df <- read_csv("https://raw.githubusercontent.com/LingData2019/LingData2020/master/data/Huttenlocher.csv") pca <- prcomp(df) PC1 <- data.frame(t(t(matrix(c(seq(-1, 1, by = 0.1), rep(0, 41)), ncol = 2) %*% t(pca$rotation)) + pca$center)) df %>% ggplot(aes(mother, child))+ stat_ellipse(linetype=2)+ geom_segment(aes(x=1, y=1.9, xend=2, yend=1.9), size=0.5, color = "red", arrow = arrow(angle = 10, type = "closed", ends = "first"))+ geom_segment(aes(x=2, y=1, xend=2, yend=1.9), size=0.5, color = "red", arrow = arrow(angle = 10, type = "closed", ends = "first"))+ geom_line(data = PC1, aes(mother, child), color = "blue", arrow = arrow(angle = 10, type = "closed"))+ geom_point(color = "darkgreen", size = 3)+ scale_y_continuous(breaks = c(1.2, 1.6, 1.9, 2.0))+ theme_bw()+ theme(axis.text.x = element_text( color=c("black", "black", "red", "black"), size=c(9, 9, 14, 9)), axis.text.y = element_text( color=c("black", "black", "red", "black"), size=c(9, 9, 14, 9)))+ annotate("text", x = 2.38, y = 2.3, label = "PC1") ``` The blue line is *the first Princple Component* (and it is NOT a regression line). The number of the PCs is always equal to the number of variables. So we can draw the second PC: ```{r echo=FALSE, warning=FALSE} PC2 <- data.frame(t(t(matrix(c(rep(0, 41), seq(-0.7, 0.7, by = 0.1)), ncol = 2) %*% t(pca$rotation)) + pca$center)) df %>% ggplot(aes(mother, child))+ stat_ellipse(linetype=2)+ geom_line(data = PC1, aes(mother, child), color = "blue", arrow = arrow(angle = 10, type = "closed"))+ geom_line(data = PC2, aes(mother, child), color = "blue", arrow = arrow(angle = 10, type = "closed", ends = "first"))+ geom_segment(aes(x=1, y=1.9, xend=2, yend=1.9), size=0.5, color = "red", arrow = arrow(angle = 10, type = "closed", ends = "first"))+ geom_segment(aes(x=2, y=1, xend=2, yend=1.9), size=0.5, color = "red", arrow = arrow(angle = 10, type = "closed", ends = "first"))+ geom_point(color = "darkgreen", size = 3)+ scale_y_continuous(breaks = c(1.2, 1.6, 1.9, 2.0))+ theme_bw()+ theme(axis.text.x = element_text(color=c("black", "black", "red", "black"), size=c(9, 9, 14, 9)), axis.text.y = element_text(color=c("black", "black", "red", "black"), size=c(9, 9, 14, 9)))+ annotate("text", x = 2.38, y = 2.3, label = "PC1")+ annotate("text", x = 1.39, y = 2.15, label = "PC2") ``` The main point of PCA is that if cumulative proportion of explained variance is high we can drop some PCs. So, we need know the following things: * What is the cumulative proportion of explained variance? ```{r, echo = TRUE} df <- read_csv("https://raw.githubusercontent.com/LingData2019/LingData2020/master/data/Huttenlocher.csv") summary(prcomp(df)) ``` We see that PC1 explains only 78.9% of the variance in our data. * How PCs are rotated comparing to the old axes? ```{r, echo = TRUE} prcomp(df) ``` So the formula for the first component rotation is $$PC1 = 0.6724959 \times child + 0.7401009 \times mother$$ The formula for the second component rotation is $$PC2 = -0.7401009 \times child + 0.6724959 \times mother$$ Now we can change the axes. We use the `autoplot()` function from `ggfortify` package to produce the graph: ```{r} autoplot(pca, loadings = TRUE, loadings.label = TRUE) + theme_bw() + stat_ellipse(linetype=2) ``` ### Summary: * If the cumulative proportion of explained variance for some PCs is high, we can change coordinate system and start predicting variables' values using less variables. * We can even make a regresion or clusterisation model. * PCA for categorical variables is called Multiple correspondence analysis (MCA) ### R functions There are several functions for PCA, MCA and their visualisation. * PCA: prcomp() * PCA: princomp() * PCA: FactoMineR::PCA() * PCA: ade4::dudi.pca() * PCA: amap::acp() * PCA visualisation: ggfortify::autoplot ### 2 Gospels' frequency word lists The gospels of Matthew, Mark, and Luke are referred to as the Synoptic Gospels and stand in contrast to John, whose content is comparatively distinct. This dataset (https://tinyurl.com/y8tcf3uw) contains frequency of selected words (without stopwords, without pronouns and without frequent word "Jesus") as attested in four gospels of the New Testament. For some visualisations you will need assign row names to the dataframe: ```{r} gospels <- read.csv("https://tinyurl.com/y8tcf3uw") row.names(gospels) <- gospels$word ``` #### 2.2 Apply PCA to four continuous variables. Use `prcomp()` function. What is the cumulative proportion of explained variance for the first and second component? ```{r} PCA <- prcomp(gospels[,2:5], center = TRUE, scale. = TRUE) summary(PCA) ``` #### 2.2 Use the `autoplot()` function of the library ggfortify for creating plot like this. See more examples here: https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html ```{r} autoplot(PCA, shape = FALSE, loadings = TRUE, label = TRUE, loadings.label = TRUE)+ theme_bw() ``` #### 2.3 Predict the coordinates for the word "Jesus", which have the following frequencies: John = 0.05, Luke = 0.01, Mark = 0.02, Matthew = 0.02. ```{r} predict(PCA, data.frame(John = 0.05, Luke = 0.01, Mark = 0.02, Matthew = 0.02)) ``` We can also look at PC2 and PC3 components: ```{r} autoplot(PCA, x=2, y=3, shape = FALSE, loadings = TRUE, label = TRUE, loadings.label = TRUE)+ theme_bw() ``` #### Excercise: Register variation in the British National Corpus Dataset and discription from Natalia Levshina’s package Rling. This is a data set with relative frequencies (proportions) of different word classes in 69 subcorpora of the British National Corpus (the BYU-BNC version). * `Reg` — a factor that describes the metaregister with levels Acad, Fiction, Misc, News, NonacProse and Spok * `Ncomm` — a numeric vector with relative frequencies of common nouns. * `Nprop` — a numeric vector with relative frequencies of proper nouns. * `Vpres` — a numeric vector with relative frequencies of verbs in the present tense form, 3rd person singular. * `Vpast` — a numeric vector with relative frequencies of verbs in the past tense form. * `P1` — a numeric vector with relative frequencies of the first-person pronouns. * `P2` — a numeric vector with relative frequencies of the second-person pronouns. * `Adj` — a numeric vector with relative frequencies of adjectives. * `ConjCoord` — a numeric vector with relative frequencies of coordinating conjunctions. * `ConjSub` — a numeric vectorwith relative frequencies of subordinating conjunctions. * `Interject` — a numeric vector with relative frequencies of interjections. * `Num` — a numeric vector with relative frequencies of numerals. Q1 Apply PCA to all variables. What is the cumulative proportion of explained variance for the first, second and third components? Q2 Extract the coordinates from the pca object (pca$x), merge with the dataset itself, and create a visualization using the first two components and creating confidence ellipses for each metaregister. ```{r} reg_bnc <- read.csv("https://goo.gl/19QywL") pca <- prcomp(reg_bnc[,-1], center = TRUE, scale. = TRUE) summary(pca) autoplot(pca, shape = FALSE, loadings = TRUE, label = TRUE, loadings.label = TRUE)+ theme_bw() reg_bnc <- cbind(reg_bnc, pca$x) reg_bnc %>% ggplot(aes(PC1, PC2, color = Reg))+ geom_point()+ stat_ellipse()+ theme_bw() ``` ## 3. Grammatical profiles of Russian verbs dataset In their article "Predicting Russian aspect by frequency across genres" Eckhoff et al. (2017) ask whether the aspect of individual verbs can be predicted based on the statistical distribution of their inflectional forms. The dataset contains a sample of sentences taken from the Russian National Corpus. Each verb was annotated by Mood&Tense, Voice, Aspect and other grammatical features. ```{r 1.01} ru <- read.csv('https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/RNCverbSamples_journ.csv', sep=';') str(ru) ``` First, we will do some preprocessing. Here, we join future passives with other passive participles. ```{r 1.02} ru$MoodTense[ru$Voice == 'pass' & ru$MoodTense == 'indicfut'] <- "partcppast" ``` Let's look at top-10 of verbs in the dataset ```{r 1.03} ru %>% group_by(LemmaTranslit) %>% summarize(count=n()) %>% arrange(desc(count)) ``` ### 3.1 Grammatical profiles Grammatical profile is a vector that contains the number (or the ratio) of inflectional forms for individual lemmas. We can pick up a subset containing only forms of the verb _chitat'_ 'read'. ```{r 1.1.1} ru.chit <- droplevels(subset(ru, LemmaTranslit == "chitat'")) ``` This is the grammatical profile of _chitat'_ ```{r 1.1.2} print("The grammatical profile of chitat'") print(table(ru.chit$MoodTense)) print(prop.table(table(ru.chit$MoodTense))*100) ``` Now we will calculate the grammatical profile of each verb (which has more than 50 occurrences in our data set) and split the resulting table into two parts: grammatical forms themselves (numeric data, see ttdata below) and metadata (categories labeled in the RNC or by annotators: lemma, trasitivity, aspect). Table of tense-mood distribution per lemma: ```{r 1.1.3} tab = table(ru$LemmaTranslit,ru$MoodTense) #turns the table into a data frame t = as.data.frame.matrix(tab) #adds lemmas as a separate column t$LemmaTranslit = row.names(t) #adds metadata columns for transitivity and aspect, assuming that these are stable per lemma - just picking the first value per lemma t$Trans = as.factor(unlist(lapply(t$LemmaTranslit, function(x) names(table(droplevels(subset(ru, LemmaTranslit == x)$Trans)))[1]))) t$Asp = as.factor(unlist(lapply(t$LemmaTranslit, function(x) names(table(droplevels(subset(ru, LemmaTranslit == x)$Aspect)))[1]))) ``` Label the biaspectuals 'b': ```{r 1.1.4} levels(t$Asp) <- c('i','p','b') t[t$LemmaTranslit=="ispol'zovat'",]$Asp <- 'b' t[t$LemmaTranslit=="obeschat'",]$Asp <- 'b' ``` Pick out the lemmas with 50 or more occurrences and split the data: ```{r 1.1.5} tt <- t[rowSums(t[,1:9]) >= 50,] ``` ## 4. t-SNE visualization (for numeric variables) The idea behind this kind of visualisation is to plot different clusters as far from each other as possible (preserving the distance between each pair of clusters). Within each cluster, the points are distributed to show the internal structure of the cluster. Note that unlike PCA, in t-SNE the points' coordinates cannot be interpreted directly (there is no linear mapping of one plane to another), and linear correlations can be misleading. ```{r 1.2} library(Rtsne) ru.tsne <- Rtsne(tt[,1:9], dims=2, perplexity=50, verbose=TRUE, max_iter = 2000) tt <- cbind(tt, ru.tsne$Y) tt %>% ggplot(aes(`1`, `2`, label = Asp, color = Asp))+ geom_text() ``` ## 5. Correspondence Analysis ```{r} tt_ca <- ca(tt[,1:9]) tt_col <- data.frame(tt_ca$colcoord) tt_col$rows <- rownames(tt_ca$colcoord) tt_row <- data.frame(tt_ca$rowcoord) tt_row$rows <- rownames(tt_ca$rowcoord) tt_col %>% ggplot(aes(Dim1, Dim2))+ geom_hline(yintercept = 0, linetype = 2)+ geom_vline(xintercept = 0,linetype = 2)+ geom_point(data = tt_row, aes(Dim1, Dim2), color = "darkblue")+ geom_text(aes(label = rows), color = "red")+ labs(x = "Dim1 (39.06%)", y = "Dim2 (19.72%)") ``` ## 6. Multiple correspondence analysis Dataset and description from [paper by Natalia Levshina](https://goo.gl/v6AmVj). Modern standard Dutch has two periphrastic causatives with the infinitive: the constructions with doen ‘do’ and laten ‘let’. The study is based on an 8-million token corpus of Netherlandic and Belgian Dutch. After the manual cleaning, there were left with 6,808 observations, which were then coded for seven semantic, syntactic, geographical and thematic variables. * Aux --- a factor that specifies the causative auxiliary with levels laten and doen. * Country --- a factor with levels NL (the Netherlands) and BE (Belgium). * Causation --- a factor that describes the type of causation with levels Affective, Inducive, Physical and Volitional * EPTrans --- a factor that specifies the transitivity of the Effected Predicate with levels Intr (intransitive) and Tr (transitive). * EPTrans1 --- a factor with levels Intr and Tr. It is very similar to the previous one, except for a few observations. ```{r} dutch_caus <- read.csv("https://goo.gl/2yAR3T") MCA <- MASS::mca(dutch_caus[,-1]) MCA dutch_caus <- cbind(dutch_caus, MCA$rs) variables <- as_data_frame(MCA$cs) variables$var_names <- rownames(MCA$cs) dutch_caus %>% ggplot(aes(`1`, `2`))+ geom_point(aes(color = Aux))+ stat_ellipse(aes(color = Aux))+ geom_text(data = variables, aes(`1`, `2`, label = var_names))+ theme_bw()+ scale_x_continuous(limits = c(-0.015, 0.02)) ``` ### Useful links * FactoMineR for PCA [link](http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/)