Dimensionality Reduction Techniques
========================================================
type: sub-section
For IODS by Tuomo Nieminen & Emma Kämäräinen
Powered by Rpresentation. The code for this presentation is [here](https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/docs/dimensionality_reduction.Rpres)
```{r, include = F}
# setup
knitr::opts_chunk$set(echo = F, comment = NA)
mytheme <- function(axt_text = 18, axt_title = 22 ) {
ggplot2::theme(axis.text=element_text(size = axt_text),
axis.title=element_text(size = axt_title, face="bold"))
}
```
Principal component analysis
========================================================
type: prompt
incremental: false
```{r}
# 3d data
set.seed(333)
x <- rep(c(1,2,5), times = 50)
col <- x
x <- jitter(x, factor = 2)
X <- matrix(c(x, x + rnorm(length(x)), x + rnorm(length(x))), ncol = 3)
```
From high...
```{r}
# plot original 3d data
library(scatterplot3d)
scatterplot3d(X,
cex.symbols = 2.0, pch = 20, box = F,
color = col + 1,
xlab = "X1", ylab = "X2", zlab = "X3")
```
***
.. to lower dimensionality
```{r}
par(mar = c(4,4,3,1))
pca <- prcomp(X, retx = F, center = T, scale = T)
newdata <- X %*% pca$rotation
eigenv <- pca$sdev**2
var_explained <- paste0("(",round(100*(eigenv/sum(eigenv)), 1),"%)")
colnames(newdata) <- paste(colnames(newdata), var_explained)
plot(newdata[,1:2], ylim = c(-5, 5),
bty = "n",
cex = 2.0, pch = 20, col = col + 1)
```
What is dimensionality?
========================================================
In statistical analysis, one can think of *dimensionality* as the number of variables (features) related to each observation in the data.
- If each observation is measured by $d$ number of features, then the data is $d$ dimensional. Each observation needs $d$ points to define it's location in a **mathematical space**.
- If there are a lot of features, some of them can relate to the same underlying dimensions (not directly measured)
- Some dimensions may be stronger and some weaker, they are not equally important
Dimensionality reduction
========================================================
The original variables of high dimensional data might contain "too much" information (and noise or some other random error) for representing the underlying phenomenom of interest.
- A solution is to reduce the number of dimensions and focus only on the **most essential dimensions** extracted from the data
- In practise we can *transform* the data and use only a few **principal components** for visualisation and/or analysis
- Hope is that the variance along a small number of principal components provides a reasonable characterization of the complete data set
Tools for dimensionality reduction
========================================================
On the linear algebra level, Singular Value Decomposition (SVD) is the most important tool for reducing the number of dimensions in multivariate data.
- The SVD literally *decomposes* a matrix into a product of smaller matrices and reveals the most important components
- Principal Component Analysis (PCA) is a statistical procedure which does the same thing
- Correspondence Analysis (CA) or Multiple CA (MCA) can be used if the data consists of categorical variables
- The classification method LDA can also be considered as a dimensionality reduction technique
Principal Component Analysis (PCA)
========================================================
In PCA, the data is first *transformed* to a new space with equal or less number of dimensions (new features). These new features are called the **principal components**. They always have the following properties:
- The 1st principal component captures the maximum amount of variance from the features in the original data
- The 2nd principal component is orthogonal to the first and it captures the maximum amount of variability left
- The same is true for each principal component. They are all **uncorreleated** and each is less important than the previous one, in terms of captured variance.
Reducing dimensionality with PCA
========================================================
incremental: false
Given the properties of the principal components, we can simply choose the first few principal components to represent our data.
- This will give us uncorrelated variables which capture the maximum amount of variation in the data!
***
```{r, fig.height = 5}
library(ggfortify)
library(dplyr)
ir <- iris
names(ir) <- c("Sep.Len", "Sep.Wid", "Pet.Len", "Pet.Wid", "Species")
pc <- prcomp(ir[-5], scale. = T)
vars <- pc$sdev**2
varcap <- (100*vars / (sum(vars))) %>% round(2)
lab <- paste0(colnames(pc$rotation), " (", varcap, "%)")
autoplot(pc, loadings = T, data = ir, col = "Species", loadings.label.size = 6, size = 3,
loadings.label = T, loadings.label.vjust = 1.5, xlab = lab[1], ylab = lab[2],
main = "A biplot of iris's 2 principal components") + mytheme()
```
*The dimensionality of iris reduced to two principal components (PC). The first PC captures more than 70% of the total variance in the 4 original variables.*
About PCA
========================================================
Unlike LDA, PCA has no criteria or target variable. PCA may therefore be called an **unsupervised** method.
- PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance.
- **Standardization** of the features before PCA is often a good idea.
- PCA is powerful at encapsulating correlations between the original features into a smaller number of uncorrelated dimensions
About PCA (2)
========================================================
PCA is a mathematical tool, not a statistical model, which is why linear algebra (SVD) is enough.
- There is no statistical model for separating the sources of variance. All variance is thought to be from the same - although multidimensional - source.
- It is also possible to model the dimensionality using underlying latent variables with for example Factor Analysis
- These advanced methods of multivariate analysis are not part of this course
Biplots
========================================================
type: prompt
incremental: false
```{r}
library(ggfortify)
library(dplyr)
ir <- iris
names(ir) <- c("Sep.Len", "Sep.Wid", "Pet.Len", "Pet.Wid", "Species")
pc <- prcomp(ir[-5], scale. = T)
vars <- pc$sdev**2
varcap <- (100*vars / (sum(vars))) %>% round(2)
lab <- paste0(colnames(pc$rotation), " (", varcap, "%)")
autoplot(pc, loadings = T, data = ir, col = "Species", loadings.label.size = 5, size = 3,
loadings.label = T, loadings.label.vjust = 1.5, xlab = lab[1], ylab = lab[2],
main = "A biplot of iris's 2 principal components") + mytheme(18,18)
```
***
Correlations of iris
```{r}
ir <- iris
names(ir) <- c("Sep.Len", "Sep.Wid", "Pet.Len", "Pet.Wid", "Species")
round(cor(ir[-5]),2)
```
The correlations (and more) can be interpret from the biplot on the left, but how?
The 'Bi' in Biplots
========================================================
A biplot is a way of visualizing two representations of the same data. The biplot displays:
**(1)** The observations in a lower (2-)dimensional representation
- A scatter plot is drawn where the observations are placed on x and y coordinates defined by two principal components (PC's)
**(2)** The original features and their relationships with both each other and the principal components
- Arrows and/or labels are drawn to visualize the connections between the original features and the PC's.
Properties of biplots
========================================================
In a biplot, the following connections hold:
- The angle between arrows representing the original features can be interpret as the correlation between the features. Small angle = high positive correlation.
- The angle between a feature and a PC axis can be interpret as the correlation between the two. Small angle = high positive correlation.
- The length of the arrows are proportional to the standard deviations of the features
Biplots can be used to visualize the results of dimensionality reduction methods such as LDA, PCA, Correspondence Analysis (CA) and Multiple CA.
Multiple Correspondence Analysis
========================================================
type: prompt
incremental: false
autosize: true
```{r, echo=FALSE}
library(FactoMineR)
library(RColorBrewer)
data(hobbies)
res.mca <- MCA(hobbies,quali.sup=19:22,quanti.sup=23, graph=F)
plot(res.mca,invisible=c("ind","quali.sup"), cex=.8, col.var=col, habillage = "quali", palette=palette(brewer.pal(8, 'Dark2')))
```
Multiple Correspondence Analysis
========================================================
incremental: false
autosize: true
- Dimensionality reduction method
- Analyses the pattern of relationships of several categorical
variables
- Generalization of PCA and a extension of correspondence analysis (CA)
- Deals with categorical variables, but continuous ones can be used as background (supplementary) variables
- Can be used with qualitative data, so there are little assumptions about the variables or the data in general. MCA uses frequencies and you can count those even from text based datasets.
Multiple Correspondence Analysis
========================================================
incremental: false
autosize: true
- For the categorical variables, you can either use the [indicator matrix or the Burt matrix](https://en.wikipedia.org/wiki/Multiple_correspondence_analysis#As_an_extension_of_correspondences_analysis) in the analysis
- The Indicator matrix contains all the levels of categorical variables as a binary variables (1 = belongs to category, 0 = if doesn't)
- Burt matrix is a matrix of two-way cross-tabulations between all the variables in the dataset
- The general aim is to condense and present the information of the cross-tabulations in a clear graphical form
- Correspondence Analysis (a special case of MCA) works similarly with a cross-table of only two categorical variables
- There are also several other variations of the CA methods
- And next, let's look how the MCA outputs look in R!
MCA summary(1)
========================================================
incremental: false
autosize: true
left: 40%
Output of MCA summary contains...
- **Eigenvalues**: the variances and the percentage of variances retained by each dimension
- **Individuals**: the individuals coordinates, the individuals contribution (%) on the dimension and the cos2 (the squared correlations) on the dimensions.
***
```{r, echo=FALSE}
tea_time <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/tea_time.csv", sep =",", header = T)
data <- dplyr::select(tea_time, dplyr::one_of(c('Tea', 'lunch', 'sugar')))
colnames(data)<- c('Var1', 'Var2', 'Var3')
levels(data$Var1) <-c("Label1", "Label2", "Label3")
levels(data$Var2) <-c("Level1", "Level2")
levels(data$Var3) <-c("Name1", "Name2")
res.mca <- MCA(data, graph = FALSE, method = 'indicator')
summary(res.mca, nbelements = 3)
```
MCA summary(2)
========================================================
incremental: false
autosize: true
left: 40%
Output of MCA summary contains...
- **Categories**: the coordinates of the variable categories, the contribution (%), the cos2 (the squared correlations) and v.test value. The v.test follows normal distribution: if the value is below/above $\pm$ 1.96, the coordinate is significantly different from zero.
- **Categorical variables**: the squared correlation between each variable and the dimensions. If the value is close to one it indicates a strong link with the variable and dimension.
***
```{r, echo=FALSE}
summary(res.mca, nbelements = 3)
```
Read more from [here](http://www.sthda.com/english/wiki/multiple-correspondence-analysis-essentials-interpretation-and-application-to-investigate-the-associations-between-categories-of-multiple-qualitative-variables-r-software-and-data-mining) and [here](http://factominer.free.fr/classical-methods/multiple-correspondence-analysis.html)
MCA biplot(1)
========================================================
incremental: false
autosize: true
Visualizing MCA:
- You can plot for variables, individuals and background (supplementary variables) separately or you can draw them in the same plot.
- `plot.MCA()` function in R (from FactoMineR) has a lot of options for plotting
- See a [video](https://www.youtube.com/watch?v=reG8Y9ZgcaQ) of MCA (plotting options start at 5:36).
- Let's look at a minimal example on the next slide.
MCA biplot(2)
========================================================
incremental: false
autosize: true
left: 50%
- On the right we have MCA factor map (biplot), where are variables drawn on the first two dimensions
- The MCA biplot is a good visualization to see the possible variable patterns
- The distance between variable categories gives a measure of their similarity
- For example Label2 and Name2 are more similar than Label2 and Level2 and Label3 is different from all the other categories
***
```{r, fig.height = 6}
plot(res.mca, invisible=c("ind"), habillage = "quali", cex = 1.3, palette=palette(brewer.pal(8, 'Dark2')))
```