---
title: "Discrimimant Analysis"
---
## Discriminant analysis
* ANOVA and MANOVA: predict a (counted/measured) response from group membership.
* Discriminant analysis: predict group membership based on counted/measured variables.
* Covers same ground as logistic regression (and its variations), but emphasis on classifying observed data into correct groups.
* Does so by searching for linear combination of original variables that best separates data into groups (canonical variables).
* Assumption here that groups are known (for data we have). If trying to "best separate" data into unknown groups, see *cluster analysis*.
## Packages
```{r bDiscrim-1, message=FALSE, warning=FALSE}
library(MASS)
library(tidyverse)
library(ggrepel)
library(ggbiplot)
library(MVTests) # for Box M test
library(conflicted)
conflict_prefer("arrange", "dplyr")
conflict_prefer("summarize", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("mutate", "dplyr")
```
- `ggrepel` allows labelling points on a plot so they don't
overwrite each other.
- `ggbiplot` uses `plyr` rather than `dplyr`, which has functions by similar names.
## About `select`
* Both `dplyr` (in `tidyverse`) and `MASS`
have a function called `select`, and *they do
different things*.
* How do you know which `select` is going to get called?
* With `library`, the one loaded *last* is visible,
and others are not.
* Thus we can access the `select` in `dplyr` but
not the one in `MASS`. If we wanted that one, we'd have to
say `MASS::select`.
- Better: load `conflicted` package. Any time you load two packages containing functions with same name, you get error and have to choose between them.
## Example 1: seed yields and weights
```{r bDiscrim-2, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/manova1.txt"
hilo <- read_delim(my_url, " ")
g <- ggplot(hilo, aes(x = yield, y = weight,
colour = fertilizer)) + geom_point(size = 4)
```
:::: {.columns}
::: {.column width="40%"}
Recall data from MANOVA: needed a multivariate analysis to find
difference in seed yield and weight based on whether they were high
or low fertilizer.
:::
::: {.column width="60%"}
```{r}
#| echo = FALSE
g
```
:::
::::
## Basic discriminant analysis
```{r bDiscrim-4 }
hilo.1 <- lda(fertilizer ~ yield + weight, data = hilo)
```
* Uses `lda` from package MASS.
* "Predicting" group membership from measured variables.
## Output
\small
```{r bDiscrim-5}
hilo.1
```
\normalsize
## Things to take from output
* Group means: high-fertilizer plants have (slightly) higher
mean yield and weight than low-fertilizer plants.
* "Coefficients of linear discriminants": \texttt{LD1,
LD2,}\ldots are scores constructed from observed variables that
best separate the groups.
* For any plant, get LD1 score by taking $-0.76$ times yield
plus $-1.25$ times weight, add up, standardize.
- the LD1 coefficients are like slopes:
- if yield higher, LD1 score for a plant lower
- if weight higher, LD1 score for a plant lower
* High-fertilizer plants have higher yield and weight, thus
low (negative) LD1 score. Low-fertilizer plants have low yield and
weight, thus high (positive) LD1 score.
* One LD1 score for each observation. Plot with actual groups.
## How many linear discriminants?
* Smaller of these:
* Number of variables
* Number of groups *minus 1*
* Seed yield and weight: 2 variables, 2 groups,
$\min(2,2-1)=1$.
## Getting LD scores
Feed output from LDA into `predict`:
\scriptsize
```{r bDiscrim-6 }
p <- predict(hilo.1)
hilo.2 <- cbind(hilo, p)
hilo.2
```
\normalsize
## LD1 scores in order
Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high:
\footnotesize
```{r bDiscrim-7}
hilo.2 %>% select(fertilizer, yield, weight, LD1) %>%
arrange(desc(LD1))
```
\normalsize
High fertilizer have yield and weight high, negative LD1 scores.
## Plotting LD1 scores
With one LD score, plot against (true) groups, eg. boxplot:
```{r bDiscrim-8, fig.height=3.4}
ggplot(hilo.2, aes(x = fertilizer, y = LD1)) + geom_boxplot()
```
## What else is in `hilo.2`?
* `class`: predicted fertilizer level (based on values of
`yield` and `weight`).
* `posterior`: predicted probability of being low or high
fertilizer given `yield` and `weight`.
- `LD1`: scores for (each) linear discriminant (here is only LD1) on each observation.
## Predictions and predicted groups
\ldots based on `yield` and `weight`:
\footnotesize
```{r bDiscrim-12}
hilo.2 %>% select(yield, weight, fertilizer, class)
```
\normalsize
## Count up correct and incorrect classification
```{r}
with(hilo.2, table(obs = fertilizer, pred = class))
```
* Each predicted fertilizer level is exactly same as observed
one (perfect prediction).
* Table shows no errors: all values on top-left to bottom-right
diagonal.
## Posterior probabilities
\footnotesize
show how clear-cut the classification decisions were:
```{r bDiscrim-13}
hilo.2 %>%
mutate(across(starts_with("posterior"), \(p) round(p, 4))) %>%
select(-LD1)
```
Only obs.\ 7 has any doubt: `yield` low for a high-fertilizer,
but high `weight` makes up for it.
\normalsize
## Example 2: the peanuts
\scriptsize
```{r bDiscrim-14, message=F}
my_url <- "http://ritsokiguess.site/datafiles/peanuts.txt"
peanuts <- read_delim(my_url, " ")
peanuts
```
\normalsize
- Recall: `location` and `variety` both significant in
MANOVA. Make combo of them (over):
## Location-variety combos
\footnotesize
```{r combos}
peanuts %>%
unite(combo, c(variety, location)) -> peanuts.combo
peanuts.combo
```
\normalsize
## Discriminant analysis
\tiny
```{r bDiscrim-15}
peanuts.1 <- lda(combo ~ y + smk + w, data = peanuts.combo)
peanuts.1
```
\normalsize
## Comments
* Now 3 LDs (3 variables, 6 groups, $\min(3,6-1)=3$).
* Relationship of LDs to original variables. Look for
coeffs far from zero:
```{r}
peanuts.1$scaling
```
* high `LD1` mainly high `y`
or low `w`.
* high `LD2` mainly low `w`.
* Proportion of trace values show relative importance of LDs:
`LD1` much more important than `LD2`; `LD3` worthless.
## The predictions and misclassification
```{r bDiscrim-17 }
p <- predict(peanuts.1)
peanuts.2 <- cbind(peanuts.combo, p)
with(peanuts.2, table(obs = combo, pred = class))
```
Actually classified very well. Only one `6_2` classified as a
`5_1`, rest all correct.
## Posterior probabilities
\scriptsize
```{r bDiscrim-18}
peanuts.2 %>%
mutate(across(starts_with("posterior"), \(p) round(p, 2))) %>%
select(combo, class, starts_with("posterior"))
```
\small
*Some* doubt about which combo each plant belongs in, but not too
much. The one misclassified plant was a close call.
\normalsize
## Discriminant scores, again
* How are discriminant scores related to original variables?
* Construct data frame with original data and discriminant
scores side by side:
\footnotesize
```{r bDiscrim-19}
peanuts.1$scaling
```
\normalsize
* LD1 positive if `y` large and/or `w` small.
* LD2 positive if `w` small.
## Discriminant scores for data
\footnotesize
```{r bDiscrim-20}
peanuts.2 %>% select(y, w, starts_with("x"))
```
* Obs.\ 5 and 6 have most positive `LD1`: large `y`,
small `w`.
* Obs.\ 4 has most positive `LD2`: small `w`.
\normalsize
## Plot LD1 vs.\ LD2, labelling by combo
\small
```{r bDiscrim-24, fig.height=4}
g <- ggplot(peanuts.2, aes(x = x.LD1, y = x.LD2, colour = combo,
label = combo)) + geom_point() +
geom_text_repel() + guides(colour = "none")
g
```
\normalsize
## "Bi-plot" from `ggbiplot`
```{r bDiscrim-25, fig.height=3.3}
ggbiplot(peanuts.1, groups = factor(peanuts.combo$combo))
```
## Installing `ggbiplot`
* `ggbiplot` not on CRAN, so usual
`install.packages` will not work.
* Install package `devtools` first (once):
```{r bDiscrim-26, eval=F}
install.packages("devtools")
```
* Then install `ggbiplot` (once):
```{r bDiscrim-27, eval=F}
library(devtools)
install_github("vqv/ggbiplot")
```
## Cross-validation
* So far, have predicted group membership from same data used to
form the groups --- dishonest!
* Better: *cross-validation*: form groups from all
observations *except one*, then predict group membership for
that left-out observation.
* No longer cheating!
* Illustrate with peanuts data again.
## Misclassifications
* Fitting and prediction all in one go:
\small
```{r bDiscrim-28 }
p <- lda(combo ~ y + smk + w,
data = peanuts.combo, CV = TRUE)
peanuts.3 <- cbind(peanuts.combo, class = p$class,
posterior = p$posterior)
with(peanuts.3, table(obs = combo, pred = class))
```
\normalsize
* Some more misclassification this time.
## Repeat of LD plot
```{r graziani,fig.height=4.7}
g
```
## Posterior probabilities
\footnotesize
```{r bDiscrim-29}
peanuts.3 %>%
mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>%
select(combo, class, starts_with("posterior"))
```
\normalsize
## Why more misclassification?
* When predicting group membership for one observation, only
uses the *other one* in that group.
* So if two in a pair are far apart, or if two groups overlap,
great potential for misclassification.
* Groups `5_1` and `6_2` overlap.
* `5_2` closest to `8_1`s looks more like an
`8_1` than a `5_2` (other one far away).
* `8_1`s relatively far apart and close to other things,
so one appears to be a `5_2` and the other an `8_2`.
## Example 3: professions and leisure activities
* 15 individuals from three different professions (politicians,
administrators and belly dancers) each participate in four
different leisure activities: reading, dancing, TV watching and
skiing. After each activity they rate it on a 0--10 scale.
* How can we best use the scores on the activities to predict a person's profession?
* Or, what combination(s) of scores best separate data into profession groups?
## The data
\footnotesize
```{r}
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/profile.txt"
active <- read_delim(my_url, " ")
active
```
\normalsize
## Discriminant analysis
\tiny
```{r bDiscrim-30, message=F}
active.1 <- lda(job ~ reading + dance + tv + ski, data = active)
active.1
```
\normalsize
## Comments
* Two discriminants, first fair bit more important than second.
* `LD1` depends (negatively) most on `dance`, a bit
on `tv`.
* `LD2` depends mostly (negatively) on `tv`.
## Misclassification
```{r bDiscrim-31 }
p <- predict(active.1)
active.2 <- cbind(active, p)
with(active.2, table(obs = job, pred = class))
```
Everyone correctly classified.
## Plotting LDs
\small
```{r bDiscrim-32, fig.height=4}
g <- ggplot(active.2, aes(x = x.LD1, y = x.LD2, colour = job, label = job)) +
geom_point() + geom_text_repel() + guides(colour = "none")
g
```
\normalsize
## Biplot
```{r bDiscrim-33, fig.height=4.3}
ggbiplot(active.1, groups = active$job)
```
## Comments on plot
* Groups well separated: bellydancers top left, administrators
top right, politicians lower middle.
* Bellydancers most negative on `LD1`: like dancing most.
* Administrators most positive on `LD1`: like dancing least.
* Politicians most negative on `LD2`: like TV-watching most.
## Plotting individual `persons`
Make `label` be identifier of person. Now need legend:
```{r bDiscrim-34, fig.height=3.5}
active.2 %>% mutate(person = row_number()) %>%
ggplot(aes(x = x.LD1, y = x.LD2, colour = job,
label = person)) +
geom_point() + geom_text_repel()
```
## Posterior probabilities
\scriptsize
```{r bDiscrim-35}
active.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>%
select(job, class, starts_with("posterior"))
```
\normalsize
Not much doubt.
## Cross-validating the jobs-activities data
Recall: no need for `predict`:
```{r bDiscrim-36 }
p <- lda(job ~ reading + dance + tv + ski, data = active, CV = TRUE)
active.3 <- cbind(active, class = p$class, posterior = p$posterior)
with(active.3, table(obs = job, pred = class))
```
This time one of the bellydancers was classified as a politician.
## and look at the posterior probabilities
\scriptsize
```{r bDiscrim-37}
active.3 %>%
mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>%
select(job, class, starts_with("post"))
```
\normalsize
## Comments
* Bellydancer was "definitely" a politician!
* One of the administrators might have been a politician too.
## Why did things get misclassified?
:::: {.columns}
::: {.column width="40%"}
Go back to plot of discriminant scores:
- one bellydancer much closer to the politicians,
- one administrator a bit closer to the politicians.
:::
::: {.column width="60%"}
```{r}
#| echo = FALSE
g
```
:::
::::
## Example 4: remote-sensing data
* View 25 crops from air, measure 4 variables `x1-x4`.
* Go back and record what each crop was.
* Can we use the 4 variables to distinguish crops?
## The data
\footnotesize
```{r bDiscrim-39 }
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/remote-sensing.txt"
crops <- read_table(my_url)
crops %>% print(n = 25)
```
\normalsize
## Discriminant analysis
\tiny
```{r bDiscrim-40 }
crops.1 <- lda(crop ~ x1 + x2 + x3 + x4, data = crops)
crops.1
```
\normalsize
## Assessing
* 3 LDs (four variables, four groups).
* 1st two important.
* `LD1` mostly `x1` (minus)
* `LD2` `x3` (minus)
## Predictions
* Thus:
\footnotesize
```{r bDiscrim-43}
p <- predict(crops.1)
crops.2 <- cbind(crops, p)
with(crops.2, table(obs = crop, pred = class))
```
\normalsize
* Not very good, eg. only half the Soybeans and Sugarbeets classified correctly.
## Plotting the LDs
```{r piacentini,fig.height=3.4}
ggplot(crops.2, aes(x = x.LD1, y = x.LD2, colour = crop)) +
geom_point()
```
Corn (red) mostly left, cotton (green) sort of right, soybeans and sugarbeets (blue and purple) mixed up.
## Biplot
```{r bDiscrim-45, fig.height=6}
ggbiplot(crops.1, groups = crops$crop)
```
## Comments
* Corn low on LD1 (left), hence low on `x1`
* Cotton tends to be high on LD1 (high `x1`)
- one cotton very low on LD2 (high `x3`?)
* Rather mixed up.
## Posterior probs (some)
\scriptsize
```{r bDiscrim-52 }
crops.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>%
filter(crop != class) %>%
select(crop, class, starts_with("posterior"))
```
\normalsize
## Comments
* These were the misclassified ones, but the posterior probability
of being correct was not usually too low.
* The correctly-classified ones are not very clear-cut either.
## MANOVA
Began discriminant analysis as a followup to MANOVA. Do our variables
significantly separate the crops?
```{r bDiscrim-53 }
response <- with(crops, cbind(x1, x2, x3, x4))
crops.manova <- manova(response ~ crop, data = crops)
summary(crops.manova)
```
## Box's M test
We should also run Box's M test to check for equal variance of each variable across crops:
\small
```{r}
summary(BoxM(response, crops$crop))
```
\normalsize
- The P-value for the M test is smaller even than our guideline of 0.001. So we should not take the MANOVA seriously.
- *Apparently* at least one of the crops differs (in means) from the others. So
it is worth doing this analysis.
- We did this the wrong way around, though!
## The right way around
* *First*, do a MANOVA to see whether any of the groups
differ significantly on any of the variables.
- Check that the MANOVA is believable by using Box's M test.
* *If the MANOVA is significant*, do a discriminant
analysis in the hopes of understanding how the groups are different.
* For remote-sensing data (without Clover):
* LD1 a fair bit more important than LD2 (definitely ignore LD3).
* LD1 depends mostly on `x1`, on which Cotton was high
and Corn was low.
* Discriminant analysis in MANOVA plays the same kind of role
that Tukey does in ANOVA.