---
title: "Discriminant Analysis"
editor: 
  markdown: 
    wrap: 72
---

## Discriminant analysis

-   ANOVA and MANOVA: predict a (counted/measured) response from group
    membership.

-   Discriminant analysis: predict group membership based on
    counted/measured variables.

-   Covers same ground as logistic regression (and its variations), but
    emphasis on classifying observed data into correct groups.

-   Does so by searching for linear combination of original variables
    that best separates data into groups (canonical variables).

-   Assumption here that groups are known (for data we have). If trying
    to "best separate" data into unknown groups, see *cluster analysis*.

## Packages

```{r bDiscrim-1, message=FALSE, warning=FALSE}
library(MASS, exclude = "select")
library(tidyverse)
library(ggrepel)
library(ggbiplot)
library(MVTests) # for Box M test
library(conflicted)
conflict_prefer("arrange", "dplyr")
conflict_prefer("summarize", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("mutate", "dplyr")
```

-   `ggrepel` allows labelling points on a plot so they don't overwrite
    each other.
-   `ggbiplot` uses `plyr` rather than `dplyr`, which has functions by
    similar names.

## About `select`

-   Both `dplyr` (in `tidyverse`) and `MASS` have a function called
    `select`, and *they do different things*.

-   How do you know which `select` is going to get called?

-   With `library`, the one loaded *last* is visible, and others are
    not.

-   Thus we can access the `select` in `dplyr` but not the one in
    `MASS`. If we wanted that one, we'd have to say `MASS::select`.

-   Better: load `conflicted` package. Any time you load two packages
    containing functions with same name, you get error and have to
    choose between them.

## Example 1: seed yields and weights

```{r bDiscrim-2, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/manova1.txt"
hilo <- read_delim(my_url, " ")
g <- ggplot(hilo, aes(x = yield, y = weight,
  colour = fertilizer)) + geom_point(size = 4)
```

::: columns
::: {.column width="40%"}
Recall data from MANOVA: needed a multivariate analysis to find
difference in seed yield and weight based on whether they were high or
low fertilizer.
:::

::: {.column width="60%"}
```{r}
#| echo = FALSE
g
```
:::
:::

## Basic discriminant analysis

```{r bDiscrim-4 }
hilo.1 <- lda(fertilizer ~ yield + weight, data = hilo)
```

-   Uses `lda` from package MASS.

-   "Predicting" group membership from measured variables.

## Output

\small

```{r bDiscrim-5}
hilo.1
```

\normalsize

## Things to take from output

-   Group means: high-fertilizer plants have (slightly) higher mean
    yield and weight than low-fertilizer plants.

-   "Coefficients of linear discriminants": \texttt{LD1,
    LD2,}\ldots are scores constructed from observed variables that best
    separate the groups.

-   For any plant, get LD1 score by taking $-0.76$ times yield plus
    $-1.25$ times weight, add up, standardize.

-   the LD1 coefficients are like slopes:

    -   if yield higher, LD1 score for a plant lower
    -   if weight higher, LD1 score for a plant lower

-   High-fertilizer plants have higher yield and weight, thus low
    (negative) LD1 score. Low-fertilizer plants have low yield and
    weight, thus high (positive) LD1 score.

-   One LD1 score for each observation. Plot with actual groups.

## How many linear discriminants?

-   Smaller of these:

    -   Number of variables

    -   Number of groups *minus 1*

-   Seed yield and weight: 2 variables, 2 groups, $\min(2,2-1)=1$.

## Getting LD scores

Feed output from LDA into `predict`:

\scriptsize

```{r bDiscrim-6 }
p <- predict(hilo.1)
hilo.2 <- cbind(hilo, p)
hilo.2
```

\normalsize

## LD1 scores in order

Most positive LD1 score is most obviously low fertilizer, most negative
is most obviously high:

\footnotesize

```{r bDiscrim-7}
hilo.2 %>% select(fertilizer, yield, weight, LD1) %>% 
  arrange(desc(LD1))
```

\normalsize

High fertilizer have yield and weight high, negative LD1 scores.

## Plotting LD1 scores

With one LD score, plot against (true) groups, eg. boxplot:

```{r bDiscrim-8, fig.height=3.4}
ggplot(hilo.2, aes(x = fertilizer, y = LD1)) + geom_boxplot()
```

## What else is in `hilo.2`?

-   `class`: predicted fertilizer level (based on values of `yield` and
    `weight`).

-   `posterior`: predicted probability of being low or high fertilizer
    given `yield` and `weight`.

-   `LD1`: scores for (each) linear discriminant (here is only LD1) on
    each observation.

## Predictions and predicted groups

\ldots based on `yield` and `weight`:

\footnotesize

```{r bDiscrim-12}
hilo.2 %>% select(yield, weight, fertilizer, class)
```

\normalsize

## Count up correct and incorrect classificationot()

```{r}
with(hilo.2, table(obs = fertilizer, pred = class))
```

-   Each predicted fertilizer level is exactly same as observed one
    (perfect prediction).

-   Table shows no errors: all values on top-left to bottom-right
    diagonal.

## Posterior probabilities

\footnotesize

show how clear-cut the classification decisions were:

```{r bDiscrim-13}
hilo.2 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 4))) %>% 
  select(-LD1)
```

Only obs. 7 has any doubt: `yield` low for a high-fertilizer, but high
`weight` makes up for it.

\normalsize

## Example 2: the peanuts

\scriptsize

```{r bDiscrim-14, message=F}
my_url <- "http://ritsokiguess.site/datafiles/peanuts.txt"
peanuts <- read_delim(my_url, " ")
peanuts
```

\normalsize

-   Recall: `location` and `variety` both significant in MANOVA. Make
    combo of them (over):

## Location-variety combos

\footnotesize

```{r combos}
peanuts %>%
   unite(combo, c(variety, location)) -> peanuts.combo
peanuts.combo
```

\normalsize

## Discriminant analysis

\tiny

```{r bDiscrim-15}
# peanuts.1 <- lda(str_c(location, variety, sep = "_") ~ y + smk + w, data = peanuts)
peanuts.1 <- lda(combo ~ y + smk + w, data = peanuts.combo)
peanuts.1
```

\normalsize

## Comments

-   Now 3 LDs (3 variables, 6 groups, $\min(3,6-1)=3$).

-   Relationship of LDs to original variables. Look for coeffs far from
    zero:

```{r}
peanuts.1$scaling
```

-   high `LD1` mainly high `y` or low `w`.

-   high `LD2` mainly low `w`.

-   Proportion of trace values show relative importance of LDs: `LD1`
    much more important than `LD2`; `LD3` worthless.

## The predictions and misclassification

```{r bDiscrim-17 }
p <- predict(peanuts.1)
peanuts.2 <- cbind(peanuts.combo, p)
peanuts.2
with(peanuts.2, table(obs = combo, pred = class))
```

Actually classified very well. Only one `6_2` classified as a `5_1`,
rest all correct.

## Posterior probabilities

\scriptsize

```{r bDiscrim-18}
peanuts.2 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 2))) %>% 
  select(combo,  class, starts_with("posterior"))
```

\small

*Some* doubt about which combo each plant belongs in, but not too much.
The one misclassified plant was a close call.

\normalsize

## Discriminant scores, again

-   How are discriminant scores related to original variables?

-   Construct data frame with original data and discriminant scores side
    by side:

\footnotesize

```{r bDiscrim-19}
peanuts.1$scaling
```

\normalsize

-   LD1 positive if `y` large and/or `w` small.

-   LD2 positive if `w` small.

## Discriminant scores for data

\footnotesize

```{r bDiscrim-20}
peanuts.2 %>% select(y, w, starts_with("x"))
```

-   Obs. 5 and 6 have most positive `LD1`: large `y`, small `w`.

-   Obs. 4 has most positive `LD2`: small `w`.

\normalsize

## Plot LD1 vs. LD2, labelling by combo

\small

```{r bDiscrim-24, fig.height=4}
g <- ggplot(peanuts.2, aes(x = x.LD1, y = x.LD2, colour = combo, 
                    label = combo)) + geom_point() +
  geom_text_repel() + guides(colour = "none")
g
```

\normalsize

## "Bi-plot" from `ggbiplot`

```{r bDiscrim-25, fig.height=3.3}
ggbiplot(peanuts.1, groups = factor(peanuts.combo$combo))
```

## Installing `ggbiplot`

-   `ggbiplot` not on CRAN, so usual `install.packages` will not work.

-   Install package `devtools` first (once):

```{r bDiscrim-26, eval=F}
install.packages("devtools")
```

-   Then install `ggbiplot` (once):

```{r bDiscrim-27, eval=F}
library(devtools)
install_github("vqv/ggbiplot")
```

## Cross-validation

-   So far, have predicted group membership from same data used to form
    the groups --- dishonest!

-   Better: *cross-validation*: form groups from all observations
    *except one*, then predict group membership for that left-out
    observation.

-   No longer cheating!

-   Illustrate with peanuts data again.

## Misclassifications

-   Fitting and prediction all in one go:

\small

```{r bDiscrim-28 }
p <- lda(combo ~ y + smk + w,
  data = peanuts.combo, CV = TRUE)
peanuts.3 <- cbind(peanuts.combo, class = p$class, 
                   posterior = p$posterior)
with(peanuts.3, table(obs = combo, pred = class))
```

\normalsize

-   Some more misclassification this time.

## Repeat of LD plot

```{r graziani,fig.height=4.7}
g
```

## Posterior probabilities

\footnotesize

```{r bDiscrim-29}
peanuts.3 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  select(combo, class, starts_with("posterior"))
```

\normalsize

## Why more misclassification?

-   When predicting group membership for one observation, only uses the
    *other one* in that group.

-   So if two in a pair are far apart, or if two groups overlap, great
    potential for misclassification.

-   Groups `5_1` and `6_2` overlap.

-   `5_2` closest to `8_1`s looks more like an `8_1` than a `5_2` (other
    one far away).

-   `8_1`s relatively far apart and close to other things, so one
    appears to be a `5_2` and the other an `8_2`.

## Example 3: professions and leisure activities

-   15 individuals from three different professions (politicians,
    administrators and belly dancers) each participate in four different
    leisure activities: reading, dancing, TV watching and skiing. After
    each activity they rate it on a 0--10 scale.

-   How can we best use the scores on the activities to predict a
    person's profession?

-   Or, what combination(s) of scores best separate data into profession
    groups?

## The data

\footnotesize

```{r}
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/profile.txt"
active <- read_delim(my_url, " ")
active
```

\normalsize

## Discriminant analysis

\tiny

```{r bDiscrim-30, message=F}
active.1 <- lda(job ~ reading + dance + tv + ski, data = active)
active.1
```

\normalsize

## Comments

-   Two discriminants, first fair bit more important than second.

-   `LD1` depends (negatively) most on `dance`, a bit on `tv`.

-   `LD2` depends mostly (negatively) on `tv`.

## Misclassification

```{r bDiscrim-31 }
p <- predict(active.1)
active.2 <- cbind(active, p)
with(active.2, table(obs = job, pred = class))
```

Everyone correctly classified.

## Plotting LDs

\small

```{r bDiscrim-32, fig.height=4}
g <- ggplot(active.2, aes(x = x.LD1, y = x.LD2, colour = job, label = job)) + 
  geom_point() + geom_text_repel() + guides(colour = "none")
g
```

\normalsize

## Biplot

```{r bDiscrim-33, fig.height=4.3}
ggbiplot(active.1, groups = active$job)
```

## Comments on plot

-   Groups well separated: bellydancers top left, administrators top
    right, politicians lower middle.

-   Bellydancers most negative on `LD1`: like dancing most.

-   Administrators most positive on `LD1`: like dancing least.

-   Politicians most negative on `LD2`: like TV-watching most.

## Plotting individual `persons`

Make `label` be identifier of person. Now need legend:

```{r bDiscrim-34, fig.height=3.5}
active.2 %>% mutate(person = row_number()) %>% 
  ggplot(aes(x = x.LD1, y = x.LD2,  colour = job, 
               label = person)) + 
  geom_point() + geom_text_repel()
```

## Posterior probabilities

\scriptsize

```{r bDiscrim-35}
active.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  select(job, class, starts_with("posterior"))
```

\normalsize

Not much doubt.

## Cross-validating the jobs-activities data

Recall: no need for `predict`:

```{r bDiscrim-36 }
p <- lda(job ~ reading + dance + tv + ski, data = active, CV = TRUE)
active.3 <- cbind(active, class = p$class, posterior = p$posterior)
with(active.3, table(obs = job, pred = class))
```

This time one of the bellydancers was classified as a politician.

## and look at the posterior probabilities

\scriptsize

```{r bDiscrim-37}
active.3 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  select(job, class, starts_with("post"))
```

\normalsize

## Comments

-   Bellydancer was "definitely" a politician!

-   One of the administrators might have been a politician too.

## Why did things get misclassified?

::: columns
::: {.column width="40%"}
Go back to plot of discriminant scores:

-   one bellydancer much closer to the politicians,

-   one administrator a bit closer to the politicians.
:::

::: {.column width="60%"}
```{r}
#| echo = FALSE
g
```
:::
:::

## Example 4: remote-sensing data

-   View 25 crops from air, measure 4 variables `x1-x4`.

-   Go back and record what each crop was.

-   Can we use the 4 variables to distinguish crops?

## The data

\footnotesize

```{r bDiscrim-39 }
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/remote-sensing.txt"
crops <- read_table(my_url)
crops %>% print(n = 25)
```

\normalsize

## Discriminant analysis

\tiny

```{r bDiscrim-40 }
crops.1 <- lda(crop ~ x1 + x2 + x3 + x4, data = crops)
crops.1
```

\normalsize

## Assessing

-   3 LDs (four variables, four groups).

-   1st two important.

-   `LD1` mostly `x1` (minus)

-   `LD2` `x3` (minus)

## Predictions

-   Thus:

\footnotesize

```{r bDiscrim-43}
p <- predict(crops.1)
crops.2 <- cbind(crops, p)
with(crops.2, table(obs = crop, pred = class))
```

\normalsize

-   Not very good, eg. only half the Soybeans and Sugarbeets classified
    correctly.

## Plotting the LDs

```{r piacentini,fig.height=3.4}
ggplot(crops.2, aes(x = x.LD1, y = x.LD2, colour = crop)) +
  geom_point()
```

Corn (red) mostly left, cotton (green) sort of right, soybeans and
sugarbeets (blue and purple) mixed up.

## Biplot

```{r bDiscrim-45, fig.height=6}
ggbiplot(crops.1, groups = crops$crop)
```

## Comments

-   Corn low on LD1 (left), hence low on `x1`

-   Cotton tends to be high on LD1 (high `x1`)

-   one cotton very low on LD2 (high `x3`?)

-   Rather mixed up.

## Posterior probs (some)

\scriptsize

```{r bDiscrim-52 }
crops.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  filter(crop != class) %>% 
  select(crop, class, starts_with("posterior"))
```

\normalsize

## Comments

-   These were the misclassified ones, but the posterior probability of
    being correct was not usually too low.

-   The correctly-classified ones are not very clear-cut either.

## MANOVA

Began discriminant analysis as a followup to MANOVA. Do our variables
significantly separate the crops?

```{r bDiscrim-53 }
response <- with(crops, cbind(x1, x2, x3, x4))
crops.manova <- manova(response ~ crop, data = crops)
summary(crops.manova)
```

## Box's M test

We should also run Box's M test to check for equal variance of each
variable across crops:

\small

```{r}
summary(BoxM(response, crops$crop))
```

\normalsize

-   The P-value for the M test is smaller even than our guideline of
    0.001. So we should not take the MANOVA seriously.

-   *Apparently* at least one of the crops differs (in means) from the
    others. So it is worth doing this analysis.

-   We did this the wrong way around, though!

## The right way around

-   *First*, do a MANOVA to see whether any of the groups differ
    significantly on any of the variables.

-   Check that the MANOVA is believable by using Box's M test.

-   *If the MANOVA is significant*, do a discriminant analysis in the
    hopes of understanding how the groups are different.

-   For remote-sensing data (without Clover):

    -   LD1 a fair bit more important than LD2 (definitely ignore LD3).

    -   LD1 depends mostly on `x1`, on which Cotton was high and Corn
        was low.

-   Discriminant analysis in MANOVA plays the same kind of role that
    Tukey does in ANOVA.