---
title: "Factor analysis"
---

##  Vs. principal components 


* Principal components: 


  * Purely mathematical.

  * Find eigenvalues, eigenvectors of correlation matrix.

  * No testing whether observed components reproducible, or even probability model behind it.


* Factor analysis: 


  * some way towards fixing this (get test of appropriateness)

  * In factor analysis, each variable modelled as: "common factor" (eg. verbal ability) and "specific factor" (left over).

  * Choose the common factors to "best" reproduce pattern seen in correlation matrix.

  * Iterative procedure, different answer from principal components.


##  Packages

```{r bFactor-1, warning=F, message=F}
library(ggbiplot)
library(tidyverse)
library(conflicted)
conflict_prefer("mutate", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("arrange", "dplyr")
```

 
##  Example


* 145 children given 5 tests, called PARA, SENT, WORD, ADD and DOTS. 3 linguistic tasks (paragraph comprehension, sentence completion  and word meaning), 2 mathematical ones (addition and counting dots).

* Correlation matrix of scores on the tests:


```

para 1     0.722 0.714 0.203 0.095
sent 0.722 1     0.685 0.246 0.181
word 0.714 0.685 1     0.170 0.113
add  0.203 0.246 0.170 1     0.585
dots 0.095 0.181 0.113 0.585 1

```


* Is there small number of underlying "constructs" (unobservable) that explains this pattern of correlations?


##  To start: principal components

Using correlation matrix. Read that first:

```{r kids-scree,message=F}
my_url <- "http://ritsokiguess.site/datafiles/rex2.txt"
kids <- read_delim(my_url, " ")
kids
```


## Principal components on correlation matrix


Turn into R `matrix`, using column `test` as column names:

```{r}
kids %>% 
column_to_rownames("test") %>% 
as.matrix() -> m
```

Principal components:

```{r bFactor-2}
kids.0 <- princomp(covmat = m) 
```

I used `kids.0` here since I want `kids.1` and `kids.2` later.


##  Scree plot
```{r bFactor-3, fig.height=3.5}
# ggscreeplot(kids.0)
```

 
##  Principal component results


* Need 2 components. Loadings:

\footnotesize
```{r bFactor-4}
kids.0$loadings
```
\normalsize

## Comments

* First component has a bit of everything, though especially the
first three tests.

* Second component rather more clearly `add` and `dots`.

* No scores, plots since no actual data.

- See how factor analysis compares on these data.


##  Factor analysis


* Specify number of factors first, get solution with exactly
that many factors.

* Includes hypothesis test, need to specify how many children
wrote the tests.

* Works from correlation matrix via `covmat` or actual
data, like `princomp`.

* Introduces extra feature, *rotation*, to make
interpretation of loadings (factor-variable relation) easier.


##  Factor analysis for the kids data


* Create "covariance list" to include number of children who
wrote the tests.

* Feed this into `factanal`, specifying how many factors (2).

- Start with the matrix we made before.

```{r bFactor-5 }
m
ml <- list(cov = m, n.obs = 145)
kids.2 <- factanal(factors = 2, covmat = ml)
```

 
##  Uniquenesses

```{r bFactor-6 }
kids.2$uniquenesses
```


* Uniquenesses say how "unique" a variable is (size of
specific factor). Small
uniqueness means that the variable is summarized by a factor (good).

* Very large uniquenesses are bad; `add`'s uniqueness is largest but not large enough to be worried about.

* Also see "communality" for this idea, where *large* is good and *small* is bad.


##  Loadings

\footnotesize
```{r bFactor-7}
kids.2$loadings
```
\normalsize

* Loadings show how each factor depends on variables. Blanks
indicate "small", less than 0.1.

## Comments

* Factor 1 clearly the "linguistic" tasks, factor 2 clearly
the "mathematical" ones.

* Two factors together explain 68\% of variability (like
regression R-squared).
  
- Which variables belong to which factor is *much* clearer than with principal components.

##  Are 2 factors enough? 
```{r bFactor-8 }
kids.2$STATISTIC
kids.2$dof
kids.2$PVAL
```

 
P-value not small, so 2 factors OK.


##  1 factor

```{r bFactor-9 }
kids.1 <- factanal(factors = 1, covmat = ml)
kids.1$STATISTIC
kids.1$dof
kids.1$PVAL
```

 
1 factor rejected (P-value small). Definitely need more than 1.

## Places rated, again

- Read data, transform, rerun principal components, get biplot: 

```{r bFactor-10, message=FALSE, fig.height=6}
my_url <- "http://ritsokiguess.site/datafiles/places.txt"
places0 <- read_table(my_url)
places0 %>% 
mutate(across(-id, \(x) log(x))) -> places
places %>% select(-id) -> places_numeric
places.1 <- princomp(places_numeric, cor = TRUE)
g <- ggbiplot(places.1, labels = places$id,
       labels.size = 0.8)
```

- This is all exactly as for principal components (nothing new here).

## The biplot

```{r bFactor-11, fig.height=3}
g
```

## Comments

- Most of the criteria are part of components 1 *and* 2. 
- If we can rotate the arrows counterclockwise:
  - economy and crime would point straight up
    - part of component 2 only
  - health and education would point to the right
    - part of component 1 only
- would be easier to see which variables belong to which component.
- Factor analysis includes a rotation to help with interpretation.

## Factor analysis

- Have to pick a number of factors *first*. 
- Do this by running principal components and looking at scree plot.
- In this case, 3 factors seemed good (revisit later):

```{r bFactor-12}
places.3 <- factanal(places_numeric, 3, scores = "r")
```

- There are different ways to get factor scores. These called "regression" scores.

## A bad biplot 

```{r bFactor-13, fig.height=4}
biplot(places.3$scores, places.3$loadings,
xlabs = places$id)
```


## Comments

- I have to find a way to make a better biplot!
- Some of the variables now point straight up and some straight across (if you look carefully for the red arrows among the black points).
- This should make the factors more interpretable than the components were.

## Factor loadings

\footnotesize
```{r bFactor-14}
places.3$loadings
```
\normalsize

## Comments on loadings

- These are at least somewhat clearer than for the principal components:
- Factor 1: health, education, arts: "well-being"
- Factor 2: housing, transportation, arts (again), recreation: "places to be"
- Factor 3: climate (only): "climate"
- In this analysis, economic factors don't seem to be important.

## Factor scores

- Make a dataframe with the city IDs and factor scores:

```{r bFactor-15}
cbind(id = places$id, places.3$scores) %>% 
as_tibble() -> places_scores
```

- Make percentile ranks again (for checking):

```{r bFactor-16}
places %>% 
mutate(across(-id, \(x) percent_rank(x))) -> places_pr
```

## Highest scores on factor 1, "well-being":

- for the top 4 places:

```{r bFactor-17}
places_scores %>% 
slice_max(Factor1, n = 4)
```

## Check percentile ranks for factor 1

```{r bFactor-18}
places_pr %>% 
select(id, health, educate, arts) %>% 
filter(id %in% c(213, 65, 234, 314))
```

- These are definitely high on the well-being variables.
- City #213 is not so high on education, but is highest of all on the others.

## Highest scores on factor 2, "places to be":

```{r bFactor-19}
places_scores %>% 
slice_max(Factor2, n = 4)
```

## Check percentile ranks for factor 2

```{r bFactor-20}
places_pr %>% 
select(id, housing, trans, arts, recreate) %>% 
filter(id %in% c(318, 12, 168, 44))
```

- These are definitely high on housing and recreation.
- Some are (very) high on transportation, but not so much on arts.
- Could look at more cities to see if #168 being low on arts is a fluke.

## Highest scores on factor 3, "climate":

```{r bFactor-21}
places_scores %>% 
slice_max(Factor3, n = 4)
```

## Check percentile ranks for factor 3

```{r bFactor-22}
places_pr %>% 
select(id, climate) %>% 
filter(id %in% c(227, 218, 269, 270))
```

This is very clear.

## Uniquenesses

- We said earlier that the economy was not part of any of our factors:

```{r bFactor-23}
places.3$uniquenesses
```

- The higher the uniqueness, the less the variable concerned is part of any of our factors (and that  maybe another factor is needed to accommodate it).
- This includes economy and maybe crime.


## Test of significance

We can test whether the three factors that we have is enough, or whether we need more to describe our data:

```{r bFactor-24}
places.3$PVAL
```

- 3 factors are not enough. 
- What would 5 factors look like?

## Five factors

\footnotesize
```{r bFactor-25}
places.5 <- factanal(places_numeric, 5, scores = "r")
places.5$loadings
```
\normalsize

## Comments 1/2

- On (new) 5 factors:
- Factor 1 is health, education, arts: same as factor 1 before.
- Factor 2 is housing, transportation, arts, recreation: as factor 2 before.
- Factor 3 is economy.
- Factor 4 is crime.
- Factor 5 is climate and housing: like factor 3 before.

## Comments 2/2

- The two added factors include the two "missing" variables.
- Is this now enough?

```{r bFactor-26}
places.5$PVAL
```

- No. My guess is that the authors of Places Rated chose their 9 criteria to capture different aspects of what makes a city good or bad to live in, and so it was too much to hope that a small number of factors would come out of these.


##  A bigger example: BEM sex role inventory


* 369 women asked to rate themselves on 60 traits, like "self-reliant" or "shy".

* Rating 1 "never or almost never true of me" to 7 ``always or
almost always true of me''.

* 60 personality traits is a lot. Can we find a smaller number
of factors that capture aspects of personality?

* The whole BEM sex role inventory on next page.


##  The whole inventory


![](bem.png){width=450px}


##  Some of the data


\scriptsize
```{r bem-scree, message=F}
my_url <- "http://ritsokiguess.site/datafiles/factor.txt"
bem <- read_tsv(my_url)
bem
```
\normalsize
 

##  Principal components first
\ldots to decide on number of factors:
```{r bFactor-27 }
bem.pc <- bem %>%
select(-subno) %>%
princomp(cor = T)
```

 
##  The scree plot
```{r genoa,fig.height=3.7}
(g <- ggscreeplot(bem.pc))
```


* No obvious elbow.


##  Zoom in to search for elbow

Possible elbows at 3 (2 factors) and 6 (5):

```{r bem-scree-two,fig.height=3.3,warning=F}
g + scale_x_continuous(limits = c(0, 8))
```


##  but is 2 really good?

```{r bFactor-28, include=FALSE}
options(width = 80)
```

\scriptsize
```{r bFactor-29 }
summary(bem.pc)
```
\normalsize

```{r bFactor-30, include=FALSE}
options(width = 60)
```

##  Comments


* Want overall fraction of variance explained (``cumulative
proportion'') to be reasonably high.

* 2 factors, 28.5\%. Terrible!

* Even 56\% (10 factors) not that good!

* Have to live with that.


##  Biplot

```{r bem-biplot,fig.height=3.5}
ggbiplot(bem.pc, alpha = 0.3)
```


##  Comments


* Ignore individuals for now.

* Most variables point to 1 o'clock or 4 o'clock.

* Suggests factor analysis with rotation will get interpretable
factors (rotate to 12 o'clock and 3 o'clock, for example).

* Try for 2-factor solution (rough interpretation, will be bad):

```{r bFactor-31 }
bem %>%
select(-subno) %>%
factanal(factors = 2) -> bem.2
```


* Show output in pieces (just print `bem.2` to see all of it).


##  Uniquenesses, sorted

\scriptsize
```{r bFactor-32, echo=-1}
options(width = 60)
sort(bem.2$uniquenesses)
```
\normalsize
 
## Comments

* Mostly high or very high (bad).

* Some smaller, eg.: Leadership ability (0.409),
Acts like leader (0.417),
Warm (0.476),
Tender (0.493).

* Smaller uniquenesses captured by one of our two factors.

- Larger uniquenesses are not: need more factors to capture them.


##  Factor loadings, some

\scriptsize
```{r bFactor-33}
bem.2$loadings
```
\normalsize


##  Making a data frame
There are too many to read easily, so make a data frame. A
bit tricky:

\footnotesize
```{r bFactor-34}
bem.2$loadings %>% 
unclass() %>% 
as_tibble() %>% 
mutate(trait = rownames(bem.2$loadings)) -> loadings
loadings %>% slice(1:8)
```
\normalsize
 

##  Pick out the big ones on factor 1

Arbitrarily defining $>0.4$ or $<-0.4$ as "big":

\scriptsize
```{r bFactor-35}
loadings %>% filter(abs(Factor1) > 0.4) 
```
\normalsize


##  Factor 2, the big ones

\footnotesize
```{r bFactor-36}
loadings %>% filter(abs(Factor2) > 0.4)
```
\normalsize
 

##  Plotting the two factors
- A bi-plot, this time with the variables reduced in size. Looking for
unusual individuals.

- Have to run `factanal` again to get factor scores for plotting.

```{r biplot-two-again, eval=F}
bem %>% select(-subno) %>% 
factanal(factors = 2, scores = "r") -> bem.2a
biplot(bem.2a$scores, bem.2a$loadings, cex = c(0.5, 0.5))
```


- Numbers on plot are row numbers of `bem`
data frame.


##  The (awful) biplot

```{r biplot-two-ag,fig.height=4,echo=F}
bem.2a <- factanal(bem[, -1], factors = 2, scores = "r")
biplot(bem.2a$scores, bem.2a$loadings, cex = c(0.5, 0.5))
```


##  Comments


* Variables mostly up ("feminine") and right ("masculine"),
accomplished by rotation.

* Some unusual individuals: 311, 214 (low on factor 2), 366
(high on factor 2),
359, 258
(low on factor 1), 230 (high on factor 1).


##  Individual 366

\tiny
```{r bFactor-37}
bem %>% slice(366) %>% glimpse()
```
\normalsize

## Comments


* Individual 366 high on factor 2, but hard to see which traits should have high scores
(unless we remember).

- Idea 1: use percentile ranks as before.

* Idea 2: Rating scale is easy to interpret. So
*tidy* original data frame to make easier to look
things up.


##  Tidying original data

\scriptsize
```{r bFactor-38}
bem %>%
ungroup() %>% 
mutate(row = row_number()) %>%
pivot_longer(c(-subno, -row), names_to="trait", 
             values_to="score") -> bem_tidy
bem_tidy
```
\normalsize
 

##  Recall data frame of loadings

\footnotesize
```{r bFactor-39}
loadings %>% slice(1:10)
```
\normalsize
 

Want to add the factor scores for each trait to our tidy data frame
`bem_tidy`. This is a left-join (over), matching on the column
`trait` that is in both data frames (thus, the default):


##  Looking up loadings

\scriptsize
```{r bFactor-40}
bem_tidy %>% left_join(loadings) -> bem_tidy
bem_tidy %>% sample_n(12)
```
\normalsize
 

##  Individual 366, high on Factor 2
So now pick out the rows of the tidy data frame that belong to
individual 366 (`row=366`) and for which the `Factor2`
score exceeds 0.4 in absolute value (our "big" from before):

\scriptsize
```{r bFactor-41}
bem_tidy %>% filter(row == 366, abs(Factor2) > 0.4)
```
\normalsize

As expected, high scorer on these.


##  Several individuals
Rows 311 and 214 were *low* on Factor 2, so their scores should
be low. Can we do them all at once?


\scriptsize
```{r bFactor-42}
bem_tidy %>% filter(
row %in% c(366, 311, 214),
abs(Factor2) > 0.4
)
```
\normalsize
 

Can we display each individual in own column?


##  Individual by column
Un-`tidy`, that is, `pivot_wider`:

\tiny
```{r bFactor-43}
bem_tidy %>%
filter(
  row %in% c(366, 311, 214),
  abs(Factor2) > 0.4
) %>%
select(-subno, -Factor1, -Factor2) %>%
pivot_wider(names_from=row, values_from=score)
```
\normalsize

366 high, 311 middling, 214 (sometimes) low.


##  Individuals 230, 258, 359
These were high, low, low on factor 1. Adapt code:

\tiny
```{r bFactor-44}
bem_tidy %>%
filter(row %in% c(359, 258, 230), abs(Factor1) > 0.4) %>%
select(-subno, -Factor1, -Factor2) %>%
pivot_wider(names_from=row, values_from=score)
```
\normalsize


##  Is 2 factors enough?
Suspect not:
```{r bFactor-45 }
bem.2$PVAL
```

 
2 factors resoundingly rejected. Need more. Have to go all the way to
15 factors to not reject:

```{r bFactor-46 }
bem %>%
select(-subno) %>%
factanal(factors = 15) -> bem.15
bem.15$PVAL
```


Even then, only just over 50\% of variability explained.

## What's important in 15 factors?

- Let's take a look at the important things in those 15 factors.

- Get 15-factor loadings into a data frame, as before:  

\small
```{r bFactor-47}
bem.15$loadings %>% 
unclass() %>% 
as_tibble() %>% 
mutate(trait = rownames(bem.15$loadings)) -> loadings
```
\normalsize
 

- then show the highest few loadings on each factor.


##  Factor 1 (of 15)

\footnotesize
```{r bFactor-48}
loadings %>%
arrange(desc(abs(Factor1))) %>%
select(Factor1, trait) %>%
slice(1:10)
```
\normalsize
 
Compassionate, understanding, sympathetic, soothing: thoughtful of
others. 


##  Factor 2

\footnotesize
```{r bFactor-49}
loadings %>%
arrange(desc(abs(Factor2))) %>%
select(Factor2, trait) %>%
slice(1:10)
```
\normalsize
 

Strong personality, forceful, assertive, dominant: getting ahead. 


##  Factor 3

\footnotesize
```{r bFactor-50}
loadings %>%
arrange(desc(abs(Factor3))) %>%
select(Factor3, trait) %>%
slice(1:10)
```
\normalsize
 

Self-reliant, self-sufficient, independent: going it alone.


##  Factor 4

\footnotesize
```{r bFactor-51}
loadings %>%
arrange(desc(abs(Factor4))) %>%
select(Factor4, trait) %>%
slice(1:10)
```
\normalsize
 

Gentle, tender, warm (affectionate): caring for others.


##  Factor 5

\scriptsize
```{r bFactor-52}
loadings %>%
arrange(desc(abs(Factor5))) %>%
select(Factor5, trait) %>%
slice(1:10)
```
\normalsize
 

Ambitious, competitive (with a bit of risk-taking and individualism):
Being the best.


##  Factor 6

\scriptsize
```{r bFactor-53}
loadings %>%
arrange(desc(abs(Factor6))) %>%
select(Factor6, trait) %>%
slice(1:10)
```
\normalsize
 

Acts like a leader, leadership ability (with a bit of Dominant):
Taking charge.


##  Factor 7

\footnotesize
```{r bFactor-54}
loadings %>%
arrange(desc(abs(Factor7))) %>%
select(Factor7, trait) %>%
slice(1:10)
```
\normalsize
 
Happy and cheerful.


##  Factor 8

\footnotesize
```{r bFactor-55}
loadings %>%
arrange(desc(abs(Factor8))) %>%
select(Factor8, trait) %>%
slice(1:10)
```
\normalsize
 
Affectionate, flattering: Making others feel good.


##  Factor 9

\footnotesize
```{r bFactor-56}
loadings %>%
arrange(desc(abs(Factor9))) %>%
select(Factor9, trait) %>%
slice(1:10)
```
\normalsize
 

Taking a stand.


##  Factor 10

\footnotesize
```{r bFactor-57}
loadings %>%
arrange(desc(abs(Factor10))) %>%
select(Factor10, trait) %>%
slice(1:10)
```
\normalsize

 
Feminine. (A little bit of not-masculine!)


##  Factor 11

\footnotesize
```{r bFactor-58}
loadings %>%
arrange(desc(abs(Factor11))) %>%
select(Factor11, trait) %>%
slice(1:10)
```
\normalsize
 

Loyal.


##  Factor 12

\footnotesize
```{r bFactor-59}
loadings %>%
arrange(desc(abs(Factor12))) %>%
select(Factor12, trait) %>%
slice(1:10)
```
\normalsize
 

Childlike. (With a bit of moody, shy, not-self-sufficient, not-conscientious.)


##  Factor 13

\footnotesize
```{r bFactor-60}
loadings %>%
arrange(desc(abs(Factor13))) %>%
select(Factor13, trait) %>%
slice(1:10)
```
\normalsize
 

Truthful. (With a bit of happy and not-gullible.)


##  Factor 14
\footnotesize

```{r bFactor-61}
loadings %>%
arrange(desc(abs(Factor14))) %>%
select(Factor14, trait) %>%
slice(1:10)
```
\normalsize
 

Decisive. (With a bit of self-sufficient and not-soft-spoken.)


##  Factor 15
\footnotesize
```{r bFactor-62}
loadings %>%
arrange(desc(abs(Factor15))) %>%
select(Factor15, trait) %>%
slice(1:10)
```
\normalsize
 

Not-compassionate, athletic, sensitive: A mixed bag. ("Cares about self"?)


##  Anything left out? Uniquenesses

\scriptsize
```{r bFactor-63}
enframe(bem.15$uniquenesses, name="quality", value="uniq") %>%
  slice_max(uniq, n = 10)
```
\normalsize
 

Uses foul language especially, also loves children and analytical. So
could use even more factors.