--- title: "Factor analysis" --- ## Vs. principal components * Principal components: * Purely mathematical. * Find eigenvalues, eigenvectors of correlation matrix. * No testing whether observed components reproducible, or even probability model behind it. * Factor analysis: * some way towards fixing this (get test of appropriateness) * In factor analysis, each variable modelled as: "common factor" (eg. verbal ability) and "specific factor" (left over). * Choose the common factors to "best" reproduce pattern seen in correlation matrix. * Iterative procedure, different answer from principal components. ## Packages ```{r bFactor-1, warning=F, message=F} library(ggbiplot) library(tidyverse) library(conflicted) conflict_prefer("mutate", "dplyr") conflict_prefer("select", "dplyr") conflict_prefer("filter", "dplyr") conflict_prefer("arrange", "dplyr") ``` ## Example * 145 children given 5 tests, called PARA, SENT, WORD, ADD and DOTS. 3 linguistic tasks (paragraph comprehension, sentence completion and word meaning), 2 mathematical ones (addition and counting dots). * Correlation matrix of scores on the tests: ``` para 1 0.722 0.714 0.203 0.095 sent 0.722 1 0.685 0.246 0.181 word 0.714 0.685 1 0.170 0.113 add 0.203 0.246 0.170 1 0.585 dots 0.095 0.181 0.113 0.585 1 ``` * Is there small number of underlying "constructs" (unobservable) that explains this pattern of correlations? ## To start: principal components Using correlation matrix. Read that first: ```{r kids-scree,message=F} my_url <- "http://ritsokiguess.site/datafiles/rex2.txt" kids <- read_delim(my_url, " ") kids ``` ## Principal components on correlation matrix Turn into R `matrix`, using column `test` as column names: ```{r} kids %>% column_to_rownames("test") %>% as.matrix() -> m ``` Principal components: ```{r bFactor-2} kids.0 <- princomp(covmat = m) ``` I used `kids.0` here since I want `kids.1` and `kids.2` later. ## Scree plot ```{r bFactor-3, fig.height=3.5} # ggscreeplot(kids.0) ``` ## Principal component results * Need 2 components. Loadings: \footnotesize ```{r bFactor-4} kids.0\$loadings ``` \normalsize ## Comments * First component has a bit of everything, though especially the first three tests. * Second component rather more clearly `add` and `dots`. * No scores, plots since no actual data. - See how factor analysis compares on these data. ## Factor analysis * Specify number of factors first, get solution with exactly that many factors. * Includes hypothesis test, need to specify how many children wrote the tests. * Works from correlation matrix via `covmat` or actual data, like `princomp`. * Introduces extra feature, *rotation*, to make interpretation of loadings (factor-variable relation) easier. ## Factor analysis for the kids data * Create "covariance list" to include number of children who wrote the tests. * Feed this into `factanal`, specifying how many factors (2). - Start with the matrix we made before. ```{r bFactor-5 } m ml <- list(cov = m, n.obs = 145) kids.2 <- factanal(factors = 2, covmat = ml) ``` ## Uniquenesses ```{r bFactor-6 } kids.2\$uniquenesses ``` * Uniquenesses say how "unique" a variable is (size of specific factor). Small uniqueness means that the variable is summarized by a factor (good). * Very large uniquenesses are bad; `add`'s uniqueness is largest but not large enough to be worried about. * Also see "communality" for this idea, where *large* is good and *small* is bad. ## Loadings \footnotesize ```{r bFactor-7} kids.2\$loadings ``` \normalsize * Loadings show how each factor depends on variables. Blanks indicate "small", less than 0.1. ## Comments * Factor 1 clearly the "linguistic" tasks, factor 2 clearly the "mathematical" ones. * Two factors together explain 68\% of variability (like regression R-squared). - Which variables belong to which factor is *much* clearer than with principal components. ## Are 2 factors enough? ```{r bFactor-8 } kids.2\$STATISTIC kids.2\$dof kids.2\$PVAL ``` P-value not small, so 2 factors OK. ## 1 factor ```{r bFactor-9 } kids.1 <- factanal(factors = 1, covmat = ml) kids.1\$STATISTIC kids.1\$dof kids.1\$PVAL ``` 1 factor rejected (P-value small). Definitely need more than 1. ## Places rated, again - Read data, transform, rerun principal components, get biplot: ```{r bFactor-10, message=FALSE, fig.height=6} my_url <- "http://ritsokiguess.site/datafiles/places.txt" places0 <- read_table(my_url) places0 %>% mutate(across(-id, \(x) log(x))) -> places places %>% select(-id) -> places_numeric places.1 <- princomp(places_numeric, cor = TRUE) g <- ggbiplot(places.1, labels = places\$id, labels.size = 0.8) ``` - This is all exactly as for principal components (nothing new here). ## The biplot ```{r bFactor-11, fig.height=3} g ``` ## Comments - Most of the criteria are part of components 1 *and* 2. - If we can rotate the arrows counterclockwise: - economy and crime would point straight up - part of component 2 only - health and education would point to the right - part of component 1 only - would be easier to see which variables belong to which component. - Factor analysis includes a rotation to help with interpretation. ## Factor analysis - Have to pick a number of factors *first*. - Do this by running principal components and looking at scree plot. - In this case, 3 factors seemed good (revisit later): ```{r bFactor-12} places.3 <- factanal(places_numeric, 3, scores = "r") ``` - There are different ways to get factor scores. These called "regression" scores. ## A bad biplot ```{r bFactor-13, fig.height=4} biplot(places.3\$scores, places.3\$loadings, xlabs = places\$id) ``` ## Comments - I have to find a way to make a better biplot! - Some of the variables now point straight up and some straight across (if you look carefully for the red arrows among the black points). - This should make the factors more interpretable than the components were. ## Factor loadings \footnotesize ```{r bFactor-14} places.3\$loadings ``` \normalsize ## Comments on loadings - These are at least somewhat clearer than for the principal components: - Factor 1: health, education, arts: "well-being" - Factor 2: housing, transportation, arts (again), recreation: "places to be" - Factor 3: climate (only): "climate" - In this analysis, economic factors don't seem to be important. ## Factor scores - Make a dataframe with the city IDs and factor scores: ```{r bFactor-15} cbind(id = places\$id, places.3\$scores) %>% as_tibble() -> places_scores ``` - Make percentile ranks again (for checking): ```{r bFactor-16} places %>% mutate(across(-id, \(x) percent_rank(x))) -> places_pr ``` ## Highest scores on factor 1, "well-being": - for the top 4 places: ```{r bFactor-17} places_scores %>% slice_max(Factor1, n = 4) ``` ## Check percentile ranks for factor 1 ```{r bFactor-18} places_pr %>% select(id, health, educate, arts) %>% filter(id %in% c(213, 65, 234, 314)) ``` - These are definitely high on the well-being variables. - City #213 is not so high on education, but is highest of all on the others. ## Highest scores on factor 2, "places to be": ```{r bFactor-19} places_scores %>% slice_max(Factor2, n = 4) ``` ## Check percentile ranks for factor 2 ```{r bFactor-20} places_pr %>% select(id, housing, trans, arts, recreate) %>% filter(id %in% c(318, 12, 168, 44)) ``` - These are definitely high on housing and recreation. - Some are (very) high on transportation, but not so much on arts. - Could look at more cities to see if #168 being low on arts is a fluke. ## Highest scores on factor 3, "climate": ```{r bFactor-21} places_scores %>% slice_max(Factor3, n = 4) ``` ## Check percentile ranks for factor 3 ```{r bFactor-22} places_pr %>% select(id, climate) %>% filter(id %in% c(227, 218, 269, 270)) ``` This is very clear. ## Uniquenesses - We said earlier that the economy was not part of any of our factors: ```{r bFactor-23} places.3\$uniquenesses ``` - The higher the uniqueness, the less the variable concerned is part of any of our factors (and that maybe another factor is needed to accommodate it). - This includes economy and maybe crime. ## Test of significance We can test whether the three factors that we have is enough, or whether we need more to describe our data: ```{r bFactor-24} places.3\$PVAL ``` - 3 factors are not enough. - What would 5 factors look like? ## Five factors \footnotesize ```{r bFactor-25} places.5 <- factanal(places_numeric, 5, scores = "r") places.5\$loadings ``` \normalsize ## Comments 1/2 - On (new) 5 factors: - Factor 1 is health, education, arts: same as factor 1 before. - Factor 2 is housing, transportation, arts, recreation: as factor 2 before. - Factor 3 is economy. - Factor 4 is crime. - Factor 5 is climate and housing: like factor 3 before. ## Comments 2/2 - The two added factors include the two "missing" variables. - Is this now enough? ```{r bFactor-26} places.5\$PVAL ``` - No. My guess is that the authors of Places Rated chose their 9 criteria to capture different aspects of what makes a city good or bad to live in, and so it was too much to hope that a small number of factors would come out of these. ## A bigger example: BEM sex role inventory * 369 women asked to rate themselves on 60 traits, like "self-reliant" or "shy". * Rating 1 "never or almost never true of me" to 7 ``always or almost always true of me''. * 60 personality traits is a lot. Can we find a smaller number of factors that capture aspects of personality? * The whole BEM sex role inventory on next page. ## The whole inventory ![](bem.png){width=450px} ## Some of the data \scriptsize ```{r bem-scree, message=F} my_url <- "http://ritsokiguess.site/datafiles/factor.txt" bem <- read_tsv(my_url) bem ``` \normalsize ## Principal components first \ldots to decide on number of factors: ```{r bFactor-27 } bem.pc <- bem %>% select(-subno) %>% princomp(cor = T) ``` ## The scree plot ```{r genoa,fig.height=3.7} (g <- ggscreeplot(bem.pc)) ``` * No obvious elbow. ## Zoom in to search for elbow Possible elbows at 3 (2 factors) and 6 (5): ```{r bem-scree-two,fig.height=3.3,warning=F} g + scale_x_continuous(limits = c(0, 8)) ``` ## but is 2 really good? ```{r bFactor-28, include=FALSE} options(width = 80) ``` \scriptsize ```{r bFactor-29 } summary(bem.pc) ``` \normalsize ```{r bFactor-30, include=FALSE} options(width = 60) ``` ## Comments * Want overall fraction of variance explained (``cumulative proportion'') to be reasonably high. * 2 factors, 28.5\%. Terrible! * Even 56\% (10 factors) not that good! * Have to live with that. ## Biplot ```{r bem-biplot,fig.height=3.5} ggbiplot(bem.pc, alpha = 0.3) ``` ## Comments * Ignore individuals for now. * Most variables point to 1 o'clock or 4 o'clock. * Suggests factor analysis with rotation will get interpretable factors (rotate to 12 o'clock and 3 o'clock, for example). * Try for 2-factor solution (rough interpretation, will be bad): ```{r bFactor-31 } bem %>% select(-subno) %>% factanal(factors = 2) -> bem.2 ``` * Show output in pieces (just print `bem.2` to see all of it). ## Uniquenesses, sorted \scriptsize ```{r bFactor-32, echo=-1} options(width = 60) sort(bem.2\$uniquenesses) ``` \normalsize ## Comments * Mostly high or very high (bad). * Some smaller, eg.: Leadership ability (0.409), Acts like leader (0.417), Warm (0.476), Tender (0.493). * Smaller uniquenesses captured by one of our two factors. - Larger uniquenesses are not: need more factors to capture them. ## Factor loadings, some \scriptsize ```{r bFactor-33} bem.2\$loadings ``` \normalsize ## Making a data frame There are too many to read easily, so make a data frame. A bit tricky: \footnotesize ```{r bFactor-34} bem.2\$loadings %>% unclass() %>% as_tibble() %>% mutate(trait = rownames(bem.2\$loadings)) -> loadings loadings %>% slice(1:8) ``` \normalsize ## Pick out the big ones on factor 1 Arbitrarily defining \$>0.4\$ or \$<-0.4\$ as "big": \scriptsize ```{r bFactor-35} loadings %>% filter(abs(Factor1) > 0.4) ``` \normalsize ## Factor 2, the big ones \footnotesize ```{r bFactor-36} loadings %>% filter(abs(Factor2) > 0.4) ``` \normalsize ## Plotting the two factors - A bi-plot, this time with the variables reduced in size. Looking for unusual individuals. - Have to run `factanal` again to get factor scores for plotting. ```{r biplot-two-again, eval=F} bem %>% select(-subno) %>% factanal(factors = 2, scores = "r") -> bem.2a biplot(bem.2a\$scores, bem.2a\$loadings, cex = c(0.5, 0.5)) ``` - Numbers on plot are row numbers of `bem` data frame. ## The (awful) biplot ```{r biplot-two-ag,fig.height=4,echo=F} bem.2a <- factanal(bem[, -1], factors = 2, scores = "r") biplot(bem.2a\$scores, bem.2a\$loadings, cex = c(0.5, 0.5)) ``` ## Comments * Variables mostly up ("feminine") and right ("masculine"), accomplished by rotation. * Some unusual individuals: 311, 214 (low on factor 2), 366 (high on factor 2), 359, 258 (low on factor 1), 230 (high on factor 1). ## Individual 366 \tiny ```{r bFactor-37} bem %>% slice(366) %>% glimpse() ``` \normalsize ## Comments * Individual 366 high on factor 2, but hard to see which traits should have high scores (unless we remember). - Idea 1: use percentile ranks as before. * Idea 2: Rating scale is easy to interpret. So *tidy* original data frame to make easier to look things up. ## Tidying original data \scriptsize ```{r bFactor-38} bem %>% ungroup() %>% mutate(row = row_number()) %>% pivot_longer(c(-subno, -row), names_to="trait", values_to="score") -> bem_tidy bem_tidy ``` \normalsize ## Recall data frame of loadings \footnotesize ```{r bFactor-39} loadings %>% slice(1:10) ``` \normalsize Want to add the factor scores for each trait to our tidy data frame `bem_tidy`. This is a left-join (over), matching on the column `trait` that is in both data frames (thus, the default): ## Looking up loadings \scriptsize ```{r bFactor-40} bem_tidy %>% left_join(loadings) -> bem_tidy bem_tidy %>% sample_n(12) ``` \normalsize ## Individual 366, high on Factor 2 So now pick out the rows of the tidy data frame that belong to individual 366 (`row=366`) and for which the `Factor2` score exceeds 0.4 in absolute value (our "big" from before): \scriptsize ```{r bFactor-41} bem_tidy %>% filter(row == 366, abs(Factor2) > 0.4) ``` \normalsize As expected, high scorer on these. ## Several individuals Rows 311 and 214 were *low* on Factor 2, so their scores should be low. Can we do them all at once? \scriptsize ```{r bFactor-42} bem_tidy %>% filter( row %in% c(366, 311, 214), abs(Factor2) > 0.4 ) ``` \normalsize Can we display each individual in own column? ## Individual by column Un-`tidy`, that is, `pivot_wider`: \tiny ```{r bFactor-43} bem_tidy %>% filter( row %in% c(366, 311, 214), abs(Factor2) > 0.4 ) %>% select(-subno, -Factor1, -Factor2) %>% pivot_wider(names_from=row, values_from=score) ``` \normalsize 366 high, 311 middling, 214 (sometimes) low. ## Individuals 230, 258, 359 These were high, low, low on factor 1. Adapt code: \tiny ```{r bFactor-44} bem_tidy %>% filter(row %in% c(359, 258, 230), abs(Factor1) > 0.4) %>% select(-subno, -Factor1, -Factor2) %>% pivot_wider(names_from=row, values_from=score) ``` \normalsize ## Is 2 factors enough? Suspect not: ```{r bFactor-45 } bem.2\$PVAL ``` 2 factors resoundingly rejected. Need more. Have to go all the way to 15 factors to not reject: ```{r bFactor-46 } bem %>% select(-subno) %>% factanal(factors = 15) -> bem.15 bem.15\$PVAL ``` Even then, only just over 50\% of variability explained. ## What's important in 15 factors? - Let's take a look at the important things in those 15 factors. - Get 15-factor loadings into a data frame, as before: \small ```{r bFactor-47} bem.15\$loadings %>% unclass() %>% as_tibble() %>% mutate(trait = rownames(bem.15\$loadings)) -> loadings ``` \normalsize - then show the highest few loadings on each factor. ## Factor 1 (of 15) \footnotesize ```{r bFactor-48} loadings %>% arrange(desc(abs(Factor1))) %>% select(Factor1, trait) %>% slice(1:10) ``` \normalsize Compassionate, understanding, sympathetic, soothing: thoughtful of others. ## Factor 2 \footnotesize ```{r bFactor-49} loadings %>% arrange(desc(abs(Factor2))) %>% select(Factor2, trait) %>% slice(1:10) ``` \normalsize Strong personality, forceful, assertive, dominant: getting ahead. ## Factor 3 \footnotesize ```{r bFactor-50} loadings %>% arrange(desc(abs(Factor3))) %>% select(Factor3, trait) %>% slice(1:10) ``` \normalsize Self-reliant, self-sufficient, independent: going it alone. ## Factor 4 \footnotesize ```{r bFactor-51} loadings %>% arrange(desc(abs(Factor4))) %>% select(Factor4, trait) %>% slice(1:10) ``` \normalsize Gentle, tender, warm (affectionate): caring for others. ## Factor 5 \scriptsize ```{r bFactor-52} loadings %>% arrange(desc(abs(Factor5))) %>% select(Factor5, trait) %>% slice(1:10) ``` \normalsize Ambitious, competitive (with a bit of risk-taking and individualism): Being the best. ## Factor 6 \scriptsize ```{r bFactor-53} loadings %>% arrange(desc(abs(Factor6))) %>% select(Factor6, trait) %>% slice(1:10) ``` \normalsize Acts like a leader, leadership ability (with a bit of Dominant): Taking charge. ## Factor 7 \footnotesize ```{r bFactor-54} loadings %>% arrange(desc(abs(Factor7))) %>% select(Factor7, trait) %>% slice(1:10) ``` \normalsize Happy and cheerful. ## Factor 8 \footnotesize ```{r bFactor-55} loadings %>% arrange(desc(abs(Factor8))) %>% select(Factor8, trait) %>% slice(1:10) ``` \normalsize Affectionate, flattering: Making others feel good. ## Factor 9 \footnotesize ```{r bFactor-56} loadings %>% arrange(desc(abs(Factor9))) %>% select(Factor9, trait) %>% slice(1:10) ``` \normalsize Taking a stand. ## Factor 10 \footnotesize ```{r bFactor-57} loadings %>% arrange(desc(abs(Factor10))) %>% select(Factor10, trait) %>% slice(1:10) ``` \normalsize Feminine. (A little bit of not-masculine!) ## Factor 11 \footnotesize ```{r bFactor-58} loadings %>% arrange(desc(abs(Factor11))) %>% select(Factor11, trait) %>% slice(1:10) ``` \normalsize Loyal. ## Factor 12 \footnotesize ```{r bFactor-59} loadings %>% arrange(desc(abs(Factor12))) %>% select(Factor12, trait) %>% slice(1:10) ``` \normalsize Childlike. (With a bit of moody, shy, not-self-sufficient, not-conscientious.) ## Factor 13 \footnotesize ```{r bFactor-60} loadings %>% arrange(desc(abs(Factor13))) %>% select(Factor13, trait) %>% slice(1:10) ``` \normalsize Truthful. (With a bit of happy and not-gullible.) ## Factor 14 \footnotesize ```{r bFactor-61} loadings %>% arrange(desc(abs(Factor14))) %>% select(Factor14, trait) %>% slice(1:10) ``` \normalsize Decisive. (With a bit of self-sufficient and not-soft-spoken.) ## Factor 15 \footnotesize ```{r bFactor-62} loadings %>% arrange(desc(abs(Factor15))) %>% select(Factor15, trait) %>% slice(1:10) ``` \normalsize Not-compassionate, athletic, sensitive: A mixed bag. ("Cares about self"?) ## Anything left out? Uniquenesses \scriptsize ```{r bFactor-63} enframe(bem.15\$uniquenesses, name="quality", value="uniq") %>% slice_max(uniq, n = 10) ``` \normalsize Uses foul language especially, also loves children and analytical. So could use even more factors.