--- title: "Cluster analysis" editor: markdown: wrap: 72 --- ## Cluster Analysis - One side-effect of discriminant analysis: could draw picture of data (if 1st 2s `LD`s told most of story) and see which individuals "close" to each other. - Discriminant analysis requires knowledge of groups. - Without knowledge of groups, use *cluster analysis*: see which individuals close together, which groups suggested by data. - Idea: see how individuals group into "clusters" of nearby individuals. - Base on "dissimilarities" between individuals. - Or base on standard deviations and correlations between variables (assesses dissimilarity behind scenes). ## Packages ```{r bCluster-1} #| message = FALSE library(MASS) # for lda later library(tidyverse) library(spatstat) # for crossdist later library(ggrepel) library(conflicted) conflict_prefer("select", "dplyr") conflict_prefer("filter", "dplyr") ``` ## One to ten in 11 languages | | English | Norwegian | Danish | Dutch | German | |-----|---------|-----------|--------|-------|--------| | 1 | one | en | en | een | eins | | 2 | two | to | to | twee | zwei | | 3 | three | tre | tre | drie | drei | | 4 | four | fire | fire | vier | vier | | 5 | five | fem | fem | vijf | funf | | 6 | six | seks | seks | zes | sechs | | 7 | seven | sju | syv | zeven | sieben | | 8 | eight | atte | otte | acht | acht | | 9 | nine | ni | ni | negen | neun | | 10 | ten | ti | ti | tien | zehn | ## One to ten {.smaller} | | French | Spanish | Italian | Polish | Hungarian | Finnish | |-----|--------|---------|---------|----------|-----------|-----------| | 1 | un | uno | uno | jeden | egy | yksi | | 2 | deux | dos | due | dwa | ketto | kaksi | | 3 | trois | tres | tre | trzy | harom | kolme | | 4 | quatre | cuatro | quattro | cztery | negy | nelja | | 5 | cinq | cinco | cinque | piec | ot | viisi | | 6 | six | seis | sei | szesc | hat | kuusi | | 7 | sept | siete | sette | siedem | het | seitseman | | 8 | huit | ocho | otto | osiem | nyolc | kahdeksan | | 9 | neuf | nueve | nove | dziewiec | kilenc | yhdeksan | | 10 | dix | diez | dieci | dziesiec | tiz | kymmenen | ## Dissimilarities and languages example - Can define dissimilarities how you like (whatever makes sense in application). - Sometimes defining "similarity" makes more sense; can turn this into dissimilarity by subtracting from some maximum. - Example: numbers 1--10 in various European languages. Define similarity between two languages by counting how often the same number has a name starting with the same letter (and dissimilarity by how often number has names starting with different letter). - Crude (doesn't even look at most of the words), but see how effective. ## Two kinds of cluster analysis - Looking at process of forming clusters (of similar languages): **hierarchical cluster analysis** (`hclust`). - Start with each individual in cluster by itself. - Join "closest" clusters one by one until all individuals in one cluster. - How to define closeness of two *clusters*? Not obvious, investigate in a moment. - Know how many clusters: which division into that many clusters is "best" for individuals? **K-means clustering** (`kmeans`). ## Two made-up clusters ```{r bCluster-2, echo=F, fig.height=4} set.seed(457299) a <- data.frame(x = runif(5, 0, 20), y = runif(5, 0, 20)) b <- data.frame(x = runif(5, 20, 40), y = runif(5, 20, 40)) ddd <- bind_rows(a = a, b = b, .id = "cluster") g <- ggplot(ddd, aes(x = x, y = y, colour = cluster)) + geom_point() + coord_fixed(xlim = c(0, 40), ylim = c(0, 40)) g ``` How to measure distance between set of red points and set of blue ones? ## Single-linkage distance Find the red point and the blue point that are closest together: ```{r bCluster-3, echo=F,fig.height=3.1} distances <- crossdist(a$x, a$y, b$x, b$y) wm1 <- which.min(apply(distances, 1, min)) wm2 <- which.min(apply(distances, 2, min)) closest <- bind_rows(a = a[wm1, ], b = b[wm2, ], .id = "cluster") # single linkage distance g + geom_segment(data = closest, aes(x = x[1], y = y[1], xend = x[2], yend = y[2]), colour = "blue") ``` Single-linkage distance between 2 clusters is distance between their closest points. ## Complete linkage Find the red and blue points that are farthest apart: ```{r bCluster-4, echo=F,fig.height=3.4} wm1 <- which.max(apply(distances, 1, max)) wm2 <- which.max(apply(distances, 2, max)) closest <- bind_rows(a[wm1, ], b[wm2, ], .id = "cluster") g + geom_segment(data = closest, aes(x = x[1], y = y[1], xend = x[2], yend = y[2]), colour = "blue") ``` Complete-linkage distance is distance between farthest points. ## Ward's method Work out mean of each cluster and join point to its mean: ```{r bCluster-5, fig.height=3.2,echo=F} xm <- aggregate(x ~ cluster, ddd, mean) ym <- aggregate(y ~ cluster, ddd, mean) dm <- cbind(xm, y = ym[, 2]) # loop through data frame and create grp that links to cluster's mean new <- data.frame( x = double(), y = double(), cluster = character(), grp = integer(), stringsAsFactors = F ) count <- 0 for (i in 1:5) { count <- count + 1 new[2 * count - 1, ] <- c(a[i, ], cluster = "a", grp = count) new[2 * count, ] <- c(dm[1, -1], cluster = "a", grp = count) count <- count + 1 new[2 * count - 1, ] <- c(b[i, ], cluster = "b", grp = count) new[2 * count, ] <- c(dm[2, -1], cluster = "b", grp = count) } ggplot(ddd, aes(x = x, y = y, colour = cluster)) + coord_fixed(xlim = c(0, 40), ylim = c(0, 40)) + geom_point() + geom_point(data = dm, shape = 3) + geom_line(data = new, aes(group = grp), alpha = 0.5) ``` Work out (i) sum of squared distances of points from means. ## Ward's method part 2 Now imagine combining the two clusters and working out overall mean. Join each point to this mean: ```{r bCluster-6, echo=F,fig.height=3.1} ddd %>% summarize(x = mean(x), y = mean(y)) -> dm # loop through data frame and create grp that links to cluster's mean new <- data.frame( x = double(), y = double(), cluster = character(), grp = integer(), stringsAsFactors = F ) count <- 0 for (i in 1:5) { count <- count + 1 new[2 * count - 1, ] <- c(a[i, ], cluster = "a", grp = count) new[2 * count, ] <- c(dm[1, ], cluster = "a", grp = count) count <- count + 1 new[2 * count - 1, ] <- c(b[i, ], cluster = "b", grp = count) new[2 * count, ] <- c(dm[1, ], cluster = "b", grp = count) } ggplot(ddd, aes(x = x, y = y, colour = cluster)) + coord_fixed(xlim = c(0, 40), ylim = c(0, 40)) + geom_point() + geom_point(data = dm, aes(colour = NULL), shape = 3) + geom_line(data = new, aes(group = grp), alpha = 0.5) ``` Calc sum of squared distances (ii) of points to combined mean. ## Ward's method part 3 - Sum of squares (ii) will be bigger than (i) (points closer to own cluster mean than combined mean). - Ward's distance is (ii) minus (i). - Think of as "cost" of combining clusters: - if clusters close together, (ii) only a little larger than (i) - if clusters far apart, (ii) a lot larger than (i) (as in example). ## Hierarchical clustering revisited - Single linkage, complete linkage, Ward are ways of measuring closeness of clusters. - Use them, starting with each observation in own cluster, to repeatedly combine two closest clusters until all points in one cluster. - They will give different answers (clustering stories). - Single linkage tends to make "stringy" clusters because clusters can be very different apart from two closest points. - Complete linkage insists on whole clusters being similar. - Ward tends to form many small clusters first. ## Dissimilarity data in R Dissimilarities for language data were how many number names had *different* first letter: ```{r bCluster-7, include=F} options(width = 60) ``` \footnotesize ```{r bCluster-8, size="scriptsize",message=F} my_url <- "http://ritsokiguess.site/datafiles/languages.txt" (number.d <- read_table(my_url)) ``` \normalsize ## Making a distance object ```{r bCluster-9, size="footnotesize"} number.d %>% select(-la) %>% as.dist() -> d d class(d) ``` ## Cluster analysis and dendrogram ```{r bCluster-10, fig.height=4} d.hc <- hclust(d, method = "single") plot(d.hc) ``` ## Comments - Tree shows how languages combined into clusters. - First (bottom), Spanish, French, Italian joined into one cluster, Norwegian and Danish into another. - Later, English joined to Norse languages, Polish to Romance group. - Then German, Dutch make a Germanic group. - Finally, Hungarian and Finnish joined to each other and everything else. ## Clustering process \small ```{r bCluster-11 } enframe(d.hc$labels) ``` ```{r} d.hc$merge ``` ## Comments - Lines of `merge` show what was combined - First, languages 2 and 3 (`no` and `dk`) - Then languages 6 and 8 (`fr` and `it`) - Then #7 combined with cluster formed at step 2 (`es` joined to `fr` and `it`). - Then `en` joined to `no` and `dk` \ldots - Finally `fi` joined to all others. ## Complete linkage ```{r bCluster-12, fig.height=4} d.hc <- hclust(d, method = "complete") plot(d.hc) ``` ## Ward ```{r wardo,fig.height=4} d.hc <- hclust(d, method = "ward.D") plot(d.hc) ``` ## Chopping the tree - Three clusters (from Ward) looks good: ```{r bCluster-13 } cutree(d.hc, 3) ``` ## Turning the "named vector" into a data frame \footnotesize ```{r bCluster-14} cutree(d.hc, 3) %>% enframe(name="country", value="cluster") ``` \scriptsize ## Drawing those clusters on the tree ```{r asfsagd,fig.height=4} plot(d.hc) rect.hclust(d.hc, 3) ``` ## Comparing single-linkage and Ward - In Ward, Dutch and German get joined earlier (before joining to Germanic cluster). - Also Hungarian and Finnish get combined earlier. ## Making those dissimilarities Original data: \tiny ```{r bCluster-15, message=F} my_url <- "http://ritsokiguess.site/datafiles/one-ten.txt" lang <- read_delim(my_url, " ") lang ``` \normalsize It would be a lot easier to extract the first letter if the number names were all in one column. ## Tidy, and extract first letter \footnotesize ```{r bCluster-16} lang %>% mutate(number=row_number()) %>% pivot_longer(-number, names_to="language", values_to="name") %>% mutate(first=str_sub(name, 1, 1)) -> lang.long lang.long ``` \normalsize ## Calculating dissimilarity - Suppose we wanted dissimilarity between English and Norwegian. It's the number of first letters that are different. - First get the lines for English: \scriptsize ```{r bCluster-17} english <- lang.long %>% filter(language == "en") english ``` \normalsize ## And then the lines for Norwegian \footnotesize ```{r bCluster-18} norwegian <- lang.long %>% filter(language == "no") norwegian ``` \normalsize And now we want to put them side by side, matched by number. This is what `left_join` does. (A "join" is a lookup of values in one table using another.) ## The join ```{r bCluster-19, echo=F} options(width=65) ``` \scriptsize ```{r bCluster-20} english %>% left_join(norwegian, join_by(number)) ``` \normalsize `first.x` is 1st letter of English word, `first.y` 1st letter of Norwegian word. ## Counting the different ones \footnotesize ```{r bCluster-21 } english %>% left_join(norwegian, join_by(number)) %>% count(different=(first.x != first.y)) ``` \normalsize or \footnotesize ```{r bCluster-22 } english %>% left_join(norwegian, join_by(number)) %>% count(different=(first.x != first.y)) %>% filter(different) %>% pull(n) -> ans ans ``` \normalsize Words for 1 and 8 start with different letter; rest are same. ## A language with itself The answer should be zero: ```{r bCluster-23} english %>% left_join(english, join_by(number)) %>% count(different=(first.x != first.y)) %>% filter(different) %>% pull(n) -> ans ans ``` - but this is "an integer vector of length zero". - so we have to allow for this possibility when we write a function to do it. ## Function to do this for any two languages ```{r bCluster-24 } countdiff <- function(lang.1, lang.2, d) { d %>% filter(language == lang.1) -> lang1d d %>% filter(language == lang.2) -> lang2d lang1d %>% left_join(lang2d, join_by(number)) %>% count(different = (first.x != first.y)) %>% filter(different) %>% pull(n) -> ans # if ans has length zero, set answer to (integer) zero. ifelse(length(ans)==0, 0L, ans) } ``` ## Testing ```{r bCluster-25 } countdiff("en", "no", lang.long) countdiff("en", "en", lang.long) ``` English and Norwegian have two different; English and English have none different. Check. ## For all pairs of languages? - First need all the languages: ```{r bCluster-26, include=F} options(width = 50) ``` ```{r bCluster-27, size="footnotesize"} languages <- names(lang) languages ``` - and then all *pairs* of languages: ```{r bCluster-28, size="scriptsize"} pairs <- crossing(lang = languages, lang2 = languages) ``` ## The pairs \scriptsize ```{r bCluster-29} pairs ``` \normalsize ## Run `countdiff` for all those language pairs \footnotesize ```{r bCluster-30} pairs %>% rowwise() %>% mutate(diff = countdiff(lang, lang2, lang.long)) -> thediff thediff ``` \normalsize ## Make square table of these ```{r bCluster-31, echo=F} options(width = 60) ``` \scriptsize ```{r bCluster-32} thediff %>% pivot_wider(names_from=lang2, values_from=diff) ``` \normalsize and that was where we began. ## Another example Birth, death and infant mortality rates for 97 countries (variables not dissimilarities): ``` 24.7 5.7 30.8 Albania 12.5 11.9 14.4 Bulgaria 13.4 11.7 11.3 Czechoslovakia 12 12.4 7.6 Former_E._Germany 11.6 13.4 14.8 Hungary 14.3 10.2 16 Poland 13.6 10.7 26.9 Romania 14 9 20.2 Yugoslavia 17.7 10 23 USSR 15.2 9.5 13.1 Byelorussia_SSR 13.4 11.6 13 Ukrainian_SSR 20.7 8.4 25.7 Argentina 46.6 18 111 Bolivia 28.6 7.9 63 Brazil 23.4 5.8 17.1 Chile 27.4 6.1 40 Columbia 32.9 7.4 63 Ecuador 28.3 7.3 56 Guyana ... ``` - Want to find groups of similar countries (and how many groups, which countries in each group). - Tree would be unwieldy with 97 countries. - More automatic way of finding given number of clusters? ## Reading in ```{r bCluster-33 } url <- "http://ritsokiguess.site/datafiles/birthrate.txt" vital <- read_table(url) vital ``` ## Standardizing - Infant mortality rate numbers bigger than others, consequence of measurement scale (arbitrary). - Standardize (numerical) columns of data frame to have mean 0, SD 1, done by `scale`. ```{r bCluster-35, size="footnotesize"} vital %>% mutate(across(where(is.numeric), \(x) scale(x))) -> vital.s ``` ## Three clusters Pretend we know 3 clusters is good. Take off the column of countries, and run `kmeans` on the resulting data frame, asking for 3 clusters: ```{r bCluster-36, echo=FALSE} set.seed(457299) ``` ```{r bCluster-37 } vital.s %>% select(-country) %>% kmeans(3) -> vital.km3 names(vital.km3) ``` A lot of output, so look at these individually. ## What's in the output? - Cluster sizes: ```{r bCluster-38 } vital.km3$size ``` - Cluster centres: ```{r bCluster-39 } vital.km3$centers ``` - Cluster 2 has lower than average rates on everything; cluster 3 has much higher than average. ## Cluster sums of squares and membership ```{r bCluster-40 } vital.km3$withinss ``` Cluster 1 compact relative to others (countries in cluster 1 more similar). ```{r bCluster-41, size="scriptsize"} vital.km3$cluster ``` The cluster membership for each of the 97 countries. ## Store countries and clusters to which they belong ```{r bCluster-42 } vital.3 <- tibble( country = vital.s$country, cluster = vital.km3$cluster ) ``` Next, which countries in which cluster? Write function to extract them: ```{r bCluster-43 } get_countries <- function(i, d) { d %>% filter(cluster == i) %>% pull(country) } ``` ## Cluster membership: cluster 2 \scriptsize ```{r bCluster-45} get_countries(2, vital.3) ``` \normalsize ## Cluster 3 ```{r bCluster-47} get_countries(3, vital.3) ``` ## Cluster 1 ```{r bCluster-48} get_countries(1, vital.3) ``` ## Problem! - `kmeans` uses randomization. So result of one run might be different from another run. - Example: just run again on 3 clusters, `table` of results: ```{r bCluster-49, echo=FALSE} set.seed(457298) ``` \small ```{r bCluster-50, size="small"} vital.s %>% select(-country) %>% kmeans(3) -> vital.km3a table( first = vital.km3$cluster, second = vital.km3a$cluster ) ``` \normalsize - Clusters are similar but *not same*. ## Solution to this - `nstart` option on `kmeans` runs that many times, takes best. Should be same every time: \small ```{r bCluster-51, eval=FALSE,size="small"} vital.s %>% select(-country) %>% kmeans(3, nstart = 20) -> vital.km3b ``` \normalsize ## How many clusters? - Three was just a guess. - Idea: try a whole bunch of #clusters (say 2--20), obtain measure of goodness of fit for each, make plot. - Appropriate measure is `tot.withinss`. - Run `kmeans` for each #clusters, get `tot.withinss` each time. ## Function to get `tot.withinss` \ldots for an input number of clusters, taking only numeric columns of input data frame: ```{r bCluster-52 } ss <- function(i, d) { d %>% select(where(is.numeric)) %>% kmeans(i, nstart = 20) -> km km$tot.withinss } ``` Note: writing function to be as general as possible, so that we can re-use it later. ## Constructing within-cluster SS Make a data frame with desired numbers of clusters, and fill it with the total within-group sums of squares. `ss` expects a single number of clusters, not a vector of several, so run `rowwise`: \normalsize ```{r bCluster-53} tibble(clusters = 2:20) %>% rowwise() %>% mutate(wss = ss(clusters, vital.s)) -> ssd ssd ``` \normalsize ## Scree plot ```{r favalli,fig.height=4.5} ggplot(ssd, aes(x = clusters, y = wss)) + geom_point() + geom_line() ``` ## Interpreting scree plot - Lower `wss` better. - But lower for larger #clusters, harder to explain. - Compromise: low-ish `wss` and low-ish #clusters. - Look for "elbow" in plot. - Idea: this is where `wss` decreases fast then slow. - On our plot, small elbow at 6 clusters. Try this many clusters. ## Six clusters, using `nstart` ```{r} set.seed(457299) ``` ```{r bCluster-54 } vital.s %>% select(-country) %>% kmeans(6, nstart = 20) -> vital.km6 vital.km6$size vital.km6$centers ``` ## Make a data frame of countries and clusters ```{r bCluster-55} vital.6 <- tibble( country = vital.s$country, cluster = vital.km6$cluster ) vital.6 %>% sample_n(10) ``` ## Cluster 1 Below-average death rate, though other rates a little higher than average: ```{r bCluster-56 } get_countries(1, vital.6) ``` ## Cluster 2 High on everything: \normalsize ```{r bCluster-57, echo=2} options(width = 60) get_countries(2, vital.6) options(width = 50) ``` \normalsize ## Cluster 3 Low on everything: \footnotesize ```{r bCluster-58} get_countries(3, vital.6) ``` \normalsize ## Cluster 4 Very high death rate, just below average on all else: \small ```{r bCluster-59} get_countries(4, vital.6) ``` \normalsize ## Cluster 5 Very high on everything: ```{r bCluster-60 } get_countries(5, vital.6) ``` ## Cluster 6 A bit below average on everything: ```{r bCluster-61 } get_countries(6, vital.6) ``` ## Comparing our 3 and 6-cluster solutions ```{r bCluster-62 } table(three = vital.km3$cluster, six = vital.km6$cluster) ``` Compared to 3-cluster solution: - most of (old) cluster 1 gone to (new) cluster 6 - cluster 2 split into clusters 2 and 5 (two types of "poor" countries) - cluster 3 split into clusters 1 and 3 (two types of "intermediate" countries, divided by death rate). ## Getting a picture from `kmeans` - Use discriminant analysis on clusters found, treating them as "known" groups. ## Discriminant analysis - So what makes the groups different? - Uses package `MASS` (loaded): \footnotesize ```{r bCluster-63 } vital.lda <- lda(vital.km6$cluster ~ birth + death + infant, data = vital.s) vital.lda$svd vital.lda$scaling ``` \normalsize - LD1 is some of everything (high=poor, low=rich). - LD2 mainly death rate, high or low. ## A data frame to make plot from - Get predictions first: \small ```{r bCluster-64} vital.pred <- predict(vital.lda) d <- data.frame( country = vital.s$country, cluster = vital.km6$cluster, vital.pred$x ) d ``` \normalsize ## What's in there; making a plot - `d` contains country names, cluster memberships and discriminant scores. - Plot `LD1` against `LD2`, colouring points by cluster and labelling by country: ```{r bCluster-65, size="small"} g <- ggplot(d, aes( x = LD1, y = LD2, colour = factor(cluster), label = country )) + geom_point() + geom_text_repel(size = 2, max.overlaps = Inf) + guides(colour = "none") ``` ## The plot ```{r bCluster-66, fig.height=5} g ``` It would be better to zoom in on parts of this plot. ## Final example: a hockey league - An Ontario hockey league has teams in 21 cities. How can we arrange those teams into 4 geographical divisions? - Distance data in spreadsheet. - Take out spaces in team names. - Save as "text/csv". - Distances, so back to `hclust`. ## A map ![](map1.png) ## Attempt 1 ```{r bCluster-67, message=F} my_url <- "http://ritsokiguess.site/datafiles/ontario-road-distances.csv" ontario <- read_csv(my_url) ontario.d <- ontario %>% select(-1) %>% as.dist() ontario.hc <- hclust(ontario.d, method = "ward.D") ``` ## Plot, with 4 clusters ```{r bCluster-68, fig.height=4} plot(ontario.hc) rect.hclust(ontario.hc, 4) ``` ## Comments - Can't have divisions of 1 team! - "Southern" divisions way too big! - Try splitting into more. I found 7 to be good: ## Seven clusters ```{r bCluster-69, fig.height=4 } plot(ontario.hc) rect.hclust(ontario.hc, 7) ``` ## Divisions now - I want to put Huntsville and North Bay together with northern teams. - I'll put the Eastern teams together. Gives: - North: Sault Ste Marie, Sudbury, Huntsville, North Bay - East: Brockville, Cornwall, Ottawa, Peterborough, Belleville, Kingston - West: Windsor, London, Sarnia - Central: Owen Sound, Barrie, Toronto, Niagara Falls, St Catharines, Brantford, Hamilton, Kitchener - Getting them same size beyond us! ## Another map ![](map2.png)