Important note: We are noticing that there have been a lack of informative titles on the plots turned in for Computer Assignments. This is partly our fault, since not all plots given as examples have had titles. For the remainder of class we hope to improve on this, just as we hope to instill in each of you the habit of INFORMATIVE titles. Please try and make your plots as informative as possible moving forward. This is a valuable skill.
We will now use the low-dimensional projection to help us guess about the disputed authorship. The function predict() will take information from a previous PCA and project new data onto the PC dimensions. Use the following code to find the projections for the disputed and collaborative essays.
fed.disp = fed[auths == "DIS", -c(1,2)]
fed.collab = fed[auths == "COL", -c(1,2)]
disp.pred = predict(pc.fed, fed.disp)
collab.pred = predict(pc.fed, fed.collab)
Make sure you label the axes and plot appropriately.
CODE FOR PLOTTING THE KNOWN DATA HERE
# To add the collaborative papers
## NOTE: Make sure to run the 'plot' and 'legend' commands simultaneously. It will NOT WORK
## if you do it line by line.
all.classifiers = as.factor(c(as.character(auths.known), rep("COL", times =nrow(collab.pred))))
plot(rbind(pc.fed$x[,1:2], collab.pred[,1:2]), col = all.classifiers, main = "Plot of First Two PCs; Colored by Author with Collabs")
legend(-10, 5,legend = c("AH", "COL", "JJ", "JM"), col = as.character(1:4), pch=rep(1, times = 4))
Based on this plot, who do you think did the primary work on the collaborations?
YOUR ANSWER HERE
YOUR CODE HERE
Now that we have seen a basic example of how to use kmeans with the Federalist papers, we are going to get further practice with a gene data example.
Load and parse the TCGA data from the course website using the following commands:
#Load the data
trial.sample = read.table("TCGA_sample.txt", header = TRUE)
#Store the subtypes of tissue and the gene expression data
Subtypes = trial.sample[,1]
Gene.Expression = as.matrix(trial.sample[,2:2001])
To run the \(k\)-means algorithm on the rows of a matrix \(X\), the kmeans(X, k, iter.max) command can be used where \(k\) is the number of clusters to find and iter.max is the maximum number of iterations used to find the \(k\) partition. Once you run y = kmeans(x,k,iter.max), you can type y\(\$\)cluster to obtain the cluster labels of the data points. Also, you can type y\(\$\)tot.withinss to obtain the total within cluster sum of squares (WCSS) for the identified partition.
Recall that \(k\)-means searches for the partition of the data that minimizes the WCSS for a fixed \(k\). To get an idea of how \(k\) affects the WCSS, run k means for k from 1 to 20 and calculate the WCSS for each partition. Then, plot the WCSS across \(k\) by using the following code:
withinss = rep(NA, 20)
for(k in 1:20){
z = kmeans(Gene.Expression,k,iter.max = 100)
withinss[k] = z$tot.withinss
}
plot(withinss, xlab = "k", ylab = "WCSS", main = "k means on TCGA")
It would be nice if we could visualize the clusters we just found, like we did with the Federalist Papers data. However, the TCGA data we are using has 2000 dimensions - and this is just a subset of the full data, which has about 20,000 genes! Luckily, we already have a way to reduce the dimensions of a dataset: PCA!
Perform PCA on the dataset Gene.Expression using the following code:
pc.tcga = prcomp(Gene.Expression)
YOUR CODE HERE
YOUR CODE HERE
set.seed(1)
k.pca = kmeans(pc.tcga$x[,1:2], centers = 2)
YOUR CODE HERE
YOUR ANSWER HERE
When passing a number \(x\) as the argument for centers
to the kmeans algorithm, \(x\) cluster centers are chosen randomly, points are assigned to a cluster based on these centers, and then the cluster centers are iteratively updated in an attempt to find the centers that minimize total within-cluster sum of squares. The number of times kmeans
updates is controlled by the argument iter.max
. To examine how the choice of initial centers affects the algorithm, we will disable the updating steps via iter.max
and choose some initial cluster centers of our own.
library(ggplot2)
# First, we create a dummy data set
set.seed(10)
random_data = data.frame(x = runif(4000),
y = runif(4000),
group = rep(NA, times = 8000))
random_data[which(random_data$x < 0.33 & random_data$y < 0.33 |
random_data$x > 0.66 & random_data$y < 0.33 |
random_data$x < 0.33 & random_data$y > 0.66 |
random_data$x > 0.66 & random_data$y > 0.66 |
(random_data$x > 0.33 & random_data$x < 0.66 &
random_data$y > 0.33 & random_data$y < 0.66)),]$group = "Group 1"
random_data[which(is.na(random_data$group)),]$group = "Group 2"
# This data set looks like:
ggplot(random_data, aes(x = x, y = y, col = group)) + geom_point() +
ggtitle("Plot of Dummy Data")
# We'll perform the kmeans with two random centers:
random.km = kmeans(random_data[,1:2], centers = 2, iter.max = 1)
ggplot(random_data, aes(x = x, y = y, col = group, pch = as.factor(random.km$cluster))) +
geom_point() +
scale_shape_discrete(name = "Cluster") +
ggtitle("Plot of Dummy Data; Clustered")
# Which gives the following within-cluster sum of squares:
random.km$tot.withinss
## [1] 829.424
my_centers = random_data[1:2, 1:2]
random.km = kmeans(random_data[,1:2], centers = my_centers, iter.max = 1)
ggplot(random_data, aes(x = x, y = y, col = group, pch = as.factor(random.km$cluster))) +
geom_point() +
scale_shape_discrete(name = "Cluster") +
ggtitle("Plot of Dummy Data; Clustered")
random.km$tot.withinss
## [1] 847.1953
Don’t just look at the number!! Can you see how the clusters themselves changed? Now, choose your own cluster centers! These do not need to be actual observed points from our data. Try something weird, and report the total within-cluster sum of squares. Did it change? Was it better or worse than the random choice? Comment as to why you think that is.
YOUR CODE, AND ANALYSIS, HERE