.BG .FN pca .TL Principal Components Analysis .DN Finds a new coordinate system for multivariate data such that the first coordinate has maximal variance, the second coordinate has maximal variance subject to being orthogonal to the first, etc. .CS pca(a, method=3) .sp or .sp pca(h, method=3, lev=length(h$order)-1) .RA .AG a data matrix to be decomposed, the rows representing observations and the columns variables. Missing values are not supported. .sp or .sp .AG h object of class `hierarchy' .OA .AG method integer taking values between 1 and 8. `Method' = 1 implies no transformation of data matrix. Hence the singular value decomposition (SVD) is carried out on a sums of squares and cross-products matrix. `Method' = 2 implies that the observations are centered to zero mean. Hence the SVD is carried out on a variance-covariance matrix. `Method' = 3 (default) implies that the observations are centered to zero mean, and additionally reduced to unit standard deviation. In this case the observations are standardized. Hence the SVD is carried out on a correlation matrix. `Method' = 4 implies that the observations are normalized by being range-divided, and then the variance-covariance matrix is used. `Method' = 5 implies that the SVD is carried out on a Kendall (rank-order) correlation matrix. `Method' = 6 implies that the SVD is carried out on a Spearman (rank-order) correlation matrix. `Method' = 7 implies that the SVD is carried out on the sample covariance matrix. `Method' = 8 implies that the SVD is carried out on the sample correlation matrix. .AG lev when the object `h' is of class `hierarchy', then a principal components analysis of a partition associated with the hierarchy is produced. .RT list, of class `"reddim"', describing the principal components analysis: .RC rproj projections of row points on the new axes. .RC cproj projections of column points on the new axes. .RC evals eigenvalues associated with the new axes. These provide figures of merit for the `variance explained' by the new axes. They are usually quoted in terms of percentage of the total, or in terms of cumulative percentage of the total. .RC evecs eigenvectors associated with the new axes. This orthogonal matrix describes the rotation. The first column is the linear combination of columns of `a' defining the first principal component, etc. .SE When carrying out a PCA of a hierarchy object, the partition is specified bt `lev'. The level plus the associated number of groups equals the number of observations, at all times. .SH NOTE In the case of `method' = 3, if any column point has zero standard deviation, then a value of 1 is substituted for the standard deviation. .PP Up to 7 principal axes are determined. The inherent dimensionality of either of the dual spaces is ordinarily `min(n,m)' where `n' and `m' are respectively the numbers of rows and columns of `a'. The centering transformation which is part of `method's 2 and 3 introduces a linear dependency causing the inherent dimensionality to be `min(n-1,m)'. Hence the number of columns returned in `rproj', `cproj', and `evecs' will be the lesser of this inherent dimensionality and 7. .PP In the case of `methods' 1 to 4, very small negative eigenvalues, if they arise, are an artifact of the SVD algorithm used, and may be treated as zero. In the case of PCA using rank-order correlations (`methods' 5 and 6), negative eigenvalues indicate that a Euclidean representation of the data is not possible. The approximate Euclidean representation given by the axes associated with the positive eigenvalues can often be quite adequate for practical interpretation of the data. .PP Routine `prcomp' is identical, to within small numerical precision differences, to `method' = 7 here. The examples below show how to transform the outputs of the present implementation onto outputs of the previous implementation. .PP Note that a very large number of columns in the input data matrix will cause dynamic memory problems: the matrix to be diagonalized requires O(m^2) storage m is the number of variables. .SH METHOD A singular value decomposition is carried out. .SH BACKGROUND Principal components analysis defines the axis which provides the best fit to both the row points and the column points. A second axis is determined which best fits the data subject to being orthogonal to the first. Third and subsequent axes are similarly found. Best fit is in the least squares sense. The criterion which optimizes the fit of the axes to the points is, by virtue of Pythagoras' theorem, simultaneously a criterion which optimizes the variance of projections on the axes. .PP Principal components analysis is often used as a data reduction technique. In the pattern recognition field, it is often termed the Karhunen-Loeve expansion since the data matrix `a' may be written as a series expansion using the eigenvectors and eigenvalues found. .SH REFERENCES Many multivariate statistics and data analysis books include a discussion of principal components analysis. Below are a few examples: C. Chatfield and A.J. Collins, `Introduction to Multivariate Analysis', Chapman and Hall, 1980 (a good, all-round introduction); M. Kendall, `Multivariate Analysis', Griffin, 1980 (dated in relation to computing techniques, but exceptionally clear and concise in the treatment of many practical aspects); F.H.C. Marriott, `The Interpretation of Multiple Observations', Academic, 1974 (a short, very readable textbook); L. Lebart, A. Morineau, and K.M. Warwick, `Multivariate Descriptive Statistical Analysis', Wiley, 1984 (an excellent geometric treatment of PCA); I.T. Joliffe, `Principal Component Analysis', Springer, 1980. .SA `svd', `prcomp', `cancor'. .EX # principal components of the prim4 data pcprim <- pca(prim4) # plot of first and second principal components plot(pcprim$rproj[,1], pcprim$rproj[,2]) # To label the points, uses `plot' with parameter `type="n"', followed by # `text': cf. examples below. # Place additional axes through x=0 and y=0: plaxes(pcprim$rproj[,1], pcprim$rproj[,2]) # variance explained by the principal components pcprim$evals*100.0/sum(pcprim$evals) # In the implementation of the S function `prcomp', different results are # produced. Here is how to obtain these results, using the function `pca'. # Consider the following result of `prcomp': old <- prcomp(prim4) # With `pca', one would do the following: new <- pca(prim4, method=7) # Data structures of `prcomp' are defined thus: n <- nrow(prim4) old$sdev = sqrt(new$evals/(n-1)) old$rotation = new$evec center <- apply(old$x, 2, mean) new$rproj[1,] <- old$x[1,] - center[1] # One remark: the rotation matrix satisfies: # old$x == prim4 %*% old$rotation # up to numerical precision. However, up to 7 principal components only # are now determined. # # Finally, a PCA of a `hierarchy' object: pca(hierclust(indat)) pca(hierclust(distance(indat))) # A four-panel set of PCAs of partitions: motif() par(mfrow=c(2,2)) h <- hierclust(indat) n <- length(h$order) # pp <- pca(h) plot(pp$rproj[,1], pp$rproj[,2], xlab="PC1", ylab="PC2", main="1 cluster", type="n") text(pp$rproj[,1], pp$rproj[,2], 1:n) # pp <- pca(h, lev=(n-2)) plot(pp$rproj[,1], pp$rproj[,2], xlab="PC1", ylab="PC2", main="2 clusters", type="n") text(pp$rproj[,1], pp$rproj[,2], pp$rlbls) # pp <- pca(h, lev=(n-3)) plot(pp$rproj[,1], pp$rproj[,2], xlab="PC1", ylab="PC2", main="3 clusters", type="n") text(pp$rproj[,1], pp$rproj[,2], pp$rlbls) # pp <- pca(h, lev=(n-4)) plot(pp$rproj[,1], pp$rproj[,2], xlab="PC1", ylab="PC2", main="4 clusters", type="n") text(pp$rproj[,1], pp$rproj[,2], pp$rlbls) .KW multivariate .KW algebra .WR