Usage
clustering(
.data,
...,
by = NULL,
scale = FALSE,
selvar = FALSE,
verbose = TRUE,
distmethod = "euclidean",
clustmethod = "average",
nclust = NA
)
Arguments
- .data
The data to be analyzed. It can be a data frame, possible with grouped data passed from
dplyr::group_by()
.- ...
The variables in
.data
to compute the distances. Set toNULL
, i.e., all the numeric variables in.data
are used.- by
One variable (factor) to compute the function by. It is a shortcut to
dplyr::group_by()
. To compute the statistics by more than one grouping variable use that function.- scale
Should the data be scaled before computing the distances? Set to FALSE. If TRUE, then, each observation will be divided by the standard deviation of the variable \(Z_{ij} = X_{ij} / sd_j\)
- selvar
Logical argument, set to
FALSE
. IfTRUE
, then an algorithm for selecting variables is implemented. See the section Details for additional information.- verbose
Logical argument. If
TRUE
(default) then the results for variable selection are shown in the console.- distmethod
The distance measure to be used. This must be one of
'euclidean'
,'maximum'
,'manhattan'
,'canberra'
,'binary'
,'minkowski'
,'pearson'
,'spearman'
, or'kendall'
. The last three are correlation-based distance.- clustmethod
The agglomeration method to be used. This should be one of
'ward.D'
,'ward.D2'
,'single'
,'complete'
,'average'
(= UPGMA),'mcquitty'
(= WPGMA),'median'
(= WPGMC) or'centroid'
(= UPGMC).- nclust
The number of clusters to be formed. Set to
NA
Value
data The data that was used to compute the distances.
cutpoint The cutpoint of the dendrogram according to Mojena (1977).
distance The matrix with the distances.
de The distances in an object of class
dist
.hc The hierarchical clustering.
Sqt The total sum of squares.
tab A table with the clusters and similarity.
clusters The sum of square and the mean of the clusters for each variable.
cofgrap If
selectvar = TRUE
, then,cofpgrap
is a ggplot2-based graphic showing the cophenetic correlation for each model (with different number of variables). Else, will be aNULL
object.statistics If
selectvar = TRUE
, then,statistics
shows the summary of the models fitted with different number of variables, including cophenetic correlation, Mantel's correlation with the original distances (all variables) and the p-value associated with the Mantel's test. Else, will be aNULL
object.
Details
When selvar = TRUE
a variable selection algorithm is executed. The
objective is to select a group of variables that most contribute to explain
the variability of the original data. The selection of the variables is based
on eigenvalue/eigenvectors solution based on the following steps.
compute the distance matrix and the cophenetic correlation with the original variables (all numeric variables in dataset);
compute the eigenvalues and eigenvectors of the correlation matrix between the variables;
Delete the variable with the largest weight (highest eigenvector in the lowest eigenvalue);
Compute the distance matrix and cophenetic correlation with the remaining variables;
Compute the Mantel's correlation between the obtained distances matrix and the original distance matrix;
Iterate steps 2 to 5 p - 2 times, where p is the number of original variables.
At the end of the p - 2 iterations, a summary of the models is returned. The distance is calculated with the variables that generated the model with the largest cophenetic correlation. I suggest a careful evaluation aiming at choosing a parsimonious model, i.e., the one with the fewer number of variables, that presents acceptable cophenetic correlation and high similarity with the original distances.
References
Mojena, R. 2015. Hierarchical grouping methods and stopping rules: an evaluation. Comput. J. 20:359-363. doi:10.1093/comjnl/20.4.359
Author
Tiago Olivoto tiagoolivoto@gmail.com
Examples
# \donttest{
library(metan)
# All rows and all numeric variables from data
d1 <- clustering(data_ge2)
# Based on the mean for each genotype
mean_gen <-
data_ge2 %>%
mean_by(GEN) %>%
column_to_rownames("GEN")
d2 <- clustering(mean_gen)
# Select variables for compute the distances
d3 <- clustering(mean_gen, selvar = TRUE)
#> EH excluded in this step |=== | 7%
EP excluded in this step |======= | 14%
CDED excluded in this step |========== | 21%
PH excluded in this step |============== | 29%
CL excluded in this step |================= | 36%
NR excluded in this step |===================== | 43%
PERK excluded in this step |======================= | 50%
EL excluded in this step |=========================== | 57%
CD excluded in this step |=============================== | 64%
ED excluded in this step |================================== | 71%
KW excluded in this step |====================================== | 79%
CW excluded in this step |========================================= | 86%
NKR excluded in this step |============================================ | 93%
TKW excluded in this step |===============================================| 100%
#> --------------------------------------------------------------------------
#>
#> Summary of the adjusted models
#> --------------------------------------------------------------------------
#> Model excluded cophenetic remaining cormantel pvmantel
#> Model 1 - 0.8656190 15 1.0000000 0.000999001
#> Model 2 EH 0.8656191 14 1.0000000 0.000999001
#> Model 3 EP 0.8656191 13 1.0000000 0.000999001
#> Model 4 CDED 0.8656191 12 1.0000000 0.000999001
#> Model 5 PH 0.8656189 11 1.0000000 0.000999001
#> Model 6 CL 0.8655939 10 0.9999996 0.000999001
#> Model 7 NR 0.8656719 9 0.9999982 0.000999001
#> Model 8 PERK 0.8657259 8 0.9999977 0.000999001
#> Model 9 EL 0.8657904 7 0.9999972 0.000999001
#> Model 10 CD 0.8658997 6 0.9999964 0.000999001
#> Model 11 ED 0.8658274 5 0.9999931 0.000999001
#> Model 12 KW 0.8643556 4 0.9929266 0.000999001
#> Model 13 CW 0.8640355 3 0.9927593 0.000999001
#> Model 14 NKR 0.8648384 2 0.9925396 0.000999001
#> --------------------------------------------------------------------------
#> Suggested variables to be used in the analysis
#> --------------------------------------------------------------------------
#> The clustering was calculated with the Model 10
#> The variables included in this model were...
#> ED CW KW NKR TKW NKE
#> --------------------------------------------------------------------------
#>
# Compute the distances with standardized data
# Define 4 clusters
d4 <- clustering(data_ge,
by = ENV,
scale = TRUE,
nclust = 4)
# }