#' Identify studies contributing to heterogeneity patterns found in GOSH plots #' #' This function uses three unsupervised learning learning algorithms #' (\emph{k}-means, DBSCAN and Gaussian Mixture Models) to identify studies #' contributing to the heterogeneity-effect size patterns found in GOSH (graphic #' display of study heterogeneity) plots. #' #' @usage gosh.diagnostics(data, km = TRUE, db = TRUE, gmm = TRUE, #' km.params = list(centers = 3, #' iter.max = 10, nstart = 1, #' algorithm = c("Hartigan-Wong", #' "Lloyd", "Forgy","MacQueen"), #' trace = FALSE), #' db.params = list(eps = 0.15, MinPts = 5, #' method = c("hybrid", "raw", "dist")), #' gmm.params = list(G = NULL, modelNames = NULL, #' prior = NULL, control = emControl(), #' initialization = list(hcPairs = NULL, #' subset = NULL, #' noise = NULL), #' Vinv = NULL, #' warn = mclust.options("warn"), #' x = NULL, verbose = FALSE), #' seed = 123, #' verbose = TRUE) #' #' @param data An object of class \code{gosh.rma} created through the #' \code{\link[metafor]{gosh}} function. #' @param km Logical. Should the \emph{k}-Means algorithm be used to identify #' patterns in the GOSH plot matrix? \code{TRUE} by default. #' @param db Logical. Should the DBSCAN algorithm be used to identify patterns #' in the GOSH plot matrix? \code{TRUE} by default. #' @param gmm Logical. Should a bivariate Gaussian Mixture Model be used to #' identify patterns in the GOSH plot matrix? \code{TRUE} by default. #' @param km.params A list containing the parameters for the \emph{k}-Means algorithm #' as implemented in \code{kmeans}. Run \code{?kmeans} for further details. #' @param db.params A list containing the parameters for the DBSCAN algorithm #' as implemented in \code{\link[fpc]{dbscan}}. Run \code{?fpc::dbscan} for further details. #' @param gmm.params A list containing the parameters for the Gaussian Mixture Models #' as implemented in \code{\link[mclust]{mclustBIC}}. Run \code{?mclust::mclustBIC} for further details. #' @param seed Seed used for reproducibility. Default seed is \code{123}. #' @param verbose Logical. Should a progress bar be printed in the console during clustering? #' #' @details #' #' \strong{GOSH Plots} #' #' GOSH (\emph{graphic display of study #' heterogeneity}) plots were proposed by Olkin, Dahabreh and Trikalinos #' (2012) as a diagnostic plot to assess effect size heterogeneity. GOSH plots #' facilitate the detection of both (i) outliers and (ii) distinct homogeneous #' subgroups within the modeled data. #' #' Data for the plots is generated by fitting a random-effects-model with the #' same specifications as in the meta-analysis to all \eqn{\mathcal{P}(k), #' \emptyset \notin \mathcal{P}(k), \forall 2^{k-1} \leq 10^6} possible #' subsets of studies in an analysis. For \eqn{|\mathcal{P}(k)| > 10^6}, 1 #' million subsets are randomly sampled and used for model fitting when using #' the \code{\link[metafor]{gosh}} function. #' #' #' \strong{GOSH Plot Diagnostics} #' #' Although GOSH plots allow to detect heterogeneity patterns and distinct #' subgroups within the data, interpretation which studies contribute to a #' certain subgroup or pattern is often difficult or computationally #' intensive. To facilitate the detection of studies responsible for specific #' patterns within the GOSH plots, this function randomly samples \eqn{10^4} #' data points from the GOSH Plot data (to speed up computation). Of the data #' points, only the \eqn{z}-transformed \eqn{I^2} and effect size value is #' used (as other heterogeneity metrics produced for the GOSH plot data using #' the \code{\link[metafor]{gosh}} function are linear combinations of #' \eqn{I^2}). To this data, three clustering algorithms are applied. #' \itemize{ \item The first algorithm is \emph{k}-Means clustering using the #' algorithm by Hartigan & Wong (1979) and \eqn{m_k = 3} cluster centers by #' default. The functions uses the \code{\link[stats]{kmeans}} implementation #' to perform \emph{k}-Means clustering. \item As \emph{k}-Means does not #' perform well in the presence of distinct arbitrary subclusters and noise, #' the function also applies \strong{DBSCAN} (\emph{density reachability and #' connectivity clustering}; Schubert et al., 2017). The hyperparameters #' \eqn{\epsilon} and \eqn{MinPts} can be tuned for each analysis to maintain #' a reasonable amount of granularity while not producing too many #' subclusters. The function uses the \code{\link[fpc]{dbscan}} implementation #' to perform the DBSCAN clustering. \item Lastly, as a clustering approach #' using a probabilistic model, Gaussian Mixture Models (GMM; Fraley & Raftery, 2002) #' are integrated in the function using an internal call to the #' \code{\link[mclust]{mclustBIC}} implementation. Clustering hyperparameters can #' be tuned by providing a list of parameters of the \code{mclustBIC} #' function in the \code{mclust} package.} #' #' To assess which studies predominantly contribute to a detected cluster, #' the function calculates the cluster imbalance of a specific study using the #' difference between (i) the expected share of subsets containing a specific #' study if the cluster makeup was purely random (viz., representative for the #' full sample), and the (ii) actual share of subsets containing a specific #' study within a cluster. Cook's distance for each study is then calculated #' based on a linear intercept model to determine the leverage of a specific #' study for each cluster makeup. Studies with a leverage value three times #' above the mean in any of the generated clusters (for all used clustering #' algorithms) are returned as potentially influential cases and the GOSH plot #' is redrawn highlighting these specific studies. #' #' @references #' Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis #' and density estimation, \emph{Journal of the American Statistical Association}, #' 97/458, pp. 611-631. #' #' Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A K-Means Clustering Algorithm. #' \emph{Journal of the Royal Statistical Society. Series C (Applied Statistics), 28} (1). 100–108. #' #' Olkin, I., Dahabreh, I. J., Trikalinos, T. A. (2012). GOSH–a Graphical Display of Study Heterogeneity. #' \emph{Research Synthesis Methods 3}, (3). 214–23. #' #' Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. (2017). DBSCAN Revisited, Revisited: #' Why and How You Should (Still) Use DBSCAN. \emph{ACM Transactions on Database Systems (TODS) 42}, (3). ACM: 19. #' #' #' @author Mathias Harrer & David Daniel Ebert #' #' @import grid gridExtra ggplot2 #' @importFrom fpc dbscan #' @importFrom stats kmeans cooks.distance lm complete.cases #' @importFrom mclust mclustBIC Mclust emControl mclust.options predict.Mclust #' @importFrom grDevices rainbow #' #' #' @export gosh.diagnostics #' #' @seealso \code{\link{InfluenceAnalysis}} #' #' @examples #' # Example: load gosh data (created with metafor's 'gosh' function), #' # then use function #' \dontrun{ #' data("m.gosh") #' res <- gosh.diagnostics(m.gosh) #' #' # Look at results #' summary(res) #' #' # Plot detected clusters #' plot(res, which = "cluster") #' #' # Plot outliers #' plot(res, which = "outlier") #' } gosh.diagnostics = function(data, km = TRUE, db = TRUE, gmm = TRUE, km.params = list(centers = 3, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy","MacQueen"), trace = FALSE), db.params = list(eps = 0.15, MinPts = 5, method = c("hybrid", "raw", "dist")), gmm.params = list(G = NULL, modelNames = NULL, prior = NULL, control = emControl(), initialization = list(hcPairs = NULL, subset = NULL, noise = NULL), Vinv = NULL, warn = mclust.options("warn"), x = NULL, verbose = FALSE), seed = 123, verbose = TRUE) { # Redefine Variables data = data sav = data do.km = km; rm(km) do.db = db; rm(db) do.gmm = gmm; rm(gmm) # set seed set.seed(seed) # Check input if (class(data) != "gosh.rma"){ stop("Argument 'data' provided does not have class 'gosh.rma'.") } if (do.km == FALSE & do.db == FALSE & do.gmm == FALSE){ stop("At least one of 'km', 'db', or 'gmm' must be set to TRUE.") } # Start loading bar if (verbose == TRUE){ cat(" ", "\n", "Perform Clustering...", "\n") cat(" |") } # Create full dataset from gosh output dat.full = sav$res[complete.cases(sav$res),] dat.full = cbind(dat.full, sav$incl[complete.cases(sav$res),]) # Create dataset for k-Means dat.km = data.frame(scale(dat.full$I2, center = TRUE, scale = TRUE), scale(dat.full$estimate, center = TRUE, scale = TRUE)) colnames(dat.km) = c("I2", "estimate") # Create dataset for DBSCAN # DBSCAN can become too computationally intensive # for very large GOSH data. For N_gosh > 10.000, N = 10.000 data points are # therefore randomly sampled. if (nrow(dat.full) < 10000) { dat.db.full = dat.full } else { dat.db.full = dat.full[sample(1:nrow(dat.full), 10000), ] #Sample 10.000 rows } dat.db = data.frame(scale(dat.db.full$I2, center = TRUE, scale = TRUE), scale(dat.db.full$estimate, center = TRUE, scale = TRUE)) colnames(dat.db) = c("I2", "estimate") if (verbose == TRUE){ cat("==========") } # K-Means km.params$x = dat.km do.call(stats::kmeans, km.params) km = do.call(stats::kmeans, km.params) # Only use 5000 rows for plotting to increase speed if (length(as.numeric(km$cluster)) > 5000){ km.plot.mask = sample(1:length(as.numeric(km$cluster)), 5000) km.plot = km km.plot$cluster = km$cluster[km.plot.mask] dat.km.plot = dat.km[km.plot.mask,] } else { km.plot = km dat.km.plot = dat.km } levels.km = unique(km.plot$cluster)[order(unique(km.plot$cluster))] dat.km.plot$cluster = factor(km.plot$cluster, levels = levels.km) km.clusterplot = ggplot(data = dat.km.plot, aes(x = estimate, y = I2, color = cluster)) + geom_point(cex = 0.5, alpha = 0.8) + ylab(expression(italic(I)^2~(z-score))) + xlab("Effect Size (z-score)") + theme_minimal() + ggtitle("K-means Algorithm") + labs(color = "Cluster") # DBSCAN db.params$data = dat.db db = do.call(fpc::dbscan, db.params) # Only use 5000 rows for plotting to increase speed if (length(as.numeric(db$cluster)) > 5000){ db.plot.mask = sample(1:length(as.numeric(db$cluster)), 5000) db.plot = db db.plot$cluster = db$cluster[db.plot.mask] dat.db.plot = dat.db[db.plot.mask,] } else { db.plot = db dat.db.plot = dat.db } if (verbose == TRUE){ cat("==========") } levels.db = unique(db.plot$cluster)[order(unique(db.plot$cluster))] dat.db.plot$cluster = factor(db.plot$cluster, levels = levels.db) levels(dat.db.plot$cluster)[levels(dat.db.plot$cluster) == "0"] = "Outlier" color.db = rainbow(nlevels(dat.db.plot$cluster)) color.db[1] = "#000000" db.clusterplot = ggplot(data = dat.db.plot, aes(x = estimate, y = I2, color = cluster)) + geom_point(cex = 0.5, alpha = 0.7) + ylab(expression(italic(I)^2~(z-score))) + xlab("Effect Size (z-score)") + theme_minimal() + ggtitle("DBSCAN Algorithm (black dots are outliers)") + scale_color_manual(values = color.db) + labs(color = "Cluster") # GMM # Use same data as used for DBSCAN dat.gmm.full = dat.db.full dat.gmm = dat.db gmm.params$data = dat.gmm # Search for optimal solution gmm.bic = do.call(mclust::mclustBIC, gmm.params) gmm = mclust::Mclust(data = dat.gmm, x = gmm.bic) # Only use 5000 rows for plotting to increase speed if (length(as.numeric(gmm$classification)) > 5000){ gmm.plot.mask = sample(1:length(as.numeric(gmm$classification)), 5000) dat.gmm.plot = dat.gmm[gmm.plot.mask,] dat.gmm.plot$cluster = predict.Mclust(gmm)$classification[gmm.plot.mask] } else { dat.gmm.plot = dat.gmm dat.gmm.plot$cluster = predict.Mclust(gmm)$classification } if (verbose == TRUE){ cat("==========") } levels.gmm = unique(dat.gmm.plot$cluster)[order(unique(dat.gmm.plot$cluster))] dat.gmm.plot$cluster = factor(dat.gmm.plot$cluster, levels = levels.gmm) gmm.clusterplot = ggplot(data = dat.gmm.plot, aes(x = estimate, y = I2, color = cluster)) + geom_point(cex = 0.5, alpha = 0.8) + ylab(expression(italic(I)^2~(z-score))) + xlab("Effect Size (z-score)") + theme_minimal() + ggtitle("Gaussian Mixture Model") + labs(color = "Cluster") if (verbose == TRUE){ cat("==========") } # Add to dfs dat.km.full = dat.full dat.km.full$cluster = km$cluster dat.db.full$cluster = db$cluster dat.gmm.full$cluster = gmm$classification #################################################### # Extract the Percentages########################### # K-Means############################################ dat.km.full$cluster = as.factor(dat.km.full$cluster) n.levels.km = nlevels(dat.km.full$cluster) # Loop for the total n of studies dat.km.full.total = dat.km.full[, -c(1:6, ncol(dat.km.full))] n.cluster.tots = apply(dat.km.full.total, 2, sum) n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots))) colnames(n.cluster.tots) = c("n.tots") if (verbose == TRUE){ cat("==========") } # Loop for the cluster-wise n of studies n = sapply(split(dat.km.full.total, dat.km.full$cluster), function(x) apply(x, 2, sum)) # Calculate Percentages deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots))) # Generate the plot Cluster = factor(rep(1:n.levels.km, each = nrow(deltas))) Study = rep(1:nrow(deltas), n.levels.km) Delta_Percentage = unlist(deltas) delta.df = data.frame(Cluster, Delta_Percentage, Study) km.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (K-Means algorithm)") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() #################################################### # Cook's Distance Plot########################### # K-Means############################################ m.cd.km = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x)) m.cd.km$`0` = NULL m.cd.km = lapply(m.cd.km, cooks.distance) m.cd.km.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.km))) m.cd.km.df$Cluster = as.factor(rep(1:(n.levels.km), each = nrow(deltas))) m.cd.km.df$Study = rep(1:nrow(deltas), times = (n.levels.km)) outlier.cd.km = 3 * mean(m.cd.km.df$Cooks.Distance) if (n.levels.km <= 2){ m.cd.km.df[m.cd.km.df$Cluster=="2", "Cooks.Distance"] = m.cd.km.df[m.cd.km.df$Cluster=="2", "Cooks.Distance"] + 0.01 km.cd.plot = ggplot(data = m.cd.km.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) + geom_line(aes(color=Cluster), alpha = 0.5) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Cook's Distance") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (Cook's Distance)") + geom_hline(yintercept = outlier.cd.km, linetype = "dashed") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() } else { km.cd.plot = ggplot(data = m.cd.km.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) + geom_line(aes(color=Cluster), alpha = 0.5) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Cook's Distance") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (Cook's Distance)") + geom_hline(yintercept = outlier.cd.km, linetype = "dashed") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() } if (verbose == TRUE){ cat("==========") } #################################################### # Extract the Percentages########################### # DBSCAN############################################ dat.db.full$cluster = as.factor(dat.db.full$cluster) n.levels.db = nlevels(dat.db.full$cluster) # Loop for the total n of studies dat.db.full.total = dat.db.full[, -c(1:6, ncol(dat.db.full))] n.cluster.tots = apply(dat.db.full.total, 2, sum) n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots))) colnames(n.cluster.tots) = c("n.tots") # Loop for the cluster-wise n of studies n = sapply(split(dat.db.full.total, dat.db.full$cluster), function(x) apply(x, 2, sum)) # Calculate Percentages deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots))) # Generate the plot Cluster = factor(rep(colnames(deltas), each = nrow(deltas))) Study = rep(1:nrow(deltas), n.levels.db) Delta_Percentage = unlist(deltas) delta.df = data.frame(Cluster, Delta_Percentage, Study) delta.df = delta.df[delta.df$Cluster != 0,] #Zero Class (Outliers are removed) db.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (Density-Based Clustering)") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() if (verbose == TRUE){ cat("==========") } #################################################### # Cook's Distance Plot########################### # DBSCAN############################################ m.cd.db = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x)) m.cd.db$`0` = NULL m.cd.db = lapply(m.cd.db, cooks.distance) m.cd.db.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.db))) m.cd.db.df$Cluster = as.factor(rep(1:(n.levels.db - 1), each = nrow(deltas))) m.cd.db.df$Study = rep(1:nrow(deltas), times = (n.levels.db - 1)) outlier.cd.db = 3 * mean(m.cd.db.df$Cooks.Distance) if (n.levels.db <= 2){ m.cd.db.df[m.cd.db.df$Cluster=="2", "Cooks.Distance"] = m.cd.db.df[m.cd.db.df$Cluster=="2", "Cooks.Distance"] + 0.01 db.cd.plot = ggplot(data = m.cd.db.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) + geom_line(aes(color=Cluster), alpha = 0.5) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Cook's Distance") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (Cook's Distance)") + geom_hline(yintercept = outlier.cd.db, linetype = "dashed") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() } else { db.cd.plot = ggplot(data = m.cd.db.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) + geom_line(aes(color=Cluster), alpha = 0.5) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Cook's Distance") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (Cook's Distance)") + geom_hline(yintercept = outlier.cd.db, linetype = "dashed") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() } if (verbose == TRUE){ cat("==========") } #################################################### # Extract the Percentages########################### # GMM ############################################ dat.gmm.full$cluster = as.factor(dat.gmm.full$cluster) n.levels.gmm = nlevels(dat.gmm.full$cluster) # Loop for the total n of studies dat.gmm.full.total = dat.gmm.full[, -c(1:6, ncol(dat.gmm.full))] n.cluster.tots = apply(dat.gmm.full.total, 2, sum) n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots))) colnames(n.cluster.tots) = c("n.tots") # Loop for the cluster-wise n of studies n = sapply(split(dat.gmm.full.total, dat.gmm.full$cluster), function(x) apply(x, 2, sum)) # Calculate Percentages deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots))) # Generate the plot Cluster = factor(rep(colnames(deltas), each = nrow(deltas))) Study = rep(1:nrow(deltas), n.levels.gmm) Delta_Percentage = unlist(deltas) delta.df = data.frame(Cluster, Delta_Percentage, Study) gmm.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (GMM)") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() #################################################### # Cook's Distance Plot########################### # GMM ############################################ m.cd.gmm = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x)) m.cd.gmm$`0` = NULL m.cd.gmm = lapply(m.cd.gmm, cooks.distance) m.cd.gmm.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.gmm))) m.cd.gmm.df$Cluster = as.factor(rep(1:(n.levels.gmm), each = nrow(deltas))) m.cd.gmm.df$Study = rep(1:nrow(deltas), times = (n.levels.gmm)) outlier.cd.gmm = 3 * mean(m.cd.gmm.df$Cooks.Distance) if (n.levels.gmm <= 2){ m.cd.gmm.df[m.cd.gmm.df$Cluster=="2", "Cooks.Distance"] = m.cd.gmm.df[m.cd.gmm.df$Cluster=="2", "Cooks.Distance"] + 0.01 gmm.cd.plot = ggplot(data = m.cd.gmm.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) + geom_line(aes(color=Cluster), alpha = 0.5) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Cook's Distance") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (Cook's Distance)") + geom_hline(yintercept = outlier.cd.gmm, linetype = "dashed") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() } else { gmm.cd.plot = ggplot(data = m.cd.gmm.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) + geom_line(aes(color=Cluster), alpha = 0.5) + geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) + scale_y_continuous(name = "Cook's Distance") + theme(axis.text = element_text(size = 5)) + ggtitle("Cluster imbalance (Cook's Distance)") + geom_hline(yintercept = outlier.cd.gmm, linetype = "dashed") + geom_hline(yintercept = 0, linetype = "dashed") + theme_minimal() } ####################### # Generate Output Plot# ####################### returnlist = list() if (do.km == TRUE){ returnlist$km.clusters = n.levels.km km.sideplots = gridExtra::arrangeGrob(km.plot, km.cd.plot, nrow=2) returnlist$km.plot = gridExtra::arrangeGrob(km.clusterplot, km.sideplots, ncol = 2) } if (do.db == TRUE){ returnlist$db.clusters = n.levels.db-1 db.sideplots = gridExtra::arrangeGrob(db.plot, db.cd.plot, nrow=2) returnlist$db.plot = gridExtra::arrangeGrob(db.clusterplot, db.sideplots, ncol = 2) } if (do.gmm == TRUE){ returnlist$gmm.clusters = n.levels.gmm gmm.sideplots = gridExtra::arrangeGrob(gmm.plot, gmm.cd.plot, nrow=2) returnlist$gmm.plot = gridExtra::arrangeGrob(gmm.clusterplot, gmm.sideplots, ncol = 2) } ############################################ # Plot GOSH for potential outlying studies # ############################################ # Get outlying studies # Kmeans outlier.studies.km.df = m.cd.km.df[m.cd.km.df$Cooks.Distance>outlier.cd.km,] outlier.studies.km = unique(outlier.studies.km.df$Study) # DBSCAN outlier.studies.db.df = m.cd.db.df[m.cd.db.df$Cooks.Distance>outlier.cd.db,] outlier.studies.db = unique(outlier.studies.db.df$Study) # GMM outlier.studies.gmm.df = m.cd.gmm.df[m.cd.gmm.df$Cooks.Distance>outlier.cd.gmm,] outlier.studies.gmm = unique(outlier.studies.gmm.df$Study) # Use all identified outliers outlier.studies.all = unique(c(outlier.studies.km, outlier.studies.db, outlier.studies.gmm)) outlier.studies.all.mask = outlier.studies.all + 6 # Add 6 to use as mask # Get plotting dataset and only choose outlier studies as mask, use db data if (length(as.numeric(db$cluster)) > 5000){ dat.all.outliers = dat.db.full[db.plot.mask, c(3,6, outlier.studies.all.mask)] } else { dat.all.outliers = dat.db.full[,c(3,6, outlier.studies.all.mask)] } if (length(outlier.studies.all) > 0){ # Loop through all identified outliers for (i in 1:length(outlier.studies.all)){ outlier.plot = ggplot(data = dat.all.outliers, aes(x = estimate, y = I2, color = dat.all.outliers[,i+2])) + geom_point(alpha=0.8) + scale_color_manual(values = c("lightgrey", "#00BFC4")) + theme_minimal() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5, face = "bold")) + xlab("Effect Size") + ylab("I-squared") density.db.upper = ggplot(data = dat.all.outliers, aes(x = estimate, fill = dat.all.outliers[,i+2])) + geom_density(alpha = 0.5) + theme_classic() + theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank(), legend.position = "none", plot.background = element_blank(), axis.line.x = element_blank(), axis.title.y = element_text(color="white"), axis.text.y = element_text(color="white"), axis.line.y = element_line(color="white") ) + scale_fill_manual(values = c("lightgrey", "#00BFC4")) blankPlot = ggplot()+geom_blank(aes(1,1))+ theme(plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.line.x = element_blank(), axis.line.y = element_blank() ) density.db.right = ggplot(data = dat.all.outliers, aes(x = I2, fill = dat.all.outliers[,i+2])) + geom_density(alpha = 0.5) + theme_classic() + theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), legend.position = "none", plot.background = element_blank(), axis.line.y = element_blank(), axis.title.x = element_text(color="white"), axis.text.x = element_text(color="white"), axis.line.x = element_line(color="white") ) + scale_fill_manual(values = c("lightgrey", "#00BFC4")) + coord_flip() returnlist[[paste0("plot.study", outlier.studies.all[i], ".removed")]] = gridExtra::arrangeGrob(density.db.upper, blankPlot, outlier.plot, density.db.right, nrow = 2, ncol = 2, heights = c(1,5), widths = c(4,1), top = paste("Study ", outlier.studies.all[i])) } } if (do.km == TRUE){ returnlist$outlier.studies.km = outlier.studies.km } if (do.db == TRUE){ returnlist$outlier.studies.db = outlier.studies.db } if (do.gmm == TRUE){ returnlist$outlier.studies.gmm = outlier.studies.gmm } if (verbose == TRUE){ cat("==========| DONE \n") cat("\n") } class(returnlist) = c("gosh.diagnostics", "list") return(returnlist) }