--- title: "Why data visualization?" author: Abhijit Dasgupta date: BIOF 339 --- ```{r setup, include=F, child = here::here('slides/templates/setup.Rmd')} ``` ```{r, echo=FALSE, results='asis'} update_header('## The failed promise of data summaries') ``` --- It is tempting to try and summarize data just by some data summaries, like means, medians or standard deviations. Summaries, by their very nature, compress information Which means .heatinline[some information gets thrown out] -- Visualization gives us a way to see patterns in the data that would not be obvious from data summaries It also allows us to use our natural visual ability of pattern recognition to understand our data --- ### Anscombe's data .pull-left[ ```{r 05-Plotting-5, echo=F} pacman::p_load(datasauRus) d <- anscombe %>% mutate(ID = 1:n()) %>% tidyr::gather(variable, value, -ID) %>% separate(variable, c('dim','data'), sep=1) %>% mutate(data = paste0('dataset', data)) %>% spread(dim, value) %>% arrange(data) ggplot(d, aes(x, y))+ geom_point() + facet_wrap(~data) + theme_bw() + labs(caption = "Anscombe, 1973") ``` ] .pull-right[ | Statistic | Value | |:----------|------:| | `mean(x)` | 9 | | `mean(y)` | 7.5 | | `var(x)` | 11 | | `var(y)` | 4.13 | | `cor(x,y)` | 0.82 | The variables for each data set have the same values of data summaries ] --- ### The [DataSaurus](https://www.autodeskresearch.com/publications/samestats) dozen .pull-left[ ```{r 05-Plotting-6, echo=F} ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+ geom_point()+ theme_void()+ theme(legend.position = "none")+ facet_wrap(~dataset, ncol=3)+ labs(caption="Matejka & Fitzmaurice, 2017") ``` ] .pull-right[ | Statistic | Value | |:----------|------:| | `mean(x)` | 54.3 | | `mean(y)` | 47.8 | | `var(x)` | 281 | | `var(y)` | 725 | | `cor(x,y)` | -0.07 | The variables for each data set have the same values of data summaries ] --- .left-column30[ A single point can completely change the computed correlation ] .right-column70[ ![](anim/corr1.gif) ] --- Data summaries are meant to help distinguish between different data sets Both Anscombe and Datasaurus show that this promise is not met by standard data summaries The previous example shows how a single point can change data summaries --- layout: true
BIOF 339: Practical R
--- ## Why visualize data? - Summary statistics cannot always distinguish datasets - Take advantage of humans' ability to visually recognize and remember patterns - Find discrepancies in the data more easily --- class: middle, center # Some examples --- ```{r, echo=FALSE, results='asis'} update_header('## Gallery') ``` --- ```{r 05-Plotting-7, echo = F, fig.align='center'} # install.packages('ggpubr') pacman::p_load(ggpubr) data("ToothGrowth") df <- ToothGrowth p <- ggboxplot(df, x = "dose", y = "len", color = "dose", palette =c("#00AFBB", "#E7B800", "#FC4E07"), add = "jitter", shape = "dose") my_comparisons <- list( c("0.5", "1"), c("1", "2"), c("0.5", "2") ) p + stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons p-value stat_compare_means(label.y = 50) ``` ??? This is a typical plot in scientific journals --- ### Kaplan-Meier plots ```{r km1, eval = T, echo = F} pacman::p_load(survminer) pacman::p_load(survival) fit <- survfit(Surv(time, status) ~ sex, data = lung) ggsurvplot(fit, pval = TRUE, conf.int = TRUE, risk.table = TRUE, risk.table.y.text.col = TRUE) ``` --- ```{r 05-Plotting-8, echo = F, fig.align = 'center'} pacman::p_load('cowplot') p1 <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point() + facet_grid(. ~ Species) + stat_smooth(method = "lm") + background_grid(major = 'y', minor = "none") + panel_border() + theme(legend.position = "none") # plot B p2 <- ggplot(iris, aes(Sepal.Length, fill = Species)) + geom_density(alpha = .7) + theme(legend.justification = "top") p2a <- p2 + theme(legend.position = "none") # plot C p3 <- ggplot(iris, aes(Sepal.Width, fill = Species)) + geom_density(alpha = .7) + theme(legend.position = "none") # legend legend <- get_legend(p2) # align all plots vertically plots <- cowplot::align_plots(p1, p2a, p3, align = 'v', axis = 'l') # put together bottom row and then everything bottom_row <- cowplot::plot_grid(plots[[2]], plots[[3]], legend, labels = c("B", "C"), rel_widths = c(1, 1, .3), nrow = 1) cowplot::plot_grid(plots[[1]], bottom_row, labels = c("A"), ncol = 1) detach('package:cowplot') ``` ??? We can put ggplot figures together in a panel with some annotations very easily using the cowplot package. These graphs can be cleaned up some. --- ```{r 05-Plotting-9, echo = F, fig.align="center"} # pacman::p_load pacman::p_load(ggridges) pacman::p_load(ggplot2) # basic example ggplot(diamonds, aes(x = price, y = cut, fill = cut)) + geom_density_ridges() + theme_ridges() + scale_x_continuous(labels = scales::label_dollar())+ theme(legend.position = "none") ``` ??? This is a plot of the diamonds dataset that comes with ggplot2 --- ### Manhattan plot ```{r 05-Plotting-10, echo = F, fig.align="center", out.height = "500px", out.width="600px", message = F} pacman::p_load(qqman) manhattan(gwasResults, annotatePval = 0.01) detach('package:qqman') ``` ??? Manhattan plots are often used in GWAS studies. You can customize the annotations and the line for the significance levels --- ### Circular Manhattan plot ```{r cmplot, include=F} #install.packages('CMplot') pacman::p_load(CMplot) data(pig60K) CMplot(pig60K, plot.type="c", chr.labels=paste("Chr",c(1:18,"X", 'Y'),sep=""), r=0.4, cir.legend=TRUE, outward=FALSE, cir.legend.col="black", cir.chr.h=1.3 ,chr.den.col="black", file="jpg", memo="", dpi=300) detach('package:CMplot') ``` ```{r 05-Plotting-11, echo=F, out.height="500", out.width="500"} knitr::include_graphics(here::here('slides/lectures/week3','img/Circular-Manhattan.trait1.trait2.trait3.jpg')) ``` ??? This gives a different representation of the manhattan plot. This example looks at three traits simultaneously --- ### Maps .pull-left[ ```{r map2, echo=F, eval=FALSE} pacman::p_load(tmap) tmap_mode('view') #<< mp2 <- tm_shape(spData::world) + tm_polygons(col = 'pop', style='jenks', palette='Reds', title='Population', popup.vars = c('pop'), id='name_long') tmap_save(mp2, file = here::here('docs/slides/lectures/week3/img/map2.html')) ``` ] .pull-right[ ```{r map3, echo=F, eval=F, cache=F} pacman::p_load(leaflet) pacman::p_load(leaflet.providers) pacman::p_load(htmlwidgets) pacman::p_load(htmltools) load(file.path(here::here('slides','lectures','data', 'exdata.rda'))) m <- leaflet(gpx) %>% addTiles() %>% addCircleMarkers(~Longitude, ~Latitude, radius=1) withr::with_dir(here::here('docs/slides/lectures/week3/img'), saveWidget(m, file='map1.html')) ``` ] --- ### Heatmap .pull-left[ ```{r heatmap, echo=FALSE, fig.height=5.5} pacman::p_load(ComplexHeatmap) pacman::p_load(circlize) mat = readRDS(system.file("extdata", "measles.rds", package = "ComplexHeatmap")) ha1 = HeatmapAnnotation( dist1 = anno_barplot( colSums(mat), bar_width = 1, gp = gpar(col = "white", fill = "#FFE200"), border = FALSE, axis_param = list(at = c(0, 2e5, 4e5, 6e5, 8e5), labels = c("0", "200k", "400k", "600k", "800k")), height = unit(2, "cm") ), show_annotation_name = FALSE) ha2 = rowAnnotation( dist2 = anno_barplot( rowSums(mat), bar_width = 1, gp = gpar(col = "white", fill = "#FFE200"), border = FALSE, axis_param = list(at = c(0, 5e5, 1e6, 1.5e6), labels = c("0", "500k", "1m", "1.5m")), width = unit(2, "cm") ), show_annotation_name = FALSE) year_text = as.numeric(colnames(mat)) year_text[year_text %% 10 != 0] = "" ha_column = HeatmapAnnotation( year = anno_text(year_text, rot = 0, location = unit(1, "npc"), just = "top") ) col_fun = colorRamp2(c(0, 800, 1000, 127000), c("white", "cornflowerblue", "yellow", "red")) ht_list = Heatmap(mat, name = "cases", col = col_fun, cluster_columns = FALSE, show_row_dend = FALSE, rect_gp = gpar(col= "white"), show_column_names = FALSE, row_names_side = "left", row_names_gp = gpar(fontsize = 6), column_title = 'Measles cases in US states 1930-2001\nVaccine introduced 1961', top_annotation = ha1, bottom_annotation = ha_column, heatmap_legend_param = list(at = c(0, 5e4, 1e5, 1.5e5), labels = c("0", "50k", "100k", "150k"))) + ha2 draw(ht_list, ht_gap = unit(3, "mm")) decorate_heatmap_body("cases", { i = which(colnames(mat) == "1961") x = i/ncol(mat) grid.lines(c(x, x), c(0, 1), gp = gpar(lwd = 2, lty = 2)) grid.text("Vaccine introduced", x, unit(1, "npc") + unit(5, "mm")) }) ``` ] .pull-right[ ```{r hm2, echo=FALSE, fig.height=5.5} pacman::p_load(ComplexHeatmap) pacman::p_load(circlize) expr = readRDS(system.file(package = "ComplexHeatmap", "extdata", "gene_expression.rds")) mat = as.matrix(expr[, grep("cell", colnames(expr))]) base_mean = rowMeans(mat) mat_scaled = t(apply(mat, 1, scale)) type = gsub("s\\d+_", "", colnames(mat)) ha = HeatmapAnnotation(type = type, annotation_name_side = "left") ht_list = Heatmap(mat_scaled, name = "expression", row_km = 5, col = colorRamp2(c(-2, 0, 2), c("blue", "white", "orange")), top_annotation = ha, show_column_names = FALSE, row_title = NULL, show_row_dend = FALSE) + Heatmap(base_mean, name = "base mean", top_annotation = HeatmapAnnotation(summary = anno_summary(gp = gpar(fill = 2:6), height = unit(2, "cm"))), width = unit(15, "mm")) + rowAnnotation(length = anno_points(expr$length, pch = 16, size = unit(1, "mm"), axis_param = list(at = c(0, 2e5, 4e5, 6e5), labels = c("0kb", "200kb", "400kb", "600kb")), width = unit(2, "cm"))) + Heatmap(expr$type, name = "gene type", top_annotation = HeatmapAnnotation(summary = anno_summary(height = unit(2, "cm"))), width = unit(15, "mm")) ht_list = rowAnnotation(block = anno_block(gp = gpar(fill = 2:6, col = NA)), width = unit(2, "mm")) + ht_list draw(ht_list, ht_gap = unit(5, "mm")) ``` ] --- ### OncoPrint ```{r, echo=F, fig.height=5} pacman::p_load(ComplexHeatmap) mat = read.table(system.file("extdata", package = "ComplexHeatmap", "tcga_lung_adenocarcinoma_provisional_ras_raf_mek_jnk_signalling.txt"), header = TRUE, stringsAsFactors = FALSE, sep = "\t") mat[is.na(mat)] = "" rownames(mat) = mat[, 1] mat = mat[, -1] mat= mat[, -ncol(mat)] mat = t(as.matrix(mat)) col = c("HOMDEL" = "blue", "AMP" = "red", "MUT" = "#008000") alter_fun = list( background = function(x, y, w, h) { grid.rect(x, y, w-unit(2, "pt"), h-unit(2, "pt"), gp = gpar(fill = "#CCCCCC", col = NA)) }, # big blue HOMDEL = function(x, y, w, h) { grid.rect(x, y, w-unit(2, "pt"), h-unit(2, "pt"), gp = gpar(fill = col["HOMDEL"], col = NA)) }, # big red AMP = function(x, y, w, h) { grid.rect(x, y, w-unit(2, "pt"), h-unit(2, "pt"), gp = gpar(fill = col["AMP"], col = NA)) }, # small green MUT = function(x, y, w, h) { grid.rect(x, y, w-unit(2, "pt"), h*0.33, gp = gpar(fill = col["MUT"], col = NA)) } ) column_title = "OncoPrint for TCGA Lung Adenocarcinoma, genes in Ras Raf MEK JNK signalling" heatmap_legend_param = list(title = "Alternations", at = c("HOMDEL", "AMP", "MUT"), labels = c("Deep deletion", "Amplification", "Mutation")) oncoPrint(mat, alter_fun = alter_fun, col = col, remove_empty_columns = TRUE, remove_empty_rows = TRUE, column_title = column_title, heatmap_legend_param = heatmap_legend_param) ``` --- ### Interactive graphs ```{r 05-Plotting-13, echo=F, eval=T} pacman::p_load(plotly) pacman::p_load(fs) pacman::p_load(here) p=ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species, shape=Species)) + geom_point(size=2, alpha=0.6)+ theme_bw() mytext=paste("Sepal Length = ", iris$Sepal.Length, "\n" , "Sepal Width = ", iris$Sepal.Width, "\n", "Row Number: ",rownames(iris), sep="") pp=plotly_build(p) style( pp, text=mytext, hoverinfo = "text", traces = c(1, 2, 3) ) pg <- ggplotly(p) # htmltools::save_html(pg, file = here::here('slides','lectures','week3','why_plot_files','pl.html')) ``` ??? These graphs are clickable. For example, click on a symbol on the legend, or drag your mouse over a region with left button held down. --- ## Network graphs ```{r network, eval=!fs::file_exists(here::here('docs','slides','lectures','week3','anim/fn.html')), echo=FALSE} pacman::p_load(htmlwidgets) pacman::p_load(networkD3) data(MisLinks, MisNodes) fn <- forceNetwork(Links = MisLinks, Nodes = MisNodes, Source = "source", Target = "target", Value = "value", NodeID = "name", Group = "group", opacity = 0.4) htmltools::save_html(fn, file = here::here('docs','slides','lectures','week3','anim','fn.html')) ``` --- ## Animated graphs ```{r aniplot, eval=!fs::file_exists('anim/gapminder.gif'), echo=F, results='hide', cache=FALSE} pacman::p_load(ggplot2) pacman::p_load(gganimate) theme_set(theme_bw()) pacman::p_load(gapminder) p <- ggplot( gapminder, aes(x = gdpPercap, y=lifeExp, size = pop, colour = country) ) + geom_point(show.legend = FALSE, alpha = 0.7) + scale_color_viridis_d() + scale_size(range = c(2, 12)) + scale_x_log10(labels = scales::label_comma()) + labs(x = "GDP per capita", y = "Life expectancy") p <- p + transition_time(year) + labs(title = "Year: {frame_time}") animate(p, renderer = gifski_renderer(), device='png') anim_save('anim/gapminder.gif') ``` ![:scale 40%](anim/gapminder.gif) --- ```{r, echo=F, results='asis'} update_header() ``` --- class: middle, center, inverse # For more in-depth looks at data viz, consider BIOF 439 in the Spring