--- title: "Introduction to Networks and Trees" description: | Typical tasks and example network datasets. author: - name: Kris Sankaran affiliation: UW Madison date: 03-13-2021 output: distill::distill_article: self_contained: false --- ```{r setup, include=FALSE} library("knitr") opts_chunk$set(cache = FALSE, message = FALSE, warning = FALSE, echo = TRUE) ``` _[Reading 1](https://search.library.wisc.edu/catalog/9911196629502121) (Chapter 9), [Reading 2](https://web.stanford.edu/class/bios221/book/Chap-Graphs.html), [Recording](https://mediaspace.wisc.edu/media/Week%208%20%5B1%5D%20Introduction%20to%20Networks%20and%20Trees/1_qie3eehm), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-03-06-week8-1/week8-1.Rmd)_ ```{r} library("ggplot2") library("ggraph") library("dplyr") library("tidygraph") theme_set(theme_graph()) ``` 1. Networks and trees can be used to represent information in a variety of contexts. Abstractly, networks and trees are types of *graphs*, which are defined by (a) a set $V$ of vertices and (b) a set $E$ of edges between pairs of vertices. 2. It is helpful to have a few specific examples in mind, * The Internet: $V = \{\text{All Webpages}\}, \left(v, v^{\prime}\right) \in E$ if there is a hyperlink between pages $v$ and $v^{\prime}$. * Evolutionary Tree: $V = \{\text{All past and present species}\}, \left(v, v^{\prime}\right) \in E$ if one of the species $v$ or $v^{\prime}$ is a descendant of the other. * Disease Transmission: $V = \{\text{Community Members}\}, \left(v, v^{\prime}\right) \in E$ if the two community members have come in close contact. * Directory Tree: $V = \{\text{All directories in a computer}\}, \left(v, v^{\prime}\right) \in E$ if one directory is contained in the other. ```{r, fig.cap = "A visualization of the internet, from the [Opte Project](https://opte.org).", echo = FALSE} include_graphics("https://uwmadison.box.com/shared/static/cjk3i7mephlyguka9y5h1d40uy9rhdk2.png") ``` ```{r, fig.cap = "An evolutionary tree, from the [Interactive Tree of Life](https://itol.embl.de/).", echo = FALSE} include_graphics("https://itol.embl.de/img/home/box2.png") ``` ```{r, fig.cap = "A COVID-19 transmission network, from *Clustering and superspreading potential of SARS-CoV-2 infections in Hong Kong*.", echo = FALSE} include_graphics("https://uwmadison.box.com/shared/static/wlnq401dxdd9pwdtg0ewgnhx7w9nzwbp.png") ``` ```{r, fig.cap = "Directories in a file system can be organized into a tree, with parent and child directories.", echo = FALSE} include_graphics("https://uwmadison.box.com/shared/static/erbi39htkgzndhxh8yhvcoq3imfkxdca.png") ``` 3. Either vertices or edges might have attributes. For example, in the directory tree, we might know the sizes of the files (vertex attribute), and in the disease transmission network we might know the duration of contact between individuals (edge attribute). 4. An edge may be either undirected or directed. In a directed edge, one vertex leads to the other, while in an undirected edge, there is no sense of ordering. 5. In R, the `tidygraph` package can be used to manipulate graph data. It's `tbl_graph` class stores node and edge attributes in a single data structure. and `ggraph` extends the usual ggplot2 syntax to graphs. ```{r} E <- data.frame( source = c(1, 2, 3, 4, 5), target = c(3, 3, 4, 5, 6) ) G <- tbl_graph(edges = E) G ``` This `tbl_graph` can be plotted using the code below. There are different geoms available for nodes and edges -- for example, what happens if you replace `geom_edge_link()` with `geom_edge_arc()`? ```{r} ggraph(G, layout = 'kk') + geom_edge_link() + geom_node_point() ``` 6. We can mutate node and edge attributes using dplyr-like syntax. Before mutating edges, it's necessary to call `activate(edges)`. ```{r} G <- G %>% mutate( id = row_number(), group = id < 4 ) %>% activate(edges) %>% mutate(width = runif(n())) G ``` Now we can visualize these derived attributes using an aesthetic mapping within the `geom_edge_link` and `geom_node_point` geoms. ```{r, fig.cap = "The same network as above, but with edge size encoding the weight attribute."} ggraph(G, layout = "kk") + geom_edge_link(aes(width = width)) + geom_node_label(aes(label = id)) ``` ### Example Tasks 5. What types of data that are amenable to representation by networks or trees? What visual comparisons do networks and trees facilitate? 6. Our initial examples suggest that trees and networks can be used to represent either physical interactions or conceptual relationships. Typical tasks include, * Searching for groupings * Following paths * Isolating key nodes 7. By "searching for groupings," we mean finding clusters of nodes that are highly interconnected, but which have few links outside the cluster. This kind of modular structure might lend itself to deeper investigation within each of the clusters. * Clusters in a network of political blogs might suggest an echo chamber effect. * Gene clusters in a differential expression study might suggest pathways needed for the production of an important protein. * Clusters in a recipe network could be used identify different culinary techniques or cuisines. ```{r, fig.cap = "A representation of 1200 blogs before the 2004 election, from *The political blogosphere and the 2004 US election: divided they blog*.", echo = FALSE} include_graphics("http://www.visualcomplexity.com/vc/images/227_big02.jpg") ``` 8. By "following paths," we mean tracing the paths out from a particular node, to see which other nodes it is close to. * Following paths in a citation network might reveal the chain of publications that led to an important discovery. * Following paths in a recommendation network might suggest other users who might be interested in watching a certain movie. ```{r, fig.cap = "A recommendation network, linking individuals and the movies that they viewed.", echo = FALSE} include_graphics("https://psl.linqs.org/assets/images/hyper/fig3.png") ``` 9. "Isolating key nodes" is a more fuzzy concept, usually referring to the task of finding nodes that are exceptional in some way. For example, it’s often interesting to find nodes with many more connections than others, or which link otherwise isolated clusters. * A node with many edges in a disease transmission network is a superspreader. * A node that links two clusters in a citation network might be especially interdisciplinary. * A node with large size in a directory tree might be a good target for reducing disk usage. ```{r, fig.cap = "The scientific journal, Social Networks, links several publication communities, as found by *Betweenness Centrality as an Indicator of the Interdisciplinarity of Scientific Journals*.", echo = FALSE} include_graphics("https://www.leydesdorff.net/betweenness/index_files/image008.jpg") ``` 10. If you find these questions interesting, you might enjoy the catalog of examples on the website [VisualComplexity](http://www.visualcomplexity.com/vc/).