{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import networkx as nx\n", "from datetime import datetime\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import warnings\n", "from custom import load_data as cf\n", "\n", "warnings.filterwarnings('ignore')\n", "\n", "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Nodes and Edges: How do we represent relationships between individuals using NetworkX?\n", "\n", "As mentioned earlier, networks, also known as graphs, are comprised of individual entities and their representatives. The technical term for these are nodes and edges, and when we draw them we typically use circles (nodes) and lines (edges). \n", "\n", "In this notebook, we will work with a social network of seventh graders, in which nodes are individual students, and edges represent their relationships. Edges between individuals show how often the seventh graders indicated other seventh graders as their favourite.\n", "\n", "Data credit: http://konect.uni-koblenz.de/networks/moreno_seventh" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data Representation\n", "\n", "In the `networkx` implementation, graph objects store their data in dictionaries. \n", "\n", "Nodes are part of the attribute `Graph.node`, which is a dictionary where the key is the node ID and the values are a dictionary of attributes. \n", "\n", "Edges are part of the attribute `Graph.edge`, which is a nested dictionary. Data are accessed as such: `G.edge[node1][node2]['attr_name']`.\n", "\n", "Because of the dictionary implementation of the graph, any hashable object can be a node. This means strings and tuples, but not lists and sets." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Load Data\n", "\n", "Let's load some real network data to get a feel for the NetworkX API. This [dataset](http://konect.uni-koblenz.de/networks/moreno_seventh) comes from a study of 7th grade students.\n", "\n", "> This directed network contains proximity ratings between studetns from 29 seventh grade students from a school in Victoria. Among other questions the students were asked to nominate their preferred classmates for three different activities. A node represents a student. An edge between two nodes shows that the left student picked the right student as his answer. The edge weights are between 1 and 3 and show how often the left student chose the right student as his favourite." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "G = cf.load_seventh_grader_network()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Basic Network Statistics\n", "Let's first understand how many students and friendships are represented in the network." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Who are represented in the network?\n", "G.nodes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**API Note:** As of NetworkX 2.0, to select subset of nodes, you have to cast `G.nodes()` as a list, i.e. `list(G.nodes())[0:10]`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Exercise\n", "\n", "Can you write a single line of code that returns the number of nodes in the graph? (1 min.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's now figure out who is connected to who in the network" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Who is connected to who in the network?\n", "G.edges()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**API Note:** As of NetworkX 2.0, to select subset of edges, you have to cast `G.edges()` as a list, i.e. `list(G.edges())[0:10]`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Exercise\n", "\n", "Can you write a single line of code that returns the number of relationships represented? (1 min.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Concept\n", "\n", "A **network**, more technically known as a **graph**, is comprised of:\n", "\n", "- a set of nodes\n", "- joined by a set of edges\n", "\n", "They can be represented as two lists:\n", "\n", "1. A **node list**: a list of 2-tuples where the first element of each tuple is the representation of the node, and the second element is a dictionary of metadata associated with the node.\n", "2. An **edge list**: a list of 3-tuples where the first two elements are the nodes that are connected together, and the third element is a dictionary of metadata associated with the edge." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Since this is a social network of people, there'll be attributes for each individual, such as a student's gender. We can grab that data off from the attributes that are stored with each node." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Let's get a list of nodes with their attributes.\n", "list(G.nodes(data=True))\n", "\n", "# NetworkX will return a list of tuples in the form (node_id, attribute_dictionary) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Exercise\n", "\n", "Can you count how many males and females are represented in the graph? (3 min.)\n", "\n", "Hint: You may want to use the Counter object from the collections module." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from collections import Counter\n", "mf_counts = Counter([d['_________'] for _, _ in G._____(data=_____)])\n", "\n", "def test_answer(mf_counts):\n", " assert mf_counts['female'] == 17\n", " assert mf_counts['male'] == 12\n", " \n", "test_answer(mf_counts)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Edges can also store attributes in their attribute dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "G.edges(data=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In this synthetic social network, the number of times the left student indicated that the right student was their favourite is stored in the \"count\" variable." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Exercise\n", "\n", "Can you figure out the maximum times any student rated another student as their favourite? (3 min.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Answer\n", "counts = [d['_____'] for _, _, _ in G._______(_________)]\n", "maxcount = max(_________)\n", "\n", "def test_maxcount(maxcount):\n", " assert maxcount == 3\n", " \n", "test_maxcount(maxcount)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Exercise\n", "\n", "We found out that there are two individuals that we left out of the network, individual no. 30 and 31. They are one male (30) and one female (31), and they are a pair that just love hanging out with one another and with individual 7 (`count=3`), in both directions per pair. Add this information to the graph. (5 min.)\n", "\n", "If you need more help, check out https://networkx.github.io/documentation/stable/tutorial.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Answer: Follow the coding pattern.\n", "G.add_node(30, gender='male')\n", "\n", "\n", "G.add_edge(30, 31, count=3)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Verify that you have added in the edges and nodes correctly by running the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def test_graph_integrity(G):\n", " assert 30 in G.nodes()\n", " assert 31 in G.nodes()\n", " assert G.node[30]['gender'] == 'male'\n", " assert G.node[31]['gender'] == 'female'\n", " assert G.has_edge(30, 31)\n", " assert G.has_edge(30, 7)\n", " assert G.has_edge(31, 7)\n", " assert G.edges[30, 7]['count'] == 3\n", " assert G.edges[7, 30]['count'] == 3\n", " assert G.edges[31, 7]['count'] == 3\n", " assert G.edges[7, 31]['count'] == 3\n", " assert G.edges[30, 31]['count'] == 3\n", " assert G.edges[31, 30]['count'] == 3\n", " print('All tests passed.')\n", " \n", "test_graph_integrity(G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**API Note:** Pre-NetworkX 2.0, the syntax for accessing any particular edge's metadata was `G.edge[node1][node2][key]`. \n", "\n", "Post-NetworkX 2.0, the syntax has changed to: `G.edges[node1, node2][key]`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Exercise (break-time)\n", "\n", "If you would like a challenge during the break, try figuring out which students have \"unrequited\" friendships, that is, they have rated another student as their favourite at least once, but that other student has not rated them as their favourite at least once.\n", "\n", "Specifically, get a list of edges for which the reverse edge is not present.\n", "\n", "**Hint:** You may need the class method `G.has_edge(n1, n2)`. This returns whether a graph has an edge between the nodes `n1` and `n2`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2018-05-02T13:57:33.103586Z", "start_time": "2018-05-02T13:57:33.052329Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "124" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unrequitted_friendships = []\n", "# Fill in your answer below.\n", "\n", "\n", "\n", "assert len(unrequitted_friendships) == 124" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a previous session at ODSC East 2018, a few other class participants attempted this problem. You can find their solutions in the Instructor version of this notebook." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Tests\n", "\n", "A note about the tests: Testing is good practice when writing code. Well-crafted assertion statements help you program defensivel, by forcing you to explicitly state your assumptions about the code or data.\n", "\n", "For more references on defensive programming, check out Software Carpentry's website: http://swcarpentry.github.io/python-novice-inflammation/08-defensive/\n", "\n", "For more information on writing tests for your data, check out these slides from a lightning talk I gave at Boston Python and SciPy 2015: http://j.mp/data-test" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Coding Patterns\n", "\n", "These are some recommended coding patterns when doing network analysis using NetworkX, which stem from my roughly two years of experience with the package." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Iterating using List Comprehensions\n", "I would recommend that you use the following for compactness: \n", "\n", " [d['attr'] for n, d in G.nodes(data=True)]\n", "\n", "And if the node is unimportant, you can do:\n", "\n", " [d['attr'] for _, d in G.nodes(data=True)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Iterating over Edges using List Comprehensions\n", "\n", "A similar pattern can be used for edges:\n", "\n", " [n2 for n1, n2, d in G.edges(data=True)]\n", "\n", "or\n", "\n", " [n2 for _, n2, d in G.edges(data=True)]\n", "\n", "If the graph you are constructing is a directed graph, with a \"source\" and \"sink\" available, then I would recommend the following pattern:\n", "\n", " [(sc, sk) for sc, sk, d in G.edges(data=True)]\n", "\n", "or \n", "\n", " [d['attr'] for sc, sk, d in G.edges(data=True)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Drawing Graphs\n", "\n", "As illustrated above, we can draw graphs using the `nx.draw()` function. The most popular format for drawing graphs is the **node-link diagram**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hairballs\n", "\n", "Nodes are circles and lines are edges. Nodes more tightly connected with one another are clustered together. Large graphs end up looking like hairballs." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "nx.draw(G)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "If the network is small enough to visualize, and the node labels are small enough to fit in a circle, then you can use the `with_labels=True` argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "nx.draw(G, with_labels=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "However, note that if the number of nodes in the graph gets really large, node-link diagrams can begin to look like massive hairballs. This is undesirable for graph visualization.\n", "\n", "## Matrix Plot\n", "\n", "Instead, we can use a **matrix** to represent them. The nodes are on the x- and y- axes, and a filled square represent an edge between the nodes. This is done by using the `MatrixPlot` object from `nxviz`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from nxviz import MatrixPlot\n", "\n", "m = MatrixPlot(G)\n", "m.draw()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Arc Plot\n", "\n", "The Arc Plot is the basis of the next set of rational network visualizations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-07-11T18:51:53.911943Z", "start_time": "2017-07-11T18:51:52.522004Z" }, "collapsed": true }, "outputs": [], "source": [ "from nxviz import ArcPlot\n", "\n", "a = ArcPlot(G, node_color='gender', node_grouping='gender')\n", "a.draw()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Circos Plot\n", "\n", "Let's try another visualization, the **Circos plot**. We can order the nodes in the Circos plot according to the node ID, but any other ordering is possible as well. Edges are drawn between two nodes.\n", "\n", "Credit goes to Justin Zabilansky (MIT) for the implementation, Jon Charest for subsequent improvements, and `nxviz` contributors for further development." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from nxviz import CircosPlot\n", "\n", "c = CircosPlot(G, node_color='gender', node_grouping='gender')\n", "c.draw()\n", "plt.savefig('images/seventh.png', dpi=300)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This visualization helps us highlight nodes that there are poorly connected, and others that are strongly connected.\n", "\n", "## Hive Plot\n", "\n", "Next up, let's try Hive Plots. HivePlots are not yet implemented in `nxviz` just yet, so we're going to be using the old `hiveplot` API for this. When HivePlots have been migrated over to `nxviz`, its API will resemble that of the CircosPlot's. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from hiveplot import HivePlot\n", "\n", "nodes = dict()\n", "nodes['male'] = [n for n,d in G.nodes(data=True) if d['gender'] == 'male']\n", "nodes['female'] = [n for n,d in G.nodes(data=True) if d['gender'] == 'female']\n", "\n", "edges = dict()\n", "edges['group1'] = G.edges(data=True)\n", "\n", "nodes_cmap = dict()\n", "nodes_cmap['male'] = 'blue'\n", "nodes_cmap['female'] = 'red'\n", "\n", "edges_cmap = dict()\n", "edges_cmap['group1'] = 'black'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "h = HivePlot(nodes, edges, nodes_cmap, edges_cmap)\n", "h.draw()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Hive plots allow us to divide our nodes into sub-groups, and visualize the within- and between-group connectivity." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "nams", "language": "python", "name": "nams" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" }, "widgets": { "state": {}, "version": "1.1.1" } }, "nbformat": 4, "nbformat_minor": 2 }