{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview\n", "\n", "This week is about tying everything together. We try to combine our work on networks plus our work on language in order to understand and analyze human behavior. We'll be working on a cool dataset of humans playing a game on (a special version of) wikipedia called \"wikispeedia\". Note that the game is now called *The Wiki Game* and can be found at http://thewikigame.com/ ( ... while \"http://wikispeedia.org\" is a page about GPS coordinates of speed-traps). \n", "\n", "Here's how the game worked:\n", "\n", "> In the game, users are asked to navigate from a given _source_ article (e.g. https://en.wikipedia.org/wiki/Gold_dollar) to a given _target_ article (e.g. https://en.wikipedia.org/wiki/Ronald_Reagan), **by only clicking Wikipedia links**. A condensed version of Wikipedia (4,604 articles) is used. \n", "\n", "So this dataset contains **human navigation paths** (clicking from page to page to find a target) and today we will work on this dataset to see if we can use our skills to understand how human navigation works. We will think about the following questions.\n", "\n", "* Path lengths\n", "* Betweenness from the human perspective\n", "* What characterizes human paths?\n", "\n", "Since we're putting our existing skills to use, there is very little reading today - it's all about exploring the dataset. Let's get started." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the dataset\n", "\n", "The first thing we're going to do is download the dataset. Today everything related to data is nice, clean, and easy to work with (Yay). You can get the dataset [here](https://snap.stanford.edu/data/wikispeedia.html). You will need to get \n", "\n", "* The list of wiki articles\n", "* The network connections\n", "* The navigation paths\n", "* Plaintext of the wiki articles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Path lengths\n", "\n", "The first thing we want to take a look at is path lengths. NetworkX allows us to calculate the shortest path between any pair of articles. We begin by comparing the length of human and shortests paths. \n", "\n", "_Exercises_\n", "> * For each _source_/_target_ pair in the list of human navigation paths, calculate the shortest path using NetworkX. Plot the distribution of path lengths. Mine looks something like this (if I use an undirected graph):\n", "![alt text](https://raw.githubusercontent.com/suneman/socialgraphs2016/master/files/shortest-path.png)\n", "Nevertheless, you should still see the difference between human and shortest paths using the directed graph as well, although they might be smaller.\n", "\n", "> * For each _source_/_target_ pair, calculate the length of the human path. The dataset contains information on people who regret a navigation step and hit the \"back\" button in their web-browser. It's up to you how to incorporate that information in the path. Justify your choice. Plot the distribution of human path lengths. If I ignore back steps, I get this on log-log scale:\n", "![alt text](https://raw.githubusercontent.com/suneman/socialgraphs2016/master/files/human-path.png)\n", "\n", "> * How much longer are the human paths on average?\n", "> * Create scatter plot where each point is a _source_/_target_ pair, and you have human path lengths on the $x$-axis and shortests paths on the $y$-axis.\n", "> * Is there a correlation between human/shortest path-lengths? What is the correlation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Betweenness\n", "\n", "An interesting definition of centrality is _betweenness centrality_ (here's a handy [link to the NetworkX documentation](http://networkx.readthedocs.io/en/stable/reference/generated/networkx.algorithms.centrality.betweenness_centrality.html)). In a traditional setting, this measure calculates all shortest paths in the network and then each node gets a score according to which fraction of all shortest paths pass through that node.\n", "\n", "\n", "In this part, we will create our own version of centrality, based on the _source_/_target_ pairs in our dataset. We define a nodes's **navigation centrality** as follows. \n", "\n", "> *Navigation centrality* of node $i$ is the fraction of all naviagtion paths that pass through $i$. We exclude the source and target from the count. If a node has not been visited by a search, the navigation centrality of that node is defined to be zero.\n", "\n", "In the exercises below, we investigate the relationship between navigation centrality and betweenness centrality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Exercises*\n", "\n", "> Begin by calculating the betweenness centrality and navigation centrality of all nodes in the wikispedia dataset.\n", "> Note that calculating the betweenness centrality can take quite a long time, so you might start it running in a separate notebook while first estimating it based on the existing human path.\n", ">\n", "> * First, list the 5 pages with highest navigation centrality.\n", "> * Second, list the 5 pages with highest betweenness centrality.\n", "> * Compare the two lists. Explain the differences between the two lists in your own words.\n", "> * Create a scatterplot of betweenness centrality vs. navigation centrality.\n", "> * Let's explore the pages that have navigation centrality equal to zero.\n", "> * How many pages have zero navigation centrality?\n", "> * What is the the page with zero navigation centrality and highest betweenness centrality? Can you explain why no human navigated to this page? Can you explain why the page is central in the actual link network? (For example, you can take a look at the degree of the node).\n", "> * Plot the distribution of betweenness centrality for the pages with zero navigation centrality. My plot on log-log scale:\n", "![alt text](https://raw.githubusercontent.com/suneman/socialgraphs2016/master/files/betweenness.png)\n", "\n", "> * Now, let's *throw out all pages with zero navigation centrality* and compare navigation- and betweenness centrality for the remaining pages.\n", "> * What is the correlation between betweenness centrality and navigation centrality?\n", "> * Comment on the top 5 outliers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bringing the text into the picture\n", "\n", "Now that we have an idea about the differences between how humans and computers search in networks, we are going to dig a little deeper using the page content to test a hypothesis to explain why the human navigation paths are longer. The general idea is that humans (who don't know about the global network structure) tend to jump between pages that have related _content_. For this reason we expect that (on average) human navigation paths have more similar content than the shortest paths in the network (which might take 'surprising' shortcuts via relatively unrelated pages). In short.\n", "\n", "> **Hypothesis H1**: Human navigation paths have more similar content than network shortest paths.\n", "\n", "The way we'll test this hypothesis is to first represent each page as a vector using a bag-of-words approach, then we can calculate a distance between pairs of pages using some vector-space difference, and finally we'll characterize each path by its average pair-wise distance. Below, I've set up that process as an exercise. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Exercises*\n", "\n", "> First, create a TF-IDF vector for each page. You already know all about TF-IDF from last week's exercise. The main difference is that we now _characterize **each page** by a TF-IDF vector_ and not a group of pages.\n", "> \n", "> Second, write a function that calculates the distance between a pair of vectors. There are many ways to calculate distances between a pair of vectors (try a Google search for `vector space distance measures` if you want to refresh your knowledge on this topic). You're free to choose what you want, but we recommend the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).\n", ">\n", "> Now you're ready for the exercises\n", "> \n", "> * Calculate the average pairwise similarity for all human navigation paths (the _source_/_target_ pairs from above). With start and end at node $i,j$ we can call this similarity $s_{i,j}$. Calculate mean/variance of the $s_{ij}$'s.\n", "> * Calculate the average pairwise similarity for all shortest paths between the _source_/_target_ pairs ($S_{i,j}$). Calculate mean/variance of the $S_{i,j}$.\n", "> * Plot the distributions of average similarities for both human- and shortest paths in a single plot. If everything works well, you should see something similar to the following:\n", "![alt text](https://raw.githubusercontent.com/suneman/socialgraphs2016/master/files/path-similarity.png)\n", "\n", "> * Finally, for each source/target pair, compare the human-navigation average similarity with the betweenness based average similarity, testing what fraction of the time, the average similarity is lower in the case of human navigation.\n", "> * Comment on your findings. Is **H1** true?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 1 }