{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 14.2. Analyzing a social network with NetworkX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, you need the *twitter* Python package for this recipe. You can install it with `pip install twitter`. (https://pypi.python.org/pypi/twitter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, you need to obtain authentication codes in order to access Twitter data. The procedure is free. In addition to a Twitter account, you also need to create an *Application* on the Twitter Developers website. Then, you will be able to retrieve the OAuth authentication codes that are required for this recipe. (https://dev.twitter.com/apps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note that access to the Twitter API is not unlimited**. Most methods can only be called a few times within a given time window. Unless you study small networks or look at small portions of large networks, you will need to throttle your requests. In this recipe, we only consider a small portion of the network, so that the API limit should not be reached. Otherwise, you will have to wait a few minutes before the next time window starts. (https://dev.twitter.com/docs/rate-limiting/1.1/limits)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Let's import a few packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import math\n", "import json\n", "import twitter\n", "import numpy as np\n", "import pandas as pd\n", "import networkx as nx\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. You need to create a `twitter.txt` text file in the current folder with the four authentication keys. You will find those in your Twitter Developers Application page, *OAuth tool* section. (https://dev.twitter.com/apps)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "(CONSUMER_KEY, CONSUMER_SECRET, \n", "OAUTH_TOKEN, OAUTH_TOKEN_SECRET) = open('twitter.txt', 'r').read().splitlines()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. We now create the `Twitter` instance that will give us access to the Twitter API." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,\n", " CONSUMER_KEY, CONSUMER_SECRET)\n", "tw = twitter.Twitter(auth=auth)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. We use the latest 1.1 version of the Twitter API in this recipe. The `twitter` library defines a direct mapping between the REST API and the attributes of the `Twitter` instance. Here, we execute the `account/verify_credentials` REST request to obtain the user id of the authenticated user (me here, or you if you execute this notebook yourself!)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "me = tw.account.verify_credentials()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "myid = me['id']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Let's define a simple function that returns the identifiers of all followers of a given user (the authenticated user by default)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def get_followers_ids(uid=None):\n", " # Retrieve the list of followers' ids of the specified user.\n", " return tw.followers.ids(user_id=uid)['ids']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We get the list of my followers.\n", "my_followers_ids = get_followers_ids()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Now, we define a function that retrieves the full profile of Twitter users. Since the `users/lookup` batch request is limited to 100 users per call, and that only a small number of calls are allowed within a time window, we only look at a subset of all the followers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def get_users_info(users_ids, max=500):\n", " n = min(max, len(users_ids))\n", " # Get information about those users, using batch requests.\n", " users = [tw.users.lookup(user_id=users_ids[100*i:100*(i+1)])\n", " for i in range(int(math.ceil(n/100.)))]\n", " # We flatten this list of lists.\n", " users = [item for sublist in users for item in sublist]\n", " return {user['id']: user for user in users}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "users_info = get_users_info(my_followers_ids)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's save this dictionary on the disk.\n", "with open('my_followers.json', 'w') as f:\n", " json.dump(users_info, f, indent=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. We also start to define the graph with the followers, as an adjacency list (technically, a dictionary of lists). This is called the **ego graph**. This graph represents all *following* connections between my followers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "adjacency = {myid: my_followers_ids}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. Now, we are going to take a look at the part of the ego graph related to Python. Specifically, we will consider the followers of the 10 most followed users whom descriptions contain \"Python\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "my_followers_python = [user for user in users_info.values()\n", " if 'python' in user['description'].lower()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "my_followers_python_best = sorted(my_followers_python, \n", " key=lambda u: u['followers_count'])[::-1][:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The request for retrieving the followers of a given user is rate-limited. Let's check how many calls remaining we have." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tw.application.rate_limit_status(resources='followers') \\\n", " ['resources']['followers']['/followers/ids']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for user in my_followers_python_best:\n", " # The call to get_followers_ids is rate-limited.\n", " adjacency[user['id']] = list(set(get_followers_ids(\n", " user['id'])).intersection(my_followers_ids))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "9. Now that our graph is defined as an adjacency list in a dictionary, we will load it in NetworkX." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "g = nx.Graph(adjacency)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We only restrict the graph to the users for which we \n", "# were able to retrieve the profile.\n", "g = g.subgraph(users_info.keys())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We also save this graph on disk.\n", "with open('my_graph.json', 'w') as f:\n", " json.dump(nx.to_dict_of_lists(g), f, indent=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We remove isolated nodes for simplicity.\n", "g.remove_nodes_from([k for k, d in g.degree().items()\n", " if d <= 1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Since I am connected to all nodes, by definition,\n", "# we can remove me for simplicity.\n", "g.remove_nodes_from([myid])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "10. Let's take a look at the graph's statistics." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "len(g.nodes()), len(g.edges())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "11. We are now going to plot this graph. We will use different sizes and colors for the nodes, according to the number of followers and the number of tweets for each user. Most followed users will be bigger. Most active users will be redder." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Update the dictionary.\n", "deg = g.degree()\n", "for user in users_info.values():\n", " fc = user['followers_count']\n", " sc = user['statuses_count']\n", " # Is this user a Pythonista?\n", " user['python'] = 'python' in user['description'].lower()\n", " # We compute the node size as a function of the \n", " # number of followers.\n", " user['node_size'] = math.sqrt(1 + 10 * fc)\n", " # The color is function of its activity on Twitter.\n", " user['node_color'] = 10 * math.sqrt(1 + sc)\n", " # We only display the name of the most followed users.\n", " user['label'] = user['screen_name'] if fc > 2000 else ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "12. Finally, we use the `draw` function to display the graph. We need to specify the node sizes and colors as lists, and the labels as a dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "node_size = [users_info[uid]['node_size'] for uid in g.nodes()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "node_color = [users_info[uid]['node_color'] for uid in g.nodes()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "labels = {uid: users_info[uid]['label'] for uid in g.nodes()}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=(6,6));\n", "nx.draw(g, cmap=plt.cm.OrRd, alpha=.8,\n", " node_size=node_size, node_color=node_color,\n", " labels=labels, font_size=4, width=.1);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "* Manipulate and visualize graphs with NetworkX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n", "\n", "> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.2" } }, "nbformat": 4, "nbformat_minor": 0 }