{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Mining the Social Web, 2nd Edition\n", "\n", "##Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More\n", "\n", "This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (2nd Edition)_](http://bit.ly/135dHfs). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.\n", "\n", "In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, [you can find the full source code repository here](http://bit.ly/16kGNyb).\n", "\n", "## Copyright and Licensing\n", "\n", "You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#LinkedIn API Access\n", "\n", "LinkedIn implements OAuth 2.0 as one of its standard authentication mechanisms, but still supports OAuth 1.0a, which provides you with four credentials (\"API Key\", \"Secret Key\", \"OAuth User Token\", and \"OAuth User Secret\") that can be used to gain instant API access with no further fuss or redirections. You can create an app and retrieve these four credentials through the \"Developer\" section of your account settings as shown below or by navigating directly to https://www.linkedin.com/secure/developer.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you'll need to install a third-party package called python-linkedin to use the code in this notebook. Installing a package directly from a GitHub repository is easy with pip in the terminal:\n", "\n", "\n", "$ pip install python-linkedin\n", "" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 1. Using LinkedIn OAuth credentials to receive an access token suitable for development and accessing your own data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from linkedin import linkedin # pip install python-linkedin\n", "\n", "# Define CONSUMER_KEY, CONSUMER_SECRET, \n", "# USER_TOKEN, and USER_SECRET from the credentials \n", "# provided in your LinkedIn application\n", "\n", "CONSUMER_KEY = ''\n", "CONSUMER_SECRET = ''\n", "USER_TOKEN = ''\n", "USER_SECRET = ''\n", "\n", "RETURN_URL = '' # Not required for developer authentication\n", "\n", "# Instantiate the developer authentication class\n", "\n", "auth = linkedin.LinkedInDeveloperAuthentication(CONSUMER_KEY, CONSUMER_SECRET, \n", " USER_TOKEN, USER_SECRET, \n", " RETURN_URL, \n", " permissions=linkedin.PERMISSIONS.enums.values())\n", "\n", "# Pass it in to the app...\n", "\n", "app = linkedin.LinkedInApplication(auth)\n", "\n", "# Use the app...\n", "\n", "app.get_profile()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 2. Retrieving your LinkedIn connections and storing them to disk" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "\n", "connections = app.get_connections()\n", "\n", "connections_data = 'resources/ch03-linkedin/linkedin_connections.json'\n", "\n", "f = open(conections_data, 'w')\n", "f.write(json.dumps(connections, indent=1))\n", "f.close()\n", "\n", "# You can reuse the data without using the API later like this...\n", "# connections = json.loads(open(connections_data).read())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Execute this cell if you need to reload data...\n", "import json\n", "connections = json.loads(open('resources/ch03-linkedin/linkedin_connections.json').read())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: Should you need to revoke account access from your application or any other OAuth application, you can do so at [https://www.linkedin.com/secure/settings?userAgree=&goback=%2Enas_*1_*1_*1](https://www.linkedin.com/secure/settings?userAgree=&goback=%2Enas_*1_*1_*1)" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 3. Pretty-printing your LinkedIn connections' data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from prettytable import PrettyTable # pip install prettytable\n", "\n", "pt = PrettyTable(field_names=['Name', 'Location'])\n", "pt.align = 'l'\n", "\n", "[ pt.add_row((c['firstName'] + ' ' + c['lastName'], c['location']['name'])) \n", " for c in connections['values']\n", " if c.has_key('location')]\n", "\n", "print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 4. Displaying job position history for your profile and a connection's profile" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "\n", "# See http://developer.linkedin.com/documents/profile-fields#fullprofile\n", "# for details on additional field selectors that can be passed in for\n", "# retrieving additional profile information.\n", "\n", "# Display your own positions...\n", "\n", "my_positions = app.get_profile(selectors=['positions'])\n", "print json.dumps(my_positions, indent=1)\n", "\n", "# Display positions for someone in your network...\n", "\n", "# Get an id for a connection. We'll just pick the first one.\n", "connection_id = connections['values'][0]['id']\n", "connection_positions = app.get_profile(member_id=connection_id, \n", " selectors=['positions'])\n", "print json.dumps(connection_positions, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 5. Using field selector syntax to request additional details for APIs" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# See http://developer.linkedin.com/documents/understanding-field-selectors\n", "# for more information on the field selector syntax\n", "\n", "my_positions = app.get_profile(selectors=['positions:(company:(name,industry,id))'])\n", "print json.dumps(my_positions, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 6. Simple normalization of company suffixes from address book data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import csv\n", "from collections import Counter\n", "from operator import itemgetter\n", "from prettytable import PrettyTable\n", "\n", "# XXX: Place your \"Outlook CSV\" formatted file of connections from \n", "# http://www.linkedin.com/people/export-settings at the following\n", "# location: resources/ch03-linkedin/my_connections.csv\n", "\n", "CSV_FILE = os.path.join(\"resources\", \"ch03-linkedin\", 'my_connections.csv')\n", "\n", "# Define a set of transforms that converts the first item\n", "# to the second item. Here, we're simply handling some\n", "# commonly known abbreviations, stripping off common suffixes, \n", "# etc.\n", "\n", "transforms = [(', Inc.', ''), (', Inc', ''), (', LLC', ''), (', LLP', ''),\n", " (' LLC', ''), (' Inc.', ''), (' Inc', '')]\n", "\n", "csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='\"')\n", "contacts = [row for row in csvReader]\n", "companies = [c['Company'].strip() for c in contacts if c['Company'].strip() != '']\n", "\n", "for i, _ in enumerate(companies):\n", " for transform in transforms:\n", " companies[i] = companies[i].replace(*transform)\n", "\n", "pt = PrettyTable(field_names=['Company', 'Freq'])\n", "pt.align = 'l'\n", "c = Counter(companies)\n", "[pt.add_row([company, freq]) \n", " for (company, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) \n", " if freq > 1]\n", "print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 7. Standardizing common job titles and computing their frequencies" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import csv\n", "from operator import itemgetter\n", "from collections import Counter\n", "from prettytable import PrettyTable\n", "\n", "# XXX: Place your \"Outlook CSV\" formatted file of connections from \n", "# http://www.linkedin.com/people/export-settings at the following\n", "# location: resources/ch03-linkedin/my_connections.csv\n", "\n", "CSV_FILE = os.path.join(\"resources\", \"ch03-linkedin\", 'my_connections.csv')\n", "\n", "transforms = [\n", " ('Sr.', 'Senior'),\n", " ('Sr', 'Senior'),\n", " ('Jr.', 'Junior'),\n", " ('Jr', 'Junior'),\n", " ('CEO', 'Chief Executive Officer'),\n", " ('COO', 'Chief Operating Officer'),\n", " ('CTO', 'Chief Technology Officer'),\n", " ('CFO', 'Chief Finance Officer'),\n", " ('VP', 'Vice President'),\n", " ]\n", "\n", "csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='\"')\n", "contacts = [row for row in csvReader]\n", "\n", "# Read in a list of titles and split apart\n", "# any combined titles like \"President/CEO.\"\n", "# Other variations could be handled as well, such\n", "# as \"President & CEO\", \"President and CEO\", etc.\n", "\n", "titles = []\n", "for contact in contacts:\n", " titles.extend([t.strip() for t in contact['Job Title'].split('/')\n", " if contact['Job Title'].strip() != ''])\n", "\n", "# Replace common/known abbreviations\n", "\n", "for i, _ in enumerate(titles):\n", " for transform in transforms:\n", " titles[i] = titles[i].replace(*transform)\n", "\n", "# Print out a table of titles sorted by frequency\n", "\n", "pt = PrettyTable(field_names=['Title', 'Freq'])\n", "pt.align = 'l'\n", "c = Counter(titles)\n", "[pt.add_row([title, freq]) \n", " for (title, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) \n", " if freq > 1]\n", "print pt\n", "\n", "# Print out a table of tokens sorted by frequency\n", "\n", "tokens = []\n", "for title in titles:\n", " tokens.extend([t.strip(',') for t in title.split()])\n", "pt = PrettyTable(field_names=['Token', 'Freq'])\n", "pt.align = 'l'\n", "c = Counter(tokens)\n", "[pt.add_row([token, freq]) \n", " for (token, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) \n", " if freq > 1 and len(token) > 2]\n", "print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 8. Geocoding locations with Microsoft Bing" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from geopy import geocoders\n", "\n", "GEO_APP_KEY = '' # XXX: Get this from https://www.bingmapsportal.com\n", "g = geocoders.Bing(GEO_APP_KEY)\n", "print g.geocode(\"Nashville\", exactly_one=False)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 9. Geocoding locations of LinkedIn connections with Microsoft Bing" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from geopy import geocoders\n", "\n", "GEO_APP_KEY = '' # XXX: Get this from https://www.bingmapsportal.com\n", "g = geocoders.Bing(GEO_APP_KEY)\n", "\n", "transforms = [('Greater ', ''), (' Area', '')]\n", "\n", "results = {}\n", "for c in connections['values']:\n", " if not c.has_key('location'): continue\n", " \n", " transformed_location = c['location']['name']\n", " for transform in transforms:\n", " transformed_location = transformed_location.replace(*transform)\n", " geo = g.geocode(transformed_location, exactly_one=False)\n", " if geo == []: continue\n", " results.update({ c['location']['name'] : geo })\n", " \n", "print json.dumps(results, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 10. Parsing out states from Bing geocoder results using a regular expression" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "\n", "# Most results contain a response that can be parsed by\n", "# picking out the first two consecutive upper case letters \n", "# as a clue for the state\n", "pattern = re.compile('.*([A-Z]{2}).*')\n", " \n", "def parseStateFromBingResult(r):\n", " result = pattern.search(r[0][0])\n", " if result == None: \n", " print \"Unresolved match:\", r\n", " return \"???\"\n", " elif len(result.groups()) == 1:\n", " print result.groups()\n", " return result.groups()[0]\n", " else:\n", " print \"Unresolved match:\", result.groups()\n", " return \"???\"\n", "\n", " \n", "transforms = [('Greater ', ''), (' Area', '')]\n", "\n", "results = {}\n", "for c in connections['values']:\n", " if not c.has_key('location'): continue\n", " if not c['location']['country']['code'] == 'us': continue\n", " \n", " transformed_location = c['location']['name']\n", " for transform in transforms:\n", " transformed_location = transformed_location.replace(*transform)\n", " \n", " geo = g.geocode(transformed_location, exactly_one=False)\n", " if geo == []: continue\n", " parsed_state = parseStateFromBingResult(geo)\n", " if parsed_state != \"???\":\n", " results.update({c['location']['name'] : parsed_state})\n", " \n", "print json.dumps(results, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Here's how to power a Cartogram visualization with the data from the \"results\" variable**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import json\n", "from IPython.display import IFrame\n", "from IPython.core.display import display\n", "\n", "# Load in a data structure mapping state names to codes.\n", "# e.g. West Virginia is WV\n", "codes = json.loads(open('resources/ch03-linkedin/viz/states-codes.json').read())\n", "\n", "from collections import Counter\n", "c = Counter([r[1] for r in results.items()])\n", "states_freqs = { codes[k] : v for (k,v) in c.items() }\n", "\n", "# Lace in all of the other states and provide a minimum value for each of them\n", "states_freqs.update({v : 0.5 for v in codes.values() if v not in states_freqs.keys() })\n", "\n", "# Write output to file\n", "f = open('resources/ch03-linkedin/viz/states-freqs.json', 'w')\n", "f.write(json.dumps(states_freqs, indent=1))\n", "f.close()\n", "\n", "# IPython Notebook can serve files and display them into\n", "# inline frames. Prepend the path with the 'files' prefix\n", "\n", "display(IFrame('files/resources/ch03-linkedin/viz/cartogram.html', '100%', '600px'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 11. Using NLTK to compute bigrams" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ceo_bigrams = nltk.bigrams(\"Chief Executive Officer\".split(), pad_right=True, \n", " pad_left=True)\n", "cto_bigrams = nltk.bigrams(\"Chief Technology Officer\".split(), pad_right=True, \n", " pad_left=True)\n", "\n", "print ceo_bigrams\n", "print cto_bigrams\n", "print len(set(ceo_bigrams).intersection(set(cto_bigrams)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 12. Clustering job titles using a greedy heuristic" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import csv\n", "from nltk.metrics.distance import jaccard_distance\n", "\n", "# XXX: Place your \"Outlook CSV\" formatted file of connections from \n", "# http://www.linkedin.com/people/export-settings at the following\n", "# location: resources/ch03-linkedin/my_connections.csv\n", "\n", "CSV_FILE = os.path.join(\"resources\", \"ch03-linkedin\", 'my_connections.csv')\n", "\n", "# Tweak this distance threshold and try different distance calculations \n", "# during experimentation\n", "DISTANCE_THRESHOLD = 0.5\n", "DISTANCE = jaccard_distance\n", "\n", "def cluster_contacts_by_title(csv_file):\n", "\n", " transforms = [\n", " ('Sr.', 'Senior'),\n", " ('Sr', 'Senior'),\n", " ('Jr.', 'Junior'),\n", " ('Jr', 'Junior'),\n", " ('CEO', 'Chief Executive Officer'),\n", " ('COO', 'Chief Operating Officer'),\n", " ('CTO', 'Chief Technology Officer'),\n", " ('CFO', 'Chief Finance Officer'),\n", " ('VP', 'Vice President'),\n", " ]\n", "\n", " separators = ['/', 'and', '&']\n", "\n", " csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='\"')\n", " contacts = [row for row in csvReader]\n", "\n", " # Normalize and/or replace known abbreviations\n", " # and build up a list of common titles.\n", "\n", " all_titles = []\n", " for i, _ in enumerate(contacts):\n", " if contacts[i]['Job Title'] == '':\n", " contacts[i]['Job Titles'] = ['']\n", " continue\n", " titles = [contacts[i]['Job Title']]\n", " for title in titles:\n", " for separator in separators:\n", " if title.find(separator) >= 0:\n", " titles.remove(title)\n", " titles.extend([title.strip() for title in title.split(separator)\n", " if title.strip() != ''])\n", "\n", " for transform in transforms:\n", " titles = [title.replace(*transform) for title in titles]\n", " contacts[i]['Job Titles'] = titles\n", " all_titles.extend(titles)\n", "\n", " all_titles = list(set(all_titles))\n", "\n", " clusters = {}\n", " for title1 in all_titles:\n", " clusters[title1] = []\n", " for title2 in all_titles:\n", " if title2 in clusters[title1] or clusters.has_key(title2) and title1 \\\n", " in clusters[title2]:\n", " continue\n", " distance = DISTANCE(set(title1.split()), set(title2.split()))\n", "\n", " if distance < DISTANCE_THRESHOLD:\n", " clusters[title1].append(title2)\n", "\n", " # Flatten out clusters\n", "\n", " clusters = [clusters[title] for title in clusters if len(clusters[title]) > 1]\n", "\n", " # Round up contacts who are in these clusters and group them together\n", "\n", " clustered_contacts = {}\n", " for cluster in clusters:\n", " clustered_contacts[tuple(cluster)] = []\n", " for contact in contacts:\n", " for title in contact['Job Titles']:\n", " if title in cluster:\n", " clustered_contacts[tuple(cluster)].append('%s %s'\n", " % (contact['First Name'], contact['Last Name']))\n", "\n", " return clustered_contacts\n", "\n", "\n", "clustered_contacts = cluster_contacts_by_title(CSV_FILE)\n", "print clustered_contacts\n", "for titles in clustered_contacts:\n", " common_titles_heading = 'Common Titles: ' + ', '.join(titles)\n", "\n", " descriptive_terms = set(titles[0].split())\n", " for title in titles:\n", " descriptive_terms.intersection_update(set(title.split()))\n", " descriptive_terms_heading = 'Descriptive Terms: ' \\\n", " + ', '.join(descriptive_terms)\n", " print descriptive_terms_heading\n", " print '-' * max(len(descriptive_terms_heading), len(common_titles_heading))\n", " print '\\n'.join(clustered_contacts[titles])\n", " print" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Incorporating random sampling can improve performance of the nested loops in Example 12**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import csv\n", "import random\n", "from nltk.metrics.distance import jaccard_distance\n", "\n", "# XXX: Place your \"Outlook CSV\" formatted file of connections from \n", "# http://www.linkedin.com/people/export-settings at the following\n", "# location: resources/ch03-linkedin/my_connections.csv\n", "\n", "CSV_FILE = os.path.join(\"resources\", \"ch03-linkedin\", 'my_connections.csv')\n", "\n", "# Tweak this distance threshold and try different distance calculations \n", "# during experimentation\n", "DISTANCE_THRESHOLD = 0.5\n", "DISTANCE = jaccard_distance\n", "\n", "# Adjust sample size as needed to reduce the runtime of the\n", "# nested loop that invokes the DISTANCE function\n", "SAMPLE_SIZE = 500\n", "\n", "def cluster_contacts_by_title(csv_file):\n", "\n", " transforms = [\n", " ('Sr.', 'Senior'),\n", " ('Sr', 'Senior'),\n", " ('Jr.', 'Junior'),\n", " ('Jr', 'Junior'),\n", " ('CEO', 'Chief Executive Officer'),\n", " ('COO', 'Chief Operating Officer'),\n", " ('CTO', 'Chief Technology Officer'),\n", " ('CFO', 'Chief Finance Officer'),\n", " ('VP', 'Vice President'),\n", " ]\n", "\n", " separators = ['/', 'and', '&']\n", "\n", " csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='\"')\n", " contacts = [row for row in csvReader]\n", "\n", " # Normalize and/or replace known abbreviations\n", " # and build up list of common titles\n", "\n", " all_titles = []\n", " for i, _ in enumerate(contacts):\n", " if contacts[i]['Job Title'] == '':\n", " contacts[i]['Job Titles'] = ['']\n", " continue\n", " titles = [contacts[i]['Job Title']]\n", " for title in titles:\n", " for separator in separators:\n", " if title.find(separator) >= 0:\n", " titles.remove(title)\n", " titles.extend([title.strip() for title in title.split(separator)\n", " if title.strip() != ''])\n", "\n", " for transform in transforms:\n", " titles = [title.replace(*transform) for title in titles]\n", " contacts[i]['Job Titles'] = titles\n", " all_titles.extend(titles)\n", "\n", " all_titles = list(set(all_titles))\n", " clusters = {}\n", " for title1 in all_titles:\n", " clusters[title1] = []\n", " for sample in xrange(SAMPLE_SIZE):\n", " title2 = all_titles[random.randint(0, len(all_titles)-1)]\n", " if title2 in clusters[title1] or clusters.has_key(title2) and title1 \\\n", " in clusters[title2]:\n", " continue\n", " distance = DISTANCE(set(title1.split()), set(title2.split()))\n", " if distance < DISTANCE_THRESHOLD:\n", " clusters[title1].append(title2)\n", "\n", " # Flatten out clusters\n", "\n", " clusters = [clusters[title] for title in clusters if len(clusters[title]) > 1]\n", "\n", " # Round up contacts who are in these clusters and group them together\n", "\n", " clustered_contacts = {}\n", " for cluster in clusters:\n", " clustered_contacts[tuple(cluster)] = []\n", " for contact in contacts:\n", " for title in contact['Job Titles']:\n", " if title in cluster:\n", " clustered_contacts[tuple(cluster)].append('%s %s'\n", " % (contact['First Name'], contact['Last Name']))\n", "\n", " return clustered_contacts\n", "\n", "\n", "clustered_contacts = cluster_contacts_by_title(CSV_FILE)\n", "print clustered_contacts\n", "for titles in clustered_contacts:\n", " common_titles_heading = 'Common Titles: ' + ', '.join(titles)\n", "\n", " descriptive_terms = set(titles[0].split())\n", " for title in titles:\n", " descriptive_terms.intersection_update(set(title.split()))\n", " descriptive_terms_heading = 'Descriptive Terms: ' \\\n", " + ', '.join(descriptive_terms)\n", " print descriptive_terms_heading\n", " print '-' * max(len(descriptive_terms_heading), len(common_titles_heading))\n", " print '\\n'.join(clustered_contacts[titles])\n", " print" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How to export data (contained in the \"clustered contacts\" variable) to power faceted display as outlined in Figure 3.**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "import os\n", "from IPython.display import IFrame\n", "from IPython.core.display import display\n", "\n", "data = {\"label\" : \"name\", \"temp_items\" : {}, \"items\" : []} \n", "for titles in clustered_contacts:\n", " descriptive_terms = set(titles[0].split())\n", " for title in titles:\n", " descriptive_terms.intersection_update(set(title.split()))\n", " descriptive_terms = ', '.join(descriptive_terms)\n", "\n", " if data['temp_items'].has_key(descriptive_terms):\n", " data['temp_items'][descriptive_terms].extend([{'name' : cc } for cc \n", " in clustered_contacts[titles]])\n", " else:\n", " data['temp_items'][descriptive_terms] = [{'name' : cc } for cc \n", " in clustered_contacts[titles]]\n", "\n", "for descriptive_terms in data['temp_items']:\n", " data['items'].append({\"name\" : \"%s (%s)\" % (descriptive_terms, \n", " len(data['temp_items'][descriptive_terms]),),\n", " \"children\" : [i for i in \n", " data['temp_items'][descriptive_terms]]})\n", "\n", "del data['temp_items']\n", "\n", "# Open the template and substitute the data\n", "\n", "TEMPLATE = 'resources/ch03-linkedin/viz/dojo_tree.html.template' \n", "OUT = 'resources/ch03-linkedin/viz/dojo_tree.html'\n", "\n", "viz_file = 'files/resources/ch03-linkedin/viz/dojo_tree.html'\n", "\n", "t = open(TEMPLATE).read()\n", "f = open(OUT, 'w')\n", "f.write(t % json.dumps(data, indent=4))\n", "f.close()\n", "\n", "# IPython Notebook can serve files and display them into\n", "# inline frames. Prepend the path with the 'files' prefix\n", "\n", "display(IFrame(viz_file, '400px', '600px'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How to export data to power a dendogram and node-link tree visualization as outlined in Figure 4.**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import csv\n", "import random\n", "from nltk.metrics.distance import jaccard_distance\n", "from cluster import HierarchicalClustering\n", "\n", "# XXX: Place your \"Outlook CSV\" formatted file of connections from \n", "# http://www.linkedin.com/people/export-settings at the following\n", "# location: resources/ch03-linkedin/my_connections.csv\n", "\n", "CSV_FILE = os.path.join(\"resources\", \"ch03-linkedin\", 'my_connections.csv')\n", "\n", "OUT_FILE = 'resources/ch03-linkedin/viz/d3-data.json'\n", "\n", "# Tweak this distance threshold and try different distance calculations \n", "# during experimentation\n", "DISTANCE_THRESHOLD = 0.5\n", "DISTANCE = jaccard_distance\n", "\n", "# Adjust sample size as needed to reduce the runtime of the\n", "# nested loop that invokes the DISTANCE function\n", "SAMPLE_SIZE = 500\n", "\n", "def cluster_contacts_by_title(csv_file):\n", "\n", " transforms = [\n", " ('Sr.', 'Senior'),\n", " ('Sr', 'Senior'),\n", " ('Jr.', 'Junior'),\n", " ('Jr', 'Junior'),\n", " ('CEO', 'Chief Executive Officer'),\n", " ('COO', 'Chief Operating Officer'),\n", " ('CTO', 'Chief Technology Officer'),\n", " ('CFO', 'Chief Finance Officer'),\n", " ('VP', 'Vice President'),\n", " ]\n", "\n", " separators = ['/', 'and', '&']\n", "\n", " csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='\"')\n", " contacts = [row for row in csvReader]\n", "\n", " # Normalize and/or replace known abbreviations\n", " # and build up list of common titles\n", "\n", " all_titles = []\n", " for i, _ in enumerate(contacts):\n", " if contacts[i]['Job Title'] == '':\n", " contacts[i]['Job Titles'] = ['']\n", " continue\n", " titles = [contacts[i]['Job Title']]\n", " for title in titles:\n", " for separator in separators:\n", " if title.find(separator) >= 0:\n", " titles.remove(title)\n", " titles.extend([title.strip() for title in title.split(separator)\n", " if title.strip() != ''])\n", "\n", " for transform in transforms:\n", " titles = [title.replace(*transform) for title in titles]\n", " contacts[i]['Job Titles'] = titles\n", " all_titles.extend(titles)\n", "\n", " all_titles = list(set(all_titles))\n", " \n", " # Define a scoring function\n", " def score(title1, title2): \n", " return DISTANCE(set(title1.split()), set(title2.split()))\n", "\n", " # Feed the class your data and the scoring function\n", " hc = HierarchicalClustering(all_titles, score)\n", "\n", " # Cluster the data according to a distance threshold\n", " clusters = hc.getlevel(DISTANCE_THRESHOLD)\n", "\n", " # Remove singleton clusters\n", " clusters = [c for c in clusters if len(c) > 1]\n", "\n", " # Round up contacts who are in these clusters and group them together\n", "\n", " clustered_contacts = {}\n", " for cluster in clusters:\n", " clustered_contacts[tuple(cluster)] = []\n", " for contact in contacts:\n", " for title in contact['Job Titles']:\n", " if title in cluster:\n", " clustered_contacts[tuple(cluster)].append('%s %s'\n", " % (contact['First Name'], contact['Last Name']))\n", "\n", " return clustered_contacts\n", "\n", "def display_output(clustered_contacts):\n", " \n", " for titles in clustered_contacts:\n", " common_titles_heading = 'Common Titles: ' + ', '.join(titles)\n", "\n", " descriptive_terms = set(titles[0].split())\n", " for title in titles:\n", " descriptive_terms.intersection_update(set(title.split()))\n", " descriptive_terms_heading = 'Descriptive Terms: ' \\\n", " + ', '.join(descriptive_terms)\n", " print descriptive_terms_heading\n", " print '-' * max(len(descriptive_terms_heading), len(common_titles_heading))\n", " print '\\n'.join(clustered_contacts[titles])\n", " print\n", "\n", "def write_d3_json_output(clustered_contacts):\n", " \n", " json_output = {'name' : 'My LinkedIn', 'children' : []}\n", "\n", " for titles in clustered_contacts:\n", "\n", " descriptive_terms = set(titles[0].split())\n", " for title in titles:\n", " descriptive_terms.intersection_update(set(title.split()))\n", "\n", " json_output['children'].append({'name' : ', '.join(descriptive_terms)[:30], \n", " 'children' : [ {'name' : c.decode('utf-8', 'replace')} for c in clustered_contacts[titles] ] } )\n", " \n", " f = open(OUT_FILE, 'w')\n", " f.write(json.dumps(json_output, indent=1))\n", " f.close()\n", " \n", "clustered_contacts = cluster_contacts_by_title(CSV_FILE)\n", "display_output(clustered_contacts)\n", "write_d3_json_output(clustered_contacts)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Once you've run the code and produced the output for the dendogram and node-link tree visualizations, here's one way to serve it.**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "from IPython.display import IFrame\n", "from IPython.core.display import display\n", "\n", "# IPython Notebook can serve files and display them into\n", "# inline frames. Prepend the path with the 'files' prefix\n", "\n", "viz_file = 'files/resources/ch03-linkedin/viz/node_link_tree.html'\n", "\n", "# XXX: Another visualization you could try:\n", "#viz_file = 'files/resources/ch03-linkedin/viz/dendogram.html'\n", "\n", "display(IFrame(viz_file, '100%', '600px'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 13. Clustering your LinkedIn professional network based upon the locations of your connections and emitting KML output for visualization with Google Earth" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import sys\n", "import json\n", "from urllib2 import HTTPError\n", "from geopy import geocoders\n", "from cluster import KMeansClustering, centroid\n", "\n", "# A helper function to munge data and build up an XML tree.\n", "# It references some code tucked away in another directory, so we have to\n", "# add that directory to the PYTHONPATH for it to be picked up.\n", "sys.path.append(os.path.join(os.getcwd(), \"resources\", \"ch03-linkedin\"))\n", "from linkedin__kml_utility import createKML\n", "\n", "# XXX: Try different values for K to see the difference in clusters that emerge\n", "\n", "K = 3\n", "\n", "# XXX: Get an API key and pass it in here. See https://www.bingmapsportal.com.\n", "GEO_API_KEY = ''\n", "g = geocoders.Bing(GEO_API_KEY)\n", "\n", "# Load this data from where you've previously stored it\n", "\n", "CONNECTIONS_DATA = 'resources/ch03-linkedin/linkedin_connections.json'\n", "\n", "OUT_FILE = \"resources/ch03-linkedin/viz/linkedin_clusters_kmeans.kml\"\n", "\n", "# Open up your saved connections with extended profile information\n", "# or fetch them again from LinkedIn if you prefer\n", "\n", "connections = json.loads(open(CONNECTIONS_DATA).read())['values']\n", "\n", "locations = [c['location']['name'] for c in connections if c.has_key('location')]\n", "\n", "# Some basic transforms may be necessary for geocoding services to function properly\n", "# Here are a couple that seem to help.\n", "\n", "transforms = [('Greater ', ''), (' Area', '')]\n", "\n", "# Step 1 - Tally the frequency of each location\n", "\n", "coords_freqs = {}\n", "for location in locations:\n", "\n", " if not c.has_key('location'): continue\n", " \n", " # Avoid unnecessary I/O and geo requests by building up a cache\n", "\n", " if coords_freqs.has_key(location):\n", " coords_freqs[location][1] += 1\n", " continue\n", " transformed_location = location\n", "\n", " for transform in transforms:\n", " transformed_location = transformed_location.replace(*transform)\n", " \n", " # Handle potential I/O errors with a retry pattern...\n", " \n", " while True:\n", " num_errors = 0\n", " try:\n", " results = g.geocode(transformed_location, exactly_one=False)\n", " break\n", " except HTTPError, e:\n", " num_errors += 1\n", " if num_errors >= 3:\n", " sys.exit()\n", " print >> sys.stderr, e\n", " print >> sys.stderr, 'Encountered an urllib2 error. Trying again...'\n", " \n", " for result in results:\n", " # Each result is of the form (\"Description\", (X,Y))\n", " coords_freqs[location] = [result[1], 1]\n", " break # Disambiguation strategy is \"pick first\"\n", "\n", "# Step 2 - Build up data structure for converting locations to KML \n", " \n", "# Here, you could optionally segment locations by continent or country\n", "# so as to avoid potentially finding a mean in the middle of the ocean.\n", "# The k-means algorithm will expect distinct points for each contact, so\n", "# build out an expanded list to pass it.\n", "\n", "expanded_coords = []\n", "for label in coords_freqs:\n", " # Flip lat/lon for Google Earth\n", " ((lat, lon), f) = coords_freqs[label]\n", " expanded_coords.append((label, [(lon, lat)] * f))\n", "\n", "# No need to clutter the map with unnecessary placemarks...\n", "\n", "kml_items = [{'label': label, 'coords': '%s,%s' % coords[0]} for (label,\n", " coords) in expanded_coords]\n", "\n", "# It would also be helpful to include names of your contacts on the map\n", "\n", "for item in kml_items:\n", " item['contacts'] = '\\n'.join(['%s %s.' % (c['firstName'], c['lastName'])\n", " for c in connections if c.has_key('location') and \n", " c['location']['name'] == item['label']])\n", "\n", "# Step 3 - Cluster locations and extend the KML data structure with centroids\n", " \n", "cl = KMeansClustering([coords for (label, coords_list) in expanded_coords\n", " for coords in coords_list])\n", "\n", "centroids = [{'label': 'CENTROID', 'coords': '%s,%s' % centroid(c)} for c in\n", " cl.getclusters(K)]\n", "\n", "kml_items.extend(centroids)\n", "\n", "# Step 4 - Create the final KML output and write it to a file\n", "\n", "kml = createKML(kml_items)\n", "\n", "f = open(OUT_FILE, 'w')\n", "f.write(kml)\n", "f.close()\n", "\n", "print 'Data written to ' + OUT" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }