{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mining the Social Web\n",
    "\n",
    "## Mining LinkedIn\n",
    "\n",
    "This Jupyter Notebook provides an interactive way to follow along with and explore the examples from the book or video series. The intent behind this notebook is to reinforce the concepts in a fun, convenient, and effective way.\n",
    "\n",
    "This code has been modified from the original book or video version in order to try to keep pace with developments at LinkedIn. Social media companies continue to modify their APIs and install additional privacy and security measures. Unfortunately, that means it's harder for a casual developer to begin exploring. Some of the code in this chapter requires higher-level permissions in order to run correctly. We have tried to keep the examples as accessible as possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LinkedIn API Access\n",
    "\n",
    "LinkedIn implements OAuth 2.0 as one of its standard authentication mechanisms, but still supports OAuth 1.0a, which provides you with four credentials (\"API Key\", \"Secret Key\", \"OAuth User Token\", and \"OAuth User Secret\") that can be used to gain instant API access with no further fuss or redirections. You can create an app and retrieve these four credentials through the \"Developer\" section of your account settings as shown below or by navigating directly to https://www.linkedin.com/secure/developer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using LinkedIn OAuth credentials to receive an access token suitable for development and accessing your own data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import string\n",
    "import random\n",
    "\n",
    "# Remember to first declare and setup a new application on the\n",
    "# LinkedIn Developer Console: https://www.linkedin.com/developer/apps\n",
    "# Copy the client ID, secret, and redirect URI in the fields below\n",
    "CLIENT_ID    = ''\n",
    "CLIENT_SECRET = ''\n",
    "REDIRECT_URI = 'http://localhost:8888'\n",
    "\n",
    "\n",
    "# Generate a random string to protect against cross-site request forgery\n",
    "letters = string.ascii_lowercase\n",
    "CSRF_TOKEN = ''.join(random.choice(letters) for i in range(24))\n",
    "\n",
    "\n",
    "# Request authentication URL\n",
    "auth_params = {'response_type': 'code',\n",
    "               'client_id': CLIENT_ID,\n",
    "               'redirect_uri': REDIRECT_URI,\n",
    "               'state': CSRF_TOKEN,\n",
    "               'scope': 'r_liteprofile,r_emailaddress,w_member_social'}\n",
    "\n",
    "html = requests.get(\"https://www.linkedin.com/oauth/v2/authorization\",\n",
    "                    params = auth_params)\n",
    "\n",
    "# Print the link to the approval page\n",
    "print(html.url)\n",
    "\n",
    "# Click the link below to be taken to your redirect page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect the address bar of your browser once you reach your redirect page.\n",
    "# Copy the code after '&code=...', but don't include '&state=...'\n",
    "# Then paste it here:\n",
    "AUTH_CODE =''\n",
    "\n",
    "ACCESS_TOKEN_URL = 'https://www.linkedin.com/oauth/v2/accessToken'\n",
    "\n",
    "qd = {'grant_type': 'authorization_code',\n",
    "      'code': AUTH_CODE,\n",
    "      'redirect_uri': REDIRECT_URI,\n",
    "      'client_id': CLIENT_ID,\n",
    "      'client_secret': CLIENT_SECRET}\n",
    "\n",
    "response = requests.post(ACCESS_TOKEN_URL, data=qd, timeout=60)\n",
    "\n",
    "response = response.json()\n",
    "\n",
    "access_token = response['access_token']\n",
    "\n",
    "print (\"Access Token:\", access_token)\n",
    "print (\"Expires in (seconds):\", response['expires_in'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieving your LinkedIn profile\n",
    "\n",
    "The LinkedIn API limits which data can be queried, e.g. you can no longer use the API to retrieve your list of connections on LinkedIn.\n",
    "\n",
    "Read about the changes made to LinkedIn's API in February 2015:\n",
    "\n",
    "https://developer.linkedin.com/blog/posts/2015/developer-program-changes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make HTTP request to retrieve personal profile\n",
    "import json\n",
    "params = {'oauth2_access_token': access_token}\n",
    "response = requests.get('https://api.linkedin.com/v2/me', params = params)\n",
    "\n",
    "print(json.dumps(response.json(), indent=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Displaying specific fields\n",
    "\n",
    "See https://developer.linkedin.com/docs/fields/positions for details on additional field selectors that can be passed in for retrieving additional profile information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "# Make HTTP request to retrieve personal profile\n",
    "\n",
    "params = {'oauth2_access_token': access_token,\n",
    "          'fields': [\"localizedFirstName,localizedLastName,id\"]}\n",
    "response = requests.get('https://api.linkedin.com/v2/me', params = params)\n",
    "\n",
    "print(json.dumps(response.json(), indent=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using field selector syntax to request additional details for APIs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# See https://developer.linkedin.com/docs/fields\n",
    "# for more information on the field selector syntax\n",
    "import json\n",
    "\n",
    "params = {'oauth2_access_token': access_token,\n",
    "          'fields': ['lastName:(preferredLocale:(country,language))']}\n",
    "response = requests.get('https://api.linkedin.com/v2/me', params = params)\n",
    "\n",
    "print(json.dumps(response.json(), indent=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get access to more API endpoints, you need to join the LinkedIn Partner Program. We do not cover that here, but for more information visit:\n",
    "\n",
    "https://developer.linkedin.com/partner-programs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download your profile data and read in connections data as a CSV file\n",
    "Go download your LinkedIn data here:\n",
    "https://www.linkedin.com/psettings/member-data\n",
    "\n",
    "Once requested, LinkedIn will prepare an archive of your profile data, which you can then download. Place the contents of the archive in a subfolder, e.g. 'data'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import csv\n",
    "\n",
    "# Point this to your 'Connections.csv' file.\n",
    "CSV_FILE = os.path.join('resources', 'ch04-linkedin', 'Connections.csv')\n",
    "\n",
    "csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='\"')\n",
    "contacts = [row for row in csvReader]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple normalization of company suffixes from address book data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prettytable import PrettyTable # pip install prettytable\n",
    "from collections import Counter\n",
    "from operator import itemgetter\n",
    "\n",
    "# Define a set of transforms that converts the first item\n",
    "# to the second item. Here, we're simply handling some\n",
    "# commonly known abbreviations, stripping off common suffixes, \n",
    "# etc.\n",
    "\n",
    "transforms = [(', Inc.', ''), (', Inc', ''), (', LLC', ''), (', LLP', ''),\n",
    "               (' LLC', ''), (' Inc.', ''), (' Inc', '')]\n",
    "\n",
    "companies = [c['Company'].strip() for c in contacts if c['Company'].strip() != '']\n",
    "\n",
    "for i, _ in enumerate(companies):\n",
    "    for transform in transforms:\n",
    "        companies[i] = companies[i].replace(*transform)\n",
    "\n",
    "pt = PrettyTable(field_names=['Company', 'Freq'])\n",
    "pt.align = 'l'\n",
    "c = Counter(companies)\n",
    "\n",
    "[pt.add_row([company, freq]) for (company, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) if freq > 1]\n",
    "\n",
    "print(pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Standardizing common job titles and computing their frequencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transforms = [\n",
    "    ('Sr.', 'Senior'),\n",
    "    ('Sr', 'Senior'),\n",
    "    ('Jr.', 'Junior'),\n",
    "    ('Jr', 'Junior'),\n",
    "    ('CEO', 'Chief Executive Officer'),\n",
    "    ('COO', 'Chief Operating Officer'),\n",
    "    ('CTO', 'Chief Technology Officer'),\n",
    "    ('CFO', 'Chief Finance Officer'),\n",
    "    ('VP', 'Vice President'),\n",
    "    ]\n",
    "\n",
    "# Read in a list of titles and split apart\n",
    "# any combined titles like \"President/CEO.\"\n",
    "# Other variations could be handled as well, such\n",
    "# as \"President & CEO\", \"President and CEO\", etc.\n",
    "\n",
    "titles = []\n",
    "for contact in contacts:\n",
    "    titles.extend([t.strip() for t in contact['Position'].split('/')\n",
    "                  if contact['Position'].strip() != ''])\n",
    "\n",
    "# Replace common/known abbreviations\n",
    "\n",
    "for i, _ in enumerate(titles):\n",
    "    for transform in transforms:\n",
    "        titles[i] = titles[i].replace(*transform)\n",
    "\n",
    "# Print out a table of titles sorted by frequency\n",
    "\n",
    "pt = PrettyTable(field_names=['Job Title', 'Freq'])\n",
    "pt.align = 'l'\n",
    "c = Counter(titles)\n",
    "[pt.add_row([title, freq]) \n",
    " for (title, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) \n",
    "     if freq > 1]\n",
    "print(pt)\n",
    "\n",
    "# Print out a table of tokens sorted by frequency\n",
    "\n",
    "tokens = []\n",
    "for title in titles:\n",
    "    tokens.extend([t.strip(',') for t in title.split()])\n",
    "pt = PrettyTable(field_names=['Token', 'Freq'])\n",
    "pt.align = 'l'\n",
    "c = Counter(tokens)\n",
    "[pt.add_row([token, freq]) \n",
    " for (token, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) \n",
    "     if freq > 1 and len(token) > 2]\n",
    "print(pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Geocoding locations with Google Maps\n",
    "\n",
    "Visit the Google Developer Console:\n",
    "\n",
    "https://console.developers.google.com\n",
    "\n",
    "From there, create an API key and enable the 'Google Maps Geocoding API'. Copy and paste the app key below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from geopy import geocoders # pip install geopy\n",
    "\n",
    "GOOGLEMAPS_APP_KEY = ''\n",
    "g = geocoders.GoogleV3(GOOGLEMAPS_APP_KEY)\n",
    "\n",
    "location = g.geocode(\"O'Reilly Media\")\n",
    "print(location)\n",
    "print('Lat/Lon: {0}, {1}'.format(location.latitude,location.longitude))\n",
    "print('https://www.google.ca/maps/@{0},{1},17z'.format(location.latitude,location.longitude))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Geocoding locations of LinkedIn connections with Google Maps\n",
    "\n",
    "Since LinkedIn only exports basic contact information, which does not include the contact's location, we can try to infer where they located from the company information. This is not inherently accurate, as many companies have multiple locations, but assuming that the person works at the headquarters, or the company is small as has only one location, then this approach should be at least somewhat accurate.\n",
    "\n",
    "For example, if someone declares that they work at the University of Toronto, then we can guess that they might also live in Toronto, although it's quite possible that they commute into the city from a nearby town."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, c in enumerate(contacts):\n",
    "    progress = '{0:3d} of {1:3d} - '.format(i+1,len(contacts))\n",
    "    company = c['Company']\n",
    "    try:\n",
    "        location = g.geocode(company, exactly_one=True)\n",
    "    except:\n",
    "        print('... Failed to get a location for {0}'.format(company))\n",
    "        location = None\n",
    "    \n",
    "    if location != None:\n",
    "        c.update([('Location', location)])\n",
    "        print(progress + company[:50] + ' -- ' + location.address)\n",
    "    else:\n",
    "        c.update([('Location', None)])\n",
    "        print(progress + company[:50] + ' -- ' + 'Unknown Location')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parsing out states from location results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def checkIfUSA(loc):\n",
    "    if loc == None: return False\n",
    "    for comp in loc.raw['address_components']:\n",
    "        if 'country' in comp['types']:\n",
    "            if comp['short_name'] == 'US':\n",
    "                return True\n",
    "            else:\n",
    "                return False\n",
    "    \n",
    "\n",
    "def parseStateFromGoogleMapsLocation(loc):\n",
    "    try:\n",
    "        address_components = loc.raw['address_components']\n",
    "        for comp in address_components:\n",
    "            if 'administrative_area_level_1' in comp['types']:\n",
    "                return comp['short_name']\n",
    "    except:\n",
    "        return None\n",
    "    \n",
    "results = {}\n",
    "for c in contacts:\n",
    "    loc = c['Location']\n",
    "    if loc == None: continue\n",
    "    if not checkIfUSA(loc): continue \n",
    "    state = parseStateFromGoogleMapsLocation(loc)\n",
    "    if state == None: continue\n",
    "    results.update({loc.address : state})\n",
    "    \n",
    "print(json.dumps(results, indent=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write the data to a JSON file, storing the address, latitude, and longitude data for location"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CONNECTIONS_DATA = 'linkedin_connections.json'\n",
    "\n",
    "# Loop over contacts and update the location information to store the\n",
    "# string address, also adding latitude and longitude information\n",
    "def serialize_contacts(contacts, output_filename):\n",
    "    for c in contacts:\n",
    "        location = c['Location']\n",
    "        if location != None:\n",
    "            # Convert the location to a string for serialization\n",
    "            c.update([('Location', location.address)])\n",
    "            c.update([('Lat', location.latitude)])\n",
    "            c.update([('Lon', location.longitude)])\n",
    "\n",
    "    f = open(output_filename, 'w')\n",
    "    f.write(json.dumps(contacts, indent=1))\n",
    "    f.close()\n",
    "    return\n",
    "\n",
    "serialize_contacts(contacts, CONNECTIONS_DATA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Here's how to power a Cartogram visualization with the data from the \"results\" variable**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import IFrame\n",
    "from IPython.core.display import display\n",
    "\n",
    "# Load in a data structure mapping state names to codes.\n",
    "# e.g. West Virginia is WV\n",
    "codes = json.loads(open('resources/ch04-linkedin/viz/states-codes.json').read())\n",
    "\n",
    "from collections import Counter\n",
    "c = Counter([r[1] for r in results.items()])\n",
    "states_freqs = { codes[k] : v for (k,v) in c.items() }\n",
    "\n",
    "# Lace in all of the other states and provide a minimum value for each of them\n",
    "states_freqs.update({v : 0.2 for v in codes.values() if v not in states_freqs.keys() })\n",
    "\n",
    "# Write output to file\n",
    "f = open('resources/ch04-linkedin/viz/states-freqs.json', 'w')\n",
    "f.write(json.dumps(states_freqs, indent=1))\n",
    "f.close()\n",
    "\n",
    "# IPython Notebook can serve files and display them into\n",
    "# inline frames\n",
    "\n",
    "display(IFrame(src='resources/ch04-linkedin/viz/cartogram.html', width='100%', height='600px'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering job titles using a greedy heuristic\n",
    "\n",
    "In this section, we introduce the Jaccard distance for comparing two sets of strings. The Jaccard distance is based on the Jaccard Index, which is defined as:\n",
    "\n",
    "$$\n",
    "J(A,B) = {{|A \\cap B|}\\over{|A \\cup B|}} = {{|A \\cap B|}\\over{|A| + |B| - |A \\cap B|}}\n",
    "$$\n",
    "\n",
    "The Wikipedia page has a good introduction to the [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index).\n",
    "![Intersection](resources/ch04-linkedin/images/Intersection_of_sets_A_and_B.png)\n",
    "![Union](resources/ch04-linkedin/images/Union_of_sets_A_and_B.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.util import bigrams\n",
    "\n",
    "ceo_bigrams = list(bigrams(\"Chief Executive Officer\".split(), pad_left=True, pad_right=True))\n",
    "cto_bigrams = list(bigrams(\"Chief Technology Officer\".split(), pad_left=True, pad_right=True))\n",
    "\n",
    "print(ceo_bigrams)\n",
    "print(cto_bigrams)\n",
    "\n",
    "print(len(set(ceo_bigrams).intersection(set(cto_bigrams))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Jaccard distance calculation**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.metrics.distance import jaccard_distance # pip install nltk\n",
    "\n",
    "job_title_1 = 'Chief Executive Officer'.split()\n",
    "job_title_2 = 'Chief Technology Officer'.split()\n",
    "\n",
    "print(job_title_1)\n",
    "print(job_title_2)\n",
    "\n",
    "print()\n",
    "print('Intersection:')\n",
    "intersection = set(job_title_1).intersection(set(job_title_2))\n",
    "print(intersection)\n",
    "\n",
    "print()\n",
    "print('Union:')\n",
    "union = set(job_title_1).union(set(job_title_2))\n",
    "print(union)\n",
    "\n",
    "print()\n",
    "print('Similarity:', len(intersection) / len(union))\n",
    "print('Distance:', jaccard_distance(set(job_title_1), set(job_title_2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "job_title_1 = 'Vice President, Sales'.split()\n",
    "job_title_2 = 'Vice President, Customer Relations'.split()\n",
    "\n",
    "print(job_title_1)\n",
    "print(job_title_2)\n",
    "\n",
    "print()\n",
    "print('Intersection:')\n",
    "intersection = set(job_title_1).intersection(set(job_title_2))\n",
    "print(intersection)\n",
    "\n",
    "print()\n",
    "print('Union:')\n",
    "union = set(job_title_1).union(set(job_title_2))\n",
    "print(union)\n",
    "\n",
    "print()\n",
    "print('Similarity:', len(intersection) / len(union))\n",
    "print('Distance:', jaccard_distance(set(job_title_1), set(job_title_2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tweak this distance threshold and try different distance calculations \n",
    "# during experimentation\n",
    "DISTANCE_THRESHOLD = 0.6\n",
    "DISTANCE = jaccard_distance\n",
    "\n",
    "with open(CONNECTIONS_DATA, 'r') as f:\n",
    "    contacts = json.load(f)\n",
    "\n",
    "def cluster_contacts_by_title():\n",
    "\n",
    "    transforms = [\n",
    "        ('Sr.', 'Senior'),\n",
    "        ('Sr', 'Senior'),\n",
    "        ('Jr.', 'Junior'),\n",
    "        ('Jr', 'Junior'),\n",
    "        ('CEO', 'Chief Executive Officer'),\n",
    "        ('COO', 'Chief Operating Officer'),\n",
    "        ('CTO', 'Chief Technology Officer'),\n",
    "        ('CFO', 'Chief Finance Officer'),\n",
    "        ('VP', 'Vice President'),\n",
    "        ]\n",
    "\n",
    "    separators = ['/', ' and ', ' & ', '|', ',']\n",
    "\n",
    "    # Normalize and/or replace known abbreviations\n",
    "    # and build up a list of common titles.\n",
    "\n",
    "    all_titles = []\n",
    "    for i, _ in enumerate(contacts):\n",
    "        if contacts[i]['Position'] == '':\n",
    "            contacts[i]['Position'] = ['']\n",
    "            continue\n",
    "        titles = [contacts[i]['Position']]\n",
    "        \n",
    "        all_titles.extend(titles)\n",
    "\n",
    "    all_titles = list(set(all_titles))\n",
    "\n",
    "    clusters = {}\n",
    "    for title1 in all_titles:\n",
    "        clusters[title1] = []\n",
    "        for title2 in all_titles:\n",
    "            if title2 in clusters[title1] or title2 in clusters and title1 \\\n",
    "                in clusters[title2]:\n",
    "                continue\n",
    "            try:\n",
    "                distance = DISTANCE(set(title1.split()), set(title2.split()))\n",
    "            except:\n",
    "                print(title1.split())\n",
    "                print(title2.split())\n",
    "                continue\n",
    "\n",
    "            if distance < DISTANCE_THRESHOLD:\n",
    "                clusters[title1].append(title2)\n",
    "\n",
    "    # Flatten out clusters\n",
    "    clusters = [clusters[title] for title in clusters if len(clusters[title]) > 1]\n",
    "\n",
    "    # Round up contacts who are in these clusters and group them together\n",
    "    clustered_contacts = {}\n",
    "    for cluster in clusters:\n",
    "        clustered_contacts[tuple(cluster)] = []\n",
    "        for contact in contacts:\n",
    "            for title in contact['Position']:\n",
    "                if title in cluster:\n",
    "                    clustered_contacts[tuple(cluster)].append('{0} {1}.'.format(\n",
    "                        contact['FirstName'], contact['LastName'][0]))\n",
    "\n",
    "    return clustered_contacts\n",
    "\n",
    "\n",
    "clustered_contacts = cluster_contacts_by_title()\n",
    "\n",
    "for titles in clustered_contacts:\n",
    "    common_titles_heading = 'Common Titles: ' + ', '.join(titles)\n",
    "\n",
    "    descriptive_terms = set(titles[0].split())\n",
    "    for title in titles:\n",
    "        descriptive_terms.intersection_update(set(title.split()))\n",
    "    if len(descriptive_terms) == 0: descriptive_terms = ['***No words in common***']\n",
    "    descriptive_terms_heading = 'Descriptive Terms: ' \\\n",
    "        + ', '.join(descriptive_terms)\n",
    "    print(common_titles_heading)\n",
    "    print('\\n'+descriptive_terms_heading)\n",
    "    print('-' * 70)\n",
    "    print('\\n'.join(clustered_contacts[titles]))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**How to export data to power a dendogram and node-link tree visualization**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "nltk.download('stopwords')\n",
    "from nltk.metrics.distance import jaccard_distance\n",
    "from nltk.corpus import stopwords # nltk.download('stopwords')\n",
    "from cluster import HierarchicalClustering # pip install cluster\n",
    "\n",
    "CSV_FILE = os.path.join('resources', 'ch04-linkedin', 'Connections.csv')\n",
    "\n",
    "OUT_FILE = 'resources/ch04-linkedin/viz/d3-data.json'\n",
    "\n",
    "# Tweak this distance threshold and try different distance calculations \n",
    "# during experimentation\n",
    "DISTANCE_THRESHOLD = 0.5\n",
    "DISTANCE = jaccard_distance\n",
    "\n",
    "# Adjust sample size as needed to reduce the runtime of the\n",
    "# nested loop that invokes the DISTANCE function\n",
    "SAMPLE_SIZE = 500\n",
    "\n",
    "def cluster_contacts_by_title(csv_file):\n",
    "\n",
    "    csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='\"')\n",
    "    contacts = [row for row in csvReader]\n",
    "    contacts = contacts[:SAMPLE_SIZE]\n",
    "    \n",
    "    transforms = [\n",
    "        ('Sr.', 'Senior'),\n",
    "        ('Sr', 'Senior'),\n",
    "        ('Jr.', 'Junior'),\n",
    "        ('Jr', 'Junior'),\n",
    "        ('CEO', 'Chief Executive Officer'),\n",
    "        ('COO', 'Chief Operating Officer'),\n",
    "        ('CTO', 'Chief Technology Officer'),\n",
    "        ('CFO', 'Chief Finance Officer'),\n",
    "        ('VP', 'Vice President'),\n",
    "        ]\n",
    "\n",
    "    separators = ['/', ' and ', '|', ',', ' & ']\n",
    "\n",
    "    # Normalize and/or replace known abbreviations\n",
    "    # and build up a list of common titles.\n",
    "\n",
    "    all_titles = []\n",
    "    for i, _ in enumerate(contacts):\n",
    "        if contacts[i]['Position'] == '':\n",
    "            contacts[i]['Position'] = ['']\n",
    "            continue\n",
    "        titles = [contacts[i]['Position']]\n",
    "        for separator in separators:\n",
    "            for title in titles:\n",
    "                if title.find(separator) >= 0:\n",
    "                    titles.remove(title)\n",
    "                    titles.extend([title.strip() for title in title.split(separator) if title.strip() != ''])\n",
    "\n",
    "        for transform in transforms:\n",
    "            titles = [title.replace(*transform) for title in titles]\n",
    "            \n",
    "        contacts[i]['Position'] = titles\n",
    "        all_titles.extend(titles)\n",
    "\n",
    "    all_titles = list(set(all_titles))\n",
    "    \n",
    "    # Define a scoring function\n",
    "    def score(title1, title2): \n",
    "        return DISTANCE(set(title1.split()), set(title2.split()))\n",
    "\n",
    "    # Feed the class your data and the scoring function\n",
    "    hc = HierarchicalClustering(all_titles, score)\n",
    "\n",
    "    # Cluster the data according to a distance threshold\n",
    "    clusters = hc.getlevel(DISTANCE_THRESHOLD)\n",
    "\n",
    "    # Remove singleton clusters\n",
    "    clusters = [c for c in clusters if len(c) > 1]\n",
    "\n",
    "    # Round up contacts who are in these clusters and group them together\n",
    "    clustered_contacts = {}\n",
    "    for cluster in clusters:\n",
    "        clustered_contacts[tuple(cluster)] = []\n",
    "        for contact in contacts:\n",
    "            for title in contact['Position']:\n",
    "                if title in cluster:\n",
    "                    clustered_contacts[tuple(cluster)].append('{0} {1}.'.format(\n",
    "                        contact['FirstName'], contact['LastName'][0]))\n",
    "\n",
    "    return clustered_contacts, clusters\n",
    "\n",
    "def get_descriptive_terms(titles):\n",
    "    flatten = lambda l: [item for sublist in l for item in sublist]\n",
    "    title_words = flatten([title.split() for title in titles])\n",
    "    filtered_words = [word for word in title_words \\\n",
    "                      if word not in stopwords.words('english')]\n",
    "    counter = Counter(filtered_words)\n",
    "    descriptive_terms = counter.most_common(2)\n",
    "    # Get the most common title words from a cluster, ignoring singletons\n",
    "    descriptive_terms = [t[0] for t in descriptive_terms if t[1] > 1]\n",
    "    return descriptive_terms\n",
    "\n",
    "\n",
    "def display_output(clustered_contacts, clusters):    \n",
    "    for title_cluster in clusters:\n",
    "        descriptive_terms = get_descriptive_terms(title_cluster)\n",
    "        common_titles_heading = 'Common Titles: ' + ', '.join((t for t in title_cluster))\n",
    "        descriptive_terms_heading =  'Descriptive Terms: ' + ', '.join((t for t in descriptive_terms))\n",
    "        \n",
    "        print(common_titles_heading)\n",
    "        print(descriptive_terms_heading)\n",
    "        print('-' * 70)\n",
    "        #print(title_cluster)\n",
    "        #print(clustered_contacts)\n",
    "        print('\\n'.join(clustered_contacts[tuple(title_cluster)]))\n",
    "        print()\n",
    "\n",
    "\n",
    "def write_d3_json_output(clustered_contacts):\n",
    "    \n",
    "    json_output = {'name' : 'My LinkedIn', 'children' : []}\n",
    "\n",
    "    for titles in clustered_contacts:\n",
    "\n",
    "        descriptive_terms = get_descriptive_terms(titles)\n",
    "\n",
    "        json_output['children'].append({'name' : ', '.join(descriptive_terms)[:30], \n",
    "                                    'children' : [ {'name' : c} for c in clustered_contacts[titles] ] } )\n",
    "    \n",
    "        f = open(OUT_FILE, 'w')\n",
    "        f.write(json.dumps(json_output, indent=1))\n",
    "        f.close()\n",
    "    \n",
    "clustered_contacts, clusters = cluster_contacts_by_title(CSV_FILE)\n",
    "display_output(clustered_contacts, clusters)\n",
    "write_d3_json_output(clustered_contacts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Once you've run the code and produced the output for the dendogram and node-link tree visualizations, here's one way to serve it.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import IFrame\n",
    "from IPython.core.display import display\n",
    "\n",
    "# Visualize clusters as a dendrogram\n",
    "viz_file = 'resources/ch04-linkedin/viz/dendogram.html'\n",
    "\n",
    "display(IFrame(viz_file, '100%', '600px'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viz_file = 'resources/ch04-linkedin/viz/node_link_tree.html'\n",
    "\n",
    "# Visualize clusters as a node-link tree\n",
    "display(IFrame(viz_file, '100%', '600px'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering your LinkedIn professional network based upon the locations of your connections and emitting KML output for visualization with Google Earth"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import simplekml # pip install simplekml\n",
    "from cluster import KMeansClustering\n",
    "from cluster.util import centroid\n",
    "\n",
    "# Load this data from where you've previously stored it\n",
    "CONNECTIONS_DATA = 'linkedin_connections.json'\n",
    "\n",
    "# Open up your saved connections with extended profile information\n",
    "# or fetch them again from LinkedIn if you prefer\n",
    "connections = json.loads(open(CONNECTIONS_DATA).read())\n",
    "\n",
    "\n",
    "# A KML object for storing all your contacts\n",
    "kml_all = simplekml.Kml()\n",
    "\n",
    "for c in connections:\n",
    "    location = c['Location']\n",
    "    if location is not None:\n",
    "        lat, lon = c['Lat'], c['Lon']\n",
    "        kml_all.newpoint(name='{} {}'.format(c['FirstName'],c['LastName']), coords=[(lon,lat)]) # coords reversed\n",
    "\n",
    "kml_all.save('resources/ch04-linkedin/viz/connections.kml')\n",
    "\n",
    "\n",
    "# Now cluster your contacts using the K-Means algorithm into K clusters\n",
    "\n",
    "K = 10\n",
    "\n",
    "cl = KMeansClustering([(c['Lat'], c['Lon']) for c in connections if c['Location'] is not None])\n",
    "\n",
    "# Get the centroids for each of the K clusters\n",
    "centroids = [centroid(c) for c in cl.getclusters(K)]\n",
    "\n",
    "# A KML object for storing the locations of each of the clusters\n",
    "kml_clusters = simplekml.Kml()\n",
    "\n",
    "for i, c in enumerate(centroids):\n",
    "    kml_clusters.newpoint(name='Cluster {}'.format(i), coords=[(c[1],c[0])]) # coords reversed\n",
    "\n",
    "kml_clusters.save('resources/ch04-linkedin/viz/kmeans_centroids.kml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}