{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mining the Social Web\n", "\n", "## Mining Mailboxes\n", "\n", "This Jupyter Notebook provides an interactive way to follow along with and explore the examples from the video series. The intent behind this notebook is to reinforce the concepts in a fun, convenient, and effective way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting a toy mailbox to JSON" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import mailbox # pip install mailbox\n", "import json" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "MBOX = 'resources/ch07-mailboxes/data/northpole.mbox'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# A routine that makes a ton of simplifying assumptions\n", "# about converting an mbox message into a Python object\n", "# given the nature of the northpole.mbox file in order\n", "# to demonstrate the basic parsing of an mbox with mail\n", "# utilities\n", "\n", "def objectify_message(msg):\n", " \n", " # Map in fields from the message\n", " o_msg = dict([ (k, v) for (k,v) in msg.items() ])\n", " \n", " # Assume one part to the message and get its content\n", " # and its content type\n", " \n", " part = [p for p in msg.walk()][0]\n", " o_msg['contentType'] = part.get_content_type()\n", " o_msg['content'] = part.get_payload()\n", " \n", " return o_msg" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create an mbox that can be iterated over and transform each of its\n", "# messages to a convenient JSON representation\n", "\n", "mbox = mailbox.mbox(MBOX)\n", "\n", "messages = []\n", "\n", "for msg in mbox:\n", " messages.append(objectify_message(msg))\n", " \n", "print(json.dumps(messages, indent=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading the Enron email corpus" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import sys\n", "from urllib.request import urlopen\n", "import time\n", "import os\n", "import envoy # pip install envoy\n", "\n", "URL = \"http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz\"\n", "DOWNLOAD_DIR = \"resources/ch07-mailboxes/data\"\n", "\n", "# Downloads a file and displays a download status every 5 seconds\n", "\n", "def download(url, download_dir): \n", " file_name = url.split('/')[-1]\n", " u = urlopen(url)\n", " f = open(os.path.join(download_dir, file_name), 'wb')\n", " meta = u.info()\n", " file_size = int(meta['Content-Length'])\n", " print(\"Downloading: %s Bytes: %s\" % (file_name, file_size))\n", "\n", " file_size_dl = 0\n", " block_sz = 8192\n", " last_update = time.time()\n", " while True:\n", " buffer = u.read(block_sz)\n", " if not buffer:\n", " break\n", "\n", " file_size_dl += len(buffer)\n", " f.write(buffer)\n", " download_status = r\"%10d MB [%5.2f%%]\" % (file_size_dl / 1000000.0, file_size_dl * 100.0 / file_size)\n", " download_status = download_status + chr(8)*(len(download_status)+1)\n", " if time.time() - last_update > 5:\n", " print(download_status)\n", " sys.stdout.flush()\n", " last_update = time.time()\n", " f.close()\n", " return f.name\n", "\n", "# Extracts a gzipped tarfile. e.g. \"$ tar xzf filename.tgz\"\n", "\n", "def tar_xzf(f):\n", " # Call out to the shell for a faster decompression.\n", " # This will still take a while because Vagrant synchronizes\n", " # thousands of files that are extracted to the host machine\n", " r = envoy.run(\"tar xzf %s -C %s\" % (f, DOWNLOAD_DIR))\n", " print(r.std_out)\n", " print(r.std_err)\n", "\n", "f = download(URL, DOWNLOAD_DIR)\n", "print(\"Download complete: %s\" % (f,))\n", "tar_xzf(f)\n", "print(\"Decompression complete\")\n", "print(\"Data is ready\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting the Enron corpus to a standardized mbox format\n", "\n", "The results of the sample code below have been saved as a file, `enron.mbox.bz2`, in a compressed format. You may decompress is to `enron.mbox` using whatever tool you prefer, appropriate to your computer's operating system. On UNIX-like systems, the file may be decompressed with the command:\n", "\n", "`tar -xjf enron.mbox.bz2`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "import email\n", "from time import asctime\n", "import os\n", "import sys\n", "from dateutil.parser import parse # pip install python_dateutil\n", "\n", "# XXX: Download the Enron corpus to resources/ch07-mailboxes/data\n", "# and unarchive it there.\n", "\n", "MAILDIR = 'resources/ch07-mailboxes/data/enron_mail_20110402/maildir' \n", "\n", "# Where to write the converted mbox\n", "MBOX = 'resources/ch07-mailboxes/data/enron.mbox'\n", "\n", "# Create a file handle that we'll be writing into...\n", "mbox = open(MBOX, 'w+')\n", "\n", "# Walk the directories and process any folder named 'inbox'\n", "\n", "for (root, dirs, file_names) in os.walk(MAILDIR):\n", "\n", " if root.split(os.sep)[-1].lower() != 'inbox':\n", " continue\n", "\n", " # Process each message in 'inbox'\n", "\n", " for file_name in file_names:\n", " file_path = os.path.join(root, file_name)\n", " message_text = open(file_path, errors='ignore').read()\n", "\n", " # Compute fields for the From_ line in a traditional mbox message\n", " _from = re.search(r\"From: ([^\\r\\n]+)\", message_text).groups()[0]\n", " _date = re.search(r\"Date: ([^\\r\\n]+)\", message_text).groups()[0]\n", "\n", " # Convert _date to the asctime representation for the From_ line\n", " _date = asctime(parse(_date).timetuple())\n", "\n", " msg = email.message_from_string(message_text)\n", " msg.set_unixfrom('From {0} {1}'.format(_from, _date))\n", "\n", " mbox.write(msg.as_string(unixfrom=True) + \"\\n\\n\")\n", " \n", "mbox.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the mailbox data into Pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd # pip install pandas\n", "import mailbox\n", "\n", "MBOX = 'resources/ch07-mailboxes/data/enron.mbox'\n", "mbox = mailbox.mbox(MBOX)\n", "\n", "mbox_dict = {}\n", "for i, msg in enumerate(mbox):\n", " mbox_dict[i] = {}\n", " for header in msg.keys():\n", " mbox_dict[i][header] = msg[header]\n", " mbox_dict[i]['Body'] = msg.get_payload().replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ').strip()\n", " \n", "df = pd.DataFrame.from_dict(mbox_dict, orient='index')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.index = df['Date'].apply(pd.to_datetime)\n", "\n", "# Remove non-essential columns\n", "cols_to_keep = ['From', 'To', 'Cc', 'Bcc', 'Subject', 'Body']\n", "df = df[cols_to_keep]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Describe the DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Investigate email volume by month" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "start_date = '2000-1-1'\n", "stop_date = '2003-1-1'\n", "\n", "datemask = (df.index > start_date) & (df.index <= stop_date)\n", "vol_by_month = df.loc[datemask].resample('1M').count()['To']\n", "\n", "print(vol_by_month)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from prettytable import PrettyTable\n", "\n", "pt = PrettyTable(field_names=['Year', 'Month', 'Num Msgs'])\n", "pt.align['Num Msgs'], pt.align['Month'] = 'r', 'r'\n", "[ pt.add_row([ind.year, ind.month, vol])\n", " for ind, vol in zip(vol_by_month.index, vol_by_month)]\n", "\n", "print(pt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "vol_by_month[::-1].plot(kind='barh', figsize=(5,8), title='Email Volume by Month')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing Patterns in Sender/Recipient Communications" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "senders = df['From'].unique()\n", "receivers = df['To'].unique()\n", "cc_receivers = df['Cc'].unique()\n", "bcc_receivers = df['Bcc'].unique()\n", "\n", "print('Num Senders:', len(senders))\n", "print('Num Receivers:', len(receivers))\n", "print('Num CC Receivers:', len(cc_receivers))\n", "print('Num BCC Receivers:', len(bcc_receivers))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "senders = set(senders)\n", "receivers = set(receivers)\n", "cc_receivers = set(cc_receivers)\n", "bcc_receivers = set(bcc_receivers)\n", "\n", "# Find the number of senders who were also direct receivers\n", "\n", "senders_intersect_receivers = senders.intersection(receivers)\n", "\n", "# Find the senders that didn't receive any messages\n", "\n", "senders_diff_receivers = senders.difference(receivers)\n", " \n", "# Find the receivers that didn't send any messages\n", "\n", "receivers_diff_senders = receivers.difference(senders)\n", "\n", "# Find the senders who were any kind of receiver by\n", "# first computing the union of all types of receivers\n", "\n", "all_receivers = receivers.union(cc_receivers, bcc_receivers)\n", "senders_all_receivers = senders.intersection(all_receivers)\n", "\n", "print(\"Num senders in common with receivers:\", len(senders_intersect_receivers))\n", "print(\"Num senders who didn't receive:\", len(senders_diff_receivers))\n", "print(\"Num receivers who didn't send:\", len(receivers_diff_senders))\n", "print(\"Num senders in common with *all* receivers:\", len(senders_all_receivers))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Who is Sending and Receiving the Most Email?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "\n", "top_senders = df.groupby('From')\n", "top_receivers = df.groupby('To')\n", "\n", "top_senders = top_senders.count()['To']\n", "top_receivers = top_receivers.count()['From']\n", "\n", "# Get the ordered indices of the top senders and receivers in descending order\n", "top_snd_ord = np.argsort(top_senders)[::-1]\n", "top_rcv_ord = np.argsort(top_receivers)[::-1]\n", "\n", "top_senders = top_senders[top_snd_ord]\n", "top_receivers = top_receivers[top_rcv_ord]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from prettytable import PrettyTable\n", "\n", "top10 = top_senders[:10]\n", "pt = PrettyTable(field_names=['Rank', 'Sender', 'Messages Sent'])\n", "pt.align['Messages Sent'] = 'r'\n", "[ pt.add_row([i+1, email, vol]) for i, email, vol in zip(range(10), top10.index.values, top10.values)]\n", "\n", "print(pt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from prettytable import PrettyTable\n", "\n", "top10 = top_receivers[:10]\n", "pt = PrettyTable(field_names=['Rank', 'Receiver', 'Messages Received'])\n", "pt.align['Messages Sent'] = 'r'\n", "[ pt.add_row([i+1, email, vol]) for i, email, vol in zip(range(10), top10.index.values, top10.values)]\n", "\n", "print(pt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Searching by keyword" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import textwrap\n", "\n", "search_term = 'raptor'\n", "\n", "query = (df['Body'].str.contains(search_term, case=False) | df['Subject'].str.contains(search_term, case=False))\n", "\n", "results = df[query]\n", "\n", "print('{0} results found.'.format(query.sum()))\n", "print('Printing first 10 results...')\n", "for i in range(10):\n", " subject, body = results.iloc[i]['Subject'], results.iloc[i]['Body']\n", " print()\n", " print('SUBJECT: ', subject)\n", " print('-'*20)\n", " for line in textwrap.wrap(body, width=70, max_lines=5):\n", " print(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing Your Gmail Programmatically" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import httplib2\n", "import os\n", "\n", "from apiclient import discovery\n", "from oauth2client import client\n", "from oauth2client import tools\n", "from oauth2client.file import Storage\n", "\n", "# If modifying these scopes, delete your previously saved credentials\n", "# at ~/.credentials/gmail-python-quickstart.json\n", "SCOPES = 'https://www.googleapis.com/auth/gmail.readonly'\n", "CLIENT_SECRET_FILE = 'client_secret.json'\n", "APPLICATION_NAME = 'Gmail API Python Quickstart'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_credentials():\n", " \"\"\"Gets valid user credentials from storage.\n", "\n", " If nothing has been stored, or if the stored credentials are invalid,\n", " the OAuth2 flow is completed to obtain the new credentials.\n", "\n", " Returns:\n", " Credentials, the obtained credential.\n", " \"\"\"\n", " home_dir = os.path.expanduser('~')\n", " credential_dir = os.path.join(home_dir, '.credentials')\n", " if not os.path.exists(credential_dir):\n", " os.makedirs(credential_dir)\n", " credential_path = os.path.join(credential_dir,\n", " 'gmail-python-quickstart.json')\n", "\n", " store = Storage(credential_path)\n", " credentials = store.get()\n", " if not credentials or credentials.invalid:\n", " flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)\n", " flow.user_agent = APPLICATION_NAME\n", " if flags:\n", " credentials = tools.run_flow(flow, store, flags)\n", " else: # Needed only for compatibility with Python 2.6\n", " credentials = tools.run(flow, store)\n", " print('Storing credentials to ' + credential_path)\n", " return credentials" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "credentials = get_credentials()\n", "http = credentials.authorize(httplib2.Http())\n", "service = discovery.build('gmail', 'v1', http=http)\n", "\n", "results = service.users().labels().list(userId='me').execute()\n", "labels = results.get('labels', [])\n", "\n", "if not labels:\n", " print('No labels found.')\n", "else:\n", " print('Labels:')\n", " for label in labels:\n", " print(label['name'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetch Gmail Messages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query = 'Mining'\n", "max_results = 10\n", "\n", "# Search for Gmail messages containing the query term\n", "results = service.users().messages().list(userId='me', q=query, maxResults=max_results).execute()\n", "\n", "for result in results['messages']:\n", " print(result['id'])\n", " # Retrieve the message itself\n", " msg = service.users().messages().get(userId='me', id=result['id'], format='minimal').execute()\n", " print(msg)" ] } ], "metadata": { "kernelspec": { "display_name": "mtsw", "language": "python", "name": "mtsw" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 1 }