{ "metadata": { "name": "", "signature": "sha256:36b6e90945207bc790ddbd40c314a984158569046f8e670ff553e63eea1f83ff" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Extracting Text From Heterogeneous Sources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I consume online media through different apps on different devices (plain browsers, Feedly, Twitter, etc.)\n", "and to archive interesting reads for future reference I developed a habit of\n", "e-mailing links to myself.\n", "\n", "Gmail addresses are great for this purpose since an e-mail sent to *firstname.lastname+Links@gmail.com* still\n", "ends up in the inbox of *firstname.lastname@gmail.com*.\n", "Combining this with [Gmail filters](https://support.google.com/mail/answer/6579?hl=en) allows you to set up\n", "a rule so that all e-mails received on *your-gmail-address+Links@gmail.com* are archived automatically and\n", "are assigned a label that you can use to collect all blog posts you want to archive (I use the label *Links*).\n", "\n", "I already collected a couple of hundred links to blog articles, webpages, pdfs, etc. that I feel compelled to\n", "do something interesting with (such as topic modelling and automatic summarization).\n", "\n", "In this blog article I describe a pipeline that automatically accesses my Gmail account and extracts the readable\n", "text from all links I have sent myself over time.\n", "Of course I am opening a can of worms here since these e-mails come in lots of different\n", "formats and point to different materials:\n", "\n", "- some e-mails are just plain text with one or multiple links\n", "- many e-mails come straight out of Feedly on my iPad which tends to send entire blog articles with html formatting\n", "- many e-mails contain direct links to PDF files which are generally not as easy to parse as plain HTML\n", "\n", "Here I will use a rule-based approach to distinguish between these different formats." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Import Python Libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the Python imports I will need throughout the pipeline:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import httplib2\n", "\n", "from apiclient.discovery import build\n", "from oauth2client.client import flow_from_clientsecrets\n", "from oauth2client.file import Storage\n", "from oauth2client.tools import run\n", "\n", "import base64\n", "import re\n", "import requests\n", "from lxml import etree\n", "from StringIO import StringIO\n", "import itertools as it\n", "\n", "import urllib2\n", "from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter\n", "from pdfminer.converter import TextConverter\n", "from pdfminer.layout import LAParams\n", "from pdfminer.pdfpage import PDFPage\n", "from cStringIO import StringIO\n", "\n", "from collections import defaultdict\n", "\n", "from lxml.etree import fromstring\n", "\n", "import sqlite3\n", "from datetime import datetime\n", "import zlib" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Set Up Access to Gmail" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Google offer an API to many of their services including Gmail.\n", "The free of charge quota of the Gmail API is very generous and you will essentially not need to worry about\n", "going over the free allowance.\n", "\n", "To access Gmail through Python you need to followe the steps outlined here\n", "[here](https://developers.google.com/gmail/api/quickstart/quickstart-python)\n", "(all steps assume that you are logged into your Gmail account in the browser):\n", "\n", "- create a new project in the Google Developer Console\n", "- on the page of your new project, make certain to set a product name in the APIs&auth/consent screen tab\n", "- download the JSON from the APIs&auth/credentials tab and store it in a location accessible from this Python script (I called mine *client_secret.json*)\n", "\n", "Get the Google API Python package\n", "\n", " pip install google-api-python-client\n", " \n", "and the following code ([source](https://developers.google.com/gmail/api/quickstart/quickstart-python)) will\n", "connect to the Gmail API and expose an API resource object pointed to by `gmail_service`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Path to the client_secret.json file downloaded from the Developer Console\n", "CLIENT_SECRET_FILE = 'client_secret.json'\n", "\n", "# Check https://developers.google.com/gmail/api/auth/scopes for all available scopes\n", "OAUTH_SCOPE = 'https://www.googleapis.com/auth/gmail.readonly'\n", "\n", "# Location of the credentials storage file\n", "STORAGE = Storage('gmail.storage')\n", "\n", "# Start the OAuth flow to retrieve credentials\n", "flow = flow_from_clientsecrets(CLIENT_SECRET_FILE, scope=OAUTH_SCOPE)\n", "http = httplib2.Http()\n", "\n", "# Try to retrieve credentials from storage or run the flow to generate them\n", "credentials = STORAGE.get()\n", "if credentials is None or credentials.invalid:\n", " credentials = run(flow, STORAGE, http=http)\n", "\n", "# Authorize the httplib2.Http object with our credentials\n", "http = credentials.authorize(http)\n", "\n", "# Build the Gmail service from discovery\n", "gmail_service = build('gmail', 'v1', http=http)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Retrieve Bodies of Relevant E-Mail Messages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Gmail API resource `gmail_service` is the only object we need to query to eventually download\n", "all e-mail bodies of interest.\n", "The [Gmail API reference](https://developers.google.com/gmail/api/v1/reference/) explains how to use the different\n", "resource types (labels, message lists, messages) that we need to work with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Gmail label of interest is known to me as *Links* but internally Gmail identifies all labels by a unique ID.\n", "The following code identifies the label ID that corresponds to the label name *Links*:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "labels = gmail_service.users().labels().list(userId='me').execute()['labels'] # 'me' is the currently logged-in user\n", "label_id = filter(lambda x: x['name'] == 'Links', labels)[0]['id']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now query `gmail_serice` for all e-mail messages with the label *Links* (as identified by `label_id`).\n", "The API returns messages in pages of at most 100 messages so that we will need to track a pointer to the\n", "next page (`nextPageToken`) until all pages are consumed." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_message_ids():\n", " \"\"\" Page through all messages in `label_id` \"\"\"\n", " next_page = None\n", "\n", " while True:\n", " if next_page is not None:\n", " response = gmail_service.users().messages().list(userId='me', labelIds=[label_id], pageToken=next_page).execute()\n", " else:\n", " response = gmail_service.users().messages().list(userId='me', labelIds=[label_id]).execute()\n", "\n", " messages = response.get('messages')\n", " next_page = response.get('nextPageToken')\n", "\n", " for el in messages:\n", " yield el['id']\n", "\n", " if next_page is None:\n", " break" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract message bodies we need to distinguish between `text/plain` and `MIME` e-mails (to my understanding the latter allows\n", "embedded images and such).\n", "When an e-mail is just plain text then the body is found directly in the `payload`, however when the e-mail is MIME I will\n", "extract the body as the first of potentially many `parts`.\n", "See the [message API reference](https://developers.google.com/gmail/api/v1/reference/users/messages) for more detail." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def message_bodies():\n", " for ctr, message_id in enumerate(get_message_ids()):\n", " message = gmail_service.users().messages().get(userId='me', id=message_id, format='full').execute()\n", " \n", " try:\n", " body = message['payload']['parts'][0]['body']['data'] # MIME\n", " except KeyError:\n", " body = message['payload']['body']['data'] # text/plain\n", " \n", " body = base64.b64decode(str(body), '-_')\n", " \n", " yield body" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Parse Links (URLs) to Material of Interest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think of every e-mail with the *Links* label as a pointer to a resource on the Internet (be it a link to a webpage, PDF, etc.).\n", "Generally to extract URLs from plain text I will make use of the following regular expression formulated by\n", "[John Gruber](http://daringfireball.net/2010/07/improved_regex_for_matching_urls):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pattern = (r'(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|'\n", " r'www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)'\n", " r'(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))'\n", " r'+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))* \\)|'\n", " r'[^\\s`!()\\[\\]{};:\\'\\\".,<>?\\\u00ab\\\u00bb\\\u201c\\\u201d\\\u2018\\\u2019]))')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I sent myself a link through the Feedly iPad app, Feedly sometimes sends the entire article in the e-mail body\n", "but it does not do so consistently as far as I can judge.\n", "\n", "Some Feedly-originating e-mail bodies will contain multiple URLs (e.g. links as part of a blog article) but they mostly\n", "provide a link to the original story as the first URL in the e-mail.\n", "\n", "Other times I may type an e-mail myself in which I copy and paste multiple links to interesting resources - in which\n", "case I want to extract all URLs from the body and not just the first one." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def is_feedly(body):\n", " return 'feedly.com' in body" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def urls():\n", " for body in message_bodies():\n", " matches = re.findall(pattern, body)\n", " if is_feedly(body):\n", " match = matches[0]\n", " yield match[0] # Feedly e-mail: first URL is link to original story\n", " else:\n", " for match in matches:\n", " yield match[0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I noticed a number of URLs in `urls()` that did not match my expectation - such as links to disqus comments and an online retailer.\n", "This is where a rule-based filter comes in:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "exclude = ['packtpub.com', 'disqus', '@', 'list-manage', 'utm_', 'ref=', 'campaign-archive']\n", "def urls_filtered():\n", " for url in urls():\n", " if not any([pattern in url.lower() for pattern in exclude]):\n", " yield url" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I quite often send myself a link to the Hacker News message thread on an article of interest.\n", "\n", "To extract the URL to the actual article we need to parse the HTML of the Hacker News discussion\n", "and locate the relevant URL in a `td` element with `class=\"title\"` - a rule that works for these particular Hacker News pages." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def is_hn(url):\n", " return 'news.ycombinator.com' in url" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "parser = etree.HTMLParser()\n", "def urls_hn_filtered():\n", " for url in urls_filtered():\n", " if is_hn(url) and (re.search(r'item\\?id=', url) is None):\n", " continue # do not keep HN links that do not point to an article\n", " elif is_hn(url):\n", " r = requests.get(url)\n", " if r.status_code != 200:\n", " continue # download of HN html failed, skip\n", " root = etree.parse(StringIO(r.text), parser).getroot()\n", " title = root.find(\".//td[@class='title']\")\n", " \n", " try:\n", " a = [child for child in title.getchildren() if child.tag == 'a'][0]\n", " except AttributeError:\n", " continue # title is None\n", "\n", " story_url = a.get('href')\n", " yield story_url\n", " else:\n", " yield url" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Poor memory gets the better of me quite often and I end up sending myself the same link multiple times.\n", "To avoid downloading the same resource twice and skewing follow-on work with duplicates I will track\n", "seen URLs with the following code:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def unique_urls():\n", " seen = defaultdict(bool)\n", " for url in urls_hn_filtered():\n", " key = hash(url)\n", " if seen[key]:\n", " continue\n", " else:\n", " seen[key] = True\n", " yield url" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Extract Readable Text from PDF and HTML Documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a stream of unique URLs of interesting articles provided by `unique_urls` we can\n", "download each of these resources and extract their readable text.\n", "\n", "To extract readable text I will here distinguish between HTML and PDF.\n", "PDF text extraction is handled with `pdfminer` and I am making use of\n", "[this code](http://stackoverflow.com/a/23840353) shared in a StackOverflow answer.\n", "HTML text extraction is handled with `lxml` and [this code snippet](http://stackoverflow.com/a/23929292)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def pdf_from_url_to_txt(url):\n", " rsrcmgr = PDFResourceManager()\n", " retstr = StringIO()\n", " codec = 'utf-8'\n", " laparams = LAParams()\n", " device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)\n", " # Open the url provided as an argument to the function and read the content\n", " f = urllib2.urlopen(urllib2.Request(url)).read()\n", " # Cast to StringIO object\n", " fp = StringIO(f)\n", " interpreter = PDFPageInterpreter(rsrcmgr, device)\n", " password = \"\"\n", " maxpages = 0\n", " caching = True\n", " pagenos = set()\n", " for page in PDFPage.get_pages(fp,\n", " pagenos,\n", " maxpages=maxpages,\n", " password=password,\n", " caching=caching,\n", " check_extractable=True):\n", " interpreter.process_page(page)\n", " fp.close()\n", " device.close()\n", " string = retstr.getvalue()\n", " retstr.close()\n", " return string" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def resource_text():\n", " headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0'}\n", " html_parser = etree.HTMLParser(recover=True, encoding='utf-8')\n", " \n", " for url in unique_urls():\n", " if url.endswith('.pdf'):\n", " try:\n", " text = pdf_from_url_to_txt(url)\n", " except:\n", " continue # something went wrong, just skip ahead\n", " \n", " yield url, text\n", " \n", " else:\n", " try:\n", " r = requests.get(url, headers=headers)\n", " except:\n", " continue # something went wrong with HTTP GET, just skip ahead\n", " \n", " if r.status_code != 200:\n", " continue\n", " if not 'text/html' in r.headers.get('content-type', ''):\n", " continue\n", " \n", " # from: http://stackoverflow.com/a/23929292 and http://stackoverflow.com/a/15830619\n", " try:\n", " document = fromstring(r.text.encode('utf-8'), html_parser)\n", " except:\n", " continue # error parsing document, just skip ahead\n", " \n", " yield url, '\\n'.join(etree.XPath('//text()')(document))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Consume Pipeline and Store Extracted Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This concludes our pipeline as the generator exposed by `readable_text` returns our current best representation of\n", "useful text for each resource.\n", "\n", "All that remains now is to consume this generator and store the extracted text together with the source URL for future reference.\n", "\n", "Since this pipeline consists of a number of failure-prone steps (communicating with APIs, extracting text, etc.) I will also\n", "want to make certain to catch errors and deal with them appropriately:\n", "In the generator `resource_text` I already included a number of places where we just skip ahead to the next resource in case of failure - however failure may still happen and it will be best to catch those errors right here where we consume the pipeline.\n", "\n", "I will be lazy here and just skip all resources that generate an error during text extraction.\n", "Hence I will consume the generator `readable_text` until it throws the `StopIteration` exception and still skip all other exceptions.\n", "Using the `with` statement here makes certain that we do not need to worry about closing the file that we write to once `StopIteration` is thrown." ] }, { "cell_type": "code", "collapsed": false, "input": [ "text_generator = resource_text()\n", "\n", "try:\n", " db = sqlite3.connect('gmail_extracted_text.db')\n", " db.execute('CREATE TABLE gmail (date text, url text, compression text, extracted blob)')\n", " db.commit()\n", " db.close()\n", "except sqlite3.OperationalError:\n", " pass # table gmail already exists\n", "\n", "while True:\n", " try:\n", " db = sqlite3.connect('gmail_extracted_text.db')\n", " \n", " url, text = text_generator.next()\n", " \n", " now = datetime.now().__str__()\n", " if isinstance(text, unicode):\n", " text = zlib.compress(text.encode('utf-8')) # to decompress: zlib.decompress(text).decode('utf-8')\n", " else:\n", " text = zlib.compress(text)\n", " db.execute('INSERT INTO gmail VALUES (?, ?, ?, ?)', (unicode(now), unicode(url), u'zlib', sqlite3.Binary(text)))\n", " db.commit()\n", " except StopIteration:\n", " break # generator consumed, stop calling .next() on it\n", " except Exception, e:\n", " print e\n", " continue # some other exception was thrown by the pipeline, just skip ahead\n", " finally:\n", " db.close() # tidy up" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In my case the above pipeline extracted text from 571 resources and the resultant database is 7.9 MB in size." ] } ], "metadata": {} } ] }