{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mining the Social Web\n",
    "\n",
    "## Mining Web Pages\n",
    "\n",
    "This Jupyter Notebook provides an interactive way to follow along with and explore the examples from the video series. The intent behind this notebook is to reinforce the concepts in a fun, convenient, and effective way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Downloading nltk packages used in this example\n",
    "nltk.download('maxent_ne_chunker')\n",
    "nltk.download('words')\n",
    "\n",
    "ne_chunks = list(nltk.chunk.ne_chunk_sents(pos_tagged_tokens))\n",
    "print(ne_chunks)\n",
    "ne_chunks[0].pprint()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using boilerpipe to extract the text from a web page\n",
    "\n",
    "Example blog post:\n",
    "http://radar.oreilly.com/2010/07/louvre-industrial-age-henry-ford.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# May also require the installation of Java runtime libraries\n",
    "# pip install boilerpipe3\n",
    "from boilerpipe.extract import Extractor\n",
    "\n",
    "# If you're interested, learn more about how Boilerpipe works by reading\n",
    "# Christian Kohlschütter's paper: http://www.l3s.de/~kohlschuetter/boilerplate/\n",
    "\n",
    "URL='https://www.oreilly.com/ideas/ethics-in-data-project-design-its-about-planning'\n",
    "\n",
    "extractor = Extractor(extractor='ArticleExtractor', url=URL)\n",
    "\n",
    "print(extractor.getText())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using feedparser to extract the text (and other fields) from an RSS or Atom feed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import feedparser # pip install feedparser\n",
    "\n",
    "FEED_URL='http://feeds.feedburner.com/oreilly/radar/atom'\n",
    "\n",
    "fp = feedparser.parse(FEED_URL)\n",
    "\n",
    "for e in fp.entries:\n",
    "    print(e.title)\n",
    "    print(e.links[0].href)\n",
    "    print(e.content[0].value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harvesting blog data by parsing feeds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import json\n",
    "import feedparser\n",
    "from bs4 import BeautifulSoup\n",
    "from nltk import clean_html\n",
    "\n",
    "FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'\n",
    "\n",
    "def cleanHtml(html):\n",
    "    if html == \"\": return \"\"\n",
    "\n",
    "    return BeautifulSoup(html, 'html5lib').get_text()\n",
    "\n",
    "fp = feedparser.parse(FEED_URL)\n",
    "\n",
    "print(\"Fetched {0} entries from '{1}'\".format(len(fp.entries[0].title), fp.feed.title))\n",
    "\n",
    "blog_posts = []\n",
    "for e in fp.entries:\n",
    "    blog_posts.append({'title': e.title, 'content'\n",
    "                      : cleanHtml(e.content[0].value), 'link': e.links[0].href})\n",
    "\n",
    "out_file = os.path.join('feed.json')\n",
    "f = open(out_file, 'w+')\n",
    "f.write(json.dumps(blog_posts, indent=1))\n",
    "f.close()\n",
    "\n",
    "print('Wrote output file to {0}'.format(f.name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Starting to write a web crawler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import httplib2\n",
    "import re\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "http = httplib2.Http()\n",
    "status, response = http.request('http://www.nytimes.com')\n",
    "\n",
    "soup = BeautifulSoup(response, 'html5lib')\n",
    "\n",
    "links = []\n",
    " \n",
    "for link in soup.findAll('a', attrs={'href': re.compile(\"^http(s?)://\")}):\n",
    "    links.append(link.get('href'))\n",
    "\n",
    "for link in links:\n",
    "    print(link)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "Create an empty graph\n",
    "Create an empty queue to keep track of nodes that need to be processed\n",
    "\n",
    "Add the starting point to the graph as the root node\n",
    "Add the root node to a queue for processing\n",
    "\n",
    "Repeat until some maximum depth is reached or the queue is empty:\n",
    "  Remove a node from the queue \n",
    "  For each of the node's neighbors: \n",
    "    If the neighbor hasn't already been processed: \n",
    "      Add it to the queue \n",
    "      Add it to the graph \n",
    "      Create an edge in the graph that connects the node and its neighbor\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using NLTK to parse web page data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Naive sentence detection based on periods**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "text = \"Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.\"\n",
    "print(text.split(\".\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**More sophisticated sentence detection**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk # Installation instructions: http://www.nltk.org/install.html\n",
    "\n",
    "# Downloading nltk packages used in this example\n",
    "nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sentences = nltk.tokenize.sent_tokenize(text)\n",
    "print(sentences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "harder_example = \"\"\"My name is John Smith and my email address is j.smith@company.com.\n",
    "Mostly people call Mr. Smith. But I actually have a Ph.D.!\n",
    "Can you believe it? Neither can most people...\"\"\"\n",
    "\n",
    "sentences = nltk.tokenize.sent_tokenize(harder_example)\n",
    "print(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Word tokenization**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "text = \"Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.\"\n",
    "sentences = nltk.tokenize.sent_tokenize(text)\n",
    "\n",
    "tokens = [nltk.word_tokenize(s) for s in sentences]\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Part of speech tagging for tokens**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Downloading nltk packages used in this example\n",
    "nltk.download('maxent_treebank_pos_tagger')\n",
    "\n",
    "pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]\n",
    "print(pos_tagged_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Alphabetical list of part-of-speech tags used in the Penn Treebank Project**\n",
    "\n",
    "See: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html\n",
    "\n",
    "| # | POS Tag | Meaning |\n",
    "|:-:|:-------:|:--------|\n",
    "| 1\t| CC | Coordinating conjunction|\n",
    "|2|\tCD\t|Cardinal number|\n",
    "|3|\tDT\t|Determiner|\n",
    "|4|\tEX\t|Existential there|\n",
    "|5|\tFW\t|Foreign word|\n",
    "|6|\tIN\t|Preposition or subordinating conjunction|\n",
    "|7|\tJJ\t|Adjective|\n",
    "|8|\tJJR\t|Adjective, comparative|\n",
    "|9|\tJJS\t|Adjective, superlative|\n",
    "|10|\tLS\t|List item marker|\n",
    "|11|\tMD\t|Modal|\n",
    "|12|\tNN\t|Noun, singular or mass|\n",
    "|13|\tNNS\t|Noun, plural|\n",
    "|14|\tNNP\t|Proper noun, singular|\n",
    "|15|\tNNPS\t|Proper noun, plural|\n",
    "|16|\tPDT\t|Predeterminer|\n",
    "|17|\tPOS\t|Possessive ending|\n",
    "|18|\tPRP\t|Personal pronoun|\n",
    "|19|\tPRP\\$\t|Possessive pronoun|\n",
    "|20|\tRB\t|Adverb|\n",
    "|21|\tRBR\t|Adverb, comparative|\n",
    "|22|\tRBS\t|Adverb, superlative|\n",
    "|23|\tRP\t|Particle|\n",
    "|24|\tSYM\t|Symbol|\n",
    "|25|\tTO\t|to|\n",
    "|26|\tUH\t|Interjection|\n",
    "|27|\tVB\t|Verb, base form|\n",
    "|28|\tVBD\t|Verb, past tense|\n",
    "|29|\tVBG\t|Verb, gerund or present participle|\n",
    "|30|\tVBN\t|Verb, past participle|\n",
    "|31|\tVBP\t|Verb, non-3rd person singular present|\n",
    "|32|\tVBZ\t|Verb, 3rd person singular present|\n",
    "|33|\tWDT\t|Wh-determiner|\n",
    "|34|\tWP\t|Wh-pronoun|\n",
    "|35|\tWP\\$|Possessive wh-pronoun|\n",
    "|36|\tWRB\t|Wh-adverb|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Named entity extraction/chunking for tokens**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Downloading nltk packages used in this example\n",
    "nltk.download('maxent_ne_chunker')\n",
    "nltk.download('words')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "jim = \"Jim bought 300 shares of Acme Corp. in 2006.\"\n",
    "\n",
    "tokens = nltk.word_tokenize(jim)\n",
    "jim_tagged_tokens = nltk.pos_tag(tokens)\n",
    "\n",
    "ne_chunks = nltk.chunk.ne_chunk(jim_tagged_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ne_chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ne_chunks = [nltk.chunk.ne_chunk(ptt) for ptt in pos_tagged_tokens]\n",
    "\n",
    "ne_chunks[0].pprint()\n",
    "ne_chunks[1].pprint()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ne_chunks[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ne_chunks[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using NLTK’s NLP tools to process human language in blog data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import nltk\n",
    "\n",
    "BLOG_DATA = \"resources/ch06-webpages/feed.json\"\n",
    "\n",
    "blog_data = json.loads(open(BLOG_DATA).read())\n",
    "\n",
    "# Download nltk packages used in this example\n",
    "nltk.download('stopwords')\n",
    "\n",
    "# Customize your list of stopwords as needed. Here, we add common\n",
    "# punctuation and contraction artifacts.\n",
    "\n",
    "stop_words = nltk.corpus.stopwords.words('english') + [\n",
    "    '.',\n",
    "    ',',\n",
    "    '--',\n",
    "    '\\'s',\n",
    "    '?',\n",
    "    ')',\n",
    "    '(',\n",
    "    ':',\n",
    "    '\\'',\n",
    "    '\\'re',\n",
    "    '\"',\n",
    "    '-',\n",
    "    '}',\n",
    "    '{',\n",
    "    u'—',\n",
    "    ']',\n",
    "    '[',\n",
    "    '...'\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for post in blog_data:\n",
    "    sentences = nltk.tokenize.sent_tokenize(post['content'])\n",
    "\n",
    "    words = [w.lower() for sentence in sentences for w in\n",
    "             nltk.tokenize.word_tokenize(sentence)]\n",
    "\n",
    "    fdist = nltk.FreqDist(words)\n",
    "\n",
    "    # Remove stopwords from fdist\n",
    "    for sw in stop_words:\n",
    "        del fdist[sw]\n",
    "   \n",
    "    # Basic stats\n",
    "\n",
    "    num_words = sum([i[1] for i in fdist.items()])\n",
    "    num_unique_words = len(fdist.keys())\n",
    "\n",
    "    # Hapaxes are words that appear only once\n",
    "    num_hapaxes = len(fdist.hapaxes())\n",
    "\n",
    "    top_10_words_sans_stop_words = fdist.most_common(10)\n",
    "\n",
    "    print(post['title'])\n",
    "    print('\\tNum Sentences:'.ljust(25), len(sentences))\n",
    "    print('\\tNum Words:'.ljust(25), num_words)\n",
    "    print('\\tNum Unique Words:'.ljust(25), num_unique_words)\n",
    "    print('\\tNum Hapaxes:'.ljust(25), num_hapaxes)\n",
    "    print('\\tTop 10 Most Frequent Words (sans stop words):\\n\\t\\t', \\\n",
    "          '\\n\\t\\t'.join(['{0} ({1})'.format(w[0], w[1]) for w in top_10_words_sans_stop_words]))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import nltk\n",
    "import numpy\n",
    "\n",
    "BLOG_DATA = \"feed.json\"\n",
    "\n",
    "blog_data = json.loads(open(BLOG_DATA).read())\n",
    "\n",
    "N = 100  # Number of words to consider\n",
    "CLUSTER_THRESHOLD = 5  # Distance between words to consider\n",
    "TOP_SENTENCES = 5  # Number of sentences to return for a \"top n\" summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "stop_words = nltk.corpus.stopwords.words('english') + [\n",
    "    '.',\n",
    "    ',',\n",
    "    '--',\n",
    "    '\\'s',\n",
    "    '?',\n",
    "    ')',\n",
    "    '(',\n",
    "    ':',\n",
    "    '\\'',\n",
    "    '\\'re',\n",
    "    '\"',\n",
    "    '-',\n",
    "    '}',\n",
    "    '{',\n",
    "    u'—',\n",
    "    '>',\n",
    "    '<',\n",
    "    '...'\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Approach taken from \"The Automatic Creation of Literature Abstracts\" by H.P. Luhn\n",
    "def _score_sentences(sentences, important_words):\n",
    "    scores = []\n",
    "    sentence_idx = 0\n",
    "\n",
    "    for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:\n",
    "\n",
    "        word_idx = []\n",
    "\n",
    "        # For each word in the word list...\n",
    "        for w in important_words:\n",
    "            try:\n",
    "                # Compute an index for where any important words occur in the sentence.\n",
    "                word_idx.append(s.index(w))\n",
    "            except ValueError: # w not in this particular sentence\n",
    "                pass\n",
    "\n",
    "        word_idx.sort()\n",
    "\n",
    "        # It is possible that some sentences may not contain any important words at all.\n",
    "        if len(word_idx)== 0: continue\n",
    "\n",
    "        # Using the word index, compute clusters by using a max distance threshold\n",
    "        # for any two consecutive words.\n",
    "\n",
    "        clusters = []\n",
    "        cluster = [word_idx[0]]\n",
    "        i = 1\n",
    "        while i < len(word_idx):\n",
    "            if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:\n",
    "                cluster.append(word_idx[i])\n",
    "            else:\n",
    "                clusters.append(cluster[:])\n",
    "                cluster = [word_idx[i]]\n",
    "            i += 1\n",
    "        clusters.append(cluster)\n",
    "\n",
    "        # Score each cluster. The max score for any given cluster is the score \n",
    "        # for the sentence.\n",
    "\n",
    "        max_cluster_score = 0\n",
    "        \n",
    "        for c in clusters:\n",
    "            significant_words_in_cluster = len(c)\n",
    "            # true clusters also contain insignificant words, so we get \n",
    "            # the total cluster length by checking the indices\n",
    "            total_words_in_cluster = c[-1] - c[0] + 1\n",
    "            score = 1.0 * significant_words_in_cluster**2 / total_words_in_cluster\n",
    "\n",
    "            if score > max_cluster_score:\n",
    "                max_cluster_score = score\n",
    "\n",
    "        scores.append((sentence_idx, max_cluster_score))\n",
    "        sentence_idx += 1\n",
    "\n",
    "    return scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def summarize(txt):\n",
    "    sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]\n",
    "    normalized_sentences = [s.lower() for s in sentences]\n",
    "\n",
    "    words = [w.lower() for sentence in normalized_sentences for w in\n",
    "             nltk.tokenize.word_tokenize(sentence)]\n",
    "\n",
    "    fdist = nltk.FreqDist(words)\n",
    "    \n",
    "    # Remove stopwords from fdist\n",
    "    for sw in stop_words:\n",
    "        del fdist[sw]\n",
    "\n",
    "    top_n_words = [w[0] for w in fdist.most_common(N)]\n",
    "\n",
    "    scored_sentences = _score_sentences(normalized_sentences, top_n_words)\n",
    "\n",
    "    # Summarization Approach 1:\n",
    "    # Filter out nonsignificant sentences by using the average score plus a\n",
    "    # fraction of the std dev as a filter\n",
    "\n",
    "    avg = numpy.mean([s[1] for s in scored_sentences])\n",
    "    std = numpy.std([s[1] for s in scored_sentences])\n",
    "    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences\n",
    "                   if score > avg + 0.5 * std]\n",
    "\n",
    "    # Summarization Approach 2:\n",
    "    # Another approach would be to return only the top N ranked sentences\n",
    "\n",
    "    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]\n",
    "    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])\n",
    "\n",
    "    # Decorate the post object with summaries\n",
    "\n",
    "    return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],\n",
    "                mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for post in blog_data: \n",
    "    post.update(summarize(post['content']))\n",
    "\n",
    "    print(post['title'])\n",
    "    print('=' * len(post['title']))\n",
    "    print()\n",
    "    print('Top N Summary')\n",
    "    print('-------------')\n",
    "    print(' '.join(post['top_n_summary']))\n",
    "    print()\n",
    "    print('Mean Scored Summary')\n",
    "    print('-------------------')\n",
    "    print(' '.join(post['mean_scored_summary']))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing document summarization results with HTML output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from IPython.display import IFrame\n",
    "from IPython.core.display import display\n",
    "\n",
    "HTML_TEMPLATE = \"\"\"<html>\n",
    "    <head>\n",
    "        <title>{0}</title>\n",
    "        <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"/>\n",
    "    </head>\n",
    "    <body>{1}</body>\n",
    "</html>\"\"\"\n",
    "\n",
    "for post in blog_data:\n",
    "   \n",
    "    # Uses previously defined summarize function.\n",
    "    post.update(summarize(post['content']))\n",
    "\n",
    "    # You could also store a version of the full post with key sentences marked up\n",
    "    # for analysis with simple string replacement...\n",
    "\n",
    "    for summary_type in ['top_n_summary', 'mean_scored_summary']:\n",
    "        post[summary_type + '_marked_up'] = '<p>{0}</p>'.format(post['content'])\n",
    "        \n",
    "        for s in post[summary_type]:\n",
    "            post[summary_type + '_marked_up'] = \\\n",
    "            post[summary_type + '_marked_up'].replace(s, '<strong>{0}</strong>'.format(s))\n",
    "\n",
    "        filename = post['title'].replace(\"?\", \"\") + '.summary.' + summary_type + '.html'\n",
    "        \n",
    "        f = open(os.path.join(filename), 'wb')\n",
    "        html = HTML_TEMPLATE.format(post['title'] + ' Summary', post[summary_type + '_marked_up'])    \n",
    "        f.write(html.encode('utf-8'))\n",
    "        f.close()\n",
    "\n",
    "        print(\"Data written to\", f.name)\n",
    "\n",
    "# Display any of these files with an inline frame. This displays the\n",
    "# last file processed by using the last value of f.name...\n",
    "print()\n",
    "print(\"Displaying {0}:\".format(f.name))\n",
    "display(IFrame('files/{0}'.format(f.name), '100%', '600px'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting entities from a text with NLTK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "import json\n",
    "\n",
    "BLOG_DATA = \"feed.json\"\n",
    "\n",
    "blog_data = json.loads(open(BLOG_DATA).read())\n",
    "\n",
    "for post in blog_data:\n",
    "\n",
    "    sentences = nltk.tokenize.sent_tokenize(post['content'])\n",
    "    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]\n",
    "    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]\n",
    "\n",
    "    # Flatten the list since we're not using sentence structure\n",
    "    # and sentences are guaranteed to be separated by a special\n",
    "    # POS tuple such as ('.', '.')\n",
    "\n",
    "    pos_tagged_tokens = [token for sent in pos_tagged_tokens for token in sent]\n",
    "\n",
    "    all_entity_chunks = []\n",
    "    previous_pos = None\n",
    "    current_entity_chunk = []\n",
    "    for (token, pos) in pos_tagged_tokens:\n",
    "\n",
    "        if pos == previous_pos and pos.startswith('NN'):\n",
    "            current_entity_chunk.append(token)\n",
    "        elif pos.startswith('NN'):\n",
    "            \n",
    "            if current_entity_chunk != []:\n",
    "                \n",
    "                # Note that current_entity_chunk could be a duplicate when appended,\n",
    "                # so frequency analysis again becomes a consideration\n",
    "\n",
    "                all_entity_chunks.append((' '.join(current_entity_chunk), pos))\n",
    "            current_entity_chunk = [token]\n",
    "\n",
    "        previous_pos = pos\n",
    "\n",
    "    # Store the chunks as an index for the document\n",
    "    # and account for frequency while we're at it...\n",
    "\n",
    "    post['entities'] = {}\n",
    "    for c in all_entity_chunks:\n",
    "        post['entities'][c] = post['entities'].get(c, 0) + 1\n",
    "\n",
    "    # For example, we could display just the title-cased entities\n",
    "\n",
    "    print(post['title'])\n",
    "    print('-' * len(post['title']))\n",
    "    proper_nouns = []\n",
    "    for (entity, pos) in post['entities']:\n",
    "        if entity.istitle():\n",
    "            print('\\t{0} ({1})'.format(entity, post['entities'][(entity, pos)]))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discovering interactions between entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "import json\n",
    "\n",
    "BLOG_DATA = \"feed.json\"\n",
    "\n",
    "def extract_interactions(txt):\n",
    "    sentences = nltk.tokenize.sent_tokenize(txt)\n",
    "    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]\n",
    "    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]\n",
    "\n",
    "    entity_interactions = []\n",
    "    for sentence in pos_tagged_tokens:\n",
    "\n",
    "        all_entity_chunks = []\n",
    "        previous_pos = None\n",
    "        current_entity_chunk = []\n",
    "\n",
    "        for (token, pos) in sentence:\n",
    "\n",
    "            if pos == previous_pos and pos.startswith('NN'):\n",
    "                current_entity_chunk.append(token)\n",
    "            elif pos.startswith('NN'):\n",
    "                if current_entity_chunk != []:\n",
    "                    all_entity_chunks.append((' '.join(current_entity_chunk),\n",
    "                            pos))\n",
    "                current_entity_chunk = [token]\n",
    "\n",
    "            previous_pos = pos\n",
    "\n",
    "        if len(all_entity_chunks) > 1:\n",
    "            entity_interactions.append(all_entity_chunks)\n",
    "        else:\n",
    "            entity_interactions.append([])\n",
    "\n",
    "    assert len(entity_interactions) == len(sentences)\n",
    "\n",
    "    return dict(entity_interactions=entity_interactions,\n",
    "                sentences=sentences)\n",
    "\n",
    "blog_data = json.loads(open(BLOG_DATA).read())\n",
    "\n",
    "# Display selected interactions on a per-sentence basis\n",
    "\n",
    "for post in blog_data:\n",
    "\n",
    "    post.update(extract_interactions(post['content']))\n",
    "\n",
    "    print(post['title'])\n",
    "    print('-' * len(post['title']))\n",
    "    for interactions in post['entity_interactions']:\n",
    "        print('; '.join([i[0] for i in interactions]))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing interactions between entities with HTML output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "import nltk\n",
    "from IPython.display import IFrame\n",
    "from IPython.core.display import display\n",
    "\n",
    "BLOG_DATA = \"feed.json\"\n",
    "\n",
    "HTML_TEMPLATE = \"\"\"<html>\n",
    "    <head>\n",
    "        <title>{0}</title>\n",
    "        <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"/>\n",
    "    </head>\n",
    "    <body>{1}</body>\n",
    "</html>\"\"\"\n",
    "\n",
    "blog_data = json.loads(open(BLOG_DATA).read())\n",
    "\n",
    "for post in blog_data:\n",
    "\n",
    "    post.update(extract_interactions(post['content']))\n",
    "\n",
    "    # Display output as markup with entities presented in bold text\n",
    "\n",
    "    post['markup'] = []\n",
    "\n",
    "    for sentence_idx in range(len(post['sentences'])):\n",
    "\n",
    "        s = post['sentences'][sentence_idx]\n",
    "        for (term, _) in post['entity_interactions'][sentence_idx]:\n",
    "            s = s.replace(term, '<strong>{0}</strong>'.format(term))\n",
    "\n",
    "        post['markup'] += [s] \n",
    "            \n",
    "    filename = post['title'].replace(\"?\", \"\") + '.entity_interactions.html'\n",
    "    f = open(os.path.join(filename), 'wb')\n",
    "    html = HTML_TEMPLATE.format(post['title'] + ' Interactions', ' '.join(post['markup']))\n",
    "    f.write(html.encode('utf-8'))\n",
    "    f.close()\n",
    "\n",
    "    print('Data written to', f.name)\n",
    "    \n",
    "    # Display any of these files with an inline frame. This displays the\n",
    "    # last file processed by using the last value of f.name...\n",
    "    \n",
    "    print('Displaying {0}:'.format(f.name))\n",
    "    display(IFrame('files/{0}'.format(f.name), '100%', '600px'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mtsw",
   "language": "python",
   "name": "mtsw"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}