{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Relation extraction using distant supervision"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Bill MacCartney (wcmac@cs.stanford.edu)\"\n",
    "__version__ = \"CS224U, Stanford, Spring 2018\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "0. [Overview](#Overview)\n",
    "    0. [The task of relation extraction](#The-task-of-relation-extraction)\n",
    "    0. [Hand-built patterns](#Hand-built-patterns)\n",
    "    0. [Supervised learning](#Supervised-learning)\n",
    "    0. [Distant supervision](#Distant-supervision)\n",
    "0. [Set-up](#Set-up)\n",
    "0. [The corpus](#The-corpus)\n",
    "0. [The knowledge base](#The-knowledge-base)\n",
    "0. [Problem formulation](#Problem-formulation)\n",
    "    0. [Joining the corpus and the KB](#Joining-the-corpus-and-the-KB)\n",
    "    0. [Negative examples]([Negative-examples])\n",
    "    0. [Multi-label classification]([Multi-label-classification])\n",
    "    0. [Building datasets](#Building-datasets)\n",
    "0. [Evaluation](#Evaluation)\n",
    "    0. [Splitting the data](#Splitting-the-data)\n",
    "    0. [Choosing evaluation metrics](#Choosing-evaluation-metrics)\n",
    "    0. [Running evaluations](#Running-evaluations)\n",
    "    0. [Evaluating a random-guessing strategy](#Evaluating-a-random-guessing-strategy)\n",
    "0. [A simple baseline model](#A-simple-baseline-model)\n",
    "0. [Building a classifier](#Building-a-classifier)\n",
    "    0. [Featurizers](#Featurizers)\n",
    "    0. [Experiments](#Experiments)\n",
    "    0. [Examining the trained models](#Examining-the-trained-models)\n",
    "    0. [Discovering new relation instances](#Discovering-new-relation-instances)\n",
    "0. [Next steps](#Next-steps)\n",
    "0. [Homework 3](#Homework-3)\n",
    "0. [Bake-off](#Bake-off)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This codebook illustrates an approach to [relation extraction](http://deepdive.stanford.edu/relation_extraction) using [distant supervision](http://deepdive.stanford.edu/distant_supervision). It uses a simplified version of the approach taken by Mintz et al. in their 2009 paper, [Distant supervision for relation extraction without labeled data](https://www.aclweb.org/anthology/P09-1113). If you haven't yet read that paper, read it now! The rest of the codebook will make a lot more sense after you're familiar with it.\n",
    "\n",
    "### The task of relation extraction\n",
    "\n",
    "Relation extraction is the task of extracting from natural language text relational triples such as:\n",
    "\n",
    "```\n",
    "(founders, SpaceX, Elon_Musk)\n",
    "(has_spouse, Elon_Musk, Talulah_Riley)\n",
    "(worked_at, Elon_Musk, Tesla_Motors)\n",
    "```\n",
    "\n",
    "If we can accumulate a large knowledge base (KB) of relational triples, we can use it to power question answering and other applications. Building a KB manually is slow and expensive, but much of the knowledge we'd like to capture is already expressed in abundant text on the web. The aim of relation extraction, therefore, is to accelerate the construction of new KBs — and facilitate the ongoing curation of existing KBs — by extracting relational triples from natural language text.\n",
    "\n",
    "### Hand-built patterns\n",
    "\n",
    "An obvious way to start is to write down a few patterns which express each relation. For example, we can use the pattern \"X is the founder of Y\" to find new instances of the `founders` relation. If we search a large corpus, we may find the phrase \"Elon Musk is the founder of SpaceX\", which we can use as evidence for the relational triple `(founders, SpaceX, Elon_Musk)`.\n",
    "\n",
    "Unfortunately, this approach doesn't get us very far. The central challenge of relation extraction is the fantastic diversity of language, the multitude of possible ways to express a given relation. For example, each of the following sentences expressed the relational triple `(founders, SpaceX, Elon_Musk)`:\n",
    "\n",
    "- \"You may also be thinking of _Elon Musk_ (founder of _SpaceX_), who started PayPal.\"\n",
    "- \"Interesting Fact: _Elon Musk_, co-founder of PayPal, went on to establish _SpaceX_, one of the most promising space travel startups in the world.\"\n",
    "- \"If Space Exploration (_SpaceX_), founded by Paypal pioneer _Elon Musk_ succeeds, commercial advocates will gain credibility and more support in Congress.\"\n",
    "\n",
    "The patterns which connect \"Elon Musk\" with \"SpaceX\" in these examples are not ones we could have easily anticipated. To do relation extraction effectively, we need to go beyond hand-built patterns.\n",
    "\n",
    "### Supervised learning\n",
    "\n",
    "Effective relation extraction will require applying machine learning methods. The natural place to start is with supervised learning. This means training an extraction model from a dataset of examples which have been labeled with the target output. Sentences like the three examples above would be annotated with the `founders` relation, but we'd also have sentences which include \"Elon Musk\" and \"SpaceX\" but do not express the `founders` relation, such as:\n",
    "\n",
    "- \"Billionaire entrepreneur _Elon Musk_ announced the latest addition to the _SpaceX_ arsenal: the 'Big F---ing Rocket' (BFR)\".\n",
    "\n",
    "Such \"negative examples\" would be labeled as such, and the fully-supervised model would then be able to learn from both positive and negative examples the linguistic patterns that indicate each relation.\n",
    "\n",
    "The difficulty with the fully-supervised approach is the cost of generating training data. Because of the great diversity of linguistic expression, our model will need lots and lots of training data: at least tens of thousands of examples, although hundreds of thousands or millions would be much better. But labeling the examples is just a slow and expensive as building the KB by hand would be.\n",
    "\n",
    "### Distant supervision\n",
    "\n",
    "The goal of distant supervision is to capture the benefits of supervised learning without paying the cost of labeling training data. Instead of labeling extraction examples by hand, we use existing relational triples to automatically identify extraction examples in a large corpus. For example, if we already have in our KB the relational triple `(founders, SpaceX, Elon_Musk)`, we can search a large corpus for sentences in which \"SpaceX\" and \"Elon Musk\" co-occur, make the (unreliable!) assumption that all the sentences express the `founder` relation, and then use them as training data for a learned model to identify new instances of the `founder` relation — all without doing any manual labeling.\n",
    "\n",
    "This is a powerful idea, but it has two limitations. The first is that, inevitably, some of the sentences in which \"SpaceX\" and \"Elon Musk\" co-occur will not express the `founder` relation — like the BFR example above. By making the blind assumption that all such sentences do express the `founder` relation, we are essentially injecting noise into our training data, and making it harder for our learning algorithms to learn good models. Distant supervision is effective in spite of this problem because it makes it possible to leverage vastly greater quantities of training data, and the benefit of more data outweighs the harm of noisier data.\n",
    "\n",
    "The second limitation is that we need an existing KB to start from. We can only train a model to extract new instances of the `founders` relation if we already have many instances of the `founders` relation. Thus, while distant supervision is a great way to extend an existing KB, it's not useful for creating a KB containing new relations from scratch.\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set-up\n",
    "\n",
    "- Make sure your environment includes all the requirements for the cs224u repository.\n",
    "- Download the [data distribution for this unit](https://web.stanford.edu/class/cs224u/data/rel_ext_data.zip), unpack it, and place it in the directory containing the course repository. (If you want to put it somewhere else, change `rel_ext_data_home` below.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "rel_ext_data_home = 'rel_ext_data'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gzip\n",
    "import numpy as np\n",
    "import random\n",
    "import os\n",
    "\n",
    "from collections import Counter, defaultdict, namedtuple\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import precision_recall_fscore_support\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual when we're doing NLP, we need to start with a _corpus_ — a large sample of natural language text. And because our goal is to do relation extraction with distant supervision, we need to be able to identify entities in the text and connect them to a knowledge base of relations between entities. So, we need a corpus in which entity mentions are annotated with _entity resolutions_ which map them to a unique, unambiguous identifiers. Entity resolution serves two purposes:\n",
    "\n",
    "0. It ensures that if an entity mention could refer to two different entities, it is properly disambiguated. For example, \"New York\" could refer to the city or the state.\n",
    "0. It ensures that if two different entity mentions refer to the same entity, they are properly identified. For example, both \"New York City\" and \"The Big Apple\" refer to New York City.\n",
    "\n",
    "The corpus we'll use for this project is derived from the [Wikilinks dataset](https://code.google.com/archive/p/wiki-links/) [announced by Google in 2013](https://research.googleblog.com/2013/03/learning-from-big-data-40-million.html). This dataset contains over 40M mentions of 3M distinct entities spanning 10M webpages. It provides entity resolutions by mapping each entity mention to a Wikipedia URL.\n",
    "\n",
    "Now, in order to do relation extraction, we actually need _pairs_ of entity mentions, and it's important to have the context around and between the two mentions. Fortunately, UMass has provided an [expanded version of Wikilinks](http://www.iesl.cs.umass.edu/data/data-wiki-links) which includes the context around each entity mention. We've written code to stitch together pairs of entity mentions along with their contexts, and we've filtered the examples extensively. The result is a compact corpus suitable for our purposes. Let's take a closer look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Reading examples from rel_ext_data/corpus.tsv.gz\n",
      "Read 414123 examples\n"
     ]
    }
   ],
   "source": [
    "Example = namedtuple('Example',\n",
    "    'entity_1, entity_2, left, mention_1, middle, mention_2, right, '\n",
    "    'left_POS, mention_1_POS, middle_POS, mention_2_POS, right_POS')\n",
    "\n",
    "def read_examples():\n",
    "    examples = []\n",
    "    path = os.path.join(rel_ext_data_home, 'corpus.tsv.gz')\n",
    "    print('Reading examples from {}'.format(path))\n",
    "    with gzip.open(path) as f:\n",
    "        for line in f:\n",
    "            fields = line[:-1].decode('utf-8').split('\\t')\n",
    "            examples.append(Example(*fields))\n",
    "    print('Read {} examples'.format(len(examples)))            \n",
    "    return examples\n",
    "\n",
    "examples = read_examples()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great, that's a lot of examples! Let's take a closer look at one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Example(entity_1='New_Mexico', entity_2='Arizona', left='to all Spanish-occupied lands . The horno has a beehive shape and uses wood as the only heat source . The procedure still used in parts of', mention_1='New Mexico', middle='and', mention_2='Arizona', right='is to build a fire inside the Horno and , when the proper amount of time has passed , remove the embers and ashes and insert the', left_POS='to/TO all/DT Spanish-occupied/JJ lands/NNS ./. The/DT horno/NN has/VBZ a/DT beehive/NN shape/NN and/CC uses/VBZ wood/NN as/IN the/DT only/JJ heat/NN source/NN ./. The/DT procedure/NN still/RB used/VBN in/IN parts/NNS of/IN', mention_1_POS='New/NNP Mexico/NNP', middle_POS='and/CC', mention_2_POS='Arizona/NNP', right_POS='is/VBZ to/TO build/VB a/DT fire/NN inside/IN the/DT Horno/NNP and/CC ,/, when/WRB the/DT proper/JJ amount/NN of/IN time/NN has/VBZ passed/VBN ,/, remove/VB the/DT embers/NNS and/CC ashes/NNS and/CC insert/VB the/DT')\n"
     ]
    }
   ],
   "source": [
    "print(examples[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Every example represents a fragment of webpage text containing two entity mentions. The first two fields, `entity_1` and `entity_2`, contain unique identifiers for the two entities mentioned. We name entities using Wiki IDs, which you can think of as the last portion of a Wikipedia URL. Thus the Wiki ID `Barack_Obama` designates the entity described by [https://en.wikipedia.org/wiki/Barack_Obama](https://en.wikipedia.org/wiki/Barack_Obama).\n",
    "\n",
    "The next five fields represent the text surrounding the two mentions, divided into five chunks: `left` contains the text before the first mention, `mention_1` is the first mention itself, `middle` contains the text between the two mentions, `mention_2` is the second mention, and the field `right` contains the text after the second mention. Thus, we can reconstruct the context as a single string like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'to all Spanish-occupied lands . The horno has a beehive shape and uses wood as the only heat source . The procedure still used in parts of New Mexico and Arizona is to build a fire inside the Horno and , when the proper amount of time has passed , remove the embers and ashes and insert the'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ex = examples[1]\n",
    "' '.join((ex.left, ex.mention_1, ex.middle, ex.mention_2, ex.right))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last five fields contain the same five chunks of text, but this time annotated with part-of-speech (POS) tags, which may turn out to be useful when we start building models for relation extraction.\n",
    "\n",
    "Let's look at the distribution of entities over the corpus. How many entities are there, and what are the most common ones?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The corpus contains 107820 entities\n",
      "The most common entities are:\n",
      "      9399 India\n",
      "      6214 England\n",
      "      4585 Germany\n",
      "      4486 France\n",
      "      4128 Australia\n",
      "      3939 China\n",
      "      3930 Canada\n",
      "      3897 Italy\n",
      "      3368 California\n",
      "      3125 Pakistan\n",
      "      3103 Europe\n",
      "      3097 New_York_City\n",
      "      3025 London\n",
      "      2470 Japan\n",
      "      2468 United_Kingdom\n",
      "      2279 New_Zealand\n",
      "      2275 New_York\n",
      "      2259 Spain\n",
      "      2132 Philippines\n",
      "      2120 Asia\n"
     ]
    }
   ],
   "source": [
    "counter = Counter()\n",
    "for example in examples:\n",
    "    counter[example.entity_1] += 1\n",
    "    counter[example.entity_2] += 1\n",
    "print('The corpus contains {} entities'.format(len(counter)))\n",
    "counts = sorted([(count, key) for key, count in counter.items()], reverse=True)\n",
    "print('The most common entities are:')\n",
    "for count, key in counts[:20]:\n",
    "    print('{:10d} {}'.format(count, key))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because we're frequently going to want to retrieve corpus examples containing specific entities, it will be convenient to create a `Corpus` class which holds not only the examples themselves, but also a precomputed index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Corpus():\n",
    "\n",
    "    def __init__(self, examples):\n",
    "        self._examples = examples\n",
    "        self._examples_by_entities = {}\n",
    "        self._index_examples_by_entities()\n",
    "\n",
    "    def _index_examples_by_entities(self):\n",
    "        for ex in self._examples:\n",
    "            if ex.entity_1 not in self._examples_by_entities:\n",
    "                self._examples_by_entities[ex.entity_1] = {}\n",
    "            if ex.entity_2 not in self._examples_by_entities[ex.entity_1]:\n",
    "                self._examples_by_entities[ex.entity_1][ex.entity_2] = []\n",
    "            self._examples_by_entities[ex.entity_1][ex.entity_2].append(ex)\n",
    "    \n",
    "    def get_examples(self):\n",
    "        return iter(self._examples)\n",
    "        \n",
    "    def get_examples_for_entities(self, e1, e2):\n",
    "        try:\n",
    "            return self._examples_by_entities[e1][e2]\n",
    "        except KeyError:\n",
    "            return []\n",
    "        \n",
    "    def __repr__(self):\n",
    "        return 'Corpus with {} examples'.format(len(self._examples))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Corpus with 414123 examples"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus = Corpus(examples)\n",
    "corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main benefit we gain from the `Corpus` class is the ability to retrieve examples containing specific entities. Let's find examples containing `Steve_Jobs` and `Pixar`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The first of 9 examples for Steve_Jobs and Pixar is:\n",
      "Example(entity_1='Steve_Jobs', entity_2='Pixar', left='of visual effects on films like The Abyss ( 1989 ) , Terminator 2 ( 1991 ) and Jurassic Park ( 1993 ) The computer graphics division of ILM was bought by', mention_1='Steve Jobs', middle='and became', mention_2='Pixar', right=', who would go on to make several groundbreaking animated films starting with Toy Story ( 1995 ) – more information on the history of that here', left_POS='of/IN visual/JJ effects/NNS on/IN films/NNS like/IN The/DT Abyss/NN -LRB-/-LRB- 1989/CD -RRB-/-RRB- ,/, Terminator/NNP 2/CD -LRB-/-LRB- 1991/CD -RRB-/-RRB- and/CC Jurassic/JJ Park/NN -LRB-/-LRB- 1993/CD -RRB-/-RRB- The/DT computer/NN graphics/NNS division/NN of/IN ILM/NNP was/VBD bought/VBN by/IN', mention_1_POS='Steve/NNP Jobs/NNP', middle_POS='and/CC became/VBD', mention_2_POS='Pixar/NNP', right_POS=',/, who/WP would/MD go/VB on/IN to/TO make/VB several/JJ groundbreaking/VBG animated/JJ films/NNS starting/VBG with/IN Toy/NNP Story/NNP -LRB-/-LRB- 1995/CD -RRB-/-RRB- --/: more/JJR information/NN on/IN the/DT history/NN of/IN that/DT here/RB')\n"
     ]
    }
   ],
   "source": [
    "def show_examples_for_pair(e1, e2, corpus):\n",
    "    exs = corpus.get_examples_for_entities(e1, e2)\n",
    "    if exs:\n",
    "        print('The first of {} examples for {} and {} is:'.format(len(exs), e1, e2))\n",
    "        print(exs[0])\n",
    "    else:\n",
    "        print('No examples for {} and {} is:'.format(e1, e2))\n",
    "\n",
    "show_examples_for_pair('Steve_Jobs', 'Pixar', corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Actually, this might not be all of the examples containing `Steve_Jobs` and `Pixar`. It's only the examples where `Steve_Jobs` was mentioned first and `Pixar` second. There may be additional examples that have them in the reverse order. Let's check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The first of 2 examples for Pixar and Steve_Jobs is:\n",
      "Example(entity_1='Pixar', entity_2='Steve_Jobs', left='in the visual accompaniment to his recordings of Bach ’ s Six Suites for Unaccompanied Cello . Ma has also been seen with Apple Inc. and former', mention_1='Pixar', middle='CEO', mention_2='Steve Jobs', right='. Ma is often invited to press events for Jobs ’ s companies , and has performed on stage during event keynote presentations , as well as appearing in', left_POS=\"in/IN the/DT visual/JJ accompaniment/NN to/TO his/PRP$ recordings/NNS of/IN Bach/NNP '/POS s/NNS Six/CD Suites/NNP for/IN Unaccompanied/NNP Cello/NNP ./. Ma/NNP has/VBZ also/RB been/VBN seen/VBN with/IN Apple/NNP Inc./NNP and/CC former/JJ\", mention_1_POS='Pixar/NNP', middle_POS='CEO/NNP', mention_2_POS='Steve/NNP Jobs/NNP', right_POS=\"./. Ma/NNP is/VBZ often/RB invited/VBN to/TO press/VB events/NNS for/IN Jobs/NNP '/POS s/NNS companies/NNS ,/, and/CC has/VBZ performed/VBN on/IN stage/NN during/IN event/NN keynote/NN presentations/NNS ,/, as/RB well/RB as/IN appearing/VBG in/IN\")\n"
     ]
    }
   ],
   "source": [
    "show_examples_for_pair('Pixar', 'Steve_Jobs', corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sure enough. Going forward, we'll have to remember to check both \"directions\" when we're looking for examples contains a specific pair of entities.\n",
    "\n",
    "This corpus is not without flaws. As you get more familiar with it, you will likely discover that it contains many examples that are nearly — but not exactly — duplicates. This seems to be a consequence of the web document sampling methodology that was used in the construction of the Wikilinks dataset. However, despite a few warts, it will serve our purposes.\n",
    "\n",
    "One thing this corpus does _not_ include is any annotation about relations. Thus, it could not be used for the fully-supervised approach to relation extraction, because the fully-supervised approach requires that each pair of entity mentions be annotated with the relation (if any) that holds between the two entities. In order to make any headway, we'll need to connect the corpus with an external source of knowledge about relations. We need a knowledge base.\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The knowledge base"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data distribution for this unit includes a _knowledge base_ (KB) ultimately derived from [Freebase](https://en.wikipedia.org/wiki/Freebase). Unfortunately, Freebase was shut down in 2016, but the Freebase data is still available from various sources and in various forms. The KB included here was extracted from the [Freebase Easy data dump](http://freebase-easy.cs.uni-freiburg.de/dump/).\n",
    "\n",
    "The KB is a collection of _relational triples_, each consisting of a _relation_, a _subject_, and an _object_. For example, here are three triples from the KB:\n",
    "\n",
    "```\n",
    "(place_of_birth, Barack_Obama, Honolulu)\n",
    "(has_spouse, Barack_Obama, Michelle_Obama)\n",
    "(author, The_Audacity_of_Hope, Barack_Obama)\n",
    "```\n",
    "\n",
    "As you might guess:\n",
    "\n",
    "- The relation is one of a handful of predefined constants, such as `place_of_birth` or `has_spouse`.\n",
    "- The subject and object are entities represented by Wiki IDs (that is, suffixes of Wikipedia URLs).\n",
    "\n",
    "Let's write some code to read the KB so that we can take a closer look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Reading KB triples from rel_ext_data/kb.tsv.gz ...\n",
      "Read 56575 KB triples\n"
     ]
    }
   ],
   "source": [
    "KBTriple = namedtuple('KBTriple', 'rel, sbj, obj')\n",
    "\n",
    "def read_kb_triples():\n",
    "    kb_triples = []\n",
    "    path = os.path.join(rel_ext_data_home, 'kb.tsv.gz')\n",
    "    print('Reading KB triples from {} ...'.format(path))\n",
    "    with gzip.open(path) as f:\n",
    "        for line in f:\n",
    "            rel, sbj, obj = line[:-1].decode('utf-8').split('\\t')\n",
    "            kb_triples.append(KBTriple(rel, sbj, obj))\n",
    "    print('Read {} KB triples'.format(len(kb_triples)))\n",
    "    return kb_triples\n",
    "\n",
    "kb_triples = read_kb_triples()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK great, we have a KB!\n",
    "\n",
    "Now, just as we did for the corpus, we'll create a `KB` class to store the KB triples and some associated indexes. We'll want to be able to look up KB triples both by relation and by entities, so we'll create indexes for both of those access patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "class KB():\n",
    "\n",
    "    def __init__(self, kb_triples):\n",
    "        self._kb_triples = kb_triples\n",
    "        self._all_relations = []\n",
    "        self._all_entity_pairs = []\n",
    "        self._kb_triples_by_relation = {}\n",
    "        self._kb_triples_by_entities = {}\n",
    "        self._collect_all_entity_pairs()\n",
    "        self._index_kb_triples_by_relation()\n",
    "        self._index_kb_triples_by_entities()\n",
    "\n",
    "    def _collect_all_entity_pairs(self):\n",
    "        pairs = set()\n",
    "        for kbt in self._kb_triples:\n",
    "            pairs.add((kbt.sbj, kbt.obj))\n",
    "        self._all_entity_pairs = sorted(list(pairs))\n",
    "        \n",
    "    def _index_kb_triples_by_relation(self):\n",
    "        for kbt in self._kb_triples:\n",
    "            if kbt.rel not in self._kb_triples_by_relation:\n",
    "                self._kb_triples_by_relation[kbt.rel] = []\n",
    "            self._kb_triples_by_relation[kbt.rel].append(kbt)\n",
    "        self._all_relations = sorted(list(self._kb_triples_by_relation))\n",
    "    \n",
    "    def _index_kb_triples_by_entities(self):\n",
    "        for kbt in self._kb_triples:\n",
    "            if kbt.sbj not in self._kb_triples_by_entities:\n",
    "                self._kb_triples_by_entities[kbt.sbj] = {}\n",
    "            if kbt.obj not in self._kb_triples_by_entities[kbt.sbj]:\n",
    "                self._kb_triples_by_entities[kbt.sbj][kbt.obj] = []\n",
    "            self._kb_triples_by_entities[kbt.sbj][kbt.obj].append(kbt)\n",
    "\n",
    "    def get_triples(self):\n",
    "        return iter(self._kb_triples)\n",
    "        \n",
    "    def get_all_relations(self):\n",
    "        return self._all_relations\n",
    "            \n",
    "    def get_all_entity_pairs(self):\n",
    "        return self._all_entity_pairs\n",
    "            \n",
    "    def get_triples_for_relation(self, rel):\n",
    "        try:\n",
    "            return self._kb_triples_by_relation[rel]\n",
    "        except KeyError:\n",
    "            return []\n",
    "\n",
    "    def get_triples_for_entities(self, e1, e2):\n",
    "        try:\n",
    "            return self._kb_triples_by_entities[e1][e2]\n",
    "        except KeyError:\n",
    "            return []\n",
    "\n",
    "    def __repr__(self):\n",
    "        return 'KB with {} triples'.format(len(self._kb_triples))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KB with 56575 triples"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kb = KB(kb_triples)\n",
    "kb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get a sense of the high-level characteristics of this KB. Some questions we'd like to answer:\n",
    "\n",
    "- How many relations are there?\n",
    "- How big is each relation?\n",
    "- Examples of each relation.\n",
    "- How many unique entities does the KB include?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "16\n"
     ]
    }
   ],
   "source": [
    "all_relations = kb.get_all_relations()\n",
    "print(len(all_relations))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How big is each relation? That is, how many triples does each relation contain?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        2140 adjoins\n",
      "        3316 author\n",
      "         637 capital\n",
      "       22489 contains\n",
      "        4958 film_performance\n",
      "        2404 founders\n",
      "        1012 genre\n",
      "        3280 has_sibling\n",
      "        3774 has_spouse\n",
      "        3153 is_a\n",
      "        1981 nationality\n",
      "        2013 parents\n",
      "        1388 place_of_birth\n",
      "        1031 place_of_death\n",
      "        1526 profession\n",
      "        1473 worked_at\n"
     ]
    }
   ],
   "source": [
    "for rel in all_relations:\n",
    "    print('{:12d} {}'.format(len(kb.get_triples_for_relation(rel)), rel))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at one example from each relation, so that we can get a sense of what they mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('adjoins', 'Siegburg', 'Bonn')\n",
      "('author', 'Uncle_Silas', 'Sheridan_Le_Fanu')\n",
      "('capital', 'Tunisia', 'Tunis')\n",
      "('contains', 'Brickfields', 'Kuala_Lumpur_Sentral_railway_station')\n",
      "('film_performance', 'Colin_Hanks', 'The_Great_Buck_Howard')\n",
      "('founders', 'Bomis', 'Jimmy_Wales')\n",
      "('genre', 'SPARQL', 'Semantic_Web')\n",
      "('has_sibling', 'Ari_Emanuel', 'Rahm_Emanuel')\n",
      "('has_spouse', 'Percy_Bysshe_Shelley', 'Mary_Shelley')\n",
      "('is_a', 'Bhanu_Athaiya', 'Costume_designer')\n",
      "('nationality', 'Ruben_Rausing', 'Sweden')\n",
      "('parents', 'Prince_Arthur_of_Connaught', 'Prince_Arthur,_Duke_of_Connaught_and_Strathearn')\n",
      "('place_of_birth', 'William_Penny_Brookes', 'Much_Wenlock')\n",
      "('place_of_death', 'Jean_Drapeau', 'Montreal')\n",
      "('profession', 'Rufus_Wainwright', 'Actor')\n",
      "('worked_at', 'Ray_Jackendoff', 'Tufts_University')\n"
     ]
    }
   ],
   "source": [
    "for rel in all_relations:\n",
    "    print(tuple(kb.get_triples_for_relation(rel)[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `get_triples_for_entities()` method allows us to look up triples by the entities they contain. Let's use it to see what relation(s) hold between `France` and `Germany`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[KBTriple(rel='adjoins', sbj='France', obj='Germany')]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kb.get_triples_for_entities('France', 'Germany')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Relations like `adjoins` and `has_sibling` are intuitively symmetric — if the relation holds between _X_ and _Y_, then we expect it to hold between _Y_ and _X_ as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[KBTriple(rel='adjoins', sbj='Germany', obj='France')]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kb.get_triples_for_entities('Germany', 'France')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, there's no guarantee that all such inverse triples actually appear in the KB. (You could write some code to check.)\n",
    "\n",
    "Most relations, however, are intuitively asymmetric. Let's see what relation holds between `Pixar` and `Steve_Jobs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[KBTriple(rel='founders', sbj='Pixar', obj='Steve_Jobs')]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kb.get_triples_for_entities('Pixar', 'Steve_Jobs')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's a bit arbitrary that the KB includes a given asymmetric relation rather than its inverse. For example, instead of the `founders` relation with triple `(founders, Pixar, Steve_Jobs)`, we might have had a `founder_of` relation with triple `(founder_of, Steve_Jobs, Pixar)`. It doesn't really matter.\n",
    "\n",
    "Although we don't have a `founder_of` relation, there might still be a relation between `Steve_Jobs` and `Pixar`. Let's check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[KBTriple(rel='worked_at', sbj='Steve_Jobs', obj='Pixar')]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kb.get_triples_for_entities('Steve_Jobs', 'Pixar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Aha, yes, that makes sense. So it can be the case that one relation holds between _X_ and _Y_, and a different relation holds between _Y_ and _X_.\n",
    "\n",
    "One more observation: there may be more than one relation that holds between a given pair of entities, even in one direction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[KBTriple(rel='has_sibling', sbj='Cleopatra', obj='Ptolemy_XIII_Theos_Philopator'),\n",
       " KBTriple(rel='has_spouse', sbj='Cleopatra', obj='Ptolemy_XIII_Theos_Philopator')]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kb.get_triples_for_entities('Cleopatra', 'Ptolemy_XIII_Theos_Philopator')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No! What? Yup, it's true — [Cleopatra](https://en.wikipedia.org/wiki/Cleopatra) married her younger brother, [Ptolemy XIII](https://en.wikipedia.org/wiki/Ptolemy_XIII_Theos_Philopator). Wait, it gets worse — she also married her _even younger_ brother, [Ptolemy XIV](https://en.wikipedia.org/wiki/Ptolemy_XIV_of_Egypt). Apparently this was normal behavior in ancient Egypt.\n",
    "\n",
    "Moving on ...\n",
    "\n",
    "Let's look at the distribution of entities in the KB. How many entities are there, and what are the most common ones?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The KB contains 46275 entities\n",
      "The most common entities are:\n",
      "       962 England\n",
      "       815 India\n",
      "       465 London\n",
      "       456 Italy\n",
      "       437 France\n",
      "       420 Germany\n",
      "       412 California\n",
      "       396 United_Kingdom\n",
      "       378 Canada\n",
      "       324 New_York_City\n",
      "       262 Actor\n",
      "       248 New_York\n",
      "       244 Australia\n",
      "       235 China\n",
      "       226 Philippines\n",
      "       224 Japan\n",
      "       223 Russia\n",
      "       214 Scotland\n",
      "       204 Europe\n",
      "       177 Pakistan\n"
     ]
    }
   ],
   "source": [
    "counter = Counter()\n",
    "for kbt in kb.get_triples():\n",
    "    counter[kbt.sbj] += 1\n",
    "    counter[kbt.obj] += 1\n",
    "print('The KB contains {} entities'.format(len(counter)))\n",
    "counts = sorted([(count, key) for key, count in counter.items()], reverse=True)\n",
    "print('The most common entities are:')\n",
    "for count, key in counts[:20]:\n",
    "    print('{:10d} {}'.format(count, key))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of entities in the KB is less than half the number of entities in the corpus! Evidently the corpus has much broader coverage than the KB.\n",
    "\n",
    "Note that there is no promise or expectation that this KB is _complete_. Not only does the KB contain no mention of many entities from the corpus — even for the entities it does include, there may be possible triples which are true in the world but are missing from the KB. As an example, these triples are in the KB:\n",
    "\n",
    "```\n",
    "(founders, SpaceX, Elon_Musk)\n",
    "(founders, Tesla_Motors, Elon_Musk)\n",
    "(worked_at, Elon_Musk, Tesla_Motors)\n",
    "```\n",
    "\n",
    "but this one is not:\n",
    "\n",
    "```\n",
    "(worked_at, Elon_Musk, SpaceX)\n",
    "```\n",
    "\n",
    "In fact, the whole point of developing methods for automatic relation extraction is to extend existing KBs (and build new ones) by identifying new relational triples from natural language text. If our KBs were complete, we wouldn't have anything to do.\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem formulation\n",
    "\n",
    "With our data assets in hand, it's time to provide a precise formulation of the prediction problem we aim to solve. We need to specify:\n",
    "\n",
    "- What is the input to the prediction?\n",
    "    - Is it a specific pair of entity _mentions_ in a specific context?\n",
    "    - Or is it a pair of _entities_, apart from any specific mentions?\n",
    "- What is the output of the prediction?\n",
    "    - Do we need to predict at most one relation label? (This is [multi-class classification](https://en.wikipedia.org/wiki/Multiclass_classification).)\n",
    "    - Or can we predict multiple relation labels? (This is [multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification).)\n",
    "\n",
    "### Joining the corpus and the KB\n",
    "\n",
    "In order to leverage the distant supervision paradigm, we'll need to connect information in the corpus with information in the KB. There are two possibilities, depending on how we formulate our prediction problem:\n",
    "\n",
    "- __Use the KB to generate labels for the corpus.__ If our problem is to classify a pair of entity _mentions_ in a specific example in the corpus, then we can use the KB to provide labels for training examples. Labeling specific examples is how the fully supervised paradigm works, so it's the obvious way to think about leveraging distant supervision as well. Although it can be made to work, it's not actually the preferred approach.\n",
    "- __Use the corpus to generate features for entity pairs.__ If instead our problem is to classify a pair of _entities_, then we can use all the examples from the corpus where those two entities co-occur to generate a feature representation describing the entity pair. This is the approach taken by [Mintz et al. 2009](https://www.aclweb.org/anthology/P09-1113), and it's the approach we'll pursue here.\n",
    "\n",
    "So we'll formulate our prediction problem such that the input is a pair of entities, and the goal is to predict what relation(s) the pair belongs to. The KB will provide the labels, and the corpus will provide the features.\n",
    "\n",
    "Let's determine how many examples we have for each triple in the KB. We'll compute averages per relation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                             examples\n",
      "relation               examples    triples    /triple\n",
      "--------               --------    -------    -------\n",
      "adjoins                   85660       2140      40.03\n",
      "author                    15822       3316       4.77\n",
      "capital                   12520        637      19.65\n",
      "contains                  99572      22489       4.43\n",
      "film_performance          11195       4958       2.26\n",
      "founders                   8061       2404       3.35\n",
      "genre                      1941       1012       1.92\n",
      "has_sibling               12332       3280       3.76\n",
      "has_spouse                16188       3774       4.29\n",
      "is_a                       6955       3153       2.21\n",
      "nationality                4649       1981       2.35\n",
      "parents                    5387       2013       2.68\n",
      "place_of_birth             2214       1388       1.60\n",
      "place_of_death             2047       1031       1.99\n",
      "profession                 2876       1526       1.88\n",
      "worked_at                  4494       1473       3.05\n"
     ]
    }
   ],
   "source": [
    "def count_examples(corpus, kb):\n",
    "    counter = Counter()\n",
    "    for rel in all_relations:\n",
    "        for kbt in kb.get_triples_for_relation(rel):\n",
    "            # count examples in both forward and reverse directions\n",
    "            counter[rel] += len(corpus.get_examples_for_entities(kbt.sbj, kbt.obj))\n",
    "            counter[rel] += len(corpus.get_examples_for_entities(kbt.obj, kbt.sbj))\n",
    "    # report results\n",
    "    print('{:20s} {:>10s} {:>10s} {:>10s}'.format('', '', '', 'examples'))\n",
    "    print('{:20s} {:>10s} {:>10s} {:>10s}'.format('relation', 'examples', 'triples', '/triple'))\n",
    "    print('{:20s} {:>10s} {:>10s} {:>10s}'.format('--------', '--------', '-------', '-------'))\n",
    "    for rel in all_relations:\n",
    "        nx = counter[rel]\n",
    "        nt = len(kb.get_triples_for_relation(rel))\n",
    "        print('{:20s} {:10d} {:10d} {:10.2f}'.format(rel, nx, nt, 1.0 * nx / nt))\n",
    "        \n",
    "count_examples(corpus, kb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For most relations, the total number of examples is fairly large, so we can be optimistic about learning what linguistic patterns express a given relation. However, for individual entity pairs, the number of examples is often quite low. Of course, more data would be better — much better! But more data could quickly become unwieldy to work with in a notebook like this."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Negative instances\n",
    "\n",
    "By joining the corpus to the KB, we can obtain abundant positive instances for each relation. But a classifier cannot be trained on positive instances alone. In order to apply the distant supervision paradigm, we will also need some negative instances — that is, entity pairs which do not belong to any known relation. If you like, you can think of these entity pairs as being assigned to a special relation called `NO_RELATION`. We can find plenty of such pairs by searching for examples in the corpus which contain two entities which do not belong to any relation in the KB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_unrelated_pairs(corpus, kb):\n",
    "    unrelated_pairs = set()\n",
    "    for ex in corpus.get_examples():\n",
    "        if kb.get_triples_for_entities(ex.entity_1, ex.entity_2):\n",
    "            continue\n",
    "        if kb.get_triples_for_entities(ex.entity_2, ex.entity_1):\n",
    "            continue\n",
    "        unrelated_pairs.add((ex.entity_1, ex.entity_2))\n",
    "        unrelated_pairs.add((ex.entity_2, ex.entity_1))\n",
    "    return unrelated_pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 301073 unrelated pairs, including:\n",
      "    ('Lainie_Kazan', 'Tab_Hunter')\n",
      "    ('City_of_Brussels', 'Belgian_Comic_Strip_Center')\n",
      "    ('John_Francis_Daley', 'Hart_Hanson')\n",
      "    ('Andrew_W._Mellon_Foundation', 'American_Council_on_Education')\n",
      "    ('Hollywood_Walk_of_Fame', 'Laura_Ingalls_Wilder_Medal')\n",
      "    ('Keeping_the_Faith', 'Great_Expectations')\n",
      "    ('American_Revolutionary_War', 'British_Empire')\n",
      "    ('Sino-Indian_War', 'Bangladesh_Liberation_War')\n",
      "    ('Greg_Howe', 'Richie_Kotzen')\n",
      "    ('A41_road', 'Marble_Arch')\n"
     ]
    }
   ],
   "source": [
    "unrelated_pairs = find_unrelated_pairs(corpus, kb)\n",
    "print('Found {} unrelated pairs, including:'.format(len(unrelated_pairs)))\n",
    "for pair in list(unrelated_pairs)[:10]:\n",
    "    print('   ', pair)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a lot of negative instances! In fact, because these negative instances far outnumber our positive instances (that is, the triples in our KB), when we train models we'll wind up downsampling the negative instances substantially.\n",
    "\n",
    "Remember, though, that some of these supposedly negative instances may be false negatives. Our KB is not complete. A pair of entities might be related in real life even if they don't appear together in the KB."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multi-label classification\n",
    "\n",
    "A given pair of entities can belong to more than one relation. In fact, this is quite common in our KB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_relation_combinations(kb):\n",
    "    counter = Counter()\n",
    "    for sbj, obj in kb.get_all_entity_pairs():\n",
    "        rels = tuple(sorted(set([kbt.rel for kbt in kb.get_triples_for_entities(sbj, obj)])))\n",
    "        if len(rels) > 1:\n",
    "            counter[rels] += 1\n",
    "    counts = sorted([(count, key) for key, count in counter.items()], reverse=True)\n",
    "    print('The most common relation combinations are:')\n",
    "    for count, key in counts:\n",
    "        print('{:10d} {}'.format(count, key))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The most common relation combinations are:\n",
      "      1526 ('is_a', 'profession')\n",
      "       495 ('capital', 'contains')\n",
      "       183 ('place_of_birth', 'place_of_death')\n",
      "        76 ('nationality', 'place_of_birth')\n",
      "        11 ('nationality', 'place_of_death')\n",
      "        11 ('adjoins', 'contains')\n",
      "         8 ('has_sibling', 'has_spouse')\n",
      "         3 ('nationality', 'place_of_birth', 'place_of_death')\n",
      "         2 ('parents', 'worked_at')\n",
      "         1 ('nationality', 'worked_at')\n",
      "         1 ('has_spouse', 'parents')\n",
      "         1 ('author', 'founders')\n"
     ]
    }
   ],
   "source": [
    "count_relation_combinations(kb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While a few of those combinations look like data errors, most look natural and intuitive. Multiple relations per entity pair is a commonplace phenomenon.\n",
    "\n",
    "This observation strongly suggests formulating our prediction problem as [multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification). We could instead treat it as [multi-class classification](https://en.wikipedia.org/wiki/Multiclass_classification) — and indeed, [Mintz et al. 2009](https://www.aclweb.org/anthology/P09-1113) did so — but if we do, we'll be faced with the problem of assigning a single relation label to entity pairs which actually belong to multiple relations. It's not obvious how best to do this (and Mintz et al. 2009 did not make their method clear).\n",
    "\n",
    "There are a number of ways to approach multi-label classification, but the most obvious is the [binary relevance method](https://en.wikipedia.org/wiki/Multi-label_classification#Problem_transformation_methods), which just factors multi-label classification over _n_ labels into _n_ independent binary classification problems, one for each label. A disadvantage of this approach is that, by treating the binary classification problems as independent, it fails to exploit correlations between labels. But it has the great virtue of simplicity, and it will suffice for our purposes.\n",
    "\n",
    "So our problem will be to take as input an entity pair and a candidate relation (label), and to return a binary prediction as to whether the entity pair belongs to the relation. Since a KB triple is precisely a relation and a pair of entities, we could say equivalently that our prediction problem amounts to binary classification of KB triples. Given a candidate KB triple, do we predict that it is valid?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building datasets\n",
    "\n",
    "We're now in a position to write a function to build datasets suitable for training and evaluating predictive models. It will have the following characteristics:\n",
    "\n",
    "- Because we've formulated our problem as multi-label classification, and we'll be training separate models for each relation, we won't build a single dataset. Instead, we'll build a dataset for each relation, and our return value will be a map from relation names to datasets.\n",
    "- The dataset for each relation will consist of two parallel lists:\n",
    "  - A list of candidate `KBTriples` which combine the given relation with a pair of entities.\n",
    "  - A corresponding list of boolean labels indicating whether the given `KBTriple` belongs to the KB.\n",
    "- The dataset for each relation will include `KBTriples` derived from two sources:\n",
    "  - Positive instances will be drawn from the KB.\n",
    "  - Negative instances will be sampled from unrelated entity pairs, as described above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_datasets(corpus, kb, include_positive=True, sampling_rate=0.1, seed=1):\n",
    "    unrelated_pairs = find_unrelated_pairs(corpus, kb)\n",
    "    random.seed(seed)\n",
    "    unrelated_pairs = random.sample(unrelated_pairs, int(sampling_rate * len(unrelated_pairs)))\n",
    "    kbts_by_rel = defaultdict(list)\n",
    "    labels_by_rel = defaultdict(list)\n",
    "    for index, rel in enumerate(all_relations):\n",
    "        if include_positive:\n",
    "            for kbt in kb.get_triples_for_relation(rel):\n",
    "                kbts_by_rel[rel].append(kbt)\n",
    "                labels_by_rel[rel].append(True)\n",
    "        for sbj, obj in unrelated_pairs:\n",
    "            kbts_by_rel[rel].append(KBTriple(rel, sbj, obj))\n",
    "            labels_by_rel[rel].append(False)        \n",
    "    return kbts_by_rel, labels_by_rel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "\n",
    "Before we start building models, let's set up a test harness that allows us to measure a model's performance. This may seem backwards, but it's analogous to the software engineering paradigm of [test-driven development](https://en.wikipedia.org/wiki/Test-driven_development): first, define success; then, pursue it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Splitting the data\n",
    "\n",
    "Whenever building a model from data, it's good practice to partition the data into a multiple _splits_ — minimally, a training split on which to train the model, and a test split on which to evaluate it. In fact, we'll go a bit further, and define four splits:\n",
    "\n",
    "- __The `tiny` split (1%).__ It's often useful to carve out a tiny chunk of data to use in place of training or test data during development. Of course, any quantitative results obtained by evaluating on the `tiny` split are nearly meaningless, but because evaluations run extremely fast, using this split is a good way to flush out bugs during iterative cycles of code development.\n",
    "- __The `train` split (69%).__ We'll use the majority of our data for training models, both during development and at final evaluation. Experiments with the `train` split may take longer to run, but they'll have much greater statistical power.\n",
    "- __The `dev` split (15%).__ We'll use the `dev` split as test data for intermediate (formative) evaluations during development. During routine experiments, all evaluations should use the `dev` split.\n",
    "- __The `test` split (15%).__ We'll reserve the `test` split for our final (summative) evaluation at the conclusion of our work. Running evaluations on the `test` split before you are ready to conclude your work is methodologically unsound and intellectually dishonest!\n",
    "\n",
    "Splitting our data assets is somewhat more complicated than in many other NLP problems, because we have both a corpus and KB. In order to minimize leakage of information from training data into test data, we'd like to split both the corpus and the KB. And in order to maximize the value of a finite quantity of data, we'd like to align the corpus splits and KB splits as closely as possible. In an ideal world, each split would have its own hermetically-sealed universe of entities, the corpus for that split would contain only examples mentioning those entities, and the KB for that split would contain only triples involving those entities. However, that ideal is not quite achievable in practice. In order to get as close as possible, we'll follow this plan:\n",
    "\n",
    "- First, we'll split the set of entities which appear as the subject in some KB triple.\n",
    "- Then, we'll split the set of KB triples based on their subject entity.\n",
    "- Finally, we'll split the set of corpus examples.\n",
    "  - If the first entity in the example has already been assigned to a split, we'll assign the example to the same split.\n",
    "  - Alternatively, if the second entity has already been assigned to a split, we'll assign the example to the same split.\n",
    "  - Otherwise, we'll assign the example to a split randomly.\n",
    "  \n",
    "<!-- \\[ TODO: figure out whether we actually need to split the _corpus_ -- any lift from testing on train corpus? \\] -->\n",
    "\n",
    "Here's code to implement the splits. It's OK to skip past the details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def split_corpus_and_kb(\n",
    "        split_names=['tiny', 'train', 'dev', 'test'],\n",
    "        split_fracs=[0.01, 0.69, 0.15, 0.15],   \n",
    "        seed=1):\n",
    "    if len(split_names) != len(split_fracs):\n",
    "        raise ValueError('split_names and split_fracs must be of equal length')\n",
    "    if sum(split_fracs) != 1.0:\n",
    "        raise ValueError('split_fracs must sum to 1')\n",
    "    n = len(split_fracs) # for convenience only\n",
    "    \n",
    "    def split_list(xs):\n",
    "        xs = sorted(xs) # sorted for reproducibility\n",
    "        if seed:\n",
    "            random.seed(seed)\n",
    "        random.shuffle(xs)\n",
    "        split_points = [0] + [int(round(frac * len(xs))) for frac in np.cumsum(split_fracs)]\n",
    "        return [xs[split_points[i]:split_points[i + 1]] for i in range(n)]\n",
    "    \n",
    "    # first, split the entities that appear as subjects in the KB\n",
    "    sbjs = list(set([kbt.sbj for kbt in kb.get_triples()]))\n",
    "    sbj_splits = split_list(sbjs)\n",
    "    sbj_split_dict = dict([(sbj, i) for i, split in enumerate(sbj_splits) for sbj in split])\n",
    "    \n",
    "    # next, split the KB triples based on their subjects\n",
    "    kbt_splits = [[kbt for kbt in kb.get_triples() if sbj_split_dict[kbt.sbj] == i] for i in range(n)]\n",
    "    \n",
    "    # now split examples based on the entities they contain\n",
    "    ex_splits = [[] for i in range(n + 1)] # include an extra split\n",
    "    for ex in corpus.get_examples():\n",
    "        if ex.entity_1 in sbj_split_dict:\n",
    "            # if entity_1 is a sbj in the KB, assign example to split of that sbj\n",
    "            ex_splits[sbj_split_dict[ex.entity_1]].append(ex)\n",
    "        elif ex.entity_2 in sbj_split_dict:\n",
    "            # if entity_2 is a sbj in the KB, assign example to split of that sbj\n",
    "            ex_splits[sbj_split_dict[ex.entity_2]].append(ex)\n",
    "        else:\n",
    "            # otherwise, put in extra split to be redistributed\n",
    "            ex_splits[-1].append(ex)\n",
    "    # reallocate the examples that weren't assigned to a split on first pass\n",
    "    extra_ex_splits = split_list(ex_splits[-1])\n",
    "    ex_splits = [ex_splits[i] + extra_ex_splits[i] for i in range(n)]\n",
    "    \n",
    "    # create a Corpus and a KB for each split\n",
    "    data = {}\n",
    "    for i in range(n):\n",
    "        data[split_names[i]] = {'corpus': Corpus(ex_splits[i]), 'kb': KB(kbt_splits[i])}\n",
    "    data['all'] = {'corpus': corpus, 'kb': kb}\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great. Let's use it to create the splits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'all': {'corpus': Corpus with 414123 examples, 'kb': KB with 56575 triples},\n",
       " 'dev': {'corpus': Corpus with 57241 examples, 'kb': KB with 7939 triples},\n",
       " 'test': {'corpus': Corpus with 63382 examples, 'kb': KB with 8053 triples},\n",
       " 'tiny': {'corpus': Corpus with 3458 examples, 'kb': KB with 425 triples},\n",
       " 'train': {'corpus': Corpus with 290042 examples, 'kb': KB with 40158 triples}}"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = split_corpus_and_kb(seed=1)\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So now we can use `data['train']['corpus']` to refer to the training corpus, or `data['dev']['kb']` to refer to the dev KB.\n",
    "\n",
    "As a convenience, let's add a function for creating datasets for a specific split:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_datasets_for_split(split, include_positive=True, sampling_rate=0.1, seed=1):\n",
    "    return build_datasets(data[split]['corpus'], data[split]['kb'], include_positive, sampling_rate, seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Choosing evaluation metrics\n",
    "\n",
    "Because we've formulated our prediction problem as a family of binary classification problems, one for each relation (label), choosing evaluation metrics is pretty straightforward. The standard metrics for evaluating binary classification are [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall), which are more meaningful than simple accuracy, particularly in problems with a highly biased label distribution (like ours). We'll compute and report precision and recall separately for each relation (label). There are only two wrinkles:\n",
    "\n",
    "0. __How best to combine precision and recall into a single metric.__ Having two evaluation metrics is often inconvenient. If we're considering a change to our model which improves precision but degrades recall, should we take it? To drive an iterative development process, it's useful to have a single metric on which to hill-climb. For binary classification, the standard answer is the [F<sub>1</sub>-score](https://en.wikipedia.org/wiki/F1_score), which is the harmonic mean of precision and recall. However, the F<sub>1</sub>-score gives equal weight to precision and recall. For our purposes, precision is probably more important than recall. If we're extracting new relation triples from (massively abundant) text on the web in order to augment a knowledge base, it's probably more important that the triples we extract are correct (precision) than that we extract all the triples we could (recall). Accordingly, instead of the F<sub>1</sub>-score, we'll use the F<sub>0.5</sub>-score, which gives precision twice as much weight as recall.\n",
    "\n",
    "0. __How to aggregate metrics across relations (labels).__ Reporting metrics separately for each relation is great, but in order to drive iterative development, we'd also like to have summary metrics which aggregate across all relations. There are two possible ways to do it: _micro-averaging_ will give equal weight to all problem instances, and thus give greater weight to relations with more instances, while _macro-averaging_ will give equal weight to all relations, and thus give lesser weight to problem instances in relations with more instances. Because the number of problem instances per relation is, to some degree, an accident of our data collection methodology, we'll choose macro-averaging.\n",
    "\n",
    "Thus, while every evaluation will report lots of metrics, when we need a single metric on which to hill-climb, it will be the macro-averaged F<sub>0.5</sub>-score."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running evaluations\n",
    "\n",
    "It's time to write some code to run evaluations and report results. This is now straightforward. The `evaluate()` function takes as inputs:\n",
    "\n",
    "- `classifier`, which is just a function that takes a list of `KBTriples` and returns a list of boolean predictions;\n",
    "- `test_split`, the split on which to evaluate the classifier, `dev` by default;\n",
    "- `verbose`, a boolean indicating whether to print output.\n",
    "\n",
    "The other functions below are just helper functions to `evaluate()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_statistics_header():\n",
    "    print('{:20s} {:>10s} {:>10s} {:>10s} {:>10s} {:>10s}'.format(\n",
    "        'relation', 'precision', 'recall', 'f-score', 'support', 'size'))\n",
    "    print('{:20s} {:>10s} {:>10s} {:>10s} {:>10s} {:>10s}'.format(\n",
    "        '-' * 18, '-' * 9, '-' * 9, '-' * 9, '-' * 9, '-' * 9))\n",
    "\n",
    "def print_statistics_row(rel, result):\n",
    "    print('{:20s} {:10.3f} {:10.3f} {:10.3f} {:10d} {:10d}'.format(rel, *result))\n",
    "\n",
    "def print_statistics_footer(avg_result):\n",
    "    print('{:20s} {:>10s} {:>10s} {:>10s} {:>10s} {:>10s}'.format(\n",
    "        '-' * 18, '-' * 9, '-' * 9, '-' * 9, '-' * 9, '-' * 9))\n",
    "    print('{:20s} {:10.3f} {:10.3f} {:10.3f} {:10d} {:10d}'.format('macro-average', *avg_result))\n",
    "\n",
    "def macro_average_results(results):\n",
    "    avg_result = [np.average([r[i] for r in results.values()]) for i in range(3)]\n",
    "    avg_result.append(np.sum([r[3] for r in results.values()]))\n",
    "    avg_result.append(np.sum([r[4] for r in results.values()]))\n",
    "    return avg_result\n",
    "    \n",
    "def evaluate(classifier, test_split='dev', verbose=True):\n",
    "    test_kbts_by_rel, true_labels_by_rel = build_datasets_for_split(test_split)\n",
    "    results = {}\n",
    "    if verbose:\n",
    "        print_statistics_header()\n",
    "    for rel in all_relations:\n",
    "        pred_labels = classifier(test_kbts_by_rel[rel])\n",
    "        stats = precision_recall_fscore_support(true_labels_by_rel[rel], pred_labels, beta=0.5)\n",
    "        stats = [stat[1] for stat in stats]  # stats[1] is the stat for label True\n",
    "        stats.append(len(pred_labels)) # number of examples\n",
    "        results[rel] = stats\n",
    "        if verbose:\n",
    "            print_statistics_row(rel, results[rel])\n",
    "    avg_result = macro_average_results(results)\n",
    "    if verbose:\n",
    "        print_statistics_footer(avg_result)\n",
    "    return avg_result[2]  # return f_0.5 score as summary statistic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating a random-guessing strategy\n",
    "\n",
    "In order to validate our evaluation framework, and to set a floor under expected results for future evaluations, let's implement and evaluate a random-guessing strategy. The random guesser is a classifier which completely ignores its input, and simply flips a coin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lift(f):\n",
    "    return lambda xs: [f(x) for x in xs]\n",
    "\n",
    "def make_random_classifier(p=0.50):\n",
    "    def random_classify(kb_triple):\n",
    "        return random.random() < p\n",
    "    return lift(random_classify)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "relation              precision     recall    f-score    support       size\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "adjoins                   0.058      0.515      0.070        303       5319\n",
      "author                    0.088      0.508      0.106        480       5496\n",
      "capital                   0.019      0.539      0.024         89       5105\n",
      "contains                  0.349      0.502      0.371       2667       7683\n",
      "film_performance          0.138      0.491      0.162        822       5838\n",
      "founders                  0.064      0.482      0.078        359       5375\n",
      "genre                     0.039      0.608      0.048        166       5182\n",
      "has_sibling               0.092      0.493      0.109        513       5529\n",
      "has_spouse                0.110      0.530      0.130        575       5591\n",
      "is_a                      0.099      0.547      0.119        494       5510\n",
      "nationality               0.054      0.463      0.066        311       5327\n",
      "parents                   0.062      0.502      0.075        325       5341\n",
      "place_of_birth            0.040      0.488      0.049        217       5233\n",
      "place_of_death            0.028      0.490      0.034        145       5161\n",
      "profession                0.052      0.563      0.063        245       5261\n",
      "worked_at                 0.047      0.544      0.058        228       5244\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "macro-average             0.084      0.517      0.098       7939      88195\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.09757501010273492"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluate(make_random_classifier())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are not too surprising. Recall is generally around 0.50, which makes sense: on any given example with label `True`, we are 50% likely to guess the right label. But precision is very poor, because most labels are not `True`, and because our classifier is completely ignorant of the features of specific problem instances. Accordingly, the F<sub>0.5</sub>-score is also very poor — first because even the equally-weighted F<sub>1</sub>-score is always closer to the lesser of precision and recall, and second because the F<sub>0.5</sub>-score weights precision twice as much as recall.\n",
    "\n",
    "Actually, the most remarkable result in this table is the comparatively good performance for the `contains` relation! What does this result tell us about the data?\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A simple baseline model\n",
    "\n",
    "It shouldn't be too hard to do better than random guessing. But for now, let's aim low — let's use the data we have in the easiest and most obvious way, and see how far that gets us.\n",
    "\n",
    "We start from the intuition that the words between two entity mentions frequently tell us how they're related. For example, in the phrase \"SpaceX was founded by Elon Musk\", the words \"was founded by\" indicate that the `founders` relation holds between the first entity mentioned and the second. Likewise, in the phrase \"Elon Musk established SpaceX\", the word \"established\" indicates the `founders` relation holds between the second entity mentioned and the first.\n",
    "\n",
    "So let's write some code to find the most common phrases that appear between the two entity mentions for each relation. As the examples illustrate, we need to make sure to consider both directions: that is, where the subject of the relation appears as the first mention, and where it appears as the second."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "adjoins              fwd         8461 ,\n",
      "adjoins              fwd         5633 and\n",
      "adjoins              fwd          993 , and\n",
      "adjoins              fwd         5599 ,\n",
      "adjoins              fwd         3780 and\n",
      "adjoins              fwd          680 , and\n",
      "author               fwd         1214 by\n",
      "author               fwd          155 ,\n",
      "author               fwd          130 , by\n",
      "author               fwd         1106 's\n",
      "author               fwd          294 ‘ s\n",
      "author               fwd          175 ’ s\n",
      "capital              fwd           37 ,\n",
      "capital              fwd           19 in\n",
      "capital              fwd           18 (\n",
      "capital              fwd         3711 ,\n",
      "capital              fwd          178 in\n",
      "capital              fwd           87 , the capital of\n",
      "contains             fwd          460 's\n",
      "contains             fwd          355 ,\n",
      "contains             fwd          250 (\n",
      "contains             fwd        25095 ,\n",
      "contains             fwd         5603 in\n",
      "contains             fwd          668 in the\n",
      "film_performance     fwd          286 in\n",
      "film_performance     fwd          200 's\n",
      "film_performance     fwd          115 film\n",
      "film_performance     fwd          213 with\n",
      "film_performance     fwd          152 , starring\n",
      "film_performance     fwd          115 opposite\n",
      "founders             fwd           98 founder\n",
      "founders             fwd           59 co-founder\n",
      "founders             fwd           57 ,\n",
      "founders             fwd          180 's\n",
      "founders             fwd          104 of\n",
      "founders             fwd           77 ‘ s\n",
      "genre                fwd           28 , a\n",
      "genre                fwd           13 in 1994 , he became a central figure in the\n",
      "genre                fwd           11 is a\n",
      "genre                fwd          122 ,\n",
      "genre                fwd           62 series\n",
      "genre                fwd           23 \n",
      "has_sibling          fwd         1369 and\n",
      "has_sibling          fwd          614 ,\n",
      "has_sibling          fwd          139 , and\n",
      "has_sibling          fwd          930 and\n",
      "has_sibling          fwd          460 ,\n",
      "has_sibling          fwd           94 , and\n",
      "has_spouse           fwd         2029 and\n",
      "has_spouse           fwd          375 ,\n",
      "has_spouse           fwd          112 and his wife\n",
      "has_spouse           fwd         1382 and\n",
      "has_spouse           fwd          271 ,\n",
      "has_spouse           fwd           78 and his wife\n",
      "is_a                 fwd          120 ,\n",
      "is_a                 fwd           77 and\n",
      "is_a                 fwd           34 , a\n",
      "is_a                 fwd          252 ,\n",
      "is_a                 fwd          175 and\n",
      "is_a                 fwd           81 \n",
      "nationality          fwd          331 of\n",
      "nationality          fwd           79 in\n",
      "nationality          fwd           34 of the\n",
      "nationality          fwd           57 ,\n",
      "nationality          fwd           27 by\n",
      "nationality          fwd           25 under\n",
      "parents              fwd           77 , son of\n",
      "parents              fwd           50 and\n",
      "parents              fwd           44 ,\n",
      "parents              fwd          177 and\n",
      "parents              fwd          167 ,\n",
      "parents              fwd           47 and his son\n",
      "place_of_birth       fwd           90 of\n",
      "place_of_birth       fwd           64 was born in\n",
      "place_of_birth       fwd           37 in\n",
      "place_of_birth       fwd           17 ,\n",
      "place_of_birth       fwd           16 by\n",
      "place_of_birth       fwd           11 under\n",
      "place_of_death       fwd           73 in\n",
      "place_of_death       fwd           63 of\n",
      "place_of_death       fwd           17 at\n",
      "place_of_death       fwd           12 ,\n",
      "place_of_death       fwd           10 under\n",
      "place_of_death       fwd            9 mayor\n",
      "profession           fwd           65 ,\n",
      "profession           fwd           27 , a\n",
      "profession           fwd           20 and\n",
      "profession           fwd          114 ,\n",
      "profession           fwd           74 \n",
      "profession           fwd           24 and\n",
      "worked_at            fwd          103 of\n",
      "worked_at            fwd           82 at\n",
      "worked_at            fwd           66 's\n",
      "worked_at            fwd           37 ,\n",
      "worked_at            fwd           30 founder\n",
      "worked_at            fwd           25 co-founder\n"
     ]
    }
   ],
   "source": [
    "def find_common_middles(split='train', top_k=3, show_output=False):\n",
    "    corpus = data[split]['corpus']\n",
    "    kb = data[split]['kb']\n",
    "    mids_by_rel = {\n",
    "        'fwd': defaultdict(lambda: defaultdict(int)),\n",
    "        'rev': defaultdict(lambda: defaultdict(int)),\n",
    "    }\n",
    "    for rel in all_relations:\n",
    "        for kbt in kb.get_triples_for_relation(rel):\n",
    "            for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):\n",
    "                mids_by_rel['fwd'][rel][ex.middle] += 1\n",
    "            for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):\n",
    "                mids_by_rel['rev'][rel][ex.middle] += 1\n",
    "    def most_frequent(mid_counter):\n",
    "        return sorted([(cnt, mid) for mid, cnt in mid_counter.items()], reverse=True)[:top_k]\n",
    "    for rel in all_relations:\n",
    "        for dir in ['fwd', 'rev']:\n",
    "            top = most_frequent(mids_by_rel[dir][rel])\n",
    "            if show_output:\n",
    "                for cnt, mid in top:\n",
    "                    print('{:20s} {:5s} {:10d} {:s}'.format(rel, 'fwd', cnt, mid))\n",
    "            mids_by_rel[dir][rel] = set([mid for cnt, mid in top])\n",
    "    return mids_by_rel\n",
    "\n",
    "_ = find_common_middles(show_output=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A few observations here:\n",
    "\n",
    "- Some of the most frequent middles are natural and intuitive. For example, \", son of\" indicates a forward `parents` relation, while \"and his son\" indicates a reverse `parents` relation.\n",
    "- Punctuation and stop words such as \"and\" and \"of\" are extremely common. Unlike some other NLP applications, it's probably a bad idea to throw these away — they carry lots of useful information.\n",
    "- However, punctuation and stop words tend to be highly ambiguous. For example, a bare comma is a likely middle for almost every relation in at least one direction.\n",
    "- A few of the results reflect quirks of the dataset. For example, the appearance of the phrase \"in 1994 , he became a central figure in the\" as a common middle for the `genre` relation reflects both the relative scarcity of examples for that relation, and an unfortunate tendency of the Wikilinks dataset to include duplicate or near-duplicate source documents. (That middle connects the entities [Ready to Die](https://en.wikipedia.org/wiki/Ready_to_Die) — the first studio album by the Notorious B.I.G. — and [East Coast hip hop](https://en.wikipedia.org/wiki/East_Coast_hip_hop).)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_top_k_middles_classifier(train_split='train', top_k=3):\n",
    "    corpus = data[train_split]['corpus']\n",
    "    top_k_mids_by_rel = find_common_middles(split=train_split, top_k=top_k)\n",
    "    def classify(kb_triple):\n",
    "        fwd_mids = top_k_mids_by_rel['fwd'][kb_triple.rel]\n",
    "        rev_mids = top_k_mids_by_rel['rev'][kb_triple.rel]\n",
    "        for ex in corpus.get_examples_for_entities(kb_triple.sbj, kb_triple.obj):\n",
    "            if ex.middle in fwd_mids:\n",
    "                return True\n",
    "        for ex in corpus.get_examples_for_entities(kb_triple.obj, kb_triple.sbj):\n",
    "            if ex.middle in rev_mids:\n",
    "                return True\n",
    "        return False\n",
    "    return lift(classify)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "relation              precision     recall    f-score    support       size\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "adjoins                   0.311      0.406      0.327        303       5319\n",
      "author                    0.212      0.058      0.139        480       5496\n",
      "capital                   0.087      0.191      0.098         89       5105\n",
      "contains                  0.494      0.066      0.214       2667       7683\n",
      "film_performance          0.250      0.002      0.012        822       5838\n",
      "founders                  0.177      0.061      0.129        359       5375\n",
      "genre                     0.000      0.000      0.000        166       5182\n",
      "has_sibling               0.295      0.222      0.277        513       5529\n",
      "has_spouse                0.359      0.249      0.330        575       5591\n",
      "is_a                      0.019      0.010      0.016        494       5510\n",
      "nationality               0.115      0.039      0.083        311       5327\n",
      "parents                   0.079      0.068      0.077        325       5341\n",
      "place_of_birth            0.052      0.023      0.042        217       5233\n",
      "place_of_death            0.011      0.007      0.010        145       5161\n",
      "profession                0.008      0.008      0.008        245       5261\n",
      "worked_at                 0.074      0.031      0.058        228       5244\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "macro-average             0.159      0.090      0.114       7939      88195\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.11360108865631796"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluate(train_top_k_middles_classifier())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not surprisingly, the performance of even this extremely simplistic model is noticeably better than random guessing. Of course, recall is much worse across the board, but precision and F<sub>0.5</sub>-score are sometimes much better. We observe big gains especially on `adjoins`, `author`, `contains`, `founders`, `has_sibling`, and `has spouse`. Then again, at least one relation actually got worse. (Can you offer any explanation for that?)\n",
    "\n",
    "Admittedly, performance is still not great in absolute terms. However, we should have modest expectations for performance on this task — we are unlikely ever to get anywhere near perfect precision with perfect recall. Why?\n",
    "\n",
    "- High precision will be hard to achieve because the KB is incomplete: some entity pairs that are related in the world — and in the corpus — may simply be missing from the KB.\n",
    "- High recall will be hard to achieve because the corpus is finite: some entity pairs that are related in the KB may not have any examples in the corpus.\n",
    "\n",
    "Because of these unavoidable obstacles, what matters is not so much absolute performance, but relative performance of different approaches.\n",
    "\n",
    "__Exercise:__ What's the optimal value for `top_k`, the number of most frequent middles to consider? What choice maximizes our chosen figure of merit, the macro-averaged F<sub>0.5</sub>-score?\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a classifier\n",
    "\n",
    "OK, it's time to get (halfway) serious. Let's apply real machine learning to train a classifier on the training data, and see how it performs on the test data. We'll begin with one of the simplest machine learning setups: a bag-of-words feature representation, and a linear model trained using logistic regression.\n",
    "\n",
    "Just like we did in the unit on [supervised sentiment analysis](https://github.com/cgpotts/cs224u/blob/master/sst_02_hand_built_features.ipynb), we'll leverage the `sklearn` library, and we'll introduce functions for featurizing instances, training models, making predictions, and evaluating results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Featurizers\n",
    "\n",
    "Featurizers are functions which define the feature representation for our model. The primary input to a featurizer will be the `KBTriple` for which we are generating features. But since our features will be derived from corpus examples containing the entities of the `KBTriple`, we must also pass in a reference to a `Corpus`. And in order to make it easy to combine different featurizers, we'll also pass in a feature counter to hold the results.\n",
    "\n",
    "Here's an implementation for a very simple bag-of-words featurizer. It finds all the corpus examples containing the two entities in the `KBTriple`, breaks the phrase appearing between the two entity mentions into words, and counts the words. Note that it makes no distinction between \"forward\" and \"reverse\" examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):\n",
    "    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):\n",
    "        for word in ex.middle.split(' '):\n",
    "            feature_counter[word] += 1\n",
    "    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):\n",
    "        for word in ex.middle.split(' '):\n",
    "            feature_counter[word] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can experiment with adding new kinds of features just by implementing additional featurizers, following `simple_bag_of_words_featurizer` as an example.\n",
    "\n",
    "Now, in order to apply machine learning algorithms such as those provided by `sklearn`, we need a way to convert datasets of `KBTriple`s into feature matrices. The function `featurize_datasets()` achieves that. It takes in a collection of `KBTriple`s grouped by relation, and returns a corresponding collection of feature matrices grouped by relation. It also needs a `Corpus` from which to extract features, and a list of featurizers to generate the features. Finally, it accepts a vectorizer as an optional argument. At training time, we won't supply a vectorizer, so this code will create a new one; at test time, we'll supply the vectorizer we created at training time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "def featurize_datasets(\n",
    "        kbts_by_rel,\n",
    "        corpus,\n",
    "        featurizers=[simple_bag_of_words_featurizer],\n",
    "        vectorizer=None):\n",
    "    # Create feature counters for all instances (kbts).\n",
    "    feat_counters_by_rel = defaultdict(list)\n",
    "    for rel, kbts in kbts_by_rel.items():\n",
    "        for kbt in kbts:\n",
    "            feature_counter = Counter()\n",
    "            for featurizer in featurizers:\n",
    "                featurizer(kbt, corpus, feature_counter)\n",
    "            feat_counters_by_rel[rel].append(feature_counter)\n",
    "    feat_matrices_by_rel = defaultdict(list)\n",
    "    # If we haven't been given a Vectorizer, create one and fit it to all the feature counters.\n",
    "    if vectorizer == None:\n",
    "        vectorizer = DictVectorizer(sparse=True)\n",
    "        def traverse_dicts():\n",
    "            for dict_list in feat_counters_by_rel.values():\n",
    "                for d in dict_list:\n",
    "                    yield d\n",
    "        vectorizer.fit(traverse_dicts())\n",
    "    # Now use the Vectorizer to transform feature dictionaries into feature matrices.\n",
    "    for rel, feat_counters in feat_counters_by_rel.items():\n",
    "        feat_matrices_by_rel[rel] = vectorizer.transform(feat_counters)\n",
    "    return feat_matrices_by_rel, vectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Experiments\n",
    "\n",
    "Now we need some functions to train models, make predictions, and evaluate the results. We'll start with `train_models()`. This function takes as arguments a data split on which to train, a list of featurizers, and model factory, which is a function which initializes an `sklearn` classifier. It returns a dictionary holding the featurizers, the vectorizer that was used to generate the training matrix, and a dictionary holding the trained models, one per relation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_models(\n",
    "        split='train',\n",
    "        featurizers=[simple_bag_of_words_featurizer],\n",
    "        model_factory=lambda: LogisticRegression(fit_intercept=True),\n",
    "        verbose=True):\n",
    "    if verbose: print('Building datasets')\n",
    "    train_o, train_y = build_datasets_for_split(split=split)\n",
    "    if verbose: print('Featurizing')\n",
    "    train_X, vectorizer = featurize_datasets(train_o, data[split]['corpus'], featurizers)\n",
    "    models = {}\n",
    "    if verbose: print('Training models')\n",
    "    for rel in all_relations:\n",
    "        models[rel] = model_factory()\n",
    "        models[rel].fit(train_X[rel], train_y[rel])\n",
    "    if verbose: print('Training complete\\n')\n",
    "    return {\n",
    "        'featurizers': featurizers,\n",
    "        'vectorizer': vectorizer,\n",
    "        'models': models,\n",
    "    }        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next comes `predict()`. This function takes as arguments a test split, a list of featurizers, the vectorizer that was used during training, and a dictionary holding the models, one per relation. It returns two parallel dictionaries: one holding the predictions (grouped by relation), the other holding the true labels (again, grouped by prediction)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(split, featurizers, vectorizer, models):\n",
    "    test_o, test_y = build_datasets_for_split(split=split)\n",
    "    test_X, _ = featurize_datasets(test_o, data[split]['corpus'], featurizers, vectorizer=vectorizer)\n",
    "    predictions = {}\n",
    "    for rel in all_relations:\n",
    "        predictions[rel] = models[rel].predict(test_X[rel])\n",
    "    return predictions, test_y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now `evaluate_predictions()`. This function takes as arguments the parallel dictionaries of predictions and true labels produced by `predict()`. It prints summary statistics for each relation, including precision, recall, and F<sub>0.5</sub>-score, and it returns the macro-averaged F<sub>0.5</sub>-score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_predictions(predictions, test_y, verbose=True):\n",
    "    results = {}  # one result row for each relation\n",
    "    if verbose:\n",
    "        print_statistics_header()\n",
    "    for rel in all_relations:\n",
    "        stats = precision_recall_fscore_support(test_y[rel], predictions[rel], beta=0.5)\n",
    "        stats = [stat[1] for stat in stats]  # stats[1] is the stat for label True\n",
    "        stats.append(len(test_y[rel]))\n",
    "        results[rel] = stats\n",
    "        if verbose:\n",
    "            print_statistics_row(rel, results[rel])\n",
    "    avg_result = macro_average_results(results)\n",
    "    if verbose:\n",
    "        print_statistics_footer(avg_result)\n",
    "    return avg_result[2]  # return f_0.5 score as summary statistic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we introduce `experiment()`, which simply chains together `train_models()`, `predict()`, and `evaluate_predictions()`. For convenience, this function returns the output of `train_models()` as its result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "def experiment(\n",
    "        train_split='train',\n",
    "        test_split='dev',\n",
    "        featurizers=[simple_bag_of_words_featurizer],\n",
    "        model_factory=lambda: LogisticRegression(fit_intercept=True),\n",
    "        verbose=True):\n",
    "    train_result = train_models(train_split, featurizers, model_factory, verbose)\n",
    "    predictions, test_y = predict(test_split,\n",
    "                                  featurizers,\n",
    "                                  train_result['vectorizer'],\n",
    "                                  train_result['models'])\n",
    "    evaluate_predictions(predictions, test_y, verbose)\n",
    "    return train_result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running `experiment()` in its default configuration will give us a baseline result for machine-learned models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Building datasets\n",
      "Featurizing\n",
      "Training models\n",
      "Training complete\n",
      "\n",
      "relation              precision     recall    f-score    support       size\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "adjoins                   0.874      0.459      0.740        303       5319\n",
      "author                    0.830      0.558      0.756        480       5496\n",
      "capital                   0.677      0.236      0.493         89       5105\n",
      "contains                  0.752      0.615      0.720       2667       7683\n",
      "film_performance          0.820      0.580      0.757        822       5838\n",
      "founders                  0.837      0.387      0.679        359       5375\n",
      "genre                     0.619      0.235      0.467        166       5182\n",
      "has_sibling               0.870      0.234      0.563        513       5529\n",
      "has_spouse                0.900      0.362      0.694        575       5591\n",
      "is_a                      0.710      0.223      0.494        494       5510\n",
      "nationality               0.584      0.145      0.363        311       5327\n",
      "parents                   0.877      0.569      0.791        325       5341\n",
      "place_of_birth            0.759      0.203      0.490        217       5233\n",
      "place_of_death            0.621      0.124      0.345        145       5161\n",
      "profession                0.639      0.159      0.399        245       5261\n",
      "worked_at                 0.740      0.250      0.532        228       5244\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "macro-average             0.757      0.334      0.580       7939      88195\n"
     ]
    }
   ],
   "source": [
    "_ = experiment()\n",
    "# _ = experiment(train_split='tiny', test_split='tiny') # better for rapid development"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Considering how vanilla our model is, these results are quite surprisingly good! We see huge gains for every relation over our `top_k_middles_classifier`. This strong performance is a powerful testament to the effectiveness of even the simplest forms of machine learning.\n",
    "\n",
    "But there is still much more we can do. To make further gains, we must not treat the model as a black box. We must open it up and get visibility into what it has learned, and more importantly, where it still falls down."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examining the trained models\n",
    "\n",
    "One important way to gain understanding of our trained model is to inspect the model weights. What features are strong positive indicators for each relation, and what features are strong negative indicators?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "def examine_model_weights(\n",
    "        train_split='train',\n",
    "        featurizers=[simple_bag_of_words_featurizer],\n",
    "        model_factory=lambda: LogisticRegression(fit_intercept=True),\n",
    "        k=3,\n",
    "        verbose=True):\n",
    "    train_result = train_models(train_split, featurizers, model_factory, verbose)\n",
    "    feature_names = train_result['vectorizer'].get_feature_names()\n",
    "    for rel, model in train_result['models'].items():\n",
    "        print('Highest and lowest feature weights for relation {}:\\n'.format(rel))\n",
    "        sorted_weights = sorted([(wgt, idx) for idx, wgt in enumerate(model.coef_[0])], reverse=True)\n",
    "        for wgt, idx in sorted_weights[:k]:\n",
    "            print('{:10.3f} {}'.format(wgt, feature_names[idx]))\n",
    "        print('{:>10s} {}'.format('.....', '.....'))\n",
    "        for wgt, idx in sorted_weights[-k:]:\n",
    "            print('{:10.3f} {}'.format(wgt, feature_names[idx]))\n",
    "        print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Building datasets\n",
      "Featurizing\n",
      "Training models\n",
      "Training complete\n",
      "\n",
      "Highest and lowest feature weights for relation adjoins:\n",
      "\n",
      "     2.523 Córdoba\n",
      "     2.255 Taluks\n",
      "     2.001 nearby\n",
      "     ..... .....\n",
      "    -1.193 an\n",
      "    -1.338 Egypt\n",
      "    -1.484 Caribbean\n",
      "\n",
      "Highest and lowest feature weights for relation author:\n",
      "\n",
      "     3.329 author\n",
      "     2.548 poem\n",
      "     2.463 wrote\n",
      "     ..... .....\n",
      "    -2.312 directed\n",
      "    -2.563 controversial\n",
      "    -4.219 1945\n",
      "\n",
      "Highest and lowest feature weights for relation capital:\n",
      "\n",
      "     3.833 capital\n",
      "     1.749 headquarters\n",
      "     1.735 towns\n",
      "     ..... .....\n",
      "    -1.635 during\n",
      "    -1.751 includes\n",
      "    -1.963 Westminster\n",
      "\n",
      "Highest and lowest feature weights for relation contains:\n",
      "\n",
      "     2.517 third-largest\n",
      "     2.434 Channel\n",
      "     2.308 districts\n",
      "     ..... .....\n",
      "    -2.316 rise\n",
      "    -3.798 Ceylon\n",
      "    -4.005 occupation\n",
      "\n",
      "Highest and lowest feature weights for relation film_performance:\n",
      "\n",
      "     4.141 starring\n",
      "     3.631 alongside\n",
      "     3.419 opposite\n",
      "     ..... .....\n",
      "    -1.874 Khakee\n",
      "    -1.994 Westminster\n",
      "    -3.850 Mohabbatein\n",
      "\n",
      "Highest and lowest feature weights for relation founders:\n",
      "\n",
      "     3.969 founder\n",
      "     3.888 founded\n",
      "     2.649 company\n",
      "     ..... .....\n",
      "    -1.682 William\n",
      "    -1.893 writing\n",
      "    -1.998 Griffith\n",
      "\n",
      "Highest and lowest feature weights for relation genre:\n",
      "\n",
      "     3.504 \n",
      "     3.184 series\n",
      "     2.783 album\n",
      "     ..... .....\n",
      "    -1.387 and\n",
      "    -1.465 Playhouse\n",
      "    -1.801 at\n",
      "\n",
      "Highest and lowest feature weights for relation has_sibling:\n",
      "\n",
      "     5.262 brother\n",
      "     4.058 sister\n",
      "     3.076 Marlon\n",
      "     ..... .....\n",
      "    -1.463 starring\n",
      "    -1.959 formed\n",
      "    -2.062 Her\n",
      "\n",
      "Highest and lowest feature weights for relation has_spouse:\n",
      "\n",
      "     5.185 wife\n",
      "     4.286 husband\n",
      "     4.216 married\n",
      "     ..... .....\n",
      "    -1.634 engineer\n",
      "    -2.341 Straus\n",
      "    -2.341 Isidor\n",
      "\n",
      "Highest and lowest feature weights for relation is_a:\n",
      "\n",
      "     4.043 stage\n",
      "     3.838 \n",
      "     2.800 theatre\n",
      "     ..... .....\n",
      "    -1.485 Texas\n",
      "    -1.557 at\n",
      "    -5.974 characin\n",
      "\n",
      "Highest and lowest feature weights for relation nationality:\n",
      "\n",
      "     2.586 born\n",
      "     2.035 -born\n",
      "     1.933 President\n",
      "     ..... .....\n",
      "    -1.584 or\n",
      "    -1.790 foreign\n",
      "    -1.900 state\n",
      "\n",
      "Highest and lowest feature weights for relation parents:\n",
      "\n",
      "     5.083 daughter\n",
      "     4.963 son\n",
      "     4.377 father\n",
      "     ..... .....\n",
      "    -1.374 need\n",
      "    -1.477 no\n",
      "    -2.788 Indian\n",
      "\n",
      "Highest and lowest feature weights for relation place_of_birth:\n",
      "\n",
      "     3.774 born\n",
      "     2.971 mayor\n",
      "     2.432 -born\n",
      "     ..... .....\n",
      "    -1.371 and\n",
      "    -1.447 or\n",
      "    -1.762 Indian\n",
      "\n",
      "Highest and lowest feature weights for relation place_of_death:\n",
      "\n",
      "     2.745 died\n",
      "     2.228 assassinated\n",
      "     1.930 Germany\n",
      "     ..... .....\n",
      "    -1.180 Belgium\n",
      "    -1.433 state\n",
      "    -1.953 Westminster\n",
      "\n",
      "Highest and lowest feature weights for relation profession:\n",
      "\n",
      "     4.132 \n",
      "     2.762 American\n",
      "     2.272 philosopher\n",
      "     ..... .....\n",
      "    -1.423 Texas\n",
      "    -1.425 on\n",
      "    -1.439 from\n",
      "\n",
      "Highest and lowest feature weights for relation worked_at:\n",
      "\n",
      "     3.425 professor\n",
      "     2.792 president\n",
      "     2.773 CEO\n",
      "     ..... .....\n",
      "    -1.507 confluence\n",
      "    -1.613 state\n",
      "    -1.718 or\n",
      "\n"
     ]
    }
   ],
   "source": [
    "examine_model_weights()\n",
    "# examine_model_weights(train_split='tiny') # better for rapid development"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By and large, the high-weight features for each relation are pretty intuitive — they are words that are used to express the relation in question. (The counter-intuitive results merit a bit of investigation!)\n",
    "\n",
    "The low-weight features (that is, features with large negative weights) may be a bit harder to understand. In some cases, however, they can be interpreted as features which indicate some _other_ relation which is anti-correlated with the target relation. (As an example, \"directed\" is a negative indicator for the `author` relation.)\n",
    "\n",
    "__Exercise:__ Investigate one of the counter-intuitive high-weight features. Find the training examples which caused the feature to be included. Given the training data, does it make sense that this feature is a good predictor for the target relation?\n",
    "\n",
    "<!--\n",
    "- SPOILER: Using `penalty='l1'` results in somewhat less intuitive feature weights, and about the same performance.\n",
    "- SPOILER: Using `penalty='l1', C=0.1` results in much more intuitive feature weights, but much worse performance.\n",
    "-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discovering new relation instances\n",
    "\n",
    "Another way to gain insight into our trained models is to use them to discover new relation instances that don't currently appear in the KB. In fact, this is the whole point of building a relation extraction system: to extend an existing KB (or build a new one) using knowledge extracted from natural language text at scale. Can the models we've trained do this effectively?\n",
    "\n",
    "Because the goal is to discover new relation instances which are _true_ but _absent from the KB_, we can't evalute this capability automatically. But we can generate candidate KB triples and manually evaluate them for correctness.\n",
    "\n",
    "To do this, we'll start from corpus examples containing pairs of entities which do not belong to any relation in the KB (earlier, we described these as \"negative examples\"). We'll then apply our trained models to each pair of entities, and sort the results by probability assigned by the model, in order to find the most likely new instances for each relation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_new_relation_instances(\n",
    "        train_split='train',\n",
    "        test_split='dev',\n",
    "        featurizers=[simple_bag_of_words_featurizer],\n",
    "        model_factory=lambda: LogisticRegression(fit_intercept=True),\n",
    "        k=10,\n",
    "        verbose=True):\n",
    "\n",
    "    # train models\n",
    "    train_result = train_models(train_split, featurizers, model_factory, verbose)\n",
    "\n",
    "    # build datasets for negative instances only\n",
    "    neg_o, neg_y = build_datasets_for_split(test_split, include_positive=False, sampling_rate=1.0)\n",
    "    neg_X, _ = featurize_datasets(neg_o,\n",
    "                                  data[test_split]['corpus'],\n",
    "                                  featurizers,\n",
    "                                  train_result['vectorizer'])\n",
    "\n",
    "    # report highest confidence predictions\n",
    "    for rel, model in train_result['models'].items():\n",
    "        print('Highest probability examples for relation {}:\\n'.format(rel))\n",
    "        probs = model.predict_proba(neg_X[rel])\n",
    "        probs = [prob[1] for prob in probs] # probability for class True\n",
    "        sorted_probs = sorted([(p, idx) for idx, p in enumerate(probs)], reverse=True)\n",
    "        for p, idx in sorted_probs[:k]:\n",
    "            print('{:10.3f} {}'.format(p, neg_o[rel][idx]))\n",
    "        print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Building datasets\n",
      "Featurizing\n",
      "Training models\n",
      "Training complete\n",
      "\n",
      "Highest probability examples for relation adjoins:\n",
      "\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Sun', obj='Moon')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Moon', obj='Sun')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='India', obj='Maharashtra')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Maharashtra', obj='India')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Europe', obj='Great_Britain')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Great_Britain', obj='Europe')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Isle_of_Wight', obj='Ryde')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Ryde', obj='Isle_of_Wight')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Uttar_Pradesh', obj='India')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='India', obj='Uttar_Pradesh')\n",
      "\n",
      "Highest probability examples for relation author:\n",
      "\n",
      "     1.000 KBTriple(rel='author', sbj='Systema_Naturae', obj='Carl_Linnaeus')\n",
      "     1.000 KBTriple(rel='author', sbj='The_Doors_of_Perception', obj='Aldous_Huxley')\n",
      "     1.000 KBTriple(rel='author', sbj='Aldous_Huxley', obj='The_Doors_of_Perception')\n",
      "     1.000 KBTriple(rel='author', sbj='Carl_Linnaeus', obj='Systema_Naturae')\n",
      "     1.000 KBTriple(rel='author', sbj='Charlie_and_the_Chocolate_Factory', obj='Roald_Dahl')\n",
      "     1.000 KBTriple(rel='author', sbj='Roald_Dahl', obj='Charlie_and_the_Chocolate_Factory')\n",
      "     1.000 KBTriple(rel='author', sbj='Stephen_Hawking', obj='A_Brief_History_of_Time')\n",
      "     1.000 KBTriple(rel='author', sbj='A_Brief_History_of_Time', obj='Stephen_Hawking')\n",
      "     1.000 KBTriple(rel='author', sbj='Neil_Gaiman', obj='American_Gods')\n",
      "     1.000 KBTriple(rel='author', sbj='American_Gods', obj='Neil_Gaiman')\n",
      "\n",
      "Highest probability examples for relation capital:\n",
      "\n",
      "     1.000 KBTriple(rel='capital', sbj='Italy', obj='Rome')\n",
      "     1.000 KBTriple(rel='capital', sbj='Rome', obj='Italy')\n",
      "     1.000 KBTriple(rel='capital', sbj='Isle_of_Wight', obj='Ryde')\n",
      "     1.000 KBTriple(rel='capital', sbj='Ryde', obj='Isle_of_Wight')\n",
      "     1.000 KBTriple(rel='capital', sbj='India', obj='Maharashtra')\n",
      "     1.000 KBTriple(rel='capital', sbj='Maharashtra', obj='India')\n",
      "     1.000 KBTriple(rel='capital', sbj='Chernobyl_Nuclear_Power_Plant', obj='Ukraine')\n",
      "     1.000 KBTriple(rel='capital', sbj='Ukraine', obj='Chernobyl_Nuclear_Power_Plant')\n",
      "     1.000 KBTriple(rel='capital', sbj='Blarney', obj='Republic_of_Ireland')\n",
      "     1.000 KBTriple(rel='capital', sbj='Republic_of_Ireland', obj='Blarney')\n",
      "\n",
      "Highest probability examples for relation contains:\n",
      "\n",
      "     1.000 KBTriple(rel='contains', sbj='Italy', obj='Rome')\n",
      "     1.000 KBTriple(rel='contains', sbj='Uttar_Pradesh', obj='India')\n",
      "     1.000 KBTriple(rel='contains', sbj='Isle_of_Wight', obj='Ryde')\n",
      "     1.000 KBTriple(rel='contains', sbj='India', obj='Maharashtra')\n",
      "     1.000 KBTriple(rel='contains', sbj='Roman_Empire', obj='Rome')\n",
      "     1.000 KBTriple(rel='contains', sbj='Rome', obj='Italy')\n",
      "     1.000 KBTriple(rel='contains', sbj='India', obj='Uttar_Pradesh')\n",
      "     1.000 KBTriple(rel='contains', sbj='Rome', obj='Roman_Empire')\n",
      "     1.000 KBTriple(rel='contains', sbj='Ryde', obj='Isle_of_Wight')\n",
      "     1.000 KBTriple(rel='contains', sbj='Maharashtra', obj='India')\n",
      "\n",
      "Highest probability examples for relation film_performance:\n",
      "\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Hong_Kong', obj='Shanghai_Noon')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Shanghai_Noon', obj='Hong_Kong')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Francis_Ford_Coppola', obj='Robin_Williams')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Robin_Williams', obj='Francis_Ford_Coppola')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='The_Pink_Panther_2', obj='Harald_Zwart')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Harald_Zwart', obj='The_Pink_Panther_2')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Salman_Khan', obj='Tere_Naam')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Tere_Naam', obj='Salman_Khan')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Gia', obj='Angelina_Jolie')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Angelina_Jolie', obj='Gia')\n",
      "\n",
      "Highest probability examples for relation founders:\n",
      "\n",
      "     1.000 KBTriple(rel='founders', sbj='L._Ron_Hubbard', obj='Church_of_Scientology')\n",
      "     1.000 KBTriple(rel='founders', sbj='Church_of_Scientology', obj='L._Ron_Hubbard')\n",
      "     1.000 KBTriple(rel='founders', sbj='Insect', obj='Lepidoptera')\n",
      "     1.000 KBTriple(rel='founders', sbj='Lepidoptera', obj='Insect')\n",
      "     1.000 KBTriple(rel='founders', sbj='Illuminati', obj='Adam_Weishaupt')\n",
      "     1.000 KBTriple(rel='founders', sbj='Adam_Weishaupt', obj='Illuminati')\n",
      "     1.000 KBTriple(rel='founders', sbj='Austria', obj='Gaston_Glock')\n",
      "     1.000 KBTriple(rel='founders', sbj='Gaston_Glock', obj='Austria')\n",
      "     1.000 KBTriple(rel='founders', sbj='Sri_Lanka', obj='Matale_District')\n",
      "     1.000 KBTriple(rel='founders', sbj='Matale_District', obj='Sri_Lanka')\n",
      "\n",
      "Highest probability examples for relation genre:\n",
      "\n",
      "     1.000 KBTriple(rel='genre', sbj='Cartoon_Cartoons', obj=\"Dexter's_Laboratory\")\n",
      "     1.000 KBTriple(rel='genre', sbj=\"Dexter's_Laboratory\", obj='Cartoon_Cartoons')\n",
      "     1.000 KBTriple(rel='genre', sbj='All_We_Know_Is_Falling', obj='Taylor_York')\n",
      "     1.000 KBTriple(rel='genre', sbj='Taylor_York', obj='All_We_Know_Is_Falling')\n",
      "     1.000 KBTriple(rel='genre', sbj='Lanner_falcon', obj='Falcon')\n",
      "     1.000 KBTriple(rel='genre', sbj='Falcon', obj='Lanner_falcon')\n",
      "     0.998 KBTriple(rel='genre', sbj='Meg_Griffin', obj='Family_Guy')\n",
      "     0.998 KBTriple(rel='genre', sbj='Family_Guy', obj='Meg_Griffin')\n",
      "     0.982 KBTriple(rel='genre', sbj='Tattoo_artist', obj='Tattoo_artist')\n",
      "     0.977 KBTriple(rel='genre', sbj='Tunch_Ilkin', obj='Tunch_Ilkin')\n",
      "\n",
      "Highest probability examples for relation has_sibling:\n",
      "\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Ishmael', obj='Abraham')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Abraham', obj='Ishmael')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Sergey_Brin', obj='Larry_Page')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Larry_Page', obj='Sergey_Brin')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Isaac', obj='Abraham')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Abraham', obj='Isaac')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Jamie_Lee_Curtis', obj='Janet_Leigh')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Janet_Leigh', obj='Jamie_Lee_Curtis')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Karl_Marx', obj='Friedrich_Engels')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Friedrich_Engels', obj='Karl_Marx')\n",
      "\n",
      "Highest probability examples for relation has_spouse:\n",
      "\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Sergey_Brin', obj='Larry_Page')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Larry_Page', obj='Sergey_Brin')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Isidor_Straus', obj='Denver')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Denver', obj='Isidor_Straus')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Karl_Marx', obj='Friedrich_Engels')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Friedrich_Engels', obj='Karl_Marx')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Rajiv_Gandhi', obj='Indira_Gandhi')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Indira_Gandhi', obj='Rajiv_Gandhi')\n",
      "     0.999 KBTriple(rel='has_spouse', sbj='Anne_Boleyn', obj='Elizabeth_I_of_England')\n",
      "     0.999 KBTriple(rel='has_spouse', sbj='Elizabeth_I_of_England', obj='Anne_Boleyn')\n",
      "\n",
      "Highest probability examples for relation is_a:\n",
      "\n",
      "     1.000 KBTriple(rel='is_a', sbj='Insect', obj='Lepidoptera')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Lepidoptera', obj='Insect')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Apidae', obj='Bee')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Bee', obj='Apidae')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Odonata', obj='Insect')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Insect', obj='Odonata')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Malvaceae', obj='Hibiscus')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Hibiscus', obj='Malvaceae')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Okra', obj='Malvaceae')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Malvaceae', obj='Okra')\n",
      "\n",
      "Highest probability examples for relation nationality:\n",
      "\n",
      "     1.000 KBTriple(rel='nationality', sbj='Sri_Lanka', obj='Matale_District')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Matale_District', obj='Sri_Lanka')\n",
      "     1.000 KBTriple(rel='nationality', sbj='North_Island', obj='New_Zealand')\n",
      "     1.000 KBTriple(rel='nationality', sbj='New_Zealand', obj='North_Island')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Systema_Naturae', obj='Carl_Linnaeus')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Carl_Linnaeus', obj='Systema_Naturae')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Insect', obj='Lepidoptera')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Lepidoptera', obj='Insect')\n",
      "     1.000 KBTriple(rel='nationality', sbj='California', obj='San_Francisco_Bay_Area')\n",
      "     1.000 KBTriple(rel='nationality', sbj='San_Francisco_Bay_Area', obj='California')\n",
      "\n",
      "Highest probability examples for relation parents:\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     1.000 KBTriple(rel='parents', sbj='Ishmael', obj='Abraham')\n",
      "     1.000 KBTriple(rel='parents', sbj='Isaac', obj='Abraham')\n",
      "     1.000 KBTriple(rel='parents', sbj='Kim_Jong-il', obj='Kim_Jong-un')\n",
      "     1.000 KBTriple(rel='parents', sbj='Kim_Jong-un', obj='Kim_Jong-il')\n",
      "     1.000 KBTriple(rel='parents', sbj='Abraham', obj='Isaac')\n",
      "     1.000 KBTriple(rel='parents', sbj='Abraham', obj='Ishmael')\n",
      "     1.000 KBTriple(rel='parents', sbj='Anne_Boleyn', obj='Elizabeth_I_of_England')\n",
      "     1.000 KBTriple(rel='parents', sbj='Elizabeth_I_of_England', obj='Anne_Boleyn')\n",
      "     1.000 KBTriple(rel='parents', sbj='Louis_the_Pious', obj='Charlemagne')\n",
      "     1.000 KBTriple(rel='parents', sbj='Charlemagne', obj='Louis_the_Pious')\n",
      "\n",
      "Highest probability examples for relation place_of_birth:\n",
      "\n",
      "     1.000 KBTriple(rel='place_of_birth', sbj='Sri_Lanka', obj='Matale_District')\n",
      "     1.000 KBTriple(rel='place_of_birth', sbj='Matale_District', obj='Sri_Lanka')\n",
      "     1.000 KBTriple(rel='place_of_birth', sbj='North_Island', obj='New_Zealand')\n",
      "     1.000 KBTriple(rel='place_of_birth', sbj='New_Zealand', obj='North_Island')\n",
      "     1.000 KBTriple(rel='place_of_birth', sbj='Illinois', obj='United_States_Senate')\n",
      "     1.000 KBTriple(rel='place_of_birth', sbj='United_States_Senate', obj='Illinois')\n",
      "     0.998 KBTriple(rel='place_of_birth', sbj='California', obj='San_Francisco_Bay_Area')\n",
      "     0.998 KBTriple(rel='place_of_birth', sbj='San_Francisco_Bay_Area', obj='California')\n",
      "     0.988 KBTriple(rel='place_of_birth', sbj='Pangasinan', obj='Philippines')\n",
      "     0.988 KBTriple(rel='place_of_birth', sbj='Philippines', obj='Pangasinan')\n",
      "\n",
      "Highest probability examples for relation place_of_death:\n",
      "\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Systema_Naturae', obj='Carl_Linnaeus')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Carl_Linnaeus', obj='Systema_Naturae')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Ishmael', obj='Abraham')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Abraham', obj='Ishmael')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Sri_Lanka', obj='Matale_District')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Matale_District', obj='Sri_Lanka')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='North_Island', obj='New_Zealand')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='New_Zealand', obj='North_Island')\n",
      "     0.999 KBTriple(rel='place_of_death', sbj='Chernobyl_Nuclear_Power_Plant', obj='Ukraine')\n",
      "     0.999 KBTriple(rel='place_of_death', sbj='Ukraine', obj='Chernobyl_Nuclear_Power_Plant')\n",
      "\n",
      "Highest probability examples for relation profession:\n",
      "\n",
      "     1.000 KBTriple(rel='profession', sbj='Eyeless_in_Gaza', obj='Aldous_Huxley')\n",
      "     1.000 KBTriple(rel='profession', sbj='Aldous_Huxley', obj='Eyeless_in_Gaza')\n",
      "     0.999 KBTriple(rel='profession', sbj='Hispania', obj='Spain')\n",
      "     0.999 KBTriple(rel='profession', sbj='Spain', obj='Hispania')\n",
      "     0.996 KBTriple(rel='profession', sbj='Screenwriter', obj='Actor')\n",
      "     0.996 KBTriple(rel='profession', sbj='Actor', obj='Screenwriter')\n",
      "     0.995 KBTriple(rel='profession', sbj='Tunch_Ilkin', obj='Tunch_Ilkin')\n",
      "     0.995 KBTriple(rel='profession', sbj='Blog_award', obj='Blog_award')\n",
      "     0.995 KBTriple(rel='profession', sbj='Physicist', obj='Nikola_Tesla')\n",
      "     0.995 KBTriple(rel='profession', sbj='Guitarist', obj='Robby_Krieger')\n",
      "\n",
      "Highest probability examples for relation worked_at:\n",
      "\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Sri_Lanka', obj='Matale_District')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Matale_District', obj='Sri_Lanka')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='North_Island', obj='New_Zealand')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='New_Zealand', obj='North_Island')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Insect', obj='Lepidoptera')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Lepidoptera', obj='Insect')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Austria', obj='Gaston_Glock')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Gaston_Glock', obj='Austria')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='L._Ron_Hubbard', obj='Church_of_Scientology')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Church_of_Scientology', obj='L._Ron_Hubbard')\n",
      "\n"
     ]
    }
   ],
   "source": [
    "find_new_relation_instances()\n",
    "# find_new_relation_instances(train_split='tiny', test_split='tiny') # for rapid development"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are actually some good discoveries here! The predictions for the `author` relation seem especially good. Of course, there are also plenty of bad results, and a few that are downright comical. We may hope that as we improve our models and optimize performance in our automatic evaluations, the results we observe in this manual evaluation improve as well.\n",
    "\n",
    "__Exercise:__ Note that every time we predict that a given relation holds between entities `X` and `Y`, we also predict, with equal confidence, that it holds between `Y` and `X`. Why? How could we fix this?\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "Our current model is quite rudimentary — it's merely a starting point for further exploration. This section lists a number of suggestions for next steps. Pursue whatever ideas look most promising! Your immediate goal is to optimize macro-averaged F<sub>0.5</sub>-score — but don't blinker yourself. Consider other ways of evaluating performance, and remember that the ultimate goal is to extract new relational triples.\n",
    "\n",
    "\n",
    "### Experimental methodology\n",
    "\n",
    "- Write code that facilitates _error analysis_ — that is, code that enables you to inspect, analyze, and categorize specific classification errors. Error analysis is the best avenue to understanding how to improve your model.\n",
    "- Our code for building datasets provides the ability to downsample negative examples, and we've used a default sampling rate of 10%. Is that the best choice? What are the consequences of sampling at a higher or lower rate?\n",
    "\n",
    "### Feature representation\n",
    "\n",
    "- Add a feature that indicates the length of the middle.\n",
    "- Augment the bag-of-words representation to include bigrams or trigrams (not just unigrams).\n",
    "- Introduce features based on the entity mentions themselves. <!-- \\[SPOILER: it helps a lot, maybe 4% in F-score. And combines nicely with the directional features.\\] -->\n",
    "- Experiment with features based on the context outside (rather than between) the two entity mentions — that is, the words before the first mention, or after the second.\n",
    "- Try adding features which capture syntactic information, such as the dependency-path features used by Mintz et al. 2009. The [NLTK](https://www.nltk.org/) toolkit contains a variety of [parsing algorithms](http://www.nltk.org/api/nltk.parse.html) that may help.\n",
    "- The bag-of-words representation does not permit generalization across word categories such as names of people, places, or companies. Can we do better using word embeddings such as [GloVe](https://nlp.stanford.edu/projects/glove/)?\n",
    "\n",
    "### Model selection\n",
    "\n",
    "\n",
    "\n",
    "- The `LogisticRegression` model in `sklearn` does L2 regularization by default. Try using L1 regularization instead. <!-- \\[SPOILER: doesn't make huge difference\\] -->\n",
    "- Whether you're using L1 or L2 regularization, you may get better results by tuning the regularization parameter. <!-- \\[SPOILER: `C=8.0`, i.e. less regularization, helps a little\\] -->\n",
    "- Experiment with different model types. `sklearn` makes this very easy: it provides implementations for everything from [elastic nets](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html) to [SVMs](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) to [gradient boosting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).\n",
    "- Explore ways to predict the relations that hold between a given pair of entities jointly, instead of independently, in order to exploit the correlations between relations. This could be done using an ensemble or hierarchial model, or a neural architecture.\n",
    "- Over the last few years, neural sequence models such as LSTMs have yielded dramatic gains on many NLP tasks. Investigate whether they help here.\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homework 3\n",
    "\n",
    "The purpose of this homework is to begin exploring larger, more diverse feature space, to figure out which ones lead to improved classifiers.\n",
    "\n",
    "As a reminder, our baseline classifier (using `simple_bag_of_words_featurizer` looks like this) when run on the `dev` set with the standard `model_factory`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Building datasets\n",
      "Featurizing\n",
      "Training models\n",
      "Training complete\n",
      "\n",
      "relation              precision     recall    f-score    support       size\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "adjoins                   0.874      0.459      0.740        303       5319\n",
      "author                    0.830      0.558      0.756        480       5496\n",
      "capital                   0.677      0.236      0.493         89       5105\n",
      "contains                  0.752      0.615      0.720       2667       7683\n",
      "film_performance          0.820      0.580      0.757        822       5838\n",
      "founders                  0.837      0.387      0.679        359       5375\n",
      "genre                     0.619      0.235      0.467        166       5182\n",
      "has_sibling               0.870      0.234      0.563        513       5529\n",
      "has_spouse                0.900      0.362      0.694        575       5591\n",
      "is_a                      0.710      0.223      0.494        494       5510\n",
      "nationality               0.584      0.145      0.363        311       5327\n",
      "parents                   0.877      0.569      0.791        325       5341\n",
      "place_of_birth            0.759      0.203      0.490        217       5233\n",
      "place_of_death            0.621      0.124      0.345        145       5161\n",
      "profession                0.639      0.159      0.399        245       5261\n",
      "worked_at                 0.740      0.250      0.532        228       5244\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "macro-average             0.757      0.334      0.580       7939      88195\n"
     ]
    }
   ],
   "source": [
    "baseline = experiment(featurizers=[simple_bag_of_words_featurizer])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Questions 1–3: Directional unigram features [3 points]\n",
    "\n",
    "The current bag-of-words representation makes no distinction between \"forward\" and \"reverse\" examples. But, intuitively, there is big difference between _X and his son Y_ and _Y and his son X_. This question asks you to modify `simple_bag_of_words_featurizer` to capture these differences. \n",
    "\n",
    "__To submit:__\n",
    "\n",
    "1. A feature function `directional_bag_of_words_featurizer` that is just like `simple_bag_of_words_featurizer` except that it distinguishes \"forward\" and \"reverse\". To do this, you just need to mark each word feature for whether it is derived from a subject–object example or from an object–subject example. The precise nature of the mark you add for the two cases doesn't make a difference to the model.\n",
    "\n",
    "2. The macro-average F-score on the `dev` set that you obtain from running `experiment` with `directional_bag_of_words_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment`.)\n",
    "\n",
    "3. `experiment` returns some of the core objects used in the experiment. How many feature names does the `vectorizer` have for the experiment run in the previous step? (Note: we're partly asking you to figure out how to get this value by using the sklearn documentation, so please don't ask how to do it on Piazza!)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Questions 4–5: The part-of-speech tags of the \"middle\" words [3 points]\n",
    "\n",
    "Our corpus distribution contains part-of-speech (POS) tagged versions of the core text spans. Let's begin to explore whether there is information in these sequences, focusing on `middle_POS`.\n",
    "\n",
    "__To submit:__\n",
    "\n",
    "1. A feature function `middle_bigram_pos_tag_featurizer` that is just like `simple_bag_of_words_featurizer` except that it creates a feature for bigram POS sequences. For example, given \n",
    "\n",
    "  `The/DT dog/N napped/V`\n",
    "  \n",
    "   we obtain the list of bigram POS sequences\n",
    "  \n",
    "   `['<s> DT', 'DT N', 'N V', 'V </s>']`. \n",
    "   \n",
    "   Don't forget the start and end tags, to model those environments properly!\n",
    "\n",
    "2. The macro-average F-score on the `dev` set that you obtain from running `experiment` with `middle_bigram_pos_tag_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment`.)\n",
    "\n",
    "Note: To parse `middle_POS`, one splits on whitespace to get the `word/TAG` pairs. Each of these pairs `s` can be parsed with `s.rsplit('/', 1)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Questions 6–7: Bag of Synsets [3 points]\n",
    "\n",
    "The following allows you to use NLTK's WordNet API to get the synsets compatible with _dog_ as used as a noun:\n",
    "\n",
    "```\n",
    "from nltk.corpus import wordnet as wn\n",
    "dog = wn.synsets('dog', pos='n')\n",
    "```\n",
    "\n",
    "This question asks you to create synset-based features from the word/tag pairs in `middle_POS`.\n",
    "\n",
    "To convert the tags in the corpus to WordNet tags:\n",
    "\n",
    "| Tag begins with | WordNet `pos` value |\n",
    "|-----------------|---------------------|\n",
    "| `N`             | `'n'`               |\n",
    "| `V`             | `'v'`               |\n",
    "| `J`             | `'a'`               |\n",
    "| `R`             | `'r'`               |\n",
    "| Otherwise       | `None`              |\n",
    "\n",
    "__To submit:__\n",
    "\n",
    "1. A feature function `synset_featurizer` that is just like `simple_bag_of_words_featurizer` except that it creates a count dictionary where the keys are synsets, as derived from the unigrams in the `middle_POS` field using `wn.synsets`. Stringify the synsets with `str` so that they can be `dict` keys. Use the table above to convert tags to `pos` arguments usable by `wn.synsets`.\n",
    "\n",
    "2. The macro-average F-score on the `dev` set that you obtain from running `experiment` with `synset_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment`.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Questions 8–9: Bringing them all together [2 points]\n",
    "\n",
    "Run `experiment` with all `directional_bag_of_words_featurizer`, `middle_bigram_pos_tag_featurizer`, and `synset_featurizer` together as the featurizers. Let's see if all this work paid off in terms of raw peformance!\n",
    "\n",
    "__To submit:__\n",
    "\n",
    "1. The macro-average F-score on the `dev` set that you obtain from running this experiment. (Aside from `featurizers`, use all the default values for `experiment`.)\n",
    "\n",
    "1. The number of feature names contained in the `vectorizer` for this experiment.\n",
    "\n",
    "Note: You'll see that this is a *very* large model. We want you to submit with the default value for `model_factory`, but it is worth trying variants that are more suitable for these inputs – `penalty=\"l1\"` for `LogisticRegression` is a good starting point, as it will give 0 weight to many uninformative features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bake-off\n",
    "\n",
    "The goal of the bake-off for this unit is very simple: to achieve the best macro-averaged F<sub>0.5</sub>-score on the `test` split. The [Next steps](#Next-steps) section suggested a number of possible strategies for improving the baseline model, and [Homework 3](#Homework-3) explores several more. But of course these are by no means exhaustive. You can surely come up with lots of additional ideas. The sky is the limit!\n",
    "\n",
    "There's only one strict rule here: __you must not evaluate on the `test` split until you ready to report your bake-off results__. All evaluations during development must be on the `dev` split. To do otherwise would be methodologically unsound and intellectually dishonest!\n",
    "\n",
    "Your bake-off submission should include:\n",
    "\n",
    "- Your macro-averaged F<sub>0.5</sub>-score on the `test` split.\n",
    "- A brief description of the strategies you employed to achieve this result.\n",
    "\n",
    "Submission URL: https://goo.gl/forms/ohqzpnHMwHIT7f642\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}