In [1]:
%%capture
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("../statnlpbook/")
#import util
import ie
import tfutil
import random
import numpy as np
import tensorflow as tf
np.random.seed(1337)
tf.set_random_seed(1337)

#util.execute_notebook('relation_extraction.ipynb')

<!---
Latex Macros
-->
$$
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\a}{\alpha}
\newcommand{\b}{\beta}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

# Relation Extraction

##  Motivation 

* The amount of available information is growing exponentially
* Text contains a lot of information
* Only some of information is relevant for each use case
* How can we automatically make sense of information?

**Information Extraction** addresses this

[Alchemy information extraction demo](https://alchemy-language-demo.mybluemix.net/)

[ReVerb demo](http://openie.allenai.org/)

## Subtasks of Information Extraction

* **Document** Classification:
    * Assign a label to each document, often representing the topic
    * Often treated as standard [classification task](chapters/doc_classify.ipynb)
* **Named Entity Recognition (NER)**:
    * Recognise boundaries of entities in text, e.g. "New York", "New York Times" 
* **Named Entity Classification (NEC)**:
    * Assign a type to each entity (e.g. "New York" -> LOC, "New York Times" -> ORG)

## Named Entity Recognition and Classification
   
* NER and NEC are often approached together (NERC)
* They are often treated as a [sequence labelling task](chapters/sequence_labelling.ipynb)
* Every token is assigned 
    * a training label (e.g. PER)
    * and sometimes a positional tag (e.g. begininning of sequence (B), inside sequence (I), outside sequence (O)
* At test time, subsequent tokens with the same label are merged to one sequence

## Named Entity Recognition and Classification: Example
   

| Sebastian | Riedel | is | a | reader | at | University | College | London | 
|-|-|-|-|-|-|-|-|-|
| B-PER | I-PER      | O | O | O         | O  | B-LOC | I-LOC | I-LOC         | 

  Sebastian Riedel: PER  
  University College London: LOC  

## Subtasks of Information Extraction

* **Relation** Extraction:
    * Recognise relatios between entities, e.g. "S. Riedel reader-at UCL"
    * classification task
* **Temporal** Information Extraction:
    * Recognise and/or normalise temporal expressions, e.g. "tomorrow morning at 8" -> "2016-11-26 08:00:00"
    * sequence labelling task

## Subtasks of Information Extraction

* **Event** Extraction:
    * Recognise events, typically consisting of entities and relations between them at a point in time and place, e.g. an election
    * classification task

### Relation Extraction

Task of extracting **semantic relations between arguments**
* Arguments are entities
    * general concepts such as "a company" (ORG), "a person" (PER), "a location" (LOC)
    * instances of such concepts (e.g. "Microsoft", "Bill Gates"), which are called proper names or named entitites (NEs)
* Relation extraction builds on the task of named entity recognition

Relation extraction is relevant for many high-level NLP tasks, such as
* for question answering, where users ask questions such as "Who founded Microsoft?",
* for information retrieval, which often relies on large collections of structured information as background data, and
* for text and data mining, where larger patterns in relations between concepts are discovered, e.g. temporal patterns about startups

## Relation Extraction as Structured Prediction
We can formalise relation extraction as an instance of [structured prediction](chapters/structured_prediction.ipynb)
* The input space $\mathcal{X}$ are pairs of arguments $\mathcal{E}$ and supporting texts $\mathcal{S}$ those arguments appear in
* The output space $\mathcal{Y}$ is a set of relation labels such as $\Ys=\{ \text{founder-of},\text{employee-at},\text{professor-at},\text{NONE}\}$. 

## Relation Extraction as Structured Prediction
* The goal is to define a model \\(s_{\params}(\x,y)\\) that assigns high *scores* to the label $\mathcal{y}$ that fits the arguments and supporting text $\mathcal{x}$, and lower scores otherwise. 
* The model will be parametrized by \\(\params\\), and these parameters we will learn from some training set of $\mathcal{x,y}$ pairs
* When we need to classify input  instances $\mathcal{x}$ consisting again of pairs of arguments and supporting texts, we have to solve the maximisation problem $\argmax_y s_{\params}(\x,y)$.

## Relation Extraction Approaches
* **Pattern-Based** Relation Extraction:
    * Extract relations via manually defined textual pattern matching
* **Bootstrapping**:
    * Learn to extract relations via manually defined textual patterns, and use those to find more patterns and so forth, iteratively

## Relation Extraction Approaches
* **Supervised** Relation Extraction:
    * Train a supervised model, from manually labelled training examples, to extract relations
* **Distantly Supervised** Relation Extraction:
    * Automatically annotate training data for supervised relation extraction, based on entries in a knowledge base

## Relation Extraction Approaches
* **Universal Schema** Relation Extraction:
    * Model relation types and their surface forms in the same space, possible method for combining pattern-based, supervised and distantly supervised relation extraction


## Relation Extraction Example
* Extracting "method used for task" relations from sentences in computer science publications
* The first step would normally be to detection named entities, i.e. to determine tose pairs of arguments $\mathcal{E}$. For simplicity, our training data already contains those annotations.


## Pattern-Based Extraction
* The simplest relation extraction model defines a set of textual patterns for each relation and then assigns labels to entity pairs whose sentences match that pattern. 
* The training data consists of entity pairs $\mathcal{E}$, patterns $A$ and labels $Y$.

In [2]:
training_patterns, training_entpairs = ie.readLabelledPatternData()
print("Training patterns and entity pairs for relation 'method used for task'")
[(tr_a, tr_e) for (tr_a, tr_e) in zip(training_patterns[:3], training_entpairs[:3])]

Training patterns and entity pairs for relation 'method used for task'


[('demonstrates XXXXX and clustering techniques for XXXXX',
  ['text mining', 'building domain ontology']),
 ('demonstrates text mining and XXXXX for building XXXXX',
  ['clustering techniques', 'domain ontology']),
 ('the XXXXX is able to enhance the XXXXX',
  ['ensemble classifier', 'detection of construction materials'])]

* The patterns are sentences where the entity pairs where blanked with the placeholder 'XXXXX'
* For the training data, we also have labels
* There are only one positive label, 'method used for task'
* For the testing data, we do not know the relations for the instances

In [3]:
testing_patterns, testing_entpairs = ie.readPatternData()
print("Testing patterns and entity pairs")
[(tr_a, tr_e) for (tr_a, tr_e) in zip(testing_patterns[0:3], testing_entpairs[:3])]

Testing patterns and entity pairs


[('a method for estimation of XXXXX of XXXXX is presented',
  ['effective properties', 'porous materials']),
 ('accounting for XXXXX is essential for estimation of XXXXX',
  ['nonlinear effects', 'effective properties']),
 ('develops the heterogeneous XXXXX for fiber-reinforced XXXXX',
  ['feature model', 'object modeling'])]

* We build a scoring model to determine which of the testing instances are examples for the relation 'method used for task' and which ones are not
    * A pattern scoring model \\(s_{\params}(\x,y)\\) only has one parameter
    * It assignes scores to each relation label \\(y\\) proportional to the matches with the set of textual patterns
    * The final label assigned to each instance is then the one with the highest score.

* Here, our pattern scoring model is even simpler since we only have patterns for one relation
    * The final label assigned to each instance is 'method used for task' if there is a match with a pattern, and 'NONE' if there is no match.

Let's have a closer look at how pattern matching works:
* Recall that the original patterns in the training data are sentences where the entity pairs are blanked with 'XXXXX'
* We could use those patterns to find new sentences
* However, we are not likely to find many since the patterns are very specific to the example
* We need to generalise those patterns to less specific ones
* A simple way is to define the sequence of words between each entity pair as a pattern, like so:

In [4]:
def sentenceToShortPath(sent):
    """
    Returns the path between two arguments in a sentence, where the arguments have been masked
    Args:
        sent: the sentence
    Returns:
        the path between to arguments
    """
    sent_toks = sent.split(" ")
    indeces = [i for i, ltr in enumerate(sent_toks) if ltr == "XXXXX"]
    pattern = " ".join(sent_toks[indeces[0]+1:indeces[1]])
    return pattern

print(training_patterns[0])
sentenceToShortPath(training_patterns[0])

demonstrates XXXXX and clustering techniques for XXXXX


'and clustering techniques for'

* There are many different alternatives to this method shortening patterns.
* **Thought exercise**: 
    * what is a possible problem with this way of shortening patterns and what are better ways of generalising patterns?

* After the sentences shortening / pattern generalisation is defined, we can then apply those patterns to testing instances to classify them into 'method used for task' and 'NONE'
* In the example here, we return the instances which contain a 'method used for task' pattern

In [5]:
def patternExtraction(training_sentences, testing_sentences):
    """
    Given a set of patterns for a relation, searches for those patterns in other sentences
    Args:
        sent: training sentences with arguments masked, testing sentences with arguments masked
    Returns:
        the testing sentences which the training patterns appeared in
    """
    # convert training and testing sentences to short paths to obtain patterns
    training_patterns = set([sentenceToShortPath(test_sent) for test_sent in training_sentences])
    testing_patterns = [sentenceToShortPath(test_sent) for test_sent in testing_sentences]
    # look for training patterns in testing patterns
    testing_extractions = []
    for i, testing_pattern in enumerate(testing_patterns):
        if testing_pattern in training_patterns: # look for exact matches of patterns
            testing_extractions.append(testing_sentences[i])
    return testing_extractions

patternExtraction(training_patterns[:300], testing_patterns[:300])

['paper reviews applications of XXXXX in XXXXX',
 'a novel approach was developed to determine the XXXXX in XXXXX',
 'four different types of insoles were examined in terms of their effects on XXXXX in XXXXX',
 'the findings can aid in better understanding the insole design features that could improve XXXXX in XXXXX',
 'this new approach provides more degrees of freedom and XXXXX in XXXXX']

* One of the shortcomings of this pattern-based approach is that the set of patterns has to be defined manually
* Also, the model does not learn new patterns
* We will next look at an approach which addresses those two shortcomings


## Bootstrapping

* Bootstrapping relation extraction models take the same input as pattern-based approaches
    * a set of entity pairs and patterns
* Overall idea: extract more patterns and entity pairs iteratively
* For this, we need two helper methods: 
    * one that generalises from patterns to extract more patterns and entity pairs
    * another one that generalises from entity pairs to extract more patterns and entity pairs


In [6]:
# generalises from patterns to extract more patterns and entity pairs
def searchForPatternsAndEntpairsByPatterns(training_patterns, testing_patterns, testing_entpairs, testing_sentences):
    testing_extractions = []
    appearing_testing_patterns = []
    appearing_testing_entpairs = []
    for i, testing_pattern in enumerate(testing_patterns):
        if testing_pattern in training_patterns: # if there is an exact match of a pattern
            testing_extractions.append(testing_sentences[i])
            appearing_testing_patterns.append(testing_pattern)
            appearing_testing_entpairs.append(testing_entpairs[i])
    return testing_extractions, appearing_testing_patterns, appearing_testing_entpairs

In [7]:
# generalises from entity pairs to extract more patterns and entity pairs
def searchForPatternsAndEntpairsByEntpairs(training_entpairs, testing_patterns, testing_entpairs, testing_sentences):
    testing_extractions = []
    appearing_testing_patterns = []
    appearing_testing_entpairs = []
    for i, testing_entpair in enumerate(testing_entpairs):
        if testing_entpair in training_entpairs: # if there is an exact match of an entity pair
            testing_extractions.append(testing_sentences[i])
            appearing_testing_entpairs.append(testing_entpair)
            appearing_testing_patterns.append(testing_patterns[i])
    return testing_extractions, appearing_testing_patterns, appearing_testing_entpairs

The two helper functions are then applied iteratively:

In [8]:
def bootstrappingExtraction(train_sents, train_entpairs, test_sents, test_entpairs, num_iter=10):
    """
    Given a set of patterns and entity pairs for a relation, extracts more patterns and entity pairs iteratively
    Args:
        train_sents: training sentences with arguments masked
        train_entpairs: training entity pairs
        test_sents: testing sentences with arguments masked
        test_entpairs: testing entity pairs
    Returns:
        the testing sentences which the training patterns or any of the inferred patterns appeared in
    """



In [9]:
def bootstrappingExtraction(train_sents, train_entpairs, test_sents, test_entpairs, num_iter=10):
    # convert training and testing sentences to short paths to obtain patterns
    train_patterns = set([sentenceToShortPath(s) for s in train_sents])
    train_patterns.discard("in") # too general, remove this
    test_patterns = [sentenceToShortPath(s) for s in test_sents]

    # iteratively get more patterns and entity pairs
    for i in range(1, num_iter):
        print("Number extractions at iteration", str(i), ":", str(len(test_extracts)))
        print("Number patterns at iteration", str(i), ":", str(len(train_patterns)))
        print("Number entpairs at iteration", str(i), ":", str(len(train_entpairs)))
        # get more patterns and entity pairs
        test_extracts_p, ext_test_patterns_p, ext_test_entpairs_p = searchForPatternsAndEntpairsByPatterns(train_patterns, test_patterns, test_entpairs, test_sents)
        test_extracts_e, ext_test_patterns_e, ext_test_entpairs_e = searchForPatternsAndEntpairsByEntpairs(train_entpairs, test_patterns, test_entpairs, test_sents)
        # add them to the existing entity pairs for the next iteration
        train_patterns.update(ext_test_patterns_p)
        train_patterns.update(ext_test_patterns_e)
        train_entpairs.extend(ext_test_entpairs_p)
        train_entpairs.extend(ext_test_entpairs_e)
        test_extracts.extend(test_extracts_p)
        test_extracts.extend(test_extracts_e)

    return test_extracts, test_entpairs

In [10]:
test_extracts, test_entpairs = ie.bootstrappingExtraction(training_patterns, training_entpairs, testing_patterns, testing_entpairs)

Number extractions at iteration 0 : 0
Number patterns at iteration 0 : 19
Number entpairs at iteration 0 : 22
Number extractions at iteration 1 : 2
Number patterns at iteration 1 : 19
Number entpairs at iteration 1 : 24
Number extractions at iteration 2 : 6
Number patterns at iteration 2 : 19
Number entpairs at iteration 2 : 28
Number extractions at iteration 3 : 10
Number patterns at iteration 3 : 19
Number entpairs at iteration 3 : 32
Number extractions at iteration 4 : 14
Number patterns at iteration 4 : 19
Number entpairs at iteration 4 : 36
Number extractions at iteration 5 : 18
Number patterns at iteration 5 : 19
Number entpairs at iteration 5 : 40


One of the things noticable is that with each iteration, the number of extractions we find increases, but they are less correct.

In [11]:
for (s, e) in zip(test_extracts[1:3], test_entpairs[1:3]):
    print(s, e)
print("")
for (s, e) in zip(test_extracts[-3:-1], test_entpairs[-3:-1]):
    print(s, e)

a strongly XXXXX is proposed to solve the XXXXX ['nonlinear effects', 'effective properties']
the novelties of our work are in both theory and application . we propose a new distance formula for interval type-2 fuzzy sets . based on the proposed distance formula , we propose a new XXXXX to solve a XXXXX and finally to illustrate the applicability of the proposed method , a case study is used ['feature model', 'object modeling']

a strongly XXXXX is proposed to solve the XXXXX ['sample size', 'gps data']
the novelties of our work are in both theory and application . we propose a new distance formula for interval type-2 fuzzy sets . based on the proposed distance formula , we propose a new XXXXX to solve a XXXXX and finally to illustrate the applicability of the proposed method , a case study is used ['memetic algorithm', 'genetic algorithm']


* One of the reasons is that the semantics of the pattern shifts
    * here we try to find new patterns for 'method used for task'
    * but because the instances share a similar context with other relations, the patterns and entity pairs iteratively move away from the 'method used in task' relation
    * Another example: 'student-at' and 'lecturere-at' relations, that have many overlapping contexts


* One way of improving this is with confidence values for each entity pair and pattern
    * For example, we might want to avoid entity pairs or patterns which are too general and penalise them

In [12]:
from collections import Counter
te_cnt = Counter()
for te in test_extracts:
    te_cnt[sentenceToShortPath(te)] += 1
print(te_cnt)

Counter({'to solve a': 11, 'is proposed to solve the': 11})


* Such as 'noisy' pattern is e.g. the 'in' pattern was found originally, which maches many contexts that are not 'method used for task' 
* **Thought exercise**: 
    * how would a confidence weighting for patterns work here?

## Supervised Relation Extraction
* A different way of assigning a relation label to new instances is to follow the supervised learning paradigm
* We have already seen for other structured prediction tasks
* For supervised relation extraction, the scoring model \\(s_{\params}(\x,y)\\) is estimated automatically based on training sentences $\mathcal{X}$ and their labels $\mathcal{Y}$
* We can use range of different classifiers, e.g. a logistic regression model or an SVM
* At testing time, the predict label for each testing instance is the highest-scoring one, i.e. $$ \y^* = \argmax_{\y\in\Ys} s(\x,\y) $$


* The training data consists again of patterns, entity pairs and labels
* This time, the given labels for the training instances are 'method used for task' or 'NONE', i.e. we have positive and negative training data

In [13]:
training_sents, training_entpairs, training_labels = ie.readLabelledData()
print("Manually labelled data set consists of", training_labels.count("NONE"), 
          "negative training examples and", training_labels.count("method used for task"), "positive training examples\n")
for (tr_s, tr_e, tr_l) in zip(training_sents[:3], training_entpairs[:3], training_labels[:3]):
    print(tr_s, tr_e, tr_l)

Manually labelled data set consists of 22 negative training examples and 22 positive training examples

demonstrates XXXXX and clustering techniques for XXXXX ['text mining', 'building domain ontology'] method used for task
demonstrates text mining and XXXXX for building XXXXX ['clustering techniques', 'domain ontology'] method used for task
the XXXXX is able to enhance the XXXXX ['ensemble classifier', 'detection of construction materials'] method used for task


* Next, we define how to transform training and testing data to features. 
* Features for the model are typically extracted from the shortest dependency path between two entities
* Basic features are n-gram features, or they can be based on the syntactic structure of the input, i.e. the dependency path ([parsing](statnlpbook/chapters/parsing))
* We assume again that entity pairs are part of the input, i.e. we assume the named entity recognition problem to be solved as part of the preprocessing of the data
* In reality, named entities have to be recognised first

* Let's look at an example, using one of sklearn's built-in feature extractor to transform sentences to n-grams

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

def featTransform(sents_train, sents_test):
    cv = CountVectorizer()
    cv.fit(sents_train)
    print(cv.get_params())
    features_train = cv.transform(sents_train)
    features_test = cv.transform(sents_test)
    return features_train, features_test, cv

We define a model, again with sklearn, using one of their built-in classifiers and a prediction function.

In [15]:
from sklearn.linear_model import LogisticRegression

def model_train(feats_train, labels):
    model = LogisticRegression(penalty='l2')  # logistic regression model with l2 regularisation
    model.fit(feats_train, labels) # fit the model to the transformed training data
    return model

def predict(model, features_test):
    """Find the most compatible output class"""
    preds = model.predict(features_test) # this returns the predicted labels
    #preds_prob = model.predict_proba(features_test)  # this returns probablities instead of labels
    return preds

We further define a helper function for debugging that determines the most useful features learned by the model

In [16]:
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

Supervised relation extraction algorithm:

In [17]:
def supervisedExtraction(train_sents, train_entpairs, train_labels, test_sents, test_entpairs):
    """
    Given pos/neg training instances, train a logistic regression model with simple BOW features and predict labels on unseen test instances
    Args:
        train_sents: training sentences with arguments masked
        train_entpairs: training entity pairs
        train_labels: labels of training instances
        test_sents: testing sentences with arguments masked
        test_entpairs: testing entity pairs
    Returns:
        predictions for the testing sentences
    """

In [18]:
def supervisedExtraction(train_sents, train_entpairs, train_labels, test_sents, test_entpairs):

    # convert training and testing sentences to short paths to obtain patterns
    train_patterns = [sentenceToShortPath(test_sent) for test_sent in train_sents]
    test_patterns = [sentenceToShortPath(test_sent) for test_sent in test_sents]

    # extract features
    features_train, features_test, cv = featTransform(train_patterns, test_patterns)

    # train model
    model = model_train(features_train, train_labels)

    # show most common features
    show_most_informative_features(cv, model)

    # get predictions
    predictions = predict(model, features_test)

    # show the predictions
    for tup in zip(predictions, test_sents, test_entpairs):
        print(tup)

    return predictions

In [19]:
ie.supervisedExtraction(training_sents, training_entpairs, training_labels, testing_patterns, testing_entpairs)

{'analyzer': 'word', 'binary': False, 'decode_error': 'strict', 'dtype': <class 'numpy.int64'>, 'encoding': 'utf-8', 'input': 'content', 'lowercase': True, 'max_df': 1.0, 'max_features': None, 'min_df': 1, 'ngram_range': (1, 1), 'preprocessor': None, 'stop_words': None, 'strip_accents': None, 'token_pattern': '(?u)\\b\\w\\w+\\b', 'tokenizer': None, 'vocabulary': None}
	-0.9520	of             		1.0542	is             
	-0.4352	specified      		0.9787	to             
	-0.4352	using          		0.8733	for            
	-0.4313	ann            		0.4851	and            
	-0.4313	find           		0.4785	solved         
	-0.3274	decreases      		0.4785	assists        
	-0.3258	that           		0.4309	are            
	-0.3181	allowing       		0.4151	solve          
	-0.3181	except         		0.4081	on             
	-0.3074	as             		0.4081	application    
	-0.2935	in             		0.3915	more           
	-0.2892	introduced     		0.3915	capable        
	-0.2892	unified        		0.3526	presente

('NONE', 'integration of XXXXX with XXXXX for policy iteration', ['reinforcement learning', 'gaussian processes'])
('method used for task', 'based on the adaptive neuro-fuzzy XXXXX ( anfis ) approach , a hazard modeling and survival prediction system is developed to assist clinicians in prognostic assessment of patients with esophageal cancer and prediction of individual XXXXX', ['inference system', 'patient survival'])
('method used for task', 'this may provide valuable prognostic information in addition to ajcc staging and aid the clinicians’ XXXXX for XXXXX', ['decision-making process', 'risk stratification'])
('method used for task', 'ann shows good XXXXX as compared to svr , rfr and XXXXX', ['prediction performance', 'mlr models'])
('method used for task', 'this type of XXXXX can help to represent XXXXX having certain order', ['soft sets', 'linguistic terms'])
('NONE', 'in certain XXXXX these XXXXX and operations on them can be very helpful', ['soft sets', 'decision making problem

('method used for task', 'we develop a XXXXX for convex XXXXX', ['polynomial time algorithm', 'compression cost function'])
('method used for task', 'we develop a new XXXXX for minmax regret XXXXX', ['lower bound', 'combinatorial optimization problems'])
('method used for task', 'our solution reduces two costs linked to this problem : XXXXX and handling XXXXX', ['fuel consumption', 'operations costs'])
('NONE', 'the XXXXX can be found by a XXXXX', ['dynamic programming algorithm', 'optimal routing strategy'])
('NONE', 'the XXXXX has a specific XXXXX', ['optimal strategy', 'threshold structure'])
('NONE', 'XXXXX and the value of the marginal product of inputs are measured in one step using various XXXXX', ['technical efficiency', 'model specifications'])
('NONE', 'we use a bootstrap dea technique to estimate the mean and 95 percent XXXXX of XXXXX and shadow prices', ['confidence intervals', 'technical efficiency'])
('NONE', 'we use a bootstrap dea technique to estimate the mean and 95 p

array(['NONE', 'method used for task', 'method used for task', ...,
       'method used for task', 'NONE', 'method used for task'],
      dtype='<U20')

* Some of the features are common words (i.e. 'stop words', such as 'is') and very broad
* Other features are very specific and thus might not appear very often
* Typically these problems can be mitigated by using more sophisticated features such as those based on syntax 
* Also, the current model does not take into the entity pairs, only the path between the entity pairs. 
    * We will later examine a model that also takes entity pairs into account


* Finally, the model requires manually annotated training data, which might not always be available.
* Next, we will look at a method that provides a solution for the latter

## Distant Supervision
* Supervised learning typically requires large amounts of hand-labelled training examples
* Since it is **time-consuming and expensive** to manually label examples, it is desirable to find ways of automatically or semi-automatically producing more training data
    * We have already seen one example of this, bootstrapping
* Although bootstrapping can be useful, one of the downsides already discussed above is **semantic drift** due to the iterative nature of finding good entity pairs and patterns
* An alternative approach to this is to distant supervision

* We still have a set of entity pairs $\mathcal{E}$, their relation types $\mathcal{R}$ and a set of sentences $\mathcal{X}$ as an input
    * but we do **not require pre-defined patterns**
* Instead, a large number of such entity pairs and relations are obtained from a **knowledge resource**, e.g. the [Wikidata knowledge base](https://www.wikidata.org), the [Yago knowledge base](www.yago-knowledge.org/) or tables
* These entity pairs and relations are then used to automatically label all sentences with relations if there exists an entity pair between which this relation holds according to the knowledge resource
* After sentences are labelled in this way, the rest of the algorithm is the same the supervised relation extraction algorithm

In [20]:
def distantlySupervisedLabelling(kb_entpairs, unlab_sents, unlab_entpairs):
    """
    Label instances using distant supervision assumption
    Args:
        kb_entpairs: entity pairs for a specific relation
        unlab_sents: unlabelled sentences with entity pairs anonymised
        unlab_entpairs: entity pairs which were anonymised in unlab_sents

    Returns: pos_train_sents, pos_train_enpairs, neg_train_sents, neg_train_entpairs

    """

In [21]:
def distantlySupervisedLabelling(kb_entpairs, unlab_sents, unlab_entpairs):
    train_sents, train_entpairs, train_labels = [], [], []
    for i, unlab_entpair in enumerate(unlab_entpairs):
        # if the entity pair is a KB tuple, it is a positive example for that relation
        if unlab_entpair in kb_entpairs:  
            train_entpairs.append(unlab_entpair)
            train_sents.append(unlab_sents[i])
            train_labels.append("method used for task")
        else: # else, it is a negative example for that relation
            train_entpairs.append(unlab_entpair)
            train_sents.append(unlab_sents[i])
            train_labels.append("NONE")

    return train_sents, train_entpairs, train_labels

In [22]:
def distantlySupervisedExtraction(kb_entpairs, unlab_sents, unlab_entpairs, test_sents, test_entpairs):
    # training_data <- Find training sentences with entity pairs
    train_sents, train_entpairs, train_labels = distantlySupervisedLabelling(kb_entpairs, unlab_sents, unlab_entpairs)
    
    print("Distantly supervised labelling results in", train_labels.count("NONE"), 
          "negative training examples and", train_labels.count("method used for task"), "positive training examples")
    
    # training works the same as for supervised RE
    supervisedExtraction(train_sents, train_entpairs, train_labels, test_sents, test_entpairs)

In [23]:
kb_entpairs, unlab_sents, unlab_entpairs = ie.readDataForDistantSupervision()
print(len(kb_entpairs), "'KB' entity pairs for relation 'method used for task' :", kb_entpairs[0:5])
print(len(unlab_entpairs), 'all entity pairs')
ie.distantlySupervisedExtraction(kb_entpairs, unlab_sents, unlab_entpairs, testing_patterns, testing_entpairs)

22 'KB' entity pairs for relation 'method used for task' : [['text mining', 'building domain ontology'], ['clustering techniques', 'domain ontology'], ['ensemble classifier', 'detection of construction materials'], ['autonomous system', 'thermal modeling'], ['optimization models', 'dynamic supply chain issue']]
44 all entity pairs
{'analyzer': 'word', 'binary': False, 'decode_error': 'strict', 'dtype': <class 'numpy.int64'>, 'encoding': 'utf-8', 'input': 'content', 'lowercase': True, 'max_df': 1.0, 'max_features': None, 'min_df': 1, 'ngram_range': (1, 1), 'preprocessor': None, 'stop_words': None, 'strip_accents': None, 'token_pattern': '(?u)\\b\\w\\w+\\b', 'tokenizer': None, 'vocabulary': None}
	-0.9520	of             		1.0542	is             
	-0.4352	specified      		0.9787	to             
	-0.4352	using          		0.8733	for            
	-0.4313	ann            		0.4851	and            
	-0.4313	find           		0.4785	solved         
	-0.3274	decreases      		0.4785	assists        
	-

('NONE', 'integration of XXXXX with XXXXX for policy iteration', ['reinforcement learning', 'gaussian processes'])
('method used for task', 'based on the adaptive neuro-fuzzy XXXXX ( anfis ) approach , a hazard modeling and survival prediction system is developed to assist clinicians in prognostic assessment of patients with esophageal cancer and prediction of individual XXXXX', ['inference system', 'patient survival'])
('method used for task', 'this may provide valuable prognostic information in addition to ajcc staging and aid the clinicians’ XXXXX for XXXXX', ['decision-making process', 'risk stratification'])
('method used for task', 'ann shows good XXXXX as compared to svr , rfr and XXXXX', ['prediction performance', 'mlr models'])
('method used for task', 'this type of XXXXX can help to represent XXXXX having certain order', ['soft sets', 'linguistic terms'])
('NONE', 'in certain XXXXX these XXXXX and operations on them can be very helpful', ['soft sets', 'decision making problem

('method used for task', 'the proposed model provide high-precision XXXXX and local XXXXX', ['load distribution', 'stress field'])
('method used for task', 'the new fe model considers both XXXXX and size effects for micro XXXXX of circular cups', ['surface roughness', 'deep drawing'])
('NONE', 'the new XXXXX considers both XXXXX and size effects for micro deep drawing of circular cups', ['surface roughness', 'fe model'])
('method used for task', 'the new XXXXX considers both surface roughness and size effects for micro XXXXX of circular cups', ['deep drawing', 'fe model'])
('NONE', 'XXXXX affects the springback , the drawability and the cups’ quality obviously in micro XXXXX', ['surface roughness', 'deep drawing'])
('method used for task', 'the XXXXX is given by a residual-stress dependent nonlinear elastic XXXXX in terms of invariants', ['material model', 'constitutive law'])
('method used for task', 'the dependence of bifurcation and postbifurcation behavior of tubes under torsion on

* The results we get here are the same as for supervised relation extraction. This is because the distant supervision heuristic identified the same positive and negative training examples as in the manually labelled dataset
* In practice, the distant supervision heuristic typically leads to noisy training data due to several reasons
    * Overlapping relations
        * For instance, 'employee-of' entails 'lecturer-at' and there are some overlapping entity pairs between the relations 'employee-of' and 'student-at'

* In practice, the distant supervision heuristic typically leads to noisy training data due to several reasons
    * The next problem is ambiguous entities
        * e.g. 'EM' has many possible meanings, only one of which is 'Expectation Maximisation', see [the Wikipedia disambiguation page for the acronym](https://en.wikipedia.org/wiki/EM).
    * Not every sentence an entity pair that is a positive example for a relation appears in actually contains that relation

* In practice, the distant supervision heuristic typically leads to noisy training data due to several reasons
    * Ambiguous entities
        * e.g. compare the sentence from [the Wikipedia EM definition](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) 
        'Expectation–maximization algorithm, an algorithm for finding maximum likelihood estimates (...)' with 
        'In this section we introduce EM (...)'
        * The first one is a true mention of 'method used for task', whereas the second one is not.

## Universal Schema
* Recall that for the pattern-based and bootstrapping approaches earlier, we were looking for simplified paths between entity pairs $\mathcal{E}$ expressing a certain relation $\mathcal{R}$ which we defined beforehand
    * This **restricts the relation extraction problem to known relation types** $\mathcal{R}$
    * In order to overcome that limitation, we could have defined new relations on the spot and added them to $\mathcal{R}$ by introducing new relation types for certain simplified paths between entity pairs

## Universal Schema
* The goal of universal schemas is to overcome the limitation of having to pre-define relations, but within the supervised learning paradigm
* This is possible by thinking of paths between entity pairs **as relation expressions themselves**
* Simplified paths between entity pairs and relation labels are no longer considered separately, but instead the paths between entity pairs and relations is **modelled in the same space**


The space of entity pairs and relations is defined by a matrix:

|  | demonstrates XXXXX for XXXXXX | XXXXX is capable of XXXXXX | an XXXXX model is employed for XXXXX | XXXXX decreases the XXXXX | method is used for task |
| ------ | ----------- |
| 'text mining', 'building domain ontology' | 1 |  |  |  | 1 |
| 'ensemble classifier', 'detection of construction materials' |  |  | 1 |  | 1 |
| 'data mining', 'characterization of wireless systems performance'|  | 1 |  |  | ? |
| 'frequency domain', 'computational cost' |  |  |  | 1 | ? |

* Here, 'method is used for task' is a relation defined by a KB schema
* the other relations are surface pattern relations generated by blanking entity pairs in sentences
* Where an entity pair and a KB relation or surface pattern relation co-occur, this is signified by a '1'
* For some of the entities and surface pairs, a label for 'method used for task' is available, whereas for others, it is not (signified by the '?')

* The task is to turn the '?'s into 0/1 predictions for the 'method for task' relation
* Note that we can use the same data this is the same task of for relation extraction extraction as considered previously with supervised learning, merely the data representation and model are different

* In order to solve this prediction task, we will learn to fill in the empty cells in the matrix
* This is achieved by learning to distinguish between entity pairs and relations which co-occur in our training data and entity pairs and relations which are not known to co-occur (the empty cells)

* Each training instance consists of a surface pattern or KB relation $\mathcal{r_{pos}}$ and an entity pair  $\mathcal{e_{pos}}$ the relation co-occurs with, as well as a relation $\mathcal{r_{neg}}$ and a entity pair $\mathcal{e_{neg}}$ that do not co-occur in the training data
* The positive relations and entity pairs are directly taken from the annotated data
* The negative entity pairs and relations are sampled randomly from data points which are represented by the empty cell in the matrix above. The goal is to estimate, for a relation $\mathcal{r}$ such as 'method is used for task' and an unseen entity pair such as $\mathcal{e}$, e.g. ('frequency domain', 'computational cost'), what the probability $\mathcal{p(y_{r,e} = 1)}$ is.

In [24]:
# data reading
training_sents, training_entpairs, training_labels = ie.readLabelledData()

# split positive and negative training data
pos_train_ids, neg_train_ids = ie.split_labels_pos_neg(training_labels + training_labels)

training_toks_pos = [t.split(" ") for i, t in enumerate(training_sents + training_labels) if i in pos_train_ids]
training_toks_neg = [t.split(" ") for i, t in enumerate(training_sents + training_labels) if i in neg_train_ids]

training_ent_toks_pos = [" || ".join(t).split(" ") for i, t in enumerate(training_entpairs + training_entpairs) if i in pos_train_ids]
training_ent_toks_neg = [" || ".join(t).split(" ") for i, t in enumerate(training_entpairs + training_entpairs) if i in neg_train_ids]
testing_ent_toks = [" || ".join(t).split(" ") for t in testing_entpairs]

# print length statistics
lens_rel = [len(s) for s in training_toks_pos + training_toks_neg]
lens_ents = [len(s) for s in training_ent_toks_pos + training_ent_toks_neg + testing_ent_toks]
print("Max relation length:", max(lens_rel))
print("Max entity pair length:", max(lens_ents))

Max relation length: 16
Max entity pair length: 9


In [25]:
# vectorise data (assign IDs to words)
count_rels, dictionary_rels, reverse_dictionary_rels = ie.build_dataset(
        [token for senttoks in training_toks_pos + training_toks_neg for token in senttoks])

count_ents, dictionary_ents, reverse_dictionary_ents = ie.build_dataset(
        [token for senttoks in training_ent_toks_pos + training_ent_toks_neg for token in senttoks])

print(reverse_dictionary_rels)

Final vocab size: 163
Final vocab size: 138
{0: 'UNK', 1: 'XXXXX', 2: 'for', 3: 'used', 4: 'method', 5: 'task', 6: 'NONE', 7: 'the', 8: 'a', 9: 'of', 10: 'and', 11: 'to', 12: 'is', 13: 'in', 14: 'are', 15: 'model', 16: 'proposed', 17: 'we', 18: 'this', 19: 'new', 20: 'presented', 21: 'as', 22: 'with', 23: 'propose', 24: 'pso-based', 25: 'anfis', 26: 'approaches', 27: 'affective', 28: 'combination', 29: '(', 30: ')', 31: 'on', 32: 'using', 33: 'demonstrates', 34: 'paper', 35: 'proposes', 36: 'solve', 37: 'design', 38: 'an', 39: 'swarm', 40: 'intelligence', 41: 'introduced', 42: 'clustering', 43: 'techniques', 44: 'text', 45: 'mining', 46: 'building', 47: 'able', 48: 'enhance', 49: 'fully', 50: '3d', 51: 'buildings', 52: 'two', 53: 'more', 54: 'capable', 55: 'product', 56: 'customer', 57: 'satisfaction', 58: 'solved', 59: 'obtained', 60: 'optimal', 61: 'section', 62: 'shape', 63: 'sizing', 64: 'cable–truss', 65: 'structures', 66: 'employed', 67: 'assists', 68: 'chaos', 69: 'theory', 70: 

In [26]:
# transform sentences to IDs, pad vectors for each sentence so they have same length
rels_train_pos = [ie.transform_dict(dictionary_rels, senttoks, max(lens_rel)) for senttoks in training_toks_pos]
rels_train_neg = [ie.transform_dict(dictionary_rels, senttoks, max(lens_rel)) for senttoks in training_toks_neg]
ents_train_pos = [ie.transform_dict(dictionary_ents, senttoks, max(lens_ents)) for senttoks in training_ent_toks_pos]
ents_train_neg = [ie.transform_dict(dictionary_ents, senttoks, max(lens_ents)) for senttoks in training_ent_toks_neg]

print(rels_train_pos[0], "\n", rels_train_pos[1])

[33  1 10 42 43  2  1  0  0  0  0  0  0  0  0  0] 
 [33 44 45 10  1  2 46  1  0  0  0  0  0  0  0  0]


In [27]:
# Negatively sample some entity pairs for training. Here we have some manually labelled neg ones, so we can sample from them.
ents_train_neg_samp = [random.choice(ents_train_neg) for _ in rels_train_neg]
    
ents_test_pos = [ie.transform_dict(dictionary_ents, senttoks, max(lens_ents)) for senttoks in testing_ent_toks]
# Sample those test entity pairs from the training ones as for those we have neg annotations
ents_test_neg_samp = [random.choice(ents_train_neg) for _ in ents_test_pos]  

vocab_size_rels = len(dictionary_rels)
vocab_size_ents = len(dictionary_ents) 

# for testing, we want to check if each unlabelled instance expresses the given relation "method for task"
rels_test_pos = [ie.transform_dict(dictionary_rels, training_toks_pos[-1], max(lens_rel)) for _ in testing_patterns]
rels_test_neg_samp = [random.choice(rels_train_neg) for _ in rels_test_pos]

In [28]:
data = ie.vectorise_data(training_sents, training_entpairs, training_labels, testing_patterns, testing_entpairs)

rels_train_pos, rels_train_neg, ents_train_pos, ents_train_neg_samp, rels_test_pos, rels_test_neg_samp, \
    ents_test_pos, ents_test_neg_samp, vocab_size_rels, vocab_size_ents, max_lens_rel, max_lens_ents, \
    dictionary_rels_rev, dictionary_ents_rev = data
  
# setting hyper-parameters
batch_size = 4
repr_dim = 30 # dimensionality of relation and entity pair vectors
learning_rate = 0.001
max_epochs = 31

Max relation length: 16
Max entity pair length: 9
Final vocab size: 163
Final vocab size: 138


In [29]:
def create_model_f_reader(max_rel_seq_length, max_cand_seq_length, repr_dim, vocab_size_rels, vocab_size_cands):
    """
    Create a Model F Universal Schema reader (Tensorflow graph).
    Args:
        max_rel_seq_length: maximum sentence sequence length
        max_cand_seq_length: maximum candidate sequence length
        repr_dim: dimensionality of vectors
        vocab_size_rels: size of relation vocabulary
        vocab_size_cands: size of candidate vocabulary
    Returns:
        dotprod_pos: dot product between positive entity pairs and relations
        dotprod_neg: dot product between negative entity pairs and relations
        diff_dotprod: difference in dot product of positive and negative instances, used for BPR loss (optional)
        [relations_pos, relations_neg, ents_pos, ents_neg]: placeholders, fed in during training for each batch
    """

In [30]:
# Placeholders (empty Tensorflow variables) for positive and negative relations and entity pairs
# In each training epoch, for each batch, those will be set through mini batching

relations_pos = tf.placeholder(tf.int32, [None, max_lens_rel], name='relations_pos')  # [batch_size, max_rel_seq_len]
relations_neg = tf.placeholder(tf.int32, [None, max_lens_rel], name='relations_neg')  # [batch_size, max_rel_seq_len]

ents_pos = tf.placeholder(tf.int32, [None, max_lens_ents], name="ents_pos") # [batch_size, max_ent_seq_len]
ents_neg = tf.placeholder(tf.int32, [None, max_lens_ents], name="ents_neg") # [batch_size, max_ent_seq_len]

In [31]:
# Creating latent representations of relations and entity pairs
# latent feature representation of all relations, which are initialised randomly
relation_embeddings = tf.Variable(tf.random_uniform([vocab_size_rels, repr_dim], -0.1, 0.1, dtype=tf.float32),
                                   name='rel_emb', trainable=True)

# latent feature representation of all entity pairs, which are initialised randomly
ent_embeddings = tf.Variable(tf.random_uniform([vocab_size_ents, repr_dim], -0.1, 0.1, dtype=tf.float32),
                                      name='cand_emb', trainable=True)

# look up latent feature representation for relations and entities in current batch
rel_encodings_pos = tf.nn.embedding_lookup(relation_embeddings, relations_pos)
rel_encodings_neg = tf.nn.embedding_lookup(relation_embeddings, relations_neg)

ent_encodings_pos = tf.nn.embedding_lookup(ent_embeddings, ents_pos)
ent_encodings_neg = tf.nn.embedding_lookup(ent_embeddings, ents_neg)

In [32]:
# our feature representation here is a vector for each word in a relation or entity 
# because our training data is so small
# we therefore take the sum of those vectors to get a representation of each relation or entity pair
rel_encodings_pos = tf.reduce_sum(rel_encodings_pos, 1)  # [batch_size, num_rel_toks, repr_dim]
rel_encodings_neg = tf.reduce_sum(rel_encodings_neg, 1)  # [batch_size, num_rel_toks, repr_dim]

ent_encodings_pos = tf.reduce_sum(ent_encodings_pos, 1)  # [batch_size, num_ent_toks, repr_dim]
ent_encodings_neg = tf.reduce_sum(ent_encodings_neg, 1)  # [batch_size, num_ent_toks, repr_dim]

In [33]:
# measuring compatibility between positive entity pairs and relations
# used for ranking test data
dotprod_pos = tf.reduce_sum(tf.multiply(ent_encodings_pos, rel_encodings_pos), 1)

# measuring compatibility between negative entity pairs and relations
dotprod_neg = tf.reduce_sum(tf.multiply(ent_encodings_neg, rel_encodings_neg), 1)

# difference in dot product of positive and negative instances
# used for BPR loss (ranking loss)
diff_dotprod = tf.reduce_sum(tf.multiply(ent_encodings_pos, rel_encodings_pos) - tf.multiply(ent_encodings_neg, rel_encodings_neg), 1)


To train this model, we define a loss, which tries to maximise the distance between the positive and negative instances. One possibility of this is the logistic loss.

$\mathcal{\sum -  log(v_{e_{pos}} * a_{r_{pos}})} + {\sum log(v_{e_{neg}} * a_{r_{neg}}))}$

Now that we have read in the data, vectorised it and created the universal schema relation extraction model, let's start training

In [34]:
# create a the model / Tensorflow computation graph
dotprod_pos, dotprod_neg, diff_dotprod, placeholders = ie.create_model_f_reader(max_lens_rel, max_lens_ents, repr_dim, vocab_size_rels,
                          vocab_size_ents)

# logistic loss
loss = tf.reduce_sum(tf.nn.softplus(-dotprod_pos)+tf.nn.softplus(dotprod_neg))

# alternative: BPR loss
#loss = tf.reduce_sum(tf.nn.softplus(diff_dotprod))

In [35]:
data = [np.asarray(rels_train_pos), np.asarray(rels_train_neg), np.asarray(ents_train_pos), np.asarray(ents_train_neg_samp)]
data_test = [np.asarray(rels_test_pos), np.asarray(rels_test_neg_samp), np.asarray(ents_test_pos), np.asarray(ents_test_neg_samp)]

# define an optimiser. Here, we use the Adam optimiser
optimizer = tf.train.AdamOptimizer(learning_rate)
    
# training with mini-batches
batcher = tfutil.BatchBucketSampler(data, batch_size)
batcher_test = tfutil.BatchBucketSampler(data_test, 1, test=True)

In [36]:
with tf.Session() as sess:
    trainer = tfutil.Trainer(optimizer, max_epochs)
    trainer(batcher=batcher, placeholders=placeholders, loss=loss, session=sess)

    # we obtain test scores
    test_scores = trainer.test(batcher=batcher_test, placeholders=placeholders, model=tf.nn.sigmoid(dotprod_pos), session=sess)

Epoch  1 	Loss  46.5499911308
Epoch  2 	Loss  32.969190836
Epoch  3 	Loss  24.4356733561
Epoch  4 	Loss  16.987249732
Epoch  5 	Loss  12.225643754
Epoch  6 	Loss  8.93038511276
Epoch  7 	Loss  6.68729266524
Epoch  8 	Loss  4.87118721008
Epoch  9 	Loss  3.77563881874
Epoch  10 	Loss  3.15554860234
Epoch  11 	Loss  2.62767970562
Epoch  12 	Loss  2.10903429985
Epoch  13 	Loss  1.7483137697
Epoch  14 	Loss  1.47833425552
Epoch  15 	Loss  1.22717268765
Epoch  16 	Loss  1.12230950594
Epoch  17 	Loss  1.04518919438
Epoch  18 	Loss  0.888099145144
Epoch  19 	Loss  0.767774961889
Epoch  20 	Loss  0.717018354684
Epoch  21 	Loss  0.673434060067
Epoch  22 	Loss  0.581763647497
Epoch  23 	Loss  0.545647699386
Epoch  24 	Loss  0.502182915807
Epoch  25 	Loss  0.463427852839
Epoch  26 	Loss  0.414758887142
Epoch  27 	Loss  0.403060832992
Epoch  28 	Loss  0.376163519919
Epoch  29 	Loss  0.336928585544
Epoch  30 	Loss  0.309169012122


In [37]:
# show predictions
ents_test = [ie.reverse_dict_lookup(dictionary_ents_rev, e) for e in ents_test_pos]
rels_test = [ie.reverse_dict_lookup(dictionary_rels_rev, r) for r in rels_test_pos]
testresults = sorted(zip(test_scores, ents_test, rels_test), key=lambda t: t[0], reverse=True)  # sort for decreasing score

print("\nTest predictions by decreasing probability:")
for score, tup, rel in testresults:
    print('%f\t%s\tREL\t%s' % (score, " ".join(tup), " ".join(rel)))


Test predictions by decreasing probability:
0.999502	UNK optimization problem || UNK optimization problem	REL	method used for task
0.999185	genetic algorithm || optimization problem	REL	method used for task
0.999185	optimization problem || genetic algorithm	REL	method used for task
0.999185	optimization problem || genetic algorithm	REL	method used for task
0.998879	UNK search algorithm || hybrid system	REL	method used for task
0.998851	UNK optimization problem || optimal UNK UNK problem	REL	method used for task
0.998722	UNK set || topology optimization	REL	method used for task
0.998654	hybrid method || dynamic system	REL	method used for task
0.998432	genetic algorithm || UNK swarm optimization	REL	method used for task
0.998432	genetic algorithm || UNK swarm optimization	REL	method used for task
0.998432	genetic algorithm || UNK swarm optimization	REL	method used for task
0.998432	genetic algorithm || UNK swarm optimization	REL	method used for task
0.998426	hybrid method || UNK set the

0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK UNK || UNK model	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK UNK || UNK model	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK UNK || UNK UNK model	REL	method used for task
0.861547	UNK UNK || UNK model	REL	method used for task
0.861547	model UNK || UNK	REL	method used for task
0.861547	model UNK || UNK	REL	method used for task
0.861547	UNK UNK || UNK model	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK UNK || UNK model	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	UNK model || UNK	REL	method used for task
0.861547	model UNK || UNK	REL	method used for task
0.8

0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || UNK	REL	method used for task
0.682130	UNK UNK || 

Test prediction probabilities are obtained by scoring each test instances with:

$\mathcal{ \sigma  ( v_{e} * a_{r} )}$

* Note that as input for the latent feature representation, we discarded words that only appeared twice
    * Hence, for those words we did not learn a representation, denoted here by 'UNK'
* This is also typically done for other feature representations, as if we only see a feature once, it is difficult to learn weights for it

**Thought Exercises**: 
* The scores shown here are for the relation 'method used for task'. However, we could also use our model to score the compatibility of entity pairs with other relations, e.g. 'demonstrates XXXXX for XXXXXX'. How could this be done here?
* How could we get around the problem of unseen words, as described above?
* What other possible problems can you see with the above formulation of universal schema relation extraction?
* What possible problems can you see with using latent word representations?

## Background Material

* Jurafky, Dan and Martin, James H. (2016). Speech and Language Processing, Chapter 21 (Information Extraction): https://web.stanford.edu/~jurafsky/slp3/21.pdf

* Riedel, Sebastian and Yao, Limin and McCallum, Andrew and Marlin, Benjamin M. (2013). Extraction with Matrix Factorization and Universal Schemas. Proceedings of NAACL.  http://www.aclweb.org/anthology/N13-1008