{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 3: Relation extraction using distant supervision" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Bill MacCartney\"\n", "__version__ = \"CS224U, Stanford, Spring 2019\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. [Overview](#Overview)\n", "1. [Set-up](#Set-up)\n", "1. [Baseline](#Baseline)\n", "1. [Homework questions](#Homework-questions)\n", " 1. [Different model factory [1 point]](#Different-model-factory-[1-point])\n", " 1. [Directional unigram features [2 points]](#Directional-unigram-features-[2-points])\n", " 1. [The part-of-speech tags of the \"middle\" words [2 points]](#The-part-of-speech-tags-of-the-\"middle\"-words-[2-points])\n", " 1. [Your original system [4 points]](#Your-original-system-[4-points])\n", "1. [Bake-off [1 point]](#Bake-off-[1-point])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "This homework and associated bake-off are devoted to the developing really effective relation extraction systems using distant supervision. \n", "\n", "As with the previous assignments, this notebook first establishes a baseline system. The initial homework questions ask you to create additional baselines and suggest areas for innovation, and the final homework question asks you to develop an original system for you to enter into the bake-off." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set-up\n", "\n", "See [the first notebook in this unit](rel_ext_01_task.ipynb#Set-up) for set-up instructions." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import rel_ext\n", "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As usual, we unite our corpus and KB into a dataset, and create some splits for experimentation:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "rel_ext_data_home = os.path.join('data', 'rel_ext_data')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'corpus.tsv.gz'))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'kb.tsv.gz'))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "dataset = rel_ext.Dataset(corpus, kb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are not wedded to this set-up for splits. The bake-off will be conducted on a previously unseen test-set, so all of the data in `dataset` is fair game:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "splits = dataset.build_splits(\n", " split_names=['tiny', 'train', 'dev'],\n", " split_fracs=[0.01, 0.79, 0.20],\n", " seed=1)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'tiny': Corpus with 3,474 examples; KB with 445 triples,\n", " 'train': Corpus with 263,285 examples; KB with 36,191 triples,\n", " 'dev': Corpus with 64,937 examples; KB with 9,248 triples,\n", " 'all': Corpus with 331,696 examples; KB with 45,884 triples}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Baseline" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):\n", " for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):\n", " for word in ex.middle.split(' '):\n", " feature_counter[word] += 1\n", " for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):\n", " for word in ex.middle.split(' '):\n", " feature_counter[word] += 1\n", " return feature_counter" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "featurizers = [simple_bag_of_words_featurizer]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "relation precision recall f-score support size\n", "------------------ --------- --------- --------- --------- ---------\n", "adjoins 0.853 0.391 0.690 340 5716\n", "author 0.790 0.532 0.720 509 5885\n", "capital 0.562 0.189 0.404 95 5471\n", "contains 0.793 0.596 0.744 3904 9280\n", "film_performance 0.781 0.563 0.725 766 6142\n", "founders 0.779 0.400 0.655 380 5756\n", "genre 0.686 0.141 0.387 170 5546\n", "has_sibling 0.878 0.230 0.562 499 5875\n", "has_spouse 0.853 0.323 0.643 594 5970\n", "is_a 0.689 0.249 0.509 497 5873\n", "nationality 0.571 0.186 0.404 301 5677\n", "parents 0.871 0.542 0.777 312 5688\n", "place_of_birth 0.686 0.206 0.468 233 5609\n", "place_of_death 0.459 0.107 0.277 159 5535\n", "profession 0.570 0.215 0.428 247 5623\n", "worked_at 0.693 0.252 0.513 242 5618\n", "------------------ --------- --------- --------- --------- ---------\n", "macro-average 0.720 0.320 0.557 9248 95264\n" ] } ], "source": [ "baseline_results = rel_ext.experiment(\n", " splits,\n", " train_split='train',\n", " test_split='dev',\n", " featurizers=featurizers,\n", " model_factory=model_factory,\n", " verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Studying model weights might yield insights:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Highest and lowest feature weights for relation adjoins:\n", "\n", " 2.552 Córdoba\n", " 2.465 Taluks\n", " 2.426 Valais\n", " ..... .....\n", " -1.175 based\n", " -1.287 other\n", " -1.431 America\n", "\n", "Highest and lowest feature weights for relation author:\n", "\n", " 2.791 author\n", " 2.397 wrote\n", " 2.270 by\n", " ..... .....\n", " -1.978 or\n", " -2.115 directed\n", " -7.687 dystopian\n", "\n", "Highest and lowest feature weights for relation capital:\n", "\n", " 3.250 capital\n", " 1.713 city\n", " 1.552 posted\n", " ..... .....\n", " -1.343 and\n", " -2.718 Province\n", " -2.725 Isfahan\n", "\n", "Highest and lowest feature weights for relation contains:\n", "\n", " 2.547 southwestern\n", " 2.040 borders\n", " 2.007 affiliated\n", " ..... .....\n", " -2.419 2002\n", " -2.778 Isfahan\n", " -2.875 band\n", "\n", "Highest and lowest feature weights for relation film_performance:\n", "\n", " 4.198 starring\n", " 3.842 co-starring\n", " 3.283 movie\n", " ..... .....\n", " -2.220 Anjaani\n", " -2.220 Anjaana\n", " -3.996 double\n", "\n", "Highest and lowest feature weights for relation founders:\n", "\n", " 3.874 founded\n", " 3.645 founder\n", " 3.203 co-founder\n", " ..... .....\n", " -1.411 series\n", " -1.477 state\n", " -1.892 band\n", "\n", "Highest and lowest feature weights for relation genre:\n", "\n", " 2.804 series\n", " 2.574 album\n", " 2.405 movie\n", " ..... .....\n", " -1.482 ;\n", " -1.766 at\n", " -2.142 follows\n", "\n", "Highest and lowest feature weights for relation has_sibling:\n", "\n", " 5.312 brother\n", " 4.030 sister\n", " 3.043 nephew\n", " ..... .....\n", " -1.285 engineer\n", " -1.290 from\n", " -1.357 Jacob\n", "\n", "Highest and lowest feature weights for relation has_spouse:\n", "\n", " 5.395 wife\n", " 4.599 husband\n", " 4.389 widow\n", " ..... .....\n", " -1.350 on\n", " -1.625 engineer\n", " -1.646 Terri\n", "\n", "Highest and lowest feature weights for relation is_a:\n", "\n", " 2.940 family\n", " 2.882 genus\n", " 2.585 \n", " ..... .....\n", " -1.647 kamut\n", " -1.679 on\n", " -2.949 hibiscus\n", "\n", "Highest and lowest feature weights for relation nationality:\n", "\n", " 2.661 born\n", " 1.974 president\n", " 1.966 caliph\n", " ..... .....\n", " -1.344 state\n", " -1.395 and\n", " -1.677 American\n", "\n", "Highest and lowest feature weights for relation parents:\n", "\n", " 5.282 son\n", " 4.750 daughter\n", " 4.418 father\n", " ..... .....\n", " -1.737 Jacob\n", " -1.980 Jahangir\n", " -2.552 Kelly\n", "\n", "Highest and lowest feature weights for relation place_of_birth:\n", "\n", " 3.729 born\n", " 3.125 birthplace\n", " 2.820 mayor\n", " ..... .....\n", " -1.390 or\n", " -1.523 and\n", " -2.121 Oldham\n", "\n", "Highest and lowest feature weights for relation place_of_death:\n", "\n", " 2.826 died\n", " 1.934 under\n", " 1.870 where\n", " ..... .....\n", " -1.294 that\n", " -1.301 and\n", " -1.444 Siege\n", "\n", "Highest and lowest feature weights for relation profession:\n", "\n", " 3.103 \n", " 2.394 American\n", " 2.391 philosopher\n", " ..... .....\n", " -1.417 York\n", " -1.713 elder\n", " -2.205 on\n", "\n", "Highest and lowest feature weights for relation worked_at:\n", "\n", " 3.318 professor\n", " 3.124 president\n", " 2.824 CEO\n", " ..... .....\n", " -1.201 then-associate\n", " -1.254 NASA\n", " -1.650 or\n", "\n" ] } ], "source": [ "rel_ext.examine_model_weights(baseline_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homework questions\n", "\n", "Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Different model factory [1 point]\n", "\n", "The code in `rel_ext` makes it very easy to experiment with other classifier models: one need only redefine the `model_factory` argument. This question asks you to assess a [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).\n", "\n", "__To submit:__ A call to `rel_ext.experiment` training on the 'train' part of `splits` and assessing on its `dev` part, with `featurizers` as defined above in this notebook and the `model_factory` set to one based in an `SVC` with `kernel='linear'` and all other arguments left with default values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Directional unigram features [2 points]\n", "\n", "The current bag-of-words representation makes no distinction between \"forward\" and \"reverse\" examples. But, intuitively, there is big difference between _X and his son Y_ and _Y and his son X_. This question asks you to modify `simple_bag_of_words_featurizer` to capture these differences. \n", "\n", "__To submit:__\n", "\n", "1. A feature function `directional_bag_of_words_featurizer` that is just like `simple_bag_of_words_featurizer` except that it distinguishes \"forward\" and \"reverse\". To do this, you just need to mark each word feature for whether it is derived from a subject–object example or from an object–subject example. The precise nature of the mark you add for the two cases doesn't make a difference to the model.\n", "\n", "2. The macro-average F-score on the `dev` set that you obtain from running `rel_ext.experiment` with `directional_bag_of_words_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment` as exemplified above in this notebook.)\n", "\n", "3. `rel_ext.experiment` returns some of the core objects used in the experiment. How many feature names does the `vectorizer` have for the experiment run in the previous step? (Note: we're partly asking you to figure out how to get this value by using the sklearn documentation, so please don't ask how to do it on Piazza!)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The part-of-speech tags of the \"middle\" words [2 points]\n", "\n", "Our corpus distribution contains part-of-speech (POS) tagged versions of the core text spans. Let's begin to explore whether there is information in these sequences, focusing on `middle_POS`.\n", "\n", "__To submit:__\n", "\n", "1. A feature function `middle_bigram_pos_tag_featurizer` that is just like `simple_bag_of_words_featurizer` except that it creates a feature for bigram POS sequences. For example, given \n", "\n", " `The/DT dog/N napped/V`\n", " \n", " we obtain the list of bigram POS sequences\n", " \n", " `b = [' DT', 'DT N', 'N V', 'V ']`. \n", " \n", " Of course, `middle_bigram_pos_tag_featurizer` should return count dictionaries defined in terms of such bigram POS lists, on the model of `simple_bag_of_words_featurizer`.\n", " \n", " Don't forget the start and end tags, to model those environments properly!\n", "\n", "2. The macro-average F-score on the `dev` set that you obtain from running `rel_ext.experiment` with `middle_bigram_pos_tag_featurizer` as the only featurizer. (Aside from this, use all the default values for `experiment` as exemplified above in this notebook.)\n", "\n", "Note: To parse `middle_POS`, one splits on whitespace to get the `word/TAG` pairs. Each of these pairs `s` can be parsed with `s.rsplit('/', 1)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Your original system [4 points]\n", "\n", "There are many options, and this could easily grow into a project. Here are a few ideas:\n", "\n", "- Try out different classifier models, from `sklearn` and elsewhere.\n", "- Add a feature that indicates the length of the middle.\n", "- Augment the bag-of-words representation to include bigrams or trigrams (not just unigrams).\n", "- Introduce features based on the entity mentions themselves. \n", "- Experiment with features based on the context outside (rather than between) the two entity mentions — that is, the words before the first mention, or after the second.\n", "- Try adding features which capture syntactic information, such as the dependency-path features used by Mintz et al. 2009. The [NLTK](https://www.nltk.org/) toolkit contains a variety of [parsing algorithms](http://www.nltk.org/api/nltk.parse.html) that may help.\n", "- The bag-of-words representation does not permit generalization across word categories such as names of people, places, or companies. Can we do better using word embeddings such as [GloVe](https://nlp.stanford.edu/projects/glove/)?\n", "- Consider adding features based on WordNet synsets. Here's a little code to get you started with that:\n", " ```\n", " from nltk.corpus import wordnet as wn\n", " dog_compatible_synsets = wn.synsets('dog', pos='n')\n", " ```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bake-off [1 point]\n", "\n", "For the bake-off, we will release a test set right after class on April 29. The announcement will go out on Piazza. You will evaluate your custom model from the previous question on these new datasets using the function `rel_ext.bake_off_experiment`. Rules:\n", "\n", "1. Only one evaluation is permitted.\n", "1. No additional system tuning is permitted once the bake-off has started.\n", "\n", "To enter the bake-off, upload this notebook on Canvas:\n", "\n", "https://canvas.stanford.edu/courses/99711/assignments/187248\n", "\n", "The cells below this one constitute your bake-off entry.\n", "\n", "People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.\n", "\n", "The bake-off will close at 4:30 pm on May 1. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Enter your bake-off assessment code in this cell. \n", "# Please do not remove this comment.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# On an otherwise blank line in this cell, please enter\n", "# your macro-average f-score (an F_0.5 score) as reported \n", "# by the code above. Please enter only a number between \n", "# 0 and 1 inclusive. Please do not remove this comment.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 2 }