{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Natural language inference: task and datasets" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Spring 2020\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. [Overview](#Overview)\n", "1. [Our version of the task](#Our-version-of-the-task)\n", "1. [Primary resources](#Primary-resources)\n", "1. [Set-up](#Set-up)\n", "1. [SNLI](#SNLI)\n", " 1. [SNLI properties](#SNLI-properties)\n", " 1. [Working with SNLI](#Working-with-SNLI)\n", "1. [MultiNLI](#MultiNLI)\n", " 1. [MultiNLI properties](#MultiNLI-properties)\n", " 1. [Working with MultiNLI](#Working-with-MultiNLI)\n", " 1. [Annotated MultiNLI subsets](#Annotated-MultiNLI-subsets)\n", "1. [Adversarial NLI](#Adversarial-NLI)\n", " 1. [Adversarial NLI properties](#Adversarial-NLI-properties)\n", " 1. [Working with Adversarial NLI](#Working-with-Adversarial-NLI)\n", "1. [Other NLI datasets](#Other-NLI-datasets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.\n", "\n", "[Dagan et al. (2006)](https://u.cs.biu.ac.il/~nlp/RTE1/Proceedings/dagan_et_al.pdf), one of the foundational papers on NLI (also called Recognizing Textual Entailment; RTE), make a case for the generality of this task in NLU:\n", "\n", "> It seems that major inferences, as needed by multiple applications, can indeed be cast in terms of textual entailment. For example, __a QA system__ has to identify texts that entail a hypothesized answer. [...] Similarly, for certain __Information Retrieval__ queries the combination of semantic concepts and relations denoted by the query should be entailed from relevant retrieved documents. [...] In __multi-document summarization__ a redundant sentence, to be omitted from the summary, should be entailed from other sentences in the summary. And in __MT evaluation__ a correct translation should be semantically equivalent to the gold standard translation, and thus both translations should entail each other. Consequently, we hypothesize that textual entailment recognition is a suitable generic task for evaluating and comparing applied semantic inference models. Eventually, such efforts can promote the development of entailment recognition \"engines\" which may provide useful generic modules across applications." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Our version of the task\n", "\n", "Our NLI data will look like this:\n", "\n", "| Premise | Relation | Hypothesis |\n", "|---------|---------------|------------|\n", "| turtle | contradiction | linguist |\n", "| A turtled danced | entails | A turtle moved |\n", "| Every reptile danced | entails | Every turtle moved |\n", "| Some turtles walk | contradicts | No turtles move |\n", "| James Byron Dean refused to move without blue jeans | entails | James Dean didn't dance without pants |\n", "\n", "In the [word-entailment bakeoff](hw_wordentail.ipynb), we looked at a special case of this where the premise and hypothesis are single words. This notebook begins to introduce the problem of NLI more fully." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Primary resources\n", "\n", "We're going to focus on three NLI corpora:\n", "\n", "* [The Stanford Natural Language Inference corpus (SNLI)](https://nlp.stanford.edu/projects/snli/)\n", "* [The Multi-Genre NLI Corpus (MultiNLI)](https://www.nyu.edu/projects/bowman/multinli/)\n", "* [The Adversarial NLI Corpus (ANLI)](https://github.com/facebookresearch/anli)\n", "\n", "The first was collected by a group at Stanford, led by [Sam Bowman](https://www.nyu.edu/projects/bowman/), and the second was collected by a group at NYU, also led by [Sam Bowman](https://www.nyu.edu/projects/bowman/). Both have the same format and were crowdsourced using the same basic methods. However, SNLI is entirely focused on image captions, whereas MultiNLI includes a greater range of contexts.\n", "\n", "The third corpus was collected by a group at Facebook AI and UNC Chapel Hill. The team's goal was to address the fact that datasets like SNLI and MultiNLI seem to be artificially easy – models trained on them can often surpass stated human performance levels but still fail on examples that are simple and intuitive for people. The dataset is \"Adversarial\" because the annotators were asked to try to construct examples that fooled strong models but still passed muster with other human readers.\n", "\n", "This notebook presents tools for working with these corpora. The [second notebook in the unit](nli_02_models.ipynb) concerns models of NLI." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set-up\n", "\n", "* As usual, you need to be fully set up to work with [the CS224u repository](https://github.com/cgpotts/cs224u/).\n", "\n", "* If you haven't already, download [the course data](http://web.stanford.edu/class/cs224u/data/data.tgz), unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change `DATA_HOME` below.)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import nli\n", "import os\n", "import pandas as pd\n", "import random" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "DATA_HOME = os.path.join(\"data\", \"nlidata\")\n", "\n", "SNLI_HOME = os.path.join(DATA_HOME, \"snli_1.0\")\n", "\n", "MULTINLI_HOME = os.path.join(DATA_HOME, \"multinli_1.0\")\n", "\n", "ANNOTATIONS_HOME = os.path.join(DATA_HOME, \"multinli_1.0_annotations\")\n", "\n", "ANLI_HOME = os.path.join(DATA_HOME, \"anli_v0.1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SNLI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SNLI properties" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For SNLI (and MultiNLI), MTurk annotators were presented with premise sentences and asked to produce new sentences that entailed, contradicted, or were neutral with respect to the premise. A subset of the examples were then validated by an additional four MTurk annotators." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* All the premises are captions from the [Flickr30K corpus](http://shannon.cs.illinois.edu/DenotationGraph/).\n", "\n", "\n", "* Some of the sentences rather depressingly reflect stereotypes ([Rudinger et al. 2017](https://aclanthology.coli.uni-saarland.de/papers/W17-1609/w17-1609)).\n", "\n", "\n", "* 550,152 train examples; 10K dev; 10K test\n", "\n", "\n", "* Mean length in tokens:\n", " * Premise: 14.1\n", " * Hypothesis: 8.3\n", "\n", "* Clause-types\n", " * Premise S-rooted: 74%\n", " * Hypothesis S-rooted: 88.9%\n", "\n", "\n", "* Vocab size: 37,026\n", "\n", "\n", "* 56,951 examples validated by four additional annotators\n", " * 58.3% examples with unanimous gold label\n", " * 91.2% of gold labels match the author's label\n", " * 0.70 overall Fleiss kappa\n", "\n", "\n", "* Top scores currently around 90%. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with SNLI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following readers should make it easy to work with SNLI:\n", " \n", "* `nli.SNLITrainReader`\n", "* `nli.SNLIDevReader`\n", "\n", "Writing a `Test` reader is easy and so left to the user who decides that a test-set evaluation is appropriate. We omit that code as a subtle way of discouraging use of the test set during project development.\n", "\n", "The base class, `nli.NLIReader`, is used by all the readers discussed here.\n", "\n", "Because the datasets are so large, it is often useful to be able to randomly sample from them. All of the reader classes discussed here support this with their keyword argument `samp_percentage`. For example, the following samples approximately 10% of the examples from the SNLI training set:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"NLIReader({'src_filename': 'data/nlidata/snli_1.0/snli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42, 'gold_label_attr_name': 'gold_label'})" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nli.SNLITrainReader(SNLI_HOME, samp_percentage=0.10, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The precise number of examples will vary somewhat because of the way the sampling is done. (Here, we trade efficiency for precision in the number of cases we return; see the implementation for details.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of the readers have a `read` method that yields `NLIExample` example instances. For SNLI, these have the following attributes:\n", "\n", "* __annotator_labels__: `list of str`\n", "* __captionID__: `str`\n", "* __gold_label__: `str`\n", "* __pairID__: `str`\n", "* __sentence1__: `str`\n", "* __sentence1_binary_parse__: `nltk.tree.Tree`\n", "* __sentence1_parse__: `nltk.tree.Tree`\n", "* __sentence2__: `str`\n", "* __sentence2_binary_parse__: `nltk.tree.Tree`\n", "* __sentence2_parse__: `nltk.tree.Tree`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following creates the label distribution for the training data:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "entailment 183416\n", "contradiction 183187\n", "neutral 182764\n", "- 785\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snli_labels = pd.Series(\n", " [ex.gold_label for ex in nli.SNLITrainReader(\n", " SNLI_HOME, filter_unlabeled=False).read()])\n", "\n", "snli_labels.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `filter_unlabeled=True` (the default) to silently drop the examples for which `gold_label` is `-`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at a specific example in some detail:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "snli_iterator = iter(nli.SNLITrainReader(SNLI_HOME).read())" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "snli_ex = next(snli_iterator)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})\n" ] } ], "source": [ "print(snli_ex)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snli_ex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see from the above attribute list, there are __three versions__ of the premise and hypothesis sentences:\n", "\n", "1. Regular string representations of the data\n", "1. Unlabeled binary parses \n", "1. Labeled parses" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A person on a horse jumps over a broken down airplane.'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snli_ex.sentence1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The binary parses lack node labels; so that we can use `nltk.tree.Tree` with them, the label `X` is added to all of them:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snli_ex.sentence1_binary_parse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the full parse tree with syntactic categories:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snli_ex.sentence1_parse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The leaves of either tree are tokenized versions of them:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['A',\n", " 'person',\n", " 'on',\n", " 'a',\n", " 'horse',\n", " 'jumps',\n", " 'over',\n", " 'a',\n", " 'broken',\n", " 'down',\n", " 'airplane',\n", " '.']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snli_ex.sentence1_parse.leaves()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MultiNLI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### MultiNLI properties\n", "\n", "\n", "* Train premises drawn from five genres: \n", " 1. Fiction: works from 1912–2010 spanning many genres\n", " 1. Government: reports, letters, speeches, etc., from government websites\n", " 1. The _Slate_ website\n", " 1. Telephone: the Switchboard corpus\n", " 1. Travel: Berlitz travel guides\n", "\n", "\n", "* Additional genres just for dev and test (the __mismatched__ condition): \n", " 1. The 9/11 report\n", " 1. Face-to-face: The Charlotte Narrative and Conversation Collection\n", " 1. Fundraising letters\n", " 1. Non-fiction from Oxford University Press\n", " 1. _Verbatim_ articles about linguistics\n", "\n", "\n", "* 392,702 train examples; 20K dev; 20K test\n", "\n", "\n", "* 19,647 examples validated by four additional annotators\n", " * 58.2% examples with unanimous gold label\n", " * 92.6% of gold labels match the author's label\n", "\n", "\n", "* Test-set labels available as a Kaggle competition. \n", "\n", " * Top matched scores currently around 0.81.\n", " * Top mismatched scores currently around 0.83." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with MultiNLI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For MultiNLI, we have the following readers: \n", "\n", "* `nli.MultiNLITrainReader`\n", "* `nli.MultiNLIMatchedDevReader`\n", "* `nli.MultiNLIMismatchedDevReader`\n", "\n", "The MultiNLI test sets are available on Kaggle ([matched version](https://www.kaggle.com/c/multinli-matched-open-evaluation) and [mismatched version](https://www.kaggle.com/c/multinli-mismatched-open-evaluation))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The interface to these is the same as for the SNLI readers:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"NLIReader({'src_filename': 'data/nlidata/multinli_1.0/multinli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42, 'gold_label_attr_name': 'gold_label'})" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nli.MultiNLITrainReader(MULTINLI_HOME, samp_percentage=0.10, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `NLIExample` instances for MultiNLI have the same attributes as those for SNLI. Here is the list repeated from above for convenience:\n", "\n", "* __annotator_labels__: `list of str`\n", "* __captionID__: `str`\n", "* __gold_label__: `str`\n", "* __pairID__: `str`\n", "* __sentence1__: `str`\n", "* __sentence1_binary_parse__: `nltk.tree.Tree`\n", "* __sentence1_parse__: `nltk.tree.Tree`\n", "* __sentence2__: `str`\n", "* __sentence2_binary_parse__: `nltk.tree.Tree`\n", "* __sentence2_parse__: `nltk.tree.Tree`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The full label distribution:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "contradiction 130903\n", "neutral 130900\n", "entailment 130899\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multinli_labels = pd.Series(\n", " [ex.gold_label for ex in nli.MultiNLITrainReader(\n", " MULTINLI_HOME, filter_unlabeled=False).read()])\n", "\n", "multinli_labels.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No examples in the MultiNLI train set lack a gold label, so the value of the `filter_unlabeled` parameter has no effect here, but it does have an effect in the `Dev` versions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Annotated MultiNLI subsets\n", "\n", "MultiNLI includes additional annotations for a subset of the dev examples. The goal is to help people understand how well their models are doing on crucial NLI-related linguistic phenomena." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "matched_ann_filename = os.path.join(\n", " ANNOTATIONS_HOME,\n", " \"multinli_1.0_matched_annotations.txt\")\n", "\n", "mismatched_ann_filename = os.path.join(\n", " ANNOTATIONS_HOME, \n", " \"multinli_1.0_mismatched_annotations.txt\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def view_random_example(annotations, random_state=42):\n", " random.seed(random_state)\n", " ann_ex = random.choice(list(annotations.items()))\n", " pairid, ann_ex = ann_ex\n", " ex = ann_ex['example'] \n", " print(\"pairID: {}\".format(pairid))\n", " print(ann_ex['annotations'])\n", " print(ex.sentence1)\n", " print(ex.gold_label)\n", " print(ex.sentence2)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "matched_ann = nli.read_annotated_subset(matched_ann_filename, MULTINLI_HOME)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pairID: 63218c\n", "[]\n", "Recently, however, I have settled down and become decidedly less experimental.\n", "contradiction\n", "I am still as experimental as ever, and I am always on the move.\n" ] } ], "source": [ "view_random_example(matched_ann)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adversarial NLI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adversarial NLI properties\n", "\n", "The ANLI dataset was created in response to evidence that datasets like SNLI and MultiNLI are artificially easy for modern machine learning models to solve. The team sought to tackle this weakness head-on, by designing a crowdsourcing task in which annotators were explicitly trying to confuse state-of-the-art models. In broad outline, the task worked like this:\n", "\n", "1. The crowdworker is presented with a premise (context) text and asked to construct a hypothesis sentence that entails, contradicts, or is neutral with respect to that premise. (The precise wording is more informally, along the lines of the SNLI/MultiNLI task).\n", "\n", "1. The crowdworker submits a hypothesis text.\n", "\n", "1. The premise/hypothesis pair is fed to a trained model that makes a prediction about the correct NLI label.\n", "\n", "1. If the model's prediction is correct, then the crowdworker loops back to step 2 to try again. If the model's prediction is incorrect, then the example is validated by different crowdworkers.\n", "\n", "The dataset consists of three rounds, each involving a different model and a different set of sources for the premise texts:\n", "\n", "| Round | Model | Training data | Context sources | \n", "|:------:|:------------|:---------------------------|:-----------------|\n", "| 1 | [BERT-large](https://www.aclweb.org/anthology/N19-1423/) | SNLI + MultiNLI | Wikipedia |\n", "| 2 | [ROBERTa](https://arxiv.org/abs/1907.11692) | SNLI + MultiNLI + [NLI-FEVER](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md) + Round 1 | Wikipedia |\n", "| 3 | [ROBERTa](https://arxiv.org/abs/1907.11692) | SNLI + MultiNLI + [NLI-FEVER](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md) + Round 1 | Various |\n", "\n", "Each round has train/dev/test splits. The sizes of these splits and their label distributions are calculated just below.\n", "\n", "The [project README](https://github.com/facebookresearch/anli/blob/master/README.md) seeks to establish some rules for how the rounds can be used for training and evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Adversarial NLI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For ANLI, we have the following readers: \n", "\n", "* `nli.ANLITrainReader`\n", "* `nli.ANLIDevReader`\n", "\n", "As with SNLI, we leave the writing of a `Test` version to the user, as a way of discouraging inadvertent use of the test set during project development." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because ANLI is distributed in three rounds, and the rounds can be used independently or pooled, the interface has a `rounds` argument. The default is `rounds=(1,2,3)`, but any subset of them can be specified. Here are some illustrations using the `Train` reader; the `Dev` interface is the same:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R(1,): 16,946\n", "R(2,): 45,460\n", "R(3,): 100,459\n", "R(1, 2, 3): 162,865\n" ] } ], "source": [ "for rounds in ((1,), (2,), (3,), (1,2,3)):\n", " count = len(list(nli.ANLITrainReader(ANLI_HOME, rounds=rounds).read()))\n", " print(\"R{0:}: {1:,}\".format(rounds, count))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above figures correspond to those in Table 2 of the paper. I am not sure what accounts for the differences of 100 examples in round 2 (and, in turn, in the grand total)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ANLI uses a different set of attributes from SNLI/MultiNLI. Here is a summary of what `NLIExample` instances offer for this corpus:\n", "\n", "* __uid__: a unique identifier; akin to `pairID` in SNLI/MultiNLI \n", "* __context__: the premise; corresponds to `sentence1` in SNLI/MultiNLI\n", "* __hypothesis__: the hypothesis; corresponds to `sentence2` in SNLI/MultiNLI\n", "* __label__: the gold label; corresponds to `gold_label` in SNLI/MultiNLI\n", "* __model_label__: the label predicted by the model used in the current round\n", "* __reason__: a crowdworker's free-text hypothesis about why the model made an incorrect prediction for the current __context__/__hypothesis__ pair\n", "* __emturk__: for dev (and test), this is `True` if the annotator contributed only dev (test) exmples, else `False`; in turn, it is `False` for all train examples.\n", "* __genre__: the source for the __context__ text\n", "* __tag__: information about the round and train/dev/test classification\n", "\n", "All these attribute are `str`-valued except for `emturk`, which is `bool`-valued." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The labels in this datset are conceptually the same as for ` SNLI/MultiNLI`, but they are encoded differently:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "n 68789\n", "e 52111\n", "c 41965\n", "dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anli_labels = pd.Series([ex.label for ex in nli.ANLITrainReader(ANLI_HOME).read()])\n", "\n", "anli_labels.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the dev set, the `label` and `model_label` values are always different, suggesting that these evaluations will be very challenging for present-day models:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False 3200\n", "dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(\n", " [ex.label == ex.model_label for ex in nli.ANLIDevReader(ANLI_HOME).read()]\n", ").value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the train set, they do sometimes correspond, and you can track the changes in the rate of correct model predictions across the rounds:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True 0.821197\n", "False 0.178803\n", "Name: Round 1, dtype: float64\n", "\n", "True 0.932028\n", "False 0.067972\n", "Name: Round 2, dtype: float64\n", "\n", "True 0.915916\n", "False 0.084084\n", "Name: Round 3, dtype: float64\n", "\n" ] } ], "source": [ "for r in (1,2,3):\n", " dist = pd.Series(\n", " [ex.label == ex.model_label for ex in nli.ANLITrainReader(ANLI_HOME, rounds=(r,)).read()]\n", " ).value_counts()\n", " dist = dist / dist.sum()\n", " dist.name = \"Round {}\".format(r)\n", " print(dist, end=\"\\n\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This corresponds to Table 2, \"Model error rate (Verified)\", in the paper. (I am not sure what accounts for the slight differences in the percentages.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other NLI datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [The FraCaS textual inference test suite](http://www-nlp.stanford.edu/~wcmac/downloads/) is a smaller, hand-built dataset that is great for evaluating a model's ability to handle complex logical patterns.\n", "\n", "* [SemEval 2013](https://www.cs.york.ac.uk/semeval-2013/) had a wide range of interesting data sets for NLI and related tasks.\n", "\n", "* [The SemEval 2014 semantic relatedness shared task](http://alt.qcri.org/semeval2014/task1/) used an NLI dataset called [Sentences Involving Compositional Knowledge (SICK)](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools).\n", "\n", "* [MedNLI](https://physionet.org/physiotools/mimic-code/mednli/) is specialized to the medical domain, using data derived from [MIMIC III](https://mimic.physionet.org).\n", "\n", "* [XNLI](https://github.com/facebookresearch/XNLI) is a multilingual NLI dataset derived from MultiNLI.\n", "\n", "* [Diverse Natural Language Inference Collection (DNC)](http://decomp.io/projects/diverse-natural-language-inference/) transforms existing annotations from other tasks into NLI problems for a diverse range of reasoning challenges.\n", "\n", "* [SciTail](http://data.allenai.org/scitail/) is an NLI dataset derived from multiple-choice science exam questions and Web text.\n", "\n", "* [NLI Style FEVER](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md) is a version of [the FEVER dataset](http://fever.ai) put into a standard NLI format. It was used by the Adversarial NLI team to train models for their annotation round 2.\n", "\n", "* Models for NLI might be adapted for use with [the 30M Factoid Question-Answer Corpus](http://agarciaduran.org/).\n", "\n", "* Models for NLI might be adapted for use with [the Penn Paraphrase Database](http://paraphrase.org/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }