{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bake-off: Word-level entailment with neural networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Spring 2018\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "0. [Overview](#Overview)\n",
    "0. [Set-up](#Set-up)\n",
    "0. [Data](#Data)\n",
    "  0. [Edge disjoint](#Edge-disjoint)\n",
    "  0. [Word disjoint](#Word-disjoint)\n",
    "  0. [Word disjoint and balanced](#Word-disjoint-and-balanced)\n",
    "0. [Baseline](#Baseline)\n",
    "  0. [Representing words: vector_func](#Representing-words:-vector_func)\n",
    "  0. [Combining words into inputs: vector_combo_func](#Combining-words-into-inputs:-vector_combo_func)\n",
    "  0. [Classifier model](#Classifier-model)\n",
    "  0. [Baseline results](#Baseline-results)\n",
    "0. [Bake-off submission](#Bake-off-submission)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Problem__: Word-level natural language inference.\n",
    "\n",
    "Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y$ a relation in\n",
    "\n",
    "* __synonym__: very roughly identical meanings; symmetric\n",
    "* __hyponym__: e.g., _puppy_ is a hyponym of _dog_\n",
    "* __hypernym__:  e.g., _dog_ is a hypernym of _puppy_\n",
    "* __antonym__: semantically opposed within a domain; symmetric\n",
    "\n",
    "The dataset is due to [Bowman et al. 2015](https://arxiv.org/abs/1406.1827). See [below](#Data) for details on how it was processed for this bake-off."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set-up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "0. Make sure your environment includes all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u/).\n",
    "\n",
    "0. Make sure you have the [the Wikipedia 2014 + Gigaword 5 distribution](http://nlp.stanford.edu/data/glove.6B.zip) of  pretrained GloVe vectors downloaded and unzipped, and that `glove_home` below is pointing to it.\n",
    "\n",
    "0. Make sure `wordentail_filename` below is pointing to the full path for `nli_wordentail_bakeoff_data.json`, which is included in [the nlidata.zip archive](http://web.stanford.edu/class/cs224u/data/nlidata.zip)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n"
     ]
    }
   ],
   "source": [
    "from collections import defaultdict\n",
    "import json\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd\n",
    "import tensorflow as tf\n",
    "from tf_shallow_neural_classifier import TfShallowNeuralClassifier\n",
    "import nli\n",
    "import utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlidata_home = 'nlidata'\n",
    "\n",
    "wordentail_filename = os.path.join(\n",
    "    nlidata_home, 'nli_wordentail_bakeoff_data.json')\n",
    "\n",
    "glove_home = os.path.join(\"vsmdata\", \"glove.6B\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "As noted above, the dataset was originally released by [Bowman et al. 2015](https://arxiv.org/abs/1406.1827), who derived it from [WordNet](https://wordnet.princeton.edu) using some heuristics (and thus it might contain some errors or unintuitive pairings).\n",
    "\n",
    "I've processed the data into three different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.\n",
    "\n",
    "* `edge_disjoint`: The `train` and `dev` __edge__ sets are disjoint, but many __words__ appear in both `train` and `dev`.\n",
    "* `word_disjoint`: The `train` and `dev` __vocabularies are disjoint__, and thus the edges are disjoint as well.\n",
    "* `word_disjoint_balanced`: Like `word_disjoint`, but with each word appearing at most one time as the left word and at most one time on the right for a given relation type.\n",
    "\n",
    "These are progressively harder problems\n",
    "\n",
    "* For `word_disjoint`, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.\n",
    "\n",
    "* For `word_disjoint_balanced`, the model can't even learn that some terms tend to appear more on the left or the right. This might be a step too far. For example, appearing more on the right for `hypernym` corresponds in a deep way with being a more general term, which is a non-trivial lexical property that we want our models to learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(wordentail_filename) as f:\n",
    "    wordentail_data = json.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The outer keys are the three splits plus a list giving the vocabulary for the entire dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['edge_disjoint', 'vocab', 'word_disjoint', 'word_disjoint_balanced'])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordentail_data.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Edge disjoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['dev', 'train'])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordentail_data['edge_disjoint'].keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is what the split looks like; all three have this same format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[['archived', 'records'], 'synonym'],\n",
       " [['stage', 'station'], 'synonym'],\n",
       " [['engineers', 'design'], 'hypernym'],\n",
       " [['save', 'book'], 'hypernym'],\n",
       " [['match', 'supply'], 'hypernym']]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordentail_data['edge_disjoint']['dev'][: 5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's test to make sure no edges are shared between `train` and `dev`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we expect, a *lot* of vocabulary items are shared between `train` and `dev`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4769"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a large percentage of the entire vocab:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6560"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(wordentail_data['vocab'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the distribution of labels in the `train` set. It's highly imbalanced, which will pose a challenge. (I'll go ahead and reveal that the `dev` set is similarly distributed.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def label_distribution(split):\n",
    "    return pd.DataFrame(wordentail_data[split]['train'])[1].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "synonym     8865\n",
       "hypernym    6475\n",
       "hyponym     1044\n",
       "antonym      629\n",
       "Name: 1, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_distribution('edge_disjoint')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Word disjoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['dev', 'train'])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordentail_data['word_disjoint'].keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the `word_disjoint` split, no __words__ are shared between `train` and `dev`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because no words are shared between `train` and `dev`, no edges are either:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The label distribution is similar to that of `edge_disjoint`, though the overall number of examples is a bit smaller:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "synonym     5610\n",
       "hypernym    3993\n",
       "hyponym      627\n",
       "antonym      386\n",
       "Name: 1, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "label_distribution('word_disjoint')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is still an important bias in the data: some words appear much more often than others, and in specific positions. For example, the very general term `part` appears on the right in a large number of cases, many of them `hypernym`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[['frames', 'part'], 'hypernym'],\n",
       " [['heaven', 'part'], 'hypernym'],\n",
       " [['pan', 'part'], 'synonym'],\n",
       " [['middle', 'part'], 'hypernym'],\n",
       " [['shared', 'part'], 'synonym'],\n",
       " [['shares', 'part'], 'synonym'],\n",
       " [['ended', 'part'], 'hypernym'],\n",
       " [['twin', 'part'], 'synonym'],\n",
       " [['meal', 'part'], 'synonym'],\n",
       " [['bit', 'part'], 'hypernym'],\n",
       " [['sections', 'part'], 'synonym'],\n",
       " [['capacity', 'part'], 'hypernym'],\n",
       " [['beginning', 'part'], 'hypernym'],\n",
       " [['divorce', 'part'], 'hypernym'],\n",
       " [['paradise', 'part'], 'hypernym'],\n",
       " [['ends', 'part'], 'hypernym'],\n",
       " [['reduced', 'part'], 'hypernym'],\n",
       " [['units', 'part'], 'hypernym'],\n",
       " [['corner', 'part'], 'hypernym'],\n",
       " [['air', 'part'], 'hypernym'],\n",
       " [['section', 'part'], 'synonym'],\n",
       " [['something', 'part'], 'synonym'],\n",
       " [['reduce', 'part'], 'hypernym'],\n",
       " [['some', 'part'], 'synonym'],\n",
       " [['heavy', 'part'], 'hypernym'],\n",
       " [['segment', 'part'], 'hypernym'],\n",
       " [['share', 'part'], 'synonym'],\n",
       " [['hat', 'part'], 'hypernym'],\n",
       " [['maria', 'part'], 'hypernym'],\n",
       " [['way', 'part'], 'hypernym'],\n",
       " [['interests', 'part'], 'synonym']]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[[ex, y] for ex, y in wordentail_data['word_disjoint']['train'] \n",
    " if ex[1] == 'part']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These tabulations suggest that a classifier could do well just by learning where words tend to appear:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_label_position_instances(split, pos=0):\n",
    "    examples = wordentail_data[split]['train']    \n",
    "    return pd.Series([(ex[pos], label) for ex, label in examples]).value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(forms, hypernym)       9\n",
       "(have, synonym)         8\n",
       "(question, synonym)     8\n",
       "(questions, synonym)    8\n",
       "(items, synonym)        8\n",
       "dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_label_position_instances('word_disjoint', pos=0).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(be, hypernym)        51\n",
       "(take, hypernym)      39\n",
       "(alter, hypernym)     38\n",
       "(person, hypernym)    33\n",
       "(modify, hypernym)    32\n",
       "dtype: int64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_label_position_instances('word_disjoint', pos=1).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Word disjoint and balanced\n",
    "\n",
    "To see how much our models are leveraging the uneven distribution of words across the left and right positions, we also have a split in which each word $w$ appears in at most one item $((w, w_{R}), y)$ and at most one item $((w_{L}, w), y)$.\n",
    "\n",
    "The following tests establish that the dataset has the desired properties:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['dev', 'train'])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordentail_data['word_disjoint_balanced'].keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nli.get_edge_overlap_size(wordentail_data, 'word_disjoint_balanced')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint_balanced')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[['frames', 'part'], 'hypernym'], [['pan', 'part'], 'synonym']]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[[ex, y] for ex, y in wordentail_data['word_disjoint_balanced']['train'] \n",
    " if ex[1] == 'part']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(remove, synonym)      1\n",
       "(close, hyponym)       1\n",
       "(seminar, hypernym)    1\n",
       "(wants, hyponym)       1\n",
       "(reform, synonym)      1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_label_position_instances('word_disjoint_balanced', pos=0).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(remove, synonym)       1\n",
       "(attitude, synonym)     1\n",
       "(relation, hypernym)    1\n",
       "(weak, synonym)         1\n",
       "(soon, synonym)         1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_label_position_instances('word_disjoint_balanced', pos=1).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even in deep learning, __feature representation is the most important thing and requires care!__\n",
    "For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Representing words: vector_func"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's consider two baseline word representations methods:\n",
    "\n",
    "1. Random vectors (as returned by `utils.randvec`).\n",
    "1. 50-dimensional GloVe representations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "def randvec(w, n=50, lower=-1.0, upper=1.0):\n",
    "    \"\"\"Returns a random vector of length `n`. `w` is ignored.\"\"\"\n",
    "    return utils.randvec(n=n, lower=lower, upper=upper)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Any of the files in glove.6B will work here:\n",
    "glove50_src = os.path.join(glove_home, 'glove.6B.50d.txt')\n",
    "\n",
    "# Creates a dict mapping strings (words) to GloVe vectors:\n",
    "GLOVE50 = utils.glove2dict(glove50_src)\n",
    "\n",
    "def glove50vec(w):    \n",
    "    \"\"\"Return `w`'s GloVe representation if available, else return \n",
    "    a random vector.\"\"\"\n",
    "    return GLOVE50.get(w, randvec(w, n=50))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining words into inputs: vector_combo_func"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we decide how to combine the two word vectors into a single representation. In more detail, where `u` is a vector representation of the left word and `v` is a vector representation of the right word, we need a function `vector_combo_func` such that `vector_combo_func(u, v)` returns a new input vector `z` of dimension `m`. A simple example is concatenation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "def vec_concatenate(u, v):\n",
    "    \"\"\"Concatenate np.array instances `u` and `v` into a new np.array\"\"\"\n",
    "    return np.concatenate((u, v))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`vector_combo_func` could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classifier model\n",
    "\n",
    "For a baseline model, I chose `TfShallowNeuralClassifier` with a pretty large hidden layer and a correspondingly high number of iterations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "net = TfShallowNeuralClassifier(hidden_dim=200, max_iter=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Baseline results\n",
    "\n",
    "The following puts the above pieces together, using `vector_func=glove50vec`, since `vector_func=randvec` seems so hopelessly misguided for `word_disjoint` and `word_disjoint_balanced`!\n",
    "\n",
    "First, we build the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = nli.build_bakeoff_dataset(\n",
    "    wordentail_data, \n",
    "    vector_func=glove50vec,\n",
    "    vector_combo_func=vec_concatenate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then we run the experiment with `nli.bakeoff_experiment`. This trains and tests on all three splits, and additionally trains on `word_disjoint`'s `train` portion and tests on `word_disjoint_balanced`'s `dev` portion, to see what distribution of examples is more effective for this balanced evaluation.\n",
    "\n",
    "Since the bake-off focus is `word_disjoint`, you might want to run just that evaluation. To to that, use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Iteration 500: loss: 9.8024865388870243"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================================================\n",
      "word_disjoint\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "    antonym       0.00      0.00      0.00       150\n",
      "   hypernym       0.54      0.43      0.48      1594\n",
      "    hyponym       0.22      0.01      0.03       275\n",
      "    synonym       0.59      0.77      0.67      2229\n",
      "\n",
      "avg / total       0.52      0.57      0.53      4248\n",
      "\n"
     ]
    }
   ],
   "source": [
    "nli.bakeoff_experiment(X, net, conditions=['word_disjoint'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will run the complete evaluation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Iteration 500: loss: 15.278596043586731/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.\n",
      "  'precision', 'predicted', average, warn_for)\n",
      "Iteration 2: loss: 12.505775809288025"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================================================\n",
      "edge_disjoint\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "    antonym       0.00      0.00      0.00       392\n",
      "   hypernym       0.58      0.43      0.49      4310\n",
      "    hyponym       0.51      0.04      0.07       710\n",
      "    synonym       0.59      0.80      0.68      5930\n",
      "\n",
      "avg / total       0.56      0.59      0.55     11342\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Iteration 3: loss: 4.142282485961914884"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================================================\n",
      "word_disjoint\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "    antonym       0.00      0.00      0.00       150\n",
      "   hypernym       0.54      0.43      0.48      1594\n",
      "    hyponym       0.33      0.02      0.04       275\n",
      "    synonym       0.59      0.78      0.67      2229\n",
      "\n",
      "avg / total       0.53      0.57      0.53      4248\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Iteration 2: loss: 13.00054156780243554"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================================================\n",
      "word_disjoint_balanced\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "    antonym       0.00      0.00      0.00       115\n",
      "   hypernym       0.48      0.28      0.35       511\n",
      "    hyponym       0.27      0.03      0.05       118\n",
      "    synonym       0.55      0.84      0.67       831\n",
      "\n",
      "avg / total       0.47      0.54      0.47      1575\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Iteration 500: loss: 9.7987087368965157"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================================================\n",
      "word_disjoint_balanced, training on word_disjoint\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "    antonym       0.00      0.00      0.00       115\n",
      "   hypernym       0.49      0.43      0.46       511\n",
      "    hyponym       0.20      0.01      0.02       118\n",
      "    synonym       0.58      0.79      0.67       831\n",
      "\n",
      "avg / total       0.48      0.56      0.50      1575\n",
      "\n"
     ]
    }
   ],
   "source": [
    "nli.bakeoff_experiment(X, net)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bake-off submission"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "__The goal__: achieve the highest average F1 score on __word_disjoint__.\n",
    "\n",
    "__Submit:__\n",
    "\n",
    "* Your score on the `word_disjoint` split.\n",
    "* A description of the method you used: \n",
    "   * Your approach to representing words.\n",
    "   * Your approach to combining them into inputs.\n",
    "   * The model you used for predictions.\n",
    "   \n",
    "__Submission URL__: https://goo.gl/forms/CizXwS3kfPjsThxA3   \n",
    "\n",
    "__Notes:__\n",
    "\n",
    "* For the methods, the only requirement is that they differ in some way from the baseline above. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.\n",
    "\n",
    "* You must train only on the `train` split. No outside training instances can be brought in. You can, though, bring in outside information via your input vectors, as long as this information is not from `dev` or `edge_disjoint`. \n",
    "\n",
    "* You can also augment your training data. For example, if `((A, B), synonym)` is a training instance, then so should be `((B, A), synonym)`. Similarly, `((A, B), hyponym)`  and `((B, C), hyponym)` are training cases, then so should be `((A, C), hyponym)`.\n",
    "\n",
    "* Since the evaluation is for `word_disjoint`, you're not going to get very far with random input vectors! A GloVe featurizer is defined above. Feel free to look around for new word vectors on the Web, or even train your own using our VSM notebooks.\n",
    "\n",
    "* You're not required to stick to `TfShallowNeuralNetwork`. For instance, you could create deeper feed-forward networks, change how they optimize, etc. As long as you have `fit` and `predict` methods with the same input and output types as our networks, you should be able to use `bakeoff_experiment`. For notes on how to extend the TensorFlow models included in this repository, see [tensorflow_models.ipynb](tensorflow_models.ipynb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}