{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word-level entailment with neural networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Spring 2016\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Contents\n",
    "\n",
    "0. [Overview](#Overview)\n",
    "0. [Set-up](#Set-up)\n",
    "0. [Data](#Data)\n",
    "0. [Neural network architecture](#Neural-network-architecture)\n",
    "0. [Shallow neural networks from scratch](#Shallow-neural-networks-from-scratch)\n",
    "0. [Input feature representation](#Input-feature-representation)\n",
    "    0. [Representing words](#Representing-words)\n",
    "    0. [Combining words into inputs](#Combining-words-into-inputs)\n",
    "0. [Building datasets for experiments](#Building-datasets-for-experiments)\n",
    "0. [Running experiments](#Running-experiments)\n",
    "0. [Shallow neural networks in TensorFlow](#Shallow-neural-networks-in-TensorFlow)\n",
    "0. [In-class bake-off](#In-class-bake-off)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Problem__: For two words $w_{1}$ and $w_{2}$, predict $w_{1} \\subset w_{2}$ or $w_{1} \\supset w_{2}$. This is a basic, word-level version of the task of __Natural Language Inference__ (NLI).\n",
    "\n",
    "__Approach__: Shallow feed-forward neural networks. Here's a broad overview of the model structure and task:\n",
    "\n",
    "![fig/wordentail.png](fig/wordentail.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set-up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "0. Make sure your environment includes all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u), especially TensorFlow, which isn't included in the standard Anaconda distribution (but is [easily installed](https://anaconda.org/jjhelmus/tensorflow)).\n",
    "0. Make sure you have the [the Wikipedia 2014 + Gigaword 5 distribution](http://nlp.stanford.edu/data/glove.6B.zip) of  pretrained GloVe vectors downloaded and unzipped, and that `glove_home` below is pointing to it.\n",
    "0. Make sure `wordentail_data_filename` below is pointing to the full path for `wordentail_data.pickle`, which is included in the cs224u repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "wordentail_data_filename = 'wordentail_data.pickle'\n",
    "glove_home = \"glove.6B\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import pickle\n",
    "import random\n",
    "from collections import defaultdict\n",
    "import numpy as np\n",
    "from sklearn.metrics import classification_report\n",
    "import utils\n",
    "from shallow_neural_networks import ShallowNeuralNetwork"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As suggested by the task decription, the dataset consists of word pairs with a label indicating that the first entails the second or the second entails the first. \n",
    "\n",
    "The pickled data distribution is a tuple in which the first member is the vocabulary for the entire dataset and the second is a dictionary establishing train/test splits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "wordentail_data = pickle.load(open(wordentail_data_filename, 'rb'))\n",
    "vocab, splits = wordentail_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The structure of `splits` creates a single training set and two different test sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['disjoint_vocab_test', 'train', 'test'])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "splits.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* All three sets are disjoint in terms of word pairs. \n",
    "\n",
    "* The `test` vocab is a subset of the `train` vocab. So every word seen at test time was seen in training. \n",
    "\n",
    "* The `disjoint_vocab_test` split has a vocabulary that is totally disjoint from `train`. So none of the words are seen in training. \n",
    "\n",
    "* All the words are in the GloVe vocabulary.\n",
    "\n",
    "Each split is itself a dict mapping class names to lists of word pairs. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys([1.0, -1.0])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "splits['train'].keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['polynesian', 'inhabitant'],\n",
       " ['wiper', 'worker'],\n",
       " ['argonaut', 'adventurer'],\n",
       " ['bride', 'relative'],\n",
       " ['aramean', 'semite']]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "splits['train'][1.0][: 5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The class labels are `1.0` if the first word entails the second, and `-1.0` if the second entails the first. These labels are scaled to the particular neural models we'll be using, in particular, to the `tanh` activation functions they use by default. It's also worth noting that we'll be treating these labels using a one-dimensional output space, since they are completely complementary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "SUBSET = 1.0    # Left word entails right, as in (hippo, mammal)\n",
    "SUPERSET = -1.0 # Right word entails left, as in (mammal, hippo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neural network architecture"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this notebook, we'll use a simple shallow neural network  parameterized as follows:\n",
    "\n",
    "* A weight matrix $W^{1}$ of dimension $m \\times n$, where $m$ is the dimensionality of the input vector representations and $n$ is the dimensionality of the hidden layer.\n",
    "* A bias term $b^{1}$ of dimension $n$.\n",
    "* A weight matrix $W^{2}$ of dimension $n \\times p$, where $p$ is the dimensionality of the output vector.\n",
    "* A bias term $b^{2}$ of dimension $p$.\n",
    "\n",
    "The network is then defined as follows, with $x$ the input layer, $h$ the hidden layer of dimension $n$, and $y$ the output of dimension $1 \\times p$:\n",
    "\n",
    "$$h = \\tanh\\left(xW^{1} + b^{1}\\right)$$\n",
    "\n",
    "$$y = \\tanh\\left(hW^{2} + b^{2}\\right)$$\n",
    "\n",
    "We'll first implement this from scratch and then reimplement it in [TensorFlow](https://www.tensorflow.org). Our hope is that this will provide a firm foundation for your own exploration of neural models for NLI."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shallow neural networks from scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before moving to TensorFlow, it's worth building up our simple shallow architecture from scratch, as a way to explore the concepts and avoid the dangers of black-box machine learning.\n",
    "The full implementation is in [shallow_neural_networks.py](shallow_neural_networks.py), as `ShallowNeuralNetwork`, so that we can use it as a free-standing module. Check it out &mdash; it's just a few dozen lines of code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Input feature representation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even in deep learning, feature representation is the most important thing and requires care!\n",
    "For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Representing words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our baseline word representations will be random vectors. This works well for the `test` task but is of course hopeless for the `disjoint_vocab_test` one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def randvec(w, n=50, lower=-0.5, upper=0.5):\n",
    "    \"\"\"Returns a random vector of length `n`. `w` is ignored.\"\"\"\n",
    "    return np.array([random.uniform(lower, upper) for i in range(n)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whereas random inputs are hopeless for `disjoint_vocab_test`, GloVe vectors might not be ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Any of the files in glove.6B will work here:\n",
    "glove50_src = os.path.join(glove_home, 'glove.6B.50d.txt')\n",
    "\n",
    "# Creates a dict mapping strings (words) to GloVe vectors:\n",
    "GLOVE50 = utils.glove2dict(glove50_src)\n",
    "\n",
    "def glove50vec(w):    \n",
    "    \"\"\"Return `w`'s GloVe representation if available, else return \n",
    "    a random vector.\"\"\"\n",
    "    return GLOVE50.get(w, randvec(w, n=50))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining words into inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we decide how to combine the two word vectors into a single representation. In more detail, where $x_{l}$ is a vector representation of the left word and $x_{r}$ is a vector representation of the right word, we need a function $\\textbf{combine}$ such that $\\textbf{combine}(x_{l}, x_{r})$ returns a new input vector $x$ of dimension $m$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def vec_concatenate(u, v):\n",
    "    \"\"\"Concatenate np.array instances `u` and `v` into a new np.array\"\"\"\n",
    "    return np.concatenate((u, v))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{combine}$ could be concatenation as in `vec_concatenate`, or vector average, vector difference, etc. (even combinations of those) &mdash; there's lots of space for experimentation here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building datasets for experiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual, we define a function that featurizes the data (here, according to `vector_func` and `vector_combo_func`) and puts it into the right format for optimization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def build_dataset(\n",
    "        wordentail_data, \n",
    "        vector_func=randvec, \n",
    "        vector_combo_func=vec_concatenate): \n",
    "    \"\"\"\n",
    "    Parameters\n",
    "    ----------    \n",
    "    wordentail_data\n",
    "        The pickled dataset at `wordentail_data_filename`.\n",
    "    \n",
    "    vector_func : (default: `randvec`)\n",
    "        Any function mapping words in the vocab for `wordentail_data`\n",
    "        to vector representations\n",
    "        \n",
    "    vector_combo_func : (default: `vec_concatenate`)\n",
    "        Any function for combining two vectors into a new vector\n",
    "        of fixed dimensionality.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    dataset : defaultdict\n",
    "        A map from split names (\"train\", \"test\", \"disjoint_vocab_test\")\n",
    "        into data instances:\n",
    "        \n",
    "        {'train': [(vec, [cls]), (vec, [cls]), ...],\n",
    "         'test':  [(vec, [cls]), (vec, [cls]), ...],\n",
    "         'disjoint_vocab_test': [(vec, [cls]), (vec, [cls]), ...]}\n",
    "    \n",
    "    \"\"\"\n",
    "    # Load in the dataset:\n",
    "    vocab, splits = wordentail_data\n",
    "    # A mapping from words (as strings) to their vector\n",
    "    # representations, as determined by vector_func:\n",
    "    vectors = {w: vector_func(w) for w in vocab}\n",
    "    # Dataset in the format required by the neural network:\n",
    "    dataset = defaultdict(list)\n",
    "    for split, data in splits.items():\n",
    "        for clsname, word_pairs in data.items():\n",
    "            for w1, w2 in word_pairs:\n",
    "                # Use vector_combo_func to combine the word vectors for\n",
    "                # w1 and w2, as given by the vectors dictionary above,\n",
    "                # and pair it with the singleton array containing clsname:\n",
    "                item = [vector_combo_func(vectors[w1], vectors[w2]), \n",
    "                        np.array([clsname])]\n",
    "                dataset[split].append(item)\n",
    "    return dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running experiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function `experiment` trains its `network` parameters on `dataset['train']` and then evaluates its performance on all three splits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def experiment(dataset, network):    \n",
    "    \"\"\"Train and evaluation code for the word-level entailment task.\n",
    "    \n",
    "    Parameters\n",
    "    ----------    \n",
    "    dataset : dict\n",
    "        With keys 'train', 'test', and 'disjoint_vocab_test', each with \n",
    "        values that are lists of vector pairs, the first giving the \n",
    "        example representation and the second giving its 1d output vector. \n",
    "        The expectation is that this was created by `build_dataset`.\n",
    "    \n",
    "    network\n",
    "        This will be `ShallowNeuralNetwork` or `TfShallowNeuralNetwork`\n",
    "        below, but it could be any function that can train and \n",
    "        evaluate on `dataset`. The needed methods are `fit` and\n",
    "        `predict`.\n",
    "    \n",
    "    Prints\n",
    "    ------\n",
    "    To standard ouput\n",
    "        An sklearn classification report for all three splits.\n",
    "        \n",
    "    \"\"\"  \n",
    "    # Train the network:\n",
    "    network.fit(dataset['train'])\n",
    "    # The following is evaluation code. You won't have to alter it\n",
    "    # unless you did something unexpected like transform the output\n",
    "    # variables before training.\n",
    "    for typ in ('train', 'test', 'disjoint_vocab_test'):\n",
    "        data = dataset[typ]\n",
    "        predictions = []\n",
    "        cats = []\n",
    "        for ex, cat in data:            \n",
    "            # The raw prediction is a singleton list containing a float,\n",
    "            # either -1 or 1. We want only its contents:\n",
    "            prediction = network.predict(ex)[0]\n",
    "            # Categorize the prediction for accuracy comparison:\n",
    "            prediction = SUPERSET if prediction <= 0.0 else SUBSET            \n",
    "            predictions.append(prediction)\n",
    "            # Store the gold label for the classification report:\n",
    "            cats.append(cat[0])\n",
    "        # Report:\n",
    "        print(\"=\"*70)\n",
    "        print(typ)\n",
    "        print(classification_report(cats, predictions, target_names=['SUPERSET', 'SUBSET']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a baseline experiment run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "completed iteration 100; error is 70.0256787791"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================================================\n",
      "train\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   SUPERSET       0.99      0.99      0.99      2000\n",
      "     SUBSET       0.99      0.99      0.99      2000\n",
      "\n",
      "avg / total       0.99      0.99      0.99      4000\n",
      "\n",
      "======================================================================\n",
      "test\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   SUPERSET       0.89      0.85      0.87       200\n",
      "     SUBSET       0.86      0.89      0.87       200\n",
      "\n",
      "avg / total       0.87      0.87      0.87       400\n",
      "\n",
      "======================================================================\n",
      "disjoint_vocab_test\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   SUPERSET       0.51      0.55      0.53        49\n",
      "     SUBSET       0.51      0.47      0.49        49\n",
      "\n",
      "avg / total       0.51      0.51      0.51        98\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "baseline_dataset = build_dataset(\n",
    "    wordentail_data, \n",
    "    vector_func=randvec, \n",
    "    vector_combo_func=vec_concatenate)\n",
    "\n",
    "baseline_network = ShallowNeuralNetwork()\n",
    "\n",
    "experiment(baseline_dataset, baseline_network)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shallow neural networks in TensorFlow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now translate `ShallowNeuralNetwork` into TensorFlow. TensorFlow is a powerful library for building deep learning models. In essence, you define the model architecture and it handles the details of optimization. In addition, it is very high-performance, so it will scale to large datasets and complicated model designs. The full implementation is in [shallow_neural_networks.py](shallow_neural_networks.py), as `TfShallowNeuralNetwork`. It's even less code than our `ShallowNeuralNetwork`!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a baseline run with this new network, using `baseline_dataset` as created above for our other baseline experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import tensorflow    \n",
    "except:\n",
    "    print(\"Warning: TensorFlow is not installed, so you won't be able to use `TfShallowNeuralNetwork`.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================================================\n",
      "train\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   SUPERSET       0.95      0.95      0.95      2000\n",
      "     SUBSET       0.95      0.95      0.95      2000\n",
      "\n",
      "avg / total       0.95      0.95      0.95      4000\n",
      "\n",
      "======================================================================\n",
      "test\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   SUPERSET       0.86      0.84      0.85       200\n",
      "     SUBSET       0.85      0.86      0.85       200\n",
      "\n",
      "avg / total       0.85      0.85      0.85       400\n",
      "\n",
      "======================================================================\n",
      "disjoint_vocab_test\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   SUPERSET       0.54      0.59      0.56        49\n",
      "     SUBSET       0.55      0.49      0.52        49\n",
      "\n",
      "avg / total       0.54      0.54      0.54        98\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Let's not try to run this if `tensorflow` isn't available:\n",
    "if 'tensorflow' in sys.modules:\n",
    "    \n",
    "    from shallow_neural_networks import TfShallowNeuralNetwork\n",
    "    \n",
    "    baseline_tfnetwork = TfShallowNeuralNetwork()\n",
    "    \n",
    "    experiment(baseline_dataset, baseline_tfnetwork)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In-class bake-off"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "__The goal__: achieve the highest average F1 score __on disjoint_vocab_test__.\n",
    "\n",
    "__Notes__\n",
    "\n",
    "* You must train only on the `train` split. No outside training instances can be brought in. You can, though, bring in outside information via your input vectors, as long as this information is not from `test` or `disjoint_vocab_test`.\n",
    "\n",
    "* Since the evaluation is for `disjoint_vocab_test`, you're not going to get very far with random input vectors! A GloVe featurizer is defined above ([`glove50vec`](#Representing-words)). Feel free to look around for new word vectors on the Web, or even train your own using our `vsm` notebook.\n",
    "\n",
    "* You're not required to stick to the network structures defined above. For instance, you could create deeper versions of them. As long as you have `fit` and `predict` methods with the same input and output types as our networks, you should be able to use `experiment`. Using `experiment` is not a requirement, though.\n",
    "\n",
    "At the end of class, bring your score to one of the teaching team. We'll report the results in the class discussion forum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}