{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Word-level entailment with neural networks" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Spring 2016\"" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Contents\n", "\n", "0. [Overview](#Overview)\n", "0. [Set-up](#Set-up)\n", "0. [Data](#Data)\n", "0. [Neural network architecture](#Neural-network-architecture)\n", "0. [Shallow neural networks from scratch](#Shallow-neural-networks-from-scratch)\n", "0. [Input feature representation](#Input-feature-representation)\n", " 0. [Representing words](#Representing-words)\n", " 0. [Combining words into inputs](#Combining-words-into-inputs)\n", "0. [Building datasets for experiments](#Building-datasets-for-experiments)\n", "0. [Running experiments](#Running-experiments)\n", "0. [Shallow neural networks in TensorFlow](#Shallow-neural-networks-in-TensorFlow)\n", "0. [In-class bake-off](#In-class-bake-off)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Problem__: For two words $w_{1}$ and $w_{2}$, predict $w_{1} \\subset w_{2}$ or $w_{1} \\supset w_{2}$. This is a basic, word-level version of the task of __Natural Language Inference__ (NLI).\n", "\n", "__Approach__: Shallow feed-forward neural networks. Here's a broad overview of the model structure and task:\n", "\n", "![fig/wordentail.png](fig/wordentail.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set-up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "0. Make sure your environment includes all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u), especially TensorFlow, which isn't included in the standard Anaconda distribution (but is [easily installed](https://anaconda.org/jjhelmus/tensorflow)).\n", "0. Make sure you have the [the Wikipedia 2014 + Gigaword 5 distribution](http://nlp.stanford.edu/data/glove.6B.zip) of pretrained GloVe vectors downloaded and unzipped, and that `glove_home` below is pointing to it.\n", "0. Make sure `wordentail_data_filename` below is pointing to the full path for `wordentail_data.pickle`, which is included in the cs224u repository." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "wordentail_data_filename = 'wordentail_data.pickle'\n", "glove_home = \"glove.6B\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "import sys\n", "import pickle\n", "import random\n", "from collections import defaultdict\n", "import numpy as np\n", "from sklearn.metrics import classification_report\n", "import utils\n", "from shallow_neural_networks import ShallowNeuralNetwork" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As suggested by the task decription, the dataset consists of word pairs with a label indicating that the first entails the second or the second entails the first. \n", "\n", "The pickled data distribution is a tuple in which the first member is the vocabulary for the entire dataset and the second is a dictionary establishing train/test splits:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "wordentail_data = pickle.load(open(wordentail_data_filename, 'rb'))\n", "vocab, splits = wordentail_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The structure of `splits` creates a single training set and two different test sets:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['disjoint_vocab_test', 'train', 'test'])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* All three sets are disjoint in terms of word pairs. \n", "\n", "* The `test` vocab is a subset of the `train` vocab. So every word seen at test time was seen in training. \n", "\n", "* The `disjoint_vocab_test` split has a vocabulary that is totally disjoint from `train`. So none of the words are seen in training. \n", "\n", "* All the words are in the GloVe vocabulary.\n", "\n", "Each split is itself a dict mapping class names to lists of word pairs. For example:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "dict_keys([1.0, -1.0])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits['train'].keys()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['polynesian', 'inhabitant'],\n", " ['wiper', 'worker'],\n", " ['argonaut', 'adventurer'],\n", " ['bride', 'relative'],\n", " ['aramean', 'semite']]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits['train'][1.0][: 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class labels are `1.0` if the first word entails the second, and `-1.0` if the second entails the first. These labels are scaled to the particular neural models we'll be using, in particular, to the `tanh` activation functions they use by default. It's also worth noting that we'll be treating these labels using a one-dimensional output space, since they are completely complementary." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "SUBSET = 1.0 # Left word entails right, as in (hippo, mammal)\n", "SUPERSET = -1.0 # Right word entails left, as in (mammal, hippo)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Neural network architecture" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this notebook, we'll use a simple shallow neural network parameterized as follows:\n", "\n", "* A weight matrix $W^{1}$ of dimension $m \\times n$, where $m$ is the dimensionality of the input vector representations and $n$ is the dimensionality of the hidden layer.\n", "* A bias term $b^{1}$ of dimension $n$.\n", "* A weight matrix $W^{2}$ of dimension $n \\times p$, where $p$ is the dimensionality of the output vector.\n", "* A bias term $b^{2}$ of dimension $p$.\n", "\n", "The network is then defined as follows, with $x$ the input layer, $h$ the hidden layer of dimension $n$, and $y$ the output of dimension $1 \\times p$:\n", "\n", "$$h = \\tanh\\left(xW^{1} + b^{1}\\right)$$\n", "\n", "$$y = \\tanh\\left(hW^{2} + b^{2}\\right)$$\n", "\n", "We'll first implement this from scratch and then reimplement it in [TensorFlow](https://www.tensorflow.org). Our hope is that this will provide a firm foundation for your own exploration of neural models for NLI." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Shallow neural networks from scratch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before moving to TensorFlow, it's worth building up our simple shallow architecture from scratch, as a way to explore the concepts and avoid the dangers of black-box machine learning.\n", "The full implementation is in [shallow_neural_networks.py](shallow_neural_networks.py), as `ShallowNeuralNetwork`, so that we can use it as a free-standing module. Check it out — it's just a few dozen lines of code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input feature representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even in deep learning, feature representation is the most important thing and requires care!\n", "For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Representing words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our baseline word representations will be random vectors. This works well for the `test` task but is of course hopeless for the `disjoint_vocab_test` one." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def randvec(w, n=50, lower=-0.5, upper=0.5):\n", " \"\"\"Returns a random vector of length `n`. `w` is ignored.\"\"\"\n", " return np.array([random.uniform(lower, upper) for i in range(n)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whereas random inputs are hopeless for `disjoint_vocab_test`, GloVe vectors might not be ..." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Any of the files in glove.6B will work here:\n", "glove50_src = os.path.join(glove_home, 'glove.6B.50d.txt')\n", "\n", "# Creates a dict mapping strings (words) to GloVe vectors:\n", "GLOVE50 = utils.glove2dict(glove50_src)\n", "\n", "def glove50vec(w): \n", " \"\"\"Return `w`'s GloVe representation if available, else return \n", " a random vector.\"\"\"\n", " return GLOVE50.get(w, randvec(w, n=50))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining words into inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we decide how to combine the two word vectors into a single representation. In more detail, where $x_{l}$ is a vector representation of the left word and $x_{r}$ is a vector representation of the right word, we need a function $\\textbf{combine}$ such that $\\textbf{combine}(x_{l}, x_{r})$ returns a new input vector $x$ of dimension $m$. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def vec_concatenate(u, v):\n", " \"\"\"Concatenate np.array instances `u` and `v` into a new np.array\"\"\"\n", " return np.concatenate((u, v))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$\\textbf{combine}$ could be concatenation as in `vec_concatenate`, or vector average, vector difference, etc. (even combinations of those) — there's lots of space for experimentation here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building datasets for experiments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As usual, we define a function that featurizes the data (here, according to `vector_func` and `vector_combo_func`) and puts it into the right format for optimization." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def build_dataset(\n", " wordentail_data, \n", " vector_func=randvec, \n", " vector_combo_func=vec_concatenate): \n", " \"\"\"\n", " Parameters\n", " ---------- \n", " wordentail_data\n", " The pickled dataset at `wordentail_data_filename`.\n", " \n", " vector_func : (default: `randvec`)\n", " Any function mapping words in the vocab for `wordentail_data`\n", " to vector representations\n", " \n", " vector_combo_func : (default: `vec_concatenate`)\n", " Any function for combining two vectors into a new vector\n", " of fixed dimensionality.\n", " \n", " Returns\n", " -------\n", " dataset : defaultdict\n", " A map from split names (\"train\", \"test\", \"disjoint_vocab_test\")\n", " into data instances:\n", " \n", " {'train': [(vec, [cls]), (vec, [cls]), ...],\n", " 'test': [(vec, [cls]), (vec, [cls]), ...],\n", " 'disjoint_vocab_test': [(vec, [cls]), (vec, [cls]), ...]}\n", " \n", " \"\"\"\n", " # Load in the dataset:\n", " vocab, splits = wordentail_data\n", " # A mapping from words (as strings) to their vector\n", " # representations, as determined by vector_func:\n", " vectors = {w: vector_func(w) for w in vocab}\n", " # Dataset in the format required by the neural network:\n", " dataset = defaultdict(list)\n", " for split, data in splits.items():\n", " for clsname, word_pairs in data.items():\n", " for w1, w2 in word_pairs:\n", " # Use vector_combo_func to combine the word vectors for\n", " # w1 and w2, as given by the vectors dictionary above,\n", " # and pair it with the singleton array containing clsname:\n", " item = [vector_combo_func(vectors[w1], vectors[w2]), \n", " np.array([clsname])]\n", " dataset[split].append(item)\n", " return dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running experiments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `experiment` trains its `network` parameters on `dataset['train']` and then evaluates its performance on all three splits:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def experiment(dataset, network): \n", " \"\"\"Train and evaluation code for the word-level entailment task.\n", " \n", " Parameters\n", " ---------- \n", " dataset : dict\n", " With keys 'train', 'test', and 'disjoint_vocab_test', each with \n", " values that are lists of vector pairs, the first giving the \n", " example representation and the second giving its 1d output vector. \n", " The expectation is that this was created by `build_dataset`.\n", " \n", " network\n", " This will be `ShallowNeuralNetwork` or `TfShallowNeuralNetwork`\n", " below, but it could be any function that can train and \n", " evaluate on `dataset`. The needed methods are `fit` and\n", " `predict`.\n", " \n", " Prints\n", " ------\n", " To standard ouput\n", " An sklearn classification report for all three splits.\n", " \n", " \"\"\" \n", " # Train the network:\n", " network.fit(dataset['train'])\n", " # The following is evaluation code. You won't have to alter it\n", " # unless you did something unexpected like transform the output\n", " # variables before training.\n", " for typ in ('train', 'test', 'disjoint_vocab_test'):\n", " data = dataset[typ]\n", " predictions = []\n", " cats = []\n", " for ex, cat in data: \n", " # The raw prediction is a singleton list containing a float,\n", " # either -1 or 1. We want only its contents:\n", " prediction = network.predict(ex)[0]\n", " # Categorize the prediction for accuracy comparison:\n", " prediction = SUPERSET if prediction <= 0.0 else SUBSET \n", " predictions.append(prediction)\n", " # Store the gold label for the classification report:\n", " cats.append(cat[0])\n", " # Report:\n", " print(\"=\"*70)\n", " print(typ)\n", " print(classification_report(cats, predictions, target_names=['SUPERSET', 'SUBSET']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a baseline experiment run:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "completed iteration 100; error is 70.0256787791" ] }, { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "train\n", " precision recall f1-score support\n", "\n", " SUPERSET 0.99 0.99 0.99 2000\n", " SUBSET 0.99 0.99 0.99 2000\n", "\n", "avg / total 0.99 0.99 0.99 4000\n", "\n", "======================================================================\n", "test\n", " precision recall f1-score support\n", "\n", " SUPERSET 0.89 0.85 0.87 200\n", " SUBSET 0.86 0.89 0.87 200\n", "\n", "avg / total 0.87 0.87 0.87 400\n", "\n", "======================================================================\n", "disjoint_vocab_test\n", " precision recall f1-score support\n", "\n", " SUPERSET 0.51 0.55 0.53 49\n", " SUBSET 0.51 0.47 0.49 49\n", "\n", "avg / total 0.51 0.51 0.51 98\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "baseline_dataset = build_dataset(\n", " wordentail_data, \n", " vector_func=randvec, \n", " vector_combo_func=vec_concatenate)\n", "\n", "baseline_network = ShallowNeuralNetwork()\n", "\n", "experiment(baseline_dataset, baseline_network)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Shallow neural networks in TensorFlow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now translate `ShallowNeuralNetwork` into TensorFlow. TensorFlow is a powerful library for building deep learning models. In essence, you define the model architecture and it handles the details of optimization. In addition, it is very high-performance, so it will scale to large datasets and complicated model designs. The full implementation is in [shallow_neural_networks.py](shallow_neural_networks.py), as `TfShallowNeuralNetwork`. It's even less code than our `ShallowNeuralNetwork`!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a baseline run with this new network, using `baseline_dataset` as created above for our other baseline experiment." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "try:\n", " import tensorflow \n", "except:\n", " print(\"Warning: TensorFlow is not installed, so you won't be able to use `TfShallowNeuralNetwork`.\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "train\n", " precision recall f1-score support\n", "\n", " SUPERSET 0.95 0.95 0.95 2000\n", " SUBSET 0.95 0.95 0.95 2000\n", "\n", "avg / total 0.95 0.95 0.95 4000\n", "\n", "======================================================================\n", "test\n", " precision recall f1-score support\n", "\n", " SUPERSET 0.86 0.84 0.85 200\n", " SUBSET 0.85 0.86 0.85 200\n", "\n", "avg / total 0.85 0.85 0.85 400\n", "\n", "======================================================================\n", "disjoint_vocab_test\n", " precision recall f1-score support\n", "\n", " SUPERSET 0.54 0.59 0.56 49\n", " SUBSET 0.55 0.49 0.52 49\n", "\n", "avg / total 0.54 0.54 0.54 98\n", "\n" ] } ], "source": [ "# Let's not try to run this if `tensorflow` isn't available:\n", "if 'tensorflow' in sys.modules:\n", " \n", " from shallow_neural_networks import TfShallowNeuralNetwork\n", " \n", " baseline_tfnetwork = TfShallowNeuralNetwork()\n", " \n", " experiment(baseline_dataset, baseline_tfnetwork)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In-class bake-off" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "__The goal__: achieve the highest average F1 score __on disjoint_vocab_test__.\n", "\n", "__Notes__\n", "\n", "* You must train only on the `train` split. No outside training instances can be brought in. You can, though, bring in outside information via your input vectors, as long as this information is not from `test` or `disjoint_vocab_test`.\n", "\n", "* Since the evaluation is for `disjoint_vocab_test`, you're not going to get very far with random input vectors! A GloVe featurizer is defined above ([`glove50vec`](#Representing-words)). Feel free to look around for new word vectors on the Web, or even train your own using our `vsm` notebook.\n", "\n", "* You're not required to stick to the network structures defined above. For instance, you could create deeper versions of them. As long as you have `fit` and `predict` methods with the same input and output types as our networks, you should be able to use `experiment`. Using `experiment` is not a requirement, though.\n", "\n", "At the end of class, bring your score to one of the teaching team. We'll report the results in the class discussion forum." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }