{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-16T12:36:43.740813",
     "start_time": "2016-12-16T12:36:42.677256"
    },
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "#reveal configuration\n",
    "from notebook.services.config import ConfigManager\n",
    "cm = ConfigManager()\n",
    "cm.update('livereveal', {\n",
    "        'theme': 'white',\n",
    "        'transition': 'none',\n",
    "        'controls': 'false',\n",
    "        'progress': 'true',\n",
    "})\n",
    "\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "%load_ext tikzmagic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-16T12:36:43.763888",
     "start_time": "2016-12-16T12:36:43.742208"
    },
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "application/javascript": [
       "require(['base/js/utils'],\n",
       "function(utils) {\n",
       "   utils.load_extensions('calico-spell-check', 'calico-document-tools', 'calico-cell-tools');\n",
       "});"
      ],
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%javascript\n",
    "require(['base/js/utils'],\n",
    "function(utils) {\n",
    "   utils.load_extensions('calico-spell-check', 'calico-document-tools', 'calico-cell-tools');\n",
    "});"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-16T12:36:44.433769",
     "start_time": "2016-12-16T12:36:44.429580"
    },
    "hide_input": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       ".red { color: #E41A1C; }\n",
       ".orange { color: #FF7F00 }\n",
       ".yellow { color: #FFC020 }         \n",
       ".green { color: #4DAF4A }                  \n",
       ".blue { color: #377EB8; }\n",
       ".purple { color: #984EA3 }       \n",
       "         \n",
       "#for reveal         \n",
       ".aside .controls, .reveal .controls {\n",
       "    display: none !important;                            \n",
       "    width: 0px !important;\n",
       "    height: 0px !important;\n",
       "}\n",
       "    \n",
       ".rise-enabled .reveal .slide-number {\n",
       "    right: 25px;\n",
       "    bottom: 25px;                        \n",
       "    font-size: 200%;     \n",
       "    color: #377EB8;                        \n",
       "}         \n",
       "         \n",
       ".rise-enabled .reveal .progress span {\n",
       "    background: #377EB8;\n",
       "}     \n",
       "         \n",
       ".present .top {\n",
       "    position: fixed !important;\n",
       "    top: 0 !important;                                   \n",
       "}                \n",
       "</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<style>\n",
    ".red { color: #E41A1C; }\n",
    ".orange { color: #FF7F00 }\n",
    ".yellow { color: #FFC020 }         \n",
    ".green { color: #4DAF4A }                  \n",
    ".blue { color: #377EB8; }\n",
    ".purple { color: #984EA3 }       \n",
    "         \n",
    "#for reveal         \n",
    ".aside .controls, .reveal .controls {\n",
    "    display: none !important;                            \n",
    "    width: 0px !important;\n",
    "    height: 0px !important;\n",
    "}\n",
    "    \n",
    ".rise-enabled .reveal .slide-number {\n",
    "    right: 25px;\n",
    "    bottom: 25px;                        \n",
    "    font-size: 200%;     \n",
    "    color: #377EB8;                        \n",
    "}         \n",
    "         \n",
    ".rise-enabled .reveal .progress span {\n",
    "    background: #377EB8;\n",
    "}     \n",
    "         \n",
    ".present .top {\n",
    "    position: fixed !important;\n",
    "    top: 0 !important;                                   \n",
    "}                \n",
    "</style>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Deep Learning for Natural Language Processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true,
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "In this chapter we will look into applications of Recurrent Neural Networks (RNNs) for NLP. Deep Learning and RNNs are currently used in many tasks, for instance, machine translation, machine comprehension, question answering, text summarization and generation. In this chapter we take recognizing textual entailment (RTE) as an example task and further look into several tools that help training RNNs for this task."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-02T16:54:38.546624",
     "start_time": "2016-12-02T16:54:38.540051"
    },
    "cell_style": "center",
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Outline\n",
    "- Use-case: Recognizing Textual Entailment\n",
    "  - Conditional Encoding\n",
    "  - Attention\n",
    "- Bag of Tricks\n",
    "  - Continuous Optimization   \n",
    "  - Regularization  \n",
    "  - Hyper-parameter Optimization\n",
    "  - Pre-trained Representations\n",
    "  - Batching\n",
    "  - Bucketing\n",
    "  - <span class=\"blue\">TensorFlow</span> `dynamic_rnn`  \n",
    "  - Bi-directional RNNs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Recognizing Textual Entailment (RTE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "RTE is the task of determining the logical relationship between two natural language sentences. Although this can be simply seen as a classification problem, it is a very difficult task as it needs commonsense and background knowledge about the world, some understanding of natural language, as well a form of fine-grained reasoning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "- **A wedding party is taking pictures**\n",
    "  - There is a funeral\t\t\t\t\t: **<span class=red>Contradiction</span>**\n",
    "  - They are outside\t\t\t\t\t: **<span class=blue>Neutral</span>**\n",
    "  - Someone got married\t\t\t\t    : **<span class=green>Entailment</span>**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### State of the Art until 2015"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[<span class=blue>Lai and Hockenmaier, 2014, Jimenez et al., 2014, Zhao et al., 2014, Beltagy et al., 2015</span> etc.]\n",
    "\n",
    "- Engineered natural language processing pipelines\n",
    "- Various external resources\n",
    "- Specialized subcomponents\n",
    "- Extensive manual creation of **features**:\n",
    "  - Negation detection, word overlap, part-of-speech tags, dependency parses, alignment, unaligned matching, chunk alignment, synonym, hypernym, antonym, denotation graph"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Neural Networks for RTE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "As shown above, models for RTE heavily relied on engineered features. One reason why neural networks were not applied to this task is due to the absence of high-quality large-scale RTE corpora."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_style": "split",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "**Previous RTE corpora**:\n",
    "- Tiny data sets (1k-10k examples)\n",
    "- Partly synthetic examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_style": "split",
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "**Stanford Natural Inference Corpus (SNLI)**:\n",
    "- 500k sentence pairs\n",
    "- Two orders of magnitude larger than existing RTE data set\n",
    "- All examples generated by humans\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Independent Sentence Encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-11T15:48:37.753981",
     "start_time": "2016-12-11T15:48:37.702704"
    },
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "With the introduction of a large-scale RTE corpora, neural networks became feasible to train. A first baseline model by Bowman et al. (2015) used LSTMs to encode the premise and hypothesis independently of each other."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[<span class=blue>Bowman et al, 2015</span>]\n",
    "\n",
    "Same LSTM encodes premise and hypothesis\n",
    "\n",
    "<img src=\"dl-applications-figures/rte.svg\" width=800/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hide_input": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Independent Sentence Encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The way RNNs can encode a sentence is to apply the RNN to the sequence of tokens and subsequently take the last output vector as the representation of the sentence. After training the RNN, the hope is that this last output vector would capture the semantics of the sentence that is needed for solving the downstream task (RTE in this case)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[<span class=blue>Bowman et al, 2015</span>]\n",
    "\n",
    "Same LSTM encodes premise and hypothesis\n",
    "\n",
    "<img src=\"dl-applications-figures/rte_encoding.svg\" width=800/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "> You can’t cram the meaning of a whole\n",
    "%&!\\$# sentence into a single \\$&!#* vector!\n",
    ">\n",
    "> -- <cite>Raymond J. Mooney</cite>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The problem with this approach is that encoding a whole sentence into a single dense vector has obvious representational limitations. However, it is a simple model that sometimes works surprisingly well in practice and often serves as a baseline for more complex architectures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Independent Sentence Encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Once both sentences are encoded as vectors, we can use a multi-layer perceptron followed by a softmax to model the probability of each class (entailment, neutral, contradiction) given the two sentences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[<span class=blue>Bowman et al, 2015</span>]\n",
    "\n",
    "<img src=\"dl-applications-figures/mlp.svg\" width=700/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hide_input": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### <span class=\"blue\">TensorFlow</span>: Multi-layer Perceptron"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Below we show simple TensorFlow code for a multi-layer perceptron. Note that we sample two random sentence representations, whereas in practice we would use the output of an RNN as explained earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-07T16:54:15.346133",
     "start_time": "2016-12-07T16:54:14.886423"
    },
    "cell_style": "center",
    "hide_input": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.28892419  0.30673885  0.40433696]\n"
     ]
    }
   ],
   "source": [
    "with tf.Graph().as_default():\n",
    "    def mlp(input_vector, layers=3, hidden_dim=200, output_dim=3):    \n",
    "        # [input_size] => [input_size x 1] (column vector)\n",
    "        tmp = tf.expand_dims(input_vector, 1)\n",
    "        for i in range(layers+1):\n",
    "            W = tf.get_variable(\n",
    "                \"W_\"+str(i), [hidden_dim, hidden_dim])\n",
    "            # tanh(Wx^T)\n",
    "            tmp = tf.tanh(tf.matmul(W, tmp))\n",
    "        W = tf.get_variable(\n",
    "            \"W_\"+str(layers+1), [output_dim, hidden_dim])\n",
    "        # [input_size x 1] => [input_size]\n",
    "        return tf.squeeze(tf.matmul(W, tmp))\n",
    "    premise = tf.placeholder(tf.float32, [None], \"premise\")\n",
    "    hypothesis = tf.placeholder(tf.float32, [None], \"hypothesis\")\n",
    "    output = tf.nn.softmax(mlp(tf.concat([premise, hypothesis], axis=0)))\n",
    "    # in practice: outputs of an LSTM\n",
    "    v1 = np.random.rand(100); v2 = np.random.rand(100)\n",
    "    with tf.Session() as sess:\n",
    "        sess.run(tf.global_variables_initializer())\n",
    "        print(sess.run(output, {premise: v1, hypothesis: v2}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Results\n",
    "\n",
    "| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |\n",
    "|-|-|-|-|-|-|-|\n",
    "| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\\\(\\approx\\\\)10M | 221k | 84.4 | - | 77.6|\n",
    "| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "In fact, this simple way of encoding the relationship between two sentences is still inferior compared to a classifier with hand-engineered features. Next, we will investigate slightly more complex models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Conditional Endcoding\n",
    "\n",
    "<img src=\"dl-applications-figures/conditional.svg\" width=800/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "If we think more about the problem, it does not seem natural to encode both sentences independently. Instead, the way we read the hypothesis could be influenced by our understanding of the premise. For designing a model we could thus use two LSTMs, one for the premise and one for the hypothesis, where the initial state of the hypothesis model is initialized by the hidden state of the premise model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hide_input": true
   },
   "source": [
    "\\begin{align}\n",
    "\\text{softmax}(\\text{tanh}(\\mathbf{W}\\mathbf{h}_N))\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Instead of a concatenation of the premise and hypothesis representation, we now simply project the last output representation into the space of the three target classes and apply the softmax function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Conditional Endcoding\n",
    "\n",
    "<img src=\"dl-applications-figures/conditional_encoding.svg\" width=800/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "This has the advantage that we do not need to encode both sentences as fixed vector representations anymore. While the premise still needs to be encoded (last hidden state, $\\mathbf{c}_5$), the LSTM used for the hypothesis can use all its parameters to track the state of the sequence comparison for the final prediction. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{align}\n",
    "\\text{softmax}(\\text{tanh}(\\mathbf{W}\\mathbf{h}_N))\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hide_input": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Results\n",
    "\n",
    "| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |\n",
    "|-|-|-|-|-|-|-|\n",
    "| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\\\(\\approx\\\\)10M | 221k | 84.4 | - | 77.6|\n",
    "| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|\n",
    "| Conditional Endcoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hide_input": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Attention [<span class=blue>Graves 2013</span>, <span class=blue>Bahdanau et al. 2015</span>]\n",
    "\n",
    "\n",
    "<img src=\"dl-applications-figures/attention.svg\" width=800/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<div class=small>\n",
    "\\begin{align}\n",
    "  \\mathbf{M} &= \\tanh(\\mathbf{W}^y\\mathbf{Y}+ \\mathbf{W}^h\\mathbf{h}_N\\mathbf{1}^T_L)&\\mathbf{M}&\\in\\mathbb{R}^{k \\times L}\\\\\n",
    "  \\alpha &= \\text{softmax}(\\mathbf{w}^T\\mathbf{M})&\\alpha&\\in\\mathbb{R}^L\\\\\n",
    "  \\mathbf{r} &= \\mathbf{Y}\\alpha^T&\\mathbf{r}&\\in\\mathbb{R}^k\n",
    "\\end{align}\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Conditional encoding improves generalization compared to independent encoding of premise and hypothesis. Another improvement can be achieved by using a neural attention mechanism. To this end we compare the last output vector ($\\mathbf{h}_N$) with all premise output vectors (concatenated to $\\mathbf{Y}$) and model a probability distribution $\\alpha$ over premise output vectors using a softmax. Lastly, we obtain a context representation $\\mathbf{r}$ by weighting output vectors with the attention $\\alpha$, and use that context representation together with $\\mathbf{h}_N$ for prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hide_input": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Attention [<span class=blue>Graves 2013</span>, <span class=blue>Bahdanau et al. 2015</span>]\n",
    "\n",
    "<img src=\"dl-applications-figures/attention_encoding.svg\" width=800/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "<div class=small>\n",
    "\\begin{align}\n",
    "  \\mathbf{M} &= \\tanh(\\mathbf{W}^y\\mathbf{Y}+ \\mathbf{W}^h\\mathbf{h}_N\\mathbf{1}^T_L)&\\mathbf{M}&\\in\\mathbb{R}^{k \\times L}\\\\\n",
    "  \\alpha &= \\text{softmax}(\\mathbf{w}^T\\mathbf{M})&\\alpha&\\in\\mathbb{R}^L\\\\\n",
    "  \\mathbf{r} &= \\mathbf{Y}\\alpha^T&\\mathbf{r}&\\in\\mathbb{R}^k\n",
    "\\end{align}\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The main benefit of attention is that we avoid encoding the entire premise in a single vector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img  src=\"./dl-applications-figures/camel.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Contextual Understanding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img  src=\"./dl-applications-figures/pink.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Results\n",
    "\n",
    "| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |\n",
    "|-|-|-|-|-|-|-|\n",
    "| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\\\(\\approx\\\\)10M | 221k | 84.4 | - | 77.6|\n",
    "| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|\n",
    "| Conditional Encoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|\n",
    "| Attention | 100 | 3.9M | 242k | 85.4 | 83.2 | 82.3 |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Fuzzy Attention"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img  src=\"./dl-applications-figures/mimes.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Word-by-word Attention [<span class=blue>Bahdanau et al. 2015</span>, <span class=blue>Hermann et al. 2015</span>, <span class=blue>Rush et al. 2015</span>]\n",
    "\n",
    "<img src=\"dl-applications-figures/word_attention.svg\" width=800/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<div class=small>\n",
    "\\begin{align}\n",
    "  \\mathbf{M}_t &= \\tanh(\\mathbf{W}^y\\mathbf{Y}+(\\mathbf{W}^h\\mathbf{h}_t+\\mathbf{W}^r\\mathbf{r}_{t-1})\\mathbf{1}^T_L) & \\mathbf{M}_t &\\in\\mathbb{R}^{k\\times L}\\\\\n",
    "  \\alpha_t &= \\text{softmax}(\\mathbf{w}^T\\mathbf{M}_t)&\\alpha_t&\\in\\mathbb{R}^L\\\\\n",
    "  \\mathbf{r}_t &= \\mathbf{Y}\\alpha^T_t + \\tanh(\\mathbf{W}^t\\mathbf{r}_{t-1})&\\mathbf{r}_t&\\in\\mathbb{R}^k\n",
    "\\end{align}\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "An extension to the attention mechanism described above is to allow the model to attend over premise output vectors for every word in the hypothesis. This can be achieved by a small adaption of the previous model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Word-by-word Attention [<span class=blue>Bahdanau et al. 2015</span>, <span class=blue>Hermann et al. 2015</span>, <span class=blue>Rush et al. 2015</span>]\n",
    "\n",
    "<img src=\"dl-applications-figures/word_attention_encoding.svg\" width=800/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "<div class=small>\n",
    "\\begin{align}\n",
    "  \\mathbf{M}_t &= \\tanh(\\mathbf{W}^y\\mathbf{Y}+(\\mathbf{W}^h\\mathbf{h}_t+\\mathbf{W}^r\\mathbf{r}_{t-1})\\mathbf{1}^T_L) & \\mathbf{M}_t &\\in\\mathbb{R}^{k\\times L}\\\\\n",
    "  \\alpha_t &= \\text{softmax}(\\mathbf{w}^T\\mathbf{M}_t)&\\alpha_t&\\in\\mathbb{R}^L\\\\\n",
    "  \\mathbf{r}_t &= \\mathbf{Y}\\alpha^T_t + \\tanh(\\mathbf{W}^t\\mathbf{r}_{t-1})&\\mathbf{r}_t&\\in\\mathbb{R}^k\n",
    "\\end{align}\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Reordering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./dl-applications-figures/reordering.png\" width=60%/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Garbage Can = Trashcan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img  src=\"./dl-applications-figures/trashcan.png\" width=90%/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Kids =  Girl + Boy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img  src=\"./dl-applications-figures/kids.png\" width=80%/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Snow is outside"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img  src=\"./dl-applications-figures/snow.png\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Results\n",
    "\n",
    "| Model | k | θ<sub>W+M</sub> | θ<sub>M</sub> | Train | Dev | Test |\n",
    "|-|-|-|-|-|-|-|\n",
    "| LSTM [<span class=blue>Bowman et al.</span>] | 100 | \\\\(\\approx\\\\)10M | 221k | 84.4 | - | 77.6|\n",
    "| Classifier [<span class=blue>Bowman et al.</span>]| - | - | - | 99.7 | - | 78.2|\n",
    "| Conditional Encoding | 159 | 3.9M | 252k | 84.4 | 83.0 | 81.4|\n",
    "| Attention | 100 | 3.9M | 242k | 85.4 | 83.2 | 82.3 |\n",
    "| Word-by-word Attention | 100 | 3.9M | 252k | 85.3 | **83.7** | **83.5** |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Bag of Tricks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Continuous Optimization\n",
    "\n",
    "<img  src=\"https://upload.wikimedia.org/wikipedia/commons/6/68/Gradient_ascent_%28surface%29.png\" width=800/>\n",
    "Source: Wikipedia"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-02T17:06:06.486431",
     "start_time": "2016-12-02T17:06:06.483991"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Momentum\n",
    "\n",
    "<img  src=\"http://sebastianruder.com/content/images/2016/09/saddle_point_evaluation_optimizers.gif\" width=800/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### <span class=\"blue\">TensorFlow</span>: Optimizers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_style": "split"
   },
   "source": [
    "- GradientDescent\n",
    "- RMSProp\n",
    "- Adagrad"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_style": "split"
   },
   "source": [
    "- Adadelta\n",
    "- **Adam**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-07T16:54:28.658189",
     "start_time": "2016-12-07T16:54:28.493181"
    },
    "cell_style": "center",
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.201413\n",
      "0.183901\n",
      "0.167832\n",
      "0.153134\n",
      "0.139727\n",
      "0.127529\n",
      "0.116455\n",
      "0.106421\n",
      "0.0973447\n",
      "0.0891449\n"
     ]
    }
   ],
   "source": [
    "with tf.Graph().as_default():\n",
    "    x = tf.get_variable(\"param\", [])  \n",
    "    loss = -tf.log(tf.sigmoid(x))  # dummy example\n",
    "    optim = tf.train.AdamOptimizer(learning_rate=0.1)\n",
    "    min_op = optim.minimize(loss)\n",
    "    with tf.Session() as sess:\n",
    "        sess.run(tf.global_variables_initializer())\n",
    "        sess.run(x.assign(1.5))\n",
    "        for i in range(10):\n",
    "            print(sess.run([min_op, loss], {})[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "There are various gradient based optimizers that you can use in TensorFlow. We recommend **Adam** with an initial learning rate of $0.1$ or $0.01$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Gradient Clipping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"rnn-figures/error_surface.png\"></img>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-07T16:54:29.224953",
     "start_time": "2016-12-07T16:54:28.662320"
    },
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-0.182426 => -0.1\n",
      "-0.167982 => -0.1\n",
      "-0.154465 => -0.1\n",
      "-0.141851 => -0.1\n",
      "-0.130109 => -0.1\n",
      "-0.119203 => -0.1\n",
      "-0.109097 => -0.1\n",
      "-0.0997508 => -0.0997508\n",
      "-0.0911243 => -0.0911243\n",
      "-0.0832129 => -0.0832129\n"
     ]
    }
   ],
   "source": [
    "with tf.Graph().as_default():\n",
    "    x = tf.get_variable(\"param\", [])      \n",
    "    loss = -tf.log(tf.sigmoid(x))  # dummy example\n",
    "    optim = tf.train.AdamOptimizer(learning_rate=0.1)    \n",
    "    gradients = optim.compute_gradients(loss)\n",
    "    capped_gradients = \\\n",
    "        [(tf.clip_by_value(grad, -0.1, 0.1), var) for grad, var in gradients]\n",
    "    min_op = optim.apply_gradients(capped_gradients)        \n",
    "    with tf.Session() as sess:\n",
    "        sess.run(tf.global_variables_initializer()); sess.run(x.assign(1.5))        \n",
    "        for i in range(100):\n",
    "            if i % 10 == 0:\n",
    "                grads = sess.run([min_op, gradients, capped_gradients], {})[1:]\n",
    "                print(\" => \".join([str(grad[0][0]) for grad in grads]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "For more stable training, gradient clipping is often applied. It simply caps gradients at a predefined interval, or in the case of `tf.clip_by_norm`, constrains the norm of the gradient."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Regularization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Dropout\n",
    "\n",
    "<img src=\"http://everglory99.github.io/Intro_DL_TCC/intro_dl_images/dropout1.png\" width=1000></img>\n",
    "\n",
    "For RNNs: `tf.nn.rnn_cell.DropoutWrapper`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-08T16:35:21.531004",
     "start_time": "2016-12-08T16:35:21.512108"
    },
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.          0.40834624  1.31100929  0.          1.08233535  0.        ]\n"
     ]
    }
   ],
   "source": [
    "with tf.Graph().as_default():\n",
    "    x = tf.placeholder(tf.float32, [None], \"input\")\n",
    "    x_dropped = tf.nn.dropout(x, 0.7)  # keeps 70% of values\n",
    "    \n",
    "    with tf.Session() as sess:        \n",
    "        print(sess.run(x_dropped, {x: np.random.rand(6)}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "One of the most successful regularization techniques is dropout. The idea is to randomly set output representations to $0$ at every execution of the computation graph. By repeating this produce, we are effectively averaging over different model architectures (see Fig b) and increase the robustness of the model. Common dropout values are $0.1$ or $0.2$, i.e., keeping 90% or 80% of the values in the output representation respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Early Stopping\n",
    "\n",
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/1/1f/Overfitting_svg.svg\" width=800></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Another simple way of regularization is to stop training on the training set (blue) once overfitting on a dev set (red) occurs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Hyper-parameter Optimization\n",
    "[<span class=blue>Bergstra and Bengio 2012</span>]\n",
    "<img src=\"https://cdn-images-1.medium.com/max/800/1*ZTlQm_WRcrNqL-nLnx6GJA.png\"></img>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "We often use grid search for finding a good combination of model hyper-parameters on a dev set. However, it turns out that a more effective method is random search by sampling uniform values for the hyper-parameters from a predefined interval."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Pre-trained Representations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-09T16:25:24.379197",
     "start_time": "2016-12-09T16:25:24.351037"
    },
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 6.  7.  8.]\n"
     ]
    }
   ],
   "source": [
    "with tf.Graph().as_default():\n",
    "    vocab_size = 4; embedding_size = 3\n",
    "    W = tf.get_variable(\"W\", [vocab_size, embedding_size], trainable=False)\n",
    "    W = W.assign(np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]))\n",
    "    seq =  tf.placeholder(tf.int64, [None], \"seq\")\n",
    "    seq_embedded = tf.nn.embedding_lookup(W, seq)\n",
    "    with tf.Session() as sess:\n",
    "        sess.run(tf.global_variables_initializer())\n",
    "        print(sess.run(seq_embedded, {seq:[0, 3, 2, 3, 1]})[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Instead of training task-specific word representations, it is sometimes helpful to use pre-trained word vectors from word2vec or GloVe. Above, you can find a simply example of manually setting vectors of a word embedding matrix. Note that by setting `trainable=False`, the model cannot update these word embeddings during training. If we only want to use pre-trained word vectors as initializer so that we can fine-tune them based on our task, we would need set `trainable=True`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Projecting Representations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-09T16:25:16.587731",
     "start_time": "2016-12-09T16:25:16.536600"
    },
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-13.46509552   2.99458957]\n"
     ]
    }
   ],
   "source": [
    "with tf.Graph().as_default():\n",
    "    vocab_size = 4; embedding_size = 3; input_size = 2\n",
    "    W = tf.get_variable(\"W\", [vocab_size, embedding_size], trainable=False)\n",
    "    W = W.assign(np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]))\n",
    "    seq =  tf.placeholder(tf.int64, [None], \"seq\")\n",
    "    seq_embedded = tf.contrib.layers.linear(tf.nn.embedding_lookup(W, seq), input_size)\n",
    "    with tf.Session() as sess:\n",
    "        sess.run(tf.global_variables_initializer())\n",
    "        print(sess.run(seq_embedded, {seq:[0, 3, 2, 3, 1]})[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Sometimes we want to work with pre-trained word vectors (e.g. $300$ dimensional word2vec vectors), but use a different hidden size for our RNN. We can learn a linear projection from one vector space to another via `tf.contrib.layers.linear`. If we want to use a non-linear projection, we can simply call a non-linear function like `tanh` on the output of `linear`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Batching\n",
    "- Utilization of GPUs\n",
    "\n",
    "<img src=\"dl-applications-figures/batching.svg\" width=800 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Batching is used to maximize utilization of GPUs. However, even if you run your models on CPUs, linear algebra libraries often can make use of batched computations. Furthermore, batching is needed for batch gradient descent. The idea is simple: instead of processing a single example such as a sequence of word vectors (represented as $N \\times k$ matrix above), we are going to process many of these at the same time (represented by the $mb \\times N \\times k$ order-3 tensor)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Bucketing\n",
    "- Increased computational efficiency by grouping data into buckets of similar length\n",
    "\n",
    "<img src=\"dl-applications-figures/bucketing.svg\" width=500 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "When batching sequences, we will generally have sequences of different lengths in one batch. This is wasting computation on padding symbols of the shorter sequence. Hence, it is a good idea to divide training sequences up into buckets of roughly the same sequence length."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## <span class=\"blue\">TensorFlow</span>: `dynamic_rnn`\n",
    "- Avoids padding (which can improve performance)\n",
    "\n",
    "<img src=\"dl-applications-figures/dynamic.svg\" width=600 />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-09T16:26:09.402343",
     "start_time": "2016-12-09T16:26:09.181380"
    },
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[-0.02504479  0.01778306  0.06259364]\n",
      " [-0.39973924  0.15913689 -0.24309666]\n",
      " [ 0.1272437  -0.11015443  0.27061945]\n",
      " [-0.16756612  0.08985849 -0.09469282]\n",
      " [-0.31299666  0.21844    -0.0656506 ]]\n"
     ]
    }
   ],
   "source": [
    "with tf.Graph().as_default():\n",
    "    input_size = 2; output_size = 3; batch_size = 5; max_length = 7\n",
    "    cell = tf.nn.rnn_cell.LSTMCell(output_size)\n",
    "    input_embedded = tf.placeholder(tf.float32, [None, None, input_size], \"input_embedded\")\n",
    "    input_length = tf.placeholder(tf.int64, [None], \"input_length\")\n",
    "    outputs, states = \\\n",
    "        tf.nn.dynamic_rnn(cell, input_embedded, sequence_length=input_length, dtype=tf.float32)    \n",
    "    with tf.Session() as sess:\n",
    "        sess.run(tf.global_variables_initializer())\n",
    "        print(sess.run(states, {\n",
    "                    input_embedded: np.random.randn(batch_size, max_length, input_size), \n",
    "                    input_length: np.random.randint(1, max_length, batch_size)\n",
    "                }).h)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "We encourage you to use `dynamic_rnn` for RNNs in TensorFlow. They avoid changes to the hidden state of the RNN once padding symbols are processed. Note that this is orthogonal to bucketing and `dynamic_rnn` still benefits from grouping sequences of similar length together in one batch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Bi-directional RNNs\n",
    "`tf.nn.bidirectional_dynamic_rnn`\n",
    "\n",
    "<img src=\"dl-applications-figures/birnn.svg\" width=1000 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Many current RNN systems use bidirectional RNNs to incorporate left-hand as well as right-hand side contexts into output representations. To this end, two independent RNNs are executed, one on the normal input sequence and one on its revers. TensorFlow is making this easy via `tf.nn.bidirectional_dynamic_rnn`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Background Material\n",
    "\n",
    "- [A Primer on Neural Network Models for Natural Language Processing](http://www.jair.org/media/4992/live-4992-9623-jair.pdf)\n",
    "- [Stanford Course: Deep Learning for Natural Language Processing](http://cs224d.stanford.edu/syllabus.html)\n",
    "- [Nando de Freitas' Deep Learning Lecture](https://www.youtube.com/watch?v=dV80NAlEins&list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu)\n",
    "- [Deep Learning Book by Goodfellow et al.](http://www.deeplearningbook.org/)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}