{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Week8 bonus descriptions\n",
    "\n",
    "Here are some cool mini-projects you can try to dive deeper into the topic.\n",
    "\n",
    "## More metrics: BLEU (5+ pts)\n",
    "\n",
    "Pick BLEU or any other relevant metric, e.g. BLEU (e.g. from `nltk.bleu_score`).\n",
    "* Train model to maximize BLEU directly\n",
    "* How does levenshtein behave when maximizing BLEU and vice versa?\n",
    "* Compare this with how they behave when optimizing likelihood. \n",
    "\n",
    "(use default parameters for bleu: 4-gram, uniform weights)\n",
    "\n",
    "## Actor-critic (5+++ pts)\n",
    "\n",
    "While self-critical training provides a large reduction of gradient variance, it has a few drawbacks:\n",
    "- It requires a lot of additional computation during training\n",
    "- It doesn't adjust V(s) between decoder steps. (one value per sequence)\n",
    "\n",
    "There's a more general way of doing the same thing: learned baselines, also known as __advantage actor-critic__.\n",
    "\n",
    "There are two main ways to apply that:\n",
    "- __naive way__: compute V(s) once per training example.\n",
    "  - This only requires additional 1-unit linear dense layer that grows out of encoder, estimating V(s)\n",
    "  - (implement this to get main points)\n",
    "- __every step__: compute V(s) on each decoder step\n",
    "  - Again it's just an 1-unit dense layer (no nonlinearity), but this time it's inside decoder recurrence.\n",
    "  - (+3 pts additional for this guy)\n",
    "\n",
    "In both cases, you should train V(s) to minimize squared error $(V(s) - R(s,a))^2$ with R being actual levenshtein.\n",
    "You can then use $ A(s,a) = (R(s,a) - const(V(s))) $ for policy gradient.\n",
    "\n",
    "There's also one particularly interesting approach (+5 additional pts):\n",
    "- __combining SCST and actor-critic__:\n",
    "  - compute baseline $V(s)$ via self-critical sequence training (just like in main assignment)\n",
    "  - learn correction $ C(s,a_{:t}) = R(s,a) - V(s) $ by minimizing $(R(s,a) - V(s) - C(s,a_{:t}))^2 $\n",
    "  - use $ A(s,a_{:t}) = R(s,a) - V(s) - const(C(s,a_{:t})) $\n",
    "\n",
    "\n",
    "\n",
    "## Implement attention (5+++ pts)\n",
    "\n",
    "Some seq2seq tasks can benefit from the attention mechanism. In addition to taking the _last_ time-step of encoder hidden state, we can allow decoder to peek on any time-step of his choice.\n",
    "\n",
    "![img](https://s30.postimg.org/f8um3kt5d/google_seq2seq_attention.gif)\n",
    "\n",
    "\n",
    "#### Recommended steps:\n",
    "__1)__ Modify encoder-decoder\n",
    "\n",
    "Learn to feed the entire encoder into the decoder. You can do so by sending encoder rnn layer directly into decoder (make sure there's no `only_return_final=True` for encoder rnn layer).\n",
    "\n",
    "```\n",
    "class decoder:\n",
    "    ...\n",
    "    encoder_rnn_input = InputLayer(encoder.rnn.output_shape, name='encoder rnn input for decoder')\n",
    "    ...\n",
    "    \n",
    "#decoder Recurrence\n",
    "rec = Recurrence(...,\n",
    "                 input_nonsequences = {decoder.encoder_rnn_input: encoder.rnn},\n",
    "                 )\n",
    "\n",
    "```\n",
    "\n",
    "For starters, you can take it's last tick (via SliceLayer) inside the decoder step and feed it as input to make sure it works.\n",
    "\n",
    "__2)__ Implement attention mechanism\n",
    "\n",
    "Next thing we'll need is to implement the math of attention.\n",
    "\n",
    "The simplest way to do so is to write a special layer. We gave you a prototype and some tests below.\n",
    "\n",
    "__3)__ Use attention inside decoder\n",
    "\n",
    "That's almost it! Now use `AttentionLayer` inside the decoder and feed it to back to lstm/gru/rnn (see code demo below).\n",
    "\n",
    "Train the full network just like you did before attention.\n",
    "\n",
    "__More points__ will be awwarded for comparing learning results of attention Vs no attention.\n",
    "\n",
    "__Bonus bonus:__ visualize attention vectors (>= +3 points)\n",
    "\n",
    "The best way to make sure your attention actually works is to visualize it.\n",
    "\n",
    "A simple way to do so is to obtain attention vectors from each tick (values __right after softmax__, not the layer outputs) and drawing those as images.\n",
    "\n",
    "#### step-by-step guide:\n",
    "- split AttentionLayer into two layers: _\"from start to softmax\"_ and _\"from softmax to output\"_\n",
    "- add outputs of the first layer to recurrence's `tracked_outputs`\n",
    "- compile a function that computes them\n",
    "- plt.imshow(them)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import theano,lasagne\n",
    "import theano.tensor as T\n",
    "from lasagne import init\n",
    "from lasagne.layers import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class AttentionLayer(MergeLayer):\n",
    "    def __init__(self,decoder_h,encoder_rnn):\n",
    "        #sanity checks\n",
    "        assert len(decoder_h.output_shape)==2,\"please feed decoder 1 step activation as first param \"\n",
    "        assert len(encoder_rnn.output_shape)==3, \"please feed full encoder rnn sequence as second param\"\n",
    "        \n",
    "        self.decoder_num_units = decoder_h.output_shape[-1]\n",
    "        self.encoder_num_units = encoder.output_shape[-1]\n",
    "\n",
    "        #Here you should initialize all trainable parameters.\n",
    "        #\n",
    "        \n",
    "        #use this syntax:\n",
    "        self.add_param(spec=init.Normal(std=0.01), #or other initializer\n",
    "                       shape=<shape tuple>,\n",
    "                       name='<param name here>')\n",
    "        \n",
    "        \n",
    "        MergeLayer.__init__(self,[decoder_h,encoder_rnn],name=\"attention\")\n",
    "        \n",
    "        \n",
    "    def get_output_shape_for(self,input_shapes,**kwargs):\n",
    "        \"\"\"return matrix of shape [batch_size, encoder num units]\"\"\"\n",
    "        return (None,self.encoder_num_units)\n",
    "        \n",
    "    def get_output_for(self,inputs,**kwargs):\n",
    "        \"\"\"\n",
    "        takes (decoder_h, encoder_seq)\n",
    "        decoder_h has shape [batch_size, decoder num_units]\n",
    "        encoder_seq has shape [batch_size, sequence_length, encoder num_units]\n",
    "        \n",
    "        returns attention output: matrix of shape [batch_size, encoder num units]\n",
    "        \n",
    "        please read comments carefully before you start implementing\n",
    "        \"\"\"\n",
    "        decoder_h,encoder_seq = inputs\n",
    "        \n",
    "        #get symbolic batch-size / seq length. Also don't forget self.decoder_num_units above\n",
    "        batch_size,seq_length,_ = tuple(encoder_seq.shape)\n",
    "        \n",
    "        #here's a recommended step-by-step guide for attention mechanism. \n",
    "        #You are free to ignore it alltogether if you so wish\n",
    "        \n",
    "        #we repeat decoder activations to allign with encoder\n",
    "        decoder_h_repeated = <cast decoder_h into [batch,seq_length,decoer_num_units] by \n",
    "                              repeating it _seq_length_ times>\n",
    "                             <use T.repeat and maybe some reshape>\n",
    "        # ^--shape=[batch,seq_length,decoder_n_units]\n",
    "        \n",
    "        encoder_and_decoder_together = <concatenate repeated decoder and encoder over last axis>\n",
    "        # ^--shape=[batch,seq_length,enc_n_units+dec_n_units]\n",
    "        \n",
    "        #here we flatten the tensor to simplify\n",
    "        encoder_and_decoder_flat = T.reshape(encoder_and_decoder_together,(-1,encoder_and_decoder_together.shape[-1]))\n",
    "        # ^--shape=[batch*seq_length,enc_n_units+dec_n_units]\n",
    "        \n",
    "        #here you use encoder_and_decoder_flat and some learned weights to predict attention logits\n",
    "        #don't use softmax yet\n",
    "        <your code here>\n",
    "        attention_logits_flat = <logits to be used as attention weights>\n",
    "        # ^--shape=[batch*seq_length,1]\n",
    "        \n",
    "        \n",
    "        #here we reshape flat logits back into correct form\n",
    "        assert attention_logits_flat.ndim==2\n",
    "        attention_logits = attention_logits_flat.reshape((batch_size,seq_length))\n",
    "        # ^--shape=[batch,seq_length]\n",
    "        \n",
    "        #here we apply softmax :)\n",
    "        attention = T.nnet.softmax(attention_logits)\n",
    "        # ^--shape=[batch,seq_length]\n",
    "        \n",
    "        #here we compute output\n",
    "        output = (attention[:,:,None]*encoder_seq).sum(axis=1) #sum over seq_length\n",
    "        # ^--shape=[batch,enc_n_units]\n",
    "        \n",
    "        return output\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#demo code\n",
    "\n",
    "from numpy.random import randn\n",
    "\n",
    "dec_h_prev = InputLayer((None,50),T.constant(randn(5,50)),name='decoder h mock')\n",
    "\n",
    "enc = InputLayer((None,None,32),T.constant(randn(5,20,32)),name='encoder sequence mock')\n",
    "\n",
    "attention = AttentionLayer(dec_h_prev,enc)\n",
    "\n",
    "#now you can use attention as additonal input to your decoder\n",
    "#LSTMCell(prev_cell,prev_out,input_or_inputs=(usual_input,attention))\n",
    "\n",
    "\n",
    "#sanity check\n",
    "demo_output = get_output(attention).eval()\n",
    "print 'actual shape:',demo_output.shape\n",
    "assert demo_output.shape == (5,32)\n",
    "assert np.isfinite(demo_output)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}