{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Week8 bonus descriptions\n", "\n", "Here are some cool mini-projects you can try to dive deeper into the topic.\n", "\n", "## More metrics: BLEU (5+ pts)\n", "\n", "Pick BLEU or any other relevant metric, e.g. BLEU (e.g. from `nltk.bleu_score`).\n", "* Train model to maximize BLEU directly\n", "* How does levenshtein behave when maximizing BLEU and vice versa?\n", "* Compare this with how they behave when optimizing likelihood. \n", "\n", "(use default parameters for bleu: 4-gram, uniform weights)\n", "\n", "## Actor-critic (5+++ pts)\n", "\n", "While self-critical training provides a large reduction of gradient variance, it has a few drawbacks:\n", "- It requires a lot of additional computation during training\n", "- It doesn't adjust V(s) between decoder steps. (one value per sequence)\n", "\n", "There's a more general way of doing the same thing: learned baselines, also known as __advantage actor-critic__.\n", "\n", "There are two main ways to apply that:\n", "- __naive way__: compute V(s) once per training example.\n", " - This only requires additional 1-unit linear dense layer that grows out of encoder, estimating V(s)\n", " - (implement this to get main points)\n", "- __every step__: compute V(s) on each decoder step\n", " - Again it's just an 1-unit dense layer (no nonlinearity), but this time it's inside decoder recurrence.\n", " - (+3 pts additional for this guy)\n", "\n", "In both cases, you should train V(s) to minimize squared error $(V(s) - R(s,a))^2$ with R being actual levenshtein.\n", "You can then use $ A(s,a) = (R(s,a) - const(V(s))) $ for policy gradient.\n", "\n", "There's also one particularly interesting approach (+5 additional pts):\n", "- __combining SCST and actor-critic__:\n", " - compute baseline $V(s)$ via self-critical sequence training (just like in main assignment)\n", " - learn correction $ C(s,a_{:t}) = R(s,a) - V(s) $ by minimizing $(R(s,a) - V(s) - C(s,a_{:t}))^2 $\n", " - use $ A(s,a_{:t}) = R(s,a) - V(s) - const(C(s,a_{:t})) $\n", "\n", "\n", "\n", "## Implement attention (5+++ pts)\n", "\n", "Some seq2seq tasks can benefit from the attention mechanism. In addition to taking the _last_ time-step of encoder hidden state, we can allow decoder to peek on any time-step of his choice.\n", "\n", "\n", "\n", "\n", "#### Recommended steps:\n", "__1)__ Modify encoder-decoder\n", "\n", "Learn to feed the entire encoder into the decoder. You can do so by sending encoder rnn layer directly into decoder (make sure there's no `only_return_final=True` for encoder rnn layer).\n", "\n", "```\n", "class decoder:\n", " ...\n", " encoder_rnn_input = InputLayer(encoder.rnn.output_shape, name='encoder rnn input for decoder')\n", " ...\n", " \n", "#decoder Recurrence\n", "rec = Recurrence(...,\n", " input_nonsequences = {decoder.encoder_rnn_input: encoder.rnn},\n", " )\n", "\n", "```\n", "\n", "For starters, you can take it's last tick (via SliceLayer) inside the decoder step and feed it as input to make sure it works.\n", "\n", "__2)__ Implement attention mechanism\n", "\n", "Next thing we'll need is to implement the math of attention.\n", "\n", "The simplest way to do so is to write a special layer. We gave you a prototype and some tests below.\n", "\n", "__3)__ Use attention inside decoder\n", "\n", "That's almost it! Now use `AttentionLayer` inside the decoder and feed it to back to lstm/gru/rnn (see code demo below).\n", "\n", "Train the full network just like you did before attention.\n", "\n", "__More points__ will be awwarded for comparing learning results of attention Vs no attention.\n", "\n", "__Bonus bonus:__ visualize attention vectors (>= +3 points)\n", "\n", "The best way to make sure your attention actually works is to visualize it.\n", "\n", "A simple way to do so is to obtain attention vectors from each tick (values __right after softmax__, not the layer outputs) and drawing those as images.\n", "\n", "#### step-by-step guide:\n", "- split AttentionLayer into two layers: _\"from start to softmax\"_ and _\"from softmax to output\"_\n", "- add outputs of the first layer to recurrence's `tracked_outputs`\n", "- compile a function that computes them\n", "- plt.imshow(them)\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import theano,lasagne\n", "import theano.tensor as T\n", "from lasagne import init\n", "from lasagne.layers import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class AttentionLayer(MergeLayer):\n", " def __init__(self,decoder_h,encoder_rnn):\n", " #sanity checks\n", " assert len(decoder_h.output_shape)==2,\"please feed decoder 1 step activation as first param \"\n", " assert len(encoder_rnn.output_shape)==3, \"please feed full encoder rnn sequence as second param\"\n", " \n", " self.decoder_num_units = decoder_h.output_shape[-1]\n", " self.encoder_num_units = encoder.output_shape[-1]\n", "\n", " #Here you should initialize all trainable parameters.\n", " #\n", " \n", " #use this syntax:\n", " self.add_param(spec=init.Normal(std=0.01), #or other initializer\n", " shape=<shape tuple>,\n", " name='<param name here>')\n", " \n", " \n", " MergeLayer.__init__(self,[decoder_h,encoder_rnn],name=\"attention\")\n", " \n", " \n", " def get_output_shape_for(self,input_shapes,**kwargs):\n", " \"\"\"return matrix of shape [batch_size, encoder num units]\"\"\"\n", " return (None,self.encoder_num_units)\n", " \n", " def get_output_for(self,inputs,**kwargs):\n", " \"\"\"\n", " takes (decoder_h, encoder_seq)\n", " decoder_h has shape [batch_size, decoder num_units]\n", " encoder_seq has shape [batch_size, sequence_length, encoder num_units]\n", " \n", " returns attention output: matrix of shape [batch_size, encoder num units]\n", " \n", " please read comments carefully before you start implementing\n", " \"\"\"\n", " decoder_h,encoder_seq = inputs\n", " \n", " #get symbolic batch-size / seq length. Also don't forget self.decoder_num_units above\n", " batch_size,seq_length,_ = tuple(encoder_seq.shape)\n", " \n", " #here's a recommended step-by-step guide for attention mechanism. \n", " #You are free to ignore it alltogether if you so wish\n", " \n", " #we repeat decoder activations to allign with encoder\n", " decoder_h_repeated = <cast decoder_h into [batch,seq_length,decoer_num_units] by \n", " repeating it _seq_length_ times>\n", " <use T.repeat and maybe some reshape>\n", " # ^--shape=[batch,seq_length,decoder_n_units]\n", " \n", " encoder_and_decoder_together = <concatenate repeated decoder and encoder over last axis>\n", " # ^--shape=[batch,seq_length,enc_n_units+dec_n_units]\n", " \n", " #here we flatten the tensor to simplify\n", " encoder_and_decoder_flat = T.reshape(encoder_and_decoder_together,(-1,encoder_and_decoder_together.shape[-1]))\n", " # ^--shape=[batch*seq_length,enc_n_units+dec_n_units]\n", " \n", " #here you use encoder_and_decoder_flat and some learned weights to predict attention logits\n", " #don't use softmax yet\n", " <your code here>\n", " attention_logits_flat = <logits to be used as attention weights>\n", " # ^--shape=[batch*seq_length,1]\n", " \n", " \n", " #here we reshape flat logits back into correct form\n", " assert attention_logits_flat.ndim==2\n", " attention_logits = attention_logits_flat.reshape((batch_size,seq_length))\n", " # ^--shape=[batch,seq_length]\n", " \n", " #here we apply softmax :)\n", " attention = T.nnet.softmax(attention_logits)\n", " # ^--shape=[batch,seq_length]\n", " \n", " #here we compute output\n", " output = (attention[:,:,None]*encoder_seq).sum(axis=1) #sum over seq_length\n", " # ^--shape=[batch,enc_n_units]\n", " \n", " return output\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#demo code\n", "\n", "from numpy.random import randn\n", "\n", "dec_h_prev = InputLayer((None,50),T.constant(randn(5,50)),name='decoder h mock')\n", "\n", "enc = InputLayer((None,None,32),T.constant(randn(5,20,32)),name='encoder sequence mock')\n", "\n", "attention = AttentionLayer(dec_h_prev,enc)\n", "\n", "#now you can use attention as additonal input to your decoder\n", "#LSTMCell(prev_cell,prev_out,input_or_inputs=(usual_input,attention))\n", "\n", "\n", "#sanity check\n", "demo_output = get_output(attention).eval()\n", "print 'actual shape:',demo_output.shape\n", "assert demo_output.shape == (5,32)\n", "assert np.isfinite(demo_output)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }