{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "RNN Character Model\n", "===========\n", "\n", "This example trains a RNN to predict the next character in a sequence. Sampling from the trained model produces somewhat intelligble text, with vocabulary and style resembling the training corpus. For more background and details:\n", "- http://karpathy.github.io/2015/05/21/rnn-effectiveness/\n", "- https://github.com/karpathy/char-rnn\n", "\n", "The data used for training is a collection of patent claims obtained from\n", "http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Couldn't import dot_parser, loading of dot files will not be possible.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Using gpu device 0: Graphics Device (CNMeM is disabled, CuDNN 4007)\n", "/home/eben/venv/local/lib/python2.7/site-packages/Theano-0.8.0rc1-py2.7.egg/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.\n", " \"downsample module has been moved to the theano.tensor.signal.pool module.\")\n" ] } ], "source": [ "import numpy as np\n", "import theano\n", "import theano.tensor as T\n", "import lasagne\n", "from lasagne.utils import floatX\n", "\n", "import pickle\n", "import gzip\n", "import random\n", "from collections import Counter" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load the corpus and look at an example\n", "corpus = gzip.open('claims.txt.gz').read()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'A fixed spool fishing reel according to claim 2, characterised in that the cog wheel (236) of greater diameter has in the range from ten teeth to sixteen teeth, and the cog wheel (232) of smaller diameter has in the range from five teeth to ten teeth. '" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus.split('\\n')[0]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Find the set of characters used in the corpus and construct mappings between characters,\n", "# integer indices, and one hot encodings\n", "VOCABULARY = set(corpus)\n", "VOCAB_SIZE = len(VOCABULARY)\n", "\n", "CHAR_TO_IX = {c: i for i, c in enumerate(VOCABULARY)}\n", "IX_TO_CHAR = {i: c for i, c in enumerate(VOCABULARY)}\n", "CHAR_TO_ONEHOT = {c: np.eye(VOCAB_SIZE)[i] for i, c in enumerate(VOCABULARY)}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "SEQUENCE_LENGTH = 50\n", "BATCH_SIZE = 50\n", "RNN_HIDDEN_SIZE = 200" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Reserve 10% of the data for validation\n", "train_corpus = corpus[:(len(corpus) * 9 // 10)]\n", "val_corpus = corpus[(len(corpus) * 9 // 10):]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Our batch generator will yield sequential portions of the corpus of size SEQUENCE_LENGTH,\n", "# starting from random locations and wrapping around the end of the data.\n", "def data_batch_generator(corpus, size=BATCH_SIZE):\n", " startidx = np.random.randint(0, len(corpus) - SEQUENCE_LENGTH - 1, size=size)\n", "\n", " while True:\n", " items = np.array([corpus[start:start + SEQUENCE_LENGTH + 1] for start in startidx])\n", " startidx = (startidx + SEQUENCE_LENGTH) % (len(corpus) - SEQUENCE_LENGTH - 1)\n", " yield items" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['(27), and (i) calibrating the measurement level obt']\n", "['tained in step (h) on the basis of the output signa']\n", "['al level obtained in step (h). \\nA method for manufa']\n" ] } ], "source": [ "# Test it out\n", "gen = data_batch_generator(corpus, size=1)\n", "print(next(gen))\n", "print(next(gen))\n", "print(next(gen))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# After sampling a data batch, we transform it into a one hot feature representation\n", "# and create a target sequence by shifting by one character\n", "def prep_batch_for_network(batch):\n", " x_seq = np.zeros((len(batch), SEQUENCE_LENGTH, VOCAB_SIZE), dtype='float32')\n", " y_seq = np.zeros((len(batch), SEQUENCE_LENGTH), dtype='int32')\n", "\n", " for i, item in enumerate(batch):\n", " for j in range(SEQUENCE_LENGTH):\n", " x_seq[i, j] = CHAR_TO_ONEHOT[item[j]]\n", " y_seq[i, j] = CHAR_TO_IX[item[j + 1]]\n", "\n", " return x_seq, y_seq" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Symbolic variables for input. In addition to the usual features and target,\n", "# we need initial values for the RNN layer's hidden states\n", "x_sym = T.tensor3()\n", "y_sym = T.imatrix()\n", "hid_init_sym = T.matrix()\n", "hid2_init_sym = T.matrix()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "l_input = lasagne.layers.InputLayer((None, SEQUENCE_LENGTH, VOCAB_SIZE))\n", "l_input_hid = lasagne.layers.InputLayer((None, RNN_HIDDEN_SIZE))\n", "l_input_hid2 = lasagne.layers.InputLayer((None, RNN_HIDDEN_SIZE))\n", "\n", "# Our network has two stacked GRU layers processing the input sequence.\n", "l_rnn = lasagne.layers.GRULayer(l_input,\n", " num_units=RNN_HIDDEN_SIZE,\n", " grad_clipping=5.,\n", " hid_init=l_input_hid,\n", " )\n", "\n", "l_rnn2 = lasagne.layers.GRULayer(l_rnn,\n", " num_units=RNN_HIDDEN_SIZE,\n", " grad_clipping=5.,\n", " hid_init=l_input_hid2,\n", " )\n", "\n", "\n", "l_shp = lasagne.layers.ReshapeLayer(l_rnn2, (-1, RNN_HIDDEN_SIZE))\n", "\n", "# Before the decoder layer, we need to reshape the sequence into the batch dimension,\n", "# so that timesteps are decoded independently.\n", "l_decoder = lasagne.layers.DenseLayer(l_shp,\n", " num_units=VOCAB_SIZE,\n", " nonlinearity=lasagne.nonlinearities.softmax)\n", "\n", "l_out = lasagne.layers.ReshapeLayer(l_decoder, (-1, SEQUENCE_LENGTH, VOCAB_SIZE))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We extract the hidden state of each GRU layer as well as the output of the decoder.\n", "# Only the hidden state at the last timestep is needed\n", "hid_out, hid2_out, prob_out = lasagne.layers.get_output([l_rnn, l_rnn2, l_out],\n", " {l_input: x_sym,\n", " l_input_hid: hid_init_sym,\n", " l_input_hid2: hid2_init_sym,\n", " })\n", "\n", "hid_out = hid_out[:, -1]\n", "hid2_out = hid2_out[:, -1]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We flatten the sequence into the batch dimension before calculating the loss\n", "def calc_cross_ent(net_output, targets):\n", " preds = T.reshape(net_output, (-1, VOCAB_SIZE))\n", " targets = T.flatten(targets)\n", " cost = T.nnet.categorical_crossentropy(preds, targets)\n", " return cost\n", "\n", "loss = T.mean(calc_cross_ent(prob_out, y_sym))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For stability during training, gradients are clipped and a total gradient norm constraint is also used\n", "MAX_GRAD_NORM = 15\n", "\n", "all_params = lasagne.layers.get_all_params(l_out, trainable=True)\n", "\n", "all_grads = T.grad(loss, all_params)\n", "all_grads = [T.clip(g, -5, 5) for g in all_grads]\n", "all_grads, norm = lasagne.updates.total_norm_constraint(\n", " all_grads, MAX_GRAD_NORM, return_norm=True)\n", "\n", "updates = lasagne.updates.adam(all_grads, all_params, learning_rate=0.002)\n", "\n", "f_train = theano.function([x_sym, y_sym, hid_init_sym, hid2_init_sym],\n", " [loss, norm, hid_out, hid2_out],\n", " updates=updates\n", " )\n", "\n", "f_val = theano.function([x_sym, y_sym, hid_init_sym, hid2_init_sym], [loss, hid_out, hid2_out])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Training takes a while - you may want to skip this and the next cell, and load the pretrained weights instead\n", "hid = np.zeros((BATCH_SIZE, RNN_HIDDEN_SIZE), dtype='float32')\n", "hid2 = np.zeros((BATCH_SIZE, RNN_HIDDEN_SIZE), dtype='float32')\n", "\n", "train_batch_gen = data_batch_generator(train_corpus)\n", "\n", "for iteration in range(20000):\n", " x, y = prep_batch_for_network(next(train_batch_gen))\n", " loss_train, norm, hid, hid2 = f_train(x, y, hid, hid2)\n", " \n", " if iteration % 250 == 0:\n", " print('Iteration {}, loss_train: {}, norm: {}'.format(iteration, loss_train, norm))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "param_values = lasagne.layers.get_all_param_values(l_out)\n", "d = {'param values': param_values,\n", " 'VOCABULARY': VOCABULARY, \n", " 'CHAR_TO_IX': CHAR_TO_IX,\n", " 'IX_TO_CHAR': IX_TO_CHAR,\n", " }\n", "#pickle.dump(d, open('gru_2layer_trained.pkl','w'), protocol=pickle.HIGHEST_PROTOCOL)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load pretrained weights into network\n", "d = pickle.load(open('gru_2layer_trained.pkl', 'r'))\n", "lasagne.layers.set_all_param_values(l_out, d['param values'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predict_fn = theano.function([x_sym, hid_init_sym, hid2_init_sym], [prob_out, hid_out, hid2_out])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.896837\n" ] } ], "source": [ "# Calculate validation loss\n", "hid = np.zeros((BATCH_SIZE, RNN_HIDDEN_SIZE), dtype='float32')\n", "hid2 = np.zeros((BATCH_SIZE, RNN_HIDDEN_SIZE), dtype='float32')\n", "\n", "val_batch_gen = data_batch_generator(val_corpus)\n", "\n", "losses = []\n", "\n", "for iteration in range(50):\n", " x, y = prep_batch_for_network(next(val_batch_gen))\n", " loss_val, hid, hid2 = f_val(x, y, hid, hid2)\n", " losses.append(loss_val)\n", "print(np.mean(losses))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For faster sampling, we rebuild the network with a sequence length of 1\n", "l_input = lasagne.layers.InputLayer((None, 1, VOCAB_SIZE))\n", "l_input_hid = lasagne.layers.InputLayer((None, RNN_HIDDEN_SIZE))\n", "l_input_hid2 = lasagne.layers.InputLayer((None, RNN_HIDDEN_SIZE))\n", "\n", "# Our network has two stacked GRU layers processing the input sequence.\n", "l_rnn = lasagne.layers.GRULayer(l_input,\n", " num_units=RNN_HIDDEN_SIZE,\n", " grad_clipping=5.,\n", " hid_init=l_input_hid,\n", " )\n", "\n", "l_rnn2 = lasagne.layers.GRULayer(l_rnn,\n", " num_units=RNN_HIDDEN_SIZE,\n", " grad_clipping=5.,\n", " hid_init=l_input_hid2,\n", " )\n", "\n", "\n", "l_shp = lasagne.layers.ReshapeLayer(l_rnn2, (-1, RNN_HIDDEN_SIZE))\n", "\n", "l_decoder = lasagne.layers.DenseLayer(l_shp,\n", " num_units=VOCAB_SIZE,\n", " nonlinearity=lasagne.nonlinearities.softmax)\n", "\n", "l_out = lasagne.layers.ReshapeLayer(l_decoder, (-1, 1, VOCAB_SIZE))\n", "\n", "hid_out, hid2_out, prob_out = lasagne.layers.get_output([l_rnn, l_rnn2, l_out],\n", " {l_input: x_sym,\n", " l_input_hid: hid_init_sym,\n", " l_input_hid2: hid2_init_sym,\n", " })\n", "hid_out = hid_out[:, -1]\n", "hid2_out = hid2_out[:, -1]\n", "prob_out = prob_out[0, -1]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lasagne.layers.set_all_param_values(l_out, d['param values'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predict_fn = theano.function([x_sym, hid_init_sym, hid2_init_sym], [prob_out, hid_out, hid2_out])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We will use random sentences from the validation corpus to 'prime' the network\n", "primers = val_corpus.split('\\n')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PRIMER: The connector assembly of claim 5, wherein said connector body includes means (56) for terminating two optical fibers and said connector body includes a pair of cantilevered arms (66) on each of two opposed sides thereof with a latching nub (70) attached to the free end (69) of each, and said housing including two openings on each of two opposed sides thereof. \n", "\n", "GENERATED: A spring mechanism with speciring machine comprising a first channel (130) to form step (IO, LS) of said second cascade (18) onlocing up to 0.65 and D. \n", "\n" ] } ], "source": [ "# We feed character one at a time from the priming sequence into the network.\n", "# To obtain a sample string, at each timestep we sample from the output probability distribution,\n", "# and feed the chosen character back into the network. We terminate after the first linebreak.\n", "sentence = ''\n", "hid = np.zeros((1, RNN_HIDDEN_SIZE), dtype='float32')\n", "hid2 = np.zeros((1, RNN_HIDDEN_SIZE), dtype='float32')\n", "x = np.zeros((1, 1, VOCAB_SIZE), dtype='float32')\n", "\n", "primer = np.random.choice(primers) + '\\n'\n", "\n", "for c in primer:\n", " p, hid, hid2 = predict_fn(x, hid, hid2)\n", " x[0, 0, :] = CHAR_TO_ONEHOT[c]\n", " \n", "for _ in range(500):\n", " p, hid, hid2 = predict_fn(x, hid, hid2)\n", " p = p/(1 + 1e-6)\n", " s = np.random.multinomial(1, p)\n", " sentence += IX_TO_CHAR[s.argmax(-1)]\n", " x[0, 0, :] = s\n", " if sentence[-1] == '\\n':\n", " break\n", " \n", "print('PRIMER: ' + primer)\n", "print('GENERATED: ' + sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercises\n", "=====\n", "\n", "1. Implement sampling using the \"temperature softmax\": $$p(i) = \\frac{e^{\\frac{z_i}{T}}}{\\Sigma_k e^{\\frac{z_k}{T}}}$$\n", "\n", "This generalizes the softmax with a parameter $T$ which affects the \"sharpness\" of the distribution. Lowering $T$ will make samples less error-prone but more repetitive. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Uncomment and run this cell for a solution\n", "#%load spoilers/tempsoftmax.py" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }