{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Character-level language models\n", "\n", "This tutorial shows how to train a character-level language model with a multilayer recurrent neural network. In particular, we will train a multilayer LSTM network that is able to generate President Obama's speeches.\n", "\n", "## Prepare data\n", "We first download the dataset and show the first few characters." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Call to Renewal Keynote Address Call to Renewal Pt 1Call to Renewal Part 2 TOPIC: Our Past, Our Future & Vision for America June \n", "28, 2006 Call to Renewal' Keynote Address Complete Text Good morning. I appreciate the opportunity to speak here at the Call to R\n", "enewal's Building a Covenant for a New America conference. I've had the opportunity to take a look at your Covenant for a New Ame\n", "rica. It is filled with outstanding policies and prescriptions for much of what ails this country. So I'd like to congratulate yo\n", "u all on the thoughtful presentations you've given so far about poverty and justice in America, and for putting fire under the fe\n", "et of the political leadership here in Washington.But today I'd like to talk about the connection between religion and politics a\n", "nd perhaps offer some thoughts about how we can sort through some of the often bitter arguments that we've been seeing over the l\n", "ast several years.I do so because, as you all know, we can affirm the importance of povert\n" ] } ], "source": [ "import os\n", "import urllib\n", "import zipfile\n", "if not os.path.exists(\"char_lstm.zip\"):\n", " urllib.urlretrieve(\"http://data.mxnet.io/data/char_lstm.zip\", \"char_lstm.zip\")\n", "with zipfile.ZipFile(\"char_lstm.zip\",\"r\") as f:\n", " f.extractall(\"./\") \n", "with open('obama.txt', 'r') as f:\n", " print f.read()[0:1000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we define a few utility functions to pre-process the dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vocab size = 83\n" ] } ], "source": [ "def read_content(path):\n", " with open(path) as ins: \n", " return ins.read()\n", " \n", "# Return a dict which maps each char into an unique int id\n", "def build_vocab(path):\n", " content = list(read_content(path))\n", " idx = 1 # 0 is left for zero-padding\n", " the_vocab = {}\n", " for word in content:\n", " if len(word) == 0:\n", " continue\n", " if not word in the_vocab:\n", " the_vocab[word] = idx\n", " idx += 1\n", " return the_vocab\n", "\n", "# Encode a sentence with int ids\n", "def text2id(sentence, the_vocab):\n", " words = list(sentence)\n", " return [the_vocab[w] for w in words if len(w) > 0]\n", " \n", "# build char vocabluary from input\n", "vocab = build_vocab(\"./obama.txt\")\n", "print('vocab size = %d' %(len(vocab)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create LSTM Model\n", "\n", "Now we create the a multi-layer LSTM model. The definition of LSTM cell is implemented in [lstm.py](https://github.com/dmlc/mxnet-notebooks/blob/master/python/tutorials/lstm.py)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import lstm\n", "# Each line contains at most 129 chars. \n", "seq_len = 129\n", "# embedding dimension, which maps a character to a 256-dimension vector\n", "num_embed = 256\n", "# number of lstm layers\n", "num_lstm_layer = 3\n", "# hidden unit in LSTM cell\n", "num_hidden = 512\n", "\n", "symbol = lstm.lstm_unroll(\n", " num_lstm_layer, \n", " seq_len,\n", " len(vocab) + 1,\n", " num_hidden=num_hidden,\n", " num_embed=num_embed,\n", " num_label=len(vocab) + 1, \n", " dropout=0.2)\n", "\n", "\"\"\"test_seq_len\"\"\"\n", "data_file = open(\"./obama.txt\")\n", "for line in data_file:\n", " assert len(line) <= seq_len + 1, \"seq_len is smaller than maximum line length. Current line length is %d. Line content is: %s\" % (len(line), line)\n", "data_file.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train\n", "\n", "First, we create a DataIterator" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Summary of dataset ==================\n", "bucket of len 129 : 8290 samples\n" ] } ], "source": [ "import bucket_io\n", "\n", "# The batch size for training\n", "batch_size = 32\n", "\n", "# initalize states for LSTM\n", "init_c = [('l%d_init_c'%l, (batch_size, num_hidden)) for l in range(num_lstm_layer)]\n", "init_h = [('l%d_init_h'%l, (batch_size, num_hidden)) for l in range(num_lstm_layer)]\n", "init_states = init_c + init_h\n", "\n", "# Even though BucketSentenceIter supports various length examples,\n", "# we simply use the fixed length version here\n", "data_train = bucket_io.BucketSentenceIter(\n", " \"./obama.txt\", \n", " vocab, \n", " [seq_len], \n", " batch_size, \n", " init_states, \n", " seperate_char='\\n',\n", " text2id=text2id, \n", " read_content=read_content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can train with the standard `model.fit` approach." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Start training with [gpu(0)]\n", "INFO:root:Epoch[0] Batch [20]\tSpeed: 74.29 samples/sec\tTrain-Perplexity=38.179584\n", "INFO:root:Epoch[0] Batch [40]\tSpeed: 71.11 samples/sec\tTrain-Perplexity=24.131196\n", "INFO:root:Epoch[0] Batch [60]\tSpeed: 71.25 samples/sec\tTrain-Perplexity=23.802137\n", "INFO:root:Epoch[0] Batch [80]\tSpeed: 70.94 samples/sec\tTrain-Perplexity=23.195673\n", "INFO:root:Epoch[0] Batch [100]\tSpeed: 71.35 samples/sec\tTrain-Perplexity=22.974986\n", "INFO:root:Epoch[0] Batch [120]\tSpeed: 71.11 samples/sec\tTrain-Perplexity=22.783410\n", "INFO:root:Epoch[0] Batch [140]\tSpeed: 71.04 samples/sec\tTrain-Perplexity=22.826977\n", "INFO:root:Epoch[0] Batch [160]\tSpeed: 71.15 samples/sec\tTrain-Perplexity=22.681599\n", "INFO:root:Epoch[0] Batch [180]\tSpeed: 71.31 samples/sec\tTrain-Perplexity=22.268179\n", "INFO:root:Epoch[0] Batch [200]\tSpeed: 71.16 samples/sec\tTrain-Perplexity=22.548455\n", "INFO:root:Epoch[0] Batch [220]\tSpeed: 71.29 samples/sec\tTrain-Perplexity=22.224348\n", "INFO:root:Epoch[0] Batch [240]\tSpeed: 71.10 samples/sec\tTrain-Perplexity=22.563747\n", "INFO:root:Epoch[0] Resetting Data Iterator\n", "INFO:root:Epoch[0] Time cost=116.385\n", "INFO:root:Saved checkpoint to \"obama-0001.params\"\n" ] } ], "source": [ "# @@@ AUTOTEST_OUTPUT_IGNORED_CELL\n", "import mxnet as mx\n", "import numpy as np\n", "import logging\n", "logging.getLogger().setLevel(logging.DEBUG)\n", "\n", "# We will show a quick demo with only 1 epoch. In practice, we can set it to be 100\n", "num_epoch = 1\n", "# learning rate \n", "learning_rate = 0.01\n", "\n", "# Evaluation metric\n", "def Perplexity(label, pred):\n", " loss = 0.\n", " for i in range(pred.shape[0]):\n", " loss += -np.log(max(1e-10, pred[i][int(label[i])]))\n", " return np.exp(loss / label.size)\n", "\n", "model = mx.mod.Module(symbol=symbol,\n", " data_names=[x[0] for x in data_train.provide_data],\n", " label_names=[y[0] for y in data_train.provide_label],\n", " context=[mx.gpu(0)])\n", "\n", "model.fit(train_data=data_train,\n", " num_epoch=num_epoch,\n", " optimizer='sgd',\n", " optimizer_params={'learning_rate':learning_rate, 'momentum':0, 'wd':0.0001},\n", " initializer=mx.init.Xavier(factor_type=\"in\", magnitude=2.34),\n", " eval_metric=mx.metric.np(Perplexity),\n", " batch_end_callback=mx.callback.Speedometer(batch_size, 20),\n", " epoch_end_callback=mx.callback.do_checkpoint(\"obama\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Inference\n", "\n", "We first define some utility functions to help us make inferences:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from rnn_model import LSTMInferenceModel\n", "\n", "\n", "# helper strcuture for prediction\n", "def MakeRevertVocab(vocab):\n", " dic = {}\n", " for k, v in vocab.items():\n", " dic[v] = k\n", " return dic\n", "\n", "# make input from char\n", "def MakeInput(char, vocab, arr):\n", " idx = vocab[char]\n", " tmp = np.zeros((1,))\n", " tmp[0] = idx\n", " arr[:] = tmp\n", "\n", "# helper function for random sample \n", "def _cdf(weights):\n", " total = sum(weights)\n", " result = []\n", " cumsum = 0\n", " for w in weights:\n", " cumsum += w\n", " result.append(cumsum / total)\n", " return result\n", "\n", "def _choice(population, weights):\n", " assert len(population) == len(weights)\n", " cdf_vals = _cdf(weights)\n", " x = random.random()\n", " idx = bisect.bisect(cdf_vals, x)\n", " return population[idx]\n", "\n", "# we can use random output or fixed output by choosing largest probability\n", "def MakeOutput(prob, vocab, sample=False, temperature=1.):\n", " if sample == False:\n", " idx = np.argmax(prob, axis=1)[0]\n", " else:\n", " fix_dict = [\"\"] + [vocab[i] for i in range(1, len(vocab) + 1)]\n", " scale_prob = np.clip(prob, 1e-6, 1 - 1e-6)\n", " rescale = np.exp(np.log(scale_prob) / temperature)\n", " rescale[:] /= rescale.sum()\n", " return _choice(fix_dict, rescale[0, :])\n", " try:\n", " char = vocab[idx]\n", " except:\n", " char = ''\n", " return char" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we create the inference model:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import rnn_model \n", "\n", "# load from check-point\n", "_, arg_params, __ = mx.model.load_checkpoint(\"obama\", 75)\n", "\n", "# build an inference model\n", "model = rnn_model.LSTMInferenceModel(\n", " num_lstm_layer,\n", " len(vocab) + 1,\n", " num_hidden=num_hidden,\n", " num_embed=num_embed,\n", " num_label=len(vocab) + 1, \n", " arg_params=arg_params, \n", " ctx=mx.gpu(), \n", " dropout=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can generate a sequence of 600 characters starting with \"The United States\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The United States of America. That's why I'm running for President.The first place we can do better than that they can afford to get the that they can afford to differ on the part of the political settlement. The second part of the problem is that the consequences would have to see the chance to starthe country that we can start by the challenges of the American people. The American people have been talking about how to compete with the streets of San Antonio who are serious about the courage to come together as one people. That the American people have been trying to get there. And they say\n" ] } ], "source": [ "seq_length = 600\n", "input_ndarray = mx.nd.zeros((1,))\n", "revert_vocab = MakeRevertVocab(vocab)\n", "# Feel free to change the starter sentence\n", "output ='The United States'\n", "random_sample = False\n", "new_sentence = True\n", "\n", "ignore_length = len(output)\n", "\n", "for i in range(seq_length):\n", " if i <= ignore_length - 1:\n", " MakeInput(output[i], vocab, input_ndarray)\n", " else:\n", " MakeInput(output[-1], vocab, input_ndarray)\n", " prob = model.forward(input_ndarray, new_sentence)\n", " new_sentence = False\n", " next_char = MakeOutput(prob, revert_vocab, random_sample)\n", " if next_char == '':\n", " new_sentence = True\n", " if i >= ignore_length - 1:\n", " output += next_char\n", "print(output)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }