{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using gpu device 2: GeForce GTX TITAN X (CNMeM is enabled with initial size: 90.0% of memory, cuDNN 4007)\n" ] } ], "source": [ "from theano.sandbox import cuda\n", "cuda.use('gpu2')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using Theano backend.\n" ] } ], "source": [ "%matplotlib inline\n", "import utils; reload(utils)\n", "from utils import *\n", "from __future__ import division, print_function" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from keras.layers import TimeDistributed, Activation\n", "from numpy.random import choice" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We haven't really looked into the detail of how this works yet - so this is provided for self-study for those who are interested. We'll look at it closely next week." ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "corpus length: 600901\n" ] } ], "source": [ "path = get_file('nietzsche.txt', origin=\"https://s3.amazonaws.com/text-datasets/nietzsche.txt\")\n", "text = open(path).read().lower()\n", "print('corpus length:', len(text))" ] }, { "cell_type": "code", "execution_count": 272, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "are thinkers who believe in the saints.\r\n", "\r\n", "\r\n", "144\r\n", "\r\n", "It stands to reason that this sketch of the saint, made upon the model\r\n", "of the whole species, can be confronted with many opposing sketches that\r\n", "would create a more agreeable impression. There are certain exceptions\r\n", "among the species who distinguish themselves either by especial\r\n", "gentleness or especial humanity, and perhaps by the strength of their\r\n", "own personality. Others are in the highest degree fascinating because\r\n", "certain of their delusions shed a particular glow over their whole\r\n", "being, as is the case with the founder of christianity who took himself\r\n", "for the only begotten son of God and hence felt himself sinless; so that\r\n", "through his imagination--that should not be too harshly judged since the\r\n", "whole of antiquity swarmed with sons of god--he attained the same goal,\r\n", "the sense of complete sinlessness, complete irresponsibility, that can\r\n", "now be attained by every individual through science.--In the same manner\r\n", "I have viewed the saints of India who occupy an intermediate station\r\n", "between the christian saints and the Greek philosophers and hence are\r\n", "not to be regarded as a pure type. Knowledge and science--as far as they\r\n", "existed--and superiority to the rest of mankind by logical discipline\r\n", "and training of the intellectual powers were insisted upon by the\r\n", "Buddhists as essential to sanctity, just as they were denounced by the\r\n", "christian world as the indications of sinfulness." ] } ], "source": [ "!tail {path} -n25" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "corpus length: 137587200\n" ] } ], "source": [ "#path = 'data/wiki/'\n", "#text = open(path+'small.txt').read().lower()\n", "#print('corpus length:', len(text))\n", "\n", "#text = text[0:1000000]" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total chars: 60\n" ] } ], "source": [ "chars = sorted(list(set(text)))\n", "vocab_size = len(chars)+1\n", "print('total chars:', vocab_size)" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "collapsed": true }, "outputs": [], "source": [ "chars.insert(0, \"\\0\")" ] }, { "cell_type": "code", "execution_count": 270, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'\\n !\"\\'(),-.0123456789:;=?[]_abcdefghijklmnopqrstuvwxyz'" ] }, "execution_count": 270, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''.join(chars[1:-6])" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false }, "outputs": [], "source": [ "char_indices = dict((c, i) for i, c in enumerate(chars))\n", "indices_char = dict((i, c) for i, c in enumerate(chars))" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "collapsed": true }, "outputs": [], "source": [ "idx = [char_indices[c] for c in text]" ] }, { "cell_type": "code", "execution_count": 276, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[43, 45, 32, 33, 28, 30, 32, 1, 1, 1]" ] }, "execution_count": 276, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idx[:10]" ] }, { "cell_type": "code", "execution_count": 274, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'preface\\n\\n\\nsupposing that truth is a woman--what then? is there not gro'" ] }, "execution_count": 274, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''.join(indices_char[i] for i in idx[:70])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocess and create model" ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "nb sequences: 600862\n" ] } ], "source": [ "maxlen = 40\n", "sentences = []\n", "next_chars = []\n", "for i in range(0, len(idx) - maxlen+1):\n", " sentences.append(idx[i: i + maxlen])\n", " next_chars.append(idx[i+1: i+maxlen+1])\n", "print('nb sequences:', len(sentences))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sentences = np.concatenate([[np.array(o)] for o in sentences[:-2]])\n", "next_chars = np.concatenate([[np.array(o)] for o in next_chars[:-2]])" ] }, { "cell_type": "code", "execution_count": 277, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "((600860, 40), (600860, 40))" ] }, "execution_count": 277, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentences.shape, next_chars.shape" ] }, { "cell_type": "code", "execution_count": 213, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n_fac = 24" ] }, { "cell_type": "code", "execution_count": 232, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model=Sequential([\n", " Embedding(vocab_size, n_fac, input_length=maxlen),\n", " LSTM(512, input_dim=n_fac,return_sequences=True, dropout_U=0.2, dropout_W=0.2,\n", " consume_less='gpu'),\n", " Dropout(0.2),\n", " LSTM(512, return_sequences=True, dropout_U=0.2, dropout_W=0.2,\n", " consume_less='gpu'),\n", " Dropout(0.2),\n", " TimeDistributed(Dense(vocab_size)),\n", " Activation('softmax')\n", " ]) " ] }, { "cell_type": "code", "execution_count": 233, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train" ] }, { "cell_type": "code", "execution_count": 219, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def print_example():\n", " seed_string=\"ethics is a basic foundation of all that\"\n", " for i in range(320):\n", " x=np.array([char_indices[c] for c in seed_string[-40:]])[np.newaxis,:]\n", " preds = model.predict(x, verbose=0)[0][-1]\n", " preds = preds/np.sum(preds)\n", " next_char = choice(chars, p=preds)\n", " seed_string = seed_string + next_char\n", " print(seed_string)" ] }, { "cell_type": "code", "execution_count": 236, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/1\n", "600860/600860 [==============================] - 640s - loss: 1.5152 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, nb_epoch=1)" ] }, { "cell_type": "code", "execution_count": 220, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ethics is a basic foundation of all thatscrriets sdi ,s lrrbmh\n", "fceelsora tec\n", " n yiefma\n", "cnostencnrut - o\n", "pen.htt\" oaiosovo stialpts es rb b\n", "ea ie\n", "ohatnmauyielueysiutlmo,es etfrne oh\n", "ohnio iis e.eosme o rdorfdbteirnse ohdnotafi enicron e eietnyn sytt e ptsrdrede httmi ah\n", "oo, tdye es r,igyct toehitu abrh ei isiem-r natra lnspamlltefae a\n", "cni vuui\n", "twgt fatieh\n" ] } ], "source": [ "print_example()" ] }, { "cell_type": "code", "execution_count": 236, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/1\n", "600860/600860 [==============================] - 640s - loss: 1.5152 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, nb_epoch=1)" ] }, { "cell_type": "code", "execution_count": 222, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ethics is a basic foundation of all that he maluces indofely and is; pticrast', and re onerog je ivesantamale as whered\n", "and ror and kytinf? on chaninn nurdeln--ans prory. heke the pepadinar; anf bom,\n", "puntely\"\" ones to bucf, alcherstol the qisleves: the the wite dadong the gur is prang not galcaula rewinl\n", "more by than sic appads not pepow o mee, a more\n", "bins c\n" ] } ], "source": [ "print_example()" ] }, { "cell_type": "code", "execution_count": 235, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.optimizer.lr=0.001" ] }, { "cell_type": "code", "execution_count": 236, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/1\n", "600860/600860 [==============================] - 640s - loss: 1.5152 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, nb_epoch=1)" ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ethics is a basic foundation of all that schools pedhaps a new prisons of the ashamed in which\n", "a coverbine estimates of the assumption that one avoid; he will curse about pain:\n", " people, he-equally present to\n", "the lalier,\n", "nature. that he has\n", "rendered and henceforth distrain and impulses to perceive that each other\n", "former and dangerous, and cannot at\n", "the pu\n" ] } ], "source": [ "print_example()" ] }, { "cell_type": "code", "execution_count": 250, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.optimizer.lr=0.0001" ] }, { "cell_type": "code", "execution_count": 239, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/1\n", "600860/600860 [==============================] - 639s - loss: 1.2892 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, nb_epoch=1)" ] }, { "cell_type": "code", "execution_count": 240, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ethics is a basic foundation of all that account has its granitify them.\n", "\n", "131. the new \"dilence,\" out of the\n", "same light,\n", "interpretation thereof: under the \"thinking\"\n", "there, to counter-arguments in the monality, so many language:\n", "though\n", "all nobilitys of higher impulses, man and hence to everything of seldom man.\n", "\n", "\n", "\n", "chapter i. woman decides according the injur\n" ] } ], "source": [ "print_example()" ] }, { "cell_type": "code", "execution_count": 242, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.save_weights('data/char_rnn.h5')" ] }, { "cell_type": "code", "execution_count": 257, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.optimizer.lr=0.00001" ] }, { "cell_type": "code", "execution_count": 243, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/1\n", "600860/600860 [==============================] - 640s - loss: 1.2544 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 243, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, nb_epoch=1)" ] }, { "cell_type": "code", "execution_count": 249, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ethics is a basic foundation of all that is military\n", "contemplation of distance itself is in physician!\n", "\n", "249.. in every\n", "to strick in the man of disguise and in the\n", "will to wind at any progress, the\n", "religious estimates of vehapance has a powerful and religious nature of manner, who had the problem of\n", "decided expression of his equality, which, sometimes power? \n" ] } ], "source": [ "print_example()" ] }, { "cell_type": "code", "execution_count": 258, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/1\n", "600860/600860 [==============================] - 640s - loss: 1.2218 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(sentences, np.expand_dims(next_chars,-1), batch_size=64, nb_epoch=1)" ] }, { "cell_type": "code", "execution_count": 264, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ethics is a basic foundation of all that\n", "the belief in the importance. the employs concerning\n", "seriousness and\n", "materialism, it is circles which alone is already attained, that he sees\n", "also the day after thinking\n", "of mankind, brightness, resistance--and after the value of \"nature\" in order to nevertheless\n", "have taken a system of liberal fatalists are willing him\n" ] } ], "source": [ "print_example()" ] }, { "cell_type": "code", "execution_count": 283, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ethics is a basic foundation of all that were beought by the temptation of truth--for the rest of a sublime medely and take part of life, which lacks himself the\n", "credibility about this, in short, and raise such a\n", "gods; and on the\n", "other hand, the explanation of\n", "the case, as the most ingredient, and insight, and approach as to the\n", "peculiarly prolonged \"distrus\n" ] } ], "source": [ "print_example()" ] }, { "cell_type": "code", "execution_count": 282, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model.save_weights('data/char_rnn.h5')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" }, "nav_menu": {}, "toc": { "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 6, "toc_cell": true, "toc_section_display": "block", "toc_window_display": false }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 0 }