{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Neural Language Model and Spinoza's Ethics\n", "\n", "In this post I will show how to build a language model for text generation using deep learning techniques.\n", "\n", "\n", "## Introduction\n", "\n", "Though natural language, in principle, have formal structures and grammar, in practice it is full of ambiguities. Modeling it using examples and modeling is an interesting alternative. The definition of a (statistical) language model given by [Ref.1](https://en.wikipedia.org/wiki/Language_model) is:\n", "\n", "> A statistical language model is a probability distribution over sequences of words. Given such a sequence it assigns a probability to the whole sequence.\n", "\n", "Or equivalently, given a sequence $\\{w_1,...,w_n\\}$ of length $m$, the model assigns a probability \n", "\n", "$$P(w_1,...,w_n)$$\n", "\n", "to the whole sequence. In particular, a neural language model can predict the probability of the next word in a sentence (see [Ref.2](https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/) for more details). \n", "\n", "The use of neural networks has become one of the main approaches to language modeling. Three properties can describe this neural language modeling (NLM) approach succinctly [Ref. 3](http://www.jmlr.org/papers/v3/bengio03a.html): \n", "\n", "> We first associate words in the vocabulary with a distributed word feature vector, then express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence and then learn simultaneously the word feature vector and the parameters of the probability function. \n", "\n", "In this project I used Spinoza's *Ethics* to build a NLM." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## *The Ethics*\n", "\n", "From [Ref.4](https://en.wikipedia.org/wiki/Ethics_(Spinoza)):\n", "\n", "> Ethics, Demonstrated in Geometrical Order, usually known as the Ethics, is a philosophical treatise written by Benedict de Spinoza. \n", "\n", "The article goes on to say that:\n", "\n", "> The book is perhaps the most ambitious attempt to apply the method of Euclid in philosophy. Spinoza puts forward a small number of definitions and axioms from which he attempts to derive hundreds of propositions and corollaries [...]\n", "\n", "The book has structure shown below. We see that it is set out in geometrical form paralleling the \"canonical example of a rigorous structure of argument producing unquestionable results: the example being the geometry of Euclid\" (see [link](https://timlshort.com/2010/06/21/spinozas-style-of-argument-in-ethics-i/))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> # PART I. CONCERNING GOD. \n", "## DEFINITIONS.\n", "I. By that which is self—caused, I mean that of which the essence involves existence, or that of which the nature is only conceivable as existent.\n", "\n", "> II. A thing is called finite after its kind, when it can be limited by another thing of the same nature; for instance, a body is called finite because we always conceive another greater body. So, also, a thought is limited by another thought, but a body is not limited by thought, nor a thought by body.\n", "\n", "> III. By substance, I mean that which is in itself, and is conceived through itself: in other words, that of which a conception can be formed independently of any other conception.\n", "\n", "> IV. By attribute, I mean that which the intellect perceives as constituting the essence of substance.”\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/marcotavora/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n", "Using TensorFlow backend.\n" ] } ], "source": [ "from numpy import array\n", "from pickle import dump\n", "from keras.utils import to_categorical\n", "from keras.utils.vis_utils import plot_model\n", "from keras.models import Sequential\n", "from keras.layers import Dense\n", "from keras.layers import LSTM\n", "\n", "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We first write a function to load texts\n", "\n", "The steps of the function below are:\n", "- Opens the file 'ethics.txt'\n", "- Reads it into a string\n", "- Closes it" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "file = \"ethics.txt\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def load_txt(file):\n", " f = open(file, 'r')\n", " text = f.read()\n", " f.close()\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Priting out part of the string\n", "\n", "We see it contains lots of new line characters." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'PART I. CONCERNING GOD.\\n\\nDEFINITIONS.\\n\\n\\nI. By that which is self--caused, I mean that of which the\\n'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "load_txt(file)[0:100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading and splitting the string" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "raw = load_txt('ethics.txt')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'PART I. CONCERNING GOD.\\n\\nDEFINITIONS.\\n\\n\\nI. By that which is self--caused, I mean that of which the\\n'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw[0:100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing\n", "\n", "The first step is tokenization. With the tokens we will be able to train our model. Some other actions are:\n", "- Exclude stopwords (common words, adding no meaning such as for example, \"I\", \"am\")\n", "- Take out punctuation and spaces\n", "- Convert text to lower case\n", "- Split words (on white spaces)\n", "- Elimitate `--`,`\"`, numbers and brackets\n", "- Dropping non-alphabetic words\n", "- Stemming\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "# nltk.download('stopwords')\n", " \n", "from nltk.corpus import stopwords\n", "from nltk.stem import PorterStemmer\n", "import string\n", "\n", "def cleaner(text):\n", " stemmer = PorterStemmer()\n", " stop = stopwords.words('english') \n", " text = text.replace('[',' ').replace(']',' ').replace('--', ' ')\n", " tokens = text.split()\n", " text = str.maketrans('', '', string.punctuation)\n", " tokens = [w.translate(text) for w in tokens]\n", " tokens = [word for word in tokens if word.isalpha()]\n", " tokens = [word.lower() for word in tokens]\n", " return tokens" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['part', 'i', 'concerning', 'god', 'definitions', 'i', 'by', 'that', 'which', 'is', 'self', 'caused', 'i', 'mean', 'that', 'of', 'which', 'the', 'essence', 'involves', 'existence', 'or', 'that', 'of', 'which', 'the', 'nature', 'is', 'only', 'conceivable', 'as', 'existent', 'ii', 'a', 'thing', 'is', 'called', 'finite', 'after', 'its', 'kind', 'when', 'it', 'can', 'be', 'limited', 'by', 'another', 'thing', 'of', 'the', 'same', 'nature', 'for', 'instance', 'a', 'body', 'is', 'called', 'finite', 'because', 'we', 'always', 'conceive', 'another', 'greater', 'body', 'so', 'also', 'a', 'thought', 'is', 'limited', 'by', 'another', 'thought', 'but', 'a', 'body', 'is', 'not', 'limited', 'by', 'thought', 'nor', 'a', 'thought', 'by', 'body', 'iii', 'by', 'substance', 'i', 'mean', 'that', 'which', 'is', 'in', 'itself', 'and']\n" ] } ], "source": [ "print(cleaner(raw)[0:100])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['part', 'i', 'concerning', 'god', 'definitions', 'i', 'by', 'that', 'which', 'is']\n" ] } ], "source": [ "tokens = cleaner(raw)\n", "#raw.split()\n", "print(tokens[0:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joining tokens to build `raw` after cleaning" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'part i concerning god definitions i by that which is sel'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw = ' '.join(tokens)\n", "raw[0:56]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building sequences" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "n = 20\n", "sequences = list()\n", "for i in range(n, len(raw)):\n", " sequences.append(raw[i-n:i+1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checking size\n", "\n", "There are around 180,000 sequences to be used for training." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "175247" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(sequences)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "first sequence is: part i concerning god\n", "second sequence is: art i concerning god \n", "third sequence is: rt i concerning god d\n" ] } ], "source": [ "print('first sequence is:',sequences[0])\n", "print('second sequence is:',sequences[1])\n", "print('third sequence is:',sequences[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving our prepared sequences" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def save_txt(sequences, file):\n", " f = open(file, 'w')\n", " f.write('\\n'.join(sequences))\n", " f.close()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "out = 'ethics_sequences.txt';\n", "save_txt(sequences, out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the sequences and checking for mistakes" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "first sequence is: part i concerning god\n", "second sequence is: art i concerning god \n", "third sequence is: rt i concerning god d\n" ] }, { "data": { "text/plain": [ "['part i concerning god',\n", " 'art i concerning god ',\n", " 'rt i concerning god d',\n", " 't i concerning god de',\n", " ' i concerning god def',\n", " 'i concerning god defi',\n", " ' concerning god defin',\n", " 'concerning god defini',\n", " 'oncerning god definit',\n", " 'ncerning god definiti']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw = load_txt('ethics_sequences.txt')\n", "seqs = raw.split('\\n')\n", "print('first sequence is:',seqs[0])\n", "print('second sequence is:',seqs[1])\n", "print('third sequence is:',seqs[2])\n", "seqs[0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encoding\n", "\n", "We must now encode the sequences as a chain of integers. The list `unique_chars` is made of unique characters:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['\\n', ' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unique_chars = sorted(list(set(raw)))\n", "unique_chars[0:10]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The values corresponding to keys:\n", "\n", "['\\n', ' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'â', 'æ']\n", "are:\n", "\n", "{'\\n': 0, ' ': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27, 'à': 28, 'â': 29, 'æ': 30}\n" ] } ], "source": [ "char_int_map = dict((a, b) for b, a in enumerate(unique_chars))\n", "print('The values corresponding to keys:\\n')\n", "print(unique_chars)\n", "print('are:\\n')\n", "print(char_int_map)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Process sequences using the dictionary" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "encoded_sequences = list()\n", "for seq in seqs:\n", " encoded_sequences.append([char_int_map[char] for char in seq])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Printing out sequences and their encoded form" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "part i concerning god\n", "[17, 2, 19, 21, 1, 10, 1, 4, 16, 15, 4, 6, 19, 15, 10, 15, 8, 1, 8, 16, 5]\n", "art i concerning god \n", "[2, 19, 21, 1, 10, 1, 4, 16, 15, 4, 6, 19, 15, 10, 15, 8, 1, 8, 16, 5, 1]\n" ] } ], "source": [ "print(sequences[0])\n", "print(encoded_sequences[0])\n", "print(sequences[1])\n", "print(encoded_sequences[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building an array from the encoded sequences" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "encoded_sequences = array(encoded_sequences)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[17, 2, 19, ..., 8, 16, 5],\n", " [ 2, 19, 21, ..., 16, 5, 1],\n", " [19, 21, 1, ..., 5, 1, 5],\n", " ...,\n", " [24, 20, 1, ..., 17, 13, 2],\n", " [20, 1, 3, ..., 13, 2, 10],\n", " [ 1, 3, 6, ..., 2, 10, 15]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoded_sequences" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "X,y = encoded_sequences[:,:-1], encoded_sequences[:,-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hot encoding" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "sequences = [to_categorical(x, num_classes=len(char_int_map)) for x in X]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequences[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Features and targets" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "X = array(sequences)\n", "y = to_categorical(y, num_classes=len(char_int_map))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "size = len(char_int_map)\n", "\n", "def define_model(X):\n", " model = Sequential()\n", " model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))\n", " model.add(Dense(size, activation='softmax'))\n", " model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n", " model.summary()\n", " return model" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "lstm_1 (LSTM) (None, 75) 32100 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 31) 2356 \n", "=================================================================\n", "Total params: 34,456\n", "Trainable params: 34,456\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "model = define_model(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fitting and saving model and dictionary" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/30\n", " - 230s - loss: 2.0191 - acc: 0.4063\n", "Epoch 2/30\n", " - 196s - loss: 1.5064 - acc: 0.5549\n", "Epoch 3/30\n", " - 197s - loss: 1.3316 - acc: 0.6036\n", "Epoch 4/30\n", " - 196s - loss: 1.2390 - acc: 0.6287\n", "Epoch 5/30\n", " - 194s - loss: 1.1790 - acc: 0.6443\n", "Epoch 6/30\n", " - 193s - loss: 1.1355 - acc: 0.6546\n", "Epoch 7/30\n", " - 195s - loss: 1.1036 - acc: 0.6623\n", "Epoch 8/30\n", " - 192s - loss: 1.0777 - acc: 0.6697\n", "Epoch 9/30\n", " - 192s - loss: 1.0562 - acc: 0.6759\n", "Epoch 10/30\n", " - 192s - loss: 1.0383 - acc: 0.6810\n", "Epoch 11/30\n", " - 193s - loss: 1.0229 - acc: 0.6846\n", "Epoch 12/30\n", " - 193s - loss: 1.0085 - acc: 0.6894\n", "Epoch 13/30\n", " - 193s - loss: 0.9964 - acc: 0.6921\n", "Epoch 14/30\n", " - 194s - loss: 0.9862 - acc: 0.6951\n", "Epoch 15/30\n", " - 194s - loss: 0.9770 - acc: 0.6976\n", "Epoch 16/30\n", " - 195s - loss: 0.9674 - acc: 0.7014\n", "Epoch 17/30\n", " - 195s - loss: 0.9597 - acc: 0.7029\n", "Epoch 18/30\n", " - 196s - loss: 0.9526 - acc: 0.7049\n", "Epoch 19/30\n", " - 199s - loss: 0.9453 - acc: 0.7071\n", "Epoch 20/30\n", " - 195s - loss: 0.9388 - acc: 0.7085\n", "Epoch 21/30\n", " - 195s - loss: 0.9330 - acc: 0.7106\n", "Epoch 22/30\n", " - 195s - loss: 0.9271 - acc: 0.7115\n", "Epoch 23/30\n", " - 196s - loss: 0.9222 - acc: 0.7132\n", "Epoch 24/30\n", " - 194s - loss: 0.9178 - acc: 0.7147\n", "Epoch 25/30\n", " - 195s - loss: 0.9124 - acc: 0.7169\n", "Epoch 26/30\n", " - 195s - loss: 0.9090 - acc: 0.7171\n", "Epoch 27/30\n", " - 195s - loss: 0.9046 - acc: 0.7188\n", "Epoch 28/30\n", " - 200s - loss: 0.9003 - acc: 0.7192\n", "Epoch 29/30\n", " - 194s - loss: 0.8969 - acc: 0.7201\n", "Epoch 30/30\n", " - 194s - loss: 0.8939 - acc: 0.7220\n" ] } ], "source": [ "history = model.fit(X, y, epochs=30, verbose=2)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "loss = history.history['loss']" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "%matplotlib inline\n", "\n", "fig = plt.figure()\n", "plt.plot(loss)\n", "plt.title('Model Loss')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Loss')\n", "#pylab.xlim([0,60])\n", "fig.savefig('loss.png')\n", "plt.show();" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "model.save('model.h5')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "dump(char_int_map, open('char_int_map.pkl', 'wb'))" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "keras.models.Sequential" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating sequences" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "from pickle import load\n", "from numpy import array\n", "from keras.models import load_model\n", "from keras.utils import to_categorical\n", "from keras.preprocessing.sequence import pad_sequences" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "def gen_seq(model, char_int_map, n_seq, test_seq, size_gen):\n", " num_classes=len(char_int_map)\n", " txt = test_seq\n", " print(txt)\n", " # generate a fixed number of characters\n", " for i in range(size_gen):\n", " encoded = pad_sequences([[char_int_map[c] for c in txt]], \n", " maxlen=n_seq, truncating='pre')\n", " encoded = to_categorical(encoded, num_classes=num_classes)\n", " ypred = model.predict_classes(encoded)\n", " int_to_char = ''\n", " for c, idx in char_int_map.items():\n", " if idx == ypred:\n", " int_to_char = c\n", " break\n", " # append to input\n", " txt += int_to_char\n", " return txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the model and the dictionary" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# load the model\n", "model = load_model('model.h5')\n", "# load the mapping\n", "char_int_map = load(open('char_int_map.pkl', 'rb'))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "that which is self caused\n", "that which is self caused by another modifications of the human b\n", "nature for instance a body\n", "nature for instance a body in the same as the constitution of the \n" ] } ], "source": [ "# test start of rhyme\n", "print(gen_seq(model, char_int_map, 20, 'that which is self caused', 40))\n", "print(gen_seq(model, char_int_map, 20, 'nature for instance a body', 40))" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'part i concerning god definitions i by that which is self caused i mean that of which the essence involves existence or that of which the nature is only conceivable as existent ii a thing is called finite after its kind when it can be limited by another thing of the same nature for instance a body is called finite because we always conceive another greater body so also a thought is limited by anot'" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw[0:400]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correct\n", "\n", " 1) \"that which is self caused i mean that of which the essence\"\n", " 2) \"nature for instance a body is called finite because we always conceive\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "\n", "The model has to learn better probably by increasing the number of epochs." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }