{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Using wrappers for Gensim models for working with Keras" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial is about using gensim models as a part of your Keras models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The wrappers available (as of now) are :\n", "* Word2Vec (uses the function ```get_keras_embedding``` defined in ```gensim.models.keyedvectors```)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Word2Vec" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#### Integration with Keras : 20NewsGroups Task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see how Gensim's Word2Vec model could be integrated with Keras while dealing with a supervised (classification) task, we consider the [20NewsGroups](qwone.com/~jason/20Newsgroups/) task. Here, we take a smaller version of this data by taking a subset of the documents to be classified. \n", "\n", "First, we import the necessary modules." ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import keras\n", "import numpy as np\n", "\n", "from gensim.models import word2vec\n", "\n", "from keras.models import Model\n", "from keras.preprocessing.text import Tokenizer, text_to_word_sequence\n", "from keras.preprocessing.sequence import pad_sequences\n", "from keras.utils.np_utils import to_categorical\n", "from keras.layers import Input, Dense, Flatten\n", "from keras.layers import Conv1D, MaxPooling1D\n", "\n", "from sklearn.datasets import fetch_20newsgroups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first load the training data.\n", "Then, we format our text samples and labels into tensors that can be fed into a neural network. To do this, we rely on Keras utilities `keras.preprocessing.text.Tokenizer`, `keras.preprocessing.sequence.pad_sequences` and `from keras.utils.np_utils import to_categorical`.\n" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "dataset = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'comp.graphics', 'sci.space'])\n", "\n", "MAX_SEQUENCE_LENGTH = 1000\n", "\n", "# Vectorize the text samples into a 2D integer tensor\n", "tokenizer = Tokenizer()\n", "tokenizer.fit_on_texts(dataset.data)\n", "sequences = tokenizer.texts_to_sequences(dataset.data)\n", "\n", "x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n", "y_train = to_categorical(np.asarray(dataset.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we train a Word2Vec model from the documents we have.\n", "From the word2vec model we construct the embedding layer to be used in our actual Keras model.\n", "\n", "The Keras tokenizer object maintains an internal vocabulary (a token to index mapping), which might be different from the vocabulary gensim builds when training the word2vec model. To align the vocabularies we pass the Keras tokenizer vocabulary to the `get_keras_embedding` function" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "keras_w2v = word2vec.Word2Vec([text_to_word_sequence(doc) for doc in dataset.data],min_count=0)\n", "embedding_layer = keras_w2v.wv.get_keras_embedding(word_index = tokenizer.word_index,train_embeddings=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we create a small 1D convnet to solve our classification problem." ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 1491 samples, validate on 166 samples\n", "Epoch 1/3\n", "1491/1491 [==============================] - 16s 11ms/step - loss: 1.0239 - acc: 0.5017 - val_loss: 0.9306 - val_acc: 0.5663\n", "Epoch 2/3\n", "1491/1491 [==============================] - 15s 10ms/step - loss: 0.6941 - acc: 0.7015 - val_loss: 0.6612 - val_acc: 0.7048\n", "Epoch 3/3\n", "1491/1491 [==============================] - 15s 10ms/step - loss: 0.4270 - acc: 0.8404 - val_loss: 0.5119 - val_acc: 0.7892\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')\n", "embedded_sequences = embedding_layer(sequence_input)\n", "x = Conv1D(128, 5, activation='relu')(embedded_sequences)\n", "x = MaxPooling1D(5)(x)\n", "x = Conv1D(128, 5, activation='relu')(x)\n", "x = MaxPooling1D(5)(x)\n", "x = Conv1D(128, 5, activation='relu')(x)\n", "x = MaxPooling1D(35)(x) # global max pooling\n", "x = Flatten()(x)\n", "x = Dense(128, activation='relu')(x)\n", "preds = Dense(y_train.shape[1], activation='softmax')(x)\n", "\n", "model = Model(sequence_input, preds)\n", "model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])\n", "\n", "model.fit(x_train, y_train, epochs=3, validation_split= 0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the model learns to reaches a reasonable accuracy, considering the small dataset.\n", "\n", "Alternatively, we can use embeddings pretrained on a different larger corpus (Glove), to see if performance impoves" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [], "source": [ "import gensim.downloader as api\n", "\n", "glove_embeddings = api.load(\"glove-wiki-gigaword-100\")" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 1491 samples, validate on 166 samples\n", "Epoch 1/3\n", "1491/1491 [==============================] - 17s 11ms/step - loss: 1.0564 - acc: 0.4514 - val_loss: 0.9083 - val_acc: 0.4578\n", "Epoch 2/3\n", "1491/1491 [==============================] - 16s 11ms/step - loss: 0.5122 - acc: 0.7901 - val_loss: 0.3278 - val_acc: 0.8855\n", "Epoch 3/3\n", "1491/1491 [==============================] - 16s 10ms/step - loss: 0.0902 - acc: 0.9718 - val_loss: 0.2187 - val_acc: 0.9398\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glove_embedding_layer = glove_embeddings.get_keras_embedding(word_index = tokenizer.word_index,train_embeddings=True)\n", "\n", "embedded_sequences = glove_embedding_layer(sequence_input)\n", "x = Conv1D(128, 5, activation='relu')(embedded_sequences)\n", "x = MaxPooling1D(5)(x)\n", "x = Conv1D(128, 5, activation='relu')(x)\n", "x = MaxPooling1D(5)(x)\n", "x = Conv1D(128, 5, activation='relu')(x)\n", "x = MaxPooling1D(35)(x) # global max pooling\n", "x = Flatten()(x)\n", "x = Dense(128, activation='relu')(x)\n", "preds = Dense(y_train.shape[1], activation='softmax')(x)\n", "\n", "model = Model(sequence_input, preds)\n", "model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])\n", "\n", "model.fit(x_train, y_train, epochs=3, validation_split= 0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that pretrained embeddings result in a faster convergence" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }