{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using wrappers for Gensim models for working with Keras"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial is about using gensim models as a part of your Keras models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The wrappers available (as of now) are :\n",
    "* Word2Vec (uses the function ```get_keras_embedding``` defined in  ```gensim.models.keyedvectors```)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Word2Vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use Word2Vec, we import the corresponding module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from gensim.models import word2vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we create a dummy set of sentences to train our Word2Vec model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = [\n",
    "    ['human', 'interface', 'computer'],\n",
    "    ['survey', 'user', 'computer', 'system', 'response', 'time'],\n",
    "    ['eps', 'user', 'interface', 'system'],\n",
    "    ['system', 'human', 'system', 'eps'],\n",
    "    ['user', 'response', 'time'],\n",
    "    ['trees'],\n",
    "    ['graph', 'trees'],\n",
    "    ['graph', 'minors', 'trees'],\n",
    "    ['graph', 'minors', 'survey']\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we create the Word2Vec model by passing appropriate parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = word2vec.Word2Vec(sentences, size=100, min_count=1, hs=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Integration with Keras : Cosine Similarity Task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As an example of integration of Gensim's Word2Vec model with Keras, we consider a word similarity task where we compute the cosine distance as a measure of similarity between the two words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "from keras.engine import Input\n",
    "from keras.models import Model\n",
    "from keras.layers.merge import dot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We would use the layer returned by the function `get_keras_embedding` in the Keras model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "wv = model.wv\n",
    "embedding_layer = wv.get_keras_embedding()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we construct the Keras model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ivan/.virtualenvs/ker/lib/python2.7/site-packages/ipykernel_launcher.py:7: UserWarning: Update your `Model` call to the Keras 2 API: `Model(outputs=Tensor(\"do..., inputs=[<tf.Tenso...)`\n",
      "  import sys\n"
     ]
    }
   ],
   "source": [
    "input_a = Input(shape=(1,), dtype='int32', name='input_a')\n",
    "input_b = Input(shape=(1,), dtype='int32', name='input_b')\n",
    "embedding_a = embedding_layer(input_a)\n",
    "embedding_b = embedding_layer(input_b)\n",
    "similarity = dot([embedding_a, embedding_b], axes=2, normalize=True)\n",
    "\n",
    "keras_model = Model(input=[input_a, input_b], output=similarity)\n",
    "keras_model.compile(optimizer='sgd', loss='mse')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we input the two words which we wish to compare and retrieve the value predicted by the model as the similarity score of the two words. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[[ 0.00596689]]]\n"
     ]
    }
   ],
   "source": [
    "word_a = 'graph'\n",
    "word_b = 'trees'\n",
    "# output is the cosine distance between the two words (as a similarity measure)\n",
    "output = keras_model.predict([np.asarray([model.wv.vocab[word_a].index]), np.asarray([model.wv.vocab[word_b].index])])\n",
    "\n",
    "print output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Integration with Keras : 20NewsGroups Task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see how Gensim's Word2Vec model could be integrated with Keras while dealing with a real supervised (classification) task, we consider the [20NewsGroups](qwone.com/~jason/20Newsgroups/) task. Here, we take a smaller version of this data by taking a subset of the documents to be classified. \n",
    "\n",
    "First, we import the necessary modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import keras\n",
    "import numpy as np\n",
    "\n",
    "from gensim.models import word2vec\n",
    "\n",
    "from keras.models import Model\n",
    "from keras.preprocessing.text import Tokenizer\n",
    "from keras.preprocessing.sequence import pad_sequences\n",
    "from keras.utils.np_utils import to_categorical\n",
    "from keras.layers import Input, Dense, Flatten\n",
    "from keras.layers import Conv1D, MaxPooling1D\n",
    "\n",
    "from sklearn.datasets import fetch_20newsgroups"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the first step of the task, we iterate over the folder in which our text samples are stored, and format them into a list of samples. Also, we prepare at the same time a list of class indices matching the samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = []  # list of text samples\n",
    "texts_w2v = []  # used to train the word embeddings\n",
    "labels = []  # list of label ids\n",
    "\n",
    "#using 3 categories for training the classifier\n",
    "data = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'comp.graphics', 'sci.space'])\n",
    "\n",
    "for index in range(len(data)):\n",
    "    label_id = data.target[index]\n",
    "    file_data = data.data[index]\n",
    "    i = file_data.find('\\n\\n')  # skip header\n",
    "    if i > 0:\n",
    "        file_data = file_data[i:]\n",
    "    try:\n",
    "        curr_str = str(file_data)\n",
    "        sentence_list = curr_str.split('\\n')\n",
    "        for sentence in sentence_list:\n",
    "            sentence = (sentence.strip()).lower()\n",
    "            texts.append(sentence)\n",
    "            texts_w2v.append(sentence.split(' '))\n",
    "            labels.append(label_id)\n",
    "    except:\n",
    "        None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we format our text samples and labels into tensors that can be fed into a neural network. To do this, we rely on Keras utilities `keras.preprocessing.text.Tokenizer` and `keras.preprocessing.sequence.pad_sequences`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "MAX_SEQUENCE_LENGTH = 1000\n",
    "\n",
    "# Vectorize the text samples into a 2D integer tensor\n",
    "tokenizer = Tokenizer()\n",
    "tokenizer.fit_on_texts(texts)\n",
    "sequences = tokenizer.texts_to_sequences(texts)\n",
    "\n",
    "# word_index = tokenizer.word_index\n",
    "data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n",
    "labels = to_categorical(np.asarray(labels))\n",
    "\n",
    "x_train = data\n",
    "y_train = labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the next step, we prepare the embedding layer to be used in our actual Keras model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "Keras_w2v = word2vec.Word2Vec(min_count=1)\n",
    "Keras_w2v.build_vocab(texts_w2v)\n",
    "Keras_w2v.train(texts, total_examples=Keras_w2v.corpus_count, epochs=Keras_w2v.iter)\n",
    "Keras_w2v_wv = Keras_w2v.wv\n",
    "embedding_layer = Keras_w2v_wv.get_keras_embedding()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we create a small 1D convnet to solve our classification problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/5\n",
      "137/137 [==============================] - 0s - loss: 0.9653 - acc: 0.4891     \n",
      "Epoch 2/5\n",
      "137/137 [==============================] - 0s - loss: 0.9753 - acc: 0.5255     \n",
      "Epoch 3/5\n",
      "137/137 [==============================] - 0s - loss: 0.9001 - acc: 0.4453     \n",
      "Epoch 4/5\n",
      "137/137 [==============================] - 0s - loss: 0.8930 - acc: 0.4818     \n",
      "Epoch 5/5\n",
      "137/137 [==============================] - 0s - loss: 0.8888 - acc: 0.4307     \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7fe87902c290>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')\n",
    "embedded_sequences = embedding_layer(sequence_input)\n",
    "x = Conv1D(128, 5, activation='relu')(embedded_sequences)\n",
    "x = MaxPooling1D(5)(x)\n",
    "x = Conv1D(128, 5, activation='relu')(x)\n",
    "x = MaxPooling1D(5)(x)\n",
    "x = Conv1D(128, 5, activation='relu')(x)\n",
    "x = MaxPooling1D(35)(x)  # global max pooling\n",
    "x = Flatten()(x)\n",
    "x = Dense(128, activation='relu')(x)\n",
    "preds = Dense(y_train.shape[1], activation='softmax')(x)\n",
    "\n",
    "model = Model(sequence_input, preds)\n",
    "model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])\n",
    "\n",
    "model.fit(x_train, y_train, epochs=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen from the results above, the accuracy obtained is not that high. This is because of the small size of training data used and we could expect to obtain better accuracy for training data of larger size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Integration with Keras : Another classification task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this task, we train our model to predict the category of the input text. We start by importing the relevant modules and libraries : "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from keras.models import Sequential\n",
    "from keras.layers import Dropout\n",
    "from keras.regularizers import l2\n",
    "from keras.models import Model\n",
    "from keras.engine import Input\n",
    "from keras.preprocessing.sequence import pad_sequences\n",
    "from keras.preprocessing.text import Tokenizer\n",
    "from gensim.models import keyedvectors\n",
    "from collections import defaultdict\n",
    "\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now define some global variables and utility functions which would be used in the code further : "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# global variables\n",
    "\n",
    "nb_filters = 1200  # number of filters\n",
    "n_gram = 2  # n-gram, or window size of CNN/ConvNet\n",
    "maxlen = 15  # maximum number of words in a sentence\n",
    "vecsize = 300  # length of the embedded vectors in the model \n",
    "cnn_dropout = 0.0  # dropout rate for CNN/ConvNet\n",
    "final_activation = 'softmax'  # activation function. Options: softplus, softsign, relu, tanh, sigmoid, hard_sigmoid, linear.\n",
    "dense_wl2reg = 0.0  # dense_wl2reg: L2 regularization coefficient\n",
    "dense_bl2reg = 0.0  # dense_bl2reg: L2 regularization coefficient for bias\n",
    "optimizer = 'adam'  # optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam\n",
    "\n",
    "# utility functions\n",
    "\n",
    "def retrieve_csvdata_as_dict(filepath):\n",
    "    \"\"\"\n",
    "    Retrieve the training data in a CSV file, with the first column being the\n",
    "    class labels, and second column the text data. It returns a dictionary with\n",
    "    the class labels as keys, and a list of short texts as the value for each key.\n",
    "    \"\"\"\n",
    "    df = pd.read_csv(filepath)\n",
    "    category_col, descp_col = df.columns.values.tolist()\n",
    "    shorttextdict = dict()\n",
    "    for category, descp in zip(df[category_col], df[descp_col]):\n",
    "        if type(descp) == str:\n",
    "            shorttextdict.setdefault(category, []).append(descp)\n",
    "    return shorttextdict\n",
    "\n",
    "def subjectkeywords():\n",
    "    \"\"\"\n",
    "    Return an example data set, with three subjects and corresponding keywords.\n",
    "    This is in the format of the training input.\n",
    "    \"\"\"\n",
    "    data_path = os.path.join(os.getcwd(), 'datasets/keras_classifier_training_data.csv')\n",
    "    return retrieve_csvdata_as_dict(data_path)\n",
    "\n",
    "def convert_trainingdata(classdict):\n",
    "    \"\"\"\n",
    "    Convert the training data into format put into the neural networks.\n",
    "    \"\"\"\n",
    "    classlabels = classdict.keys()\n",
    "    lblidx_dict = dict(zip(classlabels, range(len(classlabels))))\n",
    "\n",
    "    # tokenize the words, and determine the word length\n",
    "    phrases = []\n",
    "    indices = []\n",
    "    for label in classlabels:\n",
    "        for shorttext in classdict[label]:\n",
    "            shorttext = shorttext if type(shorttext) == str else ''\n",
    "            category_bucket = [0]*len(classlabels)\n",
    "            category_bucket[lblidx_dict[label]] = 1\n",
    "            indices.append(category_bucket)\n",
    "            phrases.append(shorttext)\n",
    "\n",
    "    return classlabels, phrases, indices\n",
    "\n",
    "def process_text(text):\n",
    "    \"\"\" \n",
    "    Process the input text by tokenizing and padding it.\n",
    "    \"\"\"\n",
    "    tokenizer = Tokenizer()\n",
    "    tokenizer.fit_on_texts(text)\n",
    "    x_train = tokenizer.texts_to_sequences(text)\n",
    "\n",
    "    x_train = pad_sequences(x_train, maxlen=maxlen)\n",
    "    return x_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create our word2vec model first. We could either train our model or user pre-trained vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we are training our Word2Vec model here\n",
    "w2v_training_data_path = os.path.join(os.getcwd(), 'datasets/word_vectors_training_data.txt')\n",
    "input_data = word2vec.LineSentence(w2v_training_data_path)\n",
    "w2v_model = word2vec.Word2Vec(input_data, size=300)\n",
    "w2v_model_wv = w2v_model.wv\n",
    "\n",
    "# Alternatively we could have imported pre-trained word-vectors like : \n",
    "# w2v_model_wv = keyedvectors.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)\n",
    "# The dataset 'GoogleNews-vectors-negative300.bin.gz' can be downloaded from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the training data for the Keras model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainclassdict = subjectkeywords()\n",
    "\n",
    "nb_labels = len(trainclassdict)  # number of class labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we create out Keras model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get embedding layer corresponding to our trained Word2Vec model\n",
    "embedding_layer = w2v_model_wv.get_keras_embedding()\n",
    "\n",
    "# create a convnet to solve our classification task\n",
    "sequence_input = Input(shape=(maxlen,), dtype='int32')\n",
    "embedded_sequences = embedding_layer(sequence_input)\n",
    "x = Conv1D(filters=nb_filters, kernel_size=n_gram, padding='valid', activation='relu', input_shape=(maxlen, vecsize))(embedded_sequences)\n",
    "x = MaxPooling1D(pool_size=maxlen - n_gram + 1)(x)\n",
    "x = Flatten()(x)\n",
    "preds = Dense(nb_labels, activation=final_activation, kernel_regularizer=l2(dense_wl2reg), bias_regularizer=l2(dense_bl2reg))(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we train the classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/10\n",
      "45/45 [==============================] - 0s - loss: 1.1035 - acc: 0.2000     \n",
      "Epoch 2/10\n",
      "45/45 [==============================] - 0s - loss: 1.0988 - acc: 0.3333     \n",
      "Epoch 3/10\n",
      "45/45 [==============================] - 0s - loss: 1.0972 - acc: 0.3333     \n",
      "Epoch 4/10\n",
      "45/45 [==============================] - 0s - loss: 1.0948 - acc: 0.6444     \n",
      "Epoch 5/10\n",
      "45/45 [==============================] - 0s - loss: 1.0938 - acc: 0.5778     \n",
      "Epoch 6/10\n",
      "45/45 [==============================] - 0s - loss: 1.0936 - acc: 0.5778     \n",
      "Epoch 7/10\n",
      "45/45 [==============================] - 0s - loss: 1.0900 - acc: 0.5111     \n",
      "Epoch 8/10\n",
      "45/45 [==============================] - 0s - loss: 1.0879 - acc: 0.5111     \n",
      "Epoch 9/10\n",
      "45/45 [==============================] - 0s - loss: 1.0856 - acc: 0.5778     \n",
      "Epoch 10/10\n",
      "45/45 [==============================] - 0s - loss: 1.0834 - acc: 0.5556     \n"
     ]
    }
   ],
   "source": [
    "classlabels, x_train, y_train = convert_trainingdata(trainclassdict)\n",
    "\n",
    "tokenizer = Tokenizer()\n",
    "tokenizer.fit_on_texts(x_train)\n",
    "x_train = tokenizer.texts_to_sequences(x_train)\n",
    "\n",
    "x_train = pad_sequences(x_train, maxlen=maxlen)\n",
    "\n",
    "model = Model(sequence_input, preds)\n",
    "model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])\n",
    "fit_ret_val = model.fit(x_train, y_train, epochs=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our classifier is now ready to predict classes for input data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'mathematics': 0.33123398, 'physics': 0.34042257, 'theology': 0.32834342}\n"
     ]
    }
   ],
   "source": [
    "input_text = 'artificial intelligence'\n",
    "\n",
    "matrix = process_text(input_text)\n",
    "\n",
    "predictions = model.predict(matrix)\n",
    "\n",
    "# get the actual categories from output\n",
    "scoredict = {}\n",
    "for idx, classlabel in zip(range(len(classlabels)), classlabels):\n",
    "    scoredict[classlabel] = predictions[0][idx]\n",
    "\n",
    "print scoredict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result above clearly suggests (~ 98% probability!) that the input `artificial intelligence` should belong to the category `mathematics`, which conforms very well with the expected output in this case.\n",
    "In general, the output could depend on several factors including the number of filters for the conv-net, the training data for the word-vectors, the training data for the classifier etc."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}