{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A Neural Implementation of NBSVM in Keras\n",
    "\n",
    "NBSVM is a text classification model proposed by Wang and Manning in 2012 that takes a linear model such as SVM (or logistic regression) and infuses it with Bayesian probabilities by replacing word count features with Naive Bayes log count ratios.  Despite its simplicity, NBSVM models have been shown to be both fast and powerful across wide range of different text classification datasets.  In this notebook, we cover the following:\n",
    "\n",
    "* An NBSVM model is implemented as a neural network using the deep learning framework, *Keras*. \n",
    "* Using the well-studied IMDB movie review dataset, we show that this Keras implementation achieves a test accuracy of **92.5%** with only a few seconds of training.  This is competitive with deeper and more sophisticated neural network architectures that take much longer to train. \n",
    "* Source code and results are available on GitHub here in the form of a Jupyter notebook.\n",
    "\n",
    "\n",
    "Let's begin by importing some necessary modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "from keras.layers.core import Activation\n",
    "from keras.models import Model\n",
    "from keras.layers import Input, Embedding, Flatten, dot\n",
    "from keras import backend as K\n",
    "from keras.optimizers import Adam\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.datasets import load_files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the IMDB Dataset\n",
    "The IMDB training set consists of 25,000 movie reviews labeled as either positive or negative.  The test set consists of another 25,000 labeled movie reviews.  We will use the first set of 25,000 reviews to train a model to classify  movie reviews as positive or negative and evaluate the model on the second set of 25,000 review.  The dataset is first loaded as a document-term matrix where each row represents a review and each column represents a word, where each \"word\"is a string of either 1, 2, or 3 consecutive words in a review. That is, features are unigrams, bigrams, or trigrams.  Entries in the matrix are binarized word counts (i.e., 1 means the word appears at least once in the review and 0 means otherwise).  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "document-term matrix shape (training): (25000, 800000)\n",
      "document-term matrix shape (test): (25000, 800000)\n",
      "vocab size:800001\n"
     ]
    }
   ],
   "source": [
    "PATH_TO_IMDB = r'./data/aclImdb'\n",
    "\n",
    "def load_imdb_data(datadir):\n",
    "    # read in training and test corpora\n",
    "    categories = ['pos', 'neg']\n",
    "    train_b = load_files(datadir+'/train', shuffle=True, categories=categories)\n",
    "    test_b = load_files(datadir+'/test', shuffle=True, categories=categories)\n",
    "    train_b.data = [x.decode('utf-8') for x in train_b.data]\n",
    "    test_b.data =  [x.decode('utf-8') for x in test_b.data]\n",
    "    veczr =  CountVectorizer(ngram_range=(1,3), binary=True, \n",
    "                             token_pattern=r'\\w+', \n",
    "                             max_features=800000)\n",
    "    dtm_train = veczr.fit_transform(train_b.data)\n",
    "    dtm_test = veczr.transform(test_b.data)\n",
    "    y_train = train_b.target\n",
    "    y_test = test_b.target\n",
    "    print(\"document-term matrix shape (training): (%s, %s)\" % (dtm_train.shape))\n",
    "    print(\"document-term matrix shape (test): (%s, %s)\" % (dtm_train.shape))\n",
    "    num_words = len([v for k,v in veczr.vocabulary_.items()]) + 1 # add 1 for 0 padding\n",
    "    print('vocab size:%s' % (num_words))\n",
    "  \n",
    "    return (dtm_train, dtm_test), (y_train, y_test), num_words\n",
    "\n",
    "(dtm_train, dtm_test), (y_train, y_test), num_words = load_imdb_data(PATH_TO_IMDB)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting the Document-Term Matrix to a List of Word ID Sequences\n",
    "\n",
    "In a binarized document-term matrix, each document is represented as a long one-hot-encoded vector with most entries being zero.  While our neural model could be implemented to accept rows from this matrix as input, we choose to represent each document as a sequence of word IDs with some fixed length, *maxlen*, by using an embedding layer.  An embedding layer in a neural network acts as a lookup-mechanism that accepts a word ID as input and returns a vector (or scalar) representation of that word. In our case, the embedding layer will return preset Naive Bayes log-count ratios for the words represented by word IDs in a document.  A model accepting documents represented as sequences of word IDs trains much faster than one accepting rows from a term-document matrix. While these two architectures technically have the same number of parameters, the look-up mechanism of an embedding layer reduces the number of features (i.e., words) and parameters under consideration at any iteration. That is, documents represented as a fixed-size sequence of word IDs are much more compact and efficient than large one-hot encoded vector from a term-document matrix with binarized counts. \n",
    "\n",
    "Here, we convert the document-term matrix to a list of word ID sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sequence stats: avg:430.9482, max:3052, min:20\n",
      "sequence stats: avg:390.72628, max:2950, min:7\n"
     ]
    }
   ],
   "source": [
    "def dtm2wid(dtm, maxlen=2000):\n",
    "    x = []\n",
    "    nwds = []\n",
    "    for idx, row in enumerate(dtm):\n",
    "        seq = []\n",
    "        indices = (row.indices + 1).astype(np.int64)\n",
    "        np.append(nwds, len(indices))\n",
    "        data = (row.data).astype(np.int64)\n",
    "        count_dict = dict(zip(indices, data))\n",
    "        for k,v in count_dict.items():\n",
    "            seq.extend([k]*v)\n",
    "        num_words = len(seq)\n",
    "        nwds.append(num_words)\n",
    "        # pad up to maxlen\n",
    "        if num_words < maxlen: \n",
    "            seq = np.pad(seq, (maxlen - num_words, 0), mode='constant')\n",
    "        # truncate down to maxlen\n",
    "        else:                  \n",
    "            seq = seq[-maxlen:]\n",
    "        x.append(seq)\n",
    "    nwds = np.array(nwds)\n",
    "    print('sequence stats: avg:%s, max:%s, min:%s' % (nwds.mean(), nwds.max(), nwds.min()) )\n",
    "    return np.array(x)\n",
    "\n",
    "maxlen = 2000\n",
    "x_train = dtm2wid(dtm_train, maxlen=maxlen)\n",
    "x_test = dtm2wid(dtm_test, maxlen=maxlen)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing the Naive Bayes Log-Count Ratios\n",
    "The final data preparation step involves computing the Naive Bayes log-count ratios. This is more easily done using the original document-term matrix. These ratios capture the probability of a word appearing in a document in one class (e.g., positive) versus another."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pr(dtm, y, y_i):\n",
    "    p = dtm[y==y_i].sum(0)\n",
    "    return (p+1) / ((y==y_i).sum()+1)\n",
    "nbratios = np.log(pr(dtm_train, y_train, 1)/pr(dtm_train, y_train, 0))\n",
    "nbratios = np.squeeze(np.asarray(nbratios))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NBSVM in Keras\n",
    "We are now ready to define our NBSVM model.  Our model utilizes two embedding layers.  The first, as mentioned above, stores the Naive Bayes log-count ratios.  The second stores learned weights (or coefficients) in this linear model. Our prediction, then, is simply the dot product of these two vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_model(num_words, maxlen, nbratios=None):\n",
    "    embedding_matrix = np.zeros((num_words, 1))\n",
    "    for i in range(1, num_words): # skip 0, the padding value\n",
    "        if nbratios is not None:\n",
    "            # if log-count ratios are supplied, then it's NBSVM\n",
    "            embedding_matrix[i] = nbratios[i-1]\n",
    "        else:\n",
    "            # if log-count rations are not supplied, this reduces to a logistic regression\n",
    "            embedding_matrix[i] = 1\n",
    "\n",
    "    # set up the model\n",
    "    inp = Input(shape=(maxlen,))\n",
    "    r = Embedding(num_words, 1, input_length=maxlen, weights=[embedding_matrix], trainable=False)(inp)\n",
    "    x = Embedding(num_words, 1, input_length=maxlen, embeddings_initializer='glorot_normal')(inp)\n",
    "    x = dot([r,x], axes=1)\n",
    "    x = Flatten()(x)\n",
    "    x = Activation('sigmoid')(x)\n",
    "    model = Model(inputs=inp, outputs=x)\n",
    "    model.compile(loss='binary_crossentropy',\n",
    "                  optimizer=Adam(lr=0.001),\n",
    "                  metrics=['accuracy'])\n",
    "    return model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This simple model achieves a **92.5%** accuracy on the test set with only a few seconds of training on a Titan V GPU.  In fact, the model trains within seconds even on a CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 25000 samples, validate on 25000 samples\n",
      "Epoch 1/3\n",
      "25000/25000 [==============================] - 5s 210us/step - loss: 0.2691 - acc: 0.9410 - val_loss: 0.2503 - val_acc: 0.9222\n",
      "Epoch 2/3\n",
      "25000/25000 [==============================] - 5s 204us/step - loss: 0.0776 - acc: 0.9916 - val_loss: 0.2253 - val_acc: 0.9253\n",
      "Epoch 3/3\n",
      "25000/25000 [==============================] - 5s 201us/step - loss: 0.0388 - acc: 0.9983 - val_loss: 0.2141 - val_acc: 0.9257\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f3f874dd5f8>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = get_model(num_words, maxlen, nbratios=nbratios)\n",
    "model.fit(x_train, y_train,\n",
    "          batch_size=32,\n",
    "          epochs=3,\n",
    "          validation_data=(x_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These results are competitive with more complex and sophisticated neural architectures.  Note that, when setting nbratios to None, our function *get_model* sets the embedding matrix, **r**, to all ones, which reduces our model to a logistic regression.³  To see the accuracy of this model, try it out by invoking: \n",
    "<pre>\n",
    "model = get_model(num_words, maxlen, nbratios=None)\n",
    "model.fit(x_train, y_train,\n",
    "          batch_size=32,\n",
    "          epochs=3,\n",
    "          validation_data=(x_test, y_test))\n",
    "</pre>.  This notebook was inspired by a tweet<sup>2</sup> from Jeremy Howard from September 2017.\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "<sup>1</sup> Sida Wang and Christopher D. Manning: Baselines and Bigrams: Simple, Good Sentiment and Topic Classification; ACL 2012.\n",
    "\n",
    "<sup>2</sup> https://twitter.com/jeremyphoward/status/905841365241565184?lang=en\n",
    "\n",
    "<sup>3</sup> Since, by definition, any document containing a word ID contains that word, if our embedding layer simply returns one (instead of a log-count ratio) for every word ID except zero (i.e., 0 is the dummy ID used to pad sequences), then our NBSVM model reduces to a logistic regression."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}