{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebooks is an experiment to see if a pure scikit-learn implementation of the fastText model can work better than a linear model on a small text classification problem: 20 newsgroups.\n",
    "\n",
    "http://arxiv.org/abs/1607.01759\n",
    "\n",
    "Those models are very similar to Deep Averaging Network (with only 1 hidden layer with a linear activation function):\n",
    "\n",
    "https://www.cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf\n",
    "\n",
    "\n",
    "Note that scikit-learn does not provide a hierarchical softmax implementation (but we don't need it on 20 newsgroups anyways)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.feature_extraction.text import HashingVectorizer\n",
    "\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "twentyng_train = fetch_20newsgroups(\n",
    "    subset='train',\n",
    "    #remove=('headers', 'footers'),\n",
    ")\n",
    "docs_train, target_train = twentyng_train.data, twentyng_train.target\n",
    "\n",
    "\n",
    "twentyng_test = fetch_20newsgroups(\n",
    "    subset='test',\n",
    "    #remove=('headers', 'footers'),\n",
    ")\n",
    "\n",
    "docs_test, target_test = twentyng_test.data, twentyng_test.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "262144"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "2 ** 18"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following uses the hashing tricks on unigrams and bigrams. `binary=True` makes us ignore repeated words in a document. The `l1` normalization ensures that we \"average\" the embeddings of the tokens in the document instead of summing them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 16.8 s, sys: 116 ms, total: 16.9 s\n",
      "Wall time: 16.9 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "vec = HashingVectorizer(\n",
    "    encoding='latin-1', binary=True, ngram_range=(1, 2),\n",
    "    norm='l1', n_features=2 ** 18)\n",
    "\n",
    "X_train = vec.transform(docs_train)\n",
    "X_test = vec.transform(docs_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0., ...,  0.,  0.,  0.],\n",
       "       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_doc_vectors = X_train[:3].toarray()\n",
    "first_doc_vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.,  0.,  0.])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_doc_vectors.min(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.0049505 ,  0.00469484,  0.00200401])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_doc_vectors.max(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.,  1.,  1.])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_doc_vectors.sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Baseline: OvR logistic regression (the multinomial logistic regression loss is currently not implemented in scikit-learn). In practice, the OvR reduction seems to work well enough."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 46s, sys: 6.69 s, total: 1min 53s\n",
      "Wall time: 11.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "\n",
    "lr = SGDClassifier(loss='log', alpha=1e-10, n_iter=50, n_jobs=-1)\n",
    "lr.fit(X_train, target_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train score: 1.000\n",
      "test score: 0.827\n",
      "CPU times: user 588 ms, sys: 289 ms, total: 877 ms\n",
      "Wall time: 602 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "print(\"train score: %0.3f\" % lr.score(X_train, target_train))\n",
    "print(\"test score: %0.3f\" % lr.score(X_test, target_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now use the MLPClassifier of scikit-learn to add a single hidden layer with a small number of hidden units.\n",
    "\n",
    "Note: instead of tanh or relu we would rather like to use a linear / identity activation function for the hidden layer but this is not (yet) implemented in scikit-learn.\n",
    "\n",
    "In that respect the following model is closer to a Deep Averaging Network (without dropout) than fastText."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration 1, loss = 2.94108225\n",
      "Validation score: 0.464664\n",
      "Iteration 2, loss = 2.49072336\n",
      "Validation score: 0.639576\n",
      "Iteration 3, loss = 1.63266821\n",
      "Validation score: 0.810954\n",
      "Iteration 4, loss = 0.90327443\n",
      "Validation score: 0.869258\n",
      "Iteration 5, loss = 0.48531751\n",
      "Validation score: 0.893993\n",
      "Iteration 6, loss = 0.27329257\n",
      "Validation score: 0.909894\n",
      "Iteration 7, loss = 0.16704835\n",
      "Validation score: 0.911661\n",
      "Iteration 8, loss = 0.11122343\n",
      "Validation score: 0.918728\n",
      "Iteration 9, loss = 0.07885910\n",
      "Validation score: 0.918728\n",
      "Iteration 10, loss = 0.05876991\n",
      "Validation score: 0.924028\n",
      "Iteration 11, loss = 0.04566916\n",
      "Validation score: 0.920495\n",
      "Iteration 12, loss = 0.03644058\n",
      "Validation score: 0.915194\n",
      "Iteration 13, loss = 0.02982519\n",
      "Validation score: 0.922261\n",
      "Validation score did not improve more than tol=0.000100 for two consecutive epochs. Stopping.\n",
      "CPU times: user 1min 21s, sys: 187 ms, total: 1min 21s\n",
      "Wall time: 1min 21s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "\n",
    "mlp = MLPClassifier(algorithm='adam', learning_rate_init=0.01,\n",
    "                    hidden_layer_sizes=10, max_iter=100, activation='tanh', verbose=100,\n",
    "                    early_stopping=True, validation_fraction=0.05, alpha=1e-10)\n",
    "mlp.fit(X_train, target_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train score: 0.996\n",
      "test score: 0.801\n",
      "CPU times: user 304 ms, sys: 54 µs, total: 304 ms\n",
      "Wall time: 303 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "print(\"train score: %0.3f\" % mlp.score(X_train, target_train))\n",
    "print(\"test score: %0.3f\" % mlp.score(X_test, target_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}