{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using wrappers for Scikit learn API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The wrappers available (as of now) are :\n",
    "* LdaModel (```gensim.sklearn_api.ldamodel.LdaTransformer```), which implements gensim's ```LDA Model``` in a scikit-learn interface\n",
    "\n",
    "* LsiModel (```gensim.sklearn_api.lsimodel.LsiTransformer```), which implements gensim's ```LSI Model``` in a scikit-learn interface\n",
    "\n",
    "* RpModel (```gensim.sklearn_api.rpmodel.RpTransformer```), which implements gensim's ```Random Projections Model``` in a scikit-learn interface\n",
    "\n",
    "* LDASeq Model (```gensim.sklearn_api.ldaseqmodel.LdaSeqTransformer```), which implements gensim's ```LdaSeqModel``` in a scikit-learn interface\n",
    "\n",
    "* Word2Vec Model (```gensim.sklearn_api.w2vmodel.W2VTransformer```), which implements gensim's ```Word2Vec``` in a scikit-learn interface\n",
    "\n",
    "* AuthorTopicModel Model (```gensim.sklearn_api.atmodel.AuthorTopicTransformer```), which implements gensim's ```AuthorTopicModel``` in a scikit-learn interface\n",
    "\n",
    "* Doc2Vec Model (```gensim.sklearn_api.d2vmodel.D2VTransformer```), which implements gensim's ```Doc2Vec``` in a scikit-learn interface\n",
    "\n",
    "* Text2Bow Model (```gensim.sklearn_api.text2bow.Text2BowTransformer```), which implements gensim's ```Dictionary``` in a scikit-learn interface\n",
    "\n",
    "* TfidfModel Model (```gensim.sklearn_api.tfidf.TfIdfTransformer```), which implements gensim's ```TfidfModel``` in a scikit-learn interface\n",
    "\n",
    "* HdpModel Model (```gensim.sklearn_api.hdp.HdpTransformer```), which implements gensim's ```HdpModel``` in a scikit-learn interface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LDA Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use LdaModel begin with importing LdaModel wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "from gensim.sklearn_api import LdaTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we will create a dummy set of texts and convert it into a corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.corpora import Dictionary\n",
    "texts = [\n",
    "    ['complier', 'system', 'computer'],\n",
    "    ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],\n",
    "    ['graph', 'flow', 'network', 'graph'],\n",
    "    ['loading', 'computer', 'system'],\n",
    "    ['user', 'server', 'system'],\n",
    "    ['tree', 'hamiltonian'],\n",
    "    ['graph', 'trees'],\n",
    "    ['computer', 'kernel', 'malfunction', 'computer'],\n",
    "    ['server', 'system', 'computer']\n",
    "]\n",
    "dictionary = Dictionary(texts)\n",
    "corpus = [dictionary.doc2bow(text) for text in texts]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then to run the LdaModel on it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.85275316,  0.14724687],\n",
       "       [ 0.12390183,  0.87609816],\n",
       "       [ 0.46129951,  0.53870052],\n",
       "       [ 0.84924179,  0.15075824],\n",
       "       [ 0.49180096,  0.50819904],\n",
       "       [ 0.40086922,  0.59913075],\n",
       "       [ 0.28454426,  0.71545571],\n",
       "       [ 0.88776201,  0.11223802],\n",
       "       [ 0.84210372,  0.15789627]], dtype=float32)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LdaTransformer(num_topics=2, id2word=dictionary, iterations=20, random_state=1)\n",
    "model.fit(corpus)\n",
    "model.transform(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Integration with Sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from gensim import matutils\n",
    "from gensim.models.ldamodel import LdaModel\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from gensim.sklearn_api.ldamodel import LdaTransformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "rand = np.random.mtrand.RandomState(1) # set seed for getting same result\n",
    "cats = ['rec.sport.baseball', 'sci.crypt']\n",
    "data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use use the loaded data to create our dictionary and corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data_texts = [_.split() for _ in data.data]\n",
    "id2word = Dictionary(data_texts)\n",
    "corpus = [id2word.doc2bow(i.split()) for i in data.data]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we just need to fit corpus and id2word to our Lda wrapper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=20)\n",
    "lda = obj.fit(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Example for Using Grid Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The inbuilt `score` function of Lda wrapper class provides two modes : `perplexity` and `u_mass` for computing the scores of the candidate models. The preferred mode for the scoring function is specified using `scorer` parameter of the wrapper as follows : "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'iterations': 20, 'num_topics': 2}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj = LdaTransformer(id2word=id2word, num_topics=2, iterations=5, scorer='u_mass') # here 'scorer' can be 'perplexity' or 'u_mass'\n",
    "parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}\n",
    "\n",
    "# set `scoring` as `None` to use the inbuilt score function of `SklLdaModel` class\n",
    "model = GridSearchCV(obj, parameters, cv=3, scoring=None)\n",
    "model.fit(corpus)\n",
    "\n",
    "model.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also supply a custom scoring function of your choice using the `scoring` parameter of `GridSearchCV` function. The example shown below uses `c_v` mode of `CoherenceModel` class for computing the scores of the candidate models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'iterations': 20, 'num_topics': 2}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from gensim.models.coherencemodel import CoherenceModel\n",
    "\n",
    "# supplying a custom scoring function\n",
    "def scoring_function(estimator, X, y=None):\n",
    "    goodcm = CoherenceModel(model=estimator.gensim_model, texts=data_texts, dictionary=estimator.gensim_model.id2word, coherence='c_v')\n",
    "    return goodcm.get_coherence()\n",
    "\n",
    "obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=5)\n",
    "parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}\n",
    "\n",
    "# set `scoring` as your custom scoring function\n",
    "model = GridSearchCV(obj, parameters, cv=2, scoring=scoring_function)\n",
    "model.fit(corpus)\n",
    "\n",
    "model.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn import linear_model\n",
    "\n",
    "def print_features_pipe(clf, vocab, n=10):\n",
    "    ''' Better printing for sorted list '''\n",
    "    coef = clf.named_steps['classifier'].coef_[0]\n",
    "    print coef\n",
    "    print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))\n",
    "    print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "id2word = Dictionary([_.split() for _ in data.data])\n",
    "corpus = [id2word.doc2bow(i.split()) for i in data.data]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.3032212   0.53114732 -0.3556002   0.05528797 -0.23462074  0.10164825\n",
      " -0.34895972 -0.07528751 -0.31437197 -0.24760965 -0.27430636 -0.05328458\n",
      "  0.1792989  -0.11535102  0.98473296]\n",
      "Positive features: >Pat:0.98 considered,:0.53 Fame.:0.30 internet...:0.18 comp.org.eff.talk.:0.10 Keach:0.06\n",
      "Negative features: Fame,:-0.36 01101001B:-0.35 circuitry:-0.31 hanging:-0.27 red@redpoll.neoucom.edu:-0.25 comp.org.eff.talk,:-0.23 dome.:-0.12 *best*:-0.08 trawling:-0.05\n",
      "0.648489932886\n"
     ]
    }
   ],
   "source": [
    "model = LdaTransformer(num_topics=15, id2word=id2word, iterations=10, random_state=37)\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used\n",
    "pipe = Pipeline([('features', model,), ('classifier', clf)])\n",
    "pipe.fit(corpus, data.target)\n",
    "print_features_pipe(pipe, id2word.values())\n",
    "\n",
    "print(pipe.score(corpus, data.target))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LSI Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use LsiModel begin with importing LsiModel wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import LsiTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.13655775  0.00381287  0.02643593 -0.08499907 -0.02387209  0.6004697\n",
      "  1.07090198  0.03926809  0.43769831  0.54886088 -0.20186911 -0.21785685\n",
      "  1.30488175  0.08663351  0.17558704]\n",
      "Positive features: internet...:1.30 01101001B:1.07 comp.org.eff.talk.:0.60 red@redpoll.neoucom.edu:0.55 circuitry:0.44 >Pat:0.18 Fame.:0.14 dome.:0.09 *best*:0.04 Fame,:0.03\n",
      "Negative features: trawling:-0.22 hanging:-0.20 Keach:-0.08 comp.org.eff.talk,:-0.02\n",
      "0.865771812081\n"
     ]
    }
   ],
   "source": [
    "model = LsiTransformer(num_topics=15, id2word=id2word)\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used\n",
    "pipe = Pipeline([('features', model,), ('classifier', clf)])\n",
    "pipe.fit(corpus, data.target)\n",
    "print_features_pipe(pipe, id2word.values())\n",
    "\n",
    "print(pipe.score(corpus, data.target))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Projections Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use RpModel begin with importing RpModel wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import RpTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-0.01217523  0.0109422 ]\n",
      "Positive features: considered,:0.01\n",
      "Negative features: Fame.:-0.01\n",
      "0.604865771812\n"
     ]
    }
   ],
   "source": [
    "model = RpTransformer(num_topics=2)\n",
    "np.random.mtrand.RandomState(1)  # set seed for getting same result\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used\n",
    "pipe = Pipeline([('features', model,), ('classifier', clf)])\n",
    "pipe.fit(corpus, data.target)\n",
    "print_features_pipe(pipe, id2word.values())\n",
    "\n",
    "print(pipe.score(corpus, data.target))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LDASeq Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use LdaSeqModel begin with importing LdaSeqModel wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import LdaSeqTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/chinmaya/GSOC/Gensim/gensim/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars\n",
      "  convergence = np.fabs((bound - old_bound) / old_bound)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-0.04877324  0.04877324]\n",
      "Positive features: NLCS:0.05\n",
      "Negative features: What:-0.05\n",
      "1.0\n"
     ]
    }
   ],
   "source": [
    "test_data = data.data[0:2]\n",
    "test_target = data.target[0:2]\n",
    "id2word_ldaseq = Dictionary(map(lambda x: x.split(), test_data))\n",
    "corpus_ldaseq = [id2word_ldaseq.doc2bow(i.split()) for i in test_data]\n",
    "\n",
    "model = LdaSeqTransformer(id2word=id2word_ldaseq, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used\n",
    "pipe = Pipeline([('features', model,), ('classifier', clf)])\n",
    "pipe.fit(corpus_ldaseq, test_target)\n",
    "print_features_pipe(pipe, id2word_ldaseq.values())\n",
    "\n",
    "print(pipe.score(corpus_ldaseq, test_target))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Word2Vec Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use Word2Vec model begin with importing Word2Vec wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import W2VTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7\n"
     ]
    }
   ],
   "source": [
    "w2v_texts = [\n",
    "    ['calculus', 'is', 'the', 'mathematical', 'study', 'of', 'continuous', 'change'],\n",
    "    ['geometry', 'is', 'the', 'study', 'of', 'shape'],\n",
    "    ['algebra', 'is', 'the', 'study', 'of', 'generalizations', 'of', 'arithmetic', 'operations'],\n",
    "    ['differential', 'calculus', 'is', 'related', 'to', 'rates', 'of', 'change', 'and', 'slopes', 'of', 'curves'],\n",
    "    ['integral', 'calculus', 'is', 'realted', 'to', 'accumulation', 'of', 'quantities', 'and', 'the', 'areas', 'under', 'and', 'between', 'curves'],\n",
    "    ['physics', 'is', 'the', 'natural', 'science', 'that', 'involves', 'the', 'study', 'of', 'matter', 'and', 'its', 'motion', 'and', 'behavior', 'through', 'space', 'and', 'time'],\n",
    "    ['the', 'main', 'goal', 'of', 'physics', 'is', 'to', 'understand', 'how', 'the', 'universe', 'behaves'],\n",
    "    ['physics', 'also', 'makes', 'significant', 'contributions', 'through', 'advances', 'in', 'new', 'technologies', 'that', 'arise', 'from', 'theoretical', 'breakthroughs'],\n",
    "    ['advances', 'in', 'the', 'understanding', 'of', 'electromagnetism', 'or', 'nuclear', 'physics', 'led', 'directly', 'to', 'the', 'development', 'of', 'new', 'products', 'that', 'have', 'dramatically', 'transformed', 'modern', 'day', 'society']\n",
    "]\n",
    "\n",
    "model = W2VTransformer(size=10, min_count=1)\n",
    "model.fit(w2v_texts)\n",
    "\n",
    "class_dict = {'mathematics': 1, 'physics': 0}\n",
    "train_data = [\n",
    "    ('calculus', 'mathematics'), ('mathematical', 'mathematics'), ('geometry', 'mathematics'), ('operations', 'mathematics'), ('curves', 'mathematics'),\n",
    "    ('natural', 'physics'), ('nuclear', 'physics'), ('science', 'physics'), ('electromagnetism', 'physics'), ('natural', 'physics')\n",
    "]\n",
    "\n",
    "train_input = list(map(lambda x: x[0], train_data))\n",
    "train_target = list(map(lambda x: class_dict[x[1]], train_data))\n",
    "\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n",
    "clf.fit(model.transform(train_input), train_target)\n",
    "text_w2v = Pipeline([('features', model,), ('classifier', clf)])\n",
    "score = text_w2v.score(train_input, train_target)\n",
    "\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### AuthorTopic Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use AuthorTopic model begin with importing AuthorTopic wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import AuthorTopicTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1 0 0]\n"
     ]
    }
   ],
   "source": [
    "from sklearn import cluster\n",
    "\n",
    "atm_texts = [\n",
    "    ['complier', 'system', 'computer'],\n",
    "    ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],\n",
    "    ['graph', 'flow', 'network', 'graph'],\n",
    "    ['loading', 'computer', 'system'],\n",
    "    ['user', 'server', 'system'],\n",
    "    ['tree', 'hamiltonian'],\n",
    "    ['graph', 'trees'],\n",
    "    ['computer', 'kernel', 'malfunction', 'computer'],\n",
    "    ['server', 'system', 'computer'],\n",
    "]\n",
    "atm_dictionary = Dictionary(atm_texts)\n",
    "atm_corpus = [atm_dictionary.doc2bow(text) for text in atm_texts]\n",
    "author2doc = {'john': [0, 1, 2, 3, 4, 5, 6], 'jane': [2, 3, 4, 5, 6, 7, 8], 'jack': [0, 2, 4, 6, 8], 'jill': [1, 3, 5, 7]}\n",
    "\n",
    "model = AuthorTopicTransformer(id2word=atm_dictionary, author2doc=author2doc, num_topics=10, passes=100)\n",
    "model.fit(atm_corpus)\n",
    "\n",
    "# create and train clustering model\n",
    "clstr = cluster.MiniBatchKMeans(n_clusters=2)\n",
    "authors_full = ['john', 'jane', 'jack', 'jill']\n",
    "clstr.fit(model.transform(authors_full))\n",
    "\n",
    "# stack together the two models in a pipeline\n",
    "text_atm = Pipeline([('features', model,), ('cluster', clstr)])\n",
    "author_list = ['jane', 'jack', 'jill']\n",
    "ret_val = text_atm.predict(author_list)\n",
    "\n",
    "print(ret_val)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Doc2Vec Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use Doc2Vec model begin with importing Doc2Vec wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import D2VTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.0\n"
     ]
    }
   ],
   "source": [
    "from gensim.models import doc2vec\n",
    "d2v_sentences = [doc2vec.TaggedDocument(words, [i]) for i, words in enumerate(w2v_texts)]\n",
    "\n",
    "model = D2VTransformer(min_count=1)\n",
    "model.fit(d2v_sentences)\n",
    "\n",
    "class_dict = {'mathematics': 1, 'physics': 0}\n",
    "train_data = [\n",
    "    (['calculus', 'mathematical'], 'mathematics'), (['geometry', 'operations', 'curves'], 'mathematics'),\n",
    "    (['natural', 'nuclear'], 'physics'), (['science', 'electromagnetism', 'natural'], 'physics')\n",
    "]\n",
    "train_input = list(map(lambda x: x[0], train_data))\n",
    "train_target = list(map(lambda x: class_dict[x[1]], train_data))\n",
    "\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n",
    "clf.fit(model.transform(train_input), train_target)\n",
    "text_d2v = Pipeline([('features', model,), ('classifier', clf)])\n",
    "score = text_d2v.score(train_input, train_target)\n",
    "\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text2Bow Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use Text2Bow model begin with importing Text2Bow wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import Text2BowTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.947147651007\n"
     ]
    }
   ],
   "source": [
    "text2bow_model = Text2BowTransformer()\n",
    "lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=np.random.seed(0))\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n",
    "text_t2b = Pipeline([('bow_model', text2bow_model), ('ldamodel', lda_model), ('classifier', clf)])\n",
    "text_t2b.fit(data.data, data.target)\n",
    "score = text_t2b.score(data.data, data.target)\n",
    "\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TfIdf Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use TfIdf model begin with importing TfIdf wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import TfIdfTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.578859060403\n"
     ]
    }
   ],
   "source": [
    "tfidf_model = TfIdfTransformer()\n",
    "tfidf_model.fit(corpus)\n",
    "lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=np.random.seed(0))\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n",
    "text_tfidf = Pipeline((('tfidf_model', tfidf_model), ('ldamodel', lda_model), ('classifier', clf)))\n",
    "text_tfidf.fit(corpus, data.target)\n",
    "score = text_tfidf.score(corpus, data.target)\n",
    "\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### HDP Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use HDP model begin with importing HDP wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_api import HdpTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.848154362416\n"
     ]
    }
   ],
   "source": [
    "model = HdpTransformer(id2word=id2word)\n",
    "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n",
    "text_hdp = Pipeline([('features', model,), ('classifier', clf)])\n",
    "text_hdp.fit(corpus, data.target)\n",
    "score = text_hdp.score(corpus, data.target)\n",
    "\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}