{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Using wrappers for Scikit learn API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The wrappers available (as of now) are :\n", "* LdaModel (```gensim.sklearn_api.ldamodel.LdaTransformer```), which implements gensim's ```LDA Model``` in a scikit-learn interface\n", "\n", "* LsiModel (```gensim.sklearn_api.lsimodel.LsiTransformer```), which implements gensim's ```LSI Model``` in a scikit-learn interface\n", "\n", "* RpModel (```gensim.sklearn_api.rpmodel.RpTransformer```), which implements gensim's ```Random Projections Model``` in a scikit-learn interface\n", "\n", "* LDASeq Model (```gensim.sklearn_api.ldaseqmodel.LdaSeqTransformer```), which implements gensim's ```LdaSeqModel``` in a scikit-learn interface\n", "\n", "* Word2Vec Model (```gensim.sklearn_api.w2vmodel.W2VTransformer```), which implements gensim's ```Word2Vec``` in a scikit-learn interface\n", "\n", "* AuthorTopicModel Model (```gensim.sklearn_api.atmodel.AuthorTopicTransformer```), which implements gensim's ```AuthorTopicModel``` in a scikit-learn interface\n", "\n", "* Doc2Vec Model (```gensim.sklearn_api.d2vmodel.D2VTransformer```), which implements gensim's ```Doc2Vec``` in a scikit-learn interface\n", "\n", "* Text2Bow Model (```gensim.sklearn_api.text2bow.Text2BowTransformer```), which implements gensim's ```Dictionary``` in a scikit-learn interface\n", "\n", "* TfidfModel Model (```gensim.sklearn_api.tfidf.TfIdfTransformer```), which implements gensim's ```TfidfModel``` in a scikit-learn interface\n", "\n", "* HdpModel Model (```gensim.sklearn_api.hdp.HdpTransformer```), which implements gensim's ```HdpModel``` in a scikit-learn interface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LDA Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use LdaModel begin with importing LdaModel wrapper" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import LdaTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we will create a dummy set of texts and convert it into a corpus" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from gensim.corpora import Dictionary\n", "texts = [\n", " ['complier', 'system', 'computer'],\n", " ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],\n", " ['graph', 'flow', 'network', 'graph'],\n", " ['loading', 'computer', 'system'],\n", " ['user', 'server', 'system'],\n", " ['tree', 'hamiltonian'],\n", " ['graph', 'trees'],\n", " ['computer', 'kernel', 'malfunction', 'computer'],\n", " ['server', 'system', 'computer']\n", "]\n", "dictionary = Dictionary(texts)\n", "corpus = [dictionary.doc2bow(text) for text in texts]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then to run the LdaModel on it" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.84165996, 0.15834005],\n", " [0.716593 , 0.28340697],\n", " [0.11434125, 0.88565874],\n", " [0.80545014, 0.19454984],\n", " [0.39609504, 0.603905 ],\n", " [0.80124027, 0.19875973],\n", " [0.19269218, 0.80730784],\n", " [0.8466452 , 0.15335481],\n", " [0.67057097, 0.32942903]], dtype=float32)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = LdaTransformer(num_topics=2, id2word=dictionary, iterations=20, random_state=1)\n", "model.fit(corpus)\n", "model.transform(corpus)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#### Integration with Sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from gensim import matutils\n", "from gensim.models.ldamodel import LdaModel\n", "from sklearn.datasets import fetch_20newsgroups\n", "from gensim.sklearn_api.ldamodel import LdaTransformer" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "rand = np.random.mtrand.RandomState(1) # set seed for getting same result\n", "cats = ['rec.sport.baseball', 'sci.crypt']\n", "data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we use use the loaded data to create our dictionary and corpus." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "data_texts = [_.split() for _ in data.data]\n", "id2word = Dictionary(data_texts)\n", "corpus = [id2word.doc2bow(i.split()) for i in data.data]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we just need to fit corpus and id2word to our Lda wrapper." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=20)\n", "lda = obj.fit(corpus)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#### Example for Using Grid Search" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The inbuilt `score` function of Lda wrapper class provides two modes : `perplexity` and `u_mass` for computing the scores of the candidate models. The preferred mode for the scoring function is specified using `scorer` parameter of the wrapper as follows : " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'iterations': 20, 'num_topics': 3}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj = LdaTransformer(id2word=id2word, num_topics=2, iterations=5, scorer='u_mass') # here 'scorer' can be 'perplexity' or 'u_mass'\n", "parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}\n", "\n", "# set `scoring` as `None` to use the inbuilt score function of `SklLdaModel` class\n", "model = GridSearchCV(obj, parameters, cv=3, scoring=None)\n", "model.fit(corpus)\n", "\n", "model.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also supply a custom scoring function of your choice using the `scoring` parameter of `GridSearchCV` function. The example shown below uses `c_v` mode of `CoherenceModel` class for computing the scores of the candidate models." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'iterations': 50, 'num_topics': 2}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from gensim.models.coherencemodel import CoherenceModel\n", "\n", "# supplying a custom scoring function\n", "def scoring_function(estimator, X, y=None):\n", " goodcm = CoherenceModel(model=estimator.gensim_model, texts=data_texts, dictionary=estimator.gensim_model.id2word, coherence='c_v')\n", " return goodcm.get_coherence()\n", "\n", "obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=5)\n", "parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}\n", "\n", "# set `scoring` as your custom scoring function\n", "model = GridSearchCV(obj, parameters, cv=2, scoring=scoring_function)\n", "model.fit(corpus)\n", "\n", "model.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn import linear_model\n", "\n", "def print_features_pipe(clf, vocab, n=10):\n", " ''' Better printing for sorted list '''\n", " # FIXME: this function is broken\n", " coef = clf.named_steps['classifier'].coef_[0]\n", " print(coef)\n", " print('Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0])))\n", " print('Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0])))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "id2word = Dictionary([_.split() for _ in data.data])\n", "corpus = [id2word.doc2bow(i.split()) for i in data.data]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0.6459731543624161\n" ] } ], "source": [ "model = LdaTransformer(num_topics=15, id2word=id2word, iterations=10, random_state=37)\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used\n", "pipe = Pipeline([('features', model,), ('classifier', clf)])\n", "pipe.fit(corpus, data.target)\n", "# print_features_pipe(pipe, id2word.values())\n", "\n", "print(pipe.score(corpus, data.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LSI Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use LsiModel begin with importing LsiModel wrapper" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import LsiTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0.8657718120805369\n" ] } ], "source": [ "model = LsiTransformer(num_topics=15, id2word=id2word)\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used\n", "pipe = Pipeline([('features', model,), ('classifier', clf)])\n", "pipe.fit(corpus, data.target)\n", "# print_features_pipe(pipe, id2word.values())\n", "\n", "print(pipe.score(corpus, data.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Projections Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use RpModel begin with importing RpModel wrapper" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import RpTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0.5461409395973155\n" ] } ], "source": [ "model = RpTransformer(num_topics=2)\n", "np.random.mtrand.RandomState(1) # set seed for getting same result\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used\n", "pipe = Pipeline([('features', model,), ('classifier', clf)])\n", "pipe.fit(corpus, data.target)\n", "# print_features_pipe(pipe, id2word.values())\n", "\n", "print(pipe.score(corpus, data.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LDASeq Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use LdaSeqModel begin with importing LdaSeqModel wrapper" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import LdaSeqTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/git/gensim/gensim/models/ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars\n", " convergence = np.fabs((bound - old_bound) / old_bound)\n", "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1.0\n" ] } ], "source": [ "test_data = data.data[0:2]\n", "test_target = data.target[0:2]\n", "id2word_ldaseq = Dictionary(map(lambda x: x.split(), test_data))\n", "corpus_ldaseq = [id2word_ldaseq.doc2bow(i.split()) for i in test_data]\n", "\n", "model = LdaSeqTransformer(id2word=id2word_ldaseq, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used\n", "pipe = Pipeline([('features', model,), ('classifier', clf)])\n", "pipe.fit(corpus_ldaseq, test_target)\n", "# print_features_pipe(pipe, id2word_ldaseq.values())\n", "\n", "print(pipe.score(corpus_ldaseq, test_target))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Word2Vec Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use Word2Vec model begin with importing Word2Vec wrapper" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import W2VTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] } ], "source": [ "w2v_texts = [\n", " ['calculus', 'is', 'the', 'mathematical', 'study', 'of', 'continuous', 'change'],\n", " ['geometry', 'is', 'the', 'study', 'of', 'shape'],\n", " ['algebra', 'is', 'the', 'study', 'of', 'generalizations', 'of', 'arithmetic', 'operations'],\n", " ['differential', 'calculus', 'is', 'related', 'to', 'rates', 'of', 'change', 'and', 'slopes', 'of', 'curves'],\n", " ['integral', 'calculus', 'is', 'realted', 'to', 'accumulation', 'of', 'quantities', 'and', 'the', 'areas', 'under', 'and', 'between', 'curves'],\n", " ['physics', 'is', 'the', 'natural', 'science', 'that', 'involves', 'the', 'study', 'of', 'matter', 'and', 'its', 'motion', 'and', 'behavior', 'through', 'space', 'and', 'time'],\n", " ['the', 'main', 'goal', 'of', 'physics', 'is', 'to', 'understand', 'how', 'the', 'universe', 'behaves'],\n", " ['physics', 'also', 'makes', 'significant', 'contributions', 'through', 'advances', 'in', 'new', 'technologies', 'that', 'arise', 'from', 'theoretical', 'breakthroughs'],\n", " ['advances', 'in', 'the', 'understanding', 'of', 'electromagnetism', 'or', 'nuclear', 'physics', 'led', 'directly', 'to', 'the', 'development', 'of', 'new', 'products', 'that', 'have', 'dramatically', 'transformed', 'modern', 'day', 'society']\n", "]\n", "\n", "model = W2VTransformer(size=10, min_count=1)\n", "model.fit(w2v_texts)\n", "\n", "class_dict = {'mathematics': 1, 'physics': 0}\n", "train_data = [\n", " ('calculus', 'mathematics'), ('mathematical', 'mathematics'), ('geometry', 'mathematics'), ('operations', 'mathematics'), ('curves', 'mathematics'),\n", " ('natural', 'physics'), ('nuclear', 'physics'), ('science', 'physics'), ('electromagnetism', 'physics'), ('natural', 'physics')\n", "]\n", "\n", "train_input = list(map(lambda x: x[0], train_data))\n", "train_target = list(map(lambda x: class_dict[x[1]], train_data))\n", "\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n", "clf.fit(model.transform(train_input), train_target)\n", "text_w2v = Pipeline([('features', model,), ('classifier', clf)])\n", "score = text_w2v.score(train_input, train_target)\n", "\n", "print(score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AuthorTopic Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use AuthorTopic model begin with importing AuthorTopic wrapper" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import AuthorTopicTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 1 0]\n" ] } ], "source": [ "from sklearn import cluster\n", "\n", "atm_texts = [\n", " ['complier', 'system', 'computer'],\n", " ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],\n", " ['graph', 'flow', 'network', 'graph'],\n", " ['loading', 'computer', 'system'],\n", " ['user', 'server', 'system'],\n", " ['tree', 'hamiltonian'],\n", " ['graph', 'trees'],\n", " ['computer', 'kernel', 'malfunction', 'computer'],\n", " ['server', 'system', 'computer'],\n", "]\n", "atm_dictionary = Dictionary(atm_texts)\n", "atm_corpus = [atm_dictionary.doc2bow(text) for text in atm_texts]\n", "author2doc = {'john': [0, 1, 2, 3, 4, 5, 6], 'jane': [2, 3, 4, 5, 6, 7, 8], 'jack': [0, 2, 4, 6, 8], 'jill': [1, 3, 5, 7]}\n", "\n", "model = AuthorTopicTransformer(id2word=atm_dictionary, author2doc=author2doc, num_topics=10, passes=100)\n", "model.fit(atm_corpus)\n", "\n", "# create and train clustering model\n", "clstr = cluster.MiniBatchKMeans(n_clusters=2)\n", "authors_full = ['john', 'jane', 'jack', 'jill']\n", "clstr.fit(model.transform(authors_full))\n", "\n", "# stack together the two models in a pipeline\n", "text_atm = Pipeline([('features', model,), ('cluster', clstr)])\n", "author_list = ['jane', 'jack', 'jill']\n", "ret_val = text_atm.predict(author_list)\n", "\n", "print(ret_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Doc2Vec Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use Doc2Vec model begin with importing Doc2Vec wrapper" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import D2VTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] } ], "source": [ "from gensim.models import doc2vec\n", "d2v_sentences = [doc2vec.TaggedDocument(words, [i]) for i, words in enumerate(w2v_texts)]\n", "\n", "model = D2VTransformer(min_count=1)\n", "model.fit(d2v_sentences)\n", "\n", "class_dict = {'mathematics': 1, 'physics': 0}\n", "train_data = [\n", " (['calculus', 'mathematical'], 'mathematics'), (['geometry', 'operations', 'curves'], 'mathematics'),\n", " (['natural', 'nuclear'], 'physics'), (['science', 'electromagnetism', 'natural'], 'physics')\n", "]\n", "train_input = list(map(lambda x: x[0], train_data))\n", "train_target = list(map(lambda x: class_dict[x[1]], train_data))\n", "\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n", "clf.fit(model.transform(train_input), train_target)\n", "text_d2v = Pipeline([('features', model,), ('classifier', clf)])\n", "score = text_d2v.score(train_input, train_target)\n", "\n", "print(score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text2Bow Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use Text2Bow model begin with importing Text2Bow wrapper" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import Text2BowTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0.9723154362416108\n" ] } ], "source": [ "text2bow_model = Text2BowTransformer()\n", "lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=np.random.seed(0))\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n", "text_t2b = Pipeline([('bow_model', text2bow_model), ('ldamodel', lda_model), ('classifier', clf)])\n", "text_t2b.fit(data.data, data.target)\n", "score = text_t2b.score(data.data, data.target)\n", "\n", "print(score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TfIdf Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use TfIdf model begin with importing TfIdf wrapper" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import TfIdfTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0.735738255033557\n" ] } ], "source": [ "tfidf_model = TfIdfTransformer()\n", "tfidf_model.fit(corpus)\n", "lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=np.random.seed(0))\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n", "text_tfidf = Pipeline((('tfidf_model', tfidf_model), ('ldamodel', lda_model), ('classifier', clf)))\n", "text_tfidf.fit(corpus, data.target)\n", "score = text_tfidf.score(corpus, data.target)\n", "\n", "print(score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### HDP Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use HDP model begin with importing HDP wrapper" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api import HdpTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example of Using Pipeline" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0.8271812080536913\n" ] } ], "source": [ "model = HdpTransformer(id2word=id2word)\n", "clf = linear_model.LogisticRegression(penalty='l2', C=0.1)\n", "text_hdp = Pipeline([('features', model,), ('classifier', clf)])\n", "text_hdp.fit(corpus, data.target)\n", "score = text_hdp.score(corpus, data.target)\n", "\n", "print(score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }