{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# `pyLDAvis.lda_model`\n", "\n", "pyLDAvis now also supports LDA application from scikit-learn. Let's take a look into this in more detail. We will be using the 20 newsgroups dataset as provided by scikit-learn." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore', category=DeprecationWarning) \n", "warnings.filterwarnings('ignore', category=FutureWarning) \n", "warnings.filterwarnings('ignore', category=UserWarning)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "import pyLDAvis\n", "import pyLDAvis.lda_model\n", "pyLDAvis.enable_notebook()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "from sklearn.decomposition import LatentDirichletAllocation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load 20 newsgroups dataset\n", "\n", "First, the 20 newsgroups dataset available in sklearn is loaded. As always, the headers, footers and quotes are removed." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11314\n" ] } ], "source": [ "newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))\n", "docs_raw = newsgroups.data\n", "print(len(docs_raw))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert to document-term matrix\n", "\n", "Next, the raw documents are converted into document-term matrix, possibly as raw counts or in TF-IDF form." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(11314, 9144)\n" ] } ], "source": [ "tf_vectorizer = CountVectorizer(strip_accents = 'unicode',\n", " stop_words = 'english',\n", " lowercase = True,\n", " token_pattern = r'\\b[a-zA-Z]{3,}\\b',\n", " max_df = 0.5, \n", " min_df = 10)\n", "dtm_tf = tf_vectorizer.fit_transform(docs_raw)\n", "print(dtm_tf.shape)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(11314, 9144)\n" ] } ], "source": [ "tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())\n", "dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)\n", "print(dtm_tfidf.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit Latent Dirichlet Allocation models\n", "\n", "Finally, the LDA models are fitted." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
LatentDirichletAllocation(n_components=20, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LatentDirichletAllocation(n_components=20, random_state=0)