{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import logging\n", "from gensim.models import EnsembleLda, LdaMulticore\n", "from gensim.models.ensemblelda import rank_masking\n", "from gensim.corpora import OpinosisCorpus\n", "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "enable the ensemble logger to show what it is doing currently" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elda_logger = logging.getLogger(EnsembleLda.__module__)\n", "elda_logger.setLevel(logging.INFO)\n", "elda_logger.addHandler(logging.StreamHandler())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def pretty_print_topics():\n", " # note that the words are stemmed so they appear chopped off\n", " for t in elda.print_topics(num_words=7):\n", " print('-', t[1].replace('*',' ').replace('\"','').replace(' +',','), '\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Experiments on the Opinosis Dataset\n", "\n", "Opinosis [1] is a small (but redundant) corpus that contains 289 product reviews for 51 products. Since it's so small, the results are rather unstable.\n", "\n", "[1] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han, _Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions [online],_ Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 340–348. Available from: https://kavita-ganesan.com/opinosis/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the corpus\n", "\n", "First, download the opinosis dataset. On linux it can be done like this for example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir ~/opinosis\n", "!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip\n", "!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = os.path.expanduser('~/opinosis/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package.\n", "It preprocesses the data using the PorterStemmer and stopwords from the nltk package.\n", "\n", "The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "opinosis = OpinosisCorpus(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**parameters**\n", "\n", "**topic_model_kind** ldamulticore is highly recommended for EnsembleLda. ensemble_workers and **distance_workers** are used to improve the time needed to train the models, as well as the **masking_method** 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so **ensemble_workers** can be set to 3 to get 95 - 100% cpu usage on my i5 3470.\n", "\n", "Since the corpus is so small, a high number of **num_models** is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elda = EnsembleLda(\n", " corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,\n", " passes=20, iterations=100, ensemble_workers=3, distance_workers=4,\n", " topic_model_class='ldamulticore', masking_method=rank_masking,\n", ")\n", "pretty_print_topics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default for **min_samples** would be 64, half of the number of models and **eps** would be 0.1. You basically play around with them until you find a sweetspot that fits for your needs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "elda.recluster(min_samples=55, eps=0.14)\n", "pretty_print_topics()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 2 }