{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmark testing of coherence pipeline on Movies dataset:\n", "## How to find how well coherence measure matches your manual annotators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Introduction__: For the validation of any model adapted from a paper, it is of utmost importance that the results of benchmark testing on the datasets listed in the paper match between the actual implementation (palmetto) and gensim. This coherence pipeline has been implemented from the work done by Roeder et al. The paper can be found [here](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf).\n", "\n", "__Approach__ :\n", "1. We will use the Movies dataset first. This dataset along with the topics on which the coherence is calculated and the gold (human) ratings on these topics can be found [here](http://139.18.2.164/mroeder/palmetto/datasets/).\n", "2. We will then calculate the coherence on these topics using the pipeline implemented in gensim.\n", "3. Once we have got all our coherence values on these topics we will calculate the correlation with the human ratings using pearson's r.\n", "4. We will compare this final correlation value with the values listed in the paper and see if the pipeline is working as expected." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The line_profiler extension is already loaded. To reload it, use:\n", " %reload_ext line_profiler\n" ] } ], "source": [ "import re\n", "import os\n", "\n", "from scipy.stats import pearsonr\n", "from datetime import datetime\n", "\n", "from gensim.models import CoherenceModel\n", "from gensim.corpora.dictionary import Dictionary\n", "%load_ext line_profiler # This was used for finding out which line was taking maximum time for indirect confirmation measure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the dataset from the link and plug in the location here" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "prefix = \"/home/devashish/datasets/Movies/movie/\"" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 0:10:23.956500\n" ] } ], "source": [ "start = datetime.now()\n", "texts = []\n", "for fil in os.listdir(prefix):\n", " for line in open(prefix + fil):\n", " # lower case all words\n", " lowered = line.lower()\n", " #remove punctuation and split into seperate words\n", " words = re.findall(r'\\w+', lowered, flags = re.UNICODE | re.LOCALE)\n", " texts.append(words)\n", "end = datetime.now()\n", "print \"Time taken: %s\" % (end - start)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 0:01:44.047829\n" ] } ], "source": [ "start = datetime.now()\n", "dictionary = Dictionary(texts)\n", "corpus = [dictionary.doc2bow(text) for text in texts]\n", "end = datetime.now()\n", "print \"Time taken: %s\" % (end - start)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross validate the numbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the paper the number of documents should be 108952 with a vocabulary of 1625124. The difference is because of a difference in preprocessing. However the results obtained are still very similar." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "124234\n", "Dictionary(758123 unique tokens: [u'schelberger', u'mdbg', u'shatzky', u'bhetan', u'verplank']...)\n" ] } ], "source": [ "print len(corpus)\n", "print dictionary" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[[]]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topics = [] # list of 100 topics\n", "for l in open('/home/devashish/datasets/Movies/topicsMovie.txt'):\n", " topics.append([l.split()])\n", "topics.pop(100)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "human_scores = []\n", "for l in open('/home/devashish/datasets/Movies/goldMovie.txt'):\n", " human_scores.append(float(l.strip()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start off with u_mass coherence measure." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 0:20:44.833342\n" ] } ], "source": [ "start = datetime.now()\n", "u_mass = []\n", "flags = []\n", "for n, topic in enumerate(topics):\n", " try:\n", " cm = CoherenceModel(topics=topic, corpus=corpus, dictionary=dictionary, coherence='u_mass')\n", " u_mass.append(cm.get_coherence())\n", " except KeyError:\n", " flags.append(n)\n", "end = datetime.now()\n", "print \"Time taken: %s\" % (end - start)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start c_v coherence measure\n", "This is expected to take much more time since `c_v` uses a sliding window to perform probability estimation and uses the cosine similarity indirect confirmation measure." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 19:50:11.214341\n" ] } ], "source": [ "start = datetime.now()\n", "c_v = []\n", "for n, topic in enumerate(topics):\n", " try:\n", " cm = CoherenceModel(topics=topic, texts=texts, dictionary=dictionary, coherence='c_v')\n", " c_v.append(cm.get_coherence())\n", " except KeyError:\n", " pass\n", "end = datetime.now()\n", "print \"Time taken: %s\" % (end - start)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start c_uci and c_npmi coherence measures\n", "They should be taking lesser time than c_v but should have a higher correlation than u_mass" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 2:55:36.044760\n" ] } ], "source": [ "start = datetime.now()\n", "c_uci = []\n", "flags = []\n", "for n, topic in enumerate(topics):\n", " try:\n", " cm = CoherenceModel(topics=topic, texts=texts, dictionary=dictionary, coherence='c_uci')\n", " c_uci.append(cm.get_coherence())\n", " except KeyError:\n", " flags.append(n)\n", "end = datetime.now()\n", "print \"Time taken: %s\" % (end - start)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 2:53:55.424213\n" ] } ], "source": [ "start = datetime.now()\n", "c_npmi = []\n", "for n, topic in enumerate(topics):\n", " print n\n", " try:\n", " cm = CoherenceModel(topics=topic, texts=texts, dictionary=dictionary, coherence='c_npmi')\n", " c_npmi.append(cm.get_coherence())\n", " except KeyError:\n", " pass\n", "end = datetime.now()\n", "print \"Time taken: %s\" % (end - start)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "final_scores = []\n", "for n, score in enumerate(human_scores):\n", " if n not in flags:\n", " final_scores.append(score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One topic encountered a KeyError. This was because of a difference in preprocessing due to which one topic word wasn't found in the dictionary" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "99 99 99 99 99\n" ] } ], "source": [ "print len(u_mass), len(c_v), len(c_uci), len(c_npmi), len(final_scores)\n", "# 1 topic has word(s) that is not in the dictionary. Probably some difference\n", "# in preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values in the paper were:\n", "\n", "__`u_mass` correlation__ : 0.093\n", "\n", "__`c_v` correlation__ : 0.548\n", "\n", "__`c_uci` correlation__ : 0.473\n", "\n", "__`c_npmi` correlation__ : 0.438\n", "\n", "Our values are also very similar to these values which is good. This validates the correctness of our pipeline." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.133916622716\n", "0.555948711374\n", "0.414722858726\n", "0.39935634517\n" ] } ], "source": [ "print pearsonr(u_mass, final_scores)[0]\n", "print pearsonr(c_v, final_scores)[0]\n", "print pearsonr(c_uci, final_scores)[0]\n", "print pearsonr(c_npmi, final_scores)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Where do we go now?\n", "\n", "- Preprocessing can be improved for this notebook by following the exact process mentioned in [this](http://arxiv.org/pdf/1403.6397v1.pdf) paper.\n", "- The time required for completing all of these operations can be improved a lot by cythonising the operations." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }