{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmark testing of coherence pipeline on Movies dataset\n",
    "## How to find how well coherence measure matches your manual annotators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Introduction__: For the validation of any model adapted from a paper, it is of utmost importance that the results of benchmark testing on the datasets listed in the paper match between the actual implementation (palmetto) and gensim. This coherence pipeline has been implemented from the work done by Roeder et al. The paper can be found [here](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf).\n",
    "\n",
    "__Approach__ :\n",
    "1. In this notebook, we'll use the Movies dataset mentioned in the paper. This dataset along with the topics on which the coherence is calculated and the gold (human) ratings on these topics can be found [here](http://139.18.2.164/mroeder/palmetto/datasets/).\n",
    "2. We will then calculate the coherence on these topics using the pipeline implemented in gensim.\n",
    "3. Once we have all our coherence values on these topics we will calculate the correlation with the human ratings using pearson's r.\n",
    "4. We will compare this final correlation value with the values listed in the paper and see if the pipeline is working as expected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "\n",
    "import re\n",
    "import os\n",
    "\n",
    "from scipy.stats import pearsonr\n",
    "from datetime import datetime\n",
    "\n",
    "from gensim.models import CoherenceModel\n",
    "from gensim.corpora.dictionary import Dictionary\n",
    "from smart_open import smart_open"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download the dataset (`movie.zip`) and gold standard data (`topicsMovie.txt` and `goldMovie.txt`) from the link and plug in the locations below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "base_dir = os.path.join(os.path.expanduser('~'), \"workshop/nlp/data/\")\n",
    "data_dir = os.path.join(base_dir, 'wiki-movie-subset')\n",
    "if not os.path.exists(data_dir):\n",
    "    raise ValueError(\"SKIP: Please download the movie corpus.\")\n",
    "\n",
    "ref_dir = os.path.join(base_dir, 'reference')\n",
    "topics_path = os.path.join(ref_dir, 'topicsMovie.txt')\n",
    "human_scores_path = os.path.join(ref_dir, 'goldMovie.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PROGRESS: 10000/125384, preprocessed 9916, discarded 84\n",
      "PROGRESS: 20000/125384, preprocessed 19734, discarded 266\n",
      "PROGRESS: 30000/125384, preprocessed 29648, discarded 352\n",
      "PROGRESS: 50000/125384, preprocessed 37074, discarded 12926\n",
      "PROGRESS: 60000/125384, preprocessed 47003, discarded 12997\n",
      "PROGRESS: 70000/125384, preprocessed 56961, discarded 13039\n",
      "PROGRESS: 80000/125384, preprocessed 66891, discarded 13109\n",
      "PROGRESS: 90000/125384, preprocessed 76784, discarded 13216\n",
      "PROGRESS: 100000/125384, preprocessed 86692, discarded 13308\n",
      "PROGRESS: 110000/125384, preprocessed 96593, discarded 13407\n",
      "PROGRESS: 120000/125384, preprocessed 106522, discarded 13478\n",
      "CPU times: user 19.8 s, sys: 9.55 s, total: 29.4 s\n",
      "Wall time: 44.9 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "texts = []\n",
    "file_num = 0\n",
    "preprocessed = 0\n",
    "listing = os.listdir(data_dir)\n",
    "\n",
    "for fname in listing:\n",
    "    file_num += 1\n",
    "    if 'disambiguation' in fname:\n",
    "        continue  # discard disambiguation and redirect pages\n",
    "    elif fname.startswith('File_'):\n",
    "        continue  # discard images, gifs, etc.\n",
    "    elif fname.startswith('Category_'):\n",
    "        continue  # discard category articles\n",
    "        \n",
    "    # Not sure how to identify portal and redirect pages,\n",
    "    # as well as pages about a single year.\n",
    "    # As a result, this preprocessing differs from the paper.\n",
    "    \n",
    "    with smart_open(os.path.join(data_dir, fname), 'rb') as f:\n",
    "        for line in f:\n",
    "            # lower case all words\n",
    "            lowered = line.lower()\n",
    "            #remove punctuation and split into seperate words\n",
    "            words = re.findall(r'\\w+', lowered, flags = re.UNICODE | re.LOCALE)\n",
    "            texts.append(words)\n",
    "            \n",
    "    preprocessed += 1\n",
    "    if file_num % 10000 == 0:\n",
    "        print('PROGRESS: %d/%d, preprocessed %d, discarded %d' % (\n",
    "            file_num, len(listing), preprocessed, (file_num - preprocessed)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 26s, sys: 1.1 s, total: 1min 27s\n",
      "Wall time: 1min 27s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "dictionary = Dictionary(texts)\n",
    "corpus = [dictionary.doc2bow(text) for text in texts]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross validate the numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "According to the paper the number of documents should be 108,952 with a vocabulary of 1,625,124. The difference is because of a difference in preprocessing. However the results obtained are still very similar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "111637\n",
      "Dictionary(756837 unique tokens: [u'verplank', u'mdbg', u'shatzky', u'duelcity', u'dulcitone']...)\n"
     ]
    }
   ],
   "source": [
    "print(len(corpus))\n",
    "print(dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "topics = []  # list of 100 topics\n",
    "with smart_open(topics_path, 'rb') as f:\n",
    "    topics = [line.split() for line in f if line]\n",
    "len(topics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "human_scores = []\n",
    "with smart_open(human_scores_path, 'rb') as f:\n",
    "    for line in f:\n",
    "        human_scores.append(float(line.strip()))\n",
    "len(human_scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deal with any vocabulary mismatch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topics with out-of-vocab terms: 72\n"
     ]
    }
   ],
   "source": [
    "# We first need to filter out any topics that contain terms not in our dictionary\n",
    "# These may occur as a result of preprocessing steps differing from those used to\n",
    "# produce the reference topics. In this case, this only occurs in one topic.\n",
    "invalid_topic_indices = set(\n",
    "    i for i, topic in enumerate(topics)\n",
    "    if any(t not in dictionary.token2id for t in topic)\n",
    ")\n",
    "print(\"Topics with out-of-vocab terms: %s\" % ', '.join(map(str, invalid_topic_indices)))\n",
    "usable_topics = [topic for i, topic in enumerate(topics) if i not in invalid_topic_indices]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start off with u_mass coherence measure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculated u_mass coherence for 99 topics\n",
      "CPU times: user 7.22 s, sys: 141 ms, total: 7.36 s\n",
      "Wall time: 7.38 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "cm = CoherenceModel(topics=usable_topics, corpus=corpus, dictionary=dictionary, coherence='u_mass')\n",
    "u_mass = cm.get_coherence_per_topic()\n",
    "print(\"Calculated u_mass coherence for %d topics\" % len(u_mass))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start c_v coherence measure\n",
    "This is expected to take much more time since `c_v` uses a sliding window to perform probability estimation and uses the cosine similarity indirect confirmation measure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculated c_v coherence for 99 topics\n",
      "CPU times: user 38.5 s, sys: 5.52 s, total: 44 s\n",
      "Wall time: 13min 8s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "cm = CoherenceModel(topics=usable_topics, texts=texts, dictionary=dictionary, coherence='c_v')\n",
    "c_v = cm.get_coherence_per_topic()\n",
    "print(\"Calculated c_v coherence for %d topics\" % len(c_v))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start c_uci and c_npmi coherence measures\n",
    "c_v and c_uci and c_npmi all use the boolean sliding window approach of estimating probabilities. Since the `CoherenceModel` caches the accumulated statistics, calculation of c_uci and c_npmi are practically free after calculating c_v coherence. These two methods are simpler and were shown to correlate less with human judgements than c_v but more so than u_mass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculated c_uci coherence for 99 topics\n",
      "CPU times: user 95 ms, sys: 8.87 ms, total: 104 ms\n",
      "Wall time: 97.2 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "cm.coherence = 'c_uci'\n",
    "c_uci = cm.get_coherence_per_topic()\n",
    "print(\"Calculated c_uci coherence for %d topics\" % len(c_uci))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculated c_npmi coherence for 99 topics\n",
      "CPU times: user 192 ms, sys: 6.38 ms, total: 198 ms\n",
      "Wall time: 194 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "cm.coherence = 'c_npmi'\n",
    "c_npmi = cm.get_coherence_per_topic()\n",
    "print(\"Calculated c_npmi coherence for %d topics\" % len(c_npmi))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "99"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "final_scores = [\n",
    "    score for i, score in enumerate(human_scores)\n",
    "    if i not in invalid_topic_indices\n",
    "]\n",
    "len(final_scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [values in the paper](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf) were:\n",
    "\n",
    "__`u_mass` correlation__ : 0.093\n",
    "\n",
    "__`c_v` correlation__    : 0.548\n",
    "\n",
    "__`c_uci` correlation__  : 0.473\n",
    "\n",
    "__`c_npmi` correlation__ : 0.438\n",
    "\n",
    "Our values are also very similar to these values which is good. This validates the correctness of our pipeline, as we can reasonably attribute the differences to differences in preprocessing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.158529392277\n",
      "0.530450687702\n",
      "0.406162050908\n",
      "0.46002144316\n"
     ]
    }
   ],
   "source": [
    "for our_scores in (u_mass, c_v, c_uci, c_npmi):\n",
    "    print(pearsonr(our_scores, final_scores)[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Where do we go now?\n",
    "\n",
    "- The time required for completing all of these operations can be improved a lot by cythonising them.\n",
    "- Preprocessing can be improved for this notebook by following the exact process mentioned in the reference paper. Specifically: _All corpora as well as the complete Wikipedia used as reference corpus are preprocessed using lemmatization and stop word removal. Additionally, we removed portal and category articles, redirection and disambiguation pages as well as articles about single years._ *Note*: we tried lemmatizing and found that significantly more of the reference topics had out-of-vocabulary terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}