{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Finding similar documents with Word2Vec and Soft Cosine Measure \n",
    "\n",
    "Soft Cosine Measure (SCM) [1, 4] is a promising new tool in machine learning that allows us to submit a query and return the most relevant documents. In **part 1**, we will show how you can compute SCM between two documents using the `inner_product` method. In **part 2**, we will use `SoftCosineSimilarity` to retrieve documents most similar to a query and compare the performance against other similarity measures.\n",
    "\n",
    "First, however, we go through the basics of what Soft Cosine Measure is.\n",
    "\n",
    "## Soft Cosine Measure basics\n",
    "\n",
    "Soft Cosine Measure (SCM) is a method that allows us to assess the similarity between two documents in a meaningful way, even when they have no words in common. It uses a measure of similarity between words, which can be derived [2] using [word2vec][] [3] vector embeddings of words. It has been shown to outperform many of the state-of-the-art methods in the semantic text similarity task in the context of community question answering [2].\n",
    "\n",
    "[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html\n",
    "\n",
    "SCM is illustrated below for two very similar sentences. The sentences have no words in common, but by modeling synonymy, SCM is able to accurately measure the similarity between the two sentences. The method also uses the bag-of-words vector representation of the documents (simply put, the word's frequencies in the documents). The intution behind the method is that we compute standard cosine similarity assuming that the document vectors are expressed in a non-orthogonal basis, where the angle between two basis vectors is derived from the angle between the word2vec embeddings of the corresponding words.\n",
    "\n",
    "![Soft Cosine Measure](soft_cosine_tutorial.png)\n",
    "\n",
    "This method was perhaps first introduced in the article “Soft Measure and Soft Cosine Measure: Measure of Features in Vector Space Model” by Grigori Sidorov, Alexander Gelbukh, Helena Gomez-Adorno, and David Pinto ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf)).\n",
    "\n",
    "In this tutorial, we will learn how to use Gensim's SCM functionality, which consists of the `inner_product` method for one-off computation, and the `SoftCosineSimilarity` class for corpus-based similarity queries.\n",
    "\n",
    "> **Note**:\n",
    ">\n",
    "> If you use this software, please consider citing [1] and [2].\n",
    ">\n",
    "\n",
    "## Running this notebook\n",
    "You can download this [Jupyter notebook](http://jupyter.org/), and run it on your own computer, provided you have installed the `gensim`, `jupyter`, `sklearn`, `pyemd`, and `wmd` Python packages.\n",
    "\n",
    "The notebook was run on an Ubuntu machine with an Intel core i7-6700HQ CPU 3.10GHz (4 cores) and 16 GB memory. Assuming all resources required by the notebook have already been downloaded, running the entire notebook on this machine takes about 30 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize logging.\n",
    "import logging\n",
    "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Computing the Soft Cosine Measure\n",
    "\n",
    "To use SCM, we need some word embeddings first of all. You could train a [word2vec][] (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but we will use pre-trained word2vec embeddings.\n",
    "\n",
    "[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html\n",
    "\n",
    "Let's create some sentences to compare."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()\n",
    "sentence_president = 'The president greets the press in Chicago'.lower().split()\n",
    "sentence_orange = 'Having a tough time finding an orange juice press machine?'.lower().split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first two sentences have very similar content, and as such the SCM should be large. Before we compute the SCM, we want to remove stopwords (\"the\", \"to\", etc.), as these do not contribute a lot to the information in the sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2018-09-11 22:02:01,041 : INFO : 'pattern' package not found; tag filters are not available for English\n",
      "2018-09-11 22:02:01,044 : INFO : adding document #0 to Dictionary(0 unique tokens: [])\n",
      "2018-09-11 22:02:01,045 : INFO : built Dictionary(14 unique tokens: ['speaks', 'illinois', 'greets', 'juice', 'chicago']...) from 3 documents (total 15 corpus positions)\n"
     ]
    }
   ],
   "source": [
    "# Import and download stopwords from NLTK.\n",
    "from nltk.corpus import stopwords\n",
    "from nltk import download\n",
    "download('stopwords')  # Download stopwords list.\n",
    "\n",
    "# Remove stopwords.\n",
    "stop_words = stopwords.words('english')\n",
    "sentence_obama = [w for w in sentence_obama if w not in stop_words]\n",
    "sentence_president = [w for w in sentence_president if w not in stop_words]\n",
    "sentence_orange = [w for w in sentence_orange if w not in stop_words]\n",
    "\n",
    "# Prepare a dictionary and a corpus.\n",
    "from gensim import corpora\n",
    "documents = [sentence_obama, sentence_president, sentence_orange]\n",
    "dictionary = corpora.Dictionary(documents)\n",
    "\n",
    "# Convert the sentences into bag-of-words vectors.\n",
    "sentence_obama = dictionary.doc2bow(sentence_obama)\n",
    "sentence_president = dictionary.doc2bow(sentence_president)\n",
    "sentence_orange = dictionary.doc2bow(sentence_orange)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, as we mentioned earlier, we will be using some downloaded pre-trained embeddings. Note that the embeddings we have chosen here require a lot of memory. We will use the embeddings to construct a term similarity matrix that will be used by the `inner_product` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2018-09-11 22:02:01,236 : INFO : loading projection weights from /home/novotny/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz\n",
      "2018-09-11 22:02:26,984 : INFO : loaded (400000, 50) matrix from /home/novotny/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz\n",
      "2018-09-11 22:02:26,985 : INFO : constructing a sparse term similarity matrix using <gensim.models.keyedvectors.WordEmbeddingSimilarityIndex object at 0x7f8d6e8615c0>\n",
      "2018-09-11 22:02:26,986 : INFO : iterating over columns in dictionary order\n",
      "2018-09-11 22:02:27,273 : INFO : constructed a sparse term similarity matrix with 11.224490% density\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 27.8 s, sys: 2.43 s, total: 30.3 s\n",
      "Wall time: 26.2 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import gensim.downloader as api\n",
    "from gensim.models import WordEmbeddingSimilarityIndex\n",
    "from gensim.similarities import SparseTermSimilarityMatrix\n",
    "\n",
    "w2v_model = api.load(\"glove-wiki-gigaword-50\")\n",
    "similarity_index = WordEmbeddingSimilarityIndex(w2v_model)\n",
    "similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compute SCM using the `inner_product` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "similarity = 0.3790\n"
     ]
    }
   ],
   "source": [
    "similarity = similarity_matrix.inner_product(sentence_obama, sentence_president, normalized=True)\n",
    "print('similarity = %.4f' % similarity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try the same thing with two completely unrelated sentences. Notice that the similarity is smaller."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "similarity = 0.1108\n"
     ]
    }
   ],
   "source": [
    "similarity = similarity_matrix.inner_product(sentence_obama, sentence_orange, normalized=True)\n",
    "print('similarity = %.4f' % similarity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Similarity queries using `SoftCosineSimilarity`\n",
    "You can use SCM to get the most similar documents to a query, using the `SoftCosineSimilarity` class. Its interface is similar to what is described in the [Similarity Queries](https://radimrehurek.com/gensim/tut3.html) Gensim tutorial.\n",
    "\n",
    "### Qatar Living unannotated dataset\n",
    "Contestants solving the community question answering task in the [SemEval 2016][semeval16] and [2017][semeval17] competitions had an unannotated dataset of 189,941 questions and 1,894,456 comments from the [Qatar Living][ql] discussion forums. As our first step, we will use the same dataset to build a corpus.\n",
    "\n",
    "[semeval16]: http://alt.qcri.org/semeval2016/task3/\n",
    "[semeval17]: http://alt.qcri.org/semeval2017/task3/\n",
    "[ql]: http://www.qatarliving.com/forum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /home/novotny/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n",
      "Number of documents: 3\n",
      "CPU times: user 2min 37s, sys: 1.62 s, total: 2min 39s\n",
      "Wall time: 2min 39s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from itertools import chain\n",
    "import json\n",
    "from re import sub\n",
    "from os.path import isfile\n",
    "\n",
    "import gensim.downloader as api\n",
    "from gensim.utils import simple_preprocess\n",
    "from nltk.corpus import stopwords\n",
    "from nltk import download\n",
    "\n",
    "\n",
    "download(\"stopwords\")  # Download stopwords list.\n",
    "stopwords = set(stopwords.words(\"english\"))\n",
    "\n",
    "def preprocess(doc):\n",
    "    doc = sub(r'<img[^<>]+(>|$)', \" image_token \", doc)\n",
    "    doc = sub(r'<[^<>]+(>|$)', \" \", doc)\n",
    "    doc = sub(r'\\[img_assist[^]]*?\\]', \" \", doc)\n",
    "    doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', \" url_token \", doc)\n",
    "    return [token for token in simple_preprocess(doc, min_len=0, max_len=float(\"inf\")) if token not in stopwords]\n",
    "\n",
    "corpus = list(chain(*[\n",
    "    chain(\n",
    "        [preprocess(thread[\"RelQuestion\"][\"RelQSubject\"]), preprocess(thread[\"RelQuestion\"][\"RelQBody\"])],\n",
    "        [preprocess(relcomment[\"RelCText\"]) for relcomment in thread[\"RelComments\"]])\n",
    "    for thread in api.load(\"semeval-2016-2017-task3-subtaskA-unannotated\")]))\n",
    "\n",
    "print(\"Number of documents: %d\" % len(documents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the corpus we have just build, we will now construct a [dictionary][], a [TF-IDF model][tfidf], a [word2vec model][word2vec], and a term similarity matrix.\n",
    "\n",
    "[dictionary]: https://radimrehurek.com/gensim/corpora/dictionary.html\n",
    "[tfidf]: https://radimrehurek.com/gensim/models/tfidfmodel.html\n",
    "[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2018-09-11 22:06:07,973 : INFO : built Dictionary(462807 unique tokens: ['pples', 'adib', 'strangers', 'kolayaalee', 'softpoint']...) from 2274338 documents (total 40096354 corpus positions)\n",
      "2018-09-11 22:06:09,432 : INFO : collecting all words and their counts\n",
      "2018-09-11 22:06:17,564 : INFO : collected 462807 word types from a corpus of 40096354 raw words and 2274338 sentences\n",
      "2018-09-11 22:06:17,565 : INFO : Loading a fresh vocabulary\n",
      "2018-09-11 22:06:18,002 : INFO : effective_min_count=5 retains 104360 unique words (22% of original 462807, drops 358447)\n",
      "2018-09-11 22:06:18,003 : INFO : effective_min_count=5 leaves 39565168 word corpus (98% of original 40096354, drops 531186)\n",
      "2018-09-11 22:06:18,454 : INFO : deleting the raw counts dictionary of 462807 items\n",
      "2018-09-11 22:06:18,474 : INFO : sample=0.001 downsamples 22 most-common words\n",
      "2018-09-11 22:06:18,475 : INFO : downsampling leaves estimated 38552993 word corpus (97.4% of prior 39565168)\n",
      "2018-09-11 22:06:18,907 : INFO : estimated required memory for 104360 words and 300 dimensions: 302644000 bytes\n",
      "2018-09-11 22:06:18,908 : INFO : resetting layer weights\n",
      "2018-09-11 22:06:21,082 : INFO : training model with 32 workers on 104360 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=5\n",
      "2018-09-11 22:06:53,894 : INFO : EPOCH - 1 : training on 40096354 raw words (38515351 effective words) took 32.8s, 1174692 effective words/s\n",
      "2018-09-11 22:07:27,121 : INFO : EPOCH - 2 : training on 40096354 raw words (38515107 effective words) took 33.2s, 1159858 effective words/s\n",
      "2018-09-11 22:08:00,122 : INFO : EPOCH - 3 : training on 40096354 raw words (38514587 effective words) took 33.0s, 1167509 effective words/s\n",
      "2018-09-11 22:08:32,976 : INFO : EPOCH - 4 : training on 40096354 raw words (38515500 effective words) took 32.8s, 1172993 effective words/s\n",
      "2018-09-11 22:09:06,211 : INFO : EPOCH - 5 : training on 40096354 raw words (38515593 effective words) took 33.2s, 1159566 effective words/s\n",
      "2018-09-11 22:09:06,212 : INFO : training on a 200481770 raw words (192576138 effective words) took 165.1s, 1166216 effective words/s\n",
      "2018-09-11 22:09:06,637 : INFO : constructing a sparse term similarity matrix using <gensim.models.keyedvectors.WordEmbeddingSimilarityIndex object at 0x7f8cde1dc9b0>\n",
      "2018-09-11 22:09:06,657 : INFO : iterating over columns in tf-idf order\n",
      "2018-09-11 22:25:34,416 : INFO : constructed a sparse term similarity matrix with 0.003654% density\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4h 38min 32s, sys: 4h 24min 33s, total: 9h 3min 5s\n",
      "Wall time: 20min 43s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from multiprocessing import cpu_count\n",
    "\n",
    "from gensim.corpora import Dictionary\n",
    "from gensim.models import TfidfModel\n",
    "from gensim.models import Word2Vec\n",
    "from gensim.models import WordEmbeddingSimilarityIndex\n",
    "from gensim.similarities import SparseTermSimilarityMatrix\n",
    "\n",
    "dictionary = Dictionary(corpus)\n",
    "tfidf = TfidfModel(dictionary=dictionary)\n",
    "w2v_model = Word2Vec(corpus, workers=cpu_count(), min_count=5, size=300, seed=12345)\n",
    "similarity_index = WordEmbeddingSimilarityIndex(w2v_model.wv)\n",
    "similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf, nonzero_limit=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation\n",
    "Next, we will load the validation and test datasets that were used by the SemEval 2016 and 2017 contestants. The datasets contain 208 original questions posted by the forum members. For each question, there is a list of 10 threads with a human annotation denoting whether or not the thread is relevant to the original question. Our task will be to order the threads so that relevant threads rank above irrelevant threads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "datasets = api.load(\"semeval-2016-2017-task3-subtaskBC\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we will perform an evaluation to compare three unsupervised similarity measures – the Soft Cosine Measure, two different implementations of the [Word Mover's Distance][wmd], and standard cosine similarity. We will use the [Mean Average Precision (MAP)][map] as an evaluation measure and 10-fold cross-validation to get an estimate of the variance of MAP for each similarity measure.\n",
    "\n",
    "[wmd]: http://vene.ro/blog/word-movers-distance-in-python.html\n",
    "[map]: https://medium.com/@pds.bangalore/mean-average-precision-abd77d0b9a7e"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from math import isnan\n",
    "from time import time\n",
    "\n",
    "from gensim.similarities import MatrixSimilarity, WmdSimilarity, SoftCosineSimilarity\n",
    "import numpy as np\n",
    "from sklearn.model_selection import KFold\n",
    "from wmd import WMD\n",
    "\n",
    "def produce_test_data(dataset):\n",
    "    for orgquestion in datasets[dataset]:\n",
    "        query = preprocess(orgquestion[\"OrgQSubject\"]) + preprocess(orgquestion[\"OrgQBody\"])\n",
    "        documents = [\n",
    "            preprocess(thread[\"RelQuestion\"][\"RelQSubject\"]) + preprocess(thread[\"RelQuestion\"][\"RelQBody\"])\n",
    "            for thread in orgquestion[\"Threads\"]]\n",
    "        relevance = [\n",
    "            thread[\"RelQuestion\"][\"RELQ_RELEVANCE2ORGQ\"] in (\"PerfectMatch\", \"Relevant\")\n",
    "            for thread in orgquestion[\"Threads\"]]\n",
    "        yield query, documents, relevance\n",
    "\n",
    "def cossim(query, documents):\n",
    "    # Compute cosine similarity between the query and the documents.\n",
    "    query = tfidf[dictionary.doc2bow(query)]\n",
    "    index = MatrixSimilarity(\n",
    "        tfidf[[dictionary.doc2bow(document) for document in documents]],\n",
    "        num_features=len(dictionary))\n",
    "    similarities = index[query]\n",
    "    return similarities\n",
    "\n",
    "def softcossim(query, documents):\n",
    "    # Compute Soft Cosine Measure between the query and the documents.\n",
    "    query = tfidf[dictionary.doc2bow(query)]\n",
    "    index = SoftCosineSimilarity(\n",
    "        tfidf[[dictionary.doc2bow(document) for document in documents]],\n",
    "        similarity_matrix)\n",
    "    similarities = index[query]\n",
    "    return similarities\n",
    "\n",
    "def wmd_gensim(query, documents):\n",
    "    # Compute Word Mover's Distance as implemented in PyEMD by William Mayner\n",
    "    # between the query and the documents.\n",
    "    index = WmdSimilarity(documents, w2v_model)\n",
    "    similarities = index[query]\n",
    "    return similarities\n",
    "\n",
    "def wmd_relax(query, documents):\n",
    "    # Compute Word Mover's Distance as implemented in WMD by Source{d}\n",
    "    # between the query and the documents.\n",
    "    words = [word for word in set(chain(query, *documents)) if word in w2v_model.wv]\n",
    "    indices, words = zip(*sorted((\n",
    "        (index, word) for (index, _), word in zip(dictionary.doc2bow(words), words))))\n",
    "    query = dict(tfidf[dictionary.doc2bow(query)])\n",
    "    query = [\n",
    "        (new_index, query[dict_index])\n",
    "        for new_index, dict_index in enumerate(indices)\n",
    "        if dict_index in query]\n",
    "    documents = [dict(tfidf[dictionary.doc2bow(document)]) for document in documents]\n",
    "    documents = [[\n",
    "        (new_index, document[dict_index])\n",
    "        for new_index, dict_index in enumerate(indices)\n",
    "        if dict_index in document] for document in documents]\n",
    "    embeddings = np.array([w2v_model.wv[word] for word in words], dtype=np.float32)\n",
    "    nbow = dict(((index, list(chain([None], zip(*document)))) for index, document in enumerate(documents)))\n",
    "    nbow[\"query\"] = tuple([None] + list(zip(*query)))\n",
    "    distances = WMD(embeddings, nbow, vocabulary_min=1).nearest_neighbors(\"query\")\n",
    "    similarities = [-distance for _, distance in sorted(distances)]\n",
    "    return similarities\n",
    "\n",
    "strategies = {\n",
    "    \"cossim\" : cossim,\n",
    "    \"softcossim\": softcossim,\n",
    "    \"wmd-gensim\": wmd_gensim,\n",
    "    \"wmd-relax\": wmd_relax}\n",
    "\n",
    "def evaluate(split, strategy):\n",
    "    # Perform a single round of evaluation.\n",
    "    results = []\n",
    "    start_time = time()\n",
    "    for query, documents, relevance in split:\n",
    "        similarities = strategies[strategy](query, documents)\n",
    "        assert len(similarities) == len(documents)\n",
    "        precision = [\n",
    "            (num_correct + 1) / (num_total + 1) for num_correct, num_total in enumerate(\n",
    "                num_total for num_total, (_, relevant) in enumerate(\n",
    "                    sorted(zip(similarities, relevance), reverse=True)) if relevant)]\n",
    "        average_precision = np.mean(precision) if precision else 0.0\n",
    "        results.append(average_precision)\n",
    "    return (np.mean(results) * 100, time() - start_time)\n",
    "\n",
    "def crossvalidate(args):\n",
    "    # Perform a cross-validation.\n",
    "    dataset, strategy = args\n",
    "    test_data = np.array(list(produce_test_data(dataset)))\n",
    "    kf = KFold(n_splits=10)\n",
    "    samples = []\n",
    "    for _, test_index in kf.split(test_data):\n",
    "        samples.append(evaluate(test_data[test_index], strategy))\n",
    "    return (np.mean(samples, axis=0), np.std(samples, axis=0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.14 s, sys: 5.08 s, total: 7.22 s\n",
      "Wall time: 2min 51s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from multiprocessing import Pool\n",
    "\n",
    "args_list = [\n",
    "    (dataset, technique)\n",
    "    for dataset in (\"2016-test\", \"2017-test\")\n",
    "    for technique in (\"softcossim\", \"wmd-gensim\", \"wmd-relax\", \"cossim\")]\n",
    "with Pool() as pool:\n",
    "    results = pool.map(crossvalidate, args_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The table below shows the pointwise estimates of means and standard variances for MAP scores and elapsed times. Baselines and winners for each year are displayed in bold. We can see that the Soft Cosine Measure gives a strong performance on both the 2016 and the 2017 dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "\n",
       "\n",
       "Dataset | Strategy | MAP score | Elapsed time (sec)\n",
       ":---|:---|:---|---:\n",
       "2016-test|softcossim|77.15 ±10.83|4.48 ±0.56\n",
       "2016-test|**Winner (UH-PRHLT-primary)**|76.70 ±0.00|\n",
       "2016-test|cossim|76.45 ±10.40|0.25 ±0.04\n",
       "2016-test|wmd-gensim|76.15 ±11.51|13.79 ±1.39\n",
       "2016-test|**Baseline 1 (IR)**|74.75 ±0.00|\n",
       "2016-test|wmd-relax|72.03 ±11.33|0.34 ±0.07\n",
       "2016-test|**Baseline 2 (random)**|46.98 ±0.00|\n",
       "\n",
       "\n",
       "Dataset | Strategy | MAP score | Elapsed time (sec)\n",
       ":---|:---|:---|---:\n",
       "2017-test|**Winner (SimBow-primary)**|47.22 ±0.00|\n",
       "2017-test|wmd-relax|45.04 ±15.44|0.39 ±0.07\n",
       "2017-test|cossim|44.38 ±14.71|0.29 ±0.05\n",
       "2017-test|softcossim|44.25 ±15.68|4.89 ±0.80\n",
       "2017-test|wmd-gensim|44.08 ±15.96|16.69 ±1.90\n",
       "2017-test|**Baseline 1 (IR)**|41.85 ±0.00|\n",
       "2017-test|**Baseline 2 (random)**|29.81 ±0.00|"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display, Markdown\n",
    "\n",
    "output = []\n",
    "baselines = [\n",
    "    ((\"2016-test\", \"**Winner (UH-PRHLT-primary)**\"), ((76.70, 0), (0, 0))),\n",
    "    ((\"2016-test\", \"**Baseline 1 (IR)**\"), ((74.75, 0), (0, 0))),\n",
    "    ((\"2016-test\", \"**Baseline 2 (random)**\"), ((46.98, 0), (0, 0))),\n",
    "    ((\"2017-test\", \"**Winner (SimBow-primary)**\"), ((47.22, 0), (0, 0))),\n",
    "    ((\"2017-test\", \"**Baseline 1 (IR)**\"), ((41.85, 0), (0, 0))),\n",
    "    ((\"2017-test\", \"**Baseline 2 (random)**\"), ((29.81, 0), (0, 0)))]\n",
    "table_header = [\"Dataset | Strategy | MAP score | Elapsed time (sec)\", \":---|:---|:---|---:\"]\n",
    "for row, ((dataset, technique), ((mean_map_score, mean_duration), (std_map_score, std_duration))) \\\n",
    "        in enumerate(sorted(chain(zip(args_list, results), baselines), key=lambda x: (x[0][0], -x[1][0][0]))):\n",
    "    if row % (len(strategies) + 3) == 0:\n",
    "        output.extend(chain([\"\\n\"], table_header))\n",
    "    map_score = \"%.02f ±%.02f\" % (mean_map_score, std_map_score)\n",
    "    duration = \"%.02f ±%.02f\" % (mean_duration, std_duration) if mean_duration else \"\"\n",
    "    output.append(\"%s|%s|%s|%s\" % (dataset, technique, map_score, duration))\n",
    "\n",
    "display(Markdown('\\n'.join(output)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "1. Grigori Sidorov et al. *Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model*, 2014. ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf))\n",
    "2. Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017. ([link to PDF](http://www.aclweb.org/anthology/S17-2051))\n",
    "3. Thomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space, 2013. ([link to PDF](https://arxiv.org/pdf/1301.3781.pdf))\n",
    "4. Vít Novotný. *Implementation Notes for the Soft Cosine Measure*, 2018. ([link to PDF](https://arxiv.org/pdf/1808.09407))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}