{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \"Document clustering applied to scientific articles in Python\"\n",
    "> \"In this article we look at document clustering implemented in Python with scikit-learn.\"\n",
    "- toc: true\n",
    "- branch: master\n",
    "- badges: true\n",
    "- comments: true\n",
    "- categories: [python, scikit-learn, nlp]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Ten Most Similar PLoS Biology Articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... at least by some measure.\n",
    "\n",
    "I recently downloaded 1754 PLoS Biology articles as XML files through the PLoS API\n",
    "and have looked at the distribution of the [time to publication](http://georg.io/2013/10/PLoS_Time_to_Publication/)\n",
    "of PLoS Biology and other PLoS journals."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here I will play a little with [scikit-learn](http://scikit-learn.org/stable/) to see if I can discover those\n",
    "PLoS Biology articles (in my data set) that are most similar to one another."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Packages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I started writing a [Python package (PLoSPy)](https://github.com/waltherg/PLoSPy) for more convient parsing\n",
    "of the XML files I have download from PLoS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import plospy\n",
    "import os\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.metrics.pairwise import linear_kernel\n",
    "import itertools"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discover Data Files on Hard Disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_names = [name for name in os.listdir('../plos/plos_biology/plos_biology_data') if '.dat' in name]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_names[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print len(all_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorize all Articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To reduce memory use, I wrote the following method that returns an iterator over all article bodies.\n",
    "In passing this iterator to the vectorizer, we avoid loading all articles into memory at once - despite\n",
    "the use of an iterator here, I have not been able to repeat this experiment with all 65,000-odd PLoS ONE\n",
    "articles without running out of memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ids = []\n",
    "titles = []\n",
    "\n",
    "def get_corpus(all_names):\n",
    "    for name_i, name in enumerate(all_names):\n",
    "        docs = plospy.PlosXml('../plos/plos_biology/plos_biology_data/'+name)\n",
    "        for article in docs.docs:\n",
    "            ids.append(article['id'])\n",
    "            titles.append(article['title'])\n",
    "            yield article['body']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = get_corpus(all_names)\n",
    "tfidf = TfidfVectorizer().fit_transform(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just as a sanity check, the number of DOIs in our data set should now equal 1754 as this is the number\n",
    "of articles [I downloaded in the first place](http://georg.io/2013/10/PLoS_Time_to_Publication)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The vectorizer generated a matrix with 139,748 columns (these are the tokens, i.e. probably unique words used in\n",
    "all 1754 PLoS Biology articles) and 1754 rows (corresponding to individual articles)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tfidf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now compute all pairwise cosine distances betweeen all 1754 vectors (articles) in matrix `tfidf`.\n",
    "I copied and pasted most of this from a StackOverflow answer that I cannot find now - I will\n",
    "add a link to the answer when I come across it again.\n",
    "\n",
    "To get the ten most similar articles, we track the top five pairwise matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_five = [[-1,-1,-1] for i in range(5)]\n",
    "threshold = -1.\n",
    "\n",
    "for index in range(len(ids)):\n",
    "    cosine_similarities = linear_kernel(tfidf[index:index+1], tfidf).flatten()\n",
    "    related_docs_indices = cosine_similarities.argsort()[:-5:-1]\n",
    "    \n",
    "    first = related_docs_indices[0]\n",
    "    second = related_docs_indices[1]\n",
    "    \n",
    "    if first != index:\n",
    "        print 'Error'\n",
    "        break\n",
    "\n",
    "    if cosine_similarities[second] > threshold:\n",
    "        if first not in [top[0] for top in top_five] and first not in [top[1] for top in top_five]:\n",
    "            scores = [top[2] for top in top_five]\n",
    "            replace = scores.index(min(scores))\n",
    "            # print 'replace',replace\n",
    "            top_five[replace] = [first, second, cosine_similarities[second]]\n",
    "            # print 'old threshold',threshold\n",
    "            threshold = min(scores)\n",
    "            # print 'new threshold',threshold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Most Similar Articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now take a look at the results!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for tf in top_five:\n",
    "    print ''\n",
    "    print('Cosine Similarity: %.2f' % tf[2])\n",
    "    print('Title 1: %s' %titles[tf[0]])\n",
    "    print('http://www.plosbiology.org/article/info%3Adoi%2F'+str(ids[tf[0]]))\n",
    "    print ''\n",
    "    print('Title 2: %s' %titles[tf[1]])\n",
    "    print('http://www.plosbiology.org/article/info%3Adoi%2F'+str(ids[tf[1]]))\n",
    "    print ''"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}