{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "The Ten Most Similar PLoS Biology Articles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... at least by some measure.\n", "\n", "I recently downloaded 1754 PLoS Biology articles as XML files through the PLoS API\n", "and have looked at the distribution of the [time to publication](http://georg.io/2013/10/PLoS_Time_to_Publication/)\n", "of PLoS Biology and other PLoS journals." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here I will play a little with [scikit-learn](http://scikit-learn.org/stable/) to see if I can discover those\n", "PLoS Biology articles (in my data set) that are most similar to one another." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Import Packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I started writing a [Python package (PLoSPy)](https://github.com/waltherg/PLoSPy) for more convient parsing\n", "of the XML files I have download from PLoS." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import plospy\n", "import os\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.metrics.pairwise import linear_kernel\n", "import itertools" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Discover Data Files on Hard Disk" ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_names = [name for name in os.listdir('../plos/plos_biology/plos_biology_data') if '.dat' in name]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "all_names[0:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print len(all_names)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Vectorize all Articles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To reduce memory use, I wrote the following method that returns an iterator over all article bodies.\n", "In passing this iterator to the vectorizer, we avoid loading all articles into memory at once - despite\n", "the use of an iterator here, I have not been able to repeat this experiment with all 65,000-odd PLoS ONE\n", "articles without running out of memory." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ids = []\n", "titles = []\n", "\n", "def get_corpus(all_names):\n", " for name_i, name in enumerate(all_names):\n", " docs = plospy.PlosXml('../plos/plos_biology/plos_biology_data/'+name)\n", " for article in docs.docs:\n", " ids.append(article['id'])\n", " titles.append(article['title'])\n", " yield article['body']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "corpus = get_corpus(all_names)\n", "tfidf = TfidfVectorizer().fit_transform(corpus)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just as a sanity check, the number of DOIs in our data set should now equal 1754 as this is the number\n", "of articles [I downloaded in the first place](http://georg.io/2013/10/PLoS_Time_to_Publication)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(ids)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vectorizer generated a matrix with 139,748 columns (these are the tokens, i.e. probably unique words used in\n", "all 1754 PLoS Biology articles) and 1754 rows (corresponding to individual articles)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tfidf.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now compute all pairwise cosine distances betweeen all 1754 vectors (articles) in matrix `tfidf`.\n", "I copied and pasted most of this from a StackOverflow answer that I cannot find now - I will\n", "add a link to the answer when I come across it again.\n", "\n", "To get the ten most similar articles, we track the top five pairwise matches." ] }, { "cell_type": "code", "collapsed": false, "input": [ "top_five = [[-1,-1,-1] for i in range(5)]\n", "threshold = -1.\n", "\n", "for index in range(len(ids)):\n", " cosine_similarities = linear_kernel(tfidf[index:index+1], tfidf).flatten()\n", " related_docs_indices = cosine_similarities.argsort()[:-5:-1]\n", " \n", " first = related_docs_indices[0]\n", " second = related_docs_indices[1]\n", " \n", " if first != index:\n", " print 'Error'\n", " break\n", "\n", " if cosine_similarities[second] > threshold:\n", " if first not in [top[0] for top in top_five] and first not in [top[1] for top in top_five]:\n", " scores = [top[2] for top in top_five]\n", " replace = scores.index(min(scores))\n", " # print 'replace',replace\n", " top_five[replace] = [first, second, cosine_similarities[second]]\n", " # print 'old threshold',threshold\n", " threshold = min(scores)\n", " # print 'new threshold',threshold" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "The Most Similar Articles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now take a look at the results!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for tf in top_five:\n", " print ''\n", " print('Cosine Similarity: %.2f' % tf[2])\n", " print('Title 1: %s' %titles[tf[0]])\n", " print('http://www.plosbiology.org/article/info%3Adoi%2F'+str(ids[tf[0]]))\n", " print ''\n", " print('Title 2: %s' %titles[tf[1]])\n", " print('http://www.plosbiology.org/article/info%3Adoi%2F'+str(ids[tf[1]]))\n", " print ''" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }