{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \"Linkage discovery between scientific articles in Python and with graphs\"\n", "> \"In this article we use Python and graphs to discover linkages between scientific papers.\"\n", "- toc: true\n", "- branch: master\n", "- badges: true\n", "- comments: true\n", "- categories: [python, scikit-learn, nlp]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# PLOS Biology-Inspired PLOS Biology Articles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This past week I had my first encounter with the concept of [graph databases](https://en.wikipedia.org/wiki/Graph_database)\n", "which lend themselves perfectly to modeling and capturing linked data.\n", "\n", "I started reading the free and brilliant book [Graph Databases](http://www.graphdatabases.com/) by Robinson, Webber, and Eifrem and\n", "began playing around with [Python bulbs](http://bulbflow.com/) by James Thornton.\n", "\n", "I further took the data set of 1754 PLOS Biology articles that I have examined on this blog multiple times and created a\n", "[Rexster](https://github.com/tinkerpop/rexster/wiki)-based graph database from them.\n", "Apart from the obvious authors, DOIs, and titles I also extracted references to other PLOS Biology articles.\n", "\n", "In this blog post I will examine these links between PLOS Biology articles.\n", "\n", "Let us first take a look at my database to get an idea of what this looks like." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "from matplotlib import pyplot" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from bulbs.rexster import Graph, Config, REXSTER_URI" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "REXSTER_URI = 'http://localhost:8182/graphs/plos'\n", "config = Config(REXSTER_URI)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "g = Graph(config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The label `g` now holds a reference to our graph database." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python bulbs allows us to define classes for our data model which is something I did when creating this graph database in the first place.\n", "These are the node (vertex) types and edge (relationship) types I defined:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Bulbs Models\n", "from bulbs.model import Node, Relationship\n", "from bulbs.property import String, Integer, DateTime, List\n", "\n", "class Author(Node):\n", " element_type = 'author'\n", " name = String(nullable=False)\n", " \n", "class Article(Node):\n", " element_type = 'article'\n", " title = String(nullable=False)\n", " published = DateTime()\n", " doi = String()\n", " \n", "class Authorship(Relationship):\n", " label = 'authored'\n", "\n", "class Citation(Relationship):\n", " label = 'cites'\n", " reference_count = Integer(nullable=False)\n", " tag = String()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a very basic model of PLOS Biology articles that captures nothing more than authorship (edges between authors and articles) and\n", "citations (edges between articles).\n", "\n", "Some of these concepts can and should probably be decorated further: for instance `Authorship` edges could include author contributions (as provided at the bottom of most PLOS Biology articles)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "g.add_proxy('authors', Author)\n", "g.add_proxy('articles', Article)\n", "g.add_proxy('authored', Authorship)\n", "g.add_proxy('cites', Citation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usually we would use Rexster/Bulbs-builtin functions that rely on some internal index but since that index seems to be broken for me right now\n", "I will simply collect all nodes and edges by hand and create Python dictionaries as indeces.\n", "\n", "This is okay here to do since our database is very small but would likely be prohibitive for anything marginally bigger." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "nodes = g.V\n", "edges = g.E" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "authors = {n.name: n for n in nodes if n.element_type == 'author'}" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[u'Shuguang Zhang',\n", " u'Ernst Hafen',\n", " u'Maren Brockmeyer',\n", " u'Bruno Eschli',\n", " u'David B. Gurevich',\n", " u'Michael Lynch',\n", " u'Alejandro Valbuena',\n", " u'Claudia Rutte',\n", " u'Matthew M Wyatt',\n", " u'Brianna B. Williams']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "authors.keys()[:10]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "articles = {n.doi: n for n in nodes if n.element_type == 'article'}" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[u'10.1371/journal.pbio.0040216',\n", " u'10.1371/journal.pbio.0040215',\n", " u'10.1371/journal.pbio.0040210',\n", " u'10.1371/journal.pbio.0040368',\n", " u'10.1371/journal.pbio.0040369',\n", " u'10.1371/journal.pbio.0040362',\n", " u'10.1371/journal.pbio.0040363',\n", " u'10.1371/journal.pbio.0040360',\n", " u'10.1371/journal.pbio.0020275',\n", " u'10.1371/journal.pbio.0040366']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "articles.keys()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now do a brief sanity check and count the number of PLOS Biology articles in our data set (this should equal 1754)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1754" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(articles.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now pick an article at random and see how this article is connected to the remainder of the graph." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "article = articles['10.1371/journal.pbio.1000584']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the title of `article`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "u'Clusters of Temporal Discordances Reveal Distinct Embryonic Patterning Mechanisms in Drosophila and Anopheles'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "article.title" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the edges pointing to this article:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(article.inE())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three `Authorship` edges that point to this specific article.\n", "\n", "To get the node at the *base* of a directed edge we can either query `article.inE().inV()` (i.e. the *in-node* of this edge)\n", "or simply ask for the *in-node* of the `article` node straight away - this should be equivalent!" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Yury Goltsev\n", "Michael Levine\n", "Dmitri Papatsenko\n" ] } ], "source": [ "for author in article.inV():\n", " print author.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick check online confirms that these are indeed the authors of `article`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As I mentioned above, I also collected all references to other PLOS Biology articles in my data set and modeled those as `Citation` relationships (edges) between articles.\n", "\n", "The `article` we are currently looking at has one such out-edge to another PLOS Biology article:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(article.outE())" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Cell Cycle–Regulated Genes of Schizosaccharomyces pombe\n", "[u'Saumyadipta Pyne', u'Janet Leatherwood', u'Anna Oliva', u'Bruce Futcher', u'Adam Rosebrock', u'Steve Skiena', u'Francisco Ferrezuelo', u'Haiying Chen']\n", "10.1371/journal.pbio.0030225\n" ] } ], "source": [ "for citation in article.outV():\n", " print citation.title\n", " print [n.name for n in citation.inV() if n.element_type == 'author']\n", " print citation.doi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see above, querying our database for the authors of the PLOS Biology article that our current article (`article`) cites is simple." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many PLOS Biology articles in our data set of 1754 articles cite other PLOS Biology articles?\n", "\n", "(caveat: this only represents those citations that I detected when parsing my set of articles)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "526" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(1 for n in nodes if n.element_type == 'article' and n.outV() > 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I did not only extract citation edges between PLOS Biology articles but also counted how often such a citation occurs in the body of the article.\n", "\n", "For our `article` and its one cited PLOS Biology article I counted:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n" ] } ], "source": [ "for citation in article.outE():\n", " print citation.reference_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just to verify this, look up `article` online (DOI = 10.1371/journal.pbio.0030225) and look for reference *[4]* which corresponds to this one cited PLOS Biology article." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now take a look at the observed distribution of how often cited PLOS Biology articles are referenced in the main text of the citing PLOS Biology article." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "citation_counts = []\n", "for doi in articles.keys():\n", " if articles[doi].outE():\n", " for e in articles[doi].outE():\n", " if e.label == 'cites':\n", " citation_counts.append(e.reference_count)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAEKCAYAAAASByJ7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHhdJREFUeJzt3X9QlWX+//HnAaH8pmSKYuNhsrH8weEcOZC0Nuge3TUC\nylVzAg3aVWdCt7J0t2l3mhRrNj7b5phNbmPjj3ZjtoXWth8LOriuJzdLIcyMacfZMdk4NMhBVFAk\nRK7vHxxP/gwPHDigr8fMmTnn5lwX73M4c15c133f120xxhhEROS6FxbqAkREpG9QIIiICKBAEBER\nHwWCiIgACgQREfFRIIiICNADgdDS0sKkSZNwOp2MHTuWZcuWAZCXl4fVasXpdOJ0Otm6dau/TX5+\nPnFxcdjtdkpLS4NdkoiIXAVLT5yHcPr0aQYOHEhbWxspKSnk5+eza9cuBg8ezPLlyy94bkVFBYsX\nL2bPnj3U1taSkpLCwYMHiYyMDHZZIiLyA3pkymjgwIEAtLa2cvbsWWJiYgC4XPYUFxeTlZVFeHg4\no0aNwmazUVZW1hNliYjID+iRQGhvbychIYGYmBimTZtGXFwcAOvWrWPChAlkZ2fT0NAAQE1NDVar\n1d/WarXi8Xh6oiwREfkBA3qi07CwMPbv38+JEydITU3F7Xbz2GOPsWLFCqBjf8LSpUspKCi4qv4s\nFktPlCkics0LZK9Ajx5ldPPNN5ORkcGePXuIjo7GYrFgsVjIzc2lvLwc6BgRVFdX+9t4PB5iY2Mv\n6csYo1uQbitXrgx5DdfSTe+n3su+egtU0APh6NGjNDU1AR07l7dv347dbsfr9fqfs2XLFmw2GwDp\n6ekUFhbS1taGx+OhsrKS5OTkYJclIiKdCPqU0bfffssjjzyCMYaWlhbmz59PRkYGOTk5HDhwgNbW\nVm677TY2btwIQFJSErNnz8bhcBAWFsb69euJiIgIdlkiItKJHjnsNNgsFkuXhj9yeW63G5fLFeoy\nrhl6P4NH72VwBfrdqUAQEblGBfrdqaUrREQEUCCIiIiPAkFERAAFgoiI+Fw3gRAVNdR/YlxXb1FR\nQ0P9MkREesx1c5RRx/IX3X2pOtpJRPoPHWUkIiJdokAQERFAgSAiIj4KBBERARQIIiLio0AQERFA\ngSAiIj4KBBERARQIIiLio0AQERFAgSAiIj4KBBERARQIIiLio0AQERFAgSAiIj4KBBERAXogEFpa\nWpg0aRJOp5OxY8eybNkyABoaGpgxYwYOh4PU1FSOHz/ub5Ofn09cXBx2u53S0tJglyQiIlehR66Y\ndvr0aQYOHEhbWxspKSnk5+fz7rvvMmbMGJ566ileeeUVDh8+zNq1a6moqGDx4sXs2bOH2tpaUlJS\nOHjwIJGRkd8XqSumiYgErE9cMW3gwIEAtLa2cvbsWUaMGEFJSQk5OTkAZGdnU1xcDEBxcTFZWVmE\nh4czatQobDYbZWVlPVGWiIj8gAE90Wl7ezuJiYkcOnSIJUuWYLPZ8Hq9DBs2DIDo6Gjq6uoAqKmp\nYfr06f62VqsVj8dzSZ95eXn++y6XC5fL1ROli4j0W263G7fb3eX2PRIIYWFh7N+/nxMnTpCamsrO\nnTu73ef5gSAiIpe6+J/lVatWBdS+R48yuvnmm8nIyGDv3r0MHz6c+vp6ALxeLyNGjAA6RgTV1dX+\nNh6Ph9jY2J4sS0RELiPogXD06FGampqAjp3L27dvx263k56eTkFBAQAFBQWkp6cDkJ6eTmFhIW1t\nbXg8HiorK0lOTg52WSIi0omgTxl9++23PPLIIxhjaGlpYf78+WRkZDB58mQyMzPZtGkTI0eOpKio\nCICkpCRmz56Nw+EgLCyM9evXExEREeyyRESkEz1y2Gmw6bBTEZHA9YnDTkVEpP9RIIiICKBAEBER\nHwWCiIgACgQREfFRIIiICKBAEBERHwWCiIgACgQREfFRIIiICKBAEBERHwWCiIgACgQREfFRIIiI\nCKBAEBERHwWCiIgACgQREfFRIIiICKBAEBERHwWCiIgACgQREfFRIIiICKBAEBERn6AHQnV1NVOn\nTsVutzNu3DheeuklAPLy8rBarTidTpxOJ1u3bvW3yc/PJy4uDrvdTmlpabBLEhGRq2Axxphgdnjk\nyBG8Xi/x8fGcPHmSxMRE3nnnHd577z0GDx7M8uXLL3h+RUUFixcvZs+ePdTW1pKSksLBgweJjIz8\nvkiLhe6WabFYgO6+1O7XISLSWwL97gz6CCEmJob4+HgABg0ahMPhoKamBuCyhRUXF5OVlUV4eDij\nRo3CZrNRVlYW7LJERKQTA3qy86qqKsrLy9m8eTPl5eWsW7eODRs2kJSUxKuvvsrQoUOpqalh+vTp\n/jZWqxWPx3NJX3l5ef77LpcLl8vVk6WLiPQ7brcbt9vd5fZBnzI65+TJk0ybNo1nn32WWbNmUV9f\nz7Bhw4COL/dDhw5RUFBAbm4u06dPJzMzE4DFixfjcrnIysr6vkhNGYmIBCzkU0YAZ86c4cEHH2T+\n/PnMmjULgOjoaCwWCxaLhdzcXMrLy4GOEUF1dbW/rcfjITY2tifKEhGRHxD0QDDGsGjRIuLi4li2\nbJl/e11dnf/+li1bsNlsAKSnp1NYWEhbWxsej4fKykqSk5ODXZaIiHQi6PsQdu/eTUFBAQ6HA6fT\nCcCLL77IX/7yFw4cOEBrayu33XYbGzduBCApKYnZs2fjcDgICwtj/fr1REREBLssERHpRI/tQwgm\n7UMQEQlcn9iHICIi/Y8CQUREAAWCiIj4KBBERARQIIiIiI8CQUREAAWCiIj4KBBERARQIIiIiI8C\nQUREAAWCiIj4KBBERARQIIiIiI8CQUREAAWCiIj4KBBERARQIIiIiI8CQUREAAWCiIj4KBBERARQ\nIIiIiI8CQUREAAWCiIj4BD0QqqurmTp1Kna7nXHjxvHSSy8B0NDQwIwZM3A4HKSmpnL8+HF/m/z8\nfOLi4rDb7ZSWlga7JBERuQoWY4wJZodHjhzB6/USHx/PyZMnSUxM5J133mHDhg2MGTOGp556ilde\neYXDhw+zdu1aKioqWLx4MXv27KG2tpaUlBQOHjxIZGTk90VaLHS3TIvFAnT3pXa/DhGR3hLod2fQ\nRwgxMTHEx8cDMGjQIBwOBzU1NZSUlJCTkwNAdnY2xcXFABQXF5OVlUV4eDijRo3CZrNRVlYW7LJE\nRKQTA3qy86qqKsrLy9m0aRNer5dhw4YBEB0dTV1dHQA1NTVMnz7d38ZqteLxeC7pKy8vz3/f5XLh\ncrl6snQRkX7H7Xbjdru73L7HAuHkyZPMnTuXtWvXEhUV1e3+zg8EERG51MX/LK9atSqg9j1ylNGZ\nM2d48MEHefjhh5k1axYAw4cPp76+HgCv18uIESOAjhFBdXW1v63H4yE2NrYnyhIRkR8Q9EAwxrBo\n0SLi4uJYtmyZf3t6ejoFBQUAFBQUkJ6e7t9eWFhIW1sbHo+HyspKkpOTg12WiIh0IuhHGX388cdM\nnToVh8PhO7Kn47DS5ORkMjMzOXLkCCNHjqSoqIghQ4YA8OKLL1JQUEBYWBirV68mNTX1wiJ1lJGI\nSMAC/e7sNBB+8pOfsGPHjk639SQFgohI4AL97rziTuXTp0/T3NyM1+uloaHBv/3UqVP873//616V\nIiLS51wxENavX8/atWv59ttvSUpK8m8fOHAgS5Ys6ZXiRESk93Q6ZfTqq6+ydOnS3qrnsjRlJCIS\nuKDvQzDGsGvXLqqrq2lvb/dvf+SRR7peZYAUCCIigQvaPoRzHnroIWpqakhISCA8PNy/vTcDQURE\nel6nI4SxY8dy8OBB/yGkoaARgohI4IK+uF1iYqJ/3SEREbl2dTplVFtby7hx40hOTuaGG24AOlLn\ngw8+6PHiRESk93QaCFpUTkTk+hD0pSt6gvYhiIgELuhHGQ0aNMi/Q7m1tZUzZ84waNAgGhsbu16l\niIj0OZ0GwsmTJ/3329vbKS4u5pNPPunRokREpPd1acrI6XTy+eef90Q9l6UpIxGRwAV9ymjLli3+\n++3t7VRUVHStMhER6dM6DYQPP/zQvw8hLCwMq9VKSUlJjxcmIiK9S0cZBdaLpoxEpN8I+pnKVVVV\npKWlERUVRVRUFBkZGVRVVXWnRhER6YM6DYTs7GzmzZvH0aNHOXr0KFlZWWRnZ/dGbSIi0os6nTKa\nOHEiX3zxxQXbHA4HBw4c6NHCzqcpIxGRwAV9yuimm27i7bff5uzZs5w9e5a3336bwYMHd6tIERHp\nezodIRw6dIglS5bw6aefYrFYuOeee1i3bh1jxozprRo1QhAR6YKgXzEtJyeH1157jZtvvhmA48eP\n8+STT/KnP/2pe5UGQIEgIhK4oE8ZVVZW+sMAYMiQIb26/0BERHpHp4Hw3XffXbCQ3YkTJ2hpabni\n8xcuXEhMTAx2u92/LS8vD6vVitPpxOl0snXrVv/P8vPziYuLw263U1pa2tXXISIi3dTpmcpPPvkk\nd911F5mZmRhjKCoq4le/+tUVn79gwQKeeOKJC665bLFYWL58OcuXL7/guRUVFbz77rt8+eWX1NbW\nkpKSwsGDB4mMjOzGSxIRka7odISQm5vLX//6V6KiohgyZAiFhYXk5uZe8flTpkzhlltuuWT75eax\niouLycrKIjw8nFGjRmGz2SgrKwvwJYiISDB0OkKAjusqJyYmdusXrVu3jg0bNpCUlMSrr77K0KFD\nqampYfr06f7nWK1WPB7PZduff+U2l8uFy+XqVj0iItcat9uN2+3ucvurCoTueuyxx1ixYgXQ8cW+\ndOlSCgoKAupDl/IUEflhF/+zvGrVqoDadzplFAzR0dFYLBYsFgu5ubmUl5cDHSOC6upq//M8Hg+x\nsbG9UZKIiFykVwKhrq7Of3/Lli3YbDYA0tPTKSwspK2tDY/HQ2VlJcnJyb1RkoiIXCToU0bz5s3j\no48+or6+ntjYWFatWsXOnTs5cOAAra2t3HbbbWzcuBGApKQkZs+ejcPhICwsjPXr1xMRERHskkRE\n5CroegiB9aIzlUWk3wj6mcoiInJ9UCCIiAigQBARER8FgoiIAAoEERHxUSCIiAigQBARER8FgoiI\nAAoEERHx6ZXVTq8dA3xnPHfd4MG30NjYEKR6RESCR0tXBNZLUProB2+5iFwDtHSFiIh0iQJBREQA\nBYKIiPgoEEREBFAgiIiIjwJBREQABYKIiPgoEEREBFAgiIiIjwJBREQABYKIiPgoEEREBOiBQFi4\ncCExMTHY7Xb/toaGBmbMmIHD4SA1NZXjx4/7f5afn09cXBx2u53S0tJglyMiIlcp6IGwYMECtm3b\ndsG2lStXkpGRwYEDB0hLS2PlypUAVFRU8O677/Lll1+ybds2cnNzaW1tDXZJIiJyFYIeCFOmTOGW\nW265YFtJSQk5OTkAZGdnU1xcDEBxcTFZWVmEh4czatQobDYbZWVlwS5JRESuQq9cIMfr9TJs2DAA\noqOjqaurA6Cmpobp06f7n2e1WvF4PJftIy8vz3/f5XLhcrl6rF4Rkf7I7Xbjdru73L7fXDHt/EAQ\nEZFLXfzP8qpVqwJq3ytHGQ0fPpz6+nqgY7QwYsQIoGNEUF1d7X+ex+MhNja2N0oSEZGL9EogpKen\nU1BQAEBBQQHp6en+7YWFhbS1teHxeKisrCQ5Obk3ShIRkYsEfcpo3rx5fPTRR9TX1xMbG8vzzz/P\nqlWryMzMZNOmTYwcOZKioiIAkpKSmD17Ng6Hg7CwMNavX09ERESwSxIRkatgMf3giu+BXij6Sn1A\nd19qcProB2+5iFwDAv3u1JnKIiICKBBERMRHgSAiIoACQUREfBQIIiICKBBERMRHgSAiIoACQURE\nfBQIIiICKBBERMRHgSAiIoACQUREfBQIIiICKBBERMRHgdDrBmCxWLp8i4oaGuoXICLXKF0PIbBe\n+kAfup6CiFwdXQ9BRES6RIEgIiKAAkFERHwGhLqAqxUZ+f9CXYKIyDWt3wTCmTP13Wi9HlgerFJE\nRK5J/SYQoDsjhMigVSEicq3SPgQREQF6eYQwevRooqKiCA8PJyIigrKyMhoaGsjMzOTIkSPceuut\nFBYWMmTIkN4sS0RE6OURgsViwe128/nnn1NWVgbAypUrycjI4MCBA6SlpbFy5creLElERHx6fcro\n4rPmSkpKyMnJASA7O5vi4uLeLklEROjlKSOLxcKMGTNoa2vj0Ucf5fHHH8fr9TJs2DAAoqOjqaur\nu0LrvPPuu3w3ERE5x+1243a7u9y+V9cyqqurY8SIEXi9Xu677z5+//vfM2fOHBobG/3PiYqKuuAx\nBGMdonXA493sA7SWkYj0J316LaMRI0YAMHz4cObOnUt5eTnDhw+nvr7jHAOv1+t/jlxJ91ZL1Yqp\nInIlvRYIzc3NNDc3A3Dq1Cm2bduGzWYjPT2dgoICAAoKCkhPT++tkvqpNjpGGF2/NTUd6/2yRaTP\n67V9CEeOHGHWrFlYLBaam5vJyspi5syZpKSkkJmZyaZNmxg5ciRFRUW9VZKIiJyn31wPQfsQgltD\nP/izi0g39el9CCIi0ncpEEREBFAgiIiIjwJBREQABYKIiPgoEEREBFAgiIiIjwJBREQABYKIiPgo\nEEREBFAgiIiIT69eIEf6igG+9aG6IwI4060eBg++hcbGhm7WISLBokC4Lp1bQrs7ur/IXlNTd0NJ\nRIJJU0YiIgIoECSkunf1N135TSS4NGUkIdS9qStNOYkEl0YIIiICKBBERMRHgSAiIoD2IUi/1v3z\nKXQuhMj3FAjSj3X/fArtmBb5ngJBrnOhP2tboxTpK/rEPoRt27Zht9uJi4vj97//fajLkevKuVFG\nd25nutW+qelYz7/MfsLtdoe6hOtayEcI3333HUuWLOHjjz8mJiaGyZMnc++99+J0OkNdmkgvCf0o\npe/0EQa0d6sCjbi6LuQjhL1792Kz2Rg1ahQDBgwgMzOT4uLiUJcl0otCP0rpO320d7sGjbi6LuQj\nBI/HQ2xsrP+x1Wq97LDx5psf6PLv+O67w7S0dLm5iPQr18aIKxQjnZAHwtX+4U6c+Ecwfts10kdf\nqKGv9NEXaghGH32hhr7SR1848qu7YdD9PpqajgUh2AIT8kCwWq1UV1f7H1dXV18wYgAwprtLNYuI\nSGdCvg9h0qRJVFZWUlNTw5kzZygqKiItLS3UZYmIXHdCPkK48cYbef3110lNTaW9vZ2cnBwSExND\nXZaIyHUn5CMEgLS0NCorK/nqq6/47W9/e8HPdI5CcI0ePRqHw4HT6SQ5OTnU5fQrCxcuJCYmBrvd\n7t/W0NDAjBkzcDgcpKamcvz48RBW2L9c7v3My8vDarXidDpxOp1s27YthBX2L9XV1UydOhW73c64\nceN46aWXgAA/o6YPa2lpMaNHjzYej8ecOXPG3HXXXWbfvn2hLqtfGz16tDl69Gioy+iXdu3aZfbt\n22fi4+P92x5//HGzZs0aY4wxa9asMUuXLg1Vef3O5d7PvLw8s3r16hBW1X/V1taaL7/80hhjTFNT\nk7nzzjvN/v37A/qM9okRwpXoHIWeYbSTvkumTJnCLbfccsG2kpIScnJyAMjOztbnMwCXez9Bn8+u\niomJIT4+HoBBgwbhcDioqakJ6DPapwPhcucoeDyeEFbU/1ksFv/w8bXXXgt1Of2e1+tl2LBhAERH\nR1NXVxfiivq/devWMWHCBLKzs2lo0BnHXVFVVUV5eTkpKSkBfUb7dCD09jG414M9e/awb98+duzY\nwebNm/nnP/8Z6pJE/B577DEOHTrEV199xZgxY1i6dGmoS+p3Tp48ydy5c1m7di1RUVEBte3TgXA1\n5yhIYEaMGAHA8OHDmTt3LuXl5SGuqH8bPnw49fX1QMdo4dz7K10THR2NxWLBYrGQm5urz2eAzpw5\nw4MPPsjDDz/MrFmzgMA+o306EHSOQnA1NzfT3NwMwKlTp9i2bRs2my3EVfVv6enpFBQUAFBQUEB6\nenqIK+rfzp/O2LJliz6fATDGsGjRIuLi4li2bJl/e0Cf0Z7f9909JSUlxmazmQkTJpgXX3wx1OX0\na19//bVxOBxm4sSJ5s477zTPPfdcqEvqV7Kyssytt95qIiIijNVqNZs2bTJHjx41P/3pT43dbjcz\nZswwx44dC3WZ/cbF7+fGjRtNdna2cTgcZvz48SY1NdV4PJ5Ql9lv/Pvf/zYWi8VMnDjRJCQkmISE\nBLN169aAPqMWY7RLX0RE+viUkYiI9B4FgoiIAAoEERHxUSCIiAigQJBrgMvloqKiosd/z5o1axg3\nbpx/GYBzvvjiC7Zu3ep//OGHH4Z8IcaMjAwaGxs5ceIEr7/+esDt8/LyWL16dQ9UJn2ZAkH6ve6c\n0X727Nmrfu4bb7zBzp07eeutty7Y/vnnn1NSUuJ//MADD/DMM890uaZgKC4uJioqimPHjvHHP/4x\n4PZaJeD6pECQXlFVVcWECRNYvHgx8fHxuFwuTp06BVz4H359fT233347AG+++SazZs0iLS2N22+/\nnddee42XX36Zu+66i8TERP/ZlwBvvfUWycnJjB8/nt27dwMdp/DPmzePiRMnYrPZeOedd/z9zpw5\nk9TUVO69995Lav3d737HhAkTmDBhgv8//cWLF/P1119z33338corr/if29rayooVKygsLMTpdFJU\nVMSbb77JE088AcAvfvELfvnLX5KSksKYMWNwu90sWLCA8ePHM3/+fH8/H3zwAUlJSdjtdn72s5/R\n1NQEwNNPP43NZiMhIYHly5dfUmtTUxNZWVnYbDYmTpzIli1bgI5lzo8ePcpvfvMbDh06hNPp9IfU\n888/j8PhYMKECRcsN79ixQruuOMOXC4XBw8evPo/rlw7eu2sCbmuHT582AwYMMC/PO9DDz1kNm/e\nbIwxxuVymYqKCmOMMV6v14wePdoYY8zmzZvNHXfcYU6fPm28Xq+JiooyGzZsMMYYs2zZMvOHP/zB\nGGPMj3/8Y7NkyRJjjDG7d+82Y8eO9T+noKDAGGPMsWPHzJgxY0xjY6PZvHmzsVqtprGx8ZI6d+/e\nbex2u/nuu+/M6dOnjc1mM3v37jXGXHnp8DfffNM88cQTFzx+/PHHjTHG/PznPzcPP/ywMcaY999/\n3wwePNj85z//Me3t7SYpKcmUl5eb2tpaM3nyZNPc3GyMMeb//u//zLPPPmvq6uqMzWbz93vy5MlL\nfvfSpUvNr3/9a//jEydOXFBrVVXVBctLv//+++bRRx81xhhz9uxZc//995vt27ebTz75xNjtdtPa\n2mpOnTpl7rjjDi1DfR0K+RXT5Ppx++23+5fnTUpKumCdqiuZNm0aN954IzfeeCNDhgzxn3Zvt9vZ\nv38/0DG98dBDDwFwzz330NLSgtfrpbS0lO3bt/Pyyy8D0NbWxjfffONf8XXw4MGX/L6PP/6YOXPm\nEBkZCcCcOXPYtWvXD15MyBhzxSWbLRYLGRkZAMTHxzNy5EjGjx8PgM1mo7q6mqqqKv773/9yzz33\nAB2jjrvvvpuhQ4cSERHBokWLSE9P54EHHrik/x07dvD+++/7H1+8mNnFdZWWllJaWorT6QQ6ljA5\nfPgwx48fZ86cOURERBAREcHMmTO1DPV1SIEgveaGG27w3w8PD/d/4YSFhdHe3g5AS0vLFduEhYX5\nH5/f5nLOzYF/8MEH/imocz777DNuuummK7Y7/4vQGNPpfHpnPz8XLufXf/FrSEtL489//vMlbffu\n3cuOHTvYsmUL69at41//+tclzwn0i/u5555j4cKFF2x7+eWXL3ndcv3RPgQJmXNfOlarlc8++wyA\nv//97wG1PXf/b3/7GwCffvopAwcOJDo6mtTU1At2qFZWVl7S9mIpKSm89957tLa20tLSwnvvvcfU\nqVN/sJaBAwf6Fw3srP+LWSwWpkyZws6dO/nmm2+AjlA8dOgQp06doqmpibS0NFavXs2+ffsuaT9j\nxgzWr1/vf9zY2PiDtaWmprJ582Z/8B45coT6+voLXndzczP/+Mc/tGP5OqRAkF5z8RfMucdPP/00\na9asYdKkSdTV1fm3n1sG+XLtz/+ZxWIhMjKSu+++mwULFrBp0yYAXnjhBerq6oiLi8PhcPh3ql7c\n7/kmT55MZmYmEydOxOl0kpOTw6RJky5b/znTpk2joqKCiRMnUlRU1GndF4uJieGNN95g5syZJCQk\nkJyczFdffUVjYyP33XcfTqeTKVOmsGbNmkvavvDCC3zzzTfExcWRkJDAjh07Luk7ISGBuLg4nnnm\nGR544AHuv/9+EhMTSUhIYObMmTQ1NfGjH/2IWbNmERcXR3p6uq63fZ3S4nYiIgJohCAiIj4KBBER\nARQIIiLio0AQERFAgSAiIj4KBBERAeD/A9SY/5xqO79gAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pyplot.hist(citation_counts, bins=range(20))\n", "pyplot.xlabel('number of times cited')\n", "pyplot.ylabel('count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This histogram has a surprisingly long tail. Let us take a look at some of the bigger values to see if these make sense." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "def article_pp(article):\n", " authors = unicode(', '.join([n.name for n in article.inV() if n.element_type == 'author']))\n", " s = ('Title: %s\\n'\n", " 'Authors: %s\\n'\n", " 'DOI: %s' % (article.title, authors, article.doi))\n", " \n", " return s" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Citer:\n", "Title: A Feedback Loop between Dynamin and Actin Recruitment during Clathrin-Mediated Endocytosis\n", "Authors: Marko Lampe, Christien J. Merrifield, Marcus J. Taylor\n", "DOI: 10.1371/journal.pbio.1001302\n", "\n", "Citee\n", "Title: A High Precision Survey of the Molecular Dynamics of Mammalian Clathrin-Mediated Endocytosis\n", "Authors: Marcus J. Taylor, David Perrais, Christien J. Merrifield\n", "DOI: 10.1371/journal.pbio.1000604\n", "\n", "Citer cites citee 21 times.\n", "-----------------------------------------------------\n", "Citer:\n", "Title: H2A.Z-Mediated Localization of Genes at the Nuclear Periphery Confers Epigenetic Memory of Previous Transcriptional State \n", "Authors: Yvonne Fondufe-Mittendorf, Sara Ahmed, Jason H Brickner, Donna Garvey Brickner, Jonathan Widom, Ivelisse Cajigas, Pei-Chih Lee\n", "DOI: 10.1371/journal.pbio.0050081\n", "\n", "Citee\n", "Title: Gene Recruitment of the Activated INO1 Locus to the Nuclear Membrane\n", "Authors: Peter Walter, Jason H Brickner\n", "DOI: 10.1371/journal.pbio.0020342\n", "\n", "Citer cites citee 21 times.\n", "-----------------------------------------------------\n" ] } ], "source": [ "for edge in edges:\n", " if edge.label == 'cites':\n", " if edge.reference_count >= 21:\n", " print('Citer:')\n", " print(article_pp(edge.outV()))\n", " print('')\n", " print('Citee')\n", " print(article_pp(edge.inV()))\n", " print('')\n", " print('Citer cites citee %d times.' % edge.reference_count)\n", " print('-----------------------------------------------------')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking these by hand we verify that our counts are correct." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think it is sensible to postulate that the more often one article cites another one, the more heavily the work presented in the *citer* was influenced by the *citee*.\n", "\n", "There is certainly some cut-off at which *importance* stops increasing - my point is simply that citing another article multiple times in your manuscript probably means that you are basing your work at least partially on the article you cite." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above list we can already see that one article titled *A sex-ratio Meiotic Drive System in Drosophila simulans. II: An X-linked Distorter* is a clear follow-up to the article titled *A sex-ratio Meiotic Drive System in Drosophila simulans. I: An Autosomal Suppressor*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One question I am interested in is: How inspired are authors by their own work (generally *very inspired* I would presume), and how inspiring are articles to a completely different group of authors?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In my opinion, if one group of authors inspires a completely different group of authors to carry out scientific work (be it to follw up, refute, or whatever) then that defines *knowledge transfer* and a point at which scientific knowledge really becomes worth the time and resources it cost to produce this knowledge in the first place.\n", "\n", "(I am certain this statement can be refined further but roughly speaking this is what I think)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us redo the above histogram but exclude all cited PLOS Biology articles that have one or more authors in common with the citing article.\n", "\n", "(one more bracketed caveat: When constructing my database I assumed that every author name occurs exactly once and is therefore unique - this is a heuristic that breaks easily)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "def are_different_authors(article_1, article_2):\n", " authors_1 = []\n", " authors_2 = []\n", " \n", " for n in article_1.inV():\n", " if n.element_type == 'author':\n", " authors_1.append(n.name)\n", " for n in article_2.inV():\n", " if n.element_type == 'author':\n", " authors_2.append(n.name)\n", " \n", " authors_1 = set(authors_1)\n", " authors_2 = set(authors_2)\n", " \n", " return len(authors_1.intersection(authors_2)) == 0\n" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "citation_counts = []\n", "\n", "for edge in edges:\n", " if edge.label == 'cites':\n", " if are_different_authors(edge.inV(), edge.outV()):\n", " citation_counts.append(edge.reference_count)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAEKCAYAAAASByJ7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHEhJREFUeJzt3X9s1PUdx/Hnt6VMIlSU0mI4IgYV6PWOHpU6XGFFV0uP\nHwN1tmjrJiSzTEVxGmcWoWhmNydBDMxgENy8zFDF+WMtTR2j/kD5YVGxcWk2pLNXU3ql/CiUUgqf\n/dHjRilYrnfttfB6JJfcffv9fO5918v3dZ/v5/v9nmWMMYiIyCUvKtIFiIhI36BAEBERQIEgIiJ+\nCgQREQEUCCIi4qdAEBERoAcCoaWlhUmTJuFyubjhhhtYvHgxAI2NjWRkZOB0OsnMzOTgwYOBNoWF\nhSQmJuJwOCgrKwt3SSIicgGsnjgP4dixYwwaNIi2tjbS0tIoLCzkrbfeYsyYMTzyyCO88MIL7N27\nl5UrV1JRUUF+fj7btm2jrq6OtLQ0qqqqGDhwYLjLEhGR79Eju4wGDRoEQGtrKydPniQ+Pp6SkhLy\n8vIAyM3Npbi4GIDi4mJycnKIjo5m5MiR2O12duzY0RNliYjI9+iRQDh16hTJyckkJCQwbdo07HY7\nPp+PYcOGARAXF0d9fT0AtbW12Gy2QFubzYbX6+2JskRE5HsM6IlOo6Ki+OKLLzh06BCZmZls2bIl\npP4sywpTZSIil5ZgZgV69CijK664ghkzZrB9+3aGDx9OQ0MDAD6fj/j4eKB9RFBTUxNo4/V6GTVq\nVKe+jDG6hem2dOnSiNdwMd30fuq97Ku3YIU9EPbv309TUxPQPrn8/vvv43A4cLvdeDweADweD263\nGwC3282GDRtoa2vD6/VSWVlJampquMsSEZEuhH2X0Xfffce9996LMYaWlhbuvvtuZsyYweTJk8nO\nzmbdunWMGDGCoqIiAFJSUpg7dy5Op5OoqCjWrFlDTExMuMsSEZEu9Mhhp+FmWVa3hj9ybuXl5aSn\np0e6jIuG3s/w0XsZXsFuOxUIIiIXqWC3nZfMpStiY6/CsqyQbrGxV0X6ZYiI9JhLZoTQfuhqqC9V\nIxUR6T80QhARkW5RIIiICKBAEBERPwWCiIgACgQREfFTIIiICKBAEBERPwWCiIgACgQREfFTIIiI\nCKBAEBERPwWCiIgACgQREfFTIIiICKBAEBERPwWCiIgACgQREfFTIIiICKBAEBERPwWCiIgACgQR\nEfFTIIiICKBAEBERPwWCiIgACgQREfELeyDU1NQwdepUHA4HY8eO5bnnngOgoKAAm82Gy+XC5XKx\nadOmQJvCwkISExNxOByUlZWFuyQREbkAljHGhLPDffv24fP5SEpK4siRI0ycOJE33niDt99+myFD\nhvDoo492WL+iooL8/Hy2bdtGXV0daWlpVFVVMXDgwP8XaVmEWqZlWUCoLzX0OkREekuw286wjxAS\nEhJISkoCYPDgwTidTmprawHOWVhxcTE5OTlER0czcuRI7HY7O3bsCHdZIiLShR6dQ6iurmbnzp1M\nmTIFgNWrVzN+/Hhyc3NpbGwEoLa2FpvNFmhjs9nwer09WZaIiJzDgJ7q+MiRI/zsZz9j5cqVDBky\nhAceeIAlS5YA7fMJixYtwuPxXHB/BQUFgfvp6emkp6eHuWIRkf6tvLyc8vLybrcP+xwCwIkTJ5g5\ncybTp09n8eLFnf7+3XffMW3aNKqqqnjmmWcYNGgQjz32GAAzZ87kySef5Ec/+tH/i9QcgohI0CI+\nh2CMYcGCBSQmJnYIg/r6+sD9jRs3YrfbAXC73WzYsIG2tja8Xi+VlZWkpqaGuywREelC2HcZbd26\nFY/Hg9PpxOVyAfDss8/y17/+ld27d9Pa2so111zDK6+8AkBKSgpz587F6XQSFRXFmjVriImJCXdZ\nIiLShR7ZZRRu2mUkIhK8iO8yEhGR/kmBICIigAJBRET8FAgiIgIoEERExE+BICIigAJBRET8FAgi\nIgIoEERExE+BICIigAJBRET8FAgiIgIoEERExE+BICIigAJBRET8FAgiIgIoEERExE+BICIigAJB\nRET8FAgiIgIoEERExE+BICIigAJBRET8FAgiIgIoEERExE+BICIigAJBRET8FAgiIgIoEERExC/s\ngVBTU8PUqVNxOByMHTuW5557DoDGxkYyMjJwOp1kZmZy8ODBQJvCwkISExNxOByUlZWFuyQREbkA\nljHGhLPDffv24fP5SEpK4siRI0ycOJE33niDtWvXMmbMGB555BFeeOEF9u7dy8qVK6moqCA/P59t\n27ZRV1dHWloaVVVVDBw48P9FWhahlmlZFhDqSw29DhGR3hLstjPsI4SEhASSkpIAGDx4ME6nk9ra\nWkpKSsjLywMgNzeX4uJiAIqLi8nJySE6OpqRI0dit9vZsWNHuMsSEZEuDOjJzqurq9m5cyfr1q3D\n5/MxbNgwAOLi4qivrwegtraWW265JdDGZrPh9Xo79VVQUBC4n56eTnp6ek+WLiLS75SXl1NeXt7t\n9j0WCEeOHOHOO+9k5cqVxMbGhtzfmYEgIiKdnf1ledmyZUG175GjjE6cOMEdd9zBPffcw5w5cwAY\nPnw4DQ0NAPh8PuLj44H2EUFNTU2grdfrZdSoUT1RloiIfI+wB4IxhgULFpCYmMjixYsDy91uNx6P\nBwCPx4Pb7Q4s37BhA21tbXi9XiorK0lNTQ13WSIi0oWwH2X08ccfM3XqVJxOp//InvbDSlNTU8nO\nzmbfvn2MGDGCoqIihg4dCsCzzz6Lx+MhKiqK5cuXk5mZ2bFIHWUkIhK0YLedYQ+EnqBAEBEJXsQP\nOxURkf5JgSAiIoACQURE/BQIIiICKBBERMRPgSAiIoACQURE/BQIIiICKBBERMRPgSAiIoACQURE\n/BQIIiICKBBERMRPgSAiIoACQURE/BQIIiICKBBERMRPgSAiIoACQURE/BQIIiICXEAg3HrrrRe0\nTERE+rcB5/vDsWPHaG5uxufz0djYGFh+9OhR/vvf//ZKcSIi0nvOGwhr1qxh5cqVfPfdd6SkpASW\nDxo0iIULF/ZKcSIi0nssY4z5vhVefPFFFi1a1Fv1nJNlWXRR5gX1AaH1AaHXISLSW4LddnYZCMYY\nPvzwQ2pqajh16lRg+b333tv9KoOkQBARCV6w287z7jI67a677qK2tpbk5GSio6MDy3szEEREpOd1\nOUK44YYbqKqq8n/DjgyNEEREghfstrPLw04nTpxIfX19SEWJiEjf12Ug1NXVMXbsWG677TZmzZrF\nrFmzmD179nnXnz9/PgkJCTgcjsCygoICbDYbLpcLl8vFpk2bAn8rLCwkMTERh8NBWVlZiC9HRES6\nq8tdRuXl5edcnp6efs7lH330EYMHD+bee+/lq6++AmDZsmUMGTKERx99tMO6FRUV5Ofns23bNurq\n6khLS6OqqoqBAwd2LFK7jEREghb2SeXzbfjPZ8qUKVRXV3dafq6iiouLycnJITo6mpEjR2K329mx\nYwdpaWlBPaeIiISuy0AYPHhwYEK5tbWVEydOMHjwYA4fPhzUE61evZq1a9eSkpLCiy++yFVXXUVt\nbS233HJLYB2bzYbX6z1n+4KCgsD99PT0oINKRORiV15eft69Oheiy0A4cuRI4P6pU6coLi7mk08+\nCepJHnjgAZYsWQK0b9gXLVqEx+MJqo8zA0FERDo7+8vysmXLgmof1NVOo6KimDVrFqWlpUE9SVxc\nHJZlYVkW999/Pzt37gTaRwQ1NTWB9bxeL6NGjQqqbxERCY8uRwgbN24M3D916hQVFRVBP0l9fT3x\n8fGB/ux2OwBut5v8/HweeeQR6urqqKysJDU1Nej+RUQkdF0GwnvvvReYQ4iKisJms1FSUnLe9efN\nm8cHH3xAQ0MDo0aNYtmyZWzZsoXdu3fT2trKNddcwyuvvAJASkoKc+fOxel0EhUVxZo1a4iJiQnT\nSxMRkWB0edhpX6DDTkVEghf2M5Wrq6vJysoiNjaW2NhYZsyYcc7DSkVEpH/rMhByc3OZN28e+/fv\nZ//+/eTk5JCbm9sbtYmISC/qcpfRhAkT+PLLLzssczqd7N69u0cLO5N2GYmIBC/su4wuv/xyXn/9\ndU6ePMnJkyd5/fXXGTJkSEhFiohI39PlCGHPnj0sXLiQTz/9FMuyuPnmm1m9ejVjxozprRo1QhAR\n6Yaw/2JaXl4eq1at4oorrgDg4MGDPPzww/z5z38OrdIgKBBERIIX9l1GlZWVgTAAGDp0aK/OH4iI\nSO/oMhCOHz/e4UJ2hw4doqWlpUeLEhGR3tflmcoPP/wwN954I9nZ2RhjKCoq4te//nVv1CYiIr3o\ngs5U3rVrF5s3b8ayLG699VZcLldv1BagOQQRkeCFfVK5L1AgiIgEL+yTyiIicmlQIIiICKBAEBER\nPwWCiIgACgQREfFTIIiICKBAEBERPwWCiIgACgQREfFTIIiICKBAEBERPwWCiIgACgQREfFTIIiI\nCKBAEBERPwWCiIgACgQREfELeyDMnz+fhIQEHA5HYFljYyMZGRk4nU4yMzM5ePBg4G+FhYUkJibi\ncDgoKysLdzkiInKBwh4I9913H6WlpR2WLV26lBkzZrB7926ysrJYunQpABUVFbz11lt89dVXlJaW\ncv/999Pa2hruksJoAJZlhXSLjb0q0i9CROScwh4IU6ZM4corr+ywrKSkhLy8PAByc3MpLi4GoLi4\nmJycHKKjoxk5ciR2u50dO3aEu6QwaqP9d5m7f2tqOtD7ZYuIXIABvfEkPp+PYcOGARAXF0d9fT0A\ntbW13HLLLYH1bDYbXq/3nH0UFBQE7qenp5Oent5j9YqI9Efl5eWUl5d3u32vBEI4nBkIIiLS2dlf\nlpctWxZU+145ymj48OE0NDQA7aOF+Ph4oH1EUFNTE1jP6/UyatSo3ihJRETO0iuB4Ha78Xg8AHg8\nHtxud2D5hg0baGtrw+v1UllZSWpqam+UJCIiZwn7LqN58+bxwQcf0NDQwKhRo3j66adZtmwZ2dnZ\nrFu3jhEjRlBUVARASkoKc+fOxel0EhUVxZo1a4iJiQl3SSIicgEsY4yJdBFdsSyLUMu0LIv2I31C\n6iUsffSDt1xELgLBbjt1prKIiAAKBBER8VMgiIgIoEAQERE/BYKIiAAKBBER8VMgiIgIoEAQERE/\nBYKIiAAKBBER8VMgiIgIoEAQERE/BYKIiAAKBBER8VMgiIgIoEAQERE/BYKIiAAKBBER8VMgiIgI\noEAQERE/BYKIiAAKBBER8VMgiIgIoEAQERE/BYKIiAAKBBER8RsQ6QIu1GuvvRbpEkRELmqWMcZE\nuoiuWJbF4MG53W5/4kQVx4/vBEJ9qVZY+ugHb7mIXAQsK7jtTa8GwujRo4mNjSU6OpqYmBh27NhB\nY2Mj2dnZ7Nu3j6uvvpoNGzYwdOjQjkVaoW6IVwMPhtgHKBBEpD8JNhB6dQ7BsizKy8v5/PPP2bFj\nBwBLly5lxowZ7N69m6ysLJYuXdqbJYmIiF+vTyqfnVYlJSXk5eUBkJubS3FxcW+XJCIiRGCEkJGR\ngdPpZNWqVQD4fD6GDRsGQFxcHPX19b1ZkoiI+PXqUUbbtm0jPj4en8/H9OnTGTduXBCtC864n+6/\niYjIaeXl5ZSXl3e7fcSOMiosLARg7dq1bN++nbi4OHw+H5MnT+Y///lPxyIvqknlGKCt262HDLmS\nw4cbQ6xBRC4FfXZSubm5mebmZgCOHj1KaWkpdrsdt9uNx+MBwOPx4Ha7e6ukCGmjPVS6d2tqOhCB\nmkXkUtBru4z27dvHnDlzsCyL5uZmcnJymD17NmlpaWRnZ7Nu3TpGjBhBUVFRb5UkIiJn6Dcnpl08\nu4xC7UPnMYjIhemzu4xERKRvUyCIiAigQBARET8FgoiIAAoEERHxUyCIiAigQBARET8FgoiIAAoE\nERHxUyCIiAigQBARET8FgoiIAL38AzkSDgP8F/vrPv2mgoiciwKh3zn9ewrd19QUWqCIyMVJu4xE\nRARQIIiIiJ8CQUREAAWCiIj4KRBERARQIIiIiJ8CQUREAAWCiIj46cS0S5LOdhaRzhQIlySd7Swi\nnSkQpJs0yhC52CgQpJvCMcqICSlUFCgi4aVAkAgKLVS020okvBQI0o9pt5VIOPWJw05LS0txOBwk\nJibyhz/8IdLlSL9xeoTR/VtT04GQqygvLw+5D2mn9zKyIh4Ix48fZ+HChZSWlrJ7927efPNNPv/8\n80iXJXLBMjOzsCyr27fY2Ksi/RL6DAVCZEU8ELZv347dbmfkyJEMGDCA7OxsiouLI12WXDIGhLQx\ntyyL1tYWQhulNIVcg2UN7BN9hBpuhYV/iHgNl7KIzyF4vV5GjRoVeGyz2c75LeGKK2Z1+zmOH99L\nS0u3m8tFLfSjpSDUye1w1RD5PkI9cqxdX6ghBjgRUg/9cX4q4oFwof+4Q4f+Ho5nu0j66As19JU+\n+kIN4eijL9QQrj5C1RdqCC0MAJqaDoQhmHpXxAPBZrNRU1MTeFxTU9NhxABgTKjffEREpCsRn0OY\nNGkSlZWV1NbWcuLECYqKisjKyop0WSIil5yIjxAuu+wyXnrpJTIzMzl16hR5eXlMnDgx0mWJiFxy\nIj5CAMjKyqKyspKvv/6aJ598ssPfdI5CeI0ePRqn04nL5SI1NTXS5fQr8+fPJyEhAYfDEVjW2NhI\nRkYGTqeTzMxMDh48GMEK+5dzvZ8FBQXYbDZcLhcul4vS0tIIVti/1NTUMHXqVBwOB2PHjuW5554D\ngvyMmj6spaXFjB492ni9XnPixAlz4403ml27dkW6rH5t9OjRZv/+/ZEuo1/68MMPza5du0xSUlJg\n2YMPPmhWrFhhjDFmxYoVZtGiRZEqr9851/tZUFBgli9fHsGq+q+6ujrz1VdfGWOMaWpqMtdff735\n4osvgvqM9okRwvnoHIWeYTRJ3y1Tpkzhyiuv7LCspKSEvLw8AHJzc/X5DMK53k/Q57O7EhISSEpK\nAmDw4ME4nU5qa2uD+oz26UA41zkKXq83ghX1f5ZlBYaPq1atinQ5/Z7P52PYsGEAxMXFUV9fH+GK\n+r/Vq1czfvx4cnNzaWzsX8fx9xXV1dXs3LmTtLS0oD6jfToQ+tsxvP3Btm3b2LVrF5s3b2b9+vX8\n4x//iHRJIgEPPPAAe/bs4euvv2bMmDEsWrQo0iX1O0eOHOHOO+9k5cqVxMbGBtW2TwfChZyjIMGJ\nj48HYPjw4dx5553s3LkzwhX1b8OHD6ehoQFoHy2cfn+le+Li4gKXoLj//vv1+QzSiRMnuOOOO7jn\nnnuYM2cOENxntE8Hgs5RCK/m5maam5sBOHr0KKWlpdjt9ghX1b+53W48Hg8AHo8Ht9sd4Yr6tzN3\nZ2zcuFGfzyAYY1iwYAGJiYksXrw4sDyoz2jPz32HpqSkxNjtdjN+/Hjz7LPPRrqcfu2bb74xTqfT\nTJgwwVx//fXmqaeeinRJ/UpOTo65+uqrTUxMjLHZbGbdunVm//795ic/+YlxOBwmIyPDHDhwINJl\n9htnv5+vvPKKyc3NNU6n04wbN85kZmYar9cb6TL7jY8++shYlmUmTJhgkpOTTXJystm0aVNQn1HL\nGE3pi4hIH99lJCIivUeBICIigAJBRET8FAgiIgIoEOQikJ6eTkVFRY8/z4oVKxg7dmzgMgCnffnl\nl2zatCnw+L333ov4hRhnzJjB4cOHOXToEC+99FLQ7QsKCli+fHkPVCZ9mQJB+r1Qzmg/efLkBa/7\n8ssvs2XLFl577bUOyz///HNKSkoCj2fNmsUTTzzR7ZrCobi4mNjYWA4cOMCf/vSnoNvrKgGXJgWC\n9Irq6mrGjx9Pfn4+SUlJpKenc/ToUaDjN/yGhgauvfZaAF599VXmzJlDVlYW1157LatWreL555/n\nxhtvZOLEiYGzLwFee+01UlNTGTduHFu3bgXaT+GfN28eEyZMwG6388YbbwT6nT17NpmZmdx2222d\nav3d737H+PHjGT9+fOCbfn5+Pt988w3Tp0/nhRdeCKzb2trKkiVL2LBhAy6Xi6KiIl599VUeeugh\nAH7xi1/wq1/9irS0NMaMGUN5eTn33Xcf48aN4+677w708+6775KSkoLD4eCnP/0pTU1NADz++OPY\n7XaSk5N59NFHO9Xa1NRETk4OdrudCRMmsHHjRqD9Muf79+/nN7/5DXv27MHlcgVC6umnn8bpdDJ+\n/PgOl5tfsmQJ1113Henp6VRVVV34P1cuHr121oRc0vbu3WsGDBgQuDzvXXfdZdavX2+MMSY9Pd1U\nVFQYY4zx+Xxm9OjRxhhj1q9fb6677jpz7Ngx4/P5TGxsrFm7dq0xxpjFixebP/7xj8YYY3784x+b\nhQsXGmOM2bp1q7nhhhsC63g8HmOMMQcOHDBjxowxhw8fNuvXrzc2m80cPny4U51bt241DofDHD9+\n3Bw7dszY7Xazfft2Y8z5Lx3+6quvmoceeqjD4wcffNAYY8zPf/5zc8899xhjjHnnnXfMkCFDzL/+\n9S9z6tQpk5KSYnbu3Gnq6urM5MmTTXNzszHGmN///vfmt7/9ramvrzd2uz3Q75EjRzo996JFi8xj\njz0WeHzo0KEOtVZXV3e4vPQ777xjfvnLXxpjjDl58qSZOXOmef/9980nn3xiHA6HaW1tNUePHjXX\nXXedLkN9CYr4L6bJpePaa68NXJ43JSWlw3WqzmfatGlcdtllXHbZZQwdOjRw2r3D4eCLL74A2ndv\n3HXXXQDcfPPNtLS04PP5KCsr4/333+f5558HoK2tjW+//TZwxdchQ4Z0er6PP/6Y22+/nYEDBwJw\n++238+GHH37vjwkZY857yWbLspgxYwYASUlJjBgxgnHjxgFgt9upqamhurqaf//739x8881A+6jj\npptu4qqrriImJoYFCxbgdruZNWtWp/43b97MO++8E3h89sXMzq6rrKyMsrIyXC4X0H4Jk71793Lw\n4EFuv/12YmJiiImJYfbs2boM9SVIgSC95gc/+EHgfnR0dGCDExUVxalTpwBoaWk5b5uoqKjA4zPb\nnMvpfeDvvvtuYBfUaZ999hmXX375eduduSE0xnS5P72rv58OlzPrP/s1ZGVl8Ze//KVT2+3bt7N5\n82Y2btzI6tWr+ec//9lpnWA33E899RTz58/vsOz555/v9Lrl0qM5BImY0xsdm83GZ599BsDf/va3\noNqevv/mm28C8OmnnzJo0CDi4uLIzMzsMKFaWVnZqe3Z0tLSePvtt2ltbaWlpYW3336bqVOnfm8t\ngwYNClw0sKv+z2ZZFlOmTGHLli18++23QHso7tmzh6NHj9LU1ERWVhbLly9n165dndpnZGSwZs2a\nwOPDhw9/b22ZmZmsX78+ELz79u2joaGhw+tubm7m73//uyaWL0EKBOk1Z29gTj9+/PHHWbFiBZMm\nTaK+vj6w/PRlkM/V/sy/WZbFwIEDuemmm7jvvvtYt24dAM888wz19fUkJibidDoDk6pn93umyZMn\nk52dzYQJE3C5XOTl5TFp0qRz1n/atGnTqKioYMKECRQVFXVZ99kSEhJ4+eWXmT17NsnJyaSmpvL1\n119z+PBhpk+fjsvlYsqUKaxYsaJT22eeeYZvv/2WxMREkpOT2bx5c6e+k5OTSUxM5IknnmDWrFnM\nnDmTiRMnkpyczOzZs2lqauKHP/whc+bMITExEbfbrd/bvkTp4nYiIgJohCAiIn4KBBERARQIIiLi\np0AQERFAgSAiIn4KBBERAeB/+GL357hEL2gAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pyplot.hist(citation_counts, bins=range(20))\n", "pyplot.xlabel('number of times cited')\n", "pyplot.ylabel('count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This histogram does not look very different from the one above.\n", "\n", "Let us take a look at data points in the tail:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Citer:\n", "Title: Lack of Support for the Association between GAD2 Polymorphisms and Severe Human Obesity\n", "Authors: Frank Geller, John P Kane, Raphael Merriman, Christian Vaisse, Winfried Rief, Robert Dent, Johannes Hebebrand, Björn Waldenmaier, Franck Mauvais-Jarvis, Anke Hinney, Michael M Swarbrick, Clive R Pullinger, Mary Malloy, Len A Pennacchio, Anna Ustaszewska, Denise L Lind, Wen-Chi Hsueh, Ruth McPherson, Martha M Cavazos, André Scherag, Pui-Yan Kwok\n", "DOI: 10.1371/journal.pbio.0030315\n", "\n", "Citee\n", "Title: GAD2 on Chromosome 10p12 Is a Candidate Gene for Human Obesity\n", "Authors: Lynn Bekris, Valérie Vasseur-Delannoy, Philippe Boutin, Karin Séron, Philippe Froguel, Mohamed Chikri, Christian Dina, Laetitia Corset, M. Aline Charles, Séverine Dubois, Francis Vasseur, Janice Cabellon, Ake Lernmark, Bernadette Neve, Karine Clement\n", "DOI: 10.1371/journal.pbio.0000068\n", "\n", "Citer cites citee 17 times.\n", "-----------------------------------------------------\n", "Citer:\n", "Title: Structural Basis of Rap Phosphatase Inhibition by Phr Peptides\n", "Authors: Alberto Marina, Francisca Gallego del Sol\n", "DOI: 10.1371/journal.pbio.1001511\n", "\n", "Citee\n", "Title: Structural Basis of Response Regulator Inhibition by a Bacterial Anti-Activator Protein\n", "Authors: Matthew B. Neiditch, Melinda D. Baker\n", "DOI: 10.1371/journal.pbio.1001226\n", "\n", "Citer cites citee 16 times.\n", "-----------------------------------------------------\n" ] } ], "source": [ "for edge in edges:\n", " if edge.label == 'cites':\n", " if edge.reference_count >= 16 and are_different_authors(edge.inV(), edge.outV()):\n", " print('Citer:')\n", " print(article_pp(edge.outV()))\n", " print('')\n", " print('Citee')\n", " print(article_pp(edge.inV()))\n", " print('')\n", " print('Citer cites citee %d times.' % edge.reference_count)\n", " print('-----------------------------------------------------')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see the two article pairs in the tail of this updated distribution are linked with lower reference counts than what we observed before filtering for author disjointedness." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, how inspiring are PLOS Biology authors for other (different) PLOS Biology authors?\n", "\n", "To answer this question, I would like to propose a measure that I have called **Inspiration Factor** in my own head for some time now and one variant of the model I have had in mind is this:\n", "\n", "Inspiration is an increasing function of the number of authors (unrelated to you) that you inspired to carry out scientific work.\n", "\n", "Since I do not want to count citations that are mentioned only once in the main text of an article, I will impose a threshold of at least *three references*.\n", "\n", "I should refine the way I parse articles to account for the context that citations are referenced in.\n", "\n", "Anyways, let us take a look at those PLOS Biology articles that have inspired at least three other PLOS Biology articles." ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "inspirators = []\n", "\n", "for article in articles.values():\n", " in_nodes = []\n", " if article.inE():\n", " for edge in article.inE():\n", " if edge.label == 'cites':\n", " if are_different_authors(edge.inV(), edge.outV()) and edge.reference_count >= 3:\n", " in_nodes.append([edge.outV(), edge.reference_count])\n", " \n", " if len(in_nodes) >= 3:\n", " inspirators.append([article, in_nodes])" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(inspirators)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inspirator\n", "Title: The Evolution of Combinatorial Gene Regulation in Fungi\n", "Authors: Alexander D Johnson, Aaron D Hernday, Hao Li, Brian B Tuch, David J Galgoczy\n", "DOI: 10.1371/journal.pbio.0060038\n", "\n", "Inspired Article\n", "Title: Biofilm Matrix Regulation by Candida albicans Zap1\n", "Authors: Oliver R. Homann, Clarissa J. Nobile, Jean-Sebastien Deneault, Aaron P. Mitchell, Andre Nantel, Aaron D. Hernday, David R. Andes, Jeniel E. Nett, Alexander D. Johnson\n", "DOI: 10.1371/journal.pbio.1000133\n", "Cites inspirator 3 times.\n", "\n", "Inspired Article\n", "Title: Evolutionary Tinkering with Conserved Components of a Transcriptional Regulatory Network\n", "Authors: Jaideep Mallick, Adnane Sellam, Hugo Lavoie, Hervé Hogues, Malcolm Whiteway, André Nantel\n", "DOI: 10.1371/journal.pbio.1000329\n", "Cites inspirator 6 times.\n", "\n", "Inspired Article\n", "Title: Evolution of Phosphoregulation: Comparison of Phosphorylation Patterns across Yeast Species\n", "Authors: Assen Roguev, Dorothea Fiedler, Jonathan C. Trinidad, Wendell A. Lim, Pedro Beltrao, Kevan M. Shokat, Alma L. Burlingame, Nevan J. Krogan\n", "DOI: 10.1371/journal.pbio.1000134\n", "Cites inspirator 3 times.\n", "\n", "--------------------------------------\n", "\n", "Inspirator\n", "Title: Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm \n", "Authors: Lisa Simirenko, Michael B Eisen, Mark Stapleton, Richard Weiszmann, Cris L. Luengo Hendriks, Tom Gingeras, Amy Beaton, Hou Cheng Chu, Xiao-yong Li, Terence P Speed, Victor Sementchenko, Mark D Biggin, Richard Bourgon, Stewart MacArthur, William Inwood, Susan E Celniker, Nobuo Ogawa, Venky N Iyer, David W Knowles, Daniel A Pollard, David Nix, Aaron Hechmer\n", "DOI: 10.1371/journal.pbio.0060027\n", "\n", "Inspired Article\n", "Title: Target Genes of the MADS Transcription Factor SEPALLATA3: Integration of Developmental and Hormonal Pathways in the Arabidopsis Flower\n", "Authors: Cezary Smaczniak, Kerstin Kaufmann, Pawel Krajewski, Ruy Jauregui, Chiara A Airoldi, Gerco C Angenent, Jose M Muiño\n", "DOI: 10.1371/journal.pbio.1000090\n", "Cites inspirator 3 times.\n", "\n", "Inspired Article\n", "Title: Evolutionary Plasticity of Polycomb/Trithorax Response Elements in Drosophila Species\n", "Authors: Arne Hauenschild, Leonie Ringrose, Renato Paro, Christina Altmutter, Marc Rehmsmeier\n", "DOI: 10.1371/journal.pbio.0060261\n", "Cites inspirator 6 times.\n", "\n", "Inspired Article\n", "Title: Quantitative Analysis of the Drosophila Segmentation Regulatory Network Using Pattern Generating Potentials\n", "Authors: Sudhir Kumar, Susan E. Celniker, Ann S. Hammonds, Saurabh Sinha, Majid Kazemian, Charles Blatti, Noriko Wakabayashi-Ito, Scot A. Wolfe, Adam Richards, Michael McCutchan, Michael H. Brodsky\n", "DOI: 10.1371/journal.pbio.1000456\n", "Cites inspirator 7 times.\n", "\n", "--------------------------------------\n", "\n" ] } ], "source": [ "for inspirator in inspirators:\n", " print('Inspirator')\n", " print article_pp(inspirator[0])\n", " print('')\n", " for el in inspirator[1]:\n", " print('Inspired Article')\n", " print article_pp(el[0])\n", " print('Cites inspirator %d times.' % el[1])\n", " print('')\n", " print('--------------------------------------')\n", " print('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that is it for now.\n", "\n", "I will expand my dataset to include more articles and think about how to enrich the data I extract from these articles.\n", "\n", "One question that I am very intrigued to tackle soon is: How long of a chain of scientific discovery do you trigger?\n", "\n", "I imagine that an article that lies at the beginning of a long chain of articles that inspired one another would have some significance." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.1" } }, "nbformat": 4, "nbformat_minor": 1 }