{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Retrospective on the 2014 NeurIPS Experiment\n", "\n", "### [Neil D. Lawrence](http://inverseprobability.com), University of\n", "\n", "Cambridge\n", "\n", "### 2021-06-16" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Abstract**: In 2014, along with Corinna Cortes, I was Program Chair of\n", "the Neural Information Processing Systems conference. At the time, when\n", "wondering about innovations for the conference, Corinna and I decided it\n", "would be interesting to test the consistency of reviewing. With this in\n", "mind, we randomly selected 10% of submissions and had them reviewed by\n", "two independent committees. In this talk I will review the construction\n", "of the experiment, explain how the NeurIPS review process worked and\n", "talk about what I felt the implications for reviewing were, vs what the\n", "community reaction was." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "The NIPS experiment was an experiment to determine the consistency of\n", "the review process. After receiving papers, we selected 10% that would\n", "be independently rereviewed. The idea was to determine how consistent\n", "the decisions between the two sets of independent papers would be. In\n", "2014 NIPS received 1678 submissions and we selected 170 for the\n", "experiment. These papers are referred to below as ‘duplicated papers.’\n", "\n", "To run the experiment, we created two separate committees within the\n", "NIPS program committee. The idea was that the two separate committees\n", "would review each duplicated paper independently and results compared." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NeurIPS in Numbers\n", "\n", "\\[edit\\]\n", "\n", "In 2014 the NeurIPS conference had 1474 active reviewers (up from 1133\n", "in 2013), 92 area chairs (up from 67 in 2013) and two program chairs,\n", "Corinna Cortes and me.\n", "\n", "The conference received 1678 submissions and presented 414 accepted\n", "papers, of which 20 were presented as talks in the single-track session,\n", "62 were presented as spotlights and 331 papers were presented as\n", "posters. Of the 1678 submissions, 19 papers were rejected without\n", "review." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The NeurIPS Experiment\n", "\n", "The objective of the NeurIPS experiment was to determine how consistent\n", "the process of peer review is. One way of phrasing this question is to\n", "ask: what would happen to submitted papers in the conference if the\n", "process was independently rerun?\n", "\n", "For the 2014 conference, to explore this question, we selected\n", "$\\approx 10\\%$ of submitted papers to be reviewed twice, by independent\n", "committees. This led to 170 papers being selected from the conference\n", "for dual reviewing. For these papers the program committee was divided\n", "into two. Reviewers were placed randomly on one side of the committee or\n", "the other. For Program Chairs we also engaged in some manual selection\n", "to ensure we had expert coverage in all the conference areas on both\n", "side of the committee." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Timeline for NeurIPS\n", "\n", "\\[edit\\]\n", "\n", "Chairing a conference starts with recruitment of the program committee,\n", "which is usually done in a few stages. The primary task is to recruit\n", "the area chairs. We sent out our program committee invites in three\n", "waves.\n", "\n", "- 17/02/2014\n", "- 08/03/2014\n", "- 09/04/2014\n", "\n", "By recruiting area chairs first, you can involve them in recruiting\n", "reviewers. We requested names of reviewers from ACs in two waves.\n", "\n", "- 25/03/2014\n", "- 11/04/2014\n", "\n", "In 2014, this wasn’t enough to obtain the requisite number of reviewers,\n", "so we used additional approaches. These included lists of previous\n", "NeurIPS authors. For each individual we were looking for at least two\n", "previously-published papers from NeurIPS and other leading leading ML\n", "venues like ICML, AISTATS, COLT, UAI etc.. We made extensive use of\n", "[DBLP](https://dblp.uni-trier.de/) for verifying each potential\n", "reviewer’s publication track record.\n", "\n", "- 14/04/2014\n", "- 28/04/2014\n", "- 09/05/2014\n", "- 10/06/2014 (note this is after deadline … lots of area chairs asked\n", " for reviewers after the deadline!). We invited them en-masse.\n", "\n", "- 06/06/2014 Submission Deadline\n", "- 12/06/2014 Bidding Open for Area Chairs (this was *delayed* by CMT\n", " issues)\n", "- 17/06/2014 Bidding Open for Reviewers\n", "- 01/07/2014 Start Reviewing\n", "- 21/07/2014 Reviewing deadline\n", "- 04/08/2014 Reviews to Authors\n", "- 11/08/2014 Author Rebuttal Due\n", "- 25/08/2014 Teleconferences Begin\n", "- 30/08/2014 Teleconferences End\n", "- 1/09/2014 Preliminary Decisions Made\n", "- 9/09/2014 Decisions Sent to Authors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Paper Scoring and Reviewer Instructions\n", "\n", "\\[edit\\]\n", "\n", "The instructions to reviewers for the 2014 conference are still\n", "available [online\n", "here](https://nips.cc/Conferences/2014/PaperInformation/ReviewerInstructions).\n", "\n", "To keep quality of reviews high, we tried to keep load low. We didn’t\n", "assign any reviewer more than 5 papers, most reviewers received 4\n", "papers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quantitative Evaluation\n", "\n", "Reviewers give a score of between 1 and 10 for each paper. The program\n", "committee will interpret the numerical score in the following way:\n", "\n", "- 10: Top 5% of accepted NIPS papers, a seminal paper for the ages.\n", "\n", " I will consider not reviewing for NIPS again if this is rejected.\n", "\n", "- 9: Top 15% of accepted NIPS papers, an excellent paper, a strong\n", " accept.\n", "\n", " I will fight for acceptance.\n", "\n", "- 8: Top 50% of accepted NIPS papers, a very good paper, a clear\n", " accept.\n", "\n", " I vote and argue for acceptance.\n", "\n", "- 7: Good paper, accept.\n", "\n", " I vote for acceptance, although would not be upset if it were\n", " rejected.\n", "\n", "- 6: Marginally above the acceptance threshold.\n", "\n", " I tend to vote for accepting it, but leaving it out of the program\n", " would be no great loss.\n", "\n", "- 5: Marginally below the acceptance threshold.\n", "\n", " I tend to vote for rejecting it, but having it in the program would\n", " not be that bad.\n", "\n", "- 4: An OK paper, but not good enough. A rejection.\n", "\n", " I vote for rejecting it, although would not be upset if it were\n", " accepted.\n", "\n", "- 3: A clear rejection.\n", "\n", " I vote and argue for rejection.\n", "\n", "- 2: A strong rejection. I’m surprised it was submitted to this\n", " conference.\n", "\n", " I will fight for rejection.\n", "\n", "- 1: Trivial or wrong or known. I’m surprised anybody wrote such a\n", " paper.\n", "\n", " I will consider not reviewing for NIPS again if this is accepted.\n", "\n", "Reviewers should NOT assume that they have received an unbiased sample\n", "of papers, nor should they adjust their scores to achieve an artificial\n", "balance of high and low scores. Scores should reflect absolute judgments\n", "of the contributions made by each paper." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Impact Score\n", "\n", "The impact score was an innovation introduce in 2013 by Ghahramani and\n", "Welling that we retained for 2014. Quoting from the instructions to\n", "reviewers:\n", "\n", "> Independently of the Quality Score above, this is your opportunity to\n", "> identify papers that are very different, original, or otherwise\n", "> potentially impactful for the NIPS community.\n", ">\n", "> There are two choices:\n", ">\n", "> 2: This work is different enough from typical submissions to\n", "> potentially have a major impact on a subset of the NIPS community.\n", ">\n", "> 1: This work is incremental and unlikely to have much impact even\n", "> though it may be technically correct and well executed.\n", ">\n", "> Examples of situations where the impact and quality scores may point\n", "> in opposite directions include papers which are technically strong but\n", "> unlikely to generate much follow-up research, or papers that have some\n", "> flaw (e.g. not enough evaluation, not citing the right literature) but\n", "> could lead to new directions of research." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confidence Score\n", "\n", "Reviewers also give a confidence score between 1 and 5 for each paper.\n", "The program committee will interpret the numerical score in the\n", "following way:\n", "\n", "5: The reviewer is absolutely certain that the evaluation is correct and\n", "very familiar with the relevant literature.\n", "\n", "4: The reviewer is confident but not absolutely certain that the\n", "evaluation is correct. It is unlikely but conceivable that the reviewer\n", "did not understand certain parts of the paper, or that the reviewer was\n", "unfamiliar with a piece of relevant literature.\n", "\n", "3: The reviewer is fairly confident that the evaluation is correct. It\n", "is possible that the reviewer did not understand certain parts of the\n", "paper, or that the reviewer was unfamiliar with a piece of relevant\n", "literature. Mathematics and other details were not carefully checked.\n", "\n", "2: The reviewer is willing to defend the evaluation, but it is quite\n", "likely that the reviewer did not understand central parts of the paper.\n", "\n", "1: The reviewer’s evaluation is an educated guess. Either the paper is\n", "not in the reviewer’s area, or it was extremely difficult to understand." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Qualitative Evaluation\n", "\n", "All NIPS papers should be good scientific papers, regardless of their\n", "specific area. We judge whether a paper is good using four criteria; a\n", "reviewer should comment on all of these, if possible:\n", "\n", "- Quality\n", "\n", " Is the paper technically sound? Are claims well-supported by\n", " theoretical analysis or experimental results? Is this a complete\n", " piece of work, or merely a position paper? Are the authors careful\n", " (and honest) about evaluating both the strengths and weaknesses of\n", " the work?\n", "\n", "- Clarity\n", "\n", " Is the paper clearly written? Is it well-organized? (If not, feel\n", " free to make suggestions to improve the manuscript.) Does it\n", " adequately inform the reader? (A superbly written paper provides\n", " enough information for the expert reader to reproduce its results.)\n", "\n", "- Originality\n", "\n", " Are the problems or approaches new? Is this a novel combination of\n", " familiar techniques? Is it clear how this work differs from previous\n", " contributions? Is related work adequately referenced? We recommend\n", " that you check the proceedings of recent NIPS conferences to make\n", " sure that each paper is significantly different from papers in\n", " previous proceedings. Abstracts and links to many of the previous\n", " NIPS papers are available from http://books.nips.cc\n", "\n", "- Significance\n", "\n", "Are the results important? Are other people (practitioners or\n", "researchers) likely to use these ideas or build on them? Does the paper\n", "address a difficult problem in a better way than previous research? Does\n", "it advance the state of the art in a demonstrable way? Does it provide\n", "unique data, unique conclusions on existing data, or a unique\n", "theoretical or pragmatic approach?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Speculation\n", "\n", "\\[edit\\]\n", "\n", "With the help of [Nicolo Fusi](http://nicolofusi.com/), [Charles\n", "Twardy](http://blog.scicast.org/tag/charles-twardy/) and the entire\n", "Scicast team we launched [a Scicast\n", "question](https://scicast.org/#!/questions/1083/trades/create/power) a\n", "week before the results were revealed. The comment thread for that\n", "question already had [an amount of interesting\n", "comment](https://scicast.org/#!/questions/1083/comments/power) before\n", "the conference. Just for informational purposes before we began\n", "reviewing Corinna forecast this figure would be 25% and I forecast it\n", "would be 20%. The box plot summary of predictions from Scicast is below.\n", "\n", "\n", "\n", "Figure: Summary forecast from those that responded to a scicast\n", "question about how consistent the decision making was." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NeurIPS Experiment Results\n", "\n", "\\[edit\\]\n", "\n", "The results of the experiment were as follows. From 170 papers 4 had to\n", "be withdrawn or were rejected without completing the review process, for\n", "the remainder, the ‘confusion matrix’ for the two committee’s decisions\n", "is in Table .\n", "\n", "Table: Table showing the results from the two committees as a confusion\n", "matrix. Four papers were rejected or withdrawn without review.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "Committee 1\n", "\n", "
\n", "\n", "\n", "Accept\n", "\n", "\n", "\n", "Reject\n", "\n", "
\n", "\n", "Committee 2\n", "\n", "\n", "\n", "Accept\n", "\n", "\n", "\n", "22\n", "\n", "\n", "\n", "22\n", "\n", "
\n", "\n", "Reject\n", "\n", "\n", "\n", "21\n", "\n", "\n", "\n", "101\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarizing the Table\n", "\n", "There are a few ways of summarizing the numbers in this table as percent\n", "or probabilities. First, the inconsistency, the proportion of decisions\n", "that were not the same across the two committees. The decisions were\n", "inconsistent for 43 out of 166 papers or 0.259 as a proportion. This\n", "number is perhaps a natural way of summarizing the figures if you are\n", "submitting your paper and wish to know an estimate of what the\n", "probability is that your paper would have different decisions according\n", "to the different committees. Secondly, the accept precision: if you are\n", "attending the conference and looking at any given paper, then you might\n", "want to know the probability that the paper would have been rejected in\n", "an independent rerunning of the conference. We can estimate this for\n", "Committee 1’s conference as 22/(22 + 22) = 0.5 (50%) and for Committee\n", "2’s conference as 21/(22+21) = 0.49 (49%). Averaging the two estimates\n", "gives us 49.5%. Finally, the reject precision: if your paper was\n", "rejected from the conference, you might like an estimate of the\n", "probability that the same paper would be rejected again if the review\n", "process had been independently rerun. That estimate is 101/(22+101) =\n", "0.82 (82%) for Committee 1 and 101/(21+101)=0.83 (83%) for Committee 2,\n", "or on average 82.5%. A final quality estimate might be the ratio of\n", "consistent accepts to consistent rejects, or the agreed accept rate,\n", "22/123 = 0.18 (18%).\n", "\n", "- *inconsistency*: 43/166 = **0.259**\n", " - proportion of decisions that were not the same\n", "- *accept precision* $0.5 \\times 22/44$ + $0.5 \\times 21/43$ =\n", " **0.495**\n", " - probability any accepted paper would be rejected in a rerunning\n", "- *reject precision* = $0.5\\times 101/(22+101)$ +\n", " $0.5\\times 101/(21 + 101)$ = **0.175**\n", " - probability any rejected paper would be rejected in a rerunning\n", "- *agreed accept rate* = 22/101 = **0.218**\n", "- ratio between agreed accepted papers and agreed rejected papers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reaction After Experiment\n", "\n", "\\[edit\\]\n", "\n", "There seems to have been a lot of discussion of the result, both at the\n", "conference and on bulletin boards since. Such discussion is to be\n", "encouraged, and for ease of memory, it is worth pointing out that the\n", "approximate proportions of papers in each category can be nicely divided\n", "in to eighths as follows. Accept-Accept 1 in 8 papers, Accept-Reject 3\n", "in 8 papers, Reject-Reject, 5 in 8 papers. This makes the statistics\n", "we’ve computed above: inconsistency 1 in 4 (25%) accept precision 1 in 2\n", "(50%) reject precision 5 in 6 (83%) and agreed accept rate of 1 in 6\n", "(20%). This compares with the accept rate of 1 in 4.\n", "\n", "- Public reaction after experiment [documented\n", " here](http://inverseprobability.com/2015/01/16/blogs-on-the-nips-experiment/)\n", "\n", "- [Open Data\n", " Science](http://inverseprobability.com/2014/07/01/open-data-science/)\n", " (see Heidelberg Meeting)\n", "\n", "- NIPS was run in a very open way.\n", " [Code](https://github.com/sods/conference) and [blog\n", " posts](http://inverseprobability.com/2014/12/16/the-nips-experiment/)\n", " all available!\n", "\n", "- Reaction triggered by [this blog\n", " post](http://blog.mrtz.org/2014/12/15/the-nips-experiment.html).\n", "\n", "Much of the discussion speculates on the number of consistent accepts in\n", "the process (using the main conference accept rate as a proxy). It\n", "therefore produces numbers that don’t match ours above. This is because\n", "the computed accept rate of the individual committees is different from\n", "that of the main conference. This could be due to a bias for the\n", "duplicated papers, or statistical sampling error. We look at these\n", "questions below. First, to get the reader primed for thinking about\n", "these numbers we discuss some context for placing these numbers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Random Committee @ 25%\n", "\n", "\\[edit\\]\n", "\n", "The first context we can place around the numbers is what would have\n", "happened at the ‘Random Conference’ where we simply accept a quarter of\n", "papers at random. In this NIPS the expected numbers of accepts would\n", "then have been given as in Table .\n", "\n", "Table: Table shows the expected values for the confusion matrix if the\n", "committee was making decisions totally at random.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "Committee 1\n", "\n", "
\n", "\n", "\n", "Accept\n", "\n", "\n", "\n", "Reject\n", "\n", "
\n", "\n", "Committee 2\n", "\n", "\n", "\n", "Accept\n", "\n", "\n", "\n", "10.4 (1 in 16)\n", "\n", "\n", "\n", "31.1 (3 in 16)\n", "\n", "
\n", "\n", "Reject\n", "\n", "\n", "\n", "31.1 (3 in 16)\n", "\n", "\n", "\n", "93.4 (9 in 16)\n", "\n", "
\n", "\n", "\n", "\n", "And for this set up we would expect *inconsistency* of 3 in 8 (37.5%)\n", "*accept precision* of 1 in 4 (25%) and a *reject precision* of 3 in 4\n", "(75%) and a *agreed accept rate* of 1 in 10 (10%). The actual committee\n", "made improvements on these numbers, the accept precision was markedly\n", "better with 50%: twice as many consistent accept decisions were made\n", "than would be expected if the process had been performed at random and\n", "only around two thirds as many inconsistent decisions were made as would\n", "have been expected if decisions were made at random. However, we should\n", "treat all these figures with some skepticism until we’ve performed some\n", "estimate of the uncertainty associated with them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stats for Random Committee\n", "\n", "- For random committee we expect:\n", " - *inconsistency* of 3 in 8 (37.5%)\n", " - *accept precision* of 1 in 4 (25%)\n", " - *reject precision* of 3 in 4 (75%) and a\n", " - *agreed accept rate* of 1 in 10 (10%).\n", "\n", "Actual committee’s accept precision markedly better with 50% accept\n", "precision." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uncertainty: Accept Rate\n", "\n", "To get a handle on the uncertainty around these numbers we’ll start by\n", "making use of the\n", "binomial\n", "distribution. First, let’s explore the fact that for the overall\n", "conference the accept rate was around 23%, but for the duplication\n", "committees the accept rate was around 25%. If we assume decisions are\n", "made according to a binomial distribution, then is the accept rate for\n", "the duplicated papers too high?\n", "\n", "Note that for all our accept probability statistics we used as a\n", "denominator the number of papers that were initially sent for review,\n", "rather than the number where a final decision was made by the program\n", "committee. These numbers are different because some papers are withdrawn\n", "before the program committee makes its decision. Most commonly this\n", "occurs after authors have seen their preliminary reviews: for NIPS 2014\n", "we provided preliminary reviews that included paper scores. So for the\n", "official accept probability we use the 170 as denominator. The accept\n", "probabilities were therefore 43 out of 170 papers (25.3%) for Committee\n", "1 and 44 out of 170 (25.8%) for Committee 2. This compares with the\n", "overall conference accept rate for papers outside the duplication\n", "process of 349 out of 1508 (23.1%).\n", "\n", "If the true underlying probability of an accept were 0.23, independent\n", "of the paper, then the probability of generating accepts for any subset\n", "of the papers would be given by a binomial distribution. Combining\n", "across the two committees for the duplicated papers, we see that 87\n", "papers in total were recommended for accept out of a total of 340\n", "trials. out of 166 trials would be given by a binomial distribution as\n", "depicted below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy.stats import binom\n", "from IPython.display import HTML" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import cmtutils.plot as plot\n", "import mlai as ma" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rv = binom(340, 0.23)\n", "x = np.arange(60, 120)\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "ax.bar(x, rv.pmf(x))\n", "display(HTML('

Number of Accepted Papers for p = 0.23

'))\n", "ax.axvline(87,linewidth=4, color='red')\n", "ma.write_figure(filename=\"uncertainty-accept-rate.svg\", directory=\"./neurips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Number of accepted papers for $p=0.23$.\n", "\n", "From the plot, we can see that whilst the accept rate was slightly\n", "higher for duplicated papers it doesn’t seem that we can say that it was\n", "statistically significant that it was higher, it falls well within the\n", "probability mass of the Binomial.\n", "\n", "Note that Area Chairs knew which papers were duplicates, whereas\n", "reviewers did not. Whilst we stipulated that duplicate papers should not\n", "be any given special treatment, we cannot discount the possibility that\n", "Area Chairs may have given slightly preferential treatment to duplicate\n", "papers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uncertainty: Accept Precision\n", "\n", "For the accept precision, if we assume that accept decisions were drawn\n", "according to a binomial, then the distribution for consistent accepts is\n", "also binomial. Our best estimate of its parameter is 22/166 = 0.13\n", "(13%). If we had a binomial distribution with these parameters, then the\n", "distribution of consistent accepts would be as follows.\n", "\n", "- How reliable is the consistent accept score?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rv = binom(166, 0.13)\n", "x = np.arange(10, 30)\n", "fig, ax = plt.subplots(figsize=(10,5))\n", "ax.bar(x, rv.pmf(x))\n", "display(HTML('

Number of Consistent Accepts given p=0.13

'))\n", "ax.axvline(22,linewidth=4, color='red') \n", "ma.write_figure(filename=\"uncertainty-accept-precision.svg\", directory=\"./neurips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Number of consistent accepts given $p=0.13$.\n", "\n", "We see immediately that there is a lot of uncertainty around this\n", "number, for the scale of the experiment as we have it. This suggests a\n", "more complex analysis is required to extract our estimates with\n", "uncertainty." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayesian Analysis\n", "\n", "Before we start the analysis, it’s important to make some statements\n", "about the aims of our modelling here. We will make some simplifying\n", "modelling assumptions for the sake of a model that is understandable. We\n", "are looking to get a handle on the uncertainty associated with some of\n", "the probabilities associated with the NIPS experiment. [Some preliminary\n", "analyses have already been conducted on\n", "blogs](http://inverseprobability.com/2015/01/16/blogs-on-the-nips-experiment/).\n", "Those analyses don’t have access to information like paper scores etc.\n", "For that reason we also leave out such information in this preliminary\n", "analysis. We will focus only on the summary results from the experiment:\n", "how many papers were consistently accepted, consistently rejected, or\n", "had inconsistent decisions. For the moment we disregard the information\n", "we have about paper scores.\n", "\n", "In our analysis there are three possible outcomes for each paper:\n", "consistent accept, inconsistent decision and consistent reject. So, we\n", "need to perform the analysis with the [multinomial\n", "distribution](http://en.wikipedia.org/wiki/Multinomial_distribution).\n", "The multinomial is parameterized by the probabilities of the different\n", "outcomes. These are our parameters of interest; we would like to\n", "estimate these probabilities alongside their uncertainties. To make a\n", "Bayesian analysis we place a prior density over these probabilities,\n", "then we update the prior with the observed data, that gives us a\n", "posterior density, giving us an uncertainty associated with these\n", "probabilities.\n", "\n", "### Prior Density\n", "\n", "Choice of prior for the multinomial is typically straightforward, the\n", "[Dirichlet density](http://en.wikipedia.org/wiki/Dirichlet_distribution)\n", "is [conjugate](http://en.wikipedia.org/wiki/Conjugate_prior) and has the\n", "additional advantage that its parameters can be set to ensure it is\n", "*uninformative*, i.e. uniform across the domain of the prior.\n", "Combination of a multinomial likelihood and a Dirichlet prior is not\n", "new, and in this domain if we were to consider the mean the posterior\n", "density only, then the approach is known as [Laplace\n", "smoothing](http://en.wikipedia.org/wiki/Additive_smoothing).\n", "\n", "For our model we are assuming for our prior that the probabilities are\n", "drawn from a Dirichlet as follows, $$\n", "p \\sim \\text{Dir}(\\alpha_1, \\alpha_2, \\alpha_3),\n", "$$ with $\\alpha_1=\\alpha_2=\\alpha_3=1$. The Dirichlet density is\n", "conjugate to the [multinomial\n", "distribution](http://en.wikipedia.org/wiki/Multinomial_distribution),\n", "and we associate three different outcomes with the multinomial. For each\n", "of the 166 papers we expect to have a consistent accept (outcome 1), an\n", "inconsistent decision (outcome 2) or a consistent reject (outcome 3). If\n", "the counts four outcome 1, 2 and 3 are represented by $k_1$, $k_2$ and\n", "$k_3$ and the associated probabilities are given by $p_1$, $p_2$ and\n", "$p_3$ then our model is, Due to the conjugacy the posterior is tractable\n", "and easily computed as a Dirichlet (see e.g. [Gelman et\n", "al](http://www.stat.columbia.edu/~gelman/book/)), where the parameters\n", "of the Dirichlet are given by the original vector from the Dirichlet\n", "prior plus the counts associated with each outcome. $$\n", "\\mathbf{p}|\\mathbf{k}, \\boldsymbol{\\alpha} \\sim \\text{Dir}(\\boldsymbol{\\alpha} + \\mathbf{k})\n", "$$ The mean probability for each outcome is then given by, $$\n", "\\bar{p}_i = \\frac{\\alpha_i+k_i}{\\sum_{j=1}^3(\\alpha_j + k_j)}.\n", "$$ and the variance is $$\n", "\\mathrm{Var}[p_i] = \\frac{(\\alpha_i+k_i) (\\alpha_0-\\alpha_i + n + k_i)}{(\\alpha_0+n)^2 (\\alpha_0+n+1)},\n", "$$ where $n$ is the number of trials (166 in our case) and\n", "$\\alpha_0 = \\sum_{i=1}^3\\alpha_i$. This allows us to compute the\n", "expected value of the probabilities and their variances under the\n", "posterior as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def posterior_mean_var(k, alpha):\n", " \"\"\"Compute the mean and variance of the Dirichlet posterior.\"\"\"\n", " alpha_0 = alpha.sum()\n", " n = k.sum()\n", " m = (k + alpha)\n", " m /= m.sum()\n", " v = (alpha+k)*(alpha_0 - alpha + n + k)/((alpha_0+n)**2*(alpha_0+n+1))\n", " return m, v\n", "\n", "k = np.asarray([22, 43, 101])\n", "alpha = np.ones((3,))\n", "m, v = posterior_mean_var(k, alpha)\n", "outcome = ['consistent accept', 'inconsistent decision', 'consistent reject']\n", "for i in range(3):\n", " display(HTML(\"

Probability of \" + outcome[i] +' ' + str(m[i]) + \"+/-\" + str(2*np.sqrt(v[i])) + \"

\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we have a probability of consistent accept as $0.136 \\pm 0.06$, the\n", "probability of inconsistent decision as $0.260 \\pm 0.09$ and probability\n", "of consistent reject as $0.60 \\pm 0.15$. Recall that if we’d selected\n", "papers at random (with accept rate of 1 in 4) then these values would\n", "have been 1 in 16 (0.0625), 3 in 8 (0.375) and 9 in 16 (0.5625).\n", "\n", "The other values we are interested in are the accept precision, reject\n", "precision and the agreed accept rate. Computing the probability density\n", "for these statistics is complex: it involves [Ratio\n", "Distributions](http://en.wikipedia.org/wiki/Ratio_distribution).\n", "However, we can use Monte Carlo to estimate the expected accept\n", "precision, reject precision, and agreed accept rate as well as their\n", "variances. We can use these results to give us error bars and histograms\n", "of these statistics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sample_precisions(k, alpha, num_samps):\n", " \"\"\"Helper function to sample from the posterior distibution of accept, \n", " reject and inconsistent probabilities and compute other statistics of interest \n", " from the samples.\"\"\"\n", "\n", " k = np.random.dirichlet(k+alpha, size=num_samps)\n", " # Factors of 2 appear because inconsistent decisions \n", " # are being accounted for across both committees.\n", " ap = 2*k[:, 0]/(2*k[:, 0]+k[:, 1])\n", " rp = 2*k[:, 2]/(k[:, 1]+2*k[:, 2])\n", " aa = k[:, 0]/(k[:, 0]+k[:, 2])\n", " return ap, rp, aa\n", "\n", "ap, rp, aa = sample_precisions(k, alpha, 10000)\n", "print(ap.mean(), '+/-', 2*np.sqrt(ap.var()))\n", "print(rp.mean(), '+/-', 2*np.sqrt(rp.var()))\n", "print(aa.mean(), '+/-', 2*np.sqrt(aa.var()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Giving an accept precision of $0.51 \\pm 0.13$, a reject precision of\n", "$0.82 \\pm 0.05$ and an agreed accept rate of $0.18 \\pm 0.07$. Note that\n", "the ‘random conference’ values of 1 in 4 for accept precision and 3 in 4\n", "for reject decisions are outside the two standard deviation error bars.\n", "If it is preferred medians and percentiles could also be computed from\n", "the samples above, but as we will see when we histogram the results the\n", "densities look broadly symmetric, so this is unlikely to have much\n", "effect." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Histogram of Monte Carlo Results\n", "\n", "Just to ensure that the error bars are reflective of the underlying\n", "densities we histogram the Monte Carlo results for accept precision,\n", "reject precision and agreed accept below. Shown on each histogram is a\n", "line representing the result we would get for the ‘random committee.’" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(1, 3, figsize=(15, 5))\n", "_ = ax[0].hist(ap, 20)\n", "_ = ax[0].set_title('Accept Precision')\n", "ax[0].axvline(0.25, linewidth=4, color=\"r\")\n", "_ = ax[1].hist(rp, 20)\n", "_ = ax[1].set_title('Reject Precision')\n", "ax[1].axvline(0.75, linewidth=4, color=\"r\")\n", "_ = ax[2].hist(aa, 20)\n", "_ = ax[2].set_title('Agreed Accept Rate')\n", "_ = ax[2].axvline(0.10, linewidth=4, color=\"r\")\n", "ma.write_figure(filename=\"random-committee-outcomes-vs-true.svg\", directory=\"./neurips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Different statistics for the random committee oucomes versus\n", "the observed committee outcomes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model Choice and Prior Values\n", "\n", "In the analysis above we’ve minimized the modeling choices: we made use\n", "of a Bayesian analysis to capture the uncertainty in counts that can be\n", "arising from statistical sampling error. To this end we chose an\n", "uninformative prior over these probabilities. However, one might argue\n", "that the prior should reflect something more about the underlying\n", "experimental structure: for example, we *know* that if the committees\n", "made their decisions independently it is unlikely that we’d obtain an\n", "inconsistency figure much greater than 37.5% because that would require\n", "committees to explicitly collude to make inconsistent decisions: the\n", "random conference is the worst case. Due to the accept rate, we also\n", "expect a larger number of reject decisions than reject. This also isn’t\n", "captured in our prior. Such questions move us into the realms of\n", "modeling the process, rather than performing a sensitivity analysis.\n", "However, if we wish to model the decision process as a whole, we have a\n", "lot more information available, and we should make use of it. The\n", "analysis above is intended to exploit our randomized experiment to\n", "explore how inconsistent we expect two committees to be. It focusses on\n", "that single question; it doesn’t attempt to give answers on what the\n", "reasons for that inconsistency are and how it may be reduced. The\n", "additional maths was needed only to give a sense of the uncertainty in\n", "the figures. That uncertainty arises due to the limited number of papers\n", "in the experiment.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reviewer Calibration\n", "\n", "\\[edit\\]\n", "\n", "Calibration of reviewers is the process where different interpretations\n", "of the reviewing scale are addressed. The tradition of calibration goes\n", "at least as far back as John Platt’s Program Chairing, and included a\n", "Bayesian model by Ge, Welling and Ghahramani at NeurIPS 2013." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reviewer Calibration Model\n", "\n", "\\[edit\\]\n", "\n", "In this note book we deal with reviewer calibration. Our assumption is\n", "that the score from the $j$th reviwer for the $i$th paper is given by $$\n", "y_{i,j} = f_i + b_j + \\epsilon_{i, j}\n", "$$ where $f_i$ is the ‘objective quality’ of paper $i$ and $b_j$ is an\n", "offset associated with reviewer $j$. $\\epsilon_{i,j}$ is a subjective\n", "quality estimate which reflects how a specific reviewer’s opinion\n", "differs from other reviewers (such differences in opinion may be due to\n", "differing expertise or perspective). The underlying ‘objective quality’\n", "of the paper is assumed to be the same for all reviewers and the\n", "reviewer offset is assumed to be the same for all papers.\n", "\n", "If we have $n$ papers and $m$ reviewers, then this implies $n$ + $m$ +\n", "$nm$ values need to be estimated. Naturally this is too many, and we can\n", "start by assuming that the subjective quality is drawn from a normal\n", "density with variance $\\sigma^2$ $$\n", "\\epsilon_{i, j} \\sim N(0, \\sigma^2 \\mathbf{I})\n", "$$ which reduces us to $n$ + $m$ + 1 parameters. Further we can assume\n", "that the objective quality is also normally distributed with mean $\\mu$\n", "and variance $\\alpha_f$, $$\n", "f_i \\sim N(\\mu, \\alpha_f)\n", "$$ this now reduces us to $m$+3 parameters. However, we only have\n", "approximately $4m$ observations (4 papers per reviewer) so parameters\n", "may still not be that well determined (particularly for those reviewers\n", "that have only one review). We, therefore, finally, assume that reviewer\n", "offset is normally distributed with zero mean, $$\n", "b_j \\sim N(0, \\alpha_b),\n", "$$ leaving us only four parameters: $\\mu$, $\\sigma^2$, $\\alpha_f$ and\n", "$\\alpha_b$. Combined together these three assumptions imply that $$\n", "\\mathbf{y} \\sim N(\\mu \\mathbf{1}, \\mathbf{K}),\n", "$$ where $\\mathbf{y}$ is a vector of stacked scores $\\mathbf{1}$ is the\n", "vector of ones and the elements of the covariance function are given by\n", "$$\n", "k(i,j; k,l) = \\delta_{i,k} \\alpha_f + \\delta_{j,l} \\alpha_b + \\delta_{i, k}\\delta_{j,l} \\sigma^2,\n", "$$ where $i$ and $j$ are the index of first paper and reviewer and $k$\n", "and $l$ are the index of second paper and reviewer. The mean is easily\n", "estimated by maximum likelihood and is given as the mean of all scores.\n", "\n", "It is convenient to reparametrize slightly into an overall scale\n", "$\\alpha_f$, and normalized variance parameters, $$\n", "k(i,j; k,l) = \\alpha_f\\left(\\delta_{i,k} + \\delta_{j,l} \\frac{\\alpha_b}{\\alpha_f} + \\delta_{i, k}\\delta_{j,l} \\frac{\\sigma^2}{\\alpha_f}\\right)\n", "$$ which we rewrite to give two ratios: offset/signal ratio,\n", "$\\hat{\\alpha}_b$ and noise/signal $\\hat{\\sigma}^2$ ratio. $$\n", "k(i,j; k,l) = \\alpha_f\\left(\\delta_{i,k} + \\delta_{j,l} \\hat{\\alpha}_b + \\delta_{i, k}\\delta_{j,l} \\hat{\\sigma}^2\\right)\n", "$$ The advantage of this parameterization is it allows us to optimize\n", "$\\alpha_f$ directly (with a fixed-point equation) and it will be very\n", "well determined. This leaves us with two free parameters, that we can\n", "explore on the grid. It is in these parameters that we expect the\n", "remaining underdetermindness of the model. We expect $\\alpha_f$ to be\n", "well determined because the negative log likelihood is now $$\n", "\\frac{|\\mathbf{y}|}{2}\\log\\alpha_f + \\frac{1}{2}\\log \\left|\\hat{\\mathbf{K}}\\right| + \\frac{1}{2\\alpha_f}\\mathbf{y}^\\top \\hat{\\mathbf{K}}^{-1} \\mathbf{y},\n", "$$ where $|\\mathbf{y}|$ is the length of $\\mathbf{y}$ (i.e. the number\n", "of reviews) and $\\hat{\\mathbf{K}}=\\alpha_f^{-1}\\mathbf{K}$ is the scale\n", "normalized covariance. This negative log likelihood is easily minimized\n", "to recover $$\n", "\\alpha_f = \\frac{1}{|\\mathbf{y}|} \\mathbf{y}^\\top \\hat{\\mathbf{K}}^{-1} \\mathbf{y}.\n", "$$ A Bayesian analysis of this parameter is possible with gamma priors,\n", "but it would merely show that this parameter is extremely well\n", "determined (the degrees of freedom parameter of the associated\n", "Student-$t$ marginal likelihood scales will the number of reviews, which\n", "will be around $|\\mathbf{y}| \\approx 6,000$ in our case.\n", "\n", "So, we propose to proceed as follows. Set the mean from the reviews\n", "($\\mu$) and then choose a two-dimensional grid of parameters for\n", "reviewer offset and diversity. For each parameter choice, optimize to\n", "find $\\alpha_f$ and then evaluate the liklihood. Worst case this will\n", "require us inverting $\\hat{\\mathbf{K}}$, but if the reviewer paper\n", "groups are disconnected, it can be done a lot quicker. Next stage is to\n", "load in the reviews for analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting the Model\n", "\n", "\\[edit\\]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cmtutils as cu\n", "import os\n", "import pandas as pd\n", "import numpy as np\n", "import GPy\n", "from scipy.sparse.csgraph import connected_components\n", "from scipy.linalg import solve_triangular " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "date = '2014-09-06'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in the Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = date + '_reviews.xls'\n", "reviews = cu.CMT_Reviews_read(filename=filename)\n", "papers = list(sorted(set(reviews.reviews.index), key=int))\n", "reviews.reviews = reviews.reviews.loc[papers]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The maximum likelihood solution for $\\mu$ is simply the mean quality of\n", "the papers, this is easily computed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mu = reviews.reviews.Quality.mean()\n", "print(\"Mean value, mu = \", mu)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation\n", "\n", "We take the reviews, which are indexed by the paper number, and create a\n", "new data frame, that indexes by paper id and email combined. From these\n", "reviews we tokenize the `PaperID` and the `Email` to extract two\n", "matrices that can be used in creation of covariance matrices. We also\n", "create a target vector which is the mean centred vector of scores." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r = reviews.reviews.reset_index()\n", "r.rename(columns={'ID':'PaperID'}, inplace=True)\n", "r.index = r.PaperID + '_' + r.Email\n", "X1 = pd.get_dummies(r.PaperID)\n", "X1 = X1[sorted(X1.columns, key=int)]\n", "X2 = pd.get_dummies(r.Email)\n", "X2 = X2[sorted(X2.columns, key=str.lower)]\n", "y = reviews.reviews.Quality - mu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Constructing the Model in GPy\n", "\n", "Having reduced the model to two parameters, I was hopeful I could set\n", "parameters broadly by hand. My initial expectation was that `alpha_b`\n", "and `sigma2` would both be less than 1, but some playing with parameters\n", "showed this wasn’t the case. Rather than waste further time, I decided\n", "to use our [`GPy` Software](https://github.com/SheffieldML/GPy) (see\n", "below) to find a maximum likelihood solution for the parameters.\n", "\n", "Model construction firstly involves constructing covariance functions\n", "for the model and concatenating `X1` and `X2` to a new input matrix `X`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = X1.join(X2)\n", "kern1 = GPy.kern.Linear(input_dim=len(X1.columns), active_dims=np.arange(len(X1.columns)))\n", "kern1.name = 'K_f'\n", "kern2 = GPy.kern.Linear(input_dim=len(X2.columns), active_dims=np.arange(len(X1.columns), len(X.columns)))\n", "kern2.name = 'K_b'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, the covariance function is used to create a Gaussian process\n", "regression model with `X` as input and `y` as target. The covariance\n", "function is given by $\\mathbf{K}_f + \\mathbf{K}_b$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = GPy.models.GPRegression(X, y.to_numpy()[:, np.newaxis], kern1+kern2)\n", "model.optimize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can check the parameters of the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(model)\n", "print(model.log_likelihood())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Name : GP regression\n", " Objective : 10071.679092815619\n", " Number of Parameters : 3\n", " Number of Optimization Parameters : 3\n", " Updates : True\n", " Parameters:\n", " GP_regression. | value | constraints | priors\n", " sum.K_f.variances | 1.2782303448777643 | +ve | \n", " sum.K_b.variances | 0.2400098787580176 | +ve | \n", " Gaussian_noise.variance | 1.2683656892796749 | +ve | \n", " -10071.679092815619" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Construct the Model Without GPy\n", "\n", "The answer from the GPy solution is introduced here, alongside the code\n", "where the covariance matrices are explicitly created (above they are\n", "created using GPy’s high level code for kernel matrices, which may be\n", "less clear on the details)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set parameter values to ML solutions given by GPy.\n", "alpha_f = model.sum.K_f.variances\n", "alpha_b = model.sum.K_b.variances/alpha_f\n", "sigma2 = model.Gaussian_noise.variance/alpha_f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we create the covariance functions based on the tokenized paper IDs\n", "and emails." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "K_f = np.dot(X1, X1.T)\n", "K_b = alpha_b*np.dot(X2, X2.T)\n", "K = K_f + K_b + sigma2*np.eye(X2.shape[0])\n", "Kinv, L, Li, logdet = GPy.util.linalg.pdinv(K) # since we have GPy loaded in use their positive definite inverse.\n", "y = reviews.reviews.Quality - mu\n", "alpha = np.dot(Kinv, y)\n", "yTKinvy = np.dot(y, alpha)\n", "alpha_f = yTKinvy/len(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we have removed the data mean, the log likelihood we are\n", "interested in is the likelihood of a multivariate Gaussian with\n", "covariance $\\mathbf{K}$ and mean zero. This is computed below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ll = 0.5*len(y)*np.log(2*np.pi*alpha_f) + 0.5*logdet + 0.5*yTKinvy/alpha_f \n", "print(\"negative log likelihood: \", ll)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Review Quality Prediction\n", "\n", "\\[edit\\]\n", "\n", "Now we wish to predict the bias corrected scores for the papers. That\n", "involves considering a variable $s_{i,j} = f_i + e_{i,j}$ which is the\n", "score with the bias removed. That variable has a covariance matrix,\n", "$\\mathbf{K}_s=\\mathbf{K}_f + \\sigma^2 \\mathbf{I}$ and a cross covariance\n", "between $\\mathbf{y}$ and $\\mathbf{s}$ is also given by $\\mathbf{K}_s$.\n", "This means we can compute the posterior distribution of the scores as\n", "follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute mean and covariance of quality scores\n", "K_s = K_f + np.eye(K_f.shape[0])*sigma2\n", "s = pd.Series(np.dot(K_s, alpha) + mu, index=X1.index)\n", "covs = alpha_f*(K_s - np.dot(K_s, np.dot(Kinv, K_s)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monte Carlo Simulations for Probability of Acceptance\n", "\n", "\\[edit\\]\n", "\n", "We can now sample from this posterior distribution of bias-adjusted\n", "scores jointly, to get a set of scores for all papers. For this set of\n", "scores, we can perform a ranking and accept the top 400 papers. This\n", "gives us a sampled conference. If we do that 1,000 times then we can see\n", "how many times each paper was accepted to get a probability of\n", "acceptance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "number_accepts = 420 # 440 because of the 10% replication" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# place this in a separate box, because sampling can take a while.\n", "samples = 1000\n", "score = np.random.multivariate_normal(mean=s, cov=covs, size=samples).T\n", "# Use X1 which maps papers to paper/reviewer pairings to get the average score for each paper.\n", "paper_score = pd.DataFrame(np.dot(np.diag(1./X1.sum(0)), np.dot(X1.T, score)), index=X1.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can compute the probability of acceptance for each of the sampled\n", "rankings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prob_accept = ((paper_score>paper_score.quantile(1-(float(number_accepts)/paper_score.shape[0]))).sum(1)/1000)\n", "prob_accept.name = 'AcceptProbability'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have the probability of accepts, we can decide on the boundaries\n", "of the grey area. These are set in `lower` and `upper`. The grey area is\n", "those papers that will be debated most heavily during the\n", "teleconferences between program chairs and area chairs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lower=0.1\n", "upper=0.9\n", "grey_area = ((prob_accept>lower) & (prob_accept\n", "\n", "Figure: Histogram of the probability of accept as estimated by the\n", "Monte Carlo simulation across all papers submitted to NeurIPS 2014." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some Sanity Checking Plots\n", "\n", "\\[edit\\]\n", "\n", "Here is the histogram of the reviewer scores after calibration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "s.hist(bins=100, ax=ax)\n", "_ = ax.set_title('Calibrated Reviewer Scores')\n", "ma.write_figure(directory=\"./neurips\", filename=\"calibrated-reviewer-scores.svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Histogram of updated reviewer scores after the calibration\n", "process is applied." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adjustments to Reviewer Scores\n", "\n", "We can also compute the posterior distribution for the adjustments to\n", "the reviewer scores." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute mean and covariance of review biases\n", "b = pd.Series(np.dot(K_b, alpha), index=X2.index)\n", "covb = alpha_f*(K_b - np.dot(K_b, np.dot(Kinv, K_b)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reviewer_bias = pd.Series(np.dot(np.diag(1./X2.sum(0)), np.dot(X2.T, b)), index=X2.columns, name='ReviewerBiasMean')\n", "reviewer_bias_std = pd.Series(np.dot(np.diag(1./X2.sum(0)), np.dot(X2.T, np.sqrt(np.diag(covb)))), index=X2.columns, name='ReviewerBiasStd')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a histogram of the mean adjustment for the reviewers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "reviewer_bias.hist(bins=100, ax=ax)\n", "_ = ax.set_title('Reviewer Calibration Adjustments Histogram')\n", "ma.write_figure(directory=\"./neurips\", filename=\"reviewer-calibration-adjustments.svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Histogram of individual offsets associated with the reviewers\n", "as estimated by the model.\n", "\n", "Export a version of the bias scores for use in CMT." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bias_export = pd.DataFrame(data={'Quality Score - Does the paper deserves to be published?':reviewer_bias, \n", " 'Impact Score - Independently of the Quality Score above, this is your opportunity to identify papers that are very different, original, or otherwise potentially impactful for the NIPS community.':pd.Series(np.zeros(len(reviewer_bias)), index=reviewer_bias.index),\n", " 'Confidence':pd.Series(np.zeros(len(reviewer_bias)), index=reviewer_bias.index)})\n", "cols = bias_export.columns.tolist()\n", "cols = [cols[2], cols[1], cols[0]]\n", "bias_export = bias_export[cols]\n", "#bias_export.to_csv(os.path.join(cu.cmt_data_directory, 'reviewer_bias.csv'), sep='\\t', header=True, index_label='Reviewer Email')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sanity Check\n", "\n", "As a sanity check Corinna suggested it makes sense to plot the average\n", "raw score for the papers vs the probability of accept, just to ensure\n", "nothing weird is going on. To clarify the plot, I’ve actually plotted\n", "raw score vs log odds of accept." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_score = pd.Series(np.dot(np.diag(1./X1.sum(0)), np.dot(X1.T, r.Quality)), index=X1.columns)\n", "prob_accept[prob_accept==0] = 1/(10*samples)\n", "prob_accept[prob_accept==1] = 1-1/(10*samples)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "ax.plot(raw_score, np.log(prob_accept)- np.log(1-prob_accept), 'rx')\n", "ax.set_title('Raw Score vs Log odds of accept')\n", "ax.set_xlabel('raw score')\n", "_ = ax.set_ylabel('log odds of accept')\n", "ma.write_figure(directory=\"./neurips\", filename=\"raw-score-vs-log-odds.svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Scatter plot of the raw paper score against the log\n", "probability of paper acceptance, as estimated by Monte Carlo\n", "simulation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calibraton Quality Sanity Checks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.name = 'CalibratedQuality'\n", "r = r.join(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at a scatter plot of the review quality vs the\n", "calibrated quality." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.plt as plt\n", "import cmtutils.plot as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "ax.plot(r.Quality, r.CalibratedQuality, 'r.', markersize=10)\n", "ax.set_xlim([0, 11])\n", "ax.set_xlabel('original review score')\n", "_ = ax.set_ylabel('calibrated review score')\n", "ma.write_figure(directory=\"./neurips\", filename=\"calibrated-review-score-vs-original-score.svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Scatter plot of the calibrated review scores against the\n", "original review scores." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation of Duplicate Papers\n", "\n", "\\[edit\\]\n", "\n", "For NeurIPS 2014 we experimented with duplicate papers: we pushed papers\n", "through the system twice, exposing them to different subsets of the\n", "reviewers. The first thing we’ll look at is the duplicate papers.\n", "Firstly, we identify them by matching on title." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = date + '_paper_list.xls'\n", "papers = cu.CMT_Papers_read(filename=filename)\n", "duplicate_list = []\n", "for ID, title in papers.papers.Title.iteritems():\n", " if int(ID)>1779 and int(ID) != 1949:\n", " pair = list(papers.papers[papers.papers['Title'].str.contains(papers.papers.Title[ID].strip())].index)\n", " pair.sort(key=int)\n", " duplicate_list.append(pair)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we compute the correlation coefficients for the duplicated papers\n", "for the average impact and quality scores." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "quality = []\n", "calibrated_quality = []\n", "accept = []\n", "impact = []\n", "confidence = []\n", "for duplicate_pair in duplicate_list:\n", " quality.append([np.mean(r[r.PaperID==duplicate_pair[0]].Quality), np.mean(r[r.PaperID==duplicate_pair[1]].Quality)])\n", " calibrated_quality.append([np.mean(r[r.PaperID==duplicate_pair[0]].CalibratedQuality), np.mean(r[r.PaperID==duplicate_pair[1]].CalibratedQuality)])\n", " impact.append([np.mean(r[r.PaperID==duplicate_pair[0]].Impact), np.mean(r[r.PaperID==duplicate_pair[1]].Impact)])\n", " confidence.append([np.mean(r[r.PaperID==duplicate_pair[0]].Conf), np.mean(r[r.PaperID==duplicate_pair[1]].Conf)])\n", "quality = np.array(quality)\n", "calibrated_quality = np.array(calibrated_quality)\n", "impact = np.array(impact)\n", "confidence = np.array(confidence)\n", "quality_cor = np.corrcoef(quality.T)[0, 1]\n", "calibrated_quality_cor = np.corrcoef(calibrated_quality.T)[0, 1]\n", "impact_cor = np.corrcoef(impact.T)[0, 1]\n", "confidence_cor = np.corrcoef(confidence.T)[0, 1]\n", "print(\"Quality correlation: \", quality_cor)\n", "print(\"Calibrated Quality correlation: \", calibrated_quality_cor)\n", "print(\"Impact correlation: \", impact_cor)\n", "print(\"Confidence correlation: \", confidence_cor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Quality correlation: 0.54403674862622\n", " Calibrated Quality correlation: 0.5455958618174274\n", " Impact correlation: 0.26945269236041036\n", " Confidence correlation: 0.3854251559444674" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation Plots\n", "\n", "To visualize the quality score correlation, we plot the group 1 papers\n", "against the group 2 papers. Here we add a small amount of jitter to\n", "ensure points to help visualize points that would otherwise fall on the\n", "same position." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_figsize)\n", "ax.plot(quality[:, 0]+np.random.randn(quality.shape[0])*0.06125, quality[:, 1]+np.random.randn(quality.shape[0])*0.06125, 'r.', markersize=10)\n", "lims = [1.5, 8.5]\n", "ax.set_xlim(lims)\n", "ax.set_ylim(lims)\n", "ax.plot(lims, lims, 'r-')\n", "_ = ax.set_title(Correlation: {cor:.2g}'.format(cor=quality_cor))\n", "ma.write_figure(directory=\"./neurips\",\n", " filename=\"quality-correlation.svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Correlation between reviewer scores across the duplicated\n", "committees (scores have jitter added to prevent too many points sitting\n", "on top of each other).\n", "\n", "Similarly for the calibrated quality of the papers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_figsize)\n", "ax.plot(calibrated_quality[:, 0]+np.random.randn(calibrated_quality.shape[0])*0.06125, calibrated_quality[:, 1]+np.random.randn(calibrated_quality.shape[0])*0.06125, 'r.', markersize=10)\n", "lims = [1.5, 8.5]\n", "ax.set_xlim(lims)\n", "ax.set_ylim(lims)\n", "ax.plot(lims, lims, 'r-')\n", "_ = ax.set_title('Correlation: {cor:.2g}'.format(cor=calibrated_quality_cor))\n", "ma.write_figure(directory=\"./neurips\",\n", " filename=\"calibrated-quality-correlation.svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Correlation between calibrated reviewer scores across the two\n", "independent committees." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Apply Laplace smoothing to accept probabilities before incorporating them.\n", "revs = r.join((prob_accept+0.0002)/1.001, on='PaperID').join(reviewer_bias, on='Email').join(papers.papers['Number Of Discussions'], on='PaperID').join(reviewer_bias_std, on='Email').sort_values(by=['AcceptProbability','PaperID', 'CalibratedQuality'], ascending=False)\n", "revs.set_index(['PaperID'], inplace=True)\n", "def len_comments(x):\n", " return len(x.Comments)\n", "revs['comment_length']=revs.apply(len_comments, axis=1)\n", "# Save the computed information to disk\n", "#revs.to_csv(os.path.join(cu.cmt_data_directory, date + '_processed_reviews.csv'), encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conference Simulation\n", "\n", "\\[edit\\]\n", "\n", "Given the realization that roughly 50% of the score seems to be\n", "‘subjective’ and 50% of the score seems to be ‘objective,’ then we can\n", "simulate the conference and see what it does for the consistency of\n", "accepts for different probability of accept." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "samples = 100000\n", "subjectivity_portion = 0.5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accept_rates = [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]\n", "consistent_accepts = []\n", "for accept_rate in accept_rates:\n", " score_1 = []\n", " score_2 = []\n", " for i in range(samples):\n", " objective = (1-subjectivity_portion)*np.random.randn()\n", " score_1.append(objective + subjectivity_portion*np.random.randn())\n", " score_2.append(objective + subjectivity_portion*np.random.randn())\n", "\n", " score_1 = np.asarray(score_1)\n", " score_2 = np.asarray(score_2)\n", "\n", " accept_1 = score_1.argsort()[:int(samples*accept_rate)]\n", " accept_2 = score_2.argsort()[:int(samples*accept_rate)]\n", "\n", " consistent_accept = len(set(accept_1).intersection(set(accept_2)))\n", " consistent_accepts.append(consistent_accept/(samples*accept_rate))\n", " print('Percentage consistently accepted: {prop}'.format(prop=consistent_accept/(samples*accept_rate)))\n", "\n", "consistent_accepts = np.array(consistent_accepts)\n", "accept_rate = np.array(accept_rate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mlai\n", "import mlai.plot as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_figsize)\n", "ax.plot(accept_rates, consistent_accepts, \"r.\", markersize=10)\n", "ax.plot(accept_rates, accept_rates, \"k-\", linewidth=2)\n", "ax.set_xlabel(\"accept rate\")\n", "ax.set_ylabel(\"accept precision\")\n", "mlai.write_figure(filename=\"accept-precision-vs-accept-rate.svg\",\n", " directory=\"./neurips/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Plot of the accept rate vs the consistency of the conference\n", "for 50% subjectivity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_figsize)\n", "ax.plot(accept_rates, consistent_accepts-accept_rates, \"k-\", linewidth=2)\n", "ax.set_xlabel(\"accept rate\")\n", "ax.set_ylabel(\"(accept precision)-(accept rate)\")\n", "mlai.write_figure(filename=\"gain-in-consistency.svg\",\n", " directory=\"./neurips/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Plot of the accept rate vs gain in consistency over a random\n", "conference for 50% subjectivity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Where do Rejected Papers Go?\n", "\n", "\\[edit\\]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import yaml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(os.path.join(nipsy.review_store, nipsy.outlet_name_mapping), 'r') as f:\n", " mapping = yaml.load(f, Loader=yaml.FullLoader)\n", "\n", "\n", "date = \"2021-06-11\"\n", "\n", "citations = nipsy.load_citation_counts(date=date)\n", "decisions = nipsy.load_decisions()\n", "nipsy.augment_decisions(decisions)\n", "joindf = nipsy.join_decisions_citations(decisions, citations)\n", "\n", "joindf['short_venue'] = joindf.venue.replace(mapping)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.graph_objects as go" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "thresh_to_show = 3\n", "\n", "label = ['submitted', 'oral', 'spotlight', 'poster', 'reject', '/dev/null']\n", "x = [0.1, 0.3, 0.3, 0.3, 0.3, 0.5]\n", "y = [0.4, 0.95, 0.9, 0.85, 0.3, 0.01]\n", "source = [0, 0, 0, 0, 4]\n", "target = [1, 2, 3, 4, 5]\n", "value = [(joindf['Status']=='Oral').sum(),\n", " (joindf['Status']=='Spotlight').sum(), \n", " (joindf['Status']=='Poster').sum(),\n", " (joindf['Status']=='Reject').sum(),\n", " joindf.loc[joindf.reject]['venue'].isna().sum()]\n", "\n", "venue_counts = joindf.loc[joindf.reject]['short_venue'].value_counts()\n", "venue_show = venue_counts[venue_counts>=thresh_to_show]\n", "target_val = target[-1]\n", "for venue,count in venue_show.items():\n", " target_val += 1\n", " value.append(count)\n", " source.append(4)\n", " label.append(venue)\n", " target.append(target_val)\n", " if venue=='ArXiv':\n", " y.append(.15)\n", " x.append(0.75)\n", " \n", " elif venue == 'None':\n", " y.append(.20)\n", " x.append(0.75)\n", "\n", " else: \n", " y.append(.27)\n", " x.append(0.8)\n", " \n", "\n", " \n", "value.append(venue_counts[venue_counts\n", "\n", "\n", "Figure: Sankey diagram showing the flow of NeurIPS papers through the\n", "system from submission to eventual publication." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Effect of Late Reviews\n", "\n", "\\[edit\\]\n", "\n", "This notebook analyzes the reduction in reviewer confidence between\n", "reviewers that submit their reviews early and those that arrive late.\n", "The reviews are first loaded in from files Corinna and Neil saved and\n", "stored in a pickle. The function for doing that is\n", "`nips.load_review_history`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams.update({'font.size': 22})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cmtutils as cu\n", "import cmtutils.nipsy as nipsy \n", "import cmtutils.plot as plot\n", "\n", "import os\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reviews = nipsy.load_review_history()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review Submission Times\n", "\n", "All reviews are now in `pandas` data frame called reviews, they are\n", "ready for processing. First of all, let's take a look at when the\n", "reviews were submitted. The function `nipsy.reviews_before` gives a\n", "snapshot of the reviews as they stood at a particular date. So we simply\n", "create a data series across the data range of reviews\n", "(`nipsy.review_data_range`) that shows the counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "review_count = pd.Series(index=nipsy.review_date_range)\n", "for date in nipsy.review_date_range:\n", " review_count.loc[date] = nipsy.reviews_before(reviews, date).Quality.shape[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mlai as ma" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "review_count.plot(linewidth=3, ax=ax)\n", "plot.deadlines(ax)\n", "ma.write_figure(filename='review-count.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Cumulative count of number of received reviews over time.\n", "\n", "We worked hard to try and ensure that all papers had three reviews\n", "before the start of the rebuttal. This next plot shows the numbers of\n", "papers that had less than three reviews across the review period. First\n", "let’s look at the overall statistics of what the count of reviewers per\n", "paper were. Below we plot mean, maximum, median, and minimum over time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lastseen = reviews.drop_duplicates(subset='ID').set_index('ID')\n", "lastseen = lastseen['LastSeen']\n", "\n", "review_count = pd.DataFrame(index=reviews.ID.unique(), columns=nipsy.review_date_range)\n", "for date in nipsy.review_date_range:\n", " counts = nipsy.reviews_status(reviews, date, column='Quality').count(level='ID')\n", " review_count[date] = counts.fillna(0)\n", "review_count.fillna(0, inplace=True) \n", "review_count = review_count.T\n", "for col in review_count.columns:\n", " if pd.notnull(lastseen[col]):\n", " review_count[col][review_count.index>lastseen[col]] = np.NaN\n", " \n", "review_count = review_count.T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mlai as ma" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "review_count.min().plot(linewidth=3, ax=ax)\n", "review_count.max().plot(linewidth=3, ax=ax)\n", "review_count.median().plot(linewidth=3, ax=ax)\n", "review_count.mean().plot(linewidth=3, ax=ax)\n", "plot.deadlines(ax)\n", "ma.write_figure(filename='number-of-reviews-over-time.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Plot representing number of reviewers per paper over time\n", "showing maximum number of reviewers per paper, minimum, median, and\n", "mean. \n", "\n", "But perhaps the more important measure is how many papers had less than\n", "3 reviewers over time. In this plot you can see that by the time\n", "rebuttal starts almost all papers have three reviewers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "count = pd.Series(index=nipsy.review_date_range)\n", "for date in nipsy.review_date_range:\n", " count[date] = (review_count[date]<3).sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mlai as ma" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "count.plot(linewidth=3, ax=ax)\n", "plot.deadlines(ax)\n", "ma.write_figure(filename='paper-short-reviews.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Number of papers with less than three reviewers as a function\n", "of time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review Confidence\n", "\n", "Now we will check the confidence of reviews as the come in over time.\n", "We’ve written a small helper function that looks in a four-day window\n", "around each time point and summarises the associated score (in the first\n", "case, confidence, `Conf`) with its across the four day window and 95%\n", "confidence intervals computed from the standard error of the mean\n", "estimate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mlai as ma" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.evolving_statistic(reviews, 'Conf', window=4, ax=ax)\n", "ma.write_figure(filename='review-confidence-time.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Average confidence of reviews as computed across a four-day\n", "moving window, plot includes sandard error the mean estimate.\n", "\n", "It looks like there might be a reduction in confidence as we pass the\n", "review deadline on 21st July, but is the difference in confidence for\n", "the reviews that came in later significant?\n", "\n", "We now simplify the question by looking at the average confidence for\n", "reviews that arrived before 21st July (the reviewing deadline) and\n", "reviews that arrived after the 21st July (i.e. those that were chased or\n", "were allocated late) but before the rebuttal period started (4th\n", "August). Below we select these two groups and estimate the estimate of\n", "the mean confidence with (again with error bars)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "column = \"Conf\"\n", "cat1, cat2 = nipsy.late_early_values(reviews, column)\n", "plot.late_early(cat1, cat2, column=column, ylim=(3.2, 3.8), ax=ax)\n", "ma.write_figure(filename='review-confidence-early-late.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Average confindence for reviews that arrived before 21st July\n", "(the reviewing deadline) and reviews that arrived after. Histogram shows\n", "mean values and confidence intervals. A $t$-test shows the difference to\n", "be significant with a $p$-value of 0.048%, although the magnitude of the\n", "difference is small (about 0.1).\n", "\n", "So, it looks like there is a small but significant difference between\n", "the average confidence of the submitted reviews before and after the\n", "deadline, the statistical significance is confirmed with a $t$-test with\n", "a $p$-value at 0.048%. The magnitude of the difference is small (about\n", "0.1) but may indicate a tendency for later reviewers to be a little more\n", "rushed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quality Score\n", "\n", "This begs the question, is there an effect on the other scores of their\n", "reviews which cover ‘quality’ and ‘impact.’ Quality of papers is scored\n", "on a 10-point scale with a recommendation of 6 being accept and We can\n", "form a similar plots for quality as shown in Figures and ." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.evolving_statistic(reviews, column='Quality', window=4, ax=ax)\n", "ma.write_figure(filename='review-quality-time.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Plot of average review quality score as a function of time\n", "using a four day moving window. Standard error is also shown in the\n", "plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "column = \"Quality\"\n", "cat1, cat2 = nipsy.late_early_values(reviews, column)\n", "plot.late_early(cat1, cat2, column=column, ylim=(5.0, 5.6), ax=ax)\n", "ma.write_figure(filename='review-quality-early-late.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bar plot of average quality scores for on-time reviews and\n", "late reviews, standard errors shown. Under a $t$-test the difference in\n", "values is statistically significant with a $p$-value of 0.007%.\n", "\n", "There is another statistically significant difference between perceived\n", "quality scores after the reviewing deadline than before. On average\n", "reviewers tend to be more generous in their quality perceptions when the\n", "review is late. The $p$-value is computed as 0.007%. We can also check\n", "if there is a similar on the impact score. The impact score was\n", "introduced by Ghahramani and Welling to get reviewers not just to think\n", "about the technical side of the paper, but whether it is driving the\n", "field forward. The score is binary, with 1 being for a paper that is\n", "unlikely to have high impact and 2 being for a paper that is likely to\n", "have a high impact." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.evolving_statistic(reviews, 'Impact', window=4, ax=ax)\n", "ma.write_figure(filename='review-impact-time.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Average impact score for papers over time, again using a\n", "moving average with a window of four days and with standard error of the\n", "mean computation shown." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "column = \"Impact\"\n", "cat1, cat2 = nipsy.late_early_values(reviews, column)\n", "plot.late_early(cat1, cat2, column=column, ylim=(1, 1.4), ax=ax)\n", "ma.write_figure(filename='review-impact-early-late.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bar plot showing the average impact score of reviews\n", "submitted before the deadline and after the deadline. The difference in\n", "means did not prove to be statistically significant under a $t$-test\n", "($p$-value 5.9%).\n", "\n", "We find the difference is not quite statistically significant for the\n", "impact score ($p$-value of 5.9%), but if anything, there is a trend to\n", "have slightly higher impacts for later reviews (see Figures and )." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Review Length\n", "\n", "A final potential indicator of review quality is the length of the\n", "reviews, we can check if there is a difference between the combined\n", "length of the review summary and the main body comments for late and\n", "early reviews (see Figures and )." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reviews['length'] = reviews['Comments'].apply(len) + reviews['Summary'].apply(len)\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.evolving_statistic(reviews, 'length', window=4, ax=ax)\n", "ma.write_figure(filename='review-length-time.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Average length of reviews submitted plotted as a function of\n", "time with standard error of the mean computation included." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "column = \"length\"\n", "cat1, cat2 = nipsy.late_early_values(reviews, column)\n", "plot.late_early(cat1, cat2, column=column, ylim=(2000, 2500), ax=ax)\n", "ma.write_figure(filename='review-length-early-late.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Bar plot of the average length of reviews submitted before\n", "and after the deadline with standard errors included. The difference of\n", "around 100 words is statistically significant under a $t$-test\n", "($p$-value 0.55%).\n", "\n", "Once again we find a small but statistically significant difference,\n", "here, as we might expect late reviews are shorter than those submitted\n", "on time, by about 100 words in a 2,400 word review." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "review_quality = pd.DataFrame(index=reviews.ID.unique(), columns=nipsy.review_date_range)\n", "for date in nipsy.review_date_range:\n", " qual = nipsy.reviews_status(reviews, date, column='Quality')\n", " review_quality[date] = qual.sum(level='ID')/qual.count(level='ID') # There's a bug where mean doesn't work in Pandas 1.2.4??" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "original_pairs = pd.read_csv(os.path.join(nipsy.review_store, 'Duplicate_PaperID_Pairs.csv'), index_col='original')\n", "duplicate_pairs = pd.read_csv(os.path.join(nipsy.review_store, 'Duplicate_PaperID_Pairs.csv'), index_col='duplicate')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform an ‘inner join’ on duplicate papers and their originals with\n", "their reviews, and set the index of the duplicated papers to match the\n", "original. This gives us data frames with matching indices containing\n", "scores over time of the duplicate and original papers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "duplicate_reviews = duplicate_pairs.join(review_quality, how=\"inner\").set_index('original')\n", "original_reviews = original_pairs.join(review_quality, how=\"inner\")\n", "del original_reviews[\"duplicate\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr_series = duplicate_reviews.corrwith(original_reviews)\n", "corr_series.index = pd.to_datetime(corr_series.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def bootstrap_index(df):\n", " n = len(df.index)\n", " return df.index[np.random.randint(n, size=n)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bootstrap_corr_df = pd.DataFrame(index=corr_series.index)\n", "for i in range(1000):\n", " ind = bootstrap_index(original_reviews)\n", " b_corr_series = duplicate_reviews.loc[ind].corrwith(original_reviews.loc[ind])\n", " b_corr_series.index = pd.to_datetime(b_corr_series.index)\n", " bootstrap_corr_df[i] = b_corr_series" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime as dt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "final_vals = bootstrap_corr_df.loc[bootstrap_corr_df.index.max()]\n", "total_mean = final_vals.mean()\n", "(bootstrap_corr_df - final_vals+total_mean).plot(legend=False, ax=ax, linewidth=1, alpha=0.05, color='k')\n", "corr_series.plot(ax=ax, linewidth=3, color=\"w\")\n", "ax.set_ylim(0.45, 0.65)\n", "ax.set_xlim(dt.datetime(2014,7,23),nipsy.events['decisions_despatched'])\n", "ax.set_title(\"Correlation of Duplicate Reviews over time\")\n", "\n", "plot.deadlines(ax)\n", "ma.write_figure(filename='correlation-duplicate-reviews-bootstrap.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Average correlation of duplicate papers over time. To give an\n", "estimate of the uncertainty the correlation is computed with bootstrap\n", "samples. Here to allow comparison between the trend lines similar, the\n", "bootstrap samples are set so they converge on the same point on the\n", "right of the graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "#\n", "import datetime as dt\n", "corr_series.plot(ax=ax, linewidth=3)\n", "ax.set_ylim(0.5, 0.6)\n", "ax.set_xlim(dt.datetime(2014,7,23),nipsy.events['decisions_despatched'])\n", "ax.set_title(\"Correlation of Duplicate Reviews over time\")\n", "\n", "plot.deadlines(ax)\n", "ma.write_figure(filename='correlation-duplicate-reviews.svg', directory='./neurips')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Average correlation of duplicate papers over time.\n", "\n", "We need to do a bit more analysis on the estimation of the correlation\n", "for the earlier submissions, but from what we see above, it looks like\n", "the correlation is being damaged by late reviews, and we never quite\n", "recover the consistency of reviews we had at the submission deadline\n", "even after the discussion phase is over." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Late Reviewers Summary\n", "\n", "In summary we find that late reviews are on average less confident and\n", "shorter, but rate papers as higher quality and perhaps as higher impact.\n", "Each of the effects is small (around 5%) but overall a picture emerges\n", "of a different category of review from those that delay their\n", "assessment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Impact of Papers Seven Years On\n", "\n", "\\[edit\\]\n", "\n", "Now we look at the actual impact of the papers published using the\n", "Semantic Scholar data base for tracking citations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams.update({'font.size': 22})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cmtutils as cu\n", "import cmtutils.nipsy as nipsy\n", "import cmtutils.plot as plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "papers = cu.Papers()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "UPDATE_IMPACTS = False # Set to True to download impacts from Semantic Scholar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The impact of the different papers is downloaded from Semantic scholar\n", "using their REST API. This can take some time, and they also throttle\n", "the calls. At the moment the code below deosn’t handle the throttling\n", "correctly. However, you it will load the cached version of of citations\n", "scores from the given date." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if UPDATE_IMPACTS:\n", " from datetime import datetime\n", " date=datetime.today().strftime('%Y-%m-%d')\n", "else:\n", " date = \"2021-06-11\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Rerun to download impacts from Semantic Scholar\n", "if UPDATE_IMPACTS:\n", " semantic_ids = nipsy.load_semantic_ids()\n", " citations_dict = citations.to_dict(orient='index')\n", " # Need to be a bit cleverer here. Semantic scholar will throttle this call.\n", " sscholar = nipsy.download_citation_counts(citations_dict=citations_dict, semantic_ids=semantic_ids)\n", " citations = pd.DataFrame.from_dict(citations_dict, orient=\"index\") \n", " citations.to_pickle(date + '-semantic-scholar-info.pickle')\n", "else: \n", " citations = nipsy.load_citation_counts(date=date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final decision sheet provides information about what happened to all\n", "of the papers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "decisions = nipsy.load_decisions()\n", "nipsy.augment_decisions(decisions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is joined with the citation information to provide our main ability\n", "to understand the impact of these papers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "joindf = nipsy.join_decisions_citations(decisions, citations)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlation of Quality Scores and Citation\n", "\n", "Our first study will be to check the correlation between quality scores\n", "of papers and how many times that the papers have been cited in\n", "practice. In the plot below, rejected papers are given as crosses,\n", "accepted papers are given as dots. We include all papers, whether\n", "published in a venue or just available through ArXiv or other preprint\n", "servers. We show the published/non-published quality scores and\n", "$\\log_{10}(1+\\text{citations})$ for all papers in the plot below. In the\n", "plot we are showing each point corrupted by some Laplacian noise and\n", "also removing axes. The idea is to give a sense of the distribution\n", "rather than reveal the score of a particular paper." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mlai as ma" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column = \"average_calibrated_quality\"\n", "filter_col = \"all\"\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.log_one_citations(column, joindf, filt=joindf[filter_col], ax=ax)\n", "ax.set_xticks([])\n", "ma.write_figure(filename=\"citations-vs-{col}-{filt}.svg\".format(filt=filter_col, col=column.replace(\"_\", \"-\")),\n", " directory=\"./neurips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Scatter plot of $\\log_{10}(1+\\text{citations})$ against the\n", "average calibrated quality score for all papers. To prevent\n", "reidentification of individual papers quality scores and citation count,\n", "each point is corrupted by differentially private noise in the plot\n", "(correlation is computed before adding differentially private\n", "noise).\n", "\n", "The correlation seems strong, but of course, we are looking at papers\n", "which were accepted and rejected by the conference. This is dangerous,\n", "as it is quite likely that presentation at the conference may provide\n", "some form of lift to the papers’ numbers of citations. So, the right\n", "thing to do is to look at the groups separately.\n", "\n", "Looking at the accepted papers only shows a very different picture.\n", "There is very little correlation between accepted papers’ quality scores\n", "and the number of citations they receive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column = \"average_calibrated_quality\"\n", "filter_col = \"accept\"\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.log_one_citations(column, joindf, filt=joindf[filter_col], ax=ax)\n", "ma.write_figure(filename=\"citations-vs-{col}-{filt}.svg\".format(filt=filter_col, col=column.replace(\"_\", \"-\")),\n", " directory=\"./neurips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Scatter plot of $\\log_{10}(1+\\text{citations})$ against the\n", "average calibrated quality score for accepted papers. To prevent\n", "reidentification of individual papers quality scores and citation count,\n", "each point is corrupted by differentially private noise in the plot\n", "(correlation is computed before adding differentially private\n", "noise).\n", "\n", "Conversely, looking at rejected papers only, we do see a slight trend,\n", "with higher scoring papers achieving more citations on average. This,\n", "combined with the lower average number of citations in the rejected\n", "paper group, alongside their lower average scores, explains the\n", "correlation we originally observed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column = \"average_calibrated_quality\"\n", "filter_col = \"reject\"\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.log_one_citations(column, joindf, filt=joindf[filter_col], ax=ax)\n", "ma.write_figure(filename=\"citations-vs-{col}-{filt}.svg\".format(filt=filter_col, col=column.replace(\"_\", \"-\")),\n", " directory=\"./neurips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Scatter plot of $\\log_{10}(1+\\text{citations})$ against the\n", "average calibrated quality score for rejected papers. To prevent\n", "reidentification of individual papers quality scores and citation count,\n", "each point is corrupted by differentially private noise in the plot\n", "(correlation is computed before adding differentially private\n", "noise).\n", "\n", "Welling and Ghahramani introduced an “impact” score in NeurIPS 2013, we\n", "might expect the impact score to show correlation. And indeed, despite\n", "the lower range of the score (a reviewer can score either 1 or 2) we do\n", "see *some* correlation, although it is relatively weak." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column = \"average_impact\"\n", "filter_col = \"accept\"\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.log_one_citations(column, joindf, filt=joindf[filter_col], ax=ax)\n", "ma.write_figure(filename=\"citations-vs-{col}-{filt}.svg\".format(filt=filter_col, col=column.replace(\"_\", \"-\")),\n", " directory=\"./neurips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Scatter plot of $\\log_{10}(1+\\text{citations})$ against the\n", "average impact score for accepted papers. To prevent reidentification of\n", "individual papers quality scores and citation count, each point is\n", "corrupted by differentially private noise in the plot (correlation is\n", "computed before adding differentially private noise).\n", "\n", "Finally, we also looked at correlation between the *confidence* score\n", "and the impact. Here correlation is somewhat stronger. Why should\n", "confidence be an indicator of higher citations? A plausible explanation\n", "is that there is confounder driving both variables. For example, it\n", "might be that papers which are easier to understand (due to elegance of\n", "the idea, or quality of exposition) inspire greater reviewer confidence\n", "and increase the number of citations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column = 'average_confidence'\n", "filter_col = \"accept\"\n", "fig, ax = plt.subplots(figsize=plot.big_wide_figsize)\n", "plot.log_one_citations(column, joindf, filt=joindf[filter_col], ax=ax)\n", "ma.write_figure(filename=\"citations-vs-{col}-{filt}.svg\".format(filt=filter_col, col=column.replace(\"_\", \"-\")),\n", " directory=\"./neurips\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Figure: Scatter plot of $\\log_{10}(1+\\text{citations})$ against the\n", "average confidence score for accepted papers. To prevent\n", "reidentification of individual papers quality scores and citation count,\n", "each point is corrupted by differentially private noise in the plot\n", "(correlation is computed before adding differentially private\n", "noise)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def bootstrap_index(df):\n", " n = len(df.index)\n", " return df.index[np.random.randint(n, size=n)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for column in [\"average_quality\", \"average_impact\", \"average_confidence\"]:\n", " cor = []\n", " for i in range(1000):\n", " ind = bootstrap_index(joindf.loc[joindf.accept])\n", " cor.append(joindf.loc[ind][column].corr(np.log(1+joindf.loc[ind]['numCitedBy'])))\n", " cora = np.array(cor)\n", " rho = cora.mean()\n", " twosd = 2*np.sqrt(cora.var())\n", " print(\"{column}\".format(column=column.replace(\"_\", \" \")))\n", " print(\"Mean correlation is {rho} +/- {twosd}\".format(rho=rho, twosd=twosd))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "\\[edit\\]\n", "\n", "Under the simple model we have outlined, we can be confident that there\n", "is inconsistency between two independent committees, but the level of\n", "inconsistency is much less than we would find for a random committee. If\n", "we accept that the bias introduced by the Area Chairs knowing when they\n", "were dealing with duplicates was minimal, then if we were to revisit the\n", "NIPS 2014 conference with an independent committee then we would expect\n", "between **38% and 64% of the presented papers to be the same**. If the\n", "conference was run at random, then we would only expect 25% of the\n", "papers to be the same.\n", "\n", "It’s apparent from comments and speculation about what these results\n", "mean, that some people might be surprised by the size of this figure.\n", "However, it only requires a little thought to see that this figure is\n", "likely to be large for any highly selective conference if there is even\n", "a small amount of inconsistency in the decision-making process. This is\n", "because once the conference has chosen to be ‘highly selective’ then\n", "because, by definition, only a small percentage of papers are to be\n", "accepted. Now if we think of a type I error as accepting a paper which\n", "should be rejected, such errors are easier to make because, again by\n", "definition, many more papers should be rejected. Type II errors\n", "(rejecting a paper that should be accepted) are less likely because (by\n", "setting the accept rate low) there are fewer papers that should be\n", "accepted in the first place. When there is a difference of opinion\n", "between reviewers, it does seem that many of the aruguments can be\n", "distilled down to (a subjective opinion) about whether controlling for\n", "type I or type II errors is more important. Further, normally when\n", "discussing type I and type II errors we believe that the underlying\n", "system of study is genuinely binary: e.g., diseased or not diseased.\n", "However, for conferences the accept/reject boundary is not a clear\n", "separation point, there is a continuum (or spectrum) of paper quality\n", "(as there also is for some diseases). And the decision boundary often\n", "falls in a region of very high density.\n", "\n", "I would prefer a world were a conference is no longer viewed as a proxy\n", "for research quality. The true test of quality is time. In the current\n", "world, papers from conferences such as NeurIPS are being used to judge\n", "whether a researcher is worthy of a position at a leading company, or\n", "whether a researcher gets tenure. This is problematic and damaging for\n", "the community. Reviewing is an inconsistent process, but that is not a\n", "bad thing. It is far worse to have a reviewing system that is\n", "consistently wrong than one which is inconsistently wrong.\n", "\n", "My own view of a NeurIPS paper is inspired by the Millenium Galleries in\n", "Sheffield. There, among the exhibitions they sometimes have work done by\n", "apprentices in their ‘qualification.’ Sheffield is known for knives, and\n", "the work of the apprentices in making knives is sometimes very intricate\n", "indeed. But it does lead to some very impractical knives. NeurIPS seems\n", "to be good at judging technical skill, but not impact. And I suspect the\n", "same is true of many other meetings. So, a publication a NeurIPS does\n", "seem to indicate that the author has some of the skills required, but it\n", "does not necessarily imply that the paper will be impactful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thanks!\n", "\n", "For more information on these subjects and more you might want to check\n", "the following resources.\n", "\n", "- twitter: [@lawrennd](https://twitter.com/lawrennd)\n", "- podcast: [The Talking Machines](http://thetalkingmachines.com)\n", "- newspaper: [Guardian Profile\n", " Page](http://www.theguardian.com/profile/neil-lawrence)\n", "- blog:\n", " [http://inverseprobability.com](http://inverseprobability.com/blog.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] } ], "nbformat": 4, "nbformat_minor": 5, "metadata": {} }