{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Modeling arXiv Submissions\n", "================\n", "\n", "The purpose of this notebook is to take a look at submissions to the [arXiv](http://www.arxiv.org). As I show below, it's very possible that the number of submissions per month will soon exceed 10000, a first in the 25-year history of the service.\n", "\n", "To do so, I start by building a simple time-dependent model for the number of submissions each month, $S(t)$, where here $t$ indexes the number of months since the beginning of the arXiv. arXiv submission data is indexed to the beginning of each month; thus, the start 1991-07-01 is set to $t=0$, 1991-08-01 is set to $t=1$, and so on.\n", "\n", "The tools I use to do so are pretty standard in the Python software stack:\n", "\n", "* [`matplotlib`](http://matplotlib.org/): for plotting and visualizing data\n", "* [`seaborn`](https://github.com/mwaskom/seaborn): to make the matplotlib plots a bit prettier\n", "* [`pandas`](http://pandas.pydata.org/): used for wrangling data\n", "* [`numpy`](http://www.numpy.org/): great support for scientific computation; lots of handy, miscellaneous functions\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ ":0: FutureWarning: IPython widgets are experimental and may change in the future.\n" ] } ], "source": [ "# Import the relevant packages\n", "from __future__ import division\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "# Don't want the matplotlib plots trying\n", "# to pop out of the screen!\n", "%matplotlib inline\n", "\n", "# If you just do \"import seaborn\",\n", "# then seaborn messes around with the\n", "# matplotlib backend parameters. I want\n", "# to avoid that, so we import only the API.\n", "\n", "import seaborn.apionly as sns\n", "\n", "# Set the style of the plots to something aesthetic\n", "sns.set_style('darkgrid')\n", "\n", "# Whether to save the plots.\n", "save = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Introduction\n", "==========" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With those preliminaries out of the way, we can go ahead and download the data from the arXiv. Monthly submission data is published at [this URL](http://arxiv.org/stats/get_monthly_submissions). (NOTE: This is a download link!) The `pandas.read_csv()` method supports grabbing data from URLs, so that's what I do here. (That way, I don't have to download the file each time I wanted to update my results. In fact, I could turn this entire pipeline into an automated process that, sometime in the middle of each month, goes and gets the latest submission data.)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Read in the data, using the dates as the index.\n", "df = pd.read_csv('http://arxiv.org/stats/get_monthly_submissions', index_col=0,\\\n", " parse_dates=True)\n", "\n", "# We don't want data from the current month, as it will be incomplete.\n", "# So we get rid of it.\n", "\n", "now = pd.datetime.now()\n", "df = df[df.index < '{0}-{1}-01'.format(now.year, now.month)]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | submissions | \n", "historical_delta | \n", "
---|---|---|
month | \n", "\n", " | \n", " |
1991-07-01 | \n", "0 | \n", "-2 | \n", "
1991-08-01 | \n", "27 | \n", "-1 | \n", "
1991-09-01 | \n", "58 | \n", "0 | \n", "
1991-10-01 | \n", "76 | \n", "0 | \n", "
1991-11-01 | \n", "64 | \n", "0 | \n", "