:author: Project Jupyter :email: :institution: Project Jupyter :equal-contributor: :author: Matthias Bussonnier :email: :institution: UC Berkeley :equal-contributor: :author: Jessica Forde :email: :institution: Project Jupyter :equal-contributor: :author: Jeremy Freeman :email: :institution: :equal-contributor: :author: Brian Granger :email: :institution: Cal Poly, San Luis Obispo :equal-contributor: :author: Tim Head :email: tim@wildtreetech.com :institution: Wild Tree Tech, Switzerland :equal-contributor: :author: Chris Holdgraf :email: choldgraf@berkeley.edu :institution: UC Berkeley :corresponding: :author: Kyle Kelley :email: :institution: Netflix :equal-contributor: :author: Gladys Nalvarte :email: :institution: Simula Research Lab :equal-contributor: :author: Andrew Osheroff :email: :institution: :equal-contributor: :author: M Pacer :email: mpacer.phd@gmail.com :institution: Netflix :equal-contributor: :author: Yuvi Panda :email: :institution: UC Berkeley :equal-contributor: :author: Fernando Perez :email: :institution: UC Berkeley :equal-contributor: :author: Benjamin Ragan-Kelley :email: benjaminrk@gmail.com :institution: Simula Research Lab :equal-contributor: :author: Carol Willing :email: willingc@gmail.com :institution: Cal Poly, San Luis Obispo :equal-contributor: :bibliography: binderbib :video: https://youtu.be/KcC0W5LP9GM =================================================================================== Binder 2.0 - Reproducible, interactive, sharable environments for science at scale =================================================================================== .. class:: abstract Binder is an open source web service that lets users create sharable, interactive, reproducible environments in the cloud. It is powered by other core projects in the open source ecosystem, including JupyterHub and Kubernetes for managing cloud resources. Binder works with pre-existing workflows in the analytics community, aiming to create interactive versions of repositories that exist on sites like GitHub with minimal extra effort needed. This paper details several of the design decisions and goals that went into the development of the current generation of Binder. .. class:: keywords cloud computing, reproducibility, binder, mybinder.org, shared computing, accessibility, kubernetes, dev ops, jupyter, jupyterhub, jupyter notebooks, github, publishing, interactivity Binder is a free, open source, and massively publicly available tool for easily creating sharable, interactive, reproducible environments in the cloud. The scientific community is increasingly unified around reproducibility. A survey in 2016 of 1,576 researchers reported that 90% of respondents believed there exists a reproducibility crisis in the scientific community. A majority of respondents also reported difficulty reproducing the work of colleagues :cite:`Baker2016-gp`. Similar results have been reported in the cell biology community :cite:`The_American_Society_for_Cell_Biology_undated-yv` and the machine learning community :cite:`Pineau2017-sb`. Making research reproducible requires pursuing two sub-goals, both of which are difficult to achieve: - **technical reproducibility**: making reproducible scientific results possible at all - **practical reproducibility**: enabling others to reproduce results without difficulty Both technical and practical reproducibility depend upon the software and technology available to researchers at any moment in time. With the growth in open source tools for data analysis, as well as the “data heavy” approach many fields are adopting, these problems become more complex yet more tractable than ever before. Fortunately, as the problem has grown more complex, the open source community has risen to meet the challenge. Tools for packaging analytics environments into “containers” allow others to re-create the computational environments needed to run analyses and evaluate results. Online communities make it easier to share and discover scientific results. A myriad of open source tools are freely available for doing analytics in open and transparent ways. New paradigms for writing code and displaying results in rich, engaging formats allow results to live next to the prose that explains their purpose. However, manual implementation of this processes is complex, and reproducing the full stack of another person’s work is too labor intensive and error-prone for day-to-day use. A recent study of scientific repositories found that citation of "both visualization tools as well as common software packages (such as MATLAB) was a widespread failure" :cite:`Stodden2018-fy`. As a result, the technical barriers limit practical reproducibility. To lower the technical barriers of sharing computational work, we introduce Binder 2.0, a tool that we believe makes reproducibility more practically possible. An overview of Binder --------------------- Binder consists of a set of tools for creating sharable, interactive, and deterministic environments that run on personal computers and cloud resources. It manages the technical complexity around: * creating containers to capture a code repository and its technical environment; * generating user sessions that run the environment defined in those containers; and * providing links that users can share with others to allow them to interact with these environments. Binder is built on modern-day tools from the open source community and is itself fully open source for others to use. You can access a public deployment of Binder at `mybinder.org `_, a web service that the Binder and JupyterHub teams run as a demonstration of the BinderHub technology and as digital public infrastructure for those who wish to share Binder links so that others may interact with their code repositories. It is meant to be a testing ground for different use cases in the Binder ecosystem as well as a public service for the scientific and educational community. `mybinder.org `_ serves nearly 9,000 daily sessions, and has already been used for reproducible publishing [#]_, sharing interactive course materials [#]_, at the university and high-school level, creating interactive package documentation in Python [#]_ with Sphinx Gallery, and sharing interactive content that requires a language-specific kernel in order to run [#]_. .. [#] https://github.com/minrk/ligo-binder .. [#] https://www.inferentialthinking.com/chapters/01/3/plotting-the-classics.html .. [#] https://sphinx-gallery.readthedocs.io/en/latest/advanced_configuration.html#binder-links .. [#] http://greenteapress.com/wp/think-dsp/ .. figure:: images/binder_uis.png :align: center :figclass: w :scale: 37 Two example user interfaces that users can run within Binder. Because BinderHub uses a JupyterHub for hosting all user sessions, one can specify an environment that serves any Jupyter-supported user interface, provided that it can run via the browser. A. Examining image data from Ross et al. on Binder with JupyterLab :cite:`Ross2017-ff`. JupyterLab provides access to the file system (left column), a notebook interface (middle column), as well as traditional script files and interactive kernels (right column). B. An RStudio interface running the modern RStudio and ``tidyverse`` stack. In both cases, users can explore the code and make their own modifications from within the Binder session, without any need to manually install dependencies. Binder continues in the tradition of promoting "the complete software development environment and the complete set of instructions which generated the figures" :cite:`Buckheit1995-ox` by effortlessly providing these tools to the general public in the cloud. The first iteration of Binder was released in 2016 :cite:`Freeman2016-jt` and provided a prototype that managed reproducible user environments in the cloud. In the years since, there have been several advances in technology for managing cloud resources, serving interactive user environments, and creating reproducible containers for analytics. Binder 2.0 utilizes these new tools, and it is more scalable and maintainable, is easier to deploy, and supports more analytic and scientific workflows than before. While previous work has specified methods or file formats for the sharing of research :cite:`Buckheit1995-ox` :cite:`Gentleman2007-cz` :cite:`Liang2015-ay`, Binder only requires configuration files typically seen in contemporary software development. Related online platforms for reproducibility also have specific front ends for presenting research and commands for running code :cite:`Anjos2017-vb` :cite:`Liang2015-ay` :cite:`Stodden2012-sd`, while Binder flexibly allows users to interact with a repository using modern data science tools such as RStudio, Jupyter Notebok, and JupyterLab. By containerizing the environment and using these front-end data science tools, Binder prioritizes an interactive user experience so that "someone else can discover it for themselves" :cite:`Somers2018-bj`. At the highest level, Binder is a particular combination of open source tools to achieve the goal of sharable, reproducible environments. This paper lays out the technical vision of Binder 2.0, including the guiding principles and goals behind each piece of technology it uses. It also discusses the guiding principles behind the *new* open source technology that the project has created. Guiding Principles of Binder ---------------------------- Several high-level project goals drive the development of Binder 2.0. These are outlined below: **Deployability**. Binder is driven by open source technology, and the BinderHub server should be deployable by a diverse representation of people in the scientific, publishing, and data analytic communities. This often means that it must be maintained by people without an extensive background in cloud management and dev-ops skills. BinderHub (the underlying technology behind Binder) should thus be deployable on a number of cloud frameworks, and with minimal technical skills required. **Maintainability**. Deploying a service on cloud resources is important but happens less frequently than *maintaining* those cloud resources all day, every day. Binder is designed to utilize modern-day tools in cloud orchestration and monitoring. These tools minimize the time that individuals must spend ensuring that the service performs as expected. Recognizing the importance of maintainability, the Binder team continues to work hard to document effective organizational and technical processes around running a production BinderHub-powered service such as `mybinder.org `_. The goal of the project is to allow a BinderHub service to be run without specialized knowledge or extensive training in cloud orchestration. **Pluggability**. Binder’s goal is to make it easier to adopt and interact with existing tools in the open source ecosystem. As such, Binder is designed to work with a number of open source packages, languages, and user interfaces. In this way, Binder acts as glue to bring together pieces of the open source community, and it easily plugs into new developments in this space. **Accessibility**. Binder should be as accessible as possible to members of the open source, scientific, educational, and data science communities. By leveraging pre-existing workflows in these communities rather than requiring people to adopt new ones, Binder increases its adoption and user acceptance. Input and feedback from members of those communities guide future development of the technology. As a key goal, Binder should support pre-existing scientific workflows and improve them by adding sharability, reproducibility, and interactivity. **Usability**. Finally, the Binder team wants simplicity and fast interaction to be core components of the service. Minimizing the number of steps towards making your work sharable via Binder helps provide an effective user experience. Consumers of shared work must be able to quickly begin using the Binder repository that another person has put together. To achieve these goals, creating multiple ways in which people can use Binder’s services is key. For example, easily sharing a link to the full Binder interface and offering a public API endpoint to request and interact with a kernel backed by an arbitrary environment increase usability. In the following sections, we describe the three major technical components that the Jupyter and Binder teams have developed for the Binder project—JupyterHub, repo2docker, and BinderHub. All are open source, and rely heavily on other tools in the open source ecosystem. We'll discuss how each feeds into the principles we’ve outlined above. Scalable interactive user sessions ---------------------------------- Binder runs as either a public or a private web service, and it needs to handle potentially large spikes in user sessions as well as sustained user activity over several minutes of time. It also needs to be deployable on a number of cloud providers in order to avoid locking in the technology to the offerings of a single cloud service. To accomplish this Binder uses a deployment of JupyterHub that runs on Kubernetes, both of which contribute to BinderHub's scalability and maintainability. JupyterHub, an open source tool from the Jupyter community, provides a centralized resource that serves interactive user sessions. It allows definition of a computational environment (e.g. a Docker image) that runs the Jupyter notebook server. A core principle of the Jupyter project is to be language- and workflow-agnostic, and JupyterHub is no exception. JupyterHub can be used to run dozens of languages served with a variety of user interfaces, including Jupyter Notebooks :cite:`Bussonnier2018-kc`, JupyterLab :cite:`Project_Jupyter_Contributors2017-yi`, RStudio :cite:`Project_Juptyer_Contributors2017-ra`, Stencila :cite:`RK_Min2018-eq`, and OpenRefine :cite:`Head2018-jf`. Another key benefit of JupyterHub is that it is straightforward to run on Kubernetes, a modern-day open source platform for orchestrating computational resources in the cloud. Kubernetes can be deployed on most major cloud providers, self-hosted infrastructure (such as OpenStack deployments), or even on an individual laptop or workstation. For example, Google Cloud Platform, Microsoft Azure, and Amazon AWS each have managed Kubernetes clusters that run with minimal user intervention. Thus, it is straightforward to deploy JupyterHub on any major cloud provider. Kubernetes is designed to be relatively self-healing, often automatically resolving problems that would normally disrupt the service. It also has a declarative syntax for defining the cloud resources that are needed to run a web service. Thus, maintainers can update a JupyterHub running on Kubernetes with minimal changes to configuration files for the deployment, providing the flexibility to configure the JupyterHub as needed, without requiring a lot of hands-on intervention and tinkering. Finally, Kubernetes is both extremely scalable and battle-tested because it was originally developed to run Google's web services. A cloud orchestration tool that can handle the usage patterns of a service like GMail can almost certainly handle the analytics environments that are served with Binder. In addition, by using Kubernetes, Binder (with JupyterHub) leverages the power of Kubernetes' strong open source community. As more companies, organizations, and universities adopt and contribute to the tool, the Binder community will benefit from these advances. There are several use-cases of JupyterHub being used for shared, interactive computing. For example, UC Berkeley hosts a Foundations in Data Science :cite:`Berkeley_Division_of_Data_Sciences_undated-nz` course that serves nearly 1,000 interactive student sessions simultaneously. The Wikimedia foundation also uses JupyterHub to facilitate users accessing the Wikipedia dataset :cite:`Wikimedia_undated-si`, allowing them to run bots and automate the editing process with a Jupyter interface. Finally, organizations such as the Open Humans Project provide a JupyterHub for their community :cite:`Open_Humans_Foundation_undated-ov` to analyze, explore, and discover interesting patterns in a shared dataset. Deterministic environment building - Repo2Docker ------------------------------------------------ Docker :cite:`Docker_Inc_undated-ai` is extremely flexible, and has been used throughout the scientific and data science community for standardizing environments that are sharable with other people. A Docker image contains nearly all of the pieces necessary to re-run an analysis. This provides the right balance between flexibility (e.g. a Docker image can contain basically any environment) and being lightweight to deploy and store in the cloud. JupyterHub can serve an arbitrary environment to users based off of a Docker image, but how is this image created in the first place? While it is possible (and common) to hand-craft a Docker image using a set of instructions called a Dockerfile, this step requires a considerable amount of knowledge about the Docker platform, making it a high barrier to the large majority of scientists and data analysts. Binder’s goal is to operate with many different workflows in data analytics, and requiring the use of a Dockerfile to define an environment is too restrictive. At the same time, the analytics community already makes heavy use of online code repositories, often hosted on websites such as GitHub :cite:`GitHub_undated-wa` or Bitbucket :cite:`Atlassian_undated-ra`. These sites are home to tens of thousands of repositories containing the computational work for research, education, development, and general communication. Best practices in development already dictate storing the requirements needed (in text files such as ``environment.yml``) along with the code itself (which often lives in document structures such as Jupyter Notebooks or RMarkdown files). As a result, in many cases the repository already contains all the information needed to build the required environment. Binder’s solution to this is a lightweight tool called “repo2docker” :cite:`Project_Jupyter_Contributors2017-no`. It is an open source command line tool that converts code repositories into a Docker image suitable for running with JupyterHub. Repo2docker: 1. is called with a single argument, a path to a git repository, and optionally a reference to a git branch, tag, or commit hash. The repository can either be online (such as on GitHub or GitLab) or local to the person’s computer. 2. clones the repository, then checks out the reference that it has been passed (or defaults to “master”). 3. looks for one or more “configuration” files that are used to define the environment needed to run the code inside the repository. These are generally files that *already exist* in the data science community. For example, if it finds a ``requirements.txt`` file, it assumes that the user wants a Python installation and installs everything inside the file. If it finds an ``install.R`` file, it assumes the user wants RStudio available, and pre-installs all the packages listed inside. 4. constructs a ``Dockerfile`` that builds the environment specified by the configuration files, and that is meant to be run via a Jupyter notebook server. 5. builds an image from this ``Dockerfile``, and then registers it online with a Docker repository of choice. Repo2docker aims to be flexible in the analytics workflows it supports, and it minimizes the amount of effort needed to support a *new* workflow. A core building block of repo2docker is the “Build Pack” - a class that defines all of the operations needed to construct the environment needed for a particular analytics workflow. These Build Packs have a ``detect`` method that returns True when a particular configuration file is present (e.g. ``requirements.txt`` will trigger the Python build pack). They also have a method called ``get_assemble_scripts`` that inserts the necessary lines into a Dockerfile to support this workflow. For example, below we show a simplified version of the Python build pack in ``repo2docker``. In this case, the ``detect`` method looks for a ``requirements.txt`` file and, if it exists, triggers the ``get_assemble_scripts`` method, which inserts lines into the Dockerfile that install Python and pip. Binder uses ``repo2docker`` to build repository images dynamically. .. code-block:: python class PythonBuildPack(CondaBuildPack): """Setup Python for use with a repository.""" def __init__(self): ... def get_assemble_scripts(self): """Return build-steps specific to this repo.""" assemble_scripts = super().get_assemble_scripts() # KERNEL_PYTHON_PREFIX is the env with the kernel # whether it's distinct from the notebook # or the same. pip = '${KERNEL_PYTHON_PREFIX}/bin/pip' # install requirements.txt in the kernel env requirements_file = self.binder_path( 'requirements.txt') if os.path.exists(requirements_file): assemble_scripts.append(( '${NB_USER}', '{} install --no-cache-dir -r "{}"'.format( pip, requirements_file) )) return assemble_scripts def detect(self): """Check if repo builds w/ Python buildpack.""" requirements_txt = self.binder_path( 'requirements.txt') return os.path.exists(requirements_txt) Repo2docker also supports more generic configuration files that are applied regardless of the particular Build Pack that is detected. For example, a file called “postBuild” will be run from the shell after all dependencies are installed. This is often used to pre-compile code or download datasets from the web. Finally, in the event that a particular setup is not natively supported, repo2docker will also build a Docker image from a plain ``Dockerfile``. This means users are never blocked by the design of repo2docker. By modularizing the environment generation process in this fashion, it is possible to mix and match environments that are present in the final image. Repo2docker’s goal is to allow for a fully composable analytics environment. If a researcher requires Python 2, 3, RStudio, and Julia, simultaneously for their work, repo2docker should enable this. .. figure:: images/binder_main_ui.png :align: center The BinderHub user interface. Users input a link to a public git repository. Binder will check out this repository and build the environment needed to run the code inside. It then provides you a link that can be shared with others so that they may run an interactive session that runs the repository’s code. In addition, by capturing pre-existing workflows rather than requiring data analysts to adopt new ones, there is a minimal energy barrier towards using repo2docker to deterministically build images that run a code repository. For example, if the following ``requirements.txt`` file is present in a repository, repo2docker will build an image with Python 3 and the packages pip installed. .. code-block:: bash $ cat requirements.txt numpy scipy matplotlib While the following file name/content will install RStudio with these R commands run before building the Docker image.: .. code-block:: bash $ cat binder/install.R install.packages("ggplot2") $ cat binder/runtime.txt r-2017-10-24 In this case, the date specified in ``runtime.txt`` instructs repo2docker to use a specific MRAN repository :cite:`Microsoft_undated-gd` date. In addition, note that these files exist in a folder called ``binder/`` (relative to the repository root). If repo2docker discovers a folder of this name, it will build the environment from the contents of this folder, ignoring any configuration files that are present in the project’s root. This allows users to dissociate the configuration files used to build the package from those used to share a Binder link. .. figure:: images/binderhub_diagram.png :align: center :figclass: w The BinderHub architecture for interactive GUI sessions. Users connect to the Binder UI via a public URL. All computational infrastructure is managed with a Kubernetes deployment (light green) managing several pods (dark green) that make up the BinderHub service. Interactive user pods (blue squares) are spawned and managed by a JupyterHub. By facilitating the process by which researchers create these reproducible images, repo2docker addresses the “works for me” problem that is common when sharing code. There are no longer breaking differences in the environment of two users if they are running code from the same image generated by repo2docker. Additionally, researchers can use repo2docker to confirm that all of the information needed to recreate their analysis is contained within their configuration files, creating a way to intuitively define “recipes” for reproducing one’s work. A web-interface to user-defined kernels and interactive sessions - BinderHub ---------------------------------------------------------------------------- JupyterHub can serve multiple interactive user sessions from pre-defined Docker images in the cloud. Repo2docker generates Docker images from the files in a git repository. BinderHub is the glue that binds these two open source tools together. It uses the building functionality of repo2docker, the kernel and user-session hosting of JupyterHub, and a Docker registry that connects these two processes together. BinderHub defines two primary patterns of interaction with this process: sharable, interactive, GUI-based sessions; and a REST API for building, requesting, and interacting with user-defined kernels. The BinderHub User Interface ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The primary pattern of interaction with BinderHub for an author is via its “build form” user interface. This form lets users point BinderHub to a public git repository. When the form is filled in and the “launch” button is clicked, BinderHub takes the following actions: 1. **Check out the repository** at the version that is specified. 2. **Check the latest commit hash**. BinderHub compares the version specified in the URL with the versions that have been previously built for this repository in the registry (if a branch is given, BinderHub checks the latest commit hash on this branch). 3. If the version has *not* been built, **launch a repo2docker process** that builds and registers an image from the repository, then returns a reference to the registered image. 4. **Create a temporary JupyterHub user account** for the visitor, with a private token. 5. **Launch a JupyterHub user session** that sources the repo2docker image in the registry. This session will serve the environment needed to run the repository, along with any GUI that the user specifies. 6. **Clean up the user session**. Once the user departs, Binder destroys the temporary user ID for the user's unique session, as well as their temporary files from their interactive session (steps 4 and 5). The Docker image for the repository persists, and will be used in subsequent launch attempts (as long as the repository commit hash does not change). Once a repository has been built with BinderHub, authors can then share a URL that triggers this process. URLs for BinderHub take the following form: .. code-block:: bash /v2//// For example, the URL for the ``binder-examples`` repository that builds a Julia environment is .. code-block:: bash mybinder.org/v2/gh/binder-examples/julia-python/master When a user clicks on this link, they will be taken to a brief loading page as a user session that serves this repository is created. Once this process is finished, they can immediately start interacting with the environment that the author has created. The BinderHub REST API ~~~~~~~~~~~~~~~~~~~~~~ While GUIs are preferable for most human interaction with a BinderHub, there are also situations when a programmatic or text-based interaction is preferable. For example, someone may wish to use BinderHub to request arbitrary kernels that power computations underlying a completely different GUI. For these use cases, BinderHub also provides a REST API that controls all of the steps described above. BinderHub currently provides a single REST endpoint that allows users to programmatically build and launch Binder repositories. It takes the following form: .. code-block:: bash /build// This follows a similar pattern to BinderHub's sharable URLs. For example, the following API request results in a Binder environment for the JupyterLab example repository on `mybinder.org `_: .. code-block:: bash mybinder.org/build/gh/binder-examples/jupyterlab/master Accessing this endpoint will trigger the following events: 1. Check if the image for this URL exists in the BinderHub cached image registry. If yes, launch it. 2. If it doesn’t exist in the image registry, check if a build is currently running. If there is **not**, then start a build process. If there **is**, then attach to the pre-existing build process. 3. Stream logs from the build process to the user. 4. If the build succeeds, contact the JupyterHub API, telling it to launch a user server with the environment that has just been built. 5. Once the server is launched, display a message showing the URL where they can connect to the notebook server (and thus connect with the Jupyter Notebook Server REST API). Information about the process above is streamed to the user via a persistent HTTP connection with structured JSON messages via the EventStream protocol. Here's an example of the output for the above build:: data: {"phase": "built", "imageName": "gcr.io/binder-prod/r2d-051...", "message": "Found built image, launching..."} data: {"phase": "launching", "message": "Launching...} data: {"phase": "ready", "message": "server running at ", "url": "", "token": ""} In this case, the user can then access the value in ``url:`` to use their Binder session (either via their browser, or programmatically via the notebook server REST API served at this URL). .. figure:: images/nteract_ui.png :align: center play.nteract.io :cite:`Nteract_contributors2016-dg` is a GUI front-end that connects to the ``mybinder.org`` REST API. When a user opens the page, it requests a kernel from mybinder.org according to the environment chosen in the top-right menu. Once mybinder.org responds that it is ready, users can execute code that will be sent to their Binder kernel, with results displayed on the right. There are already several examples of services that use BinderHub’s REST API to run webpages and applications that utilize arbitrary kernel execution. For example, thebelab :cite:`Min_undated-qd` makes it possible to deploy HTML with code blocks that are powered by a BinderHub kernel. The website creator can define the environment needed to run code on the page, and the end user can generate interactive code output once they visit the webpage. There are also several applications that use BinderHub’s kernel API to power their computation. For example, the nteract :cite:`Nteract_contributors2016-dg` project uses BinderHub to run an interactive code sandbox that serves an nteract interface and can be powered by arbitrary kernels served by BinderHub. BinderHub is permissively licensed and intentionally modular in order to serve as many use cases as possible. Our goal is to provide the tools to allow any person or organization to provide arbitrary, user-defined kernels that run in the cloud. The Binder team runs one such service as a proof-of-concept of the technology, as well as digital public infrastructure that can be used to share interactive code repositories. This service runs at the URL `mybinder.org `_ and will be discussed in the final section. Mybinder.org: Maintaining and sustaining a public service --------------------------------------------------------- In addition to providing a showcase for the technical components of the BinderHub, repo2docker, and JupyterHub architecture, the Binder project is also a case study in the maintenance and deployment of an open-source service. Managing and providing a site such as `mybinder.org `_ is not trivial, with challenges in team operations, maintaining service stability without any full-time staff, and exploring models for keeping the project financially sustainable over time. This final section describes recent efforts to address some of these questions, and to explore possible outcomes for others. The Binder team (and thus `mybinder.org `_) runs on a model of transparency and openness in the tools it creates as well as the operations of `mybinder.org `_. The Binder team has put together several group processes and documentation to facilitate maintaining this public service, and to provide a set of resources for others who wish to do the same. For example, the Binder Site Reliability Guide [#]_ is continuously updated with team knowledge, incident reports, helper scripts, and a description of the technical deployment at `mybinder.org `_. There are also several data streams that the Binder team routinely makes available for others who are interested in deploying and maintaining a BinderHub service. For example, the Binder Billing [#]_ repository shows all of the cloud hardware costs for the last several months of `mybinder.org `_ operation. In addition, the Binder Grafana board [#]_ shows a high-level view of the status of the BinderHub, JupyterHub, and Kubernetes processes underlying the service. .. [#] http://mybinder-sre.readthedocs.io/en/latest/ .. [#] https://github.com/jupyterhub/binder-billing .. [#] https://grafana.mybinder.org Cost of running the public Binder service ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Binder team has designed the public service to be as cost effective as possible. `mybinder.org `_ restricts users to one CPU and two GB of RAM. We save a great deal by not providing users with persistent storage across sessions. Users can only access public git repositories and are restricted in the kinds of network I/O that can take place. In addition, a BinderHub deployment efficiently uses its resources in order to avoid over-provisioning cloud resources. .. figure:: images/cost_breakdown.png :align: center Cloud computing costs for running ``mybinder.org`` in 2018. The x axis shows one point per day. The number of daily unique users has consistently grown over this time, while modifications to the BinderHub codebase (as well as the cloud resources used) have kept costs relatively flat. As a result, ``mybinder.org`` currently operates at about 3 cents per user per day. The decision to avoid the notion of a user "identity" in particular has strong effects on the cost of running a BinderHub server. Because users do not require persistent storage (e.g. the content of any changes they make to Jupyter Notebooks throughout a session), a significant cost of running a JupyterHub is avoided. In addition, a BinderHub deployment can efficiently use the resources available to it in order to avoid over-provisioning cloud resources as much as possible. Currently, the hosting bill for `mybinder.org `_ runs at a cost of around $180 per day and around 7,000 users per day. This comes out to around :math:`\frac{180 \times30}{7000 \times30} \approx 3` cents per user. The `mybinder.org `_ team publishes its daily hosting costs in a public repository on GitHub :cite:`JupyterHub2018-ek`. It hopes that this serves to encourage other organizations to deploy BinderHub for their own purposes, since it is possible to do so in a cost-effective manner. Finally, because Kubernetes is an open source system for managing containers, it has been deployed on a number of cloud providers as well as on self-owned hardware and virtual machines. While `mybinder.org `_ currently runs on the Google Cloud Platform, a BinderHub can run on any typical deployment of Kubernetes with minimal hardware requirements. This flexibility helps avoid vendor lock-in and is crucial for an open source tool such as BinderHub and JupyterHub. It also makes it possible for `mybinder.org `_ (or other BinderHub deployments) to seek the most cost-effective option for its needs. Models for sustainability ~~~~~~~~~~~~~~~~~~~~~~~~~ The Binder team is exploring multiple models for sustaining the public digital infrastructure of `mybinder.org `_, the team required to operate it, and the broader Binder ecosystem. At its current rate, the annual hosting cost of `mybinder.org `_ is around :math:`\$180 \times 365 \approx \$66,000`, an amount that could be sustainable with a grant-funded model. Operating and supporting the public digital infrastructure of `mybinder.org `_ requires several staff members distributed globally to provide reasonable coverage across time zones for user support and incident response. This means salary costs will require a significant amount of funding. The Binder team is actively exploring a *federation model* for BinderHub servers. Other organizations, companies, or universities can deploy their own BinderHubs for their own users or students, either on their own hardware or on cloud providers such as Google, Amazon, or Microsoft. These organization-specific deployments could require authentication or provide access to more complex cloud resources. In this case, `mybinder.org `_ could serve as a hub that connects this federated network of BinderHubs together, directing the user to an organization-specific BinderHub provided that they have the proper credentials on their machine. The future of Binder -------------------- This paper outlines the technical infrastructure underlying `mybinder.org `_ and the BinderHub open source technology, including the guiding design principles and goals of the project. Binder is designed to be modular, to adapt itself to pre-existing tools and workflows in the open source community, and to be transparent in its development and operations. Each of the tools described above is open source and permissively licensed, and we welcome the contributions and input from others in the open source community. In particular, we are excited to pursue Binder’s development in the following scenarios: 1. **Reproducible publishing**. One of the core benefits of BinderHub is that it can generate deterministic environments that are linked to a code repository stored in a long-term archive like Zenodo [#]_. This makes it useful for generating static representations of the environment needed to reproduce a scientific result. Binder has already been used alongside scientific publications (:cite:`LIGO_Scientific_Collaboration_undated-xy, Ross2017-ff`, :cite:`Cornish2018-mo`, :cite:`Holdgraf2017-so`, :cite:`Rein2016-rd`, :cite:`Neyrinck2018-xy`) to provide an interactive and reproducible document with minimal added effort. In the future, the Binder project hopes to partner with academic publishers and professional societies to incorporate these reproducible environments into the publishing workflow. .. [#] https://zenodo.org 2. **Education and interactive materials**. Binder’s goal is to lower the barrier to interactivity, and to allow users to utilize code that is hosted in repository providers such as GitHub. Because Binder runs as a free and public service, it could be used in conjunction with academic programs to provide interactivity when teaching programming and computational material. For example, the Foundations in Data Science course at UC Berkeley already utilizes mybinder.org to provide free interactive environments for its open source textbook. The Binder team hopes to find new educational uses for the technology moving forward. 3. **Access to complex cloud infrastructure**. While mybinder.org provides users with restricted hardware for cost-savings purposes, a BinderHub can be deployed on any cloud hardware that is desired. This opens the door for using BinderHub as a shared, interactive gateway that provides access to an otherwise inaccessible dataset or computational resource. For example, the GESIS Institute for Social Sciences provides a JupyterHub and BinderHub :cite:`GESIS_Leibniz_Institute_for_the_Social_Sciences_undated-sn` for their users at the university. The Binder team hopes to find new cases where BinderHub can be used as an entrypoint to provide individuals access to more sophisticated resources in the cloud. Binder is a free, open source, and massively publicly available tool for easily creating sharable, interactive, reproducible environments in the cloud. The Binder team is excited to see the Binder community continue to evolve and utilize BinderHub for new uses in reproducibility and interactive computing.