{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parallelization in `ipyrad` using `ipyparallel`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the real strenghts of `ipyrad` is the advanced parallelization methods that it uses to distribute work across arbitrarily large computing clusters, and to be able to do so when working interactively and remotely. This is done through use of the `ipyparallel` package, which is tightly linked to `ipython` and `jupyter`. When you run the command-line `ipyrad` program all of the work of `ipyparallel` is hidden under the hood, which we've done to make the program very user-friendly. However, when using the `ipyrad` API, we've taken the alternative approach of instructing users to become intimate with the `ipyparallel` library to better understand how work is being distributed on their system. This has the benefit of allowing more flexible parallelization setups, and also makes it easier for users to take advantage of `ipyparallel` for parallelizing downstream analyses, which we have many examples of in the `analysis-tools` section of the ipyrad documentation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required software\n", "All software required for this tutorial is installed during the ipyrad conda installation. \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "## conda install ipyrad -c ipyrad" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Starting an `ipcluster` instance\n", "\n", "The tricky aspect of using `ipyparallel` inside of Python (i.e., in a jupyter-notebook) is that you need to first start a cluster instance by running a command-line program called ``ipcluster`` (alternatively, you can also install an extension that makes it possible to start ipcluster from a tab in jupyter notebooks, but I feel the command line tool is simpler). This command will start separate python \"kernels\" (instances) running on the cluster/computer and ensure that they can all talk to each other. Using advanced options you can even connect kernels across multiple computers or nodes on a HPC cluster, which we'll demonstrate. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "## ipcluster start --n=4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start a `jupyter-notebook`\n", "If you are working on your laptop of a workstation then I typically open up two terminals, one to start a notebook and one to start an ipcluster instance. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## jupyter-notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", " \n", " \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Open a notebook\n", "Running `jupyter-notebook` will launch a server that will open a dashboard view in your browser, usually at the address is at `localhost:8888`. From the dashboard go to the menu and select `new/notebook/Python` to open a new notebook. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connect to `ipcluster` in your notebook\n", "Now from inside a notebook you can connect to the cluster using the `ipyparallel` library. Below we will connect to the client by providing no additional arguments, which is sufficient in this case sine we are using a very basic `ipcluster` setup. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6.0.2\n" ] } ], "source": [ "import ipyparallel as ipp\n", "print ipp.__version__" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4 cores\n" ] } ], "source": [ "## connect to ipcluster using default arguments\n", "ipyclient = ipp.Client()\n", "\n", "## count how many engines are connected\n", "print len(ipyclient), 'cores'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `Profiles` in ipcluster and how ipyrad uses them\n", "Below we show an example of a common error caused when the Client cannot find the `ipcluster` instance, in this case because it has a differnt profile name. When you start an ipcluster instance it keeps track of itself by using a specific name (its profile). The default profile is an empty string (\"\") and so this is the default profile that the `ipp.Client()` command will look for (and similarly the default profile that `ipyrad` will look for). If you change the name of the profile then you have to indicate this, like below. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Waiting for connection file: ~/.ipython/profile_MPI/security/ipcontroller-client.json\n" ] }, { "ename": "IOError", "evalue": "Connection file '~/.ipython/profile_MPI/security/ipcontroller-client.json' not found.\nYou have attempted to connect to an IPython Cluster but no Controller could be found.\nPlease double-check your configuration and ensure that a cluster is running.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIOError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m