{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "*This notebook contains material from [PyRosetta](https://RosettaCommons.github.io/PyRosetta.notebooks);\n", "content is available [on Github](https://github.com/RosettaCommons/PyRosetta.notebooks.git).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Example of Using PyRosetta with GNU Parallel](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.03-GNU-Parallel-Via-Slurm.ipynb) | [Contents](toc.ipynb) | [Index](index.ipynb) | [Part I: Parallelized Global Ligand Docking with `pyrosetta.distributed`](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/16.05-Ligand-Docking-dask.ipynb) >
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Examples Using the `dask` Module"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### We can make use of the `dask` library to parallelize code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Note:* This Jupyter notebook uses parallelization and is **not** meant to be executed within a Google Colab environment."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Note:* This Jupyter notebook requires the PyRosetta distributed layer which is obtained by building PyRosetta with the `--serialization` flag or installing PyRosetta from the RosettaCommons conda channel \n",
"\n",
"**Please see Chapter 16.00 for setup instructions**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import dask\n",
"import dask.array as da\n",
"import graphviz\n",
"import logging\n",
"logging.basicConfig(level=logging.INFO)\n",
"import numpy as np\n",
"import os\n",
"import pyrosetta\n",
"import pyrosetta.distributed\n",
"import pyrosetta.distributed.dask\n",
"import pyrosetta.distributed.io as io\n",
"import random\n",
"import sys\n",
"\n",
"from dask.distributed import Client, LocalCluster, progress\n",
"from dask_jobqueue import SLURMCluster\n",
"from IPython.display import Image\n",
"\n",
"if 'google.colab' in sys.modules:\n",
" print(\"This Jupyter notebook uses parallelization and is therefore not set up for the Google Colab environment.\")\n",
" sys.exit(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Initialize PyRosetta within this Jupyter notebook using custom command line PyRosetta flags:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:pyrosetta.distributed:maybe_init performing pyrosetta initialization: {'extra_options': '-out:level 100 -ignore_unrecognized_res 1 -ignore_waters 0 -detect_disulf 0', 'silent': True}\n",
"INFO:pyrosetta.rosetta:Found rosetta database at: /home/klimaj/anaconda3/envs/PyRosetta.notebooks/lib/python3.7/site-packages/pyrosetta/database; using it....\n",
"INFO:pyrosetta.rosetta:PyRosetta-4 2020 [Rosetta PyRosetta4.conda.linux.CentOS.python37.Release 2020.02+release.22ef835b4a2647af94fcd6421a85720f07eddf12 2020-01-05T17:31:56] retrieved from: http://www.pyrosetta.org\n",
"(C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team.\n"
]
}
],
"source": [
"flags = \"\"\"-out:level 100\n",
"-ignore_unrecognized_res 1\n",
" -ignore_waters 0 \n",
" -detect_disulf 0 # Do not automatically detect disulfides\n",
"\"\"\" # These can be unformatted for user convenience, but no spaces in file paths!\n",
"pyrosetta.distributed.init(flags)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are running this example on a high-performance computing (HPC) cluster with SLURM scheduling, use the `SLURMCluster` class described below. For more information, visit https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html. **Note**: If you are running this example on a HPC cluster with a job scheduler other than SLURM, `dask_jobqueue` also works with other job schedulers: http://jobqueue.dask.org/en/latest/api.html\n",
"\n",
"The `SLURMCluster` class in the `dask_jobqueue` module is very useful! In this case, we are requesting four workers using `cluster.scale(4)`, and specifying each worker to have:\n",
"- one thread per worker with `cores=1`\n",
"- one process per worker with `processes=1`\n",
"- one CPU per task per worker with `job_cpu=1`\n",
"- a total of 4GB memory per worker with `memory=\"4GB\"`\n",
"- itself run on the \"short\" queue/partition on the SLURM scheduler with `queue=\"short\"`\n",
"- a maximum job walltime of 3 hours using `walltime=\"03:00:00\"`\n",
"- output dask files directed to `local_directory`\n",
"- output SLURM log files directed to file path and file name (and any other SLURM commands) with the `job_extra` option\n",
"- pre-initialization with the same custom command line PyRosetta flags used in this Jupyter notebook, using the `extra=pyrosetta.distributed.dask.worker_extra(init_flags=flags)` option\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"if not os.getenv(\"DEBUG\"):\n",
" scratch_dir = os.path.join(\"/net/scratch\", os.environ[\"USER\"])\n",
" cluster = SLURMCluster(\n",
" cores=1,\n",
" processes=1,\n",
" job_cpu=1,\n",
" memory=\"4GB\",\n",
" queue=\"short\",\n",
" walltime=\"02:59:00\",\n",
" local_directory=scratch_dir,\n",
" job_extra=[\"-o {}\".format(os.path.join(scratch_dir, \"slurm-%j.out\"))],\n",
" extra=pyrosetta.distributed.dask.worker_extra(init_flags=flags)\n",
" )\n",
" cluster.scale(4)\n",
" client = Client(cluster)\n",
"else:\n",
" cluster = None\n",
" client = None"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**: The actual sbatch script submitted to the Slurm scheduler under the hood was:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#!/usr/bin/env bash\n",
"\n",
"#SBATCH -J dask-worker\n",
"#SBATCH -p short\n",
"#SBATCH -n 1\n",
"#SBATCH --cpus-per-task=1\n",
"#SBATCH --mem=4G\n",
"#SBATCH -t 02:59:00\n",
"#SBATCH -o /net/scratch/klimaj/slurm-%j.out\n",
"\n",
"JOB_ID=${SLURM_JOB_ID%;*}\n",
"\n",
"/home/klimaj/anaconda3/envs/PyRosetta.notebooks/bin/python -m distributed.cli.dask_worker tcp://172.16.131.107:19949 --nthreads 1 --memory-limit 4.00GB --name name --nanny --death-timeout 60 --local-directory /net/scratch/klimaj --preload pyrosetta.distributed.dask.worker ' -out:level 100 -ignore_unrecognized_res 1 -ignore_waters 0 -detect_disulf 0'\n",
"\n"
]
}
],
"source": [
"if not os.getenv(\"DEBUG\"):\n",
" print(cluster.job_script())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Otherwise, if you are running this example locally on your laptop, you can still spawn workers and take advantage of the `dask` module:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# cluster = LocalCluster(n_workers=1, threads_per_worker=1)\n",
"# client = Client(cluster)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open the `dask` dashboard, which shows diagnostic information about the current state of your cluster and helps track progress, identify performance issues, and debug failures:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"Client\n", "
| \n",
"\n",
"Cluster\n", "
| \n",
"
"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:PyRosetta.notebooks]",
"language": "python",
"name": "conda-env-PyRosetta.notebooks-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}