--- title: "Infrastructure and software environment" subtitle: "IFB National Network of Computing Resources" author: "Jacques van Helden" date: '`r Sys.Date()`' output: slidy_presentation: self_contained: false fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango incremental: no keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes html_document: self_contained: false fig_caption: yes highlight: zenburn theme: cerulean toc: yes toc_depth: 3 toc_float: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 ioslides_presentation: self_contained: false css: slides.css fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango smaller: yes toc: yes widescreen: yes word_document: toc: yes toc_depth: 3 # font-import: http://fonts.googleapis.com/css?family=Risque font-family: Garamond transition: linear --- ```{r include=FALSE, echo=FALSE, eval=TRUE} library(knitr) options(width = 300) knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.align = "center", fig.path = "figures/nncr-env_", size = "tiny", echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") # knitr::asis_output("\\footnotesize") # dir.slides <- "~/IFB/NNCR/using_IFB_NNCR/slides/" # setwd(dir.slides) ``` ## Command-line tools in bioinformatics - A large proportion of bioinformatics tools are available only on the command line. Moreover, even for tools equipped with a graphical user interface (e.g. BLAST, clustal, ...) the use of command-line can be necessary for some projects - Enables to automate the tasks - Managing repetitive processes: apply the same task to numerous data files or with many different options - Managing complex processes combining many tasks (workflows) - High performance computing (HPC) - running tasks that require huge resources of computing and storage - Traceability, reproducibility, reusability - traceability: keeping track of each step and parameter used to produce a result - reproducuibility; enabling to re-run the analysis and reproduce the same results - reusability: enabling to run the same analysis with different data ## Working environments Most bioinformatics tools can be installed on Unix-like operating systems (Linux, Mac OS X), and can be used in different environments. - Terminal of your own computer (Linux, Mac OS X) - Virtual Machine (e.g. [VirtualBox](https://www.virtualbox.org/), [VWMare](https://www.vmware.com/)) - Containers ([Docker](https://www.docker.com/), [singularity](https://www.sylabs.io/singularity/)) - Terminal of a remote computer (via an `ssh` connection) - Bureau Virtuel ## Virtual machines - Components - host: can be your own computer (note: also used to deploy services on HPC computers) - host operating system (Linux, Mac, Windows) - virtual machine (VM) : emulation of an other computer, that runs on the host machine - operating system of the VM: Linux, Windows, ... - hypervisor (= monitor): software that runs the virtual machines on the host - Typical applications - run a Linux OS on a Windows or Macintosh PC - test a software under different operating systems - isolate a service from the host system (security, resource segmentation) - Examples of hypervisors - VirtualBox () - VWMare () ## Container-based virtualisation - Applications run on a shared operating system without requiring a virtual machine - Advantages - modular combination of applications and libraries ("à la carte") - less resource-consuming than virtual machines - Container management software - Docker () - singularity () ## Virtual machines versus containers ```{r out.width="90%", fig.cap="**Comparaison of virtualisation solutions.** Right: Virtual Machine; Center: Docker container; right: Singularity container. Source: Greg Kurtzer keynote at HPC Advisory Council 2017 @ Stanford"} include_graphics("images/vm_vs_container.png") ``` ## Installation of software tools in the local operating system - Advantages - Immediate availability of the tools - Direct invocation by the native operating system (more efficient) - Weaknesses - Dependences (system libraries, language libraries, other executables) - Incompatibilities between dependences of different tools - The installation of some tools and libraries requires admin rights - OS-dependency of package managers ([package managers](https://en.wikipedia.org/wiki/Package_manager)): apt-get, yum, ports, brew, ... - Some applications and libraries are available only on some package managers. ## Conda packages - Doc : - Advantages - A multi-platform package manager (Linux, Mac OS X, Windows) - All the installations can be done at the user-level (no need to be admin) - Community project supported by a large community (computer scientists, statisticians, bioinformaticians, ...) $\rightarrow$ ever-increasing number of supported tools an libraries - Continuous integration $\rightarrow$ very fast response to requests - Very precise management of the dependencies and version - Seamingless management of uninstallation for no more required software - Trying it is adopting it! - Weaknesses - If each used installs each tool and dependencies in her/his own account, this creates redundancy and waste of storage space. ## Computer cluster A cluster is a set of computers (denotes as **nodes**) that work together and can be seen as a single system. Clusters are generally used to run parallel computing ```{r out.width="60%", fig.cap="**Grappe de serveurs.** En avant-plan: *Homo sapiens* tentant d'établir une interaction physique avec les machines. Source: "} include_graphics("images/600px-IBM_Blue_Gene_P_supercomputer.jpg") ``` ## Parallel computing Parallel computing consists in running simultaneously a series of processes on a computer system. Tasks can be distributed on several Computer Processing Units (**CPUs**) of a same computer and/or on several computers (cluster). The distribution of tasks on nodes and CPUs relies on a **job scheduler**. Users submit jobs (in the form of command lines or scripts) to the scheduler, which manages their execution on the different nodes and CPUs.