---
title: "Infrastructure and software environment"
subtitle: "IFB National Network of Computing Resources"
author: "Jacques van Helden"
date: '`r Sys.Date()`'
output:
slidy_presentation:
self_contained: false
fig_caption: yes
fig_height: 6
fig_width: 7
highlight: tango
incremental: no
keep_md: yes
smaller: yes
theme: cerulean
toc: yes
widescreen: yes
beamer_presentation:
colortheme: dolphin
fig_caption: yes
fig_height: 6
fig_width: 7
fonttheme: structurebold
highlight: tango
incremental: no
keep_tex: no
slide_level: 2
theme: Montpellier
toc: yes
html_document:
self_contained: false
fig_caption: yes
highlight: zenburn
theme: cerulean
toc: yes
toc_depth: 3
toc_float: yes
pdf_document:
fig_caption: yes
highlight: zenburn
toc: yes
toc_depth: 3
ioslides_presentation:
self_contained: false
css: slides.css
fig_caption: yes
fig_height: 6
fig_width: 7
highlight: tango
smaller: yes
toc: yes
widescreen: yes
word_document:
toc: yes
toc_depth: 3
# font-import: http://fonts.googleapis.com/css?family=Risque
font-family: Garamond
transition: linear
---
```{r include=FALSE, echo=FALSE, eval=TRUE}
library(knitr)
options(width = 300)
knitr::opts_chunk$set(
fig.width = 7, fig.height = 5,
fig.align = "center",
fig.path = "figures/nncr-env_",
size = "tiny",
echo = FALSE, eval = TRUE,
warning = FALSE, message = FALSE,
results = TRUE, comment = "")
# knitr::asis_output("\\footnotesize")
# dir.slides <- "~/IFB/NNCR/using_IFB_NNCR/slides/"
# setwd(dir.slides)
```
## Command-line tools in bioinformatics
- A large proportion of bioinformatics tools are available only on the command line. Moreover, even for tools equipped with a graphical user interface (e.g. BLAST, clustal, ...) the use of command-line can be necessary for some projects
- Enables to automate the tasks
- Managing repetitive processes: apply the same task to numerous data files or with many different options
- Managing complex processes combining many tasks (workflows)
- High performance computing (HPC)
- running tasks that require huge resources of computing and storage
- Traceability, reproducibility, reusability
- traceability: keeping track of each step and parameter used to produce a result
- reproducuibility; enabling to re-run the analysis and reproduce the same results
- reusability: enabling to run the same analysis with different data
## Working environments
Most bioinformatics tools can be installed on Unix-like operating systems (Linux, Mac OS X), and can be used in different environments.
- Terminal of your own computer (Linux, Mac OS X)
- Virtual Machine (e.g. [VirtualBox](https://www.virtualbox.org/), [VWMare](https://www.vmware.com/))
- Containers ([Docker](https://www.docker.com/), [singularity](https://www.sylabs.io/singularity/))
- Terminal of a remote computer (via an `ssh` connection)
- Bureau Virtuel
## Virtual machines
- Components
- host: can be your own computer (note: also used to deploy services on HPC computers)
- host operating system (Linux, Mac, Windows)
- virtual machine (VM) : emulation of an other computer, that runs on the host machine
- operating system of the VM: Linux, Windows, ...
- hypervisor (= monitor): software that runs the virtual machines on the host
- Typical applications
- run a Linux OS on a Windows or Macintosh PC
- test a software under different operating systems
- isolate a service from the host system (security, resource segmentation)
- Examples of hypervisors
- VirtualBox ()
- VWMare ()
## Container-based virtualisation
- Applications run on a shared operating system without requiring a virtual machine
- Advantages
- modular combination of applications and libraries ("à la carte")
- less resource-consuming than virtual machines
- Container management software
- Docker ()
- singularity ()
## Virtual machines versus containers
```{r out.width="90%", fig.cap="**Comparaison of virtualisation solutions.** Right: Virtual Machine; Center: Docker container; right: Singularity container. Source: Greg Kurtzer keynote at HPC Advisory Council 2017 @ Stanford"}
include_graphics("images/vm_vs_container.png")
```
## Installation of software tools in the local operating system
- Advantages
- Immediate availability of the tools
- Direct invocation by the native operating system (more efficient)
- Weaknesses
- Dependences (system libraries, language libraries, other executables)
- Incompatibilities between dependences of different tools
- The installation of some tools and libraries requires admin rights
- OS-dependency of package managers ([package managers](https://en.wikipedia.org/wiki/Package_manager)): apt-get, yum, ports, brew, ...
- Some applications and libraries are available only on some package managers.
## Conda packages
- Doc :
- Advantages
- A multi-platform package manager (Linux, Mac OS X, Windows)
- All the installations can be done at the user-level (no need to be admin)
- Community project supported by a large community (computer scientists, statisticians, bioinformaticians, ...) $\rightarrow$ ever-increasing number of supported tools an libraries
- Continuous integration $\rightarrow$ very fast response to requests
- Very precise management of the dependencies and version
- Seamingless management of uninstallation for no more required software
- Trying it is adopting it!
- Weaknesses
- If each used installs each tool and dependencies in her/his own account, this creates redundancy and waste of storage space.
## Computer cluster
A cluster is a set of computers (denotes as **nodes**) that work together and can be seen as a single system. Clusters are generally used to run parallel computing
```{r out.width="60%", fig.cap="**Grappe de serveurs.** En avant-plan: *Homo sapiens* tentant d'établir une interaction physique avec les machines. Source: "}
include_graphics("images/600px-IBM_Blue_Gene_P_supercomputer.jpg")
```
## Parallel computing
Parallel computing consists in running simultaneously a series of processes on a computer system.
Tasks can be distributed on several Computer Processing Units (**CPUs**) of a same computer and/or on several computers (cluster).
The distribution of tasks on nodes and CPUs relies on a **job scheduler**. Users submit jobs (in the form of command lines or scripts) to the scheduler, which manages their execution on the different nodes and CPUs.