--- title: "Run/manage workflows" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" vignette: | %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{systemPipeR: Workflow design and reporting generation environment} %\VignetteEngine{knitr::rmarkdown} fontsize: 14pt editor_options: chunk_output_type: console type: docs weight: 3 --- ```{r setup, echo=TRUE, message=FALSE, warning=FALSE} suppressPackageStartupMessages({ library(systemPipeR) }) ``` Until this point, you have learned how to create a SPR workflow [interactively](../step_interactive) or use a template to [import/update](../step_import) the workflow. Next, we will learn how to run the workflow and manage the workflow. First let's set up the workflow using the example workflow template. For real production purposes, we recommend you to check out the complex templates over [here](/spr_wf/). ```{r eval=TRUE, include=FALSE} # cleaning try(unlink(".SPRproject", recursive = TRUE), TRUE) try(unlink("data", recursive = TRUE), TRUE) try(unlink("results", recursive = TRUE), TRUE) try(unlink("param", recursive = TRUE), TRUE) ``` For demonstration purposes here, we still use the [simple workflow](https://raw.githubusercontent.com/systemPipeR/systemPipeR.github.io/main/static/en/sp/spr/sp_run/spr_simple_wf.md). ```{r} sal <- SPRproject() sal <- importWF(sal, file_path = system.file("extdata", "spr_simple_wf.Rmd", package = "systemPipeR")) sal ``` ## Before running It is good to check if the command-line tools are installed before running the workflow. There are a few ways in SPR to find out the tool information. We will discuss different utilities below. ### List/check all tools and modules There are two functions, `listCmdTools`, `listCmdModules` in SPR that are designed to list/check if required tools/modules are installed. The input of these two functions is the `SYSargsList` workflow object, and it will list all the tools/modules required by the workflow as a dataframe by default. ```{r} listCmdTools(sal) ``` However, if no modular system is installed, it will just print out a warning message. ```{r} listCmdModules(sal) ``` If `check_path = TRUE` is used for `listCmdTools`, in addition, this function will try to check if the listed tools are in PATH (callable), and fill the results in the third column instead of using `NA`. ```{r} listCmdTools(sal, check_path = TRUE) ``` The same thing applies for the `listCmdModules(sal, check_module = TRUE)`. The module availability is checked when `TRUE`. The machine we use to render the document does not have a modular system installed, so the result is not displayed. Try following if modular system is accessiable. ```{r eval=FALSE} listCmdModules(sal, check_module = TRUE) ``` The `listCmdTools` function also has the `check_module` argument. That means, when it is `TRUE`, it will also perform the `listCmdModules` check. However, please note, even if `check_module = TRUE`, `listCmdTools` will always return the check results for tools but not for modules. #### Tool check in `importWF` The easiest way to use two functions mentioned above is through `importWF`. As you may have noticed, at the end of the import, tool check and module check is automatically performed for the users, as shown in the screenshot below. The only difference is the return of `importWF` is the `SYSargsList` object, but the return of `listCmdTools` or `listCmdModules` is an invisible dataframe. ![](../listCmdTools.png) ### Check single tool `listCmdTools` and `listCmdModules` check tools in a batch, and can only check for tools required for current workflow. If you have a tool of interest but is not listed in your workflow, following functions will be helpful. Or, in some other cases, one would like to know the tool/module used in a certain step. #### Single tool/module in a workflow There a few access functions in SPR list tool/modules of a certain step. ```{r} # list tool of a step baseCommand(stepsWF(sal)[[3]]) # list a module of a step modules(stepsWF(sal)[[3]]) ``` There is no module required for this simple workflow, please see the screenshot of a complex RNAseq example below: ![](../module_list_rnaseq.png) #### Generic tools For any other generic tools that may not be in a workflow, `tryCMD` can be used to check if a command-line tool is installed in the PATH. ```{r} tryCMD(command="R") tryCMD(command="hisat2") tryCMD(command="blastp") ``` In examples above, installed tools will have a message `"All set up, proceed!"`, and not installed tool will have an error message, like the `blastp` example above. If you see the error message: 1. Check if the tool is really installed, by typing the same command from a terminal. If you cannot call it even from a terminal, you need to (re)install. 2. If the tool is available in terminal but you see the error message in `tryCMD`. It is mostly likely due to the path problem. Make sure you [export the path](https://askubuntu.com/questions/720678/what-does-export-path-somethingpath-mean). Or try to use following to set it up in R: ```{r eval=FALSE} old_path <- Sys.getenv("PATH") Sys.setenv(PATH = paste(old_path, "path/to/tool_directory", sep = ":")) ``` #### Generic modules For any other generic modules that may not be in a workflow, the `module` function group will be helpful. This part requires the modular system installed in current OS. Usually this is done by the admins of HPCC. Read more about [modules{blk}](https://modules.sourceforge.net/). This group of functions not only has utility to check the presence of certain modules, but also can perform other module operations, such as load/unload modules, or list all currently loaded modules, etc. See more details in help file `?module`, or [here](https://systempipe.org/spr/funcs/spr/reference/moduleload.html). ```{r eval=FALSE} module(action_type, module_name = NULL) moduleload(module_name) moduleUnload(module_name) modulelist() moduleAvail() moduleClear() moduleInit() ``` > Every modular system will be specialized to fit the needs of a given computing > cluster. Therefore, there is a chance functions above will not work in your > particular system. If so, please contact your system admins for a solution, > load the PATH of required tools using other methods, and check the PATH as > mentioned above. ## Start running To run the workflow, call the `runWF` function which will execute all steps in the workflow container. ```{r runWF, eval=TRUE} sal <- runWF(sal) sal ``` ![](../runwf.png) We can see the workflow status changed from `pending` to `Success` ## Run selected steps This function allows the user to choose one or multiple steps to be executed using the `steps` argument. However, it is necessary to follow the workflow dependency graph. If a selected step depends on a previous step(s) that was not executed, the execution will fail. ```{r runWF_error, eval=TRUE} sal <- runWF(sal, steps = c(1,3)) ``` We do not see any problem here because we have finished the entire workflow running previously. So all depedency satisfies. Let's clean the workflow and start from scratch to see what will happen if one or more depedency is not met and we are trying to run some selected steps. ```{r error=F} sal <- SPRproject(overwrite = TRUE) sal <- importWF(sal, file_path = system.file("extdata", "spr_simple_wf.Rmd", package = "systemPipeR")) sal sal <- runWF(sal, steps = c(1,3)) ``` We can see the workflow step 3 is not run because of the dependency problem: > ## export_iris > ## have been not executed yet. ## optional steps By default all steps are `'mandatory'`, but you can change it to `'optional'` ```{r eval=FALSE} SYSargsList(..., run_step = 'optional') # or LineWise(..., run_step = 'optional') ``` When workflow is run by `runWF`, default will run all steps `'ALL'`, but you can choose to only run mandatory steps `'mandatory'` or optional steps `'optional'`. ```{r eval=FALSE} # default sal <- runWF(sal, run_step = "ALL") # only mandatory sal <- runWF(sal, run_step = "mandatory") # only optional sal <- runWF(sal, run_step = "optional") ``` ## Force to run steps - Forcing the execution of the steps, even if the status of the step is `'Success'` and all the expected `outfiles` exists. ```{r eval=FALSE} sal <- runWF(sal, force = TRUE, ... = ) ``` - Another feature of the `runWF` function is ignoring all the warnings and errors and running the workflow by the arguments `warning.stop` and `error.stop`, respectively. ```{r eval=FALSE} sal <- runWF(sal, warning.stop = FALSE, error.stop = TRUE, ...) ``` - To force the step to run without checking the dependency, we can use `ignore.dep = TRUE`. For example, let's run the step 3 that could not be run because of dependency problem. ```{r include=FALSE} try(unlink("results", recursive = TRUE), TRUE) try(dir.create("results", recursive = TRUE), TRUE) ``` ```{r eval=TRUE, error=TRUE} sal <- runWF(sal, steps = 3, ignore.dep = TRUE) ``` We can see the workflow failed, because required files from step 2 are missing and we jumped directly to step 3. Therefore, skip dependency is possible in SPR but **not recommended**. ## Workflow envirnment When the project was initialized by `SPRproject` function, it was created an environment for all object to store during the workflow preprocess code execution or `Linewise` R code execution. This environment can be accessed as follows: ```{r eval=TRUE, include=FALSE} sal <- runWF(sal) ``` ```{r runWF_env, eval=TRUE} viewEnvir(sal) ``` We can see there are `"df"`, `"plot"`, `"stats"` 3 objects, and they are created during the step 5 `Linewise` code execution. To access these variables interactive from your global environment, use `copyEnvir` method. ```{r collapse=TRUE} copyEnvir(sal, c("df", "plot")) exists("df", envir = globalenv()) exists("plot", envir = globalenv()) ``` Now we see, they are in our global enviornment, and we are free to do other operations on them. ### Save envirnment The workflow execution allows to save this environment for future recovery: ```{r runWF_saveenv, eval=FALSE} sal <- runWF(sal, saveEnv = TRUE) ``` > Depending on what variable you have saved in the enviorment, it can become > expensive (take much space and slow to load back in resume). ## Parallelization on clusters This section of the tutorial provides an introduction to the usage of the _`systemPipeR`_ features on a cluster. So far, all workflow steps are run in the same computer as we manage the workflow instance. This is called running in the `management` session. Alternatively, the computation can be greatly accelerated by processing many files in parallel using several compute nodes of a cluster, where a scheduling/queuing system is used for load balancing. This is called running in the `compute` session. The behavior controlled by the `run_session` argument in `SYSargsList`. ```{r eval=FALSE} SYSargsList(..., run_session = "management") # or SYSargsList(..., run_session = "compute") ``` By default, all steps are run on `"management"`, and we can change it to use `"compute"`. However, simply change the value will not work, we also couple with computing resources (see below for what is 'resources'). The resources need to be appended to the step by `run_remote_resources` argument. ```{r eval=FALSE} SYSargsList(..., run_session = "compute", run_remote_resources = list(...)) ``` This is how to config the running session for each step, but generally we can use a more convenient method `addResources` to add resources (continue reading below). ### Resources Resources here refer to computer resources, like CPU, RAM, time limit, _etc._ The `resources` list object provides the number of independent parallel cluster processes defined under the `Njobs` element in the list. The following example will run 18 processes in parallel using each 4 CPU cores on a slurm scheduler. If the resources available on a cluster allow running all 18 processes at the same time, then the shown sample submission will utilize in a total of 72 CPU cores. Note, `runWF` can be used with most queueing systems as it is based on utilities from the `batchtools` package, which supports the use of template files (_`*.tmpl`_) for defining the run parameters of different schedulers. To run the following code, one needs to have both a `conffile` (see _`.batchtools.conf.R`_ samples [here](https://mllg.github.io/batchtools/)) and a `template` file (see _`*.tmpl`_ samples [here](https://github.com/mllg/batchtools/tree/master/inst/templates)) for the queueing available on a system. The following example uses the sample `conffile` and `template` files for the Slurm scheduler provided by this package. The resources can be appended when the step is generated, or it is possible to add these resources later, as the following example using the `addResources` function: Before adding resources ```{r collapse=TRUE} runInfo(sal)[['runOption']][['gzip']] ``` ```{r runWF_cluster, eval=TRUE} resources <- list(conffile=".batchtools.conf.R", template="batchtools.slurm.tmpl", Njobs=18, walltime=120,##minutes ntasks=1, ncpus=4, memory=1024,##Mb partition = "short"# a compute node called 'short' ) sal <- addResources(sal, c("gzip"), resources = resources) ``` After adding resources ```{r collapse=TRUE} runInfo(sal)[['runOption']][['gzip']] ``` You can see the step option is automatically replaced from 'management' to 'compute'. ## Workflow status To check the summary of the workflow, we can use: ```{r show_statusWF, eval=TRUE, collapse=TRUE} sal ``` To access more details about the workflow instances, we can use the `statusWF` method: ```{r statusWF, eval=TRUE, collapse=TRUE} statusWF(sal) ``` To access the options of each workflow step, for example, whether it is mandatory step or optional step, where it stored in the template, where to run the step, _etc_., we can use the `runInfo` function to check. ```{r collapse=TRUE, collapse=TRUE} runInfo(sal) ``` ## Visualize workflow _`systemPipeR`_ workflows instances can be visualized with the `plotWF` function. This function will make a plot of selected workflow instance and the following information is displayed on the plot: - Workflow structure (dependency graphs between different steps); - Workflow step status, *e.g.* `Success`, `Error`, `Pending`, `Warnings`; - Sample status and statistics; - Workflow timing: running duration time. If no argument is provided, the basic plot will automatically detect width, height, layout, plot method, branches, _etc_. ```{r, eval=TRUE} plotWF(sal, width = "80%", rstudio = TRUE) ``` We will discuss a lot more advanced use of `plotWF` function in the next section. ## High-level project control If you desire to resume or restart a project that has been initialized in the past, `SPRproject` function allows this operation. **Resume** With the `resume` option, it is possible to load the `SYSargsList` object in R and resume the analysis. Please, make sure to provide the `logs.dir` location, and the corresponded `YAML` file name, if the default names were not used when the project was created. ```{r SPR_resume, eval=FALSE} sal <- SPRproject(resume = TRUE, logs.dir = ".SPRproject", sys.file = ".SPRproject/SYSargsList.yml") ``` If you choose to save the environment in the last analysis, you can recover all the files created in that particular section. `SPRproject` function allows this with `load.envir` argument. Please note that the environment was saved only with you run the workflow in the last section (`runWF()`). ```{r resume_load, eval=FALSE} sal <- SPRproject(resume = TRUE, load.envir = TRUE) ``` **Restart** The `resume` option will keep all previous logs in the folder; however, if you desire to clean the execution (delete all the log files) history and restart the workflow, the `restart=TRUE` option can be used. ```{r restart_load, eval=FALSE} sal <- SPRproject(restart = TRUE, load.envir = FALSE) ``` **Overwrite** The last and more drastic option from `SYSproject` function is to `overwrite` the logs and the `SYSargsList` object. This option will delete the hidden folder and the information on the `SYSargsList.yml` file. This will not delete any parameter file nor any results it was created in previous runs. Please use with caution. ```{r SPR_overwrite, eval=FALSE} sal <- SPRproject(overwrite = TRUE) ``` ## Session ```{r} sessionInfo() ``` ```{r eval=TRUE, include=FALSE} # cleaning try(unlink(".SPRproject", recursive = TRUE), TRUE) try(unlink("data", recursive = TRUE), TRUE) try(unlink("results", recursive = TRUE), TRUE) try(unlink("param", recursive = TRUE), TRUE) ```