---
title: "Automate Creation of CWL Instructions"
author: "Author: Daniela Cassol, Le Zhang, Thomas Girke"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
toc_float:
collapsed: true
smooth_scroll: true
toc_depth: 3
fig_caption: yes
code_folding: show
number_sections: true
fontsize: 14pt
bibliography: bibtex.bib
weight: 19
type: docs
---
Source code downloads:
[ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/cmdToCwl/cmdToCwl.Rmd) ]
[ [.R](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/cmdToCwl/cmdToCwl.R) ]
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
A central concept for designing workflows within the `systemPipeR` environment
is the usage of workflow management containers. For describing analysis
workflows in a generic and flexible manner the [Common Workflow
Language](https://www.commonwl.org/) (CWL) has been adopted throughout the
environment including the workflow management containers [@Amstutz2016-ka].
Using the CWL community standard in `systemPipeR` has many advantages. For
instance, the integration of CWL allows running `systemPipeR` workflows from a
single specification instance either entirely from within R, from various
command line wrappers (e.g., cwl-runner) or from other languages (e.g., Bash or
Python). An important feature of `systemPipeR's` CWL interface is that it
provides two options to run command line tools and workflows based on CWL.
First, one can run CWL in its native way via an R-based wrapper utility for
`cwl-runner` or `cwl-tools` (CWL-based approach). Second, one can run workflows
using CWL's command line and workflow instructions from within R (R-based
approach). In the latter case the same CWL workflow definition files (e.g.
*.cwl* and *.yml*) are used but rendered and executed entirely with R functions
defined by `systemPipeR`, and thus use CWL mainly as a command line and
workflow definition format rather than execution software to run workflows.
Moreover, `systemPipeR` provides several convenience functions that are useful
for designing and debugging workflows, such as a command-line rendering
function to retrieve the exact command-line strings for each data set and
processing step prior to running a command-line.
This tutorial briefly introduces the basics how CWL defines command-line
syntax. Next, it describes how to use CWL within `systemPipeR` for designing,
modifying and running workflows.
## Load package
Recent versions of R (>=4.0.0), Bioconductor (>=3.14) and `systemPipeR` (>=2.0.8)
need to be used to gain access to the functions described in this tutorial.
```{r load_library, eval=TRUE, include=FALSE}
library(systemPipeR)
```
## CWL command line specifications
CWL command line specifications are written in [YAML](http://yaml.org/) format.
In CWL, files with the extension `.cwl` define the parameters of a chosen
command line step or workflow, while files with the extension `.yml` define
the input variables of command line steps.
The following introduces first the basic structure of `.cwl` files.
```{r}
dir_path <- system.file("extdata/cwl/example/", package="systemPipeR")
cwl <- yaml::read_yaml(file.path(dir_path, "example.cwl"))
```
- The `cwlVersion` component specifies the version of CWL that is used here.
- The `class` component declares the usage of a command-line tool.
Note, CWL has another `class` called `Workflow`. The latter defines one
or more command-line tools, while `CommandLineTool` is limited to one.
```{r}
cwl[1:2]
```
- The `baseCommand` component contains the base name of the software to be executed.
```{r}
cwl[3]
```
- The `inputs` component provides the input information required for the command-line software. Important sub-components of this section are:
- `id`: each input has an id assigning a name
- `type`: input type value (e.g. string, int, long, float, double,
File, Directory or Any);
- `inputBinding`: optional component indicating if the input
parameter should appear on the command line. If missing then the
parameter will not appear in the command-line.
```{r}
cwl[4]
```
- The `outputs` component should provide a list of the outputs expected after running a command-line tools.
Important sub-components of this section are:
- `id`: each output has an id assigning a name
- `type`: output type value (e.g. string, int, long, float, double,
File, Directory, Any or `stdout`)
- `outputBinding`: defines how to set the outputs values. The `glob` component will define the name of the output value.
```{r}
cwl[5]
```
- `stdout`: specifies a `filename` for capturing standard output.
Note here we are using a syntax that takes advantage of the inputs section,
using `results_path` parameter and also the `SampleName` to construct the `filename` of the output.
```{r}
cwl[6]
```
Next, the structure and content of the `.yml` files will be introduced. The `.yml` file
provides the parameter values for the `.cwl` components described above.
The following example defines three parameters.
```{r}
yaml::read_yaml(file.path(dir_path, "example_single.yml"))
```
Importantly, if an input component is defined in the corresponding *.cwl* file, then the
required value needs to be provided by the corresponding component of the *.yml* file.
### How to connect CWL description files within `systemPipeR`
A `SYSargsList` container stores several `SYSargs2` instances in a list-like object containing
all instructions required for processing a set of input files with a single or many command-line
steps within a workflow (i.e. several tools of one software or several independent software tools).
A single `SYSargs2` object is created and fully populated with the constructor functions
`loadWF` and `renderWF`.
The following imports a `.cwl` file (here `example.cwl`) for running a simple `echo Hello World`
example where a string `Hello World` will be printed to stdout and redirected to a file named
`M1.txt` located under a subdirectory named `results`.
```{r fromFile, eval=TRUE}
HW <- loadWF(wf_file="example.cwl", input_file="example_single.yml",
dir_path = dir_path)
HW <- renderWF(HW)
HW
cmdlist(HW)
```
The above example is limited to running only one command-line call, corresponding to one
input file, e.g. representing a single experimental sample. To scale to many command-line
calls, e.g. when processing many input samples, a simple solution offered by `systemPipeR`
is to use `variables`, one for each parameter with many inputs.
The following gives a simple example for defining and processing many inputs.
```{r}
yml <- yaml::read_yaml(file.path(dir_path, "example.yml"))
yml
```
Under the `message` and `SampleName` parameters, variables are used for that will be populated
by values provided by a third file called `targets.`
The following shows the structure of a simple `targets` file.
```{r}
targetspath <- system.file("extdata/cwl/example/targets_example.txt", package="systemPipeR")
read.delim(targetspath, comment.char = "#")
```
With help of a `targets` file, one can define all input files, sample ids and
experimental variables relevant for an analysis workflow. In the above example,
strings defined under the `Message` column will be passed on to the `echo`
command-line tool. In addition, each command-line will be assigned a label or
id specified under `SampleName` column. Any number of additional columns can be
added as needed.
Users should note here, the usage of `targets` files is optional when using
`systemPipeR's` CWL interface. Since targets files are very efficient for
organizing experimental variables, their usage is highly encouraged and well
supported in `systemPipeR`.
#### Connect parameter and targets files
The constructor functions construct an `SYSargs2` instance from three input files:
- `.cwl` file path assigned to `wf_file` argument
- `.yml` file path assigned to `input_file` argument
- `target` file assigned to `targets` argument
As mentioned above, the latter `targets` file is optional. The connection
between input variables (here defined by `input_file` argument) and the
`targets` file are defined under the `inputvars` argument. A named vector is
required, where each element name needs to match the column names in the
`targets` file, and the value must match the names of the *.yml* variables.
This is used to replace the CWL variable and construct the command-lines, usually
one for each input sample.
For consistency the pattern `_XXXX_` is used for variable naming in the `.yml` file, where the
name matches the corresponding column name in the targets file. This pattern is recommended
for easy identification but not enforced.
The following imports a `.cwl` file (same example as above) for running
the `echo` example. However, now several command-line calls are constructed with the
information provided under the `Message` column of the targets file that is passed on to
matching component in the `.yml` file.
```{r fromFile_example, eval=TRUE}
HW_mul <- loadWorkflow(targets = targetspath, wf_file="example.cwl",
input_file="example.yml", dir_path = dir_path)
HW_mul <- renderWF(HW_mul, inputvars = c(Message = "_STRING_", SampleName = "_SAMPLE_"))
HW_mul
cmdlist(HW_mul)
```
Figure 1: Connectivity between CWL param files and targets files.
## Auto-creation of CWL param files from command-line
Users can define the command-line in a pseudo-bash script format. The following used the
the command-line for `HISAT2` as example.
```{r cmd, eval=TRUE}
command <- "
hisat2 \
-S \
-x \
-k \
-min-intronlen \
-max-intronlen \
-threads \
-U
"
```
### Define prefix and defaults
- First line is the base command. Each line is an argument with its default value.
- All following lines specify arguments. Lines starting with a `-` or `--` followed
by a non-space delimited letter/word will be interpreted as a prefix, e.g.
`-S` or `--min`. Lines without this prefix will be rendered as non-prefix arguments.
- All default settings are placed inside `<...>`. Omit for arguments without values
such as `--verbose`.
- First argument is the type of the input. `F` for "File", "int" and "string" are unchanged.
- Optional: keyword `out` followed the type. Separation by `,` (comma) indicates
whether this argument is also a CWL output.
- Use `:` to separate keywords and default values. Any non-space separated value after the `:`
will be treated as the default value.
### `createParamFiles` Function
The `createParamFiles` function accepts as input a command-line provided in above `string` syntax.
The function returns a `cwl` with the following components:
- `BaseCommand`: Specifies the program to execute
- `Inputs`: Defines the input parameters of the process
- `Outputs`: Defines the parameters representing the output of the process
The fourth component is the original command-line provided as input.
In interactive mode, the function will verify if everything is correct and
ask the user to proceed. The user can answer "no" and provide more information
at the string input level. Another question is whether to save the generated CWL
results to the corresponding `.cwl` and `.yml` files. When running the function
in non-interactive mode, the results will be returned without asking for confirmation
by the user.
```{r}
cmd <- createParamFiles(command, writeParamFiles = FALSE)
```
If the user chooses not to save the `param` files in the `createParamFiles` call directly,
then the `writeParamFiles` function allows to do this in a separate step.
```{r saving, eval=TRUE}
writeParamFiles(cmd, overwrite = TRUE)
```
### Accessor functions
#### Print components
Note, the results of `createParamFiles` are stored in a `SYSargs2` container. The individual
components can be accessed as follows.
```{r}
printParam(cmd, position = "baseCommand") ## Print a baseCommand section
printParam(cmd, position = "outputs")
printParam(cmd, position = "inputs", index = 1:2) ## Print by index
printParam(cmd, position = "inputs", index = -1:-2) ## Negative indexing printing to exclude certain indices in a position
cmdlist(cmd)
```
#### Subsetting the command-line
```{r}
cmd2 <- subsetParam(cmd, position = "inputs", index = 1:2, trim = TRUE)
cmdlist(cmd2)
cmd2 <- subsetParam(cmd, position = "inputs", index = c("S", "x"), trim = TRUE)
cmdlist(cmd2)
```
#### Replacing existing argument
```{r}
cmd3 <- replaceParam(cmd, "base", index = 1, replace = list(baseCommand = "bwa"))
cmdlist(cmd3)
```
```{r}
new_inputs <- new_inputs <- list(
"new_input1" = list(type = "File", preF="-b", yml ="myfile"),
"new_input2" = "-L "
)
cmd4 <- replaceParam(cmd, "inputs", index = 1:2, replace = new_inputs)
cmdlist(cmd4)
```
#### Adding new arguments
```{r}
newIn <- new_inputs <- list(
"new_input1" = list(type = "File", preF="-b1", yml ="myfile1"),
"new_input2" = list(type = "File", preF="-b2", yml ="myfile2"),
"new_input3" = "-b3 "
)
cmd5 <- appendParam(cmd, "inputs", index = 1:2, append = new_inputs)
cmdlist(cmd5)
cmd6 <- appendParam(cmd, "inputs", index = 1:2, after=0, append = new_inputs)
cmdlist(cmd6)
```
#### Editing `output` param
```{r}
new_outs <- list(
"sam_out" = ""
)
cmd7 <- replaceParam(cmd, "outputs", index = 1, replace = new_outs)
output(cmd7)
```
## Version information
```{r sessionInfo}
sessionInfo()
```
## Funding
This project is funded by NSF award [ABI-1661152](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1661152).
## References