Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

10/9/2020 Nextflow training

Nextflow training
Table of Contents
1. Environment setup
1.1. Requirements
1.2. Next ow installation
1.3. Training material
2. Get started with Next ow
2.1. Basic concepts
2.1.1. Processes and channels
2.1.2. Execution abstraction
2.1.3. Scripting language
2.2. Your rst script
2.3. Modify and resume
2.4. Pipeline parameters
3. Channels
3.1. Channel types
3.1.1. Queue channel
3.1.2. Value channels
3.2. Channel factories
3.2.1. value
3.2.2. from
3.2.3. of
3.2.4. fromList
3.2.5. fromPath
3.2.6. fromFilePairs
3.2.7. fromSRA
4. Processes
4.1. Script
4.1.1. Script parameters
4.1.2. Conditional script
4.2. Inputs
4.2.1. Input values
4.2.2. Input les
4.2.3. Input path
4.2.4. Combine input channels
4.2.5. Input repeaters
4.3. Outputs
4.3.1. Output values
4.3.2. Output les
4.3.3. Multiple output les
4.3.4. Dynamic output le names
4.3.5. Composite inputs and outputs
4.4. When
4.5. Directives
4.5.1. Exercise

https://seqera.io/training/#_channels 1/71
10/9/2020 Nextflow training

4.6. Organise outputs


4.6.1. PublishDir directive
4.6.2. Manage semantic sub-directories
5. Operators
5.1. Basic example
5.2. Basic operators
5.2.1. view
5.2.2. map
5.2.3. into
5.2.4. mix
5.2.5. atten
5.2.6. collect
5.2.7. groupTuple
5.2.8. join
5.2.9. branch
5.3. More resources
6. Groovy basic structures and idioms
6.1. Printing values
6.2. Comments
6.3. Variables
6.4. Lists
6.5. Maps
6.6. String interpolation
6.7. Multi-line strings
6.8. If statement
6.9. For statement
6.10. Functions
6.11. Closures
6.12. More resources
7. Simple Rna-Seq pipeline
7.1. De ne the pipeline parameters
7.1.1. Exercise
7.1.2. Exercise
7.1.3. Recap
7.2. Create transcriptome index le
7.2.1. Exercise
7.2.2. Exercise
7.2.3. Exercise
7.2.4. Recap
7.3. Collect read les by pairs
7.3.1. Exercise
7.3.2. Exercise
7.3.3. Recap
7.4. Perform expression quanti cation
7.4.1. Exercise
7.4.2. Exercise
7.4.3. Recap
7.5. Quality control

https://seqera.io/training/#_channels 2/71
10/9/2020 Nextflow training

7.5.1. Exercise
7.5.2. Recap
7.6. MultiQC report
7.6.1. Recap
7.7. Handle completion event
7.8. Bonus!
7.9. Custom scripts
7.9.1. Recap
7.10. Metrics and reports
7.11. Run a project from GitHub
7.12. More resources
8. Manage dependencies & containers
8.1. Docker hands-on
8.1.1. Run a container
8.1.2. Pull a container
8.1.3. Run a container in interactive mode
8.1.4. Your rst Docker le
8.1.5. Build the image
8.1.6. Add a software package to the image
8.1.7. Run Salmon in the container
8.1.8. File system mounts
8.1.9. Upload the container in the Docker Hub (bonus)
8.1.10. Run a Next ow script using a Docker container
8.2. Singularity
8.2.1. Create a Singularity images
8.2.2. Running a container
8.2.3. Import a Docker image
8.2.4. Run a Next ow script using a Singularity container
8.2.5. The Singularity Container Library
8.3. Conda/Bioconda packages
8.3.1. Bonus Exercise
8.4. BioContainers
8.5. More resources
9. Next ow con guration
9.1. Con guration le
9.1.1. Con g syntax
9.1.2. Con g variables
9.1.3. Con g comments
9.1.4. Con g scopes
9.1.5. Con g params
9.1.6. Con g env
9.1.7. Con g process
9.1.8. Con g Docker execution
9.1.9. Con g Singularity execution
9.1.10. Con g Conda execution
10. Deployments scenarios
10.1. Cluster deployment
10.2. Managing cluster resources

https://seqera.io/training/#_channels 3/71
10/9/2020 Nextflow training

10.2.1. Work ow wide resources


10.2.2. Con gure process by name
10.2.3. Con gure process by labels
10.2.4. Con gure multiple containers
10.3. Con guration pro les
10.4. Cloud deployment
10.5. Volume mounts
10.6. Custom job de nition
10.7. Custom image
10.8. Launch template
10.9. Hybrid deployments
11. Execution cache and resume
11.1. How resume works
11.2. Work directory
11.3. How organize in silico experiments
11.4. Execution provenance
11.5. Resume troubleshooting
12. Errors handling & troubleshooting
12.1. Execution errors debugging
12.2. Ignore errors
12.3. Automatic error fail-over
12.4. Retry with backo
12.5. Dynamic resources allocation

1. Environment setup
1.1. Requirements
Nextflow can be used on any POSIX compatible system (Linux, OS X, etc). It requires Bash and Java 8 (or later, up to 12)
(http://www.oracle.com/technetwork/java/javase/downloads/index.html) to be installed.

Optional requirements for this workshop:

Docker (https://www.docker.com/) engine 1.10.x (or later)

Singularity (https://github.com/sylabs/singularity) 2.5.x (or later, optional)

Conda (https://conda.io/) 4.5 (or later, optional)

Graphviz (http://www.graphviz.org/) (optional)

AWS CLI (https://aws.amazon.com/cli/) (optional)

AWS Batch computing environment properly configured (optional)

1.2. Next ow installation


Install the latest version of Nextflow copy & pasting the following snippet in a terminal window:

BASH
1 curl get.nextflow.io | bash
2 mv nextflow ~/bin

Check the correct installation running the following command:

https://seqera.io/training/#_channels 4/71
10/9/2020 Nextflow training

BASH
1 nextflow info

1.3. Training material


Download the training material copy & pasting the following command in the terminal:

BASH
1 aws s3 sync s3://seqeralabs.com/public/nf-training .

 Don’t miss the ending dot in the above command.

2. Get started with Nextflow


2.1. Basic concepts
Nextflow is workflow orchestration engine and a programming domain specific language (DSL) that eases the writing
of data-intensive computational pipelines.

It is designed around the idea that the Linux platform is the lingua franca of data science. Linux provides many simple
but powerful command-line and scripting tools that, when chained together, facilitate complex data manipulations.

Nextflow extends this approach, adding the ability to define complex program interactions and a high-level parallel
computational environment based on the dataflow programming model. Nextflow core features are:

enable workflows portability & reproducibility

simplify parallelization and large scale deployment

easily integrate existing tools, systems & industry standards

2.1.1. Processes and channels


In practice a Nextflow pipeline script is made by joining together different processes. Each process can be written in
any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.).

Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable)
state. The only way they can communicate is via asynchronous FIFO queues, called channels in Nextflow.

Any process can define one or more channels as input and output. The interaction between these processes, and
ultimately the pipeline execution flow itself, is implicitly defined by these input and output declarations.

https://seqera.io/training/#_channels 5/71
10/9/2020 Nextflow training

2.1.2. Execution abstraction


While a process defines what command or script has to be executed, the executor determines how that script is
actually run in the target system.

If not otherwise specified, processes are executed on the local computer. The local executor is very useful for pipeline
development and testing purposes, but for real world computational pipelines an HPC or cloud platform is often
required.

In other words, Nextflow provides an abstraction between the pipeline’s functional logic and the underlying execution
system. Thus it is possible to write a pipeline once and to seamlessly run it on your computer, a grid platform, or the
cloud, without modifying it, by simply defining the target execution platform in the configuration file.

It provides out-of-the-box support for major batch schedulers and cloud platforms:

Grid engine (Open/Sun/Univa)

IBM Platform LSF

Linux SLURM

PBS Works

Torque

Moab

HTCondor

Amazon Batch

Google Life Sciences

Kubernetes

2.1.3. Scripting language

https://seqera.io/training/#_channels 6/71
10/9/2020 Nextflow training

Nextflow implements declarative domain specific language (DSL) simplifies the writing of writing complex data
analysis workflows as an extension of a general purpose programming language.

This approach makes Nextflow very flexible because allows to have in the same computing environment the benefit of
concise DSL that allow the handling of recurrent use cases with ease and the flexibility and power of a general purpose
programming language to handle corner cases, which may be difficult to implement using a declarative approach.

In practical terms Nextflow scripting is an extension of the Groovy programming language (https://groovy-lang.org/),
which in turn is a super-set of the Java programming language. Groovy can be considered as Python for Java in that is
simplifies the writing of code and is more approachable.

2.2. Your rst script


Copy the following example into your favourite text editor and save it to a file named hello.nf :

NEXTFLOW
1 #!/usr/bin/env nextflow
2
3 params.greeting = 'Hello world!'
4 greeting_ch = Channel.from(params.greeting)
5
6 process splitLetters {
7
8 input:
9 val x from greeting_ch
10
11 output:
12 file 'chunk_*' into letters
13
14 """
15 printf '$x' | split -b 6 - chunk_
16 """
17 }
18
19 process convertToUpper {
20
21 input:
22 file y from letters.flatten()
23
24 output:
25 stdout into result
26
27 """
28 cat $y | tr '[a-z]' '[A-Z]'
29 """
30 }
31
32 result.view{ it.trim() }

This script defines two processes. The first splits a string into files containing chunks of 6 characters. The second
receives these files and transforms their contents to uppercase letters. The resulting strings are emitted on the result
channel and the final output is printed by the view operator.

Execute the script by entering the following command in your terminal:

CMD
nextflow run hello.nf

It will output something similar to the text shown below:

https://seqera.io/training/#_channels 7/71
10/9/2020 Nextflow training

CMD
N E X T F L O W ~ version 20.01.0
Launching `hello.nf` [marvelous_plateau] - revision: 63f8ad7155
[warm up] executor > local
executor > local (3)
[19/c2f873] process > splitLetters [100%] 1 of 1 ✔
[05/5ff9f6] process > convertToUpper [100%] 2 of 2 ✔
HELLO
WORLD!

You can see that the first process is executed once, and the second twice. Finally the result string is printed.

It’s worth noting that the process convertToUpper is executed in parallel, so there’s no guarantee that the instance
processing the first split (the chunk Hello) will be executed before before the one processing the second split (the chunk
world!).

Thus, it is perfectly possible that you will get the final result printed out in a different order:

WORLD!
HELLO

The hexadecimal numbers, like 22/7548fa , identify the unique process execution. These numbers
are also the prefix of the directories where each process is executed. You can inspect the files
 produced by them changing to the directory $PWD/work and using these numbers to find the
process-specific execution path.

2.3. Modify and resume


Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the
processes that are actually changed will be re-executed. The execution of the processes that are not changed will be
skipped and the cached result used instead.

This helps a lot when testing or modifying part of your pipeline without having to re-execute it from scratch.

For the sake of this tutorial, modify the convertToUpper process in the previous example, replacing the process script
with the string rev $x , so that the process looks like this:

NEXTFLOW
1 process convertToUpper {
2
3 input:
4 file y from letters.flatten()
5
6 output:
7 stdout into result
8
9 """
10 rev $y
11 """
12 }

Then save the file with the same name, and execute it by adding the -resume option to the command line:

nextflow run hello.nf -resume

It will print output similar to this:

https://seqera.io/training/#_channels 8/71
10/9/2020 Nextflow training

N E X T F L O W ~ version 20.01.0
Launching `hello.nf` [naughty_tuckerman] - revision: 22eaa07be4
[warm up] executor > local
executor > local (2)
[19/c2f873] process > splitLetters [100%] 1 of 1, cached: 1 ✔
[a7/a410d3] process > convertToUpper [100%] 2 of 2 ✔
olleH
!dlrow

You will see that the execution of the process splitLetters is actually skipped (the process ID is the same), and its
results are retrieved from the cache. The second process is executed as expected, printing the reversed strings.

The pipeline results are cached by default in the directory $PWD/work . Depending on your script,

 this folder can take of lot of disk space. If your are sure you won’t resume your pipeline execution,
clean this folder periodically.

2.4. Pipeline parameters


Pipeline parameters are simply declared by prepending to a variable name the prefix params , separated by dot
character. Their value can be specified on the command line by prefixing the parameter name with a double dash
character, i.e. --paramName

For the sake of this tutorial, you can try to execute the previous example specifying a different input string parameter,
as shown below:

nextflow run hello.nf --greeting 'Bonjour le monde!'

The string specified on the command line will override the default value of the parameter. The output will look like
this:

N E X T F L O W ~ version 20.01.0
Launching `hello.nf` [wise_stallman] - revision: 22eaa07be4
[warm up] executor > local
executor > local (4)
[48/e8315b] process > splitLetters [100%] 1 of 1 ✔
[01/840ca7] process > convertToUpper [100%] 3 of 3 ✔
uojnoB
m el r
!edno

3. Channels
Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented
computational workflows based on the Dataflow (https://en.wikipedia.org/wiki/Dataflow_programming) programming
paradigm.

They are used to logically connect tasks each other or to implement functional style data transformations.

https://seqera.io/training/#_channels 9/71
10/9/2020 Nextflow training

3.1. Channel types


Nextflow distinguish two different kinds of channels: queue channels and value channels.

3.1.1. Queue channel


A queue channel is a asynchronous unidirectional FIFO queue which connects two processes or operators.

What asynchronous means? That operations are non-blocking.

What unidirectional means? That data flow from a producer to a consumer.

What FIFO means? That the data is guaranteed to be delivered in the same order as it is produced.

A queue channel is implicitly created by process output definitions or using channel factories methods such as
Channel.from (https://www.nextflow.io/docs/latest/channel.html#from) or Channel.fromPath
(https://www.nextflow.io/docs/latest/channel.html#frompath).

Try the following snippets:

NEXTFLOW
1 ch = Channel.from(1,2,3)
2 println(ch) 1

3 ch.view() 2

1 Use the built-in println function to print the ch variable.


2 Apply the view method to the ch channel, therefore prints each item emitted by the channels.

Exercise
Try to execute this snippet, it will produce an error message.

NEXTFLOW
1 ch = Channel.from(1,2,3)
2 ch.view()
3 ch.view()

 A queue channel can have one and exactly one producer and one and exactly one consumer.

3.1.2. Value channels


A value channel a.k.a. singleton channel by definition is bound to a single value and it can be read unlimited times
without consuming its content.

https://seqera.io/training/#_channels 10/71
10/9/2020 Nextflow training

NEXTFLOW
1 ch = Channel.value('Hello')
2 ch.view()
3 ch.view()
4 ch.view()

It prints:

Hello
Hello
Hello

3.2. Channel factories


3.2.1. value
The value factory method is used to create a value channel. An optional not null argument can be specified to bind
the channel to a specific value. For example:

NEXTFLOW
1 ch1 = Channel.value() 1

2 ch2 = Channel.value( 'Hello there' ) 2

3 ch2 = Channel.value( [1,2,3,4,5] ) 3

1 Creates an empty value channel.


2 Creates a value channel and binds a string to it.
3 Creates a value channel and binds a list object to it that will be emitted as a sole emission.

3.2.2. from
The factory Channel.from allows the creation of a queue channel with the values specified as argument.

NEXTFLOW
1 ch = Channel.from( 1, 3, 5, 7 )
2 ch.view{ "value: $it" }

The first line in this example creates a variable ch which holds a channel object. This channel emits the values
specified as a parameter in the from method. Thus the second line will print the following:

value: 1
value: 3
value: 5
value: 7

 Method Channel.from will be deprecated and replaced by Channel.of (see below).

3.2.3. of
The method Channel.of works in a similar manner to Channel.from , though it fixes some inconsistent behavior of
the latter and provides a better handling for range of values. For example:

NEXTFLOW
1 Channel
2 .of(1..23, 'X', 'Y')
3 .view()

3.2.4. fromList

https://seqera.io/training/#_channels 11/71
10/9/2020 Nextflow training

The method Channel.fromList creates a channel emitting the elements provided by a list objects specified as
argument:

NEXTFLOW
1 list = ['hello', 'world']
2
3 Channel
4 .fromList(list)
5 .view()

3.2.5. fromPath
The fromPath factory method create a queue channel emitting one or more files matching the specified glob pattern.

NEXTFLOW
1 Channel.fromPath( '/data/big/*.txt' )

This example creates a channel and emits as many items as there are files with txt extension in the /data/big
folder. Each element is a file object implementing the Path (https://docs.oracle.com/javase/8/docs/api/java/nio/file/Paths.html)
interface.

Two asterisks, i.e. ** , works like * but crosses directory boundaries. This syntax is generally used
 for matching complete paths. Curly brackets specify a collection of sub-patterns.

Table 1. Available options

Name Description

glob When true interprets characters * , ? , [] and {} as glob wildcards, otherwise handles them
as normal characters (default: true )

type Type of paths returned, either file , dir or any (default: file )

hidden When true includes hidden files in the resulting paths (default: false )

maxDepth Maximum number of directory levels to visit (default: no limit)

followLinks When true it follows symbolic links during directories tree traversal, otherwise they are
managed as files (default: true )

relative When true returned paths are relative to the top-most common directory (default: false )

checkIfExists When true throws an exception of the specified path do not exist in the file system (default:
false )

Learn more about the glob patterns syntax at this link (https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob).

Exercise
Use the Channel.fromPath method to create a channel emitting all files with the suffix .fq in the data/ggal/ and
any subdirectory, then print the file name.

3.2.6. fromFilePairs
The fromFilePairs method creates a channel emitting the file pairs matching a glob pattern provided by the user.
The matching files are emitted as tuples in which the first element is the grouping key of the matching pair and the
second element is the list of files (sorted in lexicographical order).

https://seqera.io/training/#_channels 12/71
10/9/2020 Nextflow training

NEXTFLOW
1 Channel
2 .fromFilePairs('/my/data/SRR*_{1,2}.fastq')
3 .view()

It will produce an output similar to the following:

[SRR493366, [/my/data/SRR493366_1.fastq, /my/data/SRR493366_2.fastq]]


[SRR493367, [/my/data/SRR493367_1.fastq, /my/data/SRR493367_2.fastq]]
[SRR493368, [/my/data/SRR493368_1.fastq, /my/data/SRR493368_2.fastq]]
[SRR493369, [/my/data/SRR493369_1.fastq, /my/data/SRR493369_2.fastq]]
[SRR493370, [/my/data/SRR493370_1.fastq, /my/data/SRR493370_2.fastq]]
[SRR493371, [/my/data/SRR493371_1.fastq, /my/data/SRR493371_2.fastq]]

 The glob pattern must contain at least a star wildcard character.

Table 2. Available options

Name Description

type Type of paths returned, either file , dir or any (default: file )

hidden When true includes hidden files in the resulting paths (default: false )

maxDepth Maximum number of directory levels to visit (default: no limit)

followLinks When true it follows symbolic links during directories tree traversal, otherwise they are
managed as files (default: true )

size Defines the number of files each emitted item is expected to hold (default: 2). Set to -1 for any.

flat When true the matching files are produced as sole elements in the emitted tuples (default:
false ).

checkIfExists When true throws an exception of the specified path do not exist in the file system (default:
false )

Exercise
Use the fromFilePairs method to create a channel emitting all pairs of fastq read in the data/ggal/ directory and
print them.

Then use the flat:true option and compare the output with the previous execution.

3.2.7. fromSRA
The Channel.fromSRA method that makes it possible to query of NCBI SRA (https://www.ncbi.nlm.nih.gov/sra) archive and
returns a channel emitting the FASTQ files matching the specified selection criteria.

The query can be project ID or accession number(s) supported by the NCBI ESearch API
(https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch). For example the following snippet:

NEXTFLOW
1 Channel
2 .fromSRA('SRP043510')
3 .view()

https://seqera.io/training/#_channels 13/71
10/9/2020 Nextflow training

prints:

TEXT
1 [SRR1448794, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/004/SRR1448794/SRR1448794.fastq.gz]
2 [SRR1448795, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/005/SRR1448795/SRR1448795.fastq.gz]
3 [SRR1448792, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/002/SRR1448792/SRR1448792.fastq.gz]
4 [SRR1448793, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/003/SRR1448793/SRR1448793.fastq.gz]
5 [SRR1910483, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR191/003/SRR1910483/SRR1910483.fastq.gz]
6 [SRR1910482, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR191/002/SRR1910482/SRR1910482.fastq.gz]
7 (remaining omitted)

Multiple accession IDs can be specified using a list object:

NEXTFLOW
1 ids = ['ERR908507', 'ERR908506', 'ERR908505']
2 Channel
3 .fromSRA(ids)
4 .view()

TEXT
1 [ERR908507, [ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR908/ERR908507/ERR908507_1.fastq.gz, ftp://ftp.sra.ebi.ac.
2 [ERR908506, [ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR908/ERR908506/ERR908506_1.fastq.gz, ftp://ftp.sra.ebi.ac.
3 [ERR908505, [ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR908/ERR908505/ERR908505_1.fastq.gz, ftp://ftp.sra.ebi.ac.

 Read pairs are implicitly managed are returned as a list of files.

It’s straightforward to use this channel as an input using the usual Nextflow syntax. For example:

NEXTFLOW
1 params.accession = 'SRP043510'
2 reads = Channel.fromSRA(params.accession)
3
4 process fastqc {
5 input:
6 tuple sample_id, file(reads_file) from reads
7
8 output:
9 file("fastqc_${sample_id}_logs") into fastqc_ch
10
11 script:
12 """
13 mkdir fastqc_${sample_id}_logs
14 fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads_file}
15 """
16 }

The code snippet above creates a channel containing 24 samples from a chromatin dynamics study and runs FASTQC
on the resulting files.

4. Processes
A process is the basic Nextflow computing primitive to execute foreign function i.e. custom scripts or tools.

The process definition starts with keyword the process , followed by process name and finally the process body
delimited by curly brackets. The process body must contain a string which represents the command or, more generally,
a script that is executed by it.

A basic process looks like the following example:

https://seqera.io/training/#_channels 14/71
10/9/2020 Nextflow training

NEXTFLOW
1 process sayHello {
2 """
3 echo 'Hello world!'
4 """
5 }

A process may contain five definition blocks, respectively: directives, inputs, outputs, when clause and finally the
process script. The syntax is defined as follows:

process < name > {


[ directives ] 1

input: 2

< process inputs >


output: 3

< process outputs >


when: 4

< condition >


[script|shell|exec]: 5

< user script to be executed >


}

1 Zero, one or more process directives


2 Zero, one or more process inputs
3 Zero, one or more process outputs
4 An optional boolean conditional to trigger the process execution
5 The command to be executed

4.1. Script
The script block is a string statement that defines the command that is executed by the process to carry out its task.

A process contains one and only one script block, and it must be the last statement when the process contains input
and output declarations.

The script block can be a simple string or multi-line string. The latter simplifies the writing of non trivial scripts
composed by multiple commands spanning over multiple lines. For example::

NEXTFLOW
1 process example {
2 script:
3 """
4 blastp -db /data/blast -query query.fa -outfmt 6 > blast_result
5 cat blast_result | head -n 10 | cut -f 2 > top_hits
6 blastdbcmd -db /data/blast -entry_batch top_hits > sequences
7 """
8 }

By default the process command is interpreted as a Bash script. However any other scripting language can be used just
simply starting the script with the corresponding Shebang (https://en.wikipedia.org/wiki/Shebang_(Unix)) declaration. For
example:

https://seqera.io/training/#_channels 15/71
10/9/2020 Nextflow training

NEXTFLOW
1 process pyStuff {
2 script:
3 """
4 #!/usr/bin/env python
5
6 x = 'Hello'
7 y = 'world!'
8 print "%s - %s" % (x,y)
9 """
10 }

This allows the compositing in the same workflow script of tasks using different programming

 languages which may better fit a particular job. However for large chunks of code is suggested to
save them into separate files and invoke them from the process script.

4.1.1. Script parameters


Process script can be defined dynamically using variable values like in other string.

NEXTFLOW
1 params.data = 'World'
2
3 process foo {
4 script:
5 """
6 echo Hello $params.data
7 """
8 }

A process script can contain any string format supported by the Groovy programming language. This

 allows us to use string interpolation or multiline string as in the script above. Refer to String
interpolation for more information.

Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment
 variables need to be escaped using \ character.

NEXTFLOW
1 process foo {
2 script:
3 """
4 echo "The current directory is \$PWD"
5 """
6 }

 Try to modify the above script using $PWD instead of \$PWD and check the difference.

This can be tricky when the script uses many Bash variables. A possible alternative is to use a script string delimited by
single-quote characters

NEXTFLOW
1 process bar {
2 script:
3 '''
4 echo $PATH | tr : '\\n'
5 '''
6 }

https://seqera.io/training/#_channels 16/71
10/9/2020 Nextflow training

However this won’t allow any more the usage of Nextflow variables in the command script.

Another alternative is to use a shell statement instead of script which uses a different syntax for Nextflow
variable: !{..} . This allow to use both Nextflow and Bash variables in the same script.

NEXTFLOW
1 params.data = 'le monde'
2
3 process baz {
4 shell:
5 '''
6 X='Bonjour'
7 echo $X !{params.data}
8 '''
9 }

4.1.2. Conditional script


The process script can also be defined in a complete dynamic manner using a if statement or any other expression
evaluating to string value. For example:

NEXTFLOW
1 params.aligner = 'kallisto'
2
3 process foo {
4 script:
5 if( params.aligner == 'kallisto' )
6 """
7 kallisto --reads /some/data.fastq
8 """
9 else if( params.aligner == 'salmon' )
10 """
11 salmon --reads /some/data.fastq
12 """
13 else
14 throw new IllegalArgumentException("Unknown aligner $params.aligner")
15 }

Exercise
Write a custom function that given the aligner name as parameter returns the command string to be executed. Then
use this function as the process script body.

4.2. Inputs
Nextflow processes are isolated from each other but can communicate between themselves sending values through
channels.

Inputs implicitly determine the dependency and the parallel execution of the process. The process execution is fired
each time a new data is ready to be consumed from the input channel:

https://seqera.io/training/#_channels 17/71
10/9/2020 Nextflow training

The input block defines which channels the process is expecting to receive inputs data from. You can only define one
input block at a time and it must contain one or more inputs declarations.

The input block follows the syntax shown below:

NEXTFLOW
input:
<input qualifier> <input name> from <source channel>

4.2.1. Input values


The val qualifier allows you to receive data of any type as input. It can be accessed in the process script by using the
specified input name, as shown in the following example:

NEXTFLOW
1 num = Channel.from( 1, 2, 3 )
2
3 process basicExample {
4 input:
5 val x from num
6
7 """
8 echo process job $x
9 """
10 }

In the above example the process is executed three times, each time a value is received from the channel num and
used to process the script. Thus, it results in an output similar to the one shown below:

process job 3
process job 1
process job 2

https://seqera.io/training/#_channels 18/71
10/9/2020 Nextflow training

The channel guarantees that items are delivered in the same order as they have been sent - but -

 since the process is executed in a parallel manner, there is no guarantee that they are processed in
the same order as they are received.

4.2.2. Input les


The file qualifier allows the handling of file values in the process execution context. This means that Nextflow will
stage it in the process execution directory, and it can be access in the script by using the name specified in the input
declaration.

NEXTFLOW
1 reads = Channel.fromPath( 'data/ggal/*.fq' )
2
3 process foo {
4 input:
5 file 'sample.fastq' from reads
6 script:
7 """
8 your_command --reads sample.fastq
9 """
10 }

The input file name can also be defined using a variable reference as shown below:

NEXTFLOW
1 reads = Channel.fromPath( 'data/ggal/*.fq' )
2
3 process foo {
4 input:
5 file sample from reads
6 script:
7 """
8 your_command --reads $sample
9 """
10 }

The same syntax it’s also able to handle more than one input file in the same execution. Only change the channel
composition.

NEXTFLOW
1 reads = Channel.fromPath( 'data/ggal/*.fq' )
2
3 process foo {
4 input:
5 file sample from reads.collect()
6 script:
7 """
8 your_command --reads $sample
9 """
10 }

When a process declares an input file the corresponding channel elements must be file objects i.e.

 created with the file helper function from the file specific channel factories e.g.
Channel.fromPath or Channel.fromFilePairs .

Consider the following snippet:

https://seqera.io/training/#_channels 19/71
10/9/2020 Nextflow training

NEXTFLOW
1 params.genome = 'data/ggal/transcriptome.fa'
2
3 process foo {
4 input:
5 file genome from params.genome
6 script:
7 """
8 your_command --reads $genome
9 """
10 }

The above code creates a temporary file named input.1 with the string data/ggal/transcriptome.fa as content.
That likely is not what you wanted to do.

4.2.3. Input path


As of version 19.10.0, Nextflow introduced a new path input qualifier that simplifies the handling of cases such as the
one shown above. In a nutshell the input path automatically handles string values as file objects. The following
example works as expected:

NEXTFLOW
1 params.genome = "$baseDir/data/ggal/transcriptome.fa"
2
3 process foo {
4 input:
5 path genome from params.genome
6 script:
7 """
8 your_command --reads $genome
9 """
10 }

The path qualifier should be preferred over file to handle process input files when using Nextflow
 19.10.0 or later.

Exercise
Write a script that creates a channel containing all read files matching the pattern data/ggal/*_1.fq followed by a
process that concatenates them into a single file and prints the first 20 lines.

4.2.4. Combine input channels


A key feature of processes is the ability to handle inputs from multiple channels. However it’s important to
understands how the content of channel and their semantic affect the execution of a process.

Consider the following example:

NEXTFLOW
1 process foo {
2 echo true
3 input:
4 val x from Channel.from(1,2,3)
5 val y from Channel.from('a','b','c')
6 script:
7 """
8 echo $x and $y
9 """
10 }

Both channels emit three value, therefore the process is executed three times, each time with a different pair:

https://seqera.io/training/#_channels 20/71
10/9/2020 Nextflow training

(1, a)

(2, b)

(3, c)

What is happening is that the process waits until there’s a complete input configuration i.e. it receives an input value
from all the channels declared as input.

When this condition is verified, it consumes the input values coming from the respective channels, and spawns a task
execution, then repeat the same logic until one or more channels have no more content.

This means channel values are consumed serially one after another and the first empty channel cause the process
execution to stop even if there are other values in other channels.

What does it happen when not all channels have the same cardinality (i.e. they emit a different number of
elements)?

For example:

NEXTFLOW
1 process foo {
2 echo true
3 input:
4 val x from Channel.from(1,2)
5 val y from Channel.from('a','b','c','d')
6 script:
7 """
8 echo $x and $y
9 """
10 }

In the above example the process is executed only two time, because when a channel has no more data to be processed
it stops the process execution.

 Note however that value channel do not affect the process termination.

To better understand this behavior compare the previous example with the following one:

NEXTFLOW
1 process bar {
2 echo true
3 input:
4 val x from Channel.value(1)
5 val y from Channel.from('a','b','c')
6 script:
7 """
8 echo $x and $y
9 """
10 }

Exercise
Write a process that is executed for each read file matching the pattern data/ggal/*_1.fq and use the same
data/ggal/transcriptome.fa in each execution.

4.2.5. Input repeaters


The each qualifier allows you to repeat the execution of a process for each item in a collection, every time a new data
is received. For example:

https://seqera.io/training/#_channels 21/71
10/9/2020 Nextflow training

NEXTFLOW
1 sequences = Channel.fromPath('data/prots/*.tfa')
2 methods = ['regular', 'expresso', 'psicoffee']
3
4 process alignSequences {
5 input:
6 path seq from sequences
7 each mode from methods
8
9 """
10 t_coffee -in $seq -mode $mode
11 """
12 }

In the above example every time a file of sequences is received as input by the process, it executes three tasks running
an alignment with a different value for the mode option. This is useful when you need to repeat the same task for a
given set of parameters.

Exercise
Extend the previous example so a task is executed for each read file matching the pattern data/ggal/*_1.fq and
repeat the same task both with salmon and kallisto .

4.3. Outputs
The output declaration block allows to define the channels used by the process to send out the results produced.

There can be defined at most one output block and it can contain one or more outputs declarations. The output block
follows the syntax shown below:

output:
<output qualifier> <output name> into <target channel>[,channel,..]

4.3.1. Output values


The val qualifier allows to output a value defined in the script context. In a common usage scenario, this is a value
which has been defined in the input declaration block, as shown in the following example::

NEXTFLOW
1 methods = ['prot','dna', 'rna']
2
3 process foo {
4 input:
5 val x from methods
6
7 output:
8 val x into receiver
9
10 """
11 echo $x > file
12 """
13 }
14
15 receiver.view { "Received: $it" }

4.3.2. Output les


The file qualifier allows to output one or more files, produced by the process, over the specified channel.

https://seqera.io/training/#_channels 22/71
10/9/2020 Nextflow training

NEXTFLOW
1 process randomNum {
2
3 output:
4 file 'result.txt' into numbers
5
6 '''
7 echo $RANDOM > result.txt
8 '''
9 }
10
11 numbers.view { "Received: " + it.text }

In the above example the process randomNum creates a file named result.txt containing a random number.

Since a file parameter using the same name is declared in the output block, when the task is completed that file is sent
over the numbers channel. A downstream process declaring the same channel as input will be able to receive it.

4.3.3. Multiple output les


When an output file name contains a * or ? wildcard character it is interpreted as a glob
(http://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob) path matcher. This allows to capture multiple files into a
list object and output them as a sole emission. For example:

NEXTFLOW
1 process splitLetters {
2
3 output:
4 file 'chunk_*' into letters
5
6 '''
7 printf 'Hola' | split -b 1 - chunk_
8 '''
9 }
10
11 letters
12 .flatMap()
13 .view { "File: ${it.name} => ${it.text}" }

it prints:

File: chunk_aa => H


File: chunk_ab => o
File: chunk_ac => l
File: chunk_ad => a

Some caveats on glob pattern behavior:

Input files are not included in the list of possible matches.

Glob pattern matches against both files and directories path.

When a two stars pattern ** is used to recourse across directories, only file paths are matched i.e. directories are
not included in the result list.

Exercise
Remove the flatMap operator and see out the output change. The documentation for the flatMap operator is
available at this link (https://www.nextflow.io/docs/latest/operator.html#flatmap).

4.3.4. Dynamic output le names

https://seqera.io/training/#_channels 23/71
10/9/2020 Nextflow training

When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic evaluated
string which references values defined in the input declaration block or in the script global context. For example::

NEXTFLOW
1 process align {
2 input:
3 val x from species
4 file seq from sequences
5
6 output:
7 file "${x}.aln" into genomes
8
9 """
10 t_coffee -in $seq > ${x}.aln
11 """
12 }

In the above example, each time the process is executed an alignment file is produced whose name depends on the
actual value of the x input.

4.3.5. Composite inputs and outputs


So far we have seen how to declare multiple input and output channels, but each channel was handling only one value
at time. However Nextflow can handle tuple of values.

When using channel emitting tuple of values the corresponding input declaration must be declared with a tuple
qualifier followed by definition of each single element in the tuple.

In the same manner output channel emitting tuple of values can be declared using the tuple qualifier following by
the definition of each tuple element in the tuple.

NEXTFLOW
1 reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')
2
3 process foo {
4 input:
5 tuple val(sample_id), file(sample_files) from reads_ch
6 output:
7 tuple val(sample_id), file('sample.bam') into bam_ch
8 script:
9 """
10 your_command_here --reads $sample_id > sample.bam
11 """
12 }
13
14 bam_ch.view()

In previous versions of Nextflow tuple was called set but it was used exactly with the same
 semantic. It can still be used for backward compatibility.

Exercise
Modify the script of the previous exercise so that the bam file is named as the given sample_id .

4.4. When
The when declaration allows you to define a condition that must be verified in order to execute the process. This can
be any expression that evaluates a boolean value.

It is useful to enable/disable the process execution depending the state of various inputs and parameters. For example:

https://seqera.io/training/#_channels 24/71
10/9/2020 Nextflow training

NEXTFLOW
1 params.dbtype = 'nr'
2 params.prot = 'data/prots/*.tfa'
3 proteins = Channel.fromPath(params.prot)
4
5 process find {
6 input:
7 file fasta from proteins
8 val type from params.dbtype
9
10 when:
11 fasta.name =~ /^BB11.*/ && type == 'nr'
12
13 script:
14 """
15 blastp -query $fasta -db nr
16 """
17 }

4.5. Directives
Directive declarations allow the definition of optional settings that affect the execution of the current process without
affecting the semantic of the task itself.

They must be entered at the top of the process body, before any other declaration blocks (i.e. input , output , etc).

Directives are commonly used to define the amount of computing resources to be used or other meta directives like
that allows the definition of extra information for configuration or logging purpose. For example:

NEXTFLOW
1 process foo {
2 cpus 2
3 memory 8.GB
4 container 'image/name'
5
6 script:
7 """
8 your_command --this --that
9 """
10 }

The complete list of directives is available at this link (https://www.nextflow.io/docs/latest/process.html#directives).

4.5.1. Exercise
Modify the script of the previous exercise adding a tag (https://www.nextflow.io/docs/latest/process.html#tag) directive logging
the sample_id in the execution output.

4.6. Organise outputs


4.6.1. PublishDir directive
Nextflow manages independently workflow execution intermediate results from the pipeline expected outputs. Task
output files are created in the task specific execution directory which is considered as a temporary directory that can
be deleted upon completion.

The pipeline result files need to be marked explicitly using the directive publishDir
(https://www.nextflow.io/docs/latest/process.html#publishdir) in the process that’s creating such file. For example:

https://seqera.io/training/#_channels 25/71
10/9/2020 Nextflow training

NEXTFLOW
1 process makeBams {
2 publishDir "/some/directory/bam_files", mode: 'copy'
3
4 input:
5 file index from index_ch
6 tuple val(name), file(reads) from reads_ch
7
8 output:
9 tuple val(name), file ('*.bam') into star_aligned
10
11 """
12 STAR --genomeDir $index --readFilesIn $reads
13 """
14 }

The above example will copy all bam files created by the star task in the directory path
/some/directory/bam_files .

The publish directory can be local or remote. For example output files could be stored to a AWS S3
 bucket (https://aws.amazon.com/s3/) just using the s3:// prefix in the target path.

4.6.2. Manage semantic sub-directories


You can use more then one publishDir to keep different outputs in separate directory. For example:

NEXTFLOW
1 params.reads = 'data/reads/*_{1,2}.fq.gz'
2 params.outdir = 'my-results'
3
4 Channel
5 .fromFilePairs(params.reads, flat: true)
6 .set{ samples_ch }
7
8 process foo {
9 publishDir "$params.outdir/$sampleId/", pattern: '*.fq'
10 publishDir "$params.outdir/$sampleId/counts", pattern: "*_counts.txt"
11 publishDir "$params.outdir/$sampleId/outlooks", pattern: '*_outlook.txt'
12
13 input:
14 set sampleId, file('sample1.fq.gz'), file('sample2.fq.gz') from samples_ch
15 output:
16 file "*"
17 script:
18 """
19 < sample1.fq.gz zcat > sample1.fq
20 < sample2.fq.gz zcat > sample2.fq
21
22 awk '{s++}END{print s/4}' sample1.fq > sample1_counts.txt
23 awk '{s++}END{print s/4}' sample2.fq > sample2_counts.txt
24
25 head -n 50 sample1.fq > sample1_outlook.txt
26 head -n 50 sample2.fq > sample2_outlook.txt
27 """
28 }

The above example will create an output structure in the directory my-results , which contains a separate sub-
directory for each given sample ID each of which contain the folders counts and outlooks .

5. Operators
Built-in functions applied to channels

https://seqera.io/training/#_channels 26/71
10/9/2020 Nextflow training

Transform channels content

Can be used also to filter, fork and combine channels

5.1. Basic example


NEXTFLOW
1 nums = Channel.from(1,2,3,4) 1

2 square = nums.map { it -> it * it } 2

3 square.view() 3

1 Create a queue channel emitting four values.


2 Create a new channels transforming each number in it’s square.
3 Print the channel content.

Operators can be chained to implement custom behaviors:

NEXTFLOW
1 Channel.from(1,2,3,4)
2 .map { it -> it * it }
3 .view()

Operators can be separated in to five groups:

Filtering operators

Transforming operators

Splitting operators

Combining operators

Forking operators

Maths operators

5.2. Basic operators


5.2.1. view
The view operator prints the items emitted by a channel to the console standard output appending a new line
character to each of them. For example:

https://seqera.io/training/#_channels 27/71
10/9/2020 Nextflow training

NEXTFLOW
1 Channel
2 .from('foo', 'bar', 'baz')
3 .view()

It prints:

foo
bar
baz

An optional closure parameter can be specified to customize how items are printed. For example:

NEXTFLOW
1 Channel
2 .from('foo', 'bar', 'baz')
3 .view { "- $it" }

It prints:

- foo
- bar
- baz

5.2.2. map
The map operator applies a function of your choosing to every item emitted by a channel, and returns the items so
obtained as a new channel. The function applied is called the mapping function and is expressed with a closure as
shown in the example below:

NEXTFLOW
1 Channel
2 .from( 'hello', 'world' )
3 .map { it -> it.reverse() }
4 .view()

A map can associate to each element a generic tuple containing any data as needed.

NEXTFLOW
1 Channel
2 .from( 'hello', 'world' )
3 .map { word -> [word, word.size()] }
4 .view { word, len -> "$word contains $len letters" }

Exercise
Use fromPath to create a channel emitting the fastq files matching the pattern data/ggal/*.fq , then chain with a
map to return a pair containing the file name and the path itself. Finally print the resulting channel.

NEXTFLOW
1 Channel.fromPath('data/ggal/*.fq')
2 .map { file -> [ file.name, file ] }
3 .view { name, file -> "> file: $name" }

5.2.3. into
The into operator connects a source channel to two or more target channels in such a way the values emitted by the
source channel are copied to the target channels. For example:

https://seqera.io/training/#_channels 28/71
10/9/2020 Nextflow training

NEXTFLOW
1 Channel
2 .from( 'a', 'b', 'c' )
3 .into{ foo; bar }
4
5 foo.view{ "Foo emits: " + it }
6 bar.view{ "Bar emits: " + it }

Note the use in this example of curly brackets and the ; as channel names separator. This is needed

 because the actual parameter of into is a closure which defines the target channels to which the
source one is connected.

5.2.4. mix
The mix operator combines the items emitted by two (or more) channels into a single channel.

NEXTFLOW
1 c1 = Channel.from( 1,2,3 )
2 c2 = Channel.from( 'a','b' )
3 c3 = Channel.from( 'z' )
4
5 c1 .mix(c2,c3).view()

1
2
a
3
b
z

The items in the resulting channel have the same order as in respective original channel, however

 there’s no guarantee that the element of the second channel are append after the elements of the
first. Indeed in the above example the element a has been printed before 3 .

5.2.5. atten
The flatten operator transforms a channel in such a way that every tuple is flattened so that each single entry is
emitted as a sole element by the resulting channel.

NEXTFLOW
1 foo = [1,2,3]
2 bar = [4, 5, 6]
3
4 Channel
5 .from(foo, bar)
6 .flatten()
7 .view()

The above snippet prints:

1
2
3
4
5
6

5.2.6. collect

https://seqera.io/training/#_channels 29/71
10/9/2020 Nextflow training

The collect operator collects all the items emitted by a channel to a list and return the resulting object as a sole
emission.

NEXTFLOW
1 Channel
2 .from( 1, 2, 3, 4 )
3 .collect()
4 .view()

It prints a single value:

[1,2,3,4]

 The result of the collect operator is a value channel.

5.2.7. groupTuple
The groupTuple operator collects tuples (or lists) of values emitted by the source channel grouping together the
elements that share the same key. Finally it emits a new tuple object for each distinct key collected.

Try the following example:

NEXTFLOW
1 Channel
2 .from( [1,'A'], [1,'B'], [2,'C'], [3, 'B'], [1,'C'], [2, 'A'], [3, 'D'] )
3 .groupTuple()
4 .view()

It shows:

[1, [A, B, C]]


[2, [C, A]]
[3, [B, D]]

This operator is useful to process altogether all elements for which there’s a common property or a grouping key.

Exercise
Use fromPath to create a channel emitting the fastq files matching the pattern data/ggal/*.fq , then use a map to
associate to each file the name prefix. Finally group together all files having the same common prefix.

5.2.8. join
The join operator creates a channel that joins together the items emitted by two channels for which exits a matching
key. The key is defined, by default, as the first element in each item emitted.

NEXTFLOW
1 left = Channel.from(['X', 1], ['Y', 2], ['Z', 3], ['P', 7])
2 right= Channel.from(['Z', 6], ['Y', 5], ['X', 4])
3 left.join(right).view()

The resulting channel emits:

[Z, 3, 6]
[Y, 2, 5]
[X, 1, 4]

https://seqera.io/training/#_channels 30/71
10/9/2020 Nextflow training

5.2.9. branch
The branch operator allows you to forward the items emitted by a source channel to one or more output channels,
choosing one out of them at a time.

The selection criteria is defined by specifying a closure that provides one or more boolean expression, each of which is
identified by a unique label. On the first expression that evaluates to a true value, the current item is bound to a named
channel as the label identifier. For example:

NEXTFLOW
1 Channel
2 .from(1,2,3,40,50)
3 .branch {
4 small: it < 10
5 large: it > 10
6 }
7 .set { result }
8
9 result.small.view { "$it is small" }
10 result.large.view { "$it is large" }

The branch operator returns a multi-channel object i.e. a variable that holds more than one
 channel object.

5.3. More resources


Check the operators documentation (https://www.nextflow.io/docs/latest/operator.html) on Nextflow web site.

6. Groovy basic structures and idioms


Nextflow is a DSL implemented on top of the Groovy programming lang, which in turns is a super-set of the Java
programming language. This means that Nextflow can run any Groovy and Java code.

6.1. Printing values


To print something is as easy as using one of the print or println methods.

GROOVY
1 println("Hello, World!")

The only difference between the two is that the println method implicitly appends a new line character to the
printed string.

 parenthesis for function invocations are optional. Therefore also the following is a valid syntax.

GROOVY
1 println "Hello, World!"

6.2. Comments
Comments use the same syntax as in the C-family programming languages:

https://seqera.io/training/#_channels 31/71
10/9/2020 Nextflow training

GROOVY
1 // comment a single config file
2
3 /*
4 a comment spanning
5 multiple lines
6 */

6.3. Variables
To define a variable, simply assign a value to it:

GROOVY
1 x = 1
2 println x
3
4 x = new java.util.Date()
5 println x
6
7 x = -3.1499392
8 println x
9
10 x = false
11 println x
12
13 x = "Hi"
14 println x

Local variables are defined using the def keyword:

GROOVY
1 def x = 'foo'

It should be always used when defining variables local to a function or a closure.

6.4. Lists
A List object can be defined by placing the list items in square brackets:

GROOVY
1 list = [10,20,30,40]

You can access a given item in the list with square-bracket notation (indexes start at 0 ) or using the get method:

GROOVY
1 assert list[0] == 10
2 assert list[0] == list.get(0)

In order to get the length of the list use the size method:

GROOVY
1 assert list.size() == 4

Lists can also be indexed with negative indexes and reversed ranges.

GROOVY
1 list = [0,1,2]
2 assert list[-1] == 2
3 assert list[-1..0] == list.reverse()

https://seqera.io/training/#_channels 32/71
10/9/2020 Nextflow training

List objects implements all methods provided by the Java java.util.List


(https://docs.oracle.com/javase/8/docs/api/java/util/List.html) interface plus the extension methods provided by Groovy API
(http://docs.groovy-lang.org/latest/html/groovy-jdk/java/util/List.html).

GROOVY
1 assert [1,2,3] << 1 == [1,2,3,1]
2 assert [1,2,3] + [1] == [1,2,3,1]
3 assert [1,2,3,1] - [1] == [2,3]
4 assert [1,2,3] * 2 == [1,2,3,1,2,3]
5 assert [1,[2,3]].flatten() == [1,2,3]
6 assert [1,2,3].reverse() == [3,2,1]
7 assert [1,2,3].collect{ it+3 } == [4,5,6]
8 assert [1,2,3,1].unique().size() == 3
9 assert [1,2,3,1].count(1) == 2
10 assert [1,2,3,4].min() == 1
11 assert [1,2,3,4].max() == 4
12 assert [1,2,3,4].sum() == 10
13 assert [4,2,1,3].sort() == [1,2,3,4]
14 assert [4,2,1,3].find{it%2 == 0} == 4
15 assert [4,2,1,3].findAll{it%2 == 0} == [4,2]

6.5. Maps
Maps are like lists that have an arbitrary type of key instead of integer. Therefore, the syntax is very much aligned.

GROOVY
1 map = [a:0, b:1, c:2]

Maps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.

GROOVY
1 assert map['a'] == 0 1

2 assert map.b == 1 2

3 assert map.get('c') == 2 3

1 Use of the square brackets.


2 Use a dot notation.
3 Use of get method.

To add data or to modify a map, the syntax is similar to adding values to list:

GROOVY
1 map['a'] = 'x' 1

2 map.b = 'y' 2

3 map.put('c', 'z') 3

4 assert map == [a:'x', b:'y', c:'z']

1 Use of the square brackets.


2 Use a dot notation.
3 Use of get method.

Map objects implements all methods provided by the Java java.util.Map


(https://docs.oracle.com/javase/8/docs/api/java/util/Map.html) interface plus the extension methods provided by Groovy API
(http://docs.groovy-lang.org/latest/html/groovy-jdk/java/util/Map.html).

6.6. String interpolation


String literals can be defined enclosing them either with single-quoted or double-quotes characters.

https://seqera.io/training/#_channels 33/71
10/9/2020 Nextflow training

Double-quoted strings can contain the value of an arbitrary variable by prefixing its name with the $ character, or the
value of any expression by using the ${expression} syntax, similar to Bash/shell scripts:

GROOVY
1 foxtype = 'quick'
2 foxcolor = ['b', 'r', 'o', 'w', 'n']
3 println "The $foxtype ${foxcolor.join()} fox"
4
5 x = 'Hello'
6 println '$x + $y'

This code prints:

GROOVY
1 The quick brown fox
2 $x + $y

 Note the different use of $ and ${..} syntax to interpolate value expressions in a string literal.

Finally string literals can also be defined using the / character as delimiter. They are known as slashy strings and are
useful for defining regular expressions and patterns, as there is no need to escape backslashes. As with double quote
strings they allow to interpolate variables prefixed with a $ character.

Try the following to see the difference:

GROOVY
1 x = /tic\tac\toe/
2 y = 'tic\tac\toe'
3
4 println x
5 println y

it prints:

tic\tac\toe
tic ac oe

6.7. Multi-line strings


A block of text that span multiple lines can be defined by delimiting it with triple single or double quotes:

GROOVY
1 text = """
2 Hello there James
3 how are you today?
4 """

Finally multi-line strings can also be defined with slashy string. For example:

GROOVY
1 text = /
2 This is a multi-line
3 slashy string!
4 It's cool, isn't it?!
5 /

https://seqera.io/training/#_channels 34/71
10/9/2020 Nextflow training

Like before, multi-line strings inside double quotes and slash characters support variable
 interpolation, while single-quoted multi-line strings do not.

6.8. If statement
The if statement uses the same syntax common other programming lang such Java, C, JavaScript, etc.

GROOVY
1 if( < boolean expression > ) {
2 // true branch
3 }
4 else {
5 // false branch
6 }

The else branch is optional. Also curly brackets are optional when the branch define just a single statement.

GROOVY
1 x = 1
2 if( x > 10 )
3 println 'Hello'

 null , empty strings and empty collections are evaluated to false .

Therefore a statement like:

GROOVY
1 list = [1,2,3]
2 if( list != null && list.size() > 0 ) {
3 println list
4 }
5 else {
6 println 'The list is empty'
7 }

Can be written as:

GROOVY
1 if( list )
2 println list
3 else
4 println 'The list is empty'

See the Groovy-Truth (http://groovy-lang.org/semantics.html#Groovy-Truth) for details.

In some cases can be useful to replace if statement with a ternary expression aka conditional
 expression. For example:

GROOVY
1 println list ? list : 'The list is empty'

The previous statement can be further simplified using the Elvis operator
(http://groovy-lang.org/operators.html#_elvis_operator) as shown below:

GROOVY
1 println list ?: 'The list is empty'

https://seqera.io/training/#_channels 35/71
10/9/2020 Nextflow training

6.9. For statement


The classical for loop syntax is supported as shown here:

GROOVY
1 for (int i = 0; i <3; i++) {
2 println("Hello World $i")
3 }

Iteration over list objects is also possible using the syntax below:

GROOVY
1 list = ['a','b','c']
2
3 for( String elem : list ) {
4 println elem
5 }

6.10. Functions
It is possible to define a custom function into a script, as shown here:

GROOVY
1 int fib(int n) {
2 return n < 2 ? 1 : fib(n-1) + fib(n-2)
3 }
4
5 assert fib(10)==89

A function can take multiple arguments separating them with a comma. The return keyword can be omitted and the
function implicitly returns the value of the last evaluated expression. Also explicit types can be omitted (thought not
recommended):

GROOVY
1 def fact( n ) {
2 n > 1 ? n * fact(n-1) : 1
3 }
4
5 assert fact(5) == 120

6.11. Closures
Closures are the swiss army knife of Nextflow/Groovy programming. In a nutshell a closure is is a block of code that
can be passed as an argument to a function, it could also be defined an anonymous function.

More formally, a closure allows the definition of functions as first class objects.

GROOVY
1 square = { it * it }

The curly brackets around the expression it * it tells the script interpreter to treat this expression as code. The it
identifier is an implicit variable that represents the value that is passed to the function when it is invoked.

Once compiled the function object is assigned to the variable square as any other variable assignments shown
previously. To invoke the closure execution use the special method call or just use the round parentheses to specify
the closure parameter(s). For example:

GROOVY
1 assert square.call(5) == 25
2 assert square(9) == 81

https://seqera.io/training/#_channels 36/71
10/9/2020 Nextflow training

This is not very interesting until we find that we can pass the function square as an argument to other functions or
methods. Some built-in functions take a function like this as an argument. One example is the collect method on
lists:

GROOVY
1 x = [ 1, 2, 3, 4 ].collect(square)
2 println x

It prints:

[ 1, 4, 9, 16 ]

By default, closures take a single parameter called it , to give it a different name use the -> syntax. For example:

GROOVY
1 square = { num -> num * num }

It’s also possible to define closures with multiple, custom-named parameters.

For example, the method each() when applied to a map can take a closure with two arguments, to which it passes the
key-value pair for each entry in the map object. For example:

GROOVY
1 printMap = { a, b -> println "$a with value $b" }
2 values = [ "Yue" : "Wu", "Mark" : "Williams", "Sudha" : "Kumari" ]
3 values.each(printMap)

It prints:

Yue with value Wu


Mark with value Williams
Sudha with value Kumari

A closure has two other important features. First, it can access and modify variables in the scope where it is defined.

Second, a closure can be defined in an anonymous manner, meaning that it is not given a name, and is defined in the
place where it needs to be used.

As an example showing both these features, see the following code fragment:

GROOVY
1 result = 0 1

2 values = ["China": 1 , "India" : 2, "USA" : 3] 2

3 values.keySet().each { result += values[it] } 3

4 println result

1 Define a global variable.


2 Define a map object.
3 Invoke the each method passing closure object which modifies the result variable.

Learn more about closures in the Groovy documentation (http://groovy-lang.org/closures.html).

6.12. More resources

https://seqera.io/training/#_channels 37/71
10/9/2020 Nextflow training

The complete Groovy language documentation is available at this link


(http://groovy-lang.org/documentation.html#languagespecification).

A great resource to master Apache Groovy syntax is Groovy in Action


(https://www.manning.com/books/groovy-in-action-second-edition).

7. Simple Rna-Seq pipeline


During this tutorial you will implement a proof of concept of a RNA-Seq pipeline which:

1. Indexes a trascriptome file.

2. Performs quality controls

3. Performs quantification.

4. Create a MultiqQC report.

7.1. De ne the pipeline parameters


The script script1.nf defines the pipeline input parameters.

NEXTFLOW
1 params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
2 params.transcriptome = "$baseDir/data/ggal/transcriptome.fa"
3 params.multiqc = "$baseDir/multiqc"
4
5 println "reads: $params.reads"

Run it by using the following command:

nextflow run script1.nf

Try to specify a different input parameter, for example:

nextflow run script1.nf --reads this/and/that

7.1.1. Exercise
Modify the script1.nf adding a fourth parameter named outdir and set it to a default path that will be used as the
pipeline output directory.

7.1.2. Exercise
Modify the script1.nf to print all the pipeline parameters by using a single log.info command and a multiline
string (https://www.nextflow.io/docs/latest/script.html#multi-line-strings) statement.

 See an example here (https://github.com/nextflow-io/rnaseq-nf/blob/3b5b49f/main.nf#L41-L48).

7.1.3. Recap
In this step you have learned:

1. How to define parameters in your pipeline script

2. How to pass parameters by using the command line

3. The use of $var and ${var} variable placeholders

https://seqera.io/training/#_channels 38/71
10/9/2020 Nextflow training

4. How to use multiline strings

5. How to use log.info to print information and save it in the log execution file

7.2. Create transcriptome index le


Nextflow allows the execution of any command or user script by using a process definition.

A process is defined by providing three main declarations: the process inputs


(https://www.nextflow.io/docs/latest/process.html#inputs), the process outputs
(https://www.nextflow.io/docs/latest/process.html#outputs) and finally the command script
(https://www.nextflow.io/docs/latest/process.html#script).

The second example adds the index process.

NEXTFLOW
1 /*
2 * pipeline input parameters
3 */
4 params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
5 params.transcriptome = "$baseDir/data/ggal/transcriptome.fa"
6 params.multiqc = "$baseDir/multiqc"
7 params.outdir = "results"
8
9 println """\
10 R N A S E Q - N F P I P E L I N E
11 ===================================
12 transcriptome: ${params.transcriptome}
13 reads : ${params.reads}
14 outdir : ${params.outdir}
15 """
16 .stripIndent()
17
18
19 /*
20 * define the `index` process that create a binary index
21 * given the transcriptome file
22 */
23 process index {
24
25 input:
26 path transcriptome from params.transcriptome
27
28 output:
29 path 'index' into index_ch
30
31 script:
32 """
33 salmon index --threads $task.cpus -t $transcriptome -i index
34 """
35 }

It takes the transcriptome params file as input and creates the transcriptome index by using the salmon tool.

Note how the input declaration defines a transcriptome variable in the process context that it is used in the
command script to reference that file in the Salmon command line.

Try to run it by using the command:

nextflow run script2.nf

The execution will fail because Salmon is not installed in your environment.

https://seqera.io/training/#_channels 39/71
10/9/2020 Nextflow training

Add the command line option -with-docker to launch the execution through a Docker container as shown below:

nextflow run script2.nf -with-docker

This time it works because it uses the Docker container nextflow/rnaseq-nf defined in the nextflow.config file.

In order to avoid to add the option -with-docker add the following line in the nextflow.config file:

docker.enabled = true

7.2.1. Exercise
Enable the Docker execution by default adding the above setting in the nextflow.config file.

7.2.2. Exercise
Print the output of the index_ch channel by using the view (https://www.nextflow.io/docs/latest/operator.html#view).

7.2.3. Exercise
Use the command tree work to see how Nextflow organizes the process work directory.

7.2.4. Recap
In this step you have learned:

1. How to define a process executing a custom command

2. How process inputs are declared

3. How process outputs are declared

4. How to access the number of available CPUs

5. How to print the content of a channel

7.3. Collect read les by pairs


This step shows how to match read files into pairs, so they can be mapped by Salmon.

Edit the script script3.nf and add the following statement as the last line:

read_pairs_ch.view()

Save it and execute it with the following command:

nextflow run script3.nf

It will print an output similar to the one shown below:

[ggal_gut, [/.../data/ggal/gut_1.fq, /.../data/ggal/gut_2.fq]]

The above example shows how the read_pairs_ch channel emits tuples composed by two elements, where the first is
the read pair prefix and the second is a list representing the actual files.

Try it again specifying different read files by using a glob pattern:

https://seqera.io/training/#_channels 40/71
10/9/2020 Nextflow training

nextflow run script3.nf --reads 'data/ggal/*_{1,2}.fq'

File paths including one or more wildcards ie. * , ? , etc. MUST be wrapped in single-quoted
 characters to avoid Bash expands the glob.

7.3.1. Exercise
Use the set (https://www.nextflow.io/docs/latest/operator.html#set) operator in place of = assignment to define the
read_pairs_ch channel.

7.3.2. Exercise
Use the checkIfExists option for the fromFilePairs (https://www.nextflow.io/docs/latest/channel.html#fromfilepairs) method
to check if the specified path contains at least file pairs.

7.3.3. Recap
In this step you have learned:

1. How to use fromFilePairs to handle read pair files

2. How to use the checkIfExists option to check input file existence

3. How to use the set operator to define a new channel variable

7.4. Perform expression quanti cation


The script script4.nf adds the quantification process.

In this script note as the index_ch channel, declared as output in the index process, is now used as a channel in the
input section.

Also note as the second input is declared as a tuple composed by two elements: the pair_id and the reads in order
to match the structure of the items emitted by the read_pairs_ch channel.

Execute it by using the following command:

nextflow run script4.nf -resume

You will see the execution of the quantification process.

The -resume option cause the execution of any step that has been already processed to be skipped.

Try to execute it with more read files as shown below:

nextflow run script4.nf -resume --reads 'data/ggal/*_{1,2}.fq'

You will notice that the quantification process is executed more than one time.

Nextflow parallelizes the execution of your pipeline simply by providing multiple input data to your script.

7.4.1. Exercise
Add a tag (https://www.nextflow.io/docs/latest/process.html#tag) directive to the quantification process to provide a more
readable execution log.

https://seqera.io/training/#_channels 41/71
10/9/2020 Nextflow training

7.4.2. Exercise
Add a publishDir (https://www.nextflow.io/docs/latest/process.html#publishdir) directive to the quantification process to
store the process results into a directory of your choice.

7.4.3. Recap
In this step you have learned:

1. How to connect two processes by using the channel declarations

2. How to resume the script execution skipping already already computed steps

3. How to use the tag directive to provide a more readable execution output

4. How to use the publishDir to store a process results in a path of your choice

7.5. Quality control


This step implements a quality control of your input reads. The inputs are the same read pairs which are provided to
the quantification steps

You can run it by using the following command:

nextflow run script5.nf -resume

The script will report the following error message:

Channel `read_pairs_ch` has been used twice as an input by process `fastqc` and process `quantification`

7.5.1. Exercise
Modify the creation of the read_pairs_ch channel by using a into (https://www.nextflow.io/docs/latest/operator.html#into)
operator in place of a set .

 see an example here (https://github.com/nextflow-io/rnaseq-nf/blob/3b5b49f/main.nf#L58).

7.5.2. Recap
In this step you have learned:

1. How to use the into operator to create multiple copies of the same channel

7.6. MultiQC report


This step collect the outputs from the quantification and fastqc steps to create a final report by using the MultiQC
(http://multiqc.info/) tool.

Execute the script with the following command:

nextflow run script6.nf -resume --reads 'data/ggal/*_{1,2}.fq'

It creates the final report in the results folder in the current work directory.

https://seqera.io/training/#_channels 42/71
10/9/2020 Nextflow training

In this script note the use of the mix (https://www.nextflow.io/docs/latest/operator.html#mix) and collect
(https://www.nextflow.io/docs/latest/operator.html#collect) operators chained together to get all the outputs of the
quantification and fastqc process as a single input.

7.6.1. Recap
In this step you have learned:

1. How to collect many outputs to a single input with the collect operator

2. How to mix two channels in a single channel

3. How to chain two or more operators togethers

7.7. Handle completion event


This step shows how to execute an action when the pipeline completes the execution.

Note that Nextflow processes define the execution of asynchronous tasks i.e. they are not executed one after another
as they are written in the pipeline script as it would happen in a common imperative programming language.

The script uses the workflow.onComplete event handler to print a confirmation message when the script completes.

Try to run it by using the following command:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

7.8. Bonus!
Send a notification email when the workflow execution complete using the -N <email address> command line
option. Note: this requires the configuration of a SMTP server in nextflow config file. For the sake of this tutorial add
the following setting in your nextflow.config file:

CONFIG
1 mail {
2 from = 'info@nextflow.io'
3 smtp.host = 'email-smtp.eu-west-1.amazonaws.com'
4 smtp.port = 587
5 smtp.user = "xxxxx"
6 smtp.password = "yyyyy"
7 smtp.auth = true
8 smtp.starttls.enable = true
9 smtp.starttls.required = true
10 }

Then execute again the previous example specifying your email address:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq' -c mail.config -N <your email>

See mail documentation (https://www.nextflow.io/docs/latest/mail.html#mail-configuration) for details.

7.9. Custom scripts


Real world pipelines use a lot of custom user scripts (BASH, R, Python, etc). Nextflow allows you to use and manage all
these scripts in consistent manner. Simply put them in a directory named bin in the pipeline project root. They will be
automatically added to the pipeline execution PATH .

For example, create a file named fastqc.sh with the following content:

https://seqera.io/training/#_channels 43/71
10/9/2020 Nextflow training

BASH
1 #!/bin/bash
2 set -e
3 set -u
4
5 sample_id=${1}
6 reads=${2}
7
8 mkdir fastqc_${sample_id}_logs
9 fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}

Save it, give execute permission and move it in the bin directory as shown below:

BASH
1 chmod +x fastqc.sh
2 mkdir -p bin
3 mv fastqc.sh bin

Then, open the script7.nf file and replace the fastqc process' script with the following code:

NEXTFLOW
1 script:
2 """
3 fastqc.sh "$sample_id" "$reads"
4 """

Run it as before:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

7.9.1. Recap
In this step you have learned:

1. How to write or use existing custom script in your Nextflow pipeline.

2. How to avoid the use of absolute paths having your scripts in the bin/ project folder.

7.10. Metrics and reports


Nextflow is able to produce multiple reports and charts providing several runtime metrics and execution information.

Run the rnaseq-nf (https://github.com/nextflow-io/rnaseq-nf) pipeline previously introduced as shown below:

nextflow run rnaseq-nf -with-docker -with-report -with-trace -with-timeline -with-dag dag.png

The -with-report option enables the creation of the workflow execution report. Open the file report.html with a
browser to see the report created with the above command.

The -with-trace option enables the create of a tab separated file containing runtime information for each executed
task. Check the content of the file trace.txt for an example.

The -with-timeline option enables the creation of the workflow timeline report showing how processes where
executed along time. This may be useful to identify most time consuming tasks and bottlenecks. See an example at this
link (https://www.nextflow.io/docs/latest/tracing.html#timeline-report).

https://seqera.io/training/#_channels 44/71
10/9/2020 Nextflow training

Finally the -with-dag option enables to rendering of the workflow execution direct acyclic graph representation.
Note: this feature requires the installation of Graphviz (http://www.graphviz.org/) in your computer. See here
(https://www.nextflow.io/docs/latest/tracing.html#dag-visualisation) for details.

Note: runtime metrics may be incomplete for run short running tasks as in the case of this tutorial.

You view the HTML files right-clicking on the file name in the left side-bar and choosing the Preview
 menu item.

7.11. Run a project from GitHub


Nextflow allows the execution of a pipeline project directly from a GitHub repository (or similar services eg. BitBucket
and GitLab).

This simplifies the sharing and the deployment of complex projects and tracking changes in a consistent manner.

The following GitHub repository hosts a complete version of the workflow introduced in this tutorial:

github.com/nextflow-io/rnaseq-nf

You can run it by specifying the project name as shown below:

nextflow run nextflow-io/rnaseq-nf -with-docker

It automatically downloads it and store in the $HOME/.nextflow folder.

Use the command info to show the project information, e.g.:

nextflow info nextflow-io/rnaseq-nf

Nextflow allows the execution of a specific revision of your project by using the -r command line option. For
Example:

nextflow run nextflow-io/rnaseq-nf -r dev

Revision are defined by using Git tags or branches defined in the project repository.

This allows a precise control of the changes in your project files and dependencies over time.

7.12. More resources


Nextflow documentation (http://docs.nextflow.io) - The Nextflow docs home.

Nextflow patterns (https://github.com/nextflow-io/patterns) - A collection of Nextflow implementation patterns.

CalliNGS-NF (https://github.com/CRG-CNAG/CalliNGS-NF) - An Variant calling pipeline implementing GATK best practices.

nf-core (http://nf-co.re/) - A community collection of production ready genomic pipelines.

8. Manage dependencies & containers


Computational workflows rarely are composed by as single script or tool.  Most of the times they require the usage of
dozens of different software components or libraries.

https://seqera.io/training/#_channels 45/71
10/9/2020 Nextflow training

Installing and maintaining such dependencies is a challenging task and the most common source of irreproducibility in
scientific applications.

Containers are exceptionally useful in scientific workflows. They allow the encapsulation of software dependencies, i.e.
tools and libraries required by a data analysis application in one or more self-contained, ready-to-run, immutable
container images that can be easily deployed in any platform supporting the container runtime.

8.1. Docker hands-on


Get practice with basic Docker commands to pull, run and build your own containers.

A container is a ready-to-run Linux environment which can be executed in an isolated manner from the hosting
system. It has own copy of the file system, processes space, memory management, etc.

Containers are a Linux feature known as Control Groups or Cgroups (https://en.wikipedia.org/wiki/Cgroups) introduced with
kernel 2.6.

Docker adds to this concept an handy management tool to build, run and share container images.

These images can be uploaded and published in a centralised repository know as Docker Hub (https://hub.docker.com), or
hosted by other parties like for example Quay (https://quay.io).

8.1.1. Run a container


Run a container is easy as using the following command:

BASH
docker run <container-name>

For example:

BASH
docker run hello-world

8.1.2. Pull a container


The pull command allows you to download a Docker image without running it. For example:

BASH
docker pull debian:stretch-slim

The above command download a Debian Linux image.

8.1.3. Run a container in interactive mode


Launching a BASH shell in the container allows you to operate in an interactive mode in the containerised operating
system. For example:

BASH
docker run -it debian:stretch-slim bash

Once launched the container you wil noticed that’s running as root (!). Use the usual commands to navigate in the file
system.

To exit from the container, stop the BASH session with the exit command.

8.1.4. Your rst Docker le

https://seqera.io/training/#_channels 46/71
10/9/2020 Nextflow training

Docker images are created by using a so called Dockerfile i.e. a simple text file containing a list of commands to be
executed to assemble and configure the image with the software packages required.

In this step you will create a Docker image containing the Salmon tool.

Warning: the Docker build process automatically copies all files that are located in the current directory to the Docker
daemon in order to create the image. This can take a lot of time when big/many files exist. For this reason it’s important
to always work in a directory containing only the files you really need to include in your Docker image. Alternatively
you can use the .dockerignore file to select the path to exclude from the build.

Then use your favourite editor eg. vim to create a file named Dockerfile and copy the following content:

DOCKER
1 FROM debian:stretch-slim
2
3 MAINTAINER <your name>
4
5 RUN apt-get update && apt-get install -y curl cowsay
6
7 ENV PATH=$PATH:/usr/games/

When done save the file.

8.1.5. Build the image


Build the Docker image by using the following command:

BASH
docker build -t my-image .

Note: don’t miss the dot in the above command. When it completes, verify that the image has been created listing all
available images:

BASH
docker images

You can try your new container by running this command:

BASH
docker run my-image cowsay Hello Docker!

8.1.6. Add a software package to the image


Add the Salmon package to the Docker image by adding to the Dockerfile the following snippet:

DOCKER
1 RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.g
2 && mv /salmon-*/bin/* /usr/bin/ \
3 && mv /salmon-*/lib/* /usr/lib/

Save the file and build again the image with the same command as before:

BASH
docker build -t my-image .

You will notice that it creates a new Docker image with the same name but with a different image ID.

8.1.7. Run Salmon in the container

https://seqera.io/training/#_channels 47/71
10/9/2020 Nextflow training

Check that everything is fine running Salmon in the container as shown below:

BASH
docker run my-image salmon --version

You can even launch a container in an interactive mode by using the following command:

BASH
docker run -it my-image bash

Use the exit command to terminate the interactive session.

8.1.8. File system mounts


Create an genome index file by running Salmon in the container.

Try to run Salmon in the container with the following command:

BASH
docker run my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index

The above command fails because Salmon cannot access the input file.

This happens because the container runs in a complete separate file system and it cannot access the hosting file system
by default.

You will need to use the --volume command line option to mount the input file(s) eg.

BASH
docker run --volume $PWD/data/ggal/transcriptome.fa:/transcriptome.fa my-image \
salmon index -t /transcriptome.fa -i transcript-index

the generated transcript-index directory is still not accessible in the host file system (and
 actually it went lost).

An easier way is to mount a parent directory to an identical one in the container, this allows you to
 use the same path when running it in the container eg.

BASH
docker run --volume $HOME:$HOME --workdir $PWD my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index

Check the content of the transcript-index folder entering the command:

BASH
ls -la transcript-index

 Note that the permissions for files created by the Docker execution is root .

Exercise
Use the option -u $(id -u):$(id -g) to allow Docker to create files with the right permission.

8.1.9. Upload the container in the Docker Hub (bonus)

https://seqera.io/training/#_channels 48/71
10/9/2020 Nextflow training

Publish your container in the Docker Hub to share it with other people.

Create an account in the hub.docker.com web site. Then from your shell terminal run the following command, entering
the user name and password you specified registering in the Hub:

BASH
docker login

Tag the image with your Docker user name account:

BASH
docker tag my-image <user-name>/my-image

Finally push it to the Docker Hub:

BASH
docker push <user-name>/my-image

After that anyone will be able to download it by using the command:

BASH
docker pull <user-name>/my-image

Note how after a pull and push operation, Docker prints the container digest number e.g.

BASH
Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
Status: Downloaded newer image for nextflow/rnaseq-nf:latest

This is a unique and immutable identifier that can be used to reference container image in a univocally manner. For
example:

BASH
docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266

8.1.10. Run a Next ow script using a Docker container


The simplest way to run a Nextflow script with a Docker image is using the -with-docker command line option:

nextflow run script2.nf -with-docker my-image

We’ll see later how to configure in the Nextflow config file which container to use instead of having to specify every
time as a command line argument.

8.2. Singularity
Singularity (http://singularity.lbl.gov) is container runtime designed to work in HPC data center, where the usage of Docker
is generally not allowed due to security constraints.

Singularity implements a container execution model similarly to Docker however it uses a complete different
implementation design.

A Singularity container image is archived as a plain file that can be stored in a shared file system and accessed by
many computing nodes managed by a batch scheduler.

8.2.1. Create a Singularity images

https://seqera.io/training/#_channels 49/71
10/9/2020 Nextflow training

Singularity images are created using a Singularity file in similar manner to Docker, though using a different syntax.

SINGULARITY
1 Bootstrap: docker
2 From: debian:stretch-slim
3
4 %environment
5 export PATH=$PATH:/usr/games/
6
7 %labels
8 AUTHOR <your name>
9
10 %post
11
12 apt-get update && apt-get install -y locales-all curl cowsay
13 curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz |
14 && mv /salmon-*/bin/* /usr/bin/ \
15 && mv /salmon-*/lib/* /usr/lib/

Once you have save the Singularity file. Create the image with these commands:

BASH
sudo singularity build my-image.sif Singularity

Note: the build command requires sudo permissions. A common workaround consists to build the image on a local
workstation and then deploy in the cluster just copying the image file.

8.2.2. Running a container


Once done, you can run your container with the following command

BASH
singularity exec my-image.sif cowsay 'Hello Singularity'

By using the shell command you can enter in the container in interactive mode. For example:

BASH
singularity shell my-image.sif

Once in the container instance run the following commands:

BASH
touch hello.txt
ls -la

Note how the files on the host environment are shown. Singularity automatically mounts the host
 $HOME directory and uses the current work directory.

8.2.3. Import a Docker image


An easier way to create Singularity container without requiring sudo permission and boosting the containers
interoperability is to import a Docker container image pulling it directly from a Docker registry. For example:

BASH
singularity pull docker://debian:stretch-slim

The above command automatically download the Debian Docker image and converts it to a Singularity image store in
the current directory with the name debian-jessie.simg .

https://seqera.io/training/#_channels 50/71
10/9/2020 Nextflow training

8.2.4. Run a Next ow script using a Singularity container


Nextflow allows the transparent usage of Singularity containers as easy as with Docker ones.

It only requires to enable the use of Singularity engine in place of Docker in the Nextflow configuration file using the -
with-singularity command line option:

BASH
nextflow run script7.nf -with-singularity nextflow/rnaseq-nf

As before the Singularity container can also be provided in the Nextflow config file. We’ll see later how to do it.

8.2.5. The Singularity Container Library


The authors of Singularity, SyLabs (https://www.sylabs.io/) have their own repository of Singularity containers.

In the same way that we can push docker images to Docker Hub, we can upload Singularity images to the Singularity
Library.

8.3. Conda/Bioconda packages


Conda is popular package and environment manager. The built-in support for Conda allows Nextflow pipelines to
automatically creates and activates the Conda environment(s) given the dependencies specified by each process.

A Conda environment is defined using a YAML file which lists the required software packages. For example:

YAML
name: nf-tutorial
channels:
- defaults
- bioconda
- conda-forge
dependencies:
- salmon=1.0.0
- fastqc=0.11.5
- multiqc=1.5

Given the recipe file, the environment is created using the command shown below:

BASH
conda env create --file env.yml

You can check the environment was created successfully with the command shown below:

BASH
conda env list

To enable the environment you can use the activate command:

BASH
conda activate nf-tutorial

Nextflow is able to manage the activation of a Conda environment when the its directory is specified using the -with-
conda option. For example:

BASH
nextflow run script7.nf -with-conda /home/ubuntu/miniconda2/envs/nf-tutorial

https://seqera.io/training/#_channels 51/71
10/9/2020 Nextflow training

When specifying as Conda environment a YAML recipe file, Nextflow automatically downloads the
 required dependencies, build the environment and automatically activate it.

This makes easier to manage different environments for the processes in the workflow script.

See the Nextflow (https://www.nextflow.io/docs/latest/conda.html) in the Nextflow documentation for details.

8.3.1. Bonus Exercise


Take a look at the Dockerfile of the rnaseq-nf (https://github.com/nextflow-io/rnaseq-nf) pipeline to determine how it is built.

8.4. BioContainers
Another useful resource linking together Bioconda and containers is the BioContainers (https://biocontainers.pro) project.
BioContainers is a community initiative that provides a registry of container images for every Bioconda recipe.

8.5. More resources


Marcel (https://github.com/brouberol/marcel) the french Docker .. (humor)

9. Nextflow configuration
A key Nextflow feature is the ability to decouple the workflow implementation by the configuration setting required by
the underlying execution platform.

This enable portable deployment without the need to modify the application code.

9.1. Con guration le


When a pipeline script is launched Nextflow looks for a file named nextflow.config in the current directory and in
the script base directory (if it is not the same as the current directory). Finally it checks for the file
$HOME/.nextflow/config .

When more than one on the above files exist they are merged, so that the settings in the first override the same ones
that may appear in the second one, and so on.

The default config file search mechanism can be extended proving an extra configuration file by using the command
line option -c <config file> .

9.1.1. Con g syntax


A Nextflow configuration file is a simple text file containing a set of properties defined using the syntax:

name = value

Please note, string values need to be wrapped in quotation characters while numbers and boolean
values ( true , false ) do not. Also note that values are typed, meaning for example that, 1 is
 different from '1' , since the first is interpreted as the number one, while the latter is interpreted as
a string value.

9.1.2. Con g variables


Configuration properties can be used as variables in the configuration file itself, by using the usual $propertyName or
${expression} syntax.

https://seqera.io/training/#_channels 52/71
10/9/2020 Nextflow training

CONFIG
1 propertyOne = 'world'
2 anotherProp = "Hello $propertyOne"
3 customPath = "$PATH:/my/app/folder"

In the configuration file it’s possible to access any variable defined in the host environment such as
 $PATH , $HOME , $PWD , etc.

9.1.3. Con g comments


Configuration files use the same conventions for comments used in the Nextflow script:

NEXTFLOW
1 // comment a single config file
2
3 /*
4 a comment spanning
5 multiple lines
6 */

9.1.4. Con g scopes


Configuration settings can be organized in different scopes by dot prefixing the property names with a scope identifier
or grouping the properties in the same scope using the curly brackets notation. This is shown in the following example:

CONFIG
1 alpha.x = 1
2 alpha.y = 'string value..'
3
4 beta {
5 p = 2
6 q = 'another string ..'
7 }

9.1.5. Con g params


The scope params allows the definition of workflow parameters that overrides the values defined in the main
workflow script.

This is useful to consolidate one or more execution parameters in a separate file.

CONFIG
1 // config file
2 params.foo = 'Bonjour'
3 params.bar = 'le monde!'

NEXTFLOW
1 // workflow script
2 params.foo = 'Hello'
3 params.bar = 'world!'
4
5 // print the both params
6 println "$params.foo $params.bar"

Exercise
Save the first snippet as nextflow.config and the second one as params.nf . Then run:

CMD
nextflow run params.nf

Execute is again specifying the foo parameter on the command line:

https://seqera.io/training/#_channels 53/71
10/9/2020 Nextflow training

CMD
nextflow run params.nf --foo Hola

Compare the result of the two executions.

9.1.6. Con g env


The env scope allows the definition one or more variable that will be exported in the environment where the
workflow tasks will be executed.

CONFIG
1 env.ALPHA = 'some value'
2 env.BETA = "$HOME/some/path"

Exercise
Save the above snippet a file named my-env.config . The save the snippet below in a file named foo.nf :

NEXTFLOW
1 process foo {
2 echo true
3 '''
4 env | egrep 'ALPHA|BETA'
5 '''
6 }

Finally executed the following command:

nextflow run foo.nf -c my-env.config

9.1.7. Con g process


The process directives (https://www.nextflow.io/docs/latest/process.html#directives) allow the specification of specific settings
for the task execution such as cpus , memory , container and other resources in the pipeline script.

This is useful specially when prototyping a small workflow script.

However it’s always a good practice to decouple the workflow execution logic from the process configuration settings,
i.e. it’s strongly suggested to define the process settings in the workflow configuration file instead of the workflow
script.

The process configuration scope allows the setting of any process directives
(https://www.nextflow.io/docs/latest/process.html#directives) in the Nextflow configuration file. For example:

CONFIG
1 process {
2 cpus = 10
3 memory = 8.GB
4 container = 'biocontainers/bamtools:v2.4.0_cv3'
5 }

The above config snippet defines the cpus , memory and container directives for all processes in your workflow
script.

The process selector (https://www.nextflow.io/docs/latest/config.html#process-selectors) can be used to apply the configuration


to a specific process or group of processes (discussed later).

https://seqera.io/training/#_channels 54/71
10/9/2020 Nextflow training

Memory and time duration unit can be specified either using a string based notation in which the

 digit(s) and the unit can be separated by a blank or by using the numeric notation in which the
digit(s) and the unit are separated by a dot character and it’s not enclosed by quote characters.

String syntax Numeric syntax Value

'10 KB' 10.KB 10240 bytes

'500 MB' 500.MB 524288000 bytes

'1 min' 1.min 60 seconds

'1 hour 25 sec' - 1 hour and 25 seconds

The syntax for setting process directives in the configuration file requires = ie. assignment operator,
 instead it should not be used when setting process directives in the workflow script.

This important especially when you want to define a config setting using a dynamic expression using a closure. For
example:

process {
memory = { 4.GB * task.cpus }
}

Directives that requires more than one value, e.g. pod (https://www.nextflow.io/docs/latest/process.html#pod), in the
configuration file need to be expressed as a map object.

process {
pod = [env: 'FOO', value: '123']
}

Finally directives that allows to be repeated in the process definition, in the configuration files need to be defined as a
list object. For example:

process {
pod = [ [env: 'FOO', value: '123'],
[env: 'BAR', value: '456'] ]
}

9.1.8. Con g Docker execution


The container image to be used for the process execution can be specified in the nextflow.config file:

CONFIG
1 process.container = 'nextflow/rnaseq-nf'
2 docker.enabled = true

 The use of the unique SHA256 image ID guarantees that the image content do not change over time

https://seqera.io/training/#_channels 55/71
10/9/2020 Nextflow training

CONFIG
1 process.container = 'nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab
2 docker.enabled = true

9.1.9. Con g Singularity execution


The run the workflow execution with a Singularity container provide the container image file path in the Nextflow
config file using the container directive:

CONFIG
1 process.container = '/some/singularity/image.sif'
2 singularity.enabled = true

 The container image file must be an absolute path i.e. it must start with a / .

The following protocols are supported:

library:// download the container image from the Singularity Library service (https://cloud.sylabs.io/library).

shub:// download the container image from the Singularity Hub (https://singularity-hub.org/).

docker:// download the container image from the Docker Hub (https://hub.docker.com/) and convert it to the
Singularity format.

docker-daemon:// pull the container image from a local Docker installation and convert it to a Singularity image
file.

Specifying a plain Docker container image name, Nextflow implicitly download and converts it to a
 Singularity image when the Singularity execution is enabled. For example:

CONFIG
1 process.container = 'nextflow/rnaseq-nf'
2 singularity.enabled = true

The above configuration instructs Nextflow to use Singularity engine to run your script processes. The container is
pulled from the Docker registry and cached in the current directory to be used for further runs.

Alternatively if you have a Singularity image file, its location absolute path can be specified as the container name
either using the -with-singularity option or the process.container setting in the config file.

Try to run the script as shown below:

BASH
nextflow run script7.nf

Note: Nextflow will pull the container image automatically, it will require a few seconds depending the network
connection speed.

9.1.10. Con g Conda execution


The use of a Conda environment can also be provided in the configuration file adding the following setting in the
nextflow.config file:

CONFIG
1 process.conda = "/home/ubuntu/miniconda2/envs/nf-tutorial"

https://seqera.io/training/#_channels 56/71
10/9/2020 Nextflow training

You can either specify the path of an existing Conda environment directory or the path of Conda environment YAML
file.

10. Deployments scenarios


Real world genomic application can spawn the execution of thousands of jobs. In this scenario a batch scheduler is
commonly used to deploy a pipeline in a computing cluster, allowing the execution of many jobs in parallel across
many computing nodes.

Nextflow has built-in support for most common used batch schedulers such as Univa Grid Engine and SLURM
(https://slurm.schedmd.com/) and IBM LSF between the other. Check the Nextflow documentation for the complete list of
supported execution platforms (https://www.nextflow.io/docs/latest/executor.html).

10.1. Cluster deployment


A key Nextflow feature is the ability to decouple the workflow implementation from the actual execution platform
implementing an abstraction layer that allows the deployment of the resulting workflow on any executing platform
support by the framework.

To run your pipeline with a batch scheduler modify the nextflow.config file specifying the target executor and the
required computing resources if needed. For example:

CONFIG
1 process.executor = 'slurm'

10.2. Managing cluster resources

https://seqera.io/training/#_channels 57/71
10/9/2020 Nextflow training

When using a batch scheduler is generally needed to specify the amount of resources i.e. cpus, memory, execution time,
etc. required by each task.

This can be done using the following process directives:

queue the cluster queue to be used for the computation


(https://www.nextflow.io/docs/latest/process.html#queue)

cpus the number of cpus to be allocated a task execution


(https://www.nextflow.io/docs/latest/process.html#cpus)

memory the amount of memory to be allocated a task execution


(https://www.nextflow.io/docs/latest/process.html#memory)

time the max amount of time to be allocated a task execution


(https://www.nextflow.io/docs/latest/process.html#time)

disk the amount of disk storage required a task execution


(https://www.nextflow.io/docs/latest/process.html#disk)

10.2.1. Work ow wide resources


Use the scope process to define the resource requirements for all processes in your workflow applications. For
example:

CONFIG
1 process {
2 executor = 'slurm'
3 queue = 'short'
4 memory = '10 GB'
5 time = '30 min'
6 cpus = 4
7 }

10.2.2. Con gure process by name


In real world application different tasks need different amount of computing resources. It is possible to define the
resources for a specific task using the select withName: followed by the process name:

CONFIG
1 process {
2 executor = 'slurm'
3 queue = 'short'
4 memory = '10 GB'
5 time = '30 min'
6 cpus = 4
7
8 withName: foo {
9 cpus = 4
10 memory = '20 GB'
11 queue = 'short'
12 }
13
14 withName: bar {
15 cpus = 8
16 memory = '32 GB'
17 queue = 'long'
18 }
19 }

10.2.3. Con gure process by labels

https://seqera.io/training/#_channels 58/71
10/9/2020 Nextflow training

When a workflow application is composed by many processes can be overkill listing all process names in the
configuration file to specifies the resources for each of them.

A better strategy consist to annotate the processes with a label (https://www.nextflow.io/docs/latest/process.html#label)


directive. Then specify the resources in the configuration file using for all processes having the same label.

The workflow script:

NEXTFLOW
1 process task1 {
2 label 'long'
3
4 """
5 first_command --here
6 """
7 }
8
9 process task2 {
10 label 'short'
11
12 """
13 second_command --here
14 """
15 }
16

The configuration file:

CONFIG
1 process {
2 executor = 'slurm'
3
4 withLabel: 'short' {
5 cpus = 4
6 memory = '20 GB'
7 queue = 'alpha'
8 }
9
10 withLabel: 'long' {
11 cpus = 8
12 memory = '32 GB'
13 queue = 'omega'
14 }
15 }

10.2.4. Con gure multiple containers


It is possible to use a different container for each process in your workflow. For having a workflow script defining two
process, it’s possible to define a config file as shown below:

CONFIG
1 process {
2 withName: foo {
3 container = 'some/image:x'
4 }
5 withName: bar {
6 container = 'other/image:y'
7 }
8 }
9
10 docker.enabled = true

https://seqera.io/training/#_channels 59/71
10/9/2020 Nextflow training

A single fat container or many slim containers? Both approaches have pros & cons. A single
container is simpler to build and to maintain, however when using many tools the image can

 become very big and tools can conflict each other. Using a container for each process can result in
many different images to build and to maintain, especially when processes in your workflow uses
different tools in each task.

Read more about config process selector at this link (https://www.nextflow.io/docs/latest/config.html#process-selectors).

10.3. Con guration pro les


Configuration files can contain the definition of one or more profiles. A profile is a set of configuration attributes that
can be activated/chosen when launching a pipeline execution by using the -profile command line option.

Configuration profiles are defined by using the special scope profiles which group the attributes that belong to the
same profile using a common prefix. For example:

CONFIG
1 profiles {
2
3 standard {
4 params.genome = '/local/path/ref.fasta'
5 process.executor = 'local'
6 }
7
8 cluster {
9 params.genome = '/data/stared/ref.fasta'
10 process.executor = 'sge'
11 process.queue = 'long'
12 process.memory = '10GB'
13 process.conda = '/some/path/env.yml'
14 }
15
16 cloud {
17 params.genome = '/data/stared/ref.fasta'
18 process.executor = 'awsbatch'
19 process.container = 'cbcrg/imagex'
20 docker.enabled = true
21 }
22
23 }

This configuration defines three different profiles: standard , cluster and cloud that set different process
configuration strategies depending on the target runtime platform. By convention the standard profile is implicitly
used when no other profile is specified by the user.

To enable a specific profile use -profile option followed by the profile name:

CMD
nextflow run <your script> -profile cluster

Two or more configuration profiles can be specified by separating the profile names with a comma
 character:

CMD
nextflow run <your script> -profile standard,cloud

10.4. Cloud deployment

https://seqera.io/training/#_channels 60/71
10/9/2020 Nextflow training

AWS Batch (https://aws.amazon.com/batch/) is a managed computing service that allows the execution of containerised
workloads in the Amazon cloud infrastructure.

Nextflow provides a built-in support for AWS Batch which allows the seamless deployment of a Nextflow pipeline in
the cloud offloading the process executions as Batch jobs.

Once the Batch environment is configured specifying the instance types to be used and the max number of cpus to be
allocated, you need to created a Nextflow configuration file like the one showed below:

CONFIG
1 process.executor = 'awsbatch' 1

2 process.queue = 'nextflow-ci' 2

3 process.container = 'nextflow/rnaseq-nf:latest' 3

4 workDir = 's3://nextflow-ci/work/' 4

5 aws.region = 'eu-west-1' 5

6 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' 6

1 Set the AWS Batch as the executor to run the processes in the workflow
2 The name of the computing queue defined in the Batch environment
3 The Docker container image to be used to run each job
4 The workflow work directory must be a AWS S3 bucket
5 The AWS region to be used
6 The path of the AWS cli tool required to download/upload files to/from the container

The best practices is to keep this setting as a separate profile in your workflow config file. This allows
 the execution with a simple command.

nextflow run script7.nf

The complete details about AWS Batch deployment are available at this link
(https://www.nextflow.io/docs/latest/awscloud.html#aws-batch).

10.5. Volume mounts


EBS volumes (or other supported storage) can be mounted in the job container using the following configuration
snippet:

aws {
batch {
volumes = '/some/path'
}
}

Multiple volumes can be specified using comma-separated paths. The usual Docker volume mount syntax can be used
to define complex volumes for which the container paths is different from the host paths or to specify a read-only
option:

aws {
region = 'eu-west-1'
batch {
volumes = ['/tmp', '/host/path:/mnt/path:ro']
}
}

https://seqera.io/training/#_channels 61/71
10/9/2020 Nextflow training

IMPORTANT:

This a global configuration that has to be specified in a Nextflow config file, as such it’s applied to all process
executions.

Nextflow expects those paths to be available. It does not handle the provision of EBS volumes or other kind of
storage.

10.6. Custom job de nition


Nextflow automatically creates the Batch Job definitions
(https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) needed to execute your pipeline processes. Therefore
it’s not required to define them before run your workflow.

However, you may still need to specify a custom Job Definition to provide fine-grained control of the configuration
settings of a specific job e.g. to define custom mount paths or other special settings of a Batch Job.

To use your own job definition in a Nextflow workflow, use it in place of the container image name, prefixing it with
the job-definition:// string. For example:

process {
container = 'job-definition://your-job-definition-name'
}

10.7. Custom image


Since Nextflow requires the AWS CLI tool to be accessible in the computing environment a common solution consists of
creating a custom AMI and install it in a self-contained manner e.g. using Conda package manager.

When creating your custom AMI for AWS Batch, make sure to use the Amazon ECS-Optimized
 Amazon Linux AMI as the base image.

The following snippet shows how to install AWS CLI with Miniconda:

sudo yum install -y bzip2 wget


wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $HOME/miniconda
$HOME/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh

The aws tool will be placed in a directory named bin in the main installation folder. Modifying this
 directory structure, after the installation, this will cause the tool not to work properly.

Finally specify the aws full path in the Nextflow config file as show below:

aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'

10.8. Launch template


An alternative to is to create a custom AMI using a Launch template
(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-launch-templates.html) that installs the AWS CLI tool during the
instance boot via a custom user-data.

https://seqera.io/training/#_channels 62/71
10/9/2020 Nextflow training

In the EC2 dashboard create a Launch template specifying in the user data field:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/sh
## install required deps
set -x
export PATH=/usr/local/bin:$PATH
yum install -y jq python27-pip sed wget bzip2
pip install -U boto3

## install awscli
USER=/home/ec2-user
wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $USER/miniconda
$USER/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
chown -R ec2-user:ec2-user $USER/miniconda

--//--

Then in the Batch dashboard create a new compute environment and specify the newly created launch template in the
corresponding field.

10.9. Hybrid deployments


Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment
of hybrid workloads in which some jobs are execute in the local computer or local computing cluster and some jobs are
offloaded to AWS Batch service.

To enable this feature use one or more process selectors


(https://www.nextflow.io/docs/latest/config.html#config-process-selectors) in your Nextflow configuration file to apply the AWS
Batch configuration (https://www.nextflow.io/docs/latest/awscloud.html#awscloud-batch-config) only to a subset of processes in
your workflow. For example:

CONFIG
1 process {
2 executor = 'slurm' 1

3 queue = 'short' 2

4
5 withLabel: bigTask { 3

6 executor = 'awsbatch' 4

7 queue = 'my-batch-queue' 5

8 container = 'my/image:tag' 6

9 }
10 }
11
12 aws {
13 region = 'eu-west-1' 7

14 }

1 Set slurm as the default executor


2 Set the queue for the SLURM cluster
3 Setting of for the process named bigTask
4 Set awsbatch as executor for the bigTask process

https://seqera.io/training/#_channels 63/71
10/9/2020 Nextflow training

5 Set the queue for the for the bigTask process


6 set the container image to deploy the bigTask process
7 Defines the region for Batch execution

11. Execution cache and resume


Nextflow caching mechanism works assigning a unique ID to each task which is used to create a separate execution
directory where the tasks are executed and the results stored.

The task unique ID is generated as a 128-bit hash number obtained composing the task inputs values and files and the
command string.

The pipeline work directory is organized as shown below:

work/
├── 12
│   └── 1adacb582d2198cd32db0e6f808bce
│   ├── genome.fa -> /data/../genome.fa
│   └── index
│   ├── hash.bin
│   ├── header.json
│   ├── indexing.log
│   ├── quasi_index.log
│   ├── refInfo.json
│   ├── rsd.bin
│   ├── sa.bin
│   ├── txpInfo.bin
│   └── versionInfo.json
├── 19
│   └── 663679d1d87bfeafacf30c1deaf81b
│   ├── ggal_gut
│   │   ├── aux_info
│   │   │   ├── ambig_info.tsv
│   │   │   ├── expected_bias.gz
│   │   │   ├── fld.gz
│   │   │   ├── meta_info.json
│   │   │   ├── observed_bias.gz
│   │   │   └── observed_bias_3p.gz
│   │   ├── cmd_info.json
│   │   ├── libParams
│   │   │   └── flenDist.txt
│   │   ├── lib_format_counts.json
│   │   ├── logs
│   │   │   └── salmon_quant.log
│   │   └── quant.sf
│   ├── ggal_gut_1.fq -> /data/../ggal_gut_1.fq
│   ├── ggal_gut_2.fq -> /data/../ggal_gut_2.fq
│   └── index -> /data/../asciidocs/day2/work/12/1adacb582d2198cd32db0e6f808bce/index

11.1. How resume works


The -resume command line option allow the continuation of a pipeline execution since the last step that was
successfully completed:

nextflow run <script> -resume

https://seqera.io/training/#_channels 64/71
10/9/2020 Nextflow training

In practical terms the pipeline is executed from the beginning however before launching the execution of a process.
Nextflow uses the task unique ID to check if the work directory already exists and it contains a valid command exit
status and the expected output files.

If this condition is satisfied the task execution is skipped and previously computed results are used as the process
results.

The first task, for which a new output is computed, invalidates all downstream executions in the remaining DAG.

11.2. Work directory


The task work directories are created in the folder work in the launching path by default. This is supposed to be a
scratch storage area that can be cleaned up once the computation is completed.

Workflow final output are supposed to the stored in a different location specified using one or more
 publishDir (https://www.nextflow.io/docs/latest/process.html#publishdir) directive.

A different location for the execution work directory can be specified using the command line option -w e.g.

nextflow run <script> -w /some/scratch/dir

If you delete or more the pipeline work directory will prevent to use the resume feature in following
 runs.

The hash code for input files is computed using:

The complete file path

The file size

The last modified timestamp

Therefore just touching a file will invalidated the related task execution.

11.3. How organize in silico experiments


It’s a good practice to organize each each experiment in its own folder. The experiment main input parameters should
be specified using a Nextflow config file. This makes simply to track and replicate the experiment over time.

Note that in the same experiment the same pipeline can be executed multiple times, however it should be avoided to
launch two (or more) Nextflow instances in the same directory concurrently.

The nextflow log command lists the executions run in the current folder:

BASH
1 $ nextflow log
2
3 TIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID
4 2019-05-06 12:07:32 1.2s focused_carson ERR a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85
5 2019-05-06 12:08:33 21.1s mighty_boyd OK a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85
6 2019-05-06 12:31:15 1.2s insane_celsius ERR b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf
7 2019-05-06 12:31:24 17s stupefied_euclid OK b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf

You can use either the session ID or the run name to recover a specific execution. For example:

https://seqera.io/training/#_channels 65/71
10/9/2020 Nextflow training

nextflow run rnaseq-nf -resume mighty_boyd

11.4. Execution provenance


The log command when provided with a run name or session ID can return many useful information about a
pipeline execution that can be used to create a provenance report.

By default, it lists the work directories used to compute each task. For example:

$ nextflow log tiny_fermat

/data/.../work/7b/3753ff13b1fa5348d2d9b6f512153a
/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310
/data/.../work/82/ba67e3175bd9e6479d4310e5a92f99
/data/.../work/e5/2816b9d4e7b402bfdd6597c2c2403d
/data/.../work/3b/3485d00b0115f89e4c202eacf82eba

Using the option -f (fields) it’s possible to specify which metadata should be printed by the log command. For
example:

$ nextflow log tiny_fermat -f 'process,exit,hash,duration'

index 0 7b/3753ff 2.0s


fastqc 0 c1/56a36d 9.3s
fastqc 0 f7/659c65 9.1s
quant 0 82/ba67e3 2.7s
quant 0 e5/2816b9 3.2s
multiqc 0 3b/3485d0 6.3s

The complete list of available fields can be retrieved with the command:

nextflow log -l

The option -F allows the specification of a filtering criteria to print only a subset of tasks. For example:

$ nextflow log tiny_fermat -F 'process =~ /fastqc/'

/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310

This can be useful to locate specific tasks work directories.

Finally, the -t option allow the creation of a basic custom provenance report proving a template file, in any format of
your choice. For example:

https://seqera.io/training/#_channels 66/71
10/9/2020 Nextflow training

HTML
<div>
<h2>${name}</h2>
<div>
Script:
<pre>${script}</pre>
</div>

<ul>
<li>Exit: ${exit}</li>
<li>Status: ${status}</li>
<li>Work dir: ${workdir}</li>
<li>Container: ${container}</li>
</ul>
</div>

Save the above snippet in a file named template.html . Then run this command:

nextflow log tiny_fermat -t template.html > prov.html

Finally open the file prov.html file with a browser.

11.5. Resume troubleshooting


If your workflow execution is not resumed as expected and one or more task are re-executed all the times, these may
be the most likely causes:

Input file changed: Make sure that there’s no change in your input files. Don’t forget task unique hash is computed
taking into account the complete file path, the last modified timestamp and the file size. If any of these information
changes, the workflow will be re-executed even if the input content is the same.

A process modifies an input: A process should never alter input files otherwise the resume, for future executions,
will be invalidated for the same reason explained in the previous point.

Inconsistent file attributes: Some shared file system, such as NFS (https://en.wikipedia.org/wiki/Network_File_System),
may report inconsistent file timestamp i.e. a different timestamp for the same file even if it has not be modified. To
prevent this problem use the lenient cache strategy (https://www.nextflow.io/docs/latest/process.html#cache).

Race condition in global variable: Nextflow is designed to simplify parallel programming without taking care
about race conditions and the access to shared resources. One of the few cases in which a race condition can arise is
when using a global variable with two (or more) operators. For example:
NEXTFLOW
1 Channel
2 .from(1,2,3)
3 .map { it -> X=it; X+=2 }
4 .view { "ch1 = $it" }
5
6 Channel
7 .from(1,2,3)
8 .map { it -> X=it; X*=2 }
9 .view { "ch2 = $it" }

The problem in this snippet is that the X variable in the closure definition is defined in the global scope. Therefore,
since operators are executed in parallel, the X value can be overwritten by the other map invocation.

The correct implementation requires the use of the def keyword to declare the variable local.

https://seqera.io/training/#_channels 67/71
10/9/2020 Nextflow training

NEXTFLOW
1 Channel
2 .from(1,2,3)
3 .map { it -> def X=it; X+=2 }
4 .println { "ch1 = $it" }
5
6 Channel
7 .from(1,2,3)
8 .map { it -> def X=it; X*=2 }
9 .println { "ch2 = $it" }

Not deterministic input channels: While dataflow channel ordering is guaranteed i.e. data is read in the same
order in which it’s written in the channel, when a process declares as input two or more channel each of which is
the output of a different process the overall input ordering is not consistent over different executions.

In practical term, consider the following snippet:


NEXTFLOW
1 process foo {
2 input: set val(pair), file(reads) from ...
3 output: set val(pair), file('*.bam') into bam_ch
4 """
5 your_command --here
6 """
7 }
8
9 process bar {
10 input: set val(pair), file(reads) from ...
11 output: set val(pair), file('*.bai') into bai_ch
12 """
13 other_command --here
14 """
15 }
16
17 process gather {
18 input:
19 set val(pair), file(bam) from bam_ch
20 set val(pair), file(bai) from bai_ch
21 """
22 merge_command $bam $bai
23 """
24 }

The inputs declared at line 19,20 can be delivered in any order because the execution order of the process foo and
bar is not deterministic due to the parallel executions of them.

Therefore the input of the third process needs to be synchronized using the join
(https://www.nextflow.io/docs/latest/operator.html#join) operator or a similar approach. The third process should be
written as:
NEXTFLOW
1 ...
2
3 process gather {
4 input:
5 set val(pair), file(bam), file(bai) from bam_ch.join(bai_ch)
6 """
7 merge_command $bam $bai
8 """
9 }

12. Errors handling & troubleshooting


12.1. Execution errors debugging
https://seqera.io/training/#_channels 68/71
10/9/2020 Nextflow training

When a process execution exit with a non-zero exit status, Nextflow stops the workflow execution and report the
failing task:

CMD
ERROR ~ Error executing process > 'index'

Caused by: 1

Process `index` terminated with an error exit status (127)

Command executed: 2

salmon index --threads 1 -t transcriptome.fa -i index

Command exit status: 3

127

Command output: 4

(empty)

Command error: 5

.command.sh: line 2: salmon: command not found

Work dir: 6
/Users/pditommaso/work/0b/b59f362980defd7376ee0a75b41f62

1 A description of the error cause


2 The command executed
3 The command exit status
4 The command standard output, when available
5 The command standard error
6 The command work directory

Review carefully all these data, they can provide valuable information on the cause of the error.

If this is not enough, change in the task work directory. It contains all the files to replicate the issue in a isolated
manner.

The task execution directory contains these files:

.command.sh : The command script.

.command.run : The command wrapped used to run the job.

.command.out : The complete job standard output.

.command.err : The complete job standard error.

.command.log : The wrapper execution output.

.command.begin : Sentinel file created as soon as the job is launched.

.exitcode : A file containing the task exit code.

Task input files (symlinks)

Task output files

Verify that the .command.sh file contains the expected command to be executed and all variables are correctly
resolved.

https://seqera.io/training/#_channels 69/71
10/9/2020 Nextflow training

Also verify the existence of the file .exitcode file. If missing and also the file .command.begin does not exist, the task
was never executed by the subsystem (eg. the batch scheduler). If the .command.begin file exist, the job was launched
but the it likely was abruptly killed.

You can replicate the failing execution using the command bash .command.run and verify the cause of the error.

12.2. Ignore errors


There are cases in which a process error may be expected and it should not stop the overall workflow execution.

To handle this use case set the process errorStrategy to ignore :

NEXTFLOW
1 process foo {
2 errorStrategy 'ignore'
3 script:
4 """
5 your_command --this --that
6 """
7 }

If you want to ignore any error set the same directive in the config file as default setting:

CONFIG
1 process.errorStrategy = 'ignore'

12.3. Automatic error fail-over


In (rare) cases errors may be caused by transient conditions. In this situation an effective strategy consistent in trying
to re-execute the failing task.

NEXTFLOW
1 process foo {
2 errorStrategy 'retry'
3 script:
4 """
5 your_command --this --that
6 """
7 }

Using the retry error strategy the task is re-executed a second time if it returns a non-zero exit status before stopping
the complete workflow execution.

The directive maxRetries (https://www.nextflow.io/docs/latest/process.html#maxretries) can be used to set number of attempts


the task can be re-execute before declaring it failed with an error condition.

12.4. Retry with backo


There are cases in which the required execution resources may be temporary unavailable e.g. network congestion. In
these cases simply re-executing the same task will likely result in the identical error. A retry with an exponential
backoff delay can better recover these error conditions.

https://seqera.io/training/#_channels 70/71
10/9/2020 Nextflow training

NEXTFLOW
1 process foo {
2 errorStrategy { sleep(Math.pow(2, task.attempt) * 200 as long); return 'retry' }
3 maxRetries 5
4 script:
5 '''
6 your_command --here
7 '''
8 }

12.5. Dynamic resources allocation


It’s a very common scenario that different instances of the same process may have very different needs in terms of
computing resources. In such situations requesting, for example, an amount of memory too low will cause some tasks
to fail. Instead, using a higher limit that fits all the tasks in your execution could significantly decrease the execution
priority of your jobs.

To handle this use case you can use a retry error strategy and increasing the computing resources allocated by the
job at each successive attempt.

NEXTFLOW
1 process foo {
2 cpus 4
3 memory { 2.GB * task.attempt } 1

4 time { 1.hour * task.attempt } 2

5 errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' } 3

6 maxRetries 3 4

7
8 script:
9 """
10 your_command --cpus $task.cpus --mem $task.memory
11 """
12 }

1 The memory is defined in a dynamic manner, the first attempt is 2 GB, the second 4 GB, and so on.
2 The wall execution time is set dynamically as well, the first execution attempt is set to 1 hour, the second 2
hours, and so on.
3 If the task return an exit status equals to 140 sets the error strategy to retry otherwise terminates the
execution.
4 It can retry the process execution up to three times.

Creative Commons CC BY-NC-ND (Attribution-NonCommercial-ShareAlike-NoDerivatives)


Copyright 2020, Seqera Labs (https://www.seqera.io). All rights reserved. Not for redistribution.

https://seqera.io/training/#_channels 71/71

You might also like