Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Tools for Reproducible Research

Soroor Hediyeh-Zadeh
This book is for sale at http://leanpub.com/toolsforreproducibleresearch

This version was published on 2016-10-30

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.

© 2016 Soroor Hediyeh-Zadeh


Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Reproducible Research with Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Minimal Docker tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Writing Dockerfiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Building Docker Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Running Docker Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Docker commands cheatsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Preface
Reproducibility of scientific experiments is vital in assessing the validity of the results. A recent study
¹ shows that the results of differential gene expression analysis on microarray data with custom
chips can change as different versions of chip annotation become available, making such studies
not reproducible. Many leading journals now require the authors to demonstrate reproducibility
of the study results by sharing the data and source code underlying the experiment. Even when
the data and source code are provided the readers may fail to reproduce the results because
of differences in operating systems, computing environments and software/library dependencies.
Recently, workflows have been developed that utilise the existing technologies to create stable,
easy to share computing environments or “digital archives” that would run all the analysis on an
environment similar to the one used by the authors at the time of generating the results. Examples
of such digital archives include “Reference Environments” proposed by Hurley et al. (2014)², and
“Continuous Analysis” workflows proposed by Beaulieu-Jones et al. (2016).
This book presents a collection of tools and workflows that have been developed to reproduce results
of scientific studies. There will be a focus on developing digital archives for the analysis done in R
statistical programming language, however workflows can easily be adapted to be used with other
programming languages.

About the cover


The cover of the book is a photo of “Shazdeh Garden” gateway, the ninth Iranian garden registered on
the UNESCO’s World Heritage List in Kerman province, Iran. What is peculiar about this garden
is that it is located in a desert. The water resource for Shazdeh Garden is streams originating from
adjacent mountains. The garden is a perfect example of resource management. When I reflect on my
research career since my undergraduate studies, I see myself as that bare and dry land, that owes
its thrive and prosper to her supervisor’s trust and support. Melissa Davis believed my potentials
and capabilities and gave me the chance to discover them. “Shazde Garden” is, therefore, a nice
metaphor for the prosperity of one’s potentials that could be achieved by belief and support of their
supervisors. I hope we see more such inspiring supervisors in the research community.
Soroor Hediyeh-Zadeh
¹Reproducible Computational Workflows with Continuous Analysis. Brett K Beaulieu-Jones, Casey S Greene. bioRxiv 056473; doi:
http://dx.doi.org/10.1101/056473
²Hurley, D. G., Budden, D. M., & Crampin, E. J. (2014). Virtual Reference Environments: a simple way to make research reproducible. Briefings
in Bioinformatics, bbu043. doi:10.1093/bib/bbu043
Preface ii

shazdeh garden
Reproducible Research with Docker
Docker³ is a minimal virtual machine. A virtual machine is a lightweight computing environment on
which one can install and run software without installation of the software on user’s machine/ OS
system. For example, you can program a Docker container to run R version 3.0, whereas you might
have R version 3.2.0 installed on your own computer. In other words, with Docker you can run
multiple computing environments on one machine in parallel without disrupting the functionality
of one or the other. In the context of research reproducibility, the data, source codes and any
dependencies are programmed into containers. Docker connects these containers to produce the final
computing environment, or the Docker image. Docker images can be very large in size. However,
they can be easily run and shared. For this reason, Docker is becoming a popular technology for
software development.
In this section I’ll provide complete instructions to create a Docker image that initiates an instance
of RStudio Server. The data and source codes will be mounted on this instance of RStudio Server
by programming Docker, therefore, enabling to reproduce the results. Programming of Docker
containers is done by writing Dockerfiles. A Dockerfile is a simple text file (without any suffix)
that contains a set of Docker-specific instructions. Like any other technologies Docker has it’s own
markup language, but generally the language is very similar to unix command-lines. Please note
that basic knowledge of unix command lines is essential in building Docker images. I’ll first start by
a minimal tutorial on Docker markup language.

Minimal Docker tutorial


A Dockerfile contains a set of instructions that tell Docker what needs to be done. The instructions
are by followed by Unix commands. Similar to a computer which requires an operation system to
work, Docker would need a OS to be installed on it. For example:

1 $ cat Dockerfile
2
3
4 FROM ubuntu:14.04
5 RUN mkdir scripts
6 WORKDIR /scripts

The FROM instruction tells docker to import Ubuntu image version 14.04 from ubuntu repository on
DockerHub ⁴. This, in fact, installs Ubuntu on our Docker image. The RUN instruction tells Docker
³https://docs.docker.com/
⁴DockerHub is a repository for sharing Docker images. See https://hub.docker.com
Reproducible Research with Docker 2

that the following command is a unix command and should be treated accordingly. So Docker runs
mkdir scripts on Ubuntu and creates the scripts/ directory. The WORKDIR instruction is similar cd
in unix, and changes the current working directory to the scripts/ directory that was just created
in the previous step.
Docker instructions ADD or COPY can be used to add or copy a file from your local directory to the
Docker container. CMD will run the program/executable specified by the following command.

1 FROM r-base
2 COPY . /usr/local/src/myscripts
3 WORKDIR /usr/local/src/myscripts
4 CMD ["Rscript", "myscript.R"]

In the above example, the latest version of base-r image is imported. COPY makes a copy of the
contents of the current directory to the /usr/local/src/myscripts directory on Docker. WORKDIR
changes the directory to myscripts and CMD runs myscript.R using the command-line executable
Rscripts. There are a couple more Docker instructions that you will see in this chapter: EXPOSE
and MAINTAINER. The EXPOSE instruction informs Docker that the container listens on the specified
network ports at runtime. The MAINTAINER instruction tells who has made the image. For more details
on Docker instructions see Best practices for writing Dockerfiles⁵

Writing Dockerfiles
As discussed, Docker provides a null computing environment that can be programmed and
customised through Docker instructions. Therefore, in order to make Docker run a RStudio session,
we would need to have an OS such as Ubuntu installed to have a programmable environment. We
can then install R, RStudio Server and any other dependencies. We can also install executables such
as Git or wget, if we are going to download from a link or clone the data from a version control
repository, such as GitHub. In addition, we can setup user accounts and tell Docker to listen for
network ports at the runtime. A Dockerfile containing such instructions would be the “base image”
to our main Docker image which runs the actual analysis. To build the main Docker image, we would
need another Dockerfile. This Dockerfile would import the base image (i.e. an Ubuntu machine with
R and RStudio Server etc. installed), download the data and source codes and run the analysis.
Writing such Dockerfiles from scratch could be time consuming. I am, therefore, going to go over
the principals by providing some templates that have already been developed for this purpose. The
template Dockerfiles are available on GitHub⁶

The Base Image


The following is a template for the base Dockerfile that we would need to use to build the base
image.
⁵https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
⁶https://github.com/soroorh/toolsforreproducibleresearch
Reproducible Research with Docker 3

1 FROM ubuntu:14.04
2 MAINTAINER Name Surname <surname.name@domain.edu.au>
3 RUN echo "deb http://cran.ms.unimelb.edu.au/bin/linux/ubuntu trusty/" >> /etc/ap\
4 t/sources.list
5 RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
6 RUN apt-get update
7 RUN apt-get install -y software-properties-common libxml2-dev
8 RUN add-apt-repository ppa:marutter/rdev
9 RUN apt-get update
10 RUN apt-get upgrade -y
11 RUN apt-get install -y -q r-base r-base-dev gdebi-core libapparmor1 supervisor w\
12 get git
13 RUN (wget https://download2.rstudio.org/rstudio-server-0.99.902-amd64.deb && gde\
14 bi -n rstudio-server-0.99.902-amd64.deb)
15 RUN rm /rstudio-server-0.99.902-amd64.deb
16 RUN (adduser --disabled-password --gecos "" user && echo "user:user"|chpasswd)
17 RUN mkdir -p /var/log/supervisor
18 ADD supervisord.conf /etc/supervisor/conf.d/supervisord.conf
19 EXPOSE 8787
20 CMD ["/usr/bin/supervisord"]

Here, the FROM instruction in line 1 imports Ubuntu version 14.04 image from ubuntu repository (in
DockerHub). This installs Ubuntu on our null computing environment. The MAINTAINER instruction
specifies the name of the author(s) of the image. In line 3 the RUN instruction is used to run a
command. This command adds the path to download the latest version of R from a particular CRAN
minor repository to the relevant search path in ubuntu. This ensures that at each build run the latest
R version - that was used by the authors at the time of the study- is installed. Note that the R version
will be identical to the one used in the original study, and will not change with running the image.
However, if the image is re-built, then the instruction tells Docker to check for the latest version in
the specified CRAN minor and install the latest version. The path can be replaced with the path to
your desired CRAN minor repository.
lines 4-9 install required dependencies to install R on a ubuntu machine. apt-get is a unix command
for installing software. Line 10 is a nice example of how required programs can be installed. Here,
line 10 is instructing Docker to install programs such as base R, R-Devel (development version of
R), gdebi-core (required for installing RStudio Server), wget ( download from a link), git (for version
control) and supervisor (for setting up users accounts). Line 11 downloads and installs RStudio Server
version 0.99.902 for 64bit. As new versions of R become available, new RStudio Server versions
become available, too. So, make sure you replace the RStudio download link with the latest link
(visit RStudio website for more information), if you’re using the template. Once the RStudio Server
is installed, the source download file is not required any more, and can be removed (line 12). The RUN
instruction in Line 13 instructs Docker to set up a user account , named “user”, and sets a password
for this account (user:user). Docker then creates a directory called supervisor/ under /var/log
Reproducible Research with Docker 4

directory in ubuntu, using mkdir command for creating directories. The ADD instruction in line 15 is
used to add the supervisord.conf script from the current local directory to the directory that was
just created. This executable is available with the template on GitHub repository. Line 18 uses EXPOSE
instruction to tell the docker to listen for a port on the network to be called at the runtime. This is
the port that initiates an instance of RStudio Server. Once the image is run from the command-
line/terminal, Docker would wait for the user to call this port from the browser. It would then run
the image. In the final line, the executable supervisor is run, which is responsible to create a login
page.

login page

The Main Image


Organisation of the data

Ideally, files related to a project should be organised in a meaningful and structured manner. For
example, there should be a /data directory that is home to all the data used in the experiment in
Reproducible Research with Docker 5

the project directory. The source codes go under the /scripts directory, and the results are stored
in the/output directory. Having data organised this way, locating files becomes easier and ensures
consistency. Since all project-related files are going to be made available through a RStudio session,
I recommend you organise your files as described here, so that your readers would know where to
look to find codes, datasets and the expected outputs. You would then need to push your project
directory to a version control repository such as Github, BitBucket or SVN.

The Dockerfile
Once project files are pushed to a git repository, we can start with writing up our main Dockerfile.
As discussed earlier, the main Docker image that runs the analysis requires a base image on which all
the required softwares and their dependencies are installed. The main Dockerfile, therefore, begins
with a FROM instruction that imports the base image that was discussed in the earlier section.

1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>

If project directories (i.e /scripts, /data, /output) are stored on a git repository, we can instruct
Docker to clone the repository through the RUN instruction. This is possible because we have installed
git on the base image, and we have imported the base image here. To save some memory, we could
then make symbolic links from the cloned repository to the home directory of the user account that
was set in the base image Dockerfile, rather than copying all the cloned files across. Please note that
the symbolic link is made to the home directory of the user, because by default RStudio session sets
the home directory as the working directory.

1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
3 RUN git clone https://username@bitbucket.org/username/sample_project_repo.git
4 RUN ln -s /sample_project_repo/output /home/user/output # change 'user' here \
5 to the user account name setup in base image
6 RUN ln -s /sample_project_repo/scripts /home/user/scripts
7 RUN ln -s /sample_project_repo/data /home/user/data

Below is an illustration of how we would like project files to appear in the RStudio Server session
that Docker is going to launch.
Reproducible Research with Docker 6

Data Organisation in Docker-based RStudio session

The next step is to enbale write permissions to the /output directory, and any other (sub)directory
to which you wish to write files. This enables saving the main and/or supplementary results, as well
as the intermediate data. The RUN instruction can be used to set chown and chmode settings. chown
-R provides the user ‘user’ access to all files in the directory recursively (-R) , and chmod 700 gives
write access to the ‘user’. These setting should be enabled for all directories or sub-directories of
interest.

1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
3 RUN git clone https://username@bitbucket.org/username/sample_project_repo.git
4 RUN ln -s /sample_project_repo/output /home/user/output # change 'user' here \
5 to the user account name setup in base image
6 RUN ln -s /sample_project_repo/scripts /home/user/scripts
7 RUN ln -s /sample_project_repo/data /home/user/data
8 RUN chown -R user:user /home/user/output/figures
9 RUN chmod 700 /home/user/output/figures
10 RUN chown -R user:user /home/user/output/results
11 RUN chmod 700 /home/user/output/results

Any data analysis in R might depend on CRAN or Bioconductor packages. Such dependencies can be
installed for each study on its main Dockerfile through the RUN instruction. Followings are examples
of package installation from popular R package repositories. Note use of Rscript executable to run
R functions source and install.packages in command-line.
Reproducible Research with Docker 7

1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
3 RUN git clone https://username@bitbucket.org/username/sample_project_repo.git
4 RUN ln -s /sample_project_repo/output /home/user/output # change 'user' here \
5 to the user account name setup in base image
6 RUN ln -s /sample_project_repo/scripts /home/user/scripts
7 RUN ln -s /sample_project_repo/data /home/user/data
8 RUN chown -R user:user /home/user/output/figures
9 RUN chmod 700 /home/user/output/figures
10 RUN chown -R user:user /home/user/output/results
11 RUN chmod 700 /home/user/output/results
12 RUN (Rscript -e 'install.packages(c("dplyr","XML","ggplot2"), repos="http://cran\
13 .rstudio.com/")') # install CRAN packages
14 RUN (Rscript -e 'source("http://bioconductor.org/biocLite.R"); biocLite(c("limma\
15 "))') # install Bioconductor packages

A nice way of running the analysis would be to write a general script called ‘generate_all_exper-
iments.R’ that would go over all the scripts in the study. This script can be sourced by the reader
once the RStudio session has started, to re-generate all the results in the experiment.

1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
3 RUN git clone https://username@bitbucket.org/username/sample_project_repo.git
4 RUN ln -s /sample_project_repo/output /home/user/output # change 'user' here \
5 to the user account name setup in base image
6 RUN ln -s /sample_project_repo/scripts /home/user/scripts
7 RUN ln -s /sample_project_repo/data /home/user/data
8 RUN chown -R user:user /home/user/output/figures
9 RUN chmod 700 /home/user/output/figures
10 RUN chown -R user:user /home/user/output/results
11 RUN chmod 700 /home/user/output/results
12 RUN (Rscript -e 'install.packages(c("dplyr","hexbin","colorRamps","survival","XM\
13 L","ggplot2","matrixStats"), repos="http://cran.rstudio.com/")')
14 RUN (Rscript -e 'source("http://bioconductor.org/biocLite.R"); biocLite(c("limma\
15 ","GSVA", "sva","org.Hs.eg.db"))')
16 WORKDIR /sample_project_repo
17 RUN mv generate_all_experiments.R /home/user

Note that all your scripts should be run from within the /scripts directory, and all the results should
be written to the /output directory relative to the /scripts directory. Similarly, the path to the data
should be relative to the /scripts directory So, be careful when setting paths in your scripts. In
fact, the reason why I change the working directory to the cloned repository is that the path in all
Reproducible Research with Docker 8

my scripts is relative to the /scripts directory (i.e. it is assumed that the current working directory
is the /scripts directory). Also note that ‘generate_all_experiments.R’ is in the with the same git
repository as /data, /output and /scripts.

Building Docker Images

Manual build with docker build


To create the actual image and publish it to a public repository such as DockerHub for sharing, the
Dockerfile needs to be build either manually through the docker build command from command-
line/terminal, or through an automated build process that will be discussed later.
To manually build images from Dockerfile (here the image for the main Dockerfile), enter at the
terminal:

1 docker build -t your-dockerHub-repo/image-repo . # note the single dot

where -t flag tells Docker that this image belongs to user ‘your-dockerHub-repo’ and repository
‘image-repo’. The ‘.’ at the end of the statement means that Docker should look the current working
directory for the Dockerfile, and any other files/scripts (for example, if you have a script in your
working directory that you would import to the image using ADD or COPY). If you intend to publish
your image, then I would recommend you create an account with DockerHub, and make a repository
for your image. You can then push the image that was just built to DockerHub using:

1 docker psuh your-dockerHub-repo/image-repo

Build automation with DockerHub


DockerHub provides a build automation system, where every push to a public repository that
contains the Dockerfile, such as GitHub or BitBucket, triggers image building. To create automated
build:

• login to your DockerHub account


• from your Dashboard, click ‘create’ then choose ‘create automated build’
• select GitHub or BitBucker (depending on where you have stored the Dockerfile)
• select the repository that contains the Dockerfile

Running Docker Images


To run an image, use docker run:
Reproducible Research with Docker 9

1 docker run -p 49000:8787 -d your-dockerHub-repo/image-repo

You would then navigate in your web browser to http://0.0.0.0:49000, and this opens up RStudio
Server.
The -p flag in the run command tells docker that it should connect to port 8787 ⁷ when port 49000 is
called at the runtime ( i.e when http://0.0.0.0:49000 is called at the browser). The flag -d tells docker
that the image is stored in the user account ‘your-dockerHub-repo’ under ‘image-repo’ repository
(this is the way that docker addresses an image on DockerHub).
So lets recap how the main Dockerfile can initiate an instance of RStudio Server. On running the
docker run command on the image built from the main Dockerfile, the FROM instruction imports
the base image. The base image:

1. installs ubuntu, R, RStudio Server, Git and other dependencies


2. clones/downloads the project directory and makes a symbolic link to the home directory of
the user
3. sets up write permissions
4. copies/adds any scripts from the current working directory to the image
5. it waits for the 8787 port (default port to connect to RStudio Server) to be entered, because of
the EXPOSE instruction in the base Dockerfile
6. Upon entering http://0.0.0.0:49000 in the browser, it opens up RStudio Server.

Docker commands cheatsheet


In this section, I will introduce some of commonly used Docker commands.
To list all Docker images on your machine enter (in terminal/command-line):

1 docker images

To see all running images enter:

1 docker ps

To remove a docker container enter:

1 docker rm your-dockerHub-repo/image-repo

Note that sometimes you get an error that a container can’t be removed, because it is running. In
that case you would need to stop the container first.
⁷By default RStudio Server runs on port 8787. see https://support.rstudio.com/hc/en-us/articles/200552306-Getting-Started
Reproducible Research with Docker 10

1 docker stop your-dockerHub-repo/image-repo

Sometimes when you attempt to run an image using docker run, you get an error that the
connection can’t be made because the port is already allocated. This happens as results of a running
container/image running at the same port with which you having been trying to connect. In such
cases you would need to stop all running processes to make the port available using the following
command

1 docker stop $(docker ps -aq)

Also note that when you attempt to run an image through docker run command, Docker looks for
the image on your machine. If it can’t find an image with that name, then it tries to push the latest
version of the image from the (DockerHub) repository. However, if a new version of an image is
made available and you have already a version of that image on your machine, Docker would run
the existing version of the image. If you wish to get the new version, you would need to remove the
existing image with docker rmi image-name, and the run the docker run command.

You might also like