Professional Documents
Culture Documents
Tools For Reproducible Research
Tools For Reproducible Research
Soroor Hediyeh-Zadeh
This book is for sale at http://leanpub.com/toolsforreproducibleresearch
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
shazdeh garden
Reproducible Research with Docker
Docker³ is a minimal virtual machine. A virtual machine is a lightweight computing environment on
which one can install and run software without installation of the software on user’s machine/ OS
system. For example, you can program a Docker container to run R version 3.0, whereas you might
have R version 3.2.0 installed on your own computer. In other words, with Docker you can run
multiple computing environments on one machine in parallel without disrupting the functionality
of one or the other. In the context of research reproducibility, the data, source codes and any
dependencies are programmed into containers. Docker connects these containers to produce the final
computing environment, or the Docker image. Docker images can be very large in size. However,
they can be easily run and shared. For this reason, Docker is becoming a popular technology for
software development.
In this section I’ll provide complete instructions to create a Docker image that initiates an instance
of RStudio Server. The data and source codes will be mounted on this instance of RStudio Server
by programming Docker, therefore, enabling to reproduce the results. Programming of Docker
containers is done by writing Dockerfiles. A Dockerfile is a simple text file (without any suffix)
that contains a set of Docker-specific instructions. Like any other technologies Docker has it’s own
markup language, but generally the language is very similar to unix command-lines. Please note
that basic knowledge of unix command lines is essential in building Docker images. I’ll first start by
a minimal tutorial on Docker markup language.
1 $ cat Dockerfile
2
3
4 FROM ubuntu:14.04
5 RUN mkdir scripts
6 WORKDIR /scripts
The FROM instruction tells docker to import Ubuntu image version 14.04 from ubuntu repository on
DockerHub ⁴. This, in fact, installs Ubuntu on our Docker image. The RUN instruction tells Docker
³https://docs.docker.com/
⁴DockerHub is a repository for sharing Docker images. See https://hub.docker.com
Reproducible Research with Docker 2
that the following command is a unix command and should be treated accordingly. So Docker runs
mkdir scripts on Ubuntu and creates the scripts/ directory. The WORKDIR instruction is similar cd
in unix, and changes the current working directory to the scripts/ directory that was just created
in the previous step.
Docker instructions ADD or COPY can be used to add or copy a file from your local directory to the
Docker container. CMD will run the program/executable specified by the following command.
1 FROM r-base
2 COPY . /usr/local/src/myscripts
3 WORKDIR /usr/local/src/myscripts
4 CMD ["Rscript", "myscript.R"]
In the above example, the latest version of base-r image is imported. COPY makes a copy of the
contents of the current directory to the /usr/local/src/myscripts directory on Docker. WORKDIR
changes the directory to myscripts and CMD runs myscript.R using the command-line executable
Rscripts. There are a couple more Docker instructions that you will see in this chapter: EXPOSE
and MAINTAINER. The EXPOSE instruction informs Docker that the container listens on the specified
network ports at runtime. The MAINTAINER instruction tells who has made the image. For more details
on Docker instructions see Best practices for writing Dockerfiles⁵
Writing Dockerfiles
As discussed, Docker provides a null computing environment that can be programmed and
customised through Docker instructions. Therefore, in order to make Docker run a RStudio session,
we would need to have an OS such as Ubuntu installed to have a programmable environment. We
can then install R, RStudio Server and any other dependencies. We can also install executables such
as Git or wget, if we are going to download from a link or clone the data from a version control
repository, such as GitHub. In addition, we can setup user accounts and tell Docker to listen for
network ports at the runtime. A Dockerfile containing such instructions would be the “base image”
to our main Docker image which runs the actual analysis. To build the main Docker image, we would
need another Dockerfile. This Dockerfile would import the base image (i.e. an Ubuntu machine with
R and RStudio Server etc. installed), download the data and source codes and run the analysis.
Writing such Dockerfiles from scratch could be time consuming. I am, therefore, going to go over
the principals by providing some templates that have already been developed for this purpose. The
template Dockerfiles are available on GitHub⁶
1 FROM ubuntu:14.04
2 MAINTAINER Name Surname <surname.name@domain.edu.au>
3 RUN echo "deb http://cran.ms.unimelb.edu.au/bin/linux/ubuntu trusty/" >> /etc/ap\
4 t/sources.list
5 RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
6 RUN apt-get update
7 RUN apt-get install -y software-properties-common libxml2-dev
8 RUN add-apt-repository ppa:marutter/rdev
9 RUN apt-get update
10 RUN apt-get upgrade -y
11 RUN apt-get install -y -q r-base r-base-dev gdebi-core libapparmor1 supervisor w\
12 get git
13 RUN (wget https://download2.rstudio.org/rstudio-server-0.99.902-amd64.deb && gde\
14 bi -n rstudio-server-0.99.902-amd64.deb)
15 RUN rm /rstudio-server-0.99.902-amd64.deb
16 RUN (adduser --disabled-password --gecos "" user && echo "user:user"|chpasswd)
17 RUN mkdir -p /var/log/supervisor
18 ADD supervisord.conf /etc/supervisor/conf.d/supervisord.conf
19 EXPOSE 8787
20 CMD ["/usr/bin/supervisord"]
Here, the FROM instruction in line 1 imports Ubuntu version 14.04 image from ubuntu repository (in
DockerHub). This installs Ubuntu on our null computing environment. The MAINTAINER instruction
specifies the name of the author(s) of the image. In line 3 the RUN instruction is used to run a
command. This command adds the path to download the latest version of R from a particular CRAN
minor repository to the relevant search path in ubuntu. This ensures that at each build run the latest
R version - that was used by the authors at the time of the study- is installed. Note that the R version
will be identical to the one used in the original study, and will not change with running the image.
However, if the image is re-built, then the instruction tells Docker to check for the latest version in
the specified CRAN minor and install the latest version. The path can be replaced with the path to
your desired CRAN minor repository.
lines 4-9 install required dependencies to install R on a ubuntu machine. apt-get is a unix command
for installing software. Line 10 is a nice example of how required programs can be installed. Here,
line 10 is instructing Docker to install programs such as base R, R-Devel (development version of
R), gdebi-core (required for installing RStudio Server), wget ( download from a link), git (for version
control) and supervisor (for setting up users accounts). Line 11 downloads and installs RStudio Server
version 0.99.902 for 64bit. As new versions of R become available, new RStudio Server versions
become available, too. So, make sure you replace the RStudio download link with the latest link
(visit RStudio website for more information), if you’re using the template. Once the RStudio Server
is installed, the source download file is not required any more, and can be removed (line 12). The RUN
instruction in Line 13 instructs Docker to set up a user account , named “user”, and sets a password
for this account (user:user). Docker then creates a directory called supervisor/ under /var/log
Reproducible Research with Docker 4
directory in ubuntu, using mkdir command for creating directories. The ADD instruction in line 15 is
used to add the supervisord.conf script from the current local directory to the directory that was
just created. This executable is available with the template on GitHub repository. Line 18 uses EXPOSE
instruction to tell the docker to listen for a port on the network to be called at the runtime. This is
the port that initiates an instance of RStudio Server. Once the image is run from the command-
line/terminal, Docker would wait for the user to call this port from the browser. It would then run
the image. In the final line, the executable supervisor is run, which is responsible to create a login
page.
login page
Ideally, files related to a project should be organised in a meaningful and structured manner. For
example, there should be a /data directory that is home to all the data used in the experiment in
Reproducible Research with Docker 5
the project directory. The source codes go under the /scripts directory, and the results are stored
in the/output directory. Having data organised this way, locating files becomes easier and ensures
consistency. Since all project-related files are going to be made available through a RStudio session,
I recommend you organise your files as described here, so that your readers would know where to
look to find codes, datasets and the expected outputs. You would then need to push your project
directory to a version control repository such as Github, BitBucket or SVN.
The Dockerfile
Once project files are pushed to a git repository, we can start with writing up our main Dockerfile.
As discussed earlier, the main Docker image that runs the analysis requires a base image on which all
the required softwares and their dependencies are installed. The main Dockerfile, therefore, begins
with a FROM instruction that imports the base image that was discussed in the earlier section.
1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
If project directories (i.e /scripts, /data, /output) are stored on a git repository, we can instruct
Docker to clone the repository through the RUN instruction. This is possible because we have installed
git on the base image, and we have imported the base image here. To save some memory, we could
then make symbolic links from the cloned repository to the home directory of the user account that
was set in the base image Dockerfile, rather than copying all the cloned files across. Please note that
the symbolic link is made to the home directory of the user, because by default RStudio session sets
the home directory as the working directory.
1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
3 RUN git clone https://username@bitbucket.org/username/sample_project_repo.git
4 RUN ln -s /sample_project_repo/output /home/user/output # change 'user' here \
5 to the user account name setup in base image
6 RUN ln -s /sample_project_repo/scripts /home/user/scripts
7 RUN ln -s /sample_project_repo/data /home/user/data
Below is an illustration of how we would like project files to appear in the RStudio Server session
that Docker is going to launch.
Reproducible Research with Docker 6
The next step is to enbale write permissions to the /output directory, and any other (sub)directory
to which you wish to write files. This enables saving the main and/or supplementary results, as well
as the intermediate data. The RUN instruction can be used to set chown and chmode settings. chown
-R provides the user ‘user’ access to all files in the directory recursively (-R) , and chmod 700 gives
write access to the ‘user’. These setting should be enabled for all directories or sub-directories of
interest.
1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
3 RUN git clone https://username@bitbucket.org/username/sample_project_repo.git
4 RUN ln -s /sample_project_repo/output /home/user/output # change 'user' here \
5 to the user account name setup in base image
6 RUN ln -s /sample_project_repo/scripts /home/user/scripts
7 RUN ln -s /sample_project_repo/data /home/user/data
8 RUN chown -R user:user /home/user/output/figures
9 RUN chmod 700 /home/user/output/figures
10 RUN chown -R user:user /home/user/output/results
11 RUN chmod 700 /home/user/output/results
Any data analysis in R might depend on CRAN or Bioconductor packages. Such dependencies can be
installed for each study on its main Dockerfile through the RUN instruction. Followings are examples
of package installation from popular R package repositories. Note use of Rscript executable to run
R functions source and install.packages in command-line.
Reproducible Research with Docker 7
1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
3 RUN git clone https://username@bitbucket.org/username/sample_project_repo.git
4 RUN ln -s /sample_project_repo/output /home/user/output # change 'user' here \
5 to the user account name setup in base image
6 RUN ln -s /sample_project_repo/scripts /home/user/scripts
7 RUN ln -s /sample_project_repo/data /home/user/data
8 RUN chown -R user:user /home/user/output/figures
9 RUN chmod 700 /home/user/output/figures
10 RUN chown -R user:user /home/user/output/results
11 RUN chmod 700 /home/user/output/results
12 RUN (Rscript -e 'install.packages(c("dplyr","XML","ggplot2"), repos="http://cran\
13 .rstudio.com/")') # install CRAN packages
14 RUN (Rscript -e 'source("http://bioconductor.org/biocLite.R"); biocLite(c("limma\
15 "))') # install Bioconductor packages
A nice way of running the analysis would be to write a general script called ‘generate_all_exper-
iments.R’ that would go over all the scripts in the study. This script can be sourced by the reader
once the RStudio session has started, to re-generate all the results in the experiment.
1 FROM dockerhub-repo/docker-rstudio-base
2 MAINTAINER name <surname.name@domain.edu.au>
3 RUN git clone https://username@bitbucket.org/username/sample_project_repo.git
4 RUN ln -s /sample_project_repo/output /home/user/output # change 'user' here \
5 to the user account name setup in base image
6 RUN ln -s /sample_project_repo/scripts /home/user/scripts
7 RUN ln -s /sample_project_repo/data /home/user/data
8 RUN chown -R user:user /home/user/output/figures
9 RUN chmod 700 /home/user/output/figures
10 RUN chown -R user:user /home/user/output/results
11 RUN chmod 700 /home/user/output/results
12 RUN (Rscript -e 'install.packages(c("dplyr","hexbin","colorRamps","survival","XM\
13 L","ggplot2","matrixStats"), repos="http://cran.rstudio.com/")')
14 RUN (Rscript -e 'source("http://bioconductor.org/biocLite.R"); biocLite(c("limma\
15 ","GSVA", "sva","org.Hs.eg.db"))')
16 WORKDIR /sample_project_repo
17 RUN mv generate_all_experiments.R /home/user
Note that all your scripts should be run from within the /scripts directory, and all the results should
be written to the /output directory relative to the /scripts directory. Similarly, the path to the data
should be relative to the /scripts directory So, be careful when setting paths in your scripts. In
fact, the reason why I change the working directory to the cloned repository is that the path in all
Reproducible Research with Docker 8
my scripts is relative to the /scripts directory (i.e. it is assumed that the current working directory
is the /scripts directory). Also note that ‘generate_all_experiments.R’ is in the with the same git
repository as /data, /output and /scripts.
where -t flag tells Docker that this image belongs to user ‘your-dockerHub-repo’ and repository
‘image-repo’. The ‘.’ at the end of the statement means that Docker should look the current working
directory for the Dockerfile, and any other files/scripts (for example, if you have a script in your
working directory that you would import to the image using ADD or COPY). If you intend to publish
your image, then I would recommend you create an account with DockerHub, and make a repository
for your image. You can then push the image that was just built to DockerHub using:
You would then navigate in your web browser to http://0.0.0.0:49000, and this opens up RStudio
Server.
The -p flag in the run command tells docker that it should connect to port 8787 ⁷ when port 49000 is
called at the runtime ( i.e when http://0.0.0.0:49000 is called at the browser). The flag -d tells docker
that the image is stored in the user account ‘your-dockerHub-repo’ under ‘image-repo’ repository
(this is the way that docker addresses an image on DockerHub).
So lets recap how the main Dockerfile can initiate an instance of RStudio Server. On running the
docker run command on the image built from the main Dockerfile, the FROM instruction imports
the base image. The base image:
1 docker images
1 docker ps
1 docker rm your-dockerHub-repo/image-repo
Note that sometimes you get an error that a container can’t be removed, because it is running. In
that case you would need to stop the container first.
⁷By default RStudio Server runs on port 8787. see https://support.rstudio.com/hc/en-us/articles/200552306-Getting-Started
Reproducible Research with Docker 10
Sometimes when you attempt to run an image using docker run, you get an error that the
connection can’t be made because the port is already allocated. This happens as results of a running
container/image running at the same port with which you having been trying to connect. In such
cases you would need to stop all running processes to make the port available using the following
command
Also note that when you attempt to run an image through docker run command, Docker looks for
the image on your machine. If it can’t find an image with that name, then it tries to push the latest
version of the image from the (DockerHub) repository. However, if a new version of an image is
made available and you have already a version of that image on your machine, Docker would run
the existing version of the image. If you wish to get the new version, you would need to remove the
existing image with docker rmi image-name, and the run the docker run command.