05

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Int'l Conf. Par. and Dist. Proc. Tech. and Appl.

| PDPTA'19 | 175

Performance evaluation of MEGADOCK protein–protein


interaction prediction system implemented with distributed
containers on a cloud computing environment
Kento Aoyama1,2 , Yuki Yamamoto1 , Masahito Ohue1 , and Yutaka Akiyama1
1 Department of Computer Science, School of Computing, Tokyo Institute of Technology, Tokyo, Japan
2 AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL),

National Institute of Advanced Industrial Science and Technology, Ibaraki, Japan


Email: {aoyama, y_yamamoto}@bi.c.titech.ac.jp, {ohue, akiyama}@c.titech.ac.jp

Abstract— Container-based virtualization, a lightweight management and distributed processing using the container-
virtualization technology, has begun to be introduced into based virtualization technology [3], [4].
large-scale parallel computing environments. In the bioinfor- In the container-based virtualization, a software execution
matics field, where various dependent libraries and software environment, including dependent software libraries and
tools need to be combined, the container technology that iso- execution binaries, is isolated as a container, and immediate
lates the software environment and enables rapid distribution software distribution as an executable format can be real-
as in an immediate executable format, is expected to have ized [5], [6]. This feature facilitates the management of the
many benefits. In this study, we employed Docker, which is software environment and distribution, thus introducing new
an implementation of Linux containers, and implemented a software. It has also been reported that the container-based
distributed computing environment of our original protein– virtualization performs better than the hypervisor-based vir-
protein interaction prediction system, MEGADOCK, with tualization, which is used to implement common virtual
virtual machine instances on Microsoft Azure cloud com- machines (VMs), and when properly configured, performs
puting environment, and evaluated its parallel performance. almost as well as running on a physical machine [7].
Both when MEGADOCK was directly performed on the Container-based virtualization has developed in areas such
virtual machine and also when it is performed with Docker as dynamic load balancing in parallel distributed platforms
containers of MEGADOCK on the virtual machine, the on cloud environments because it enables rapid environment
execution speed achieved was almost equal even if the building and application abstraction [8]. Although the intro-
number of worker cores was increased up to approximately duction has not advanced in the computer environments at
500 cores. On the standardization of portable and executable research institutes and universities due to the concern regard-
software environments, the container techniques have large ing performance degradation by virtualization, there is a tail-
contributions in order to improve productivity and repro- wind through excellent benchmark results on parallel com-
ducibility of scientific research. puting environments and application research case reports.
Recently, the container-based virtualization technology has
Keywords: container-based virtualization, cloud computing, MPI, begun to be adopted in supercomputing environments.
Docker, MEGADOCK, protein–protein interaction (PPI) As an example, the National Energy Research Scientific
Computing Center (NERSC) in the United States, which
1. Introduction has a large-scale supercomputer Cori, developed an open-
In the field of bioinformatics and computational biology, source software for the container-based virtualization of
various software are utilized for research activities. Manage- high-performance computing called Shifter [9]. There are
ment of software environments such as dependent software reports on the utilization of the container-based virtualization
libraries is one of the most challenging issues in compu- in various application researches from NERSC [10]. In
tational research. Recently, as a solution to complication addition, another container implementation, Singularity[11],
of the software environment, introduction of the container- is available on the TSUBAME 3.0 supercomputer of Tokyo
based virtualization technology, which is an approach using Institute of Technology and AI Bridging Cloud Infrastructure
virtualization technologies with lightweight and excellent (ABCI) of National Institute of Advanced Industrial Science
performance, is advancing [1], [2]. Particularly in the field and Technology (AIST), both of which are world’s first-
of genome research, pipeline software systems consisting tier supercomputing environments in Japan [12]. From the
of multiple pieces of software are commonly used, which above, correspondence to the container-based virtualization
tend to complicate the environment. For this reason, case is urgent even in the bioinformatics and computational
studies have been reported, including those on environmental biology fields.

ISBN: 1-60132-508-8, CSREA Press ©


176 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

VM kernel features. The namespace can isolate user processes


Container
from global namespaces to individual namespaces, and en-
App
App
able us to use different namespaces for mounting points,
processes, users, networks, hostnames, etc. Therefore, users

App
App
Root File System (‘/’) can touch an isolated application environment that is separate
Linux Kernel
Li K l from the host environment. The container-based virtualiza-
App
App

App
App
Virtuall Hardware
Vi H d Container Root tion is sometimes called as kernels-sharing virtualization
Hypervisor/App Container Runtime
because the containers running on the same host commonly
Root File System
S t (‘/’) Root File
Fil System
S t (‘/’) use the same kernel.
Linux Kernel
Li K l Linux Kernel
Li K l According to a previous study, performance overheads
H
Hardware
d H d
Hardware of a container in various aspects are smaller than those of
the VMs because the resource management in containers is
under the direct control of its host kernels [7]. Moreover, the
Fig. 1: Overview of virtualization technologies: hypervisor-
data size of the container images tends to be smaller than
based (left) and container-based virtualization (right)
that of the VMs. This offers a significant advantage on the
application deployment.
In this study, we focused on MEGADOCK [13], [14], 2.3 Docker
a protein–protein interaction (PPI) prediction software, as
Docker [1] is the most popular set of tools and platform
an example of bioinformatics software, that can predict
for managing, deploying, sharing of Linux containers. It is
PPIs between various proteins by parallel computing. We
an open-sourced software on the GitHub repository, operated
introduced distributed processing on MEGADOCK using
by Moby [21] project, written in Golang, contributed by
Docker containers [1], an implementation of the container-
worldwide developers. There are several related toolsets and
based virtualization, and then evaluated its computational
services of Docker ecosystems, such as Docker Hub [6], the
performance on the Microsoft Azure public cloud computing
largest container image registry service to exchange user-
environment [15] by comparing it with a simple parallel
developed container images. Docker Machine [22] provides
implementation with message passing interface (MPI) [16].
container environments to Windows and MacOS using a
combination of Docker and the hypervisor-based approach.
2. Overview of container-based virtual-
ization 2.3.1 Sharing container image via Docker Hub
There are two major concepts of virtualization approaches A container image can include all the dependencies neces-
in the context of applications running on a cloud environ- sary to execute the target application: code, runtime, system
ment: hypervisor-based and container-based. tools, system libraries, and configurations. Thus, it enables
us to reproduce the same application environment in the
2.1 Hypervisor-based virtualization container as we build it, and deploy onto the machine with
In hypervisor-based virtualization, the virtual environ- other specifications. Users easily share their own applica-
ment is provided by a higher-level “hypervisor” that further tion environment with each other through uploading (push)
manages the OS (Supervisor) that manages the application of container images via the Docker Hub [6], the largest
(Fig. 1, left). The “virtual machine” (VM) widely used in container image registry service for Docker containers, and
general cloud environments is provided by the hypervisor- downloading (pull) of the same container image onto a
based virtualization that enables users to use various operat- different machine environment (Fig. 2).
ing systems such as Windows and Linux OS as Guest OS,
which is managed by a hypervisor running on the Host OS 2.3.2 Filesystem of Docker container image
or hardware. Docker adopts a layered file system as the file system
There are various types of implementations for the in the container image to reduce the total file size of the
hypervisor-based virtualization, such as Kernel Virtual Ma- images. Every image layer corresponds to a batch of user
chine (KVM) [17], Hyper-V [18] used in Microsoft Azure, operations or differential changes of the file system in the
XEN [19] used in Amazon Web Service, and VMware [20]. container and each has a hashed id. Its layered file system
has great benefits on reproducing the operations, rollbacks,
2.2 Container-based virtualization reuse of the same container layer that has the same hash
In container-based virtualization, containers are realized id; these contribute to the reproducibility of the application
by the isolation of the namespaces of user processes running and also of the research. Note that such a layered file system
on the host OS (Fig. 1, right). These virtualizations are does not show good performance at the latency of the file I/O
mainly implemented by the namespace [5], one of the Linux due to its differential layer management such that we usually

ISBN: 1-60132-508-8, CSREA Press ©


Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 | 177

Microsoft Azure
Image
g Client Resource group
App Deploy VMs
Azure CLI Virtual network

Docker
k Hub
b Bins/Libs SSH Public interface VM VM
Submit job (master) (worker)
VM
Build Push Pull Run (worker)

Dockerfile Image
g Storage
Image
g Container
DSWJHWLQVWDOOʞ App App App
wget …

make
Bins/Libs Bins/Libs Bins/Libs
Fig. 4: System architecture of MEGADOCK-Azure [16]
Docker Engine Docker Engine
Ubuntu CentOS
Linux Kernel Linux Kernel
also transfer data (e.g., protein 3-D coordinate (pdb) files)
and results to/from a representative VM (Master VM).
Fig. 2: Sharing a Docker container image via Docker Hub
The resource group management function provides control
over VMs, virtual networks, storage, etc. They are allocated
Docker Container Dockerfile Build
to the same resource group on Microsoft Azure so that they
孺孳宩宦孵孴宨孻孶孵室宩孶 孴孵孼孱學季宐宅 安宕宒宐季宸宥宸宱宷宸孽孴孻孱孳孷 can communicate with each other.
The task distribution function provides appropriately task
宧孼學孳宦孶孻孺宨孹孷宨 孷孳孱孻季宐宅 宕官宑季季季室害宷季宸害宧室宷宨季将宼
孩孩季室害宷季宬宱家宷室宯宯季尀
allocation of PPI prediction calculations for each protein
宧孼學孳宦孶孻孺宨孹孷宨 孷孵孱學季宐宅 宕官宑季季季宺宪宨宷 宫宷宷害孽孲孲孱孱孱
pair on a secured VM. One VM is assigned as a master,
宨孶孵宩孹孶孷孵宧孼宨宦 孷孱孹孴季宐宅 宕官宑季季季季宰室宮宨 the rest as workers, and the master sends MEGADOCK
megadock:latest calculations to the workers by specifying a pdb file, and
the calculations are distributed in the master-worker model.
> docker run -it megadock:latest megadock -R data/1gcq_r.pdb … Run Because MEGADOCK calculations can be performed in-
dependently for each protein pair, it is implemented by
Fig. 3: A layered filesystem in a Docker container image hybrid parallelization with thread-level parallelization of
built from Dockerfile various rotational angles for a protein pair and process-
level parallelization for different protein pairs [24]. The
master-worker model was implemented using the MPIDP
directly perform the data I/O through the mount point where framework [24], [25], such that the master and the worker
the target data are stored and attached to the container. communicate on MPI, and the workers do not communicate
with each other. Each worker process executes its calculation
3. MEGADOCK in multi-threads by using OpenMP (OpenMP and CUDA
MEGADOCK is a PPI prediction software for a large- when using GPUs).
scale parallel computing environment, which was developed
3.2 MEGADOCK with container-based virtu-
by Ohue et al. [13], [14]. MEGADOCK supports MPI,
OpenMP, and GPU parallelization, and has achieved massive alization
parallel computing on TSUBAME 2.5/3.0, K computer, etc. The versatility of a software dependent environment with
The MPI parallel implementation on Microsoft Azure [15] deployment/management problems and improvement of its
public cloud (MEGADOCK-Azure [16]) as well as the execution performance are still pressing issues. The use
predicted PPI database MEGADOCK-Web [23] have been of a cloud computing environment, such as the case of
developed to promote the use of this software in more MEGADOCK-Azure, which is based on VMs, is one of the
general environments. solutions, but there are still concerns about the complexity
to reuse the entire existing local computing resource, vendor
3.1 MEGADOCK-Azure [16] lock-in problems, and performance overheads due to the
MEGADOCK-Azure has three main functions: client, hypervisor-based virtualization.
resource group management, and task distribution on Mi- In this study, we tried to solve these problems by using
crosoft Azure VMs. Fig. 4 shows a diagram of the system ar- Docker containers. Introduction of container techniques into
chitecture of parallel processing infrastructure on Microsoft MEGADOCK has the following advantages:
Azure VMs using MEGADOCK-Azure. • Docker containers are able to run on almost all envi-
The client function uses the Azure command line interface ronments over various cloud computing infrastructure
(AzureCLI) to locate the necessary compute resources on using the same container image as well as on our local
Microsoft Azure, including VMs and virtual networks. It can environments.

ISBN: 1-60132-508-8, CSREA Press ©


178 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

• The container-based virtualization approach generally Table 2: Experiment 1. Software environment


shows superior performance than the hypervisor-based Virtual Machine Docker
virtualization approach both in running and deploying. OS (image) SUSE Linux library/ubuntu
Version (tag) Enterprise Server 12 14.04
• There are compatible container environments[9] avail-
Linux Kernel 3.12.43 N/A
able on several high-performance computing (HPC) GCC 4.8.3 4.8.4
environments such as TSUBAME 3.0 and ABCI super- FFTW 3.3.4 3.3.5
OpenMPI 1.10.2 1.6.5
computers, such that it can even be a model of standard Docker Engine 1.12.6 N/A
application package in HPC environments.
To maintain compatibility between different environments, Standard_D14_v2
we implemented the MEGADOCK system using Docker Overlay Network (Swarm mode)
Standard_D14_v2
containers running on VM instances of Microsoft Azure Standard_D14_v2 Standard_D14_v2
by referring to the MEGADOCK-Azure architecture. The
Docker Container Docker Container
system overview is shown in Fig. 5. MPI comm.
Process Process Process Process
The Docker container image used for the system is gener- Process Process Process Process
ated from a build recipe (Dockerfile) and can run the same Local SSD Local SSD
MEGADOCK calculation as VM images. The containers
over different VM instances are connected on an overlay Process MPIDP Master Process MPIDP Worker
network using Docker networking functions and are able to
communicate with each other using the MPI library. Thereby, Fig. 5: Overview of overlay network using Docker Network
we can run the docking calculation of MEGADOCK on
the cloud environment using containers, and the system is
compatible with other environments. degradation due to the layered file system of the container,
all output files are stored to a data volume on local SSD
4. Performance evaluation which was mounted on the inside of the container.
We present two experiments to evaluate the parallel per- For MPI and OpenMP configurations, the number of
formance of the container-distributed MEGADOCK system processes is selected such that every node has four processes,
using Docker features. and the number of threads is fixed to four in all the cases
(OMP_NUM_THREADS=4). On the container case, each node runs
4.1 Experiment 1. Parallel performance on a one container for MEGADOCK calculation, and it performs
cloud environment as same as when it directly runs on a VM.

Firstly, we measured the execution time and its paral- 4.1.2 Dataset
lel speed-up ratio of distributed MEGADOCK system by
changing the number of worker processor cores of the VM Protein hetero-dimer complex structure data of the
instances in Microsoft Azure under the master-worker model protein–protein docking benchmark version 1.0 [26] were
on the MPIDP framework. used for performance evaluation. Whole 59 hetero-dimer
protein complexes were used and all-to-all combinations of
4.1.1 Experimental setup each binding partner (59×59 = 3,481 pairs) were calculated
to predict their possible PPIs.
We selected Standard_D14_v2, a high-end VM instance
on Microsoft Azure. The specifications of the instance are
4.1.3 Experiment result
listed in Table 1, and the software environment is shown in
Table 2. The execution time of MEGADOCK running with the
Docker containers on the VM instances and MEGADOCK
directly running on the VM instances (MEGADOCK-Azure)
Table 1: Experiment 1. Specifications of Standard_D14_v2
CPU Intel Xeon E5-2673, 2.40 [GHz] × 16 [core] is shown in Fig. 6. Each bar shows the execution time on
Memory 112 [GB] the number of VM instances, and the error-bars show the
Local SSD 800 [GB] standard deviation in the measurements.
Fig. 7 shows the scalability in strong-scaling for the same
The measured data were obtained from the result of the results. The label “ideal” indicates the ideal linear scaling to
time command, and the median of 3 runs of the calculations the number of worker cores.
was selected. To avoid slower data transfer time between According to the result of scalability, both of them
nodes, all output results of docking calculation were gener- achieved a good speed-up of up to 476 worker cores. It was
ated onto local SSDs attached on each VM instance. On the ×35.5 speed-up in the case of directly running on the VMs,
Docker container case, to avoid the unnecessary performance and ×36.6 in the case of the Docker containers on the VMs.

ISBN: 1-60132-508-8, CSREA Press ©


Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 | 179




宙宐 宇宲宦宮宨宵季宲宱季宙宐
 


宗宬宰宨季实家宨宦宠




 
  
   

    
学季宲宩季宙宐家季孫孴孹季宦宲宵宨家季宬宱季孴季宙宐孬

Fig. 6: Experiment 1: Execution time comparison between MEGADOCK directly running on VMs and with Docker container




 
宖害宨宨宧孰宸害



 季完宧宨室宯

 季宙宐
 季宇宲宦宮宨宵季宲宱季宙宐
 
  
 

          

学季宲宩季宺宲宵宮宨宵季宦宲宵宨家

Fig. 7: Experiment 1: Strong-scaling performance of MEGADOCK (based on VM=1) on the benchmark dataset

The speed-ups were almost equivalent in this experiment and listed in Table 3, and environmental settings are shown in
they indicate that the performance overhead on the Docker Table 4.
containers is small when running on the VM instance of the Table 3: Experiment 2. Specifications of physical machine
cloud environment. CPU Intel Xeon E5-1630, 3.7 [GHz] × 8 [core]
The MEGADOCK execution load is mainly composed of Memory 32 [GB]
independent 3D-fast Fourier transform (FFT) convolutions Local SSD 128 [GB]
GPU NVIDIA Tesla K40
on each single node even in the MPI version such that
it tends to be a compute-intensive workload, not a data Table 4: Experiment 2. Software and environmental settings
I/O or network-intensive; therefore, similar to the situa- bare-metal Docker Docker (GPU)
tion mentioned in the Linux container performance profile OS (image) CentOS library/ubuntu nvidia/cuda
Version (tag) 7.2.1511 14.04 8.0-devel
reports [7], MEGADOCK calculation on the distributed Linux Kernel 3.10.0 N/A N/A
containers environment also performs well. GCC 4.8.5 4.8.4 4.8.4
FFTW 3.3.5 3.3.5 3.3.5
OpenMPI 1.10.0 1.6.5 N/A
4.2 Experiment 2. Execution performance on a Docker Engine 1.12.3 N/A N/A
NVCC 8.0.44 N/A 8.0.44
GPU-attached bare-metal node NVIDIA Docker 1.0.0 rc.3 N/A N/A
Additionally, to investigate the overhead of the container- NVIDIA Driver 367.48 N/A 367.48
based virtualization, we measured the execution time of We used the physical machine and measured the execution
MEGADOCK running on a local node with various con- time of the MEGADOCK calculation for 100 pairs of pdb
ditions and compared the results of performance. data, for each of the following conditions (Fig. 8):
(a) MEGADOCK using MPI library,
4.2.1 Experimental setup
(b) MEGADOCK using MPI library in a container,
We used a bare-metal (not a virtual machine) GPU- (c) MEGADOCK using GPU, and
attached node for this experiment. The specifications are (d) MEGADOCK using GPU in a container.

ISBN: 1-60132-508-8, CSREA Press ©


180 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

a) Bare-metal machine b) Bare-metal machine the result of experiment, MEGADOCK calculation using the
Docker Container MPI library in the Docker container (b) was approximately
Process Process Process Process 6.3% slower than the same calculation using the MPI library
Process Process Process Process on the bare-metal environment (a). Note that this experiment
was performed on a single node of a bare-metal environment
such that the communication cost can be made sufficiently
c) Bare-metal machine d) Bare-metal machine
small because there was no inter-node MPI communication.
Docker Container On the other hand, MEGADOCK calculation using GPU
Process GPU Process GPU in the Docker container (d) performed almost the same
result as calculations using the GPU on the bare-metal
environment (c). This means that the MEGADOCK-GPU,
Fig. 8: Experiment 2: Overview of experimental conditions: which does not use the MPI library, performs at the same
(a) MEGADOCK using MPI library, (b) MEGADOCK using MPI library
speed of execution even when it runs in the Docker container.
in a container, (c) MEGADOCK using GPU, and (d) MEGADOCK using
GPU in a container
5. Discussion
In Experiment 1, the execution on a single VM instance
We used NVIDIA Docker [27] to invoke NVIDIA GPU with the MEGADOCK-Azure (VM) particularly consumed
device from inside of the Docker container. All the re- time but the reason was unknown, and that affected the result
sult data of MEGADOCK calculation were output to the of scalability. This irregularity increases the scalability more
network-attached storage (NAS) on the same network in all than expected. The reason should be investigated; however,
the cases, and we used the “volume” option to mount the it is difficult because it is time-consuming calculation.
path to the NAS when using Dockers. We did not achieve multiple GPU nodes parallelization
The data were obtained by the time command to measure using the MPI library in this study; however, now the VM
the duration from start to end of the execution, and we instances attached to NVIDIA GPU devices are generally
selected the median of six repeated runs. available. Moreover, there are more sophisticated VM in-
As it had the same configuration as that of Experiment stances for the HPC applications on Microsoft Azure that is
1, the number of Docker containers per node was one, the connected by the InfiniBand each other and supports low-
number of processes was four, and the number of threads latency communication using remote direct memory access
per process was fixed as four. (RDMA). We have already performed experiments over
multiple VM instances with GPUs [16] or HPC instances;
4.2.2 Dataset therefore, further experiments with the Docker containers is
PPI predictions were performed for 100 pairs of pdb data our future challenge.
randomly obtained from the KEGG pathway [28]. Additionally, there is a possibility to introduce an alterna-
tive approach for task distributions of MEGADOCK. In this
4.2.3 Experiment result study, we used the MPIDP framework, which uses the MPI
library to realize a dynamic task distribution over multiple
 nodes because it is the built-in-function of MEGADOCK;

 宅室宵宨孰宰宨宷室宯 宇宲宦宮宨宵
however, that can be alternated by another framework such

as MapReduce. Moreover, our current implementation lacks
 the functionality to recover from unpredictable failures
宗宬宰宨季实家宨宦宠


caused by the MPI processes, containers, VM instances,
or hardware such that we should introduce a more fault-

 
tolerant framework that has functions of auto-recovery from
 failure and redundancy of executions. We are considering
introducing container orchestration frameworks such as Ku-

bernetes [29] and Apache Mesosphere [30] to resolve the
宆宓官季孫宐宓完孬 宊宓官
issues.
Fig. 9: Experiment 2: Execution time comparison between
MEGADOCK running on bare-metal and with Docker con- 6. Conclusion
tainer We implemented a protein–protein interaction prediction
system, MEGADOCK, using Docker containers and its net-
Fig. 9 show the result for the execution time of the working functions on the VM instances of Microsoft Azure.
MEGADOCK calculation for 100 pairs in each condition. As We confirmed that the performance is almost equivalent to

ISBN: 1-60132-508-8, CSREA Press ©


Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 | 181

the same calculation directly performed on the VM instances [3] P. Di Tommaso, E. Palumbo, M. Chatzou, et al. “The impact of Docker
through the benchmark experiment of protein docking calcu- containers on the performance of genomic pipelines,” PeerJ, vol. 3,
no. e1273, 2015.
lations. Both when MEGADOCK directly runs on the virtual [4] A. Paolo, D. Tommaso, A. B. Ramirez, et al. “Benchmark Report:
machine and when it runs with the Docker containers of Univa Grid Engine, Nextflow, and Docker for running Genomic
MEGADOCK on the virtual machine, the execution speed Analysis Workflows,” Univa White Paper, 2016.
[5] E. W. Biederman, “Multiple Instances of the Global Linux Names-
achieved was almost equal even when the number of worker paces,” In Proc. the 2006 Ottawa Linux Symp., vol. 1, pp. 101–112,
cores increased up to approximately 500 cores. 2006.
In the second experiment, we performed MEGADOCK [6] “Docker Hub.” https://hub.docker.com/, [Accessed May 8, 2019].
[7] W. Felter, A. Ferreira, R. Rajamony, J. Rubio, “An updated perfor-
calculation with MPI/GPU on a single bare-metal machine mance comparison of virtual machines and Linux containers,” In Proc.
both in a Docker container and in a bare-metal environment, IEEE ISPASS 2015, pp. 171–172, 2015.
to investigate the effects of performance overhead of con- [8] L. M. Vaquero, L. Rodero-Merino, R. Buyya, “Dynamically scaling
applications in the cloud,” ACM SIGCOMM Computer Communica-
tainer virtualization. Results showed a small performance tion Review, vol. 41(1), pp. 45–52, 2011.
degradation of approximately 6.3% on the MPI version in [9] W. Bhimji, S. Canon, D. Jacobsen, et al. “Shifter : Containers for
the container case compared with the bare-metal; however, HPC,” J. Phys. Conf. Ser., vol. 898, no. 082021, 2017.
[10] B. Debbie, “Using containers and supercomputers to solve the mys-
it was almost equal in the GPU version with the container teries of the Universe,” dockercon16, 2016.
and bare-metal. [11] G. M. Kurtzer, V. Sochat, M. W. Bauer, “Singularity: Scientific
The containers enable us to isolate software dependen- containers for mobility of compute,” PLoS One, vol. 12(5), pp. 1–
20, 2017.
cies and the system software stacks, which offer a great [12] “TOP500 Lists,” https://www.top500.org/, [Accessed May 8, 2019].
advantage to the users in sharing software packages through [13] M. Ohue, T. Shimoda, S. Suzuki, et al. “MEGADOCK 4.0: An ultra-
platforms, thereby making it easy to distribute the latest high-performance protein-protein docking software for heterogeneous
supercomputers,” Bioinformatics, vol. 30(22), pp. 3281–3283, 2014.
research achievement. Virtualization technologies have been [14] M. Ohue, Y. Matsuzaki, N. Uchikoga, et al. “MEGADOCK: An
evolved in the context of the general cloud computing all-to-all protein-protein interaction prediction system using tertiary
environment; however, in the current era, many research structure data,” Protein Pept. Lett., vol. 21(8), pp. 766–778, 2014.
[15] “Microsoft Azure.”
institutions have introduced container environments into https://azure.microsoft.com/, [Accessed May 8, 2019].
their HPC infrastructures. To improve productivity and retain [16] M. Ohue, Y. Yamamoto, Y. Akiyama. “Parallel computing of protein-
scientific reproducibility, it is necessary to introduce such protein interaction prediction system MEGADOCK on Microsoft
Azure,” IPSJ Tech. Rep., vol. 2017-BIO-49, no. 4, 2017.
software engineering techniques into research activities. [17] A. Kivity, U. Lublin, A. Liguori, et al. “kvm: the Linux virtual
machine monitor,” Proc. the Linux Symp., vol. 1, pp. 225–230, 2007.
[18] A. Velte, T. Velte, “Microsoft Virtualization with Hyper-V,” 2010.
Code Availability [19] “Xen Project.” https://www.xen.org, [Accessed May 8, 2019].
[20] “VMware - Virtualization Overview.”
The entire source code of MEGADOCK is open sourced https://www.vmware.com/pdf/virtualization.pdf,
on the GitHub repository. There is also a build recipe (Dock- [Accessed May 8, 2019].
erfile) for building Docker container images for performing [21] “Moby project.” https://mobyproject.org, [Accessed May 8, 2019].
[22] “Docker Machine.”
the PPI prediction calculations in various environments. https://docs.docker.com/machine/, [Accessed May 8, 2019].
https://github.com/akiyamalab/MEGADOCK [23] T. Hayashi, Y. Matsuzaki, K. Yanagisawa, et al. “MEGADOCK-Web:
an integrated database of high-throughput structure-based protein-
protein interaction predictions,” BMC Bioinform., vol. 19(Suppl 4),
Acknowledgment no. 62, 2018.
[24] Y. Matsuzaki, N. Uchikoga, M Ohue, et al. “MEGADOCK 3.0: a
This work was partially supported by KAKENHI (Grant high-performance proteinâĂŞprotein interaction prediction software
No. 17H01814 and 18K18149) from the Japan Society for using hybrid parallel computing for petascale supercomputing envi-
the Promotion of Science (JSPS), the Program for Building ronments.” Source Code for Biol. and Med., vol. 8(1), no. 18, 2013.
[25] M. Kakuta, S. Suzuki, K. Izawa, et al. “A massively parallel sequence
Regional Innovation Ecosystems “Program to Industrial- similarity search for metagenomic sequencing data,” Int. J. Mol. Sci.,
ize an Innovative Middle Molecule Drug Discovery Flow vol. 18(10), no. 2124, 2017.
through Fusion of Computational Drug Design and Chemical [26] R. Chen, J. Mintseris, J. Janin, Z. Weng, “A protein-protein docking
benchmark,” Proteins, vol. 52(1), pp. 88–91, 2003.
Synthesis Technology” from the Japanese Ministry of Educa- [27] “NVIDIA - nvidia-docker.”
tion, Culture, Sports, Science and Technology (MEXT), the https://github.com/NVIDIA/nvidia-docker, [Accessed May 8, 2019].
Research Complex Program “Wellbeing Research Campus: [28] E. V. Wasmuth, C. D. Lima, “KEGG: new perspectives on genomes,
pathways, diseases and drugs,” Nucl. Acids Res., vol. 45, pp. 1–15,
Creating new values through technological and social inno- 2016.
vation” from JST, Microsoft Business Investment Funding [29] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, J. Wilkes, “Borg,
from Microsoft Corp., and Leave a Nest Co., Ltd. Omega, and Kubernetes,” Queue, vol. 14, no. 1, 2016.
[30] B. Hindman, A. Konwinski, M. Zaharia, et al. “Mesos: A Platform
for Fine-grained Resource Sharing in the Data Center,” In Proc.
References the 8th USENIX Conference on Networked Systems Design and
Implementation (NSDI’11), pp. 295–308, 2011.
[1] “Docker.” https://www.docker.com/, [Accessed May 8, 2019].
[2] G. Stphane, “LXC - Linux containers.”
https://linuxcontainers.org/, [Accessed May 8, 2019].

ISBN: 1-60132-508-8, CSREA Press ©

You might also like