Professional Documents
Culture Documents
Performant Container Support For HPC Applications: July 2019
Performant Container Support For HPC Applications: July 2019
net/publication/334776164
CITATIONS READS
11 357
3 authors, including:
All content following this page was uploaded by Yinzhi Wang on 05 August 2019.
ABSTRACT 1 INTRODUCTION
The demand for the ability to easily customize, reproduce and Both the size and complexity of HPC clusters have been growing
migrate applications and workflows has been steadily increasing over the past decades to meet the increasing demand for com-
among the HPC community as software environments and applica- putational power in the science community. Such growth is ac-
tions grow in complexity. Lightweight containers that are suitable companied by the deployment of many-core processors as well
for HPC applications at scale are considered to be a viable approach as low-latency, high-bandwidth interconnect fabrics that require
to meet this demand. Previous studies have addressed the perfor- specialized hardware drivers to utilize. Effective utilization of the
mance aspects of most existing containers using microbenchmarks processors and fabric can extend the capability and capacity of
and revealed the performance overheads of the best implementa- scientific applications to execute workloads at extreme scales. The
tions to be small. However, the feasibility of providing containerized, increasing complexity of the software environments these appli-
real-world HPC applications on HPC systems, and the impact on cations are developed in has led to a growing demand for more
overall application performance at scale, has not yet been explored. customizable HPC software environments to run them in. Although
Here we present a basic feasibility and performance study using the the necessity of interfacing with hardware drivers limits customiz-
Singularity container. We evaluate what is required to enable con- ability, it is not obvious how significant these limitations are. This
tainer images to utilize the high-speed fabric present on most HPC paper explores the trade-offs between customizability and perfor-
systems and explore their performance by comparing real-world mance when running HPC applications at scale.
applications run both within a container and in the absence of a The demand on customizability is effectively tackled by virtu-
container (natively). The results indicate lightweight Singularity alization technologies, which have become prevalent due to their
images are a promising approach to the HPC communities demands hardware independence, isolation, and security features. Combined
for not only customizability, reproducibility and portability, but with the concept of Grid Computing, virtualization technology es-
also performance. tablished the new infrastructure known as cloud computing. The
hypervisor-based virtualization solutions, such as Xen, VMware ES-
X/ESXi, and KVM, are commonly implemented in the commercial
CCS CONCEPTS
cloud computing platforms. Their substantial performance over-
• General and reference → Performance; • Software and its head [8][13], however, has prevented the adoption of virtualization
engineering → Application specific development environ- in the support of conventional HPC applications because they pre-
ments; Software performance; • Networks → Network performance clude processor specific optimizations and utilization of high speed
analysis. fabrics due to the hypervisor layer.
Meanwhile, lightweight virtualization solutions such as con-
KEYWORDS tainers that exclude the hypervisor layer have gained substantial
traction in the HPC community. The container-based technology
Singularity, HPC, container
facilitating distribution and deployment of applications prevails
in the research communities that emphasize the reproducibility of
ACM Reference Format: both scientific findings and computational environments. Docker
Yinzhi Wang, R. Todd Evans, and Lei Huang. 2019. Performant Container
[1] is among the most popular ones. However, due to the root privi-
Support for HPC Applications. In Practice and Experience in Advanced Re-
lege required to execute Docker containers and associated security
search Computing (PEARC ’19), July 28-August 1, 2019, Chicago, IL, USA.
ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3332186.3332226 concerns [6], it cannot be easily adopted in most HPC environments.
Singularity [6] is created for scientific application driven workloads
to meet both the demands from users and administrators in the
HPC environment. Singularity shares most of the benefits of the
Docker container while mitigating the security concerns. Because
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
of these characteristics the usage of Singularity, or similar container
for profit or commercial advantage and that copies bear this notice and the full citation technologies, in the support of conventional HPC applications is
on the first page. Copyrights for components of this work owned by others than the likely to increase.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA 1.1 Motivation
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-7227-5/19/07. . . $15.00 Figure 1 compares the layers of software between an application
https://doi.org/10.1145/3332186.3332226 running on a virtual machine (VM) and one running in a Singularity
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA Wang, et al.
Application
Application
Application
Application
Intel Compilers Lmod
OS Version
and MPI Module System
CentOS 7.4.1708 Built-in Yes
System CentOS 7.4.1708 With bind path Yes
Root Ubuntu 18.10 With bind path Yes
Linux
Application
Application
Application
Application
Application
Application
Application
Application
Application
Application
Singularity
Kernel
Virtual for comparison. Performance of the three images are evaluated
Hardware running a set of commonly used scientific applications including
Hypervisor WRF, MILC, NAMD, and GROMACS. The results show that these
System System Singularity applications run on the images we built experience little to no
Root Root Container measurable performance overhead compared to running the appli-
Linux Kernel Linux Kernel cations natively on the system. The results also show the presence
Physical Hardware Physical Hardware of a small amount of constant overhead, due mostly to the startup
of Singularity daemon. Through the binding of system directories,
all these images can utilize all the pre-built tools and applications
Figure 1: Architecture of hypervisor-based virtual environ- natively on the system and manage them with environment module
ments (left), and Singularity as a container-based virtual en- systems such as Lmod[9].
vironment (right). The Singularity daemon running along
with other applications on the host launches the container. 2 IMPLEMENTATION
The applications running within the Singularity container
can access system root through path binding or overlay. 2.1 Resources
All the experiments in this study run on the Stampede2 supercom-
puter at the Texas Advanced Computing Center. Stampede2 hosts
container. While the VM obscures hardware resources from hosted 4,200 Knights Landing nodes and 1,736 Intel Xeon Skylake (SKX)
applications with the interposition of the hypervisor, Singularity nodes. We chose to use the SKX nodes to run all the tests to achieve
containers can expose the same hardware resources to container better performance. Each of the SKX nodes has two Intel Xeon
hosted applications as native applications. There are a number of Platinum 8160 processors and a total of 192GB DDR4 memory. The
studies that have analyzed the performance overhead of Singularity interconnect of the system is a 100Gb/sec Intel Omni-Path (OPA)
including its impact on disk I/O, Memory, and Network bandwidth network.
in the HPC context [3][7][12]. The microbenchmarks show neg-
ligible difference between running with or without the container 2.2 Singularity Images
(natively) and indicate that Singularity could be a great candidate A total of three Singularity images are built (Table 1). The first one
to support HPC users requiring increased customizability, repro- is based on CentOS Linux 7.4.1708 with the Intel compiler and MPI
ducibility and portability. For an HPC center, this involves more tools copied into the image during the build (CentOS1 ). The second
than just enabling Singularity on the clusters as a module for the one is built with the same version of CentOS but with only the base
users, but also releasing a selected number of basic container images GNU Compiler Collection (GCC) compiler included with CentOS in
that can be customized to fulfill user needs and still run seamlessly the container (CentOS2 ). The third is built with Ubuntu 18.10 and
with all the native hardware to achieve high performance. We es- has only Ubuntu’s base GCC built-in. In addition to the basic tools
tablish how effective a solution to combine customizability and and libraries, including proper versions of rdma and psm2 libraries
performance in an HPC environment Singularity is by answering for these Linux distributions, we also have Lmod installed to the
the following questions: same directory as the one on Stampede2 along with an initializing
• What are the basic components required for containers to script so that the environment within the container is similar to
access the host’s hardware? the host’s.
• Are there any limitations on compatibility between different All three images are built in a CentOS Linux environment with
Linux distributions on the container and the host? root access on a personal computer. When running on Stampede2,
• How much performance overhead on real-world scientific in addition to the user writeable directories, /home1, /work, and
applications is introduced by the container? /scratch, the staff maintained applications in /opt/apps are all
mounted with the bind paths defined in Singularity. Exposing the
1.2 Contribution host’s /opt/intel directory then determines whether the host’s
To address these questions, we build two CentOS images, one of Intel compiler and MPI are usable within the container.
which has a compiler and MPI tools built into the container and the The CentOS distribution that was running on Stampede2 at the
other which utilizes the host’s compiler and MPI through Singular- time of this study was identical to those in the CentOS containers,
ity’s bind path mechanism. We also built a third container using the so the purpose of having the first two images is to minimize the
latest version of Ubuntu, with the host compiler and MPI exposed, discrepancy between the container and native system, thus all the
Performant Container Support for HPC Applications PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA
pre-built modules on Stampede2 are expected to run seamlessly • MILC In this benchmark, the executable su3_rhmd_hisq evolves
when mounted within the container through bind paths. Having a 18x18x18x36 gauge configuration using a staggered-fermion
the Intel compilers and MPI built into one of the CentOS images rational hybrid Monte Carlo evolution code for 2 trajectories.
is to further minimize potential issues due to the difference of The dataset is from https://portal.nersc.gov/project/m888/apex/
system libraries. Then, the Ubuntu image represents the situation MILC_lattices/. The MILC benchmark is run on 3, 6, 9, and 18
where most major system libraries are different inside the container. nodes with 48 MPI tasks per node.
This corresponds to the cases where the user might demand some • NAMD This benchmark simulates the 1M atom Satellite Tobacco
new features from the latest GNU C Library that is not available Mosaic Virus. It will be run on 1, 2, 4, 8, and 16 nodes with 4 MPI
locally on an HPC system. With each of these three images, the tasks per node, which is a more efficient configuration with multi-
users would have the freedom to install packages using the yum or threading enabled. The version of NAMD used is 2017_12_05
apt-get command when building their own customized container that is available as a module on Stampede2. It is built with Intel
images. Compiler and Intel MPI version 16.0.3.
• GROMACS In this benchmark, pure water solutions were sim-
ulated using GROMACS version 2018.3. The simulated systems
2.3 Applications consist a total of 1.536 Million atoms. The initial coordinates and
We picked four of the top 10 applications run on the Stampede2 simulation parameters were obtained from GROMACS website
supercomputer, WRF [11], MILC [2], NAMD [10], and GROMACS at, ftp://ftp.gromacs.org/pub/benchmarkswater_GMX50_bare.tar.
[4], to benchmark the performance of Singularity images. These gz . All simulations were performed in the isothermal isobaric
application are from different scientific domains using different (NpT) ensemble at 300 K and 1 atm. Scaling behavior will be
algorithms and should be representative of much of Stampede2’s determined using 1, 2, 4, 8, and 16 nodes with 48 MPI tasks per
workload. Note that to have the comparison effective, we have four node.
versions of each application built and compiled within the three All the benchmarks run on the aforementioned four different en-
container environments and the native Stampede2 environment, vironments for comparison. Each individual test is ran 3 times to
and then run them all in the Stampede2 environment. All of the take the average runtime.
applications except for NAMD are compiled with the Intel Compiler
using Intel MPI with the same optimization and vectorization op- 3 RESULTS AND DISCUSSION
tions enabled to achieve the best possible performance. The NAMD
version used here has known performance issues when compiled
3.1 Container Usage in an HPC Environment
with newer versions of the Intel compiler and MPI. Given that The two CentOS images build and run the applications flawlessly
NAMD is installed as a module on Stampede2, we instead compare right out of the box. This is expected as both images have minimal
its performance by loading the module within the container envi- differences from the native system. Building the Intel stack into
ronment. The Intel Compiler and Intel MPI version is 18.0.2 if not the container image turns out to be unnecessary and redundant
specified otherwise in the following discussion. when a compiler on the host is available. This greatly improves the
The Weather Research and Forecasting (WRF) model is a nu- practicality of providing Intel-enabled images as products to users,
merical weather prediction application designed for atmospheric because the Intel package itself is a multi-gigabyte package, and
research and operational forecasting. It is based on a Eulerian solver including it would significantly blow up the size and hence disk-
using a third-order Runge-Kutta time-integration scheme coupled space requirements and upload/download times for the container.
with a split, explicit second-order time integration scheme. The Due to the difference in the shared system libraries and headers,
benchmark code MILC is used to study quantum chromodynamics, the mounted Intel compilers and MPI refuse to execute within the
the theory of the strong interactions of subatomic physics. It’s run- Ubuntu container. The compiler issue is specifically caused by the
time is dominated by sparse matrix solver algorithms. NAMD is an differences in the systems’ math.h headers between the two Linux
application of classical molecular dynamics simulations that simu- distributions. We fixed it by substituting in Intel’s math.h header for
lates the interaction between atoms. It integrates the forces on all the system version’s within the Ubuntu container. This is achieved
atoms by the explicit, reversible, and symplectic Verlet algorithm to through defining __PURE_INTEL_C99_HEADERS__ when compiling.
simulate the dynamic evolution of the system. GROMACS is a pack- Intel MPI runtime issue only occurs when the incompatible GNU C
age to perform molecular dynamics using Newtonian equations of Library is loaded. We fixed it by applying a customized patch with
motion for systems with hundreds to millions of particles. LD_PRELOAD to replace the original strtok_r() function imple-
mented in Ubuntu 18.10 with our modified implementation shown
as following:
2.4 Benchmark Descriptions # include < stdio .h >
# include < string .h >
• WRF Our WRF benchmarks uses the 2.5 KM CONUS Bench- char * strtok_r ( char *s , const char * delim , char ** save_ptr ) {
mark dataset from http://www2.mmm.ucar.edu/wrf/WG2/bench/ char * end ;
if ( (s == NULL ) && (* save_ptr != NULL )) {
Bench_V3_20081028.htm . In this dataset, the domain is 2.5 km if ((* save_ptr )[1] == 0) {
in horizontal resolution on a 1500 by 1200 grid with 35 vertical return NULL ;
}
levels with a time step of 15 seconds. We run the WRF benchmark }
if (s == NULL )
on 1, 2, 4, 8, and 16 nodes each with 4 MPI tasks per node and s = * save_ptr ;
and 12 OpenMP threads per task. if (* s == '\0 ') {
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA Wang, et al.
Time (s)
+0·4 +0·2 +8·1 +0·5
2 117.9 117.6 121.2 115.2
−0·5 −0·4 −4·2 −0·7
+0·3 +0·4 +1·9 +0·4 10
4 61.4 61.5 61.8 59.9
−0·4 −0·5 −1·0 −0·3
+1·1 +1·1 +0·1 +0·2
8 34.3 34.9 33.6 33.5
−0·7 −0·7 −0·1 −0·1
+0·3 +0·2 +0·2 +0·7 0
16 21.6 21.9 21.4 21.1 1 2 4 8 16
−0·2 −0·1 −0·2 −0·4
Number of Nodes
Native CentOS1 CentOS2 Ubuntu
4
10
Time (s)
Time (s)
2
5
0 0
1 2 4 8 16 WRF MILC NAMD GROMACS
Number of Nodes Applications
Native CentOS2 Ubuntu CentOS1 Overhead CentOS2 Overhead Ubuntu Overhead
4 CONCLUSION
Our benchmark results indicate it is viable to provide a selection of
20 customizable images to HPC users, and still have their applications
achieve optimal performance when running within the Singularity
container. Container images with excessive size may harm the per-
Time (s)
REFERENCES
performance difference of CentOS1 shown in the runtime where [1] [n. d.]. Docker. https://www.docker.com
overhead is neglected. [2] [n. d.]. MIMD Lattice Computation (MILC) Collaboration Home Page. http:
//physics.indiana.edu/~sg/milc.html
The overhead of Ubuntu is comparable to CentOS2 ’s in WRF [3] Carlos Arango, Rémy Dernat, and John Sanabria. 2017. Performance Evaluation
and NAMD, but slightly less in MILC and GROMACS (Figure 6). of Container-based Virtualization for High Performance Computing Environ-
ments. arXiv:1709.10140 [cs] (Sept. 2017). http://arxiv.org/abs/1709.10140 arXiv:
Considering the image sizes for CentOS2 and Ubuntu are 309 MB 1709.10140.
and 212 MB respectively, such a difference is not likely solely from [4] H. J. C. Berendsen, D. van der Spoel, and R. van Drunen. 1995. GROMACS: A
loading the images. We suspect the difference in Linux kernel and message-passing parallel molecular dynamics implementation. Computer Physics
Communications 91, 1 (Sept. 1995), 43–56. https://doi.org/10.1016/0010-4655(95)
system libraries between the two images also contributes to con- 00042-E
tainer’s overhead. Such a small difference, however, is negligible [5] Todd Evans, William L. Barth, James C. Browne, Robert L. DeLeon, Thomas R.
given the comparable or even greater variance from noise and the Furlani, Steven M. Gallo, Matthew D. Jones, and Abani K. Patra. 2014. Comprehen-
sive Resource Use Monitoring for HPC Systems with TACC Stats. In Proceedings of
orders of magnitude greater runtime for all the benchmark tests. the First International Workshop on HPC User Support Tools (HUST ’14). IEEE Press,
Overall, the container overhead stays constant with increasing Piscataway, NJ, USA, 13–21. https://doi.org/10.1109/HUST.2014.7 event-place:
New Orleans, Louisiana.
node counts, so it would not deteriorate the scaling performance of [6] Gregory M. Kurtzer, Vanessa Sochat, and Michael W. Bauer. 2017. Singularity:
HPC applications running within it. The overhead may vary among Scientific containers for mobility of compute. PLOS ONE 12, 5 (May 2017),
applications. However, as long as the container is lightweight in e0177459. https://doi.org/10.1371/journal.pone.0177459
[7] Emily Le and David Paz. 2017. Performance Analysis of Applications Using
size, such overhead is harmless especially compared to the usually Singularity Container on SDSC Comet. In Proceedings of the Practice and Ex-
much longer runtime. perience in Advanced Research Computing 2017 on Sustainability, Success and
Performant Container Support for HPC Applications PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA
Impact (PEARC17). ACM, New York, NY, USA, 66:1–66:4. https://doi.org/10.1145/ Klaus Schulten. 2005. Scalable molecular dynamics with NAMD. Journal of Com-
3093338.3106737 event-place: New Orleans, LA, USA. putational Chemistry 26, 16 (2005), 1781–1802. https://doi.org/10.1002/jcc.20289
[8] Jiuxing Liu. 2010. Evaluating standard-based self-virtualizing devices: A per- [11] C. Skamarock, B. Klemp, Jimy Dudhia, O. Gill, Dale Barker, G. Duda, Xiang-yu
formance study on 10 GbE NICs with SR-IOV support. In 2010 IEEE Inter- Huang, Wei Wang, and G. Powers. 2008. A Description of the Advanced Research
national Symposium on Parallel Distributed Processing (IPDPS). 1–12. https: WRF Version 3. (2008). https://doi.org/10.5065/D68S4MVH
//doi.org/10.1109/IPDPS.2010.5470365 [12] Jonathan Sparks. 2017. HPC Containers in Use. Proceedings of the Cray User
[9] Robert McLay, Karl W. Schulz, William L. Barth, and Tommy Minyard. 2011. Best Group.
Practices for the Deployment and Management of Production HPC Clusters. In [13] M. G. Xavier, M. V. Neves, F. D. Rossi, T. C. Ferreto, T. Lange, and C. A. F. De
State of the Practice Reports (SC ’11). ACM, New York, NY, USA, Article 9, 11 pages. Rose. 2013. Performance Evaluation of Container-Based Virtualization for High
https://doi.org/10.1145/2063348.2063360 Performance Computing Environments. In 2013 21st Euromicro International
[10] James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhor- Conference on Parallel, Distributed, and Network-Based Processing. 233–240. https:
shid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant Kalé, and //doi.org/10.1109/PDP.2013.41