Massively Parallel Implementation of Iterative Eigensolvers in Large-scale

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Computer Physics Communications 299 (2024) 109135

Contents lists available at ScienceDirect

Computer Physics Communications


journal homepage: www.elsevier.com/locate/cpc

Computer Programs in Physics

Massively parallel implementation of iterative eigensolvers in large-scale


plane-wave density functional theory ✩,✩✩
Junwei Feng, Lingyun Wan, Jielan Li, Shizhe Jiao, Xinhui Cui, Wei Hu ∗ , Jinlong Yang
School of Data Science, Key Laboratory of Precision and Intelligent Chemistry, and Anhui Center for Applied Mathematics, University of Science and Technology of
China, Hefei 230026, Anhui, China

A R T I C L E I N F O A B S T R A C T

Keywords: The Kohn-sham density functional theory (DFT) is a powerful method to describe the electronic structures of
Kohn-Sham equation molecules and solids in condensed matter physics, computational chemistry and materials science. However,
Density functional theory large and accurate DFT calculations within plane waves process a cubic-scaling computational complexity,
Iterative eigensolvers
which is usually limited by expensive computation and communication costs. The rapid development of high
Plane waves
Numerical algorithms
performance computing (HPC) on leadership supercomputers brings new opportunities for developing plane-
wave DFT calculations for large-scale systems. Here, we implement parallel iterative eigensolvers in large-scale
plane-wave DFT calculations, including Davidson, locally optimal block preconditioned conjugate gradient
(LOBPCG), projected preconditioned conjugate gradient (PPCG) and the Chebyshev subspace iteration (CheFSI)
algorithms, and analyze the performance of these algorithms in massively parallel plane-wave computing tasks.
We adopt a two-level parallelization strategy that combines the message passing interface (MPI) with open multi-
processing (OpenMP) parallel programming to handle data exchange and matrix operations in the construction
and diagonalization of large-scale Hamiltonian matrix within plane waves. Numerical results illustrate that these
iterative eigensolvers can scale up to 42,592 processing cores with high peak performance of 30% on leadship
supercomputers to study the electronic structures of bulk silicon systems containing 10,648 atoms.

Program summary
Program Title: Plane wave density functional theory (PWDFT)
CPC Library link to program files: https://doi.org/10.17632/c8v2mx5vn4.1
Developer’s repository link: https://bitbucket.org/berkeleylab/scales
Licensing provisions: BSD 3-clause
Programming language: C++
Nature of problem: PWDFT is used for electronic structure calculations based on Kohn-Sham density functional
theory. The key challenge to address is a constrained energy minimization problem, which can also be
formulated as a nonlinear eigenvalue problem. MPI/OpenMP-based approaches are employed to provide multi-
core acceleration for the study of the chemical and material properties of larger-scale molecules and solids.
Solution method: PWDFT implements self-consistent field (SCF) iterations and direct constrained minimization
algorithms with various acceleration strategies. It is written in C++ and offers parallel acceleration based on
MPI/OpenMP.


The review of this paper was arranged by Prof. Weigel Martin.
✩✩
This paper and its associated computer program are available via the Computer Physics Communications homepage on ScienceDirect (http://www.sciencedirect.
com/science/journal/00104655).
* Corresponding author.
E-mail address: whuustc@ustc.edu.cn (W. Hu).

https://doi.org/10.1016/j.cpc.2024.109135
Received 30 September 2023; Received in revised form 7 January 2024; Accepted 12 February 2024
Available online 22 February 2024
0010-4655/© 2024 Elsevier B.V. All rights reserved.
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

Table 1
Characteristics of several plane-wave DFT packages (Qbox, Quantum-Espresso, VASP, ABINIT and PWDFT),
including year of test release, programming language, test systems, number of test architecture atoms, system
scale, number of CPU cores and machine.

Plane-wave DFT packages Year Language System Atoms Scale Machine

Qbox [18] 2006 C++ Molybdenum 1k 8k CPU cores BlueGene/L


Quantum-Espresso [19] 2017 Fortran H2 O 192 2.3k CPU cores Edison
Quantum-Espresso [20] 2017 Fortran Carbon nanotubes 1.5k 1k CPU cores NEC SX-ACE
VASP [21] 2017 Fortran PdO4 , Silicon 256 4k CPU cores Cori
VASP [22] 2019 Fortran Graphene 11k 5k CPU cores Cray XC-40
ABINIT [23] 2020 Fortran Ga2 O3 1.9k 5k CPU cores Joliot-Curie

PWDFT [24] 2017 C++ Silicon 5K 8 K CPU cores Edsion


PWDFT [25] 2021 C++ Silicon 5K 25 K CPU cores Cori

PWDFT (This work) 2024 C++ Silicon 10k 42 K CPU cores BSCC-T6

1. Introduction damental challenge. The commonly used discrete basis sets include
Gaussian type basis sets (GTO) [26], numerical atomic basis sets (NAO)
The advancement of materials science has given rise to an increasing [27], plane wave basis sets (PW) [28], and local real space basis sets
demand for understanding the intricate phenomena governing material (RS) [29]. Among these, atomic basis sets are often employed for deal-
properties and processes at the atomic level [1][2]. To address this, ing with molecular systems, whereas PW basis sets are preferred for
it is imperative to develop accurate and efficient techniques for solv- handling periodic systems.
ing the fundamental quantum-mechanical equations concerning com- Compared with the Gaussian basis set and the numerical atomic ba-
plex many-body systems. A monumental leap in the application of sis set, the plane wave basis set matrix has higher density, requires
quantum mechanics to challenging problems in chemistry is attributed less memory extraction, is easy to vectorize, and can better utilize the
to the Kohn-Sham density functional theory (DFT) [3][4][5]. DFT’s floating point performance of the computer. It is widely used in various
true strength lies in its superior performance ratio when compared to DFT solving software, such as Qbox [18], VASP [21], Quantum-Espresso
electron-correlated wave function-based methods like coupled cluster. [19], PWmat [30], ABINIT [23], and PWDFT [24]. By leveraging the ad-
As the most widely used electronic structure method, KS-DFT serves vantages of plane wave basis sets, these software packages significantly
as a powerful tool for conducting first-principles calculations and finds enhance the efficiency and capability of electronic structure calcula-
diverse applications in materials science, solid state physics, medicine, tions on powerful computing platforms. However, due to the limitations
and other fields. These applications encompass predicting quantum phe- of global communication requirements and memory requirements of
nomena and designing new materials (e.g., photocatalysis, lithium-ion plane waves, current DFT calculation software based on plane waves
batteries). can only simulate systems with thousands of atoms as shown in Table 1.
It is noteworthy that despite being formulated 50 years ago, the sig- The atomic basis set has the disadvantage that random sparse matrix op-
nificance and relevance of Kohn-Sham DFT persist to this day. Even erations cannot bring out the performance of computer floating point
though multi-electron problems are transformed into Kohn-Sham DFT, calculations. For this purpose, many new basis sets based on finite ele-
the time complexity remains O(𝑁 3 ) (PW+LDA/GGA), and the computa- ment methods have been developed, such as the PARSEC [31][32] (RS
tional power of a single processor core remains inadequate to handle the basis sets), DFT-FE [16] (RS-PW basis sets) and DGDFT [17] (DG-PW
electronic structure problems of modern supercomputers. Thankfully, basis sets) (the calculation scale of this type of basis sets can be found
high-performance computing (HPC) [6] on leadership supercomputers in the comparison table in this work) mentioned above, which are more
adeptly tackles this computationally intensive challenge. Among the suitable than plane waves when calculations of large systems. How-
major applications HPC, one of the most important is the application ever, the construction process of some finite element basis sets rely on
related to DFT in material simulations [7]. For example, on NERSC’s the steps of dividing the system into blocks and solving the plane wave
HPC platforms, DFT-related applications occupy over 70% of the over- DFT on each block. Therefore, expanding the calculation scale of plane
all computing time for material simulations [8]. waves is also crucial to improving the calculation scale of these basis
Traditionally, optimization involves dividing the DFT calculation sets.
task into multiple modules and assigning them to different CPU cores In the realm of plane waves, the most computationally demanding
using data-level or task-level parallel methods. By employing parallel- step involves the diagonalization process of the Hamiltonian [33]. This
optimized DFT computing software, high-performance computing clus- crucial step aims to find the partial minimum eigenvalues and eigenvec-
ters, or tens of thousands of CPU cores in supercomputers, we can tors of the Hamiltonian matrix, which correspond to the energy wave
significantly accelerate the same computing task, enabling DFT to han- function and orbital wave function of the system’s occupied state. Due
dle large-scale systems on supercomputers. However, designing efficient to the high dimensionality of the Hamiltonian matrix under the plane
parallel codes presents a considerable challenge due to the diverse data wave basis sets, direct diagonalization methods like QR and Jacobi algo-
structures and algorithms involved, as well as the frequent occurrence rithms become inefficient due to their slow computation and excessive
of inherent sequential control. Effectively utilizing a large number of memory usage, even after truncation. However, a promising aspect of
processors becomes an intricate task. Despite these difficulties, sev- DFT calculations is that obtaining all the eigenvalues and eigenvec-
eral software packages have been successfully developed to effectively tors of the entire Hamiltonian matrix is not necessary. Instead, we only
solve the KS equation, including SIESTA [9], CONQUEST [10], FHI-aims require a subset of small eigenvalues and their corresponding eigenvec-
[11], BigDFT [12], HONPAS [13], LS3DF [14], RSDFT [15], DFTFE tors, and these can be obtained through iterative diagonalization.
[16], DGDFT [17], and more. These packages contribute significantly To enhance computational speed and accuracy, extensive research is
to advancing electronic structure calculations in diverse scientific ap- dedicated to introducing more efficient iterative diagonalization meth-
plications. ods. Among the most widely used methods are the typical Krylov
In various DFT software packages, selecting suitable basis sets to [34] subspace methods, such as Davidson [35], locally optimal block
expand the Kohn-Sham orbitals and effectively solve the eigenvalue preconditioned conjugate gradient (LOBPCG) [36], and projected pre-
problem of Kohn-Sham equations on modern supercomputers is a fun- conditioned conjugate gradient (PPCG) [37] algorithms. Additionally,

2
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

Table 2
Comparison of characteristics of four different iterative eigensolvers in plane-
wave DFT, including Davidson, LOBPCG, PPCG and CheFSI algorithms.

Iterative eigensolvers Subspace Convergence Efficiency Memory

Davidson-9 Krylov Very high Very low Very high


Davidson-3 Krylov Medium Medium High
LOBPCG Krylov Medium Medium High
PPCG Krylov Medium Very fast Low
CheFSI Chebyshev Low Fast Medium

improved power methods [38], like the Chebyshev subspace iteration 𝜓𝑖 . In the pseudopotential approximation, the Hamiltonian can be ex-
1
(CheFSI) [39] algorithm, are also employed. panded in the form 𝐻 = − Δ + 𝑉PS + 𝑉H [𝜌] + 𝑉XC , where the first term
2
By parallelizing diagonalization methods through high-performance is the kinetic energy term of the system, which is given directly by
techniques, DFT calculations can be significantly accelerated, enabling the inverted lattice point. The second term 𝑉PS is the pseudopotential
the scaling of simulation systems from tens or hundreds of atoms to operator, which describe the overall potential field of inner electrons
tens of thousands of atoms. This enhancement empowers researchers to and nucleus. The third term 𝑉H [𝜌] is the Hartree potential energy op-
explore more extensive and complex systems, leading to new insights erator, which describes the electron-electron interaction. The last item
and breakthroughs in materials science and related disciplines. 𝑉XC is the exchange correlation potential, which contains the exchange
In this work, we conduct a thorough comparison of several mas- potential and correlation potential: 𝑉XC = 𝑉X + 𝑉C , which describes
sively parallel matrix diagonalization algorithms for calculating elec- quantum many-body effects of electrons. The accuracy of Kohn-Sham
tronic structures of large-scale systems with tens of thousands of atoms DFT strongly depends on the exchange correlation functionals, such as
by using the plane-wave DFT (PWDFT) [24][40] software. PWDFT is local density approximation [47], generalized gradient approximation
a sub-module of the discontinuous Galerkin Density Functional The- (GGA) [48], meta-GGA [49] and hybrid functionals (B3LYP [50] and
ory (DGDFT) package [41][42][43], implemented through C/C++ lan- HSE [51]).
guage, and employs a multi-level parallel strategy to enable large-scale When the KS equation is discretized with plane waves, the elec-
electronic structure calculations on supercomputers. tronic structure problem will be transformed into eigenvalue problem
Compared to the large-scale eigenvalue method reported in ABINIT 𝐻𝑋 (= 𝑋Λ, where 𝐻) corresponds to Kohn-Sham Hamiltonian matrix,
[44], which extends to 16,384 cores, we have addressed the paral-
𝑋 ≡ 𝜓1 , 𝜓2 , … , 𝜓𝑁e is the Kohn-Sham orbital expanded by the plane
lelization issues associated with the Rayleigh-Ritz steps, resulting in a
several-fold improvement in scalability. This allows us to handle paral- wave. Dimension of this matrix is 𝑁r × 𝑁e , where the 𝑁r and 𝑁𝑒 are the
lel iterative diagonalization problems on over 40,000 processors. number of reciprocal lattice points satisfying the truncation energy con-
In particular, we employ a combination of MPI (message passing dition and corresponds
( to the)number of occupied orbitals, respectively.
interface) and OpenMP (open multi-processing) parallel methods to ad- And Λ ≡ diag 𝜀1 , 𝜀2 , … , 𝜀𝑁e is a diagonal matrix of corresponding en-
dress the significant time and memory requirements involved in diago- ergy eigenvalues, whose dimension is 𝑁e × 𝑁e . SCF iterative process is
nalizing the Hamiltonian. Remarkably, our results demonstrate that the shown in Fig. 1.
plane-wave density functional method can be effectively extended to
tens of thousands of processors on modern heterogeneous supercomput- 2.2. Plane-wave DFT
ers, enabling the simulation of systems composed of tens of thousands
of silicon atoms. Furthermore, we perform a comparative analysis of
the performance of Davidson, LOBPCG, PPCG, and CheFSI diagonal- When dealing with periodic systems, we often use plane wave basis
ization methods in massively parallel electronic structure calculations sets
within the plane-wave framework. By thoroughly examining these al- 𝑁𝐠
1 ∑
gorithms, we gain valuable insights into their efficiency and suitability 𝜓𝑗 (𝐫) = √ 𝜓𝑗,𝐆 e𝑖𝐆⋅𝐫 (1)
for large-scale electronic structure simulations. Ω 𝐆
where 𝜓𝑗,𝐆 are the coefficient of the Fourier transform, e𝑖𝐆 is the plane
2. Methodology
wave for a given wave vector G, and 𝑁𝑔 is the number of all reciprocal
lattice points satisfying the truncation condition
In this section, we delve into the implementation of Kohn-Sham
density functional theory (DFT) [45][46] within the plane wave frame-
|𝐆|2 < 2𝐸cut . (2)
work. This set of theories ultimately leads to the solution of nonlinear
eigenvalue problems. To tackle such eigenvalue problems, we have in- Under the plane wave basis sets, we can easily obtain the specific forms
corporated the Davidson, LOBPCG, PPCG, CheFSI, and other eigenvalue of each term in KS equation, for example, the kinetic energy term 𝑇𝑠 is
iterative solving methods into the density functional calculation pro- expressed as
gram. Subsequently, we conduct a comprehensive comparison of the
𝑁e 𝑁e 𝑁g
parallel efficiency and scalability of these algorithms for large-scale 1∑ 1 ∑∑|
d𝐫 ||∇𝜓𝑖 (𝐫)|| = 𝜓 𝐆|
2 2
𝑇s = (3)
electronic structure calculations. The comparison encompasses crucial 2 𝑖=1 ∫ 2Ω 𝑖=1 𝐆 | 𝑖,𝐆 |
aspects such as convergence rates, memory requirements, and compu-
tational efficiency, all of which are outlined in Table 2. and the Hartree term can be expressed as
[ ]
1 4𝜋𝑛𝐆
2.1. Kohn-Sham density functional theory 𝐸𝐻 =  −1 𝑛(𝐫)d𝐫 (4)
2∫ |𝐆|2
The electronic structure problem of the physical system containing where
𝑁𝑒 electrons is a nonlinear eigenvalue problem of the form 𝐻𝜓𝑖 (𝐫) = 𝑁g ′
𝜀𝑖 𝜓𝑖 (𝐫), where 𝑖 = 1, 2, ..., 𝑁𝑒 , 𝜀𝑖 is the orbital energy of the ith KS or- 1 ∑
𝑛(𝐫) = √ 𝑛𝐆 e𝑖𝐆⋅𝐫 (5)
bital. Wavefunction of the KS orbital corresponds to the eigenvector Ω 𝐆

3
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

𝑆 ← [𝑆, 𝑉 ] (9)
at each iteration, where 𝑉 is the preconditioned residual

𝑉 = 𝑇 𝑅 = 𝑇 (𝐻𝑋 − Λ𝑋) (10)


where T is Teter preconditioner.
The pseudocode implementation of the Davidson algorithm in
PWDFT is shown in Algorithm 1.

Algorithm 1 Davidson method for solving the eigenvalue problem


𝐻𝑥𝑖 = 𝜆𝑖 𝑥𝑖 , 𝑖 = 1, 2, … , 𝑘 in PWDFT.
Input: Hamiltonian 𝐻 and initial wave functions {𝑥𝑖 }𝑘𝑖=1 .
Output: Eigenvalues {𝜆𝑖 }𝑘𝑖=1 and wave functions {𝑥𝑖 }𝑘𝑖=1 .
1: Initialize 𝑆𝑘 = {𝑠𝑖 }𝑘𝑖=1 and orthonormalize 𝑆𝑘 .
2: while convergence not reached do
3: Solve the generalized eigenequation 𝑆 T 𝐻𝑆𝐶 = 𝑆 T 𝑆𝐶Λ and get coeffi-
cient 𝐶𝑘 = {𝑐𝑖 }𝑘𝑖=1 and eigenvalues Λ𝑘 = {𝜆𝑖 }𝑘𝑖=1
4: Update wave function 𝑋𝑘 ← 𝑆𝑘 𝐶𝑘
5: Compute residual vector by new wave function 𝑅𝑘 ← 𝐻𝑋𝑘 − Λ𝑋𝑘
6: Compute the preconditioned residual 𝑉 ← 𝑇 𝑅𝑘
7: Updata subspace 𝑆 ← [𝑋𝑘 , 𝑉 ]
8: end while
9: Update {𝑥𝑖 }𝑘𝑖=1 ← 𝑋𝑘 .

2.3.2. LOBPCG [ ]
LOBPCG [36] searches for the minimum of Tr 𝑋 𝑇 𝐻𝑋 with con-
𝑇
straint condition 𝑋 𝑋 = 𝐼 in third-order [𝑋, 𝑊 , 𝑃 ] subspace of width
3𝑁𝑒 . After each iteration step, the new feature vector is denoted by

𝑋 ← 𝑋𝐶𝑋 + 𝑊 𝐶𝑊 + 𝑃 𝐶𝑃 (11)
Fig. 1. Self-consistent field iterations in plane-wave density functional theory.
where 𝑊 is the residual multiplied by the preconditioner:
1 ( ( ))
𝑛𝐆 = √ d𝐫𝑛(𝐫)e−𝑖𝐆⋅𝐫 (6)
𝑊 = 𝑇 𝑅 = 𝑇 𝐻𝑋 − 𝑋 𝑋 𝑇 𝐻𝑋
Ω∫
(12)

and  −1 is the inverse Fourier transform. Here R is residual. There are different ways to select preconditioners,
and the purpose of multiplying preconditioners is to make the algorithm
converge faster. P denotes the conjugation direction. Coefficients 𝐶𝑋 ,
2.3. Iterative eigensolvers in plane-wave DFT
𝐶𝑊 , 𝐶𝑃 are given by the solving the eigenvectors of the generalized
eigenequations
Under the plane wave basis set, the dimension of 𝐻 matrix in Equa-
tion (3) is 𝑁r × 𝑁r , which size is related to the truncation energy 𝐸cut .
𝑆 𝑇 𝐻𝑆𝐶 = 𝑆 𝑇 𝑆𝐶Λ (13)
It is worth mentioning that 𝑁r is usually a fairly large value in the nor- [ ]𝑇
malized plane wave basis set (about 100 × 100 × 100 to 1,000 × 1,000 where 𝑆 = [𝑋, 𝑊 , 𝑃 ] and 𝐶 = 𝐶𝑋 , 𝐶𝑊 , 𝐶𝑃 . The pseudocode imple-
× 1,000), and appropriate eigenvalue iteration algorithms can effec- mentation of the LOBPCG Algorithm in PWDFT is shown in Algorithm 2.
tively reduce the cost which make the Hamiltonian implicitly present
in the computation. Algorithm 2 LOBPCG method for solving the eigenvalue problem
These iterative diagonalization methods solve the linear eigenvalue 𝐻𝑥𝑖 = 𝜆𝑖 𝑥𝑖 , 𝑖 = 1, 2, … , 𝑘 in PWDFT.
problem
[ of] the form 𝐻𝑋 = 𝑋Λ by searching for the minimum of Input: Hamiltonian 𝐻 and initial wave functions {𝑥𝑖 }𝑘𝑖=1 .
Tr 𝑋 𝑇 𝐻𝑋 . Output: Eigenvalues {𝜆𝑖 }𝑘𝑖=1 and wave functions {𝑥𝑖 }𝑘𝑖=1 .
In PWDFT software, basic linear algebra operations involved in the 1: Initialize X= {𝑥𝑖 }𝑘𝑖=1 and orthonormalize 𝑋 .
implementation of all methods, such as Gemm and Trsm, are provided 2: while convergence not reached do
by BLAS and PBLAS library, the operations required for Fourier trans- 3: Compute the residual vector 𝑅 ← 𝐻𝑋 − 𝑋(𝑋 T 𝐻𝑋)
form are provided by FFTW, and more advanced linear algebra opera- 4: Select a suitable preconditioner 𝑇 to compute preconditioned residuals
tions such as Syevd and Potrf are provided by LAPACK and ScaLAPACK. 𝑊 ←𝑇𝑅
5: Generating subspaces 𝑆 ← [𝑋, 𝑊 , 𝑃 ]
6: Solve the generalized eigenequations 𝑆 T 𝐻𝑆𝐶 = 𝑆 T 𝑆𝐶Λ to obtain the
2.3.1. Davidson
combination coefficients [𝐶𝑋 , 𝐶𝑊 , 𝐶𝑃 ]T of each vector in the next sub-
[The classical
] Davidson [35] method searches for the minimum of space
Tr 𝑋 𝑇 𝐻𝑋 in the orthogonal subspace 𝑆 , and 𝑆 will keep expanding 7: Update conjugate gradient direction 𝑃 ← 𝑊 𝐶𝑊 + 𝑃 𝐶𝑃
as the number of iterations increases. After each iteration step, new 8: Update wave function 𝑋 ← 𝑋𝐶𝑋 + 𝑃
eigenfunction is updated by 9: end while
10: Update {𝑥𝑖 }𝑘𝑖=1 ← 𝑋𝑘 .
𝑋 ← 𝑆𝐶 (7)
where C is obtained by solving the generalized eigenequation 2.3.3. PPCG
𝑇 𝑇 LOBPCG method inevitably costs a large amount of communication
𝑆 𝐻𝑆𝐶 = 𝑆 𝑆𝐶Λ (8)
overhead when performing the Rayleigh-Ritz step. The recently devel-
New subspace is updated to oped PPCG [37] method can effectively solve this problem. By adding

4
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

the projection matrix 𝐼 − 𝑋𝑋 𝑇 , 𝑊 and 𝑃 are projected in the direction eigenvalues less than 𝑎 to the other intervals, we can maximize the com-
orthogonal to 𝑋 . In this case, 𝑆 𝑇 𝑆 will be transformed into a block ponents of the orbitals we need in the subspace. The mapping satisfying
diagonal matrix, and the problem of solving the characteristic equa- the condition is given by
tion of 3𝑁𝑒 × 3𝑁𝑒 is decomposed into 𝑁𝑒 sub-problems. The size of ( ) ( )
𝑏−𝑎 𝑏+𝑎
each subproblem is 3 × 3. The subspace matrix diagonalization step 𝐿(𝐻) = 𝐻 − ∕ (17)
2 2
will be undertaken by each kernel independently, saving the cost of
combined with the recurrence relation of Chebyshev inequality of the
the communication step. The pseudo-code implementation of the PPCG
first kind
algorithm in PWDFT is shown in Algorithm 3.
𝐶𝑚+1 (𝑥) = 2𝑥𝐶𝑚 (𝑥) − 𝐶𝑚−1 (𝑥), 𝑚 = 1, 2, … (18)
Algorithm 3 PPCG method for solving the eigenvalue problem 𝐻𝑥𝑖 =
we can obtain the recurrence relation of the wave function
𝜆𝑖 𝑥𝑖 , 𝑖 = 1, 2, … , 𝑘 in PWDFT.
( )
Input: Hamiltonian 𝐻 and initial wave functions {𝑥𝑖 }𝑘𝑖=1 . 4 𝑏−𝑎
𝑋𝑘+1 = 𝐻− 𝐼 𝑋𝑘 − 𝑋𝑘−1 ,
Output: Eigenvalues {𝜆𝑖 }𝑘𝑖=1 and wave functions {𝑥𝑖 }𝑘𝑖=1 . 𝑏+𝑎 2 (19)
1: Initialize X= {𝑥𝑖 }𝑘𝑖=1 and orthonormalize 𝑋 . 𝑘 = 1, 2, … , 𝑚 − 1
2: while convergence not reached do
where the eigenvalue upper bound 𝑏 and the eigenvalue lower bound
3: Compute the residual vector 𝑅 ← 𝐻𝑋 − 𝑋(𝑋 T 𝐻𝑋)
4: Select a suitable preconditioner 𝑇 to compute preconditioned residuals
𝑎 can be estimated by iterative diagonalization methods such as David-
𝑊 ←𝑇𝑅 son. The pseudocode implementation of the CheFSI Algorithm in
5: Project 𝑊 ← (𝐼 − 𝑋𝑋 𝑇 )𝑊 PWDFT is shown in Algorithm 4.
6: Do the Rayleigh-Ritz process on the partitioned subspace.
7: for 𝑖=1,2,...,𝑁𝑒 do Algorithm 4 CheFSI method for solving the eigenvalue problem 𝐻𝑥𝑖 =
8: Generating subspaces 𝑆 ← [𝑥𝑖 , 𝑤𝑖 , 𝑝𝑖 ]
𝜆𝑖 𝑥𝑖 , 𝑖 = 1, 2, … , 𝑘 in PWDFT.
9: Solve the generalized eigenequations 𝑆 T 𝐻𝑆𝑐 = 𝜃𝑆 T 𝑆𝑐 to obtain the
combination coefficients [𝑐𝑥 , 𝑐𝑤 , 𝑐𝑝 ]T of each vector in the next sub- Input: Hamiltonian 𝐻 and initial wave functions {𝑥𝑖 }𝑘𝑖=1 .
space. 𝜃 is the Ritz value after chunking. Output: Eigenvalues {𝜆𝑖 }𝑘𝑖=1 and wave functions {𝑥𝑖 }𝑘𝑖=1 .
10: Update conjugate gradient direction 𝑝𝑖 ← 𝑤𝑖 𝑐𝑤 + 𝑝𝑖 𝑐𝑝 1: Initialize wave function X={𝑥𝑖 }𝑘𝑖=1
11: Update wave function 𝑥𝑖 ← 𝑥𝑖 𝑐𝑥 + 𝑝𝑖 2: Use a few step other diagonalization methods to estimate upper bounds 𝑏
12: end for and lower bounds 𝑎
13: Update 𝑋 =[𝑥1 , 𝑥2 , ..., 𝑥𝑁𝑒 ],𝑃 =[𝑝1 , 𝑝2 , ..., 𝑝𝑁𝑒 ] 3: while convergence not reached do
14: orthonormalize 𝑋 . 4: Map the eigenvalues of the interval [𝑎, 𝑏] to [−1, 1] and perform m-degree
15: If needed, solve the Rayleigh-Ritz process on the subspace 𝑠𝑝𝑎𝑛{𝑋} Chebyshev filter 𝑌̃ = 𝑝𝑚 (𝐿(𝐻)) 𝑋
16: end while 5: Orthonormalize 𝑌̃ ← 𝑂𝑟𝑡ℎ(𝑌̃ )
17: Update {𝑥𝑖 }𝑘𝑖=1 ← 𝑋𝑘 . 6: Execute Rayleigh-Ritz step: solving the characteristic equation
𝑌̃ T 𝐻 𝑌̃ 𝐶 = 𝐶Λ
7: Update wave function 𝑋 ← 𝑌̃ 𝐶
8: end while
2.3.4. CheFSI 9: Update {𝑥𝑖 }𝑘𝑖=1 ← 𝑋 .
Chebyshev-filtered iterative method [39][52] can be considered as
a power method, which differs from the classical power method by its
use of Chebyshev polynomials instead of general polynomials. 2.4. Data-level parallel diagonalization method
Chebyshev polynomial [53] of degree m is defined as
( ) Parallel matrix diagonalization assigns the wave function to each
⎧ cos 𝑚 cos−1 (𝑥) , |𝑥| ≤ 1
⎪ ( ) core by column, and completes the following matrix operations on each
𝐶𝑚 (𝑥) = ⎨ cosh 𝑚 cosh−1 (𝑥) , 𝑥>1 (14) core.
⎪ 𝑚
( )
⎩ (−1) cosh 𝑚 cosh (−𝑥) , 𝑥 < −1
−1

1). Hamiltonian acting on the wave function separately on each core.


It is easy to find that 𝐶𝑚 (𝑥) ≤ 1 is constant in the interval |𝑥| ≤ 1, and
It is worth noting that the implicit stored Hamiltonian is a matrix
the polynomial has a relatively large growth rate in other intervals.
of 𝑁𝑟 × 𝑁𝑟 size, and the wave function is a vector of 𝑁𝑟 × 𝑁𝑒 size.
This reveals that in the eigenvalue solving problem, we can map
In general, 𝑁𝑒 is much smaller than 𝑁𝑟 . So this step is the most
the undesired eigenvalues to the interval [−1, 1], making them minimal
time-consuming step in the whole diagonalization algorithm, and
by using the characteristics of Chebyshev polynomials, and map the
its time complexity is 𝑂(𝑁𝑟 𝑁𝑟 𝑁𝑒 ).
desired eigenvalues to other intervals to make them maximal.
2). Calculate matrix multiplication in the form of 𝑋 𝑇 𝑋 or 𝑋 𝑇 𝐻𝑋 or
When solving the Kohn-Sham equation iteratively, we are interested
𝑋𝐶 . This type of matrix multiplication will be encountered when
in the eigenvalues of all occupied orbits, so the energies of higher levels
calculating residual vectors or Cholesky decomposition orthogonal-
are not of concern to us under the ground state problem.
ization.
Any first-guess wave function 𝑋0 can be expanded linearly by the
eigenvectors of the Hamiltionian as 𝐿 =𝑐ℎ𝑜𝑙(𝑋 𝑇 𝑋),
(20)
𝑂𝑟𝑡ℎ(𝑋) = 𝑋𝐿−1
𝑋0 = 𝛼1 𝜓1 + 𝛼2 𝜓2 + ⋯ + 𝛼𝑛 𝜓𝑛 (15)
The parallel method is to block the two matrices by row, complete
The Chebyshev polynomial Hamiltonian is applied to both sides of the
the block multiplication on each core, and use MPI_Allreduce to
equation
complete the merge of the results. This type of matrix multiplica-
( ) ( ) ( ) tion needs to divide the wave function in the form of row block,
𝑝(𝐻)𝑋0 = 𝛼1 𝑝 𝜖1 𝜓1 + 𝛼2 𝑝 𝜖2 𝜓2 + ⋯ + 𝛼𝑛 𝑝 𝜖𝑛 𝜓𝑛 (16)
and the wave function converted from column block to row block
where 𝜖𝑖 represents the eigenvalue of the ith orbital and 𝐻 is the Hamil- is completed by global communication MPI_Alltoallv, as shown in
tonian. We assume that the lowest unoccupied orbital has the energy 𝑎, Fig. 2 (a)-(c). The linear algebra operation after the parallel of each
and the highest energy of the system is 𝑏. By using a linear mapping to block is shown in Fig. 2 (d)-(f), such as calculate small scale char-
map all the eigenvalues in the interval [𝑎, 𝑏] to [−1, 1], and to map the acteristic problem matrix 𝑆 T 𝐻𝑆𝐶 = 𝑆 T 𝑆𝐶Λ as shown in (d), as

5
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

Fig. 3. Six parameter arrays required by MPI_Alltoallv. Different colors repre-


sent different processes, and there are 2 processes in the figure. Where A is the
wave function matrix of column partition and B is row partition. The sendk and
recvk are matrices used to find mappings between the wave function matrices
A and B and the buffers sendbuf and recvbuf. The use of arrays is reflected in
Algorithm 5.

tions and LAPACK library functions, such as Gemm, Potrf, Syevd, and so
on. Additionally, the use of multi-threaded FFT enhances the computa-
tion time of the Hamiltonian acting on the wave function step, while the
parallel buffer processing of MPI_Alltoall minimizes the communication
time of row and column transformations.
Fig. 2. Two different block modes in the PWDFT wavefunction. Column block
By employing this hybrid approach, we achieve a significant im-
pattern presented in (a) is used to FFT to compute the results of the Hamiltonian
acting on the wave function in parallel. Row block pattern is presented in (b) for
provement in the computational efficiency of large-scale electronic
parallel computation of matrix-matrix multiplication (Gemm) by Blas [54]. The structure calculations, facilitating the exploration of complex systems
2D block pattern given in (c) is used for SYEVD and parallel computation for and phenomena in materials science and beyond.
small-scale diagonalization problems. (d) Block mode of 𝑆 𝑇 𝑆 or 𝑆 𝑇 𝐻𝑆 matrix
multiplication calculation, merge using MPI_Allreduce. (e) Parallel operation of Algorithm 5 Column to row conversion by MPI_Alltoallv in PWDFT.
Hamiltonian action on wave function. (f) Block model for inverse projection Input: Column partition matrix A.
computation 𝑋𝑘 ← 𝑆𝑘 𝐶𝑘 . Output: Row partition matrix B.
1: # pragma omp parallel
2: {
shown in (e) when calculating 𝐻𝑆 , and as shown in (f) when cal-
3: # pragma omp for schedule(dynamic,1)
culating inverse projection 𝑋𝑘 ← 𝑆𝑘 𝐶𝑘 .
4: Calculate the six arrays: sendk, sendcounts, senddispls, recvk, recvcounts,
3). The small-scale matrix diagonalization problem syevd of the and recvdispls.
Rayleigh Ritz process, solved in block parallel using ScaLAPACK. 5: For i = 1 : numit
6: Number of data elements sent and received in the buffer sendcounts and
2.5. Two-level parallel processing method recvcounts
7: senddispls and recvdispls for each communicator process data relative to
When diagonalizing the Hamiltonian under the plane-wave basis the location of the buffer.
8: senddispls and recvdispls for each communicator process data relative to
set, simple MPI parallelism often results in memory overflow due to
the location of the buffer.
load imbalance. To address this challenge, the hybrid MPI and OpenMP
9: The index mapping relationship sendk and recvk between two-dimensional
parallel strategy proves highly effective by reducing the number of oc-
matrix and one-dimensional buffer are obtained.
cupied nodes and optimizing node utilization. The use of MPI is shown 10: # pragma omp for schedule(dynamic,1)
in Fig. 2, and the use of OpenMP is divided into two ways. One is to use 11: for i = 1 : imax; j = 1 : jmax do
the openmp of intelMKL to accelerate FFT, Gemm, SYEVD and other li- 12: sendbuf[sendk(i, j)] = A(i, j)
brary functions. The other is to use openmp to accelerate the column 13: end for
and row transformation of the matrix, as shown in Algorithm 5 and 14: MPI_Alltoallv(six arrays)
Fig. 3. 15: # pragma omp for schedule(dynamic,1)
The limitations of memory availability necessitate a careful allo- 16: for i = 1 : imax,j = 1 : jmax do

cation of the maximum wave function dimension for each node in 17: B(i, j) =recvbuf[recvk(i, j)]
18: end for
large-scale computing tasks. Consequently, pure MPI parallelism may
19: }
leave a substantial number of cores idle, particularly noticeable in ex-
tensive computational tasks. Here, leveraging OpenMP to harness the
idle core computing power emerges as a cost-effective acceleration ap- 3. Results and discussion
proach.
Employing multi-threaded Intel MKL library proves advantageous, All the DFT self-consistent field (SCF) iterative calculations are im-
offering accelerated performance across all levels of BLAS library func- plemented in the PWDFT software. We compare the computational

6
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

Table 3
Numerical accuracy of total energy and atomic
forces of PWDFT compared to ABINIT. Si64 was
selected as the reference system, and the con-
vergence accuracy was set to 10−8 . The LDA-PZ
functional and HGH pseudopotential were used to
measure the results from Ecut = 10.0 Hartree to
Ecut = 80.0 Hartree in ABINIT.

E𝑐𝑢𝑡 Δ𝐸 (Hartree/atom) Δ𝐹 (Hartree/Bohr)


10.0 4.03E-05 6.50E-09
20.0 9.90E-05 9.65E-09
40.0 2.40E-06 1.31E-08
60.0 6.26E-08 1.17E-08
80.0 2.50E-09 4.50E-09

efficiency and algorithm convergence of various diagonalization meth-


ods under the multiple systems spanning from Si256 -Si10648 .

3.1. Numerical accuracy


Fig. 4. Convergence curves of different diagonalization algorithms in Si256
system. Each diagonalization method is cycled 8 times in each round of SCF it-
To verify the accuracy of PWDFT calculation results, we selected eration, where the order of CheFSI’s Chebyshev polynomial is chosen as 15, and
five systems of Si64 with energy cutoffs ranging from 𝐸𝑐𝑢𝑡 = 10 Hartree the upper and lower bounds are estimated using the Lanczos method. Davidson-
to 80 Hartree as benchmark systems. We define the error of total en- 3 means that the maximum order of krylov subspace in Davidson method is set
ergy and atomic forces as Δ𝐸 = |𝐸PWDFT − 𝐸ABINIT |∕𝑁𝐴 and Δ𝐹 = to order 3, and Davidson-9 means that the maximum order of subspace is set
| | to order 9. The SCF convergence error is expressed as 𝑛𝑜𝑟𝑚(𝑜𝑢𝑡 − 𝑖𝑛)∕𝑛𝑜𝑟𝑚(𝑖𝑛),
max𝐼 |𝐹𝐼PWDFT − 𝐹𝐼ABINIT |. These systems were compared with the cal-
| | and the convergence criterion requires that this error be less than 10−6 .
culation results from ABINIT, and the comparison results are presented
in Table 3.
As can be seen from the two comparison charts, the ABINIT of calcu- Table 4
lating accuracy of PWDFT is almost the same. Both software packages Si8192 calculation time varies with the num-
ber of threads.
demonstrate differences in total energy and atomic forces calculations
that are far smaller than chemical accuracy. Number of threads Time (s) Speedup

1 1722.80 1.00
3.2. Convergence analysis 2 1171.82 1.47
4 713.01 2.41
6 611.39 2.81
In the convergence test, we selected a smaller system Si256 as the 8 494.43 3.48
standard system, which is a closed-shell system with the number of
occupied states 𝑁𝑒 /2, where 𝑁𝑒 is the total number of electrons in the
system, and the number of occupied states in the system is 512. The 3.3.1. OpenMP/MPI two-level parallel method extends the computable
kinetic energy cutoff is set to 𝐸𝑐𝑢𝑡 = 40.0 Hartree, and the number of system scale scheme
parallel cores is unified to 512 processors. Under this setting standard, The memory requirement is an important factor affecting the size of
the convergence of Davidson, LOBPCG, PPCG, and CheFSI is compared, the system that can be calculated by the plane wave density functional
and the convergence curve is shown in Fig. 4. method. In general, a single MPI parallel scheme cannot meet the re-
The order of the subspace is significantly correlated with the conver- quirement of simulating the system with the scale of tens of thousands
gence of the algorithm. Here, we agree that n in Davidson-n represents of atoms on a general supercomputer. The introduction of OpenMP for
the maximum dimension of krylov subspace expansion in Davidson al- fine-grained parallelism can effectively extract additional node perfor-
gorithm divided by the dimension of the wave function. Among several mance, and extend the upper limit of the simulable system.
different algorithms, the convergence of Davidson-9 which does not Si8192 was selected as the test system, and 1,024 MPI processes were
limit the upper bound of the subspace dimension (Subspace of 8 it- used to test the calculation time under the condition that each process
erations naturally expands to the 9th order) is the best. However, since had different number of threads, and the results were shown in Table 4.
the time complexity of the QR method to solve the Rayleigh-Ritz prob- In general, due to the limitation of memory problems, all the cores
lem is 𝑂(𝑁 3 ), the cost of time and space is intolerable. Davidson-3, of a single node cannot be used to participate in the calculation in large-
LOBPCG and PPCG methods with third-order subspace dimension have scale plane wave simulation. We prove that the MPI/OpenMP two-level
similar convergence steps. CheFSI method is a power method based on parallel strategy can effectively utilize the computing resources of idle
Chebyshev polynomials, and its convergence is related to the order of cores and reduce the utilization rate of nodes. All relevant information
Chebyshev polynomials. When the order is high enough, CheFSI will get of the test system is shown in Table 5.
good convergence, but the iteration time of 15th-order Chebyshev poly-
nomials is already quite high, which is greater than the iteration time 3.3.2. Weak scalability
of other krylov methods. Therefore, in the case of similar time costs, the We select Si4096 as the benchmark system to test the weak scalabil-
convergence of CheFSI method is poor. ity of LOBPCG, PPCG and CheFSI methods under the condition that the
number of threads used by each process is 1, E𝑐𝑢𝑡 = 10.0 Hartree and
3.3. Computational efficiency pseudopotential is HGH. The time required for each diagonalization al-
gorithm to iterate once under 256 cores, 512 cores, 1,024 cores, 2,048
Our analysis of computational efficiency is tested from two aspects cores, and 4,096 cores is shown in Fig. 5 (a).
of strong and weak scalability, and the selected test object is Si solid The matrix multiplication and the two steps of the Hamiltonian
systems. acting on the wave function are the most time-consuming parts of

7
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

Fig. 5. (a) Weak scalability of three different diagonalization methods in plane wave electronic structure calculations. The time required by 4 iterative diagonalization
algorithms in Si4096 systems under different number of MPI processes (256, 512, 1,024, 2,048, 4,096). (b) Strong scalability of three different diagonalization methods
in plane wave electronic structure calculations. The time required by three iterative diagonalization algorithms in Si1024 , Si2048 , Si3072 , Si4096 , Si6144 system under
1,024 MPI processes.

Fig. 6. Strong scalability for different components in four iterative eigensolvers (Davidson, LOBPCG, PPCG and CheFSI) in PWDFT. The time required in Si1024 ,
Si2048 , Si3072 , Si4096 , and Si6144 systems under 1,024 MPI processes. Where Gemm is the matrix multiplication function, FFT is the Fast Fourier Transform, and Syevd
is the Scalapack accelerated small-scale matrix diagonalization function. Filter is the Chebyshev polynomial recurrence, and Orth is the orthogonalization.

8
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

the diagonalization step of the Hamiltonian. By increasing the num- Table 5


ber of parallel MPI cores, the cost of matrix multiplication (Gemm) Each test system and the corresponding test parameters,
required for each core will be smaller, so that the time of the SCF it- among which 𝑁𝑎 represents the number of atoms in the sys-
eration process can decrease linearly with the increase of the number of tem, 𝑁𝑟 represents the number of lattice points in real space,
𝐸𝑐𝑢𝑡 represents the truncation energy (Ha), and 𝐶𝑚𝑎𝑥 repre-
cores. However, the communication overhead between different cores
sents the maximum number of CPU cores used in the test.
increases. When the communication overhead is balanced against the
saving of matrix multiplication time, the parallel efficiency reaches the System 𝑁𝑎 𝑁𝑟 𝐸𝑐𝑢𝑡 #𝐵𝑎𝑛𝑑 𝐶𝑚𝑎𝑥
upper limit and it no longer makes sense to continue increasing the Si1024 1,024 424,800 10.0 2,048 1,024
number of cores. PPCG algorithm reduces MPI communication over- Si2048 2,048 835,440 10.0 4,096 1,024
Si4096 4,096 1,643,032 10.0 8,192 8,192
head by dividing large Rayleigh-Ritz eigenvalue problems into many
Si6144 6,144 2,450,624 10.0 12,288 1,024
small problems and assigning them to each kernel for solving, and has Si8192 8,192 3,258,216 10.0 16,384 8,192
the best parallel performance among the three algorithms. Si10648 10,648 4,251,528 10.0 21,296 42,592

3.3.3. Strong scalability


To test strong scalability, we selected five standard systems as Si1024 ,
Si2048 , Si3072 , Si4096 and Si6144 , and tested the performance of four al-
gorithms in different systems using 1,024 core. Comparison results are
shown in Fig. 5 (b).
For different diagonalization methods, we evaluate the trend of the
execution time of each time-consuming function as the number of atoms
increases, which is shown in Fig. 6.
In the SCF iteration, there are 5 time-consuming steps in the diago-
nalization of Hamiltonian, which are Syevd, Gemm, FFT, MPI_Allreduce
and MPI_Alltoallv. The parallel computation is mainly to reduce the
execution time of Gemm and FFT functions through data-level par-
allel segmentation matrix, while the Syevd function is mainly based
on QR method and can only achieve a limited degree of parallelism
through ScaLAPACK. The number of cores for this function to achieve
the maximum parallel efficiency is about 500 on the computing cluster
we tested. The time cost of MPI_Allreduce and MPI_Alltoallv increases
with the number of cores. However, for LOBPCG, CheFSI and David-
son diagonalization methods, the time cost of Syevd is always nearly an
order of magnitude higher than that of the two communication func- Fig. 7. In the Si10648 system, the time spent by SCF in each step of PPCG al-
tions. Therefore, for these three methods, the main factor limiting the gorithm varies with the number of cores. 𝐸𝑐𝑢𝑡 = 10.0 Hartree, SCF calculates
improvement of their parallel efficiency is the Syevd function. As for 10 PPCG iterations per step. Ten PPCG iterations are performed for each step
the PPCG method, its improvement lies in the data level parallelism of the SCF calculation. Ten PPCG iterations are performed for each step of the
SCF calculation. 1thread means one thread per MPI process, and 8thread means
of the Rayleigh-Ritz problem through preconditions, which solves the
eight threads per MPI process.
problem that the syevd function can only carry out a limited degree
of parallelism, so it has better parallel efficiency than the other three
methods.
(1) CPU instruction execution efficiency analysis: Modern CPU proces-
sors generally adopt superscalar pipelining technology, which can
3.3.4. Parallel computing of electronic structure of large-scale architecture transmit multiple instructions in a clock cycle. For Intel Xeon Gold
by MPI/OpenMP secondary parallel method 6,342 processors, two multiplication and two addition instructions
Due to the limitation of memory and computing speed, it is very dif- can be completed per clock cycle, with a maximum theoretical CPI
ficult to simulate plane-wave accurately on a large scale system with value of 0.25. Generally speaking, for ordinary plane wave comput-
tens of thousands of atoms. Common plane-wave software can simu- ing tasks, unreasonable data partition design and memory access
late systems with scales ranging from several hundred to thousands of design will lead to frequent memory access, resulting in mem-
atoms, and the MPI/OpenMP two-level parallel approach can efficiently ory access stall phenomenon, resulting in extremely high CPI. For
utilize node memory. Combined with the low memory footprint PPCG PWDFT, its instruction execution efficiency is very close to the the-
algorithm, we used 42,592 CPU cores to complete ten of thousands of Si oretical upper limit of CPI, indicating that the whole program has
atomic-level parallel computing tests, as shown in Fig. 7. In theory, for a good memory access architecture design.
systems with larger truncation energy or atomic numbers, it should be (2) Analysis of computing efficiency in nodes: GFLOPS of PWDFT in-
possible to use such parallelism to further increase the parallelization crease with the increase of system computing scale, and the peak
scale, but unfortunately we currently lack a supercomputer that meets performance of PWDFT reaches 64% in Si3072 system. The program
both memory and core number requirements for further testing. as a whole shows good hardware release ability.
(3) MPI imbalance analysis: For parallel computing, good scalability
3.3.5. In-node and inter-node high-performance computing performance comes from balanced parallel task partitioning, and the basis of
analysis of PWDFT large core parallelism is very few non-parallel parts in the task.
Using Intel VTune Profiler and Paratune analysis tools, the perfor- The equilibrium of MPI task division can be obtained from MPI im-
mance of PWDFT program in large-scale parallel computing was quan- balance analysis. For PWDFT, with the increase of system scale, the
titatively studied. imbalance of MPI task division always remains between 0.1% and
We performed an in-node computing performance analysis of 0.3%, so it can be basically considered that the parallel task divi-
PWDFT on a personal server using the Intel Vtune analysis tool, and sion of the program is completely balanced, allowing large-scale
the results are shown in Table 6. parallel expansion.

9
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

Table 6
Single node high-performance computing analysis for 𝑆𝑖512 -𝑆𝑖3072 using Intel Vtune. All
the tasks were performed with 16 MPI processes distributed over 16 physical cores. The
node memory is 4 TB and the CPU core of the node is Intel Xeon Gold 6342 Processor.
One GFLOPS (gigaFLOPS) is equal to one billion floating-point operations per second. CPI
is the average clock period required to execute an instruction. MPI Imbalance shows the
CPU time spent by ranks spinning in waits on communication operations, normalized by
the number of ranks. Time is the running time of the entire computing task.

System GFLOPS Average CPU Frequency CPI Rate MPI Imbalance (s) Time (s)

Si512 459.62 3.27 GHz 0.33 0.08 (0.2%) 31.29


Si1024 315.20 3.35 GHz 0.31 0.32 (0.1%) 421.84
Si2048 958.18 3.28 GHz 0.38 2.33 (0.3%) 915.78
Si3072 1077.05 3.28 GHz 0.38 4.95 (0.2%) 2669.97

Table 7
Multi node high performance computing analysis. Performance analysis object is the Si10648 computing task, which
uses 1440 MPI processes with one thread per process and 210 compute nodes. Each node has 96 Intel Xeon Platinum
cpus. One GFLOPS (gigaFLOPS) is equal to one billion floating-point operations per second. CPU (all)% indicates
the percentage of cpus used in the calculation in the total node cpus. GFLOPS% represents the percentage of
computing power used by a program in the maximum GFLOPS of each node. MemRW (GB/s) represents how many
gigabytes of memory bandwidth are used per second. IB Send and IB Recv represent the average number of MB of IB
network traffic per second. Lustre Readiops (MB/s) represents how many megabits of IO operations are performed
per second.

GFLOPS CPU (all)% GFLOPS% MemRW (GB/s) IB Send (MB/s) IB Recv (MB/s) Lustre Readiops (MB/s)

176.43 7.99 2.5% 10.88 40.11 40.14 0.01

Furthermore, the performance of Si10648 computing task is ana- communication flow at 210 nodes is only 40 MB/s, indicating that
lyzed across nodes using the performance analysis support provided by our program has good scalability.
Paratune. The information includes GFLOPS, storage access efficiency, (4) IO analysis: From our IO data analysis on Lustre, our application
and communication efficiency of computing tasks, as shown in Table 7. has minimal IO operations and does not pose a computing bottle-
neck or stress.
(1) Computational efficiency analysis: Due to the high memory re-
quirements of large system tasks, we couldn’t run computing tasks PWDFT exhibits typical computant-intensive characteristics in large-
at full cores in the node, but GFLOPS of our program accounted for scale computing cases, and 30% GFLOPS ratio can be achieved in full
8% of the node using only 8% of the CPU cores per node. GFLOPS% core operation. Such a ratio is very high in the field of scientific com-
of our computing task running at the full core will reach 30% of puting. At the same time, there is no high access bandwidth and com-
the whole node if memory issues are excluded. This value is higher munication requirements, so it can show high parallel scalability.
than that of traditional plane wave density functional computing
software such as VASP and QE. It shows that our code has good 4. Conclusion and outlook
instruction execution efficiency and vectorization optimization.
(2) Memory bandwidth analysis: For matrix multiplication Gemm, the In summary, we have implemented a variety of different matrix di-
computation time complexity is 𝑂(𝑁 3 ), while the retrieval time agonalization algorithms in PWDFT, and compared the convergence
complexity is 𝑂(𝑁 2 ). Therefore, it is both a memory access inten- steps, memory consumption, computing speed and scalability of dif-
sive task and a computing intensive task. However, the program’s ferent diagonalization algorithms. This paper demonstrates how to
memory access bandwidth is far less than the CPU’s floating point solve the problem of memory consumption in the diagonalization step
computing throughput, and the memory access delay will greatly through hybrid MPI and OpenMP two-level parallelization mechanism.
drag down the overall instruction execution speed of the program, By combining the PPCG diagonalization algorithm with a two-stage par-
thus slowing down the calculation speed of the whole program. To allelization mechanism, PWDFT achieves high-speed parallel computing
solve this problem, we implement a good partitioning operation on the scale of tens of thousands of silicon atoms. However, PWDFT can
on the wave function matrix, so that the partitioned matrix can be still be optimized in many details for higher performance.
accommodated into the Cache, and more memory access instruc- Until now, the world’s leading supercomputers have exhibited di-
tions will directly obtain data from the Cache, greatly reducing the verse system designs, such as Fugaku’s ARM A64FX and Frontier’s AMD
memory access delay. The 8% CPU core only occupies 3% of the GPU-based systems. Each of these systems possesses unique program-
memory bandwidth of the whole node, avoiding the problem of ming models and architectural characteristics. A significant professional
intensive memory access. challenge lies in optimizing software to harness the full potential of
(3) IB traffic analysis: In parallel operation, large-scale global com- these varied computer system architectures, enhancing computational
munication will bring huge communication cost, which restricts performance, and minimizing communication overhead. Currently, no-
the maximum scale of parallel computing. To optimize the pro- table advancements have been made in enhancing the PWDFT version
gram, replacing costly global communication with local small-scale for NVIDIA GPUs. Furthermore, a series of novel algorithms geared to-
point-to-point communication can effectively improve the program wards reducing memory consumption during communication processes
scalability. Nodes in zone T6 are connected using the IB network. and enhancing overall communication efficiency are being progres-
The IB network traffic effectively reflects the scale of global com- sively incorporated into the software package. We eagerly anticipate
munication. After local communication optimization, the average sharing these developments in the near future.

10
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

CRediT authorship contribution statement [11] A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke,
H.-J. Bungartz, H. Lederer, The elpa library: scalable parallel eigenvalue solutions
for electronic structure theory and computational science, J. Phys. Condens. Matter
Junwei Feng: Methodology, Software, Writing – original draft, Data
26 (21) (2014) 213201.
curation. Lingyun Wan: Software, Writing – review & editing. Jielan [12] L. Genovese, A. Neelov, S. Goedecker, T. Deutsch, S.A. Ghasemi, A. Willand, D.
Li: Resources, Writing – review & editing. Shizhe Jiao: Writing – re- Caliste, O. Zilberberg, M. Rayson, A. Bergman, et al., Daubechies wavelets as a
view & editing. Xinhui Cui: Writing – review & editing. Wei Hu: Project basis set for density functional pseudopotential calculations, J. Chem. Phys. 129 (1)
administration, Resources, Writing – review & editing, Funding acquisi- (2008) 014109.
[13] H. Shang, L. Xu, B. Wu, X. Qin, Y. Zhang, J. Yang, The dynamic parallel distribution
tion. Jinlong Yang: Project administration, Resources, Writing – review
algorithm for hybrid density-functional calculations in honpas package, Comput.
& editing. Phys. Commun. 254 (2020) 107204.
[14] L. Wang, B. Lee, et al., Linearly scaling (3D) fragment method for large-scale elec-
Declaration of competing interest tronic structure calculations, in: Proceedings of the 2008 ACM/IEEE Conference on
Supercomputing, SC ’08, IEEE Press, 2008.
[15] Y. Hasegawa, J.-I. Iwata, et al., First-principles calculations of electron states of a
The authors declare the following financial interests/personal rela- silicon nanowire with 100,000 atoms on the k computer, in: Proceedings of 2011 In-
tionships which may be considered as potential competing interests: ternational Conference for High Performance Computing, Networking, Storage and
Wei Hu reports financial support was provided by University of Science Analysis, SC ’11, Association for Computing Machinery, New York, NY, USA, 2011.
and Technology of China. Wei Hu reports a relationship with University [16] S. Das, P. Motamarri, et al., Fast, scalable and accurate finite-element based ab ini-
tio calculations using mixed precision computing: 46 pflops simulation of a metallic
of Science and Technology of China that includes: employment.
dislocation system, in: Proceedings of the International Conference for High Per-
formance Computing, Networking, Storage and Analysis, SC ’19, Association for
Data availability Computing Machinery, New York, NY, USA, 2019.
[17] W. Hu, H. An, Z. Guo, Q. Jiang, X. Qin, J. Chen, W. Jia, C. Yang, Z. Luo, J. Li,
et al., 2.5 million-atom ab initio electronic-structure simulation of complex metal-
No data was used for the research described in the article.
lic heterostructures with dgdft, in: 2022 SC22: International Conference for High
Performance Computing, Networking, Storage and Analysis (SC), IEEE Computer
Acknowledgements Society, 2022, pp. 48–60.
[18] F. Gygi, E.W. Draeger, M. Schulz, B.R. De Supinski, J.A. Gunnels, V. Austel, J.C. Sex-
This work is partly supported by the Strategic Priority Research Pro- ton, F. Franchetti, S. Kral, C.W. Ueberhuber, et al., Large-Scale Electronic Structure
Calculations of High-Z Metals on the Bluegene/L Platform, 2006 45–es.
gram of the Chinese Academy of Sciences (XDB0450101), the Innova-
[19] T.A. Barnes, T. Kurth, P. Carrier, N. Wichmann, D. Prendergast, P.R. Kent, J.
tion Program for Quantum Science and Technology (2021ZD0303306), Deslippe, Improved treatment of exact exchange in quantum espresso, Comput.
by the National Natural Science Foundation of China (22173093, Phys. Commun. 214 (2017) 52–58.
22288201, 21688102), by the Anhui Provincial Key Research and [20] O. Watanabe, A. Musa, H. Hokari, S. Singh, R. Mathur, H. Kobayashi, Performance
Development Program (2022a05020052), by the Hefei National Lab- evaluation of quantum espresso on nec sx-ace, in: 2017 IEEE International Confer-
ence on Cluster Computing (CLUSTER), IEEE International Conference on Cluster
oratory for Physical Sciences at the Microscale (KF2020003), by Computing, 2017, pp. 701–708.
the Chinese Academy of Sciences Pioneer Hundred Talents Program [21] Z. Zhao, M. Marsman, F. Wende, J. Kim, Performance of hybrid mpi/openmp vasp
(KJ2340000031, KJ2340007002), by the National Key Research and on cray xc40 based on intel knights landing many integrated core architecture, in:
Development Program of China (2016YFA0200604, 2021YFB0300600), CUG Proceedings, 2017.
the Anhui Initiative in Quantum Information Technologies (AHY090400), [22] P. Lucignano, D. Alfè, V. Cataudella, D. Ninno, G. Cantele, Crucial role of atomic
corrugation on the flat bands and energy gaps of twisted bilayer graphene at the
the CAS Project for Young Scientists in Basic Research (YSBR-005), magic angle 𝜃 1. 08◦, Phys. Rev. B 99 (19) (2019) 195419.
the Hefei National Laboratory for Physical Sciences at the Microscale [23] X. Gonze, B. Amadon, G. Antonius, F. Arnardi, L. Baguet, J.-M. Beuken, J. Bieder, F.
(SK2340002001), and the Fundamental Research Funds for the Central Bottin, J. Bouchet, E. Bousquet, et al., The abinit project: impact, environment and
Universities (WK2340000091, WK2060000018). The authors acknowl- recent developments, Comput. Phys. Commun. 248 (2020) 107042.
edge the Beijing Beilong Super Cloud Computing and Beijing Beilong [24] W. Hu, L. Lin, A.S. Banerjee, E. Vecharynski, C. Yang, Adaptively compressed ex-
change operator for large-scale hybrid density functional calculations with applica-
Super Cloud Computing Co., Ltd for providing HPC resources (http:// tions to the adsorption of water on silicene, J. Chem. Theory Comput. 13 (3) (2017)
www.blsc.cn/). 1188–1198.
[25] L. Wan, X. Liu, J. Liu, X. Qin, W. Hu, J. Yang, Hybrid mpi and openmp paral-
References lel implementation of large-scale linear-response time-dependent density functional
theory with plane-wave basis set, Electron. Struct. 3 (2) (2021) 024004.
[26] J.G. Hill, Gaussian basis sets for molecular applications, Int. J. Quant. Chem. 113 (1)
[1] K. Burke, Perspective on density functional theory, J. Chem. Phys. 136 (15) (2012)
(2013) 21–34.
150901.
[27] F. Jensen, Atomic orbital basis sets, Wiley Interdiscip. Rev. Comput. Mol. Sci. 3 (3)
[2] A.D. Becke, Perspective: fifty years of density-functional theory in chemical physics,
(2013) 273–295.
J. Chem. Phys. 140 (18) (2014) 18A301.
[3] G.B. Wang, H.Q. Zhao, Z.L. Zhang, W.L. Wang, D.M. Chen, Theoretical study on [28] B. Nagy, F. Jensen, Basis sets in quantum chemistry, Rev. Comput. Chem. 30 (2017)
resonance Raman spectra of tetraoxaporphyrin dication by TDDFT calculation, Chin. 93–149.
J. Chem. Phys. 28 (5) (2015) 579–587. [29] T.L. Beck, Real-space mesh techniques in density-functional theory, Rev. Mod. Phys.
[4] J.Y. Weng, T.T. Zhou, Y.H. Zhang, Pseudo-bonding interaction between boron- 72 (4) (2000) 1041.
doped heterofullerene and zinc porphine predicted by DFT calculation, Chin. J. [30] W. Jia, Z. Cao, L. Wang, J. Fu, X. Chi, W. Gao, L.-W. Wang, The analysis of a plane
Chem. Phys. 27 (3) (2014) 285–290. wave pseudopotential density functional theory code on a gpu machine, Comput.
[5] D. Wu, G.D. Chen, C.Y. Ge, Z.p. Hu, X.H. He, X.G. Li, DFT+U analysis on stability of Phys. Commun. 184 (1) (2013) 9–18.
low-index facets in hexagonal LaCoO3 perovskite: effect of Co3+ spin states, Chin. [31] M. Dogan, K.-H. Liou, J.R. Chelikowsky, Real-space solution to the electronic struc-
J. Chem. Phys. 30 (3) (2017) 295–302. ture problem for nearly a million electrons, J. Comput. Phys. 158 (24) (2023).
[6] X. Qin, J. Chen, Z. Luo, L. Wan, J. Li, S. Jiao, Z. Zhang, Q. Jiang, W. Hu, H. An, et [32] M. Dogan, K.-H. Liou, J.R. Chelikowsky, Solving the electronic structure problem
al., High performance computing for first-principles Kohn-Sham density functional for over 100 000 atoms in real space, Phys. Rev. Mater. 7 (6) (2023) L063001.
theory towards exascale supercomputers, in: CCF TCHPC, 2022, pp. 1–17. [33] J. Hafner, Ab-initio simulations of materials using vasp: density-functional theory
[7] E.A. Carter, Challenges in modeling materials properties without experimental in- and beyond, J. Comput. Chem. 29 (13) (2008) 2044–2078.
put, Science 321 (5890) (2008) 800–803. [34] V. Simoncini, D.B. Szyld, Recent computational developments in Krylov subspace
[8] L.-W. Wang, A Survey of Codes and Algorithms Used in Nersc Material Science methods for linear systems, Numer. Linear Algebra Appl. 14 (1) (2007) 1–59.
Allocations, 2006. [35] E. Davidson, The iterative calculation of a few of the lowest eigenvalues and corre-
[9] J.M. Soler, E. Artacho, J.D. Gale, A. García, J. Junquera, P. Ordejón, D. Sánchez- sponding eigenvectors of large real-symmetric matrices, J. Comput. Phys. 17 (1975)
Portal, The siesta method for ab initio order-n materials simulation, J. Phys. Con- 87–94.
dens. Matter 14 (11) (2002) 2745. [36] A.V. Knyazev, Toward the optimal preconditioned eigensolver: locally optimal block
[10] M.J. Gillan, D.R. Bowler, A.S. Torralba, T. Miyazaki, Order-n first-principles calcu- preconditioned conjugate gradient method, SIAM J. Sci. Comput. 23 (2) (2001)
lations with the conquest code, Comput. Phys. Commun. 177 (1–2) (2007) 14–18. 517–541.

11
J. Feng, L. Wan, J. Li et al. Computer Physics Communications 299 (2024) 109135

[37] E. Vecharynski, C. Yang, J.E. Pask, A projected preconditioned conjugate gradient [46] W. Kohn, L.J. Sham, Self-consistent equations including exchange and correlation
algorithm for computing many extreme eigenpairs of a Hermitian matrix, J. Comput. effects, Phys. Rev. 140 (4A) (1965) A1133.
Phys. 290 (2015) 73–89. [47] D.M. Ceperley, B.J. Alder, Ground state of the electron gas by a stochastic method,
[38] H.A. van der Vorst, Computational Methods for Large Eigenvalue Problems, 2002. Phys. Rev. Lett. 45 (7) (1980) 566.
[39] Y. Zhou, J.R. Chelikowsky, Y. Saad, Chebyshev-filtered subspace iteration method [48] A.D. Becke, Density-functional exchange-energy approximation with correct asymp-
free of sparse diagonalization for solving the Kohn–Sham equation, J. Comput. Phys. totic behavior, Phys. Rev. A 38 (6) (1988) 3098.
274 (2014) 770–782. [49] J. Tao, J.P. Perdew, V.N. Staroverov, G.E. Scuseria, Climbing the density func-
[40] W. Hu, L. Lin, C. Yang, Projected commutator diis method for accelerating hy- tional ladder: nonempirical meta–generalized gradient approximation designed for
brid functional electronic structure calculations, J. Chem. Theory Comput. 13 (11) molecules and solids, Phys. Rev. Lett. 91 (14) (2003) 146401.
(2017) 5458–5467. [50] A.D. Becke, A new mixing of Hartree–Fock and local density-functional theories,
[41] L. Lin, J. Lu, L. Ying, E. Weinan, Adaptive local basis set for Kohn–Sham density J. Chem. Phys. 98 (2) (1993) 1372–1377.
functional theory in a discontinuous Galerkin framework I: total energy calculation, [51] A. Stroppa, G. Kresse, The shortcomings of semi-local and hybrid functionals: what
J. Comput. Phys. 231 (4) (2012) 2140–2154. we can learn from surface science studies, New J. Phys. 10 (6) (2008) 063020.
[42] W. Hu, L. Lin, C. Yang, Dgdft: a massively parallel method for large scale density [52] Y. Saad, Y. Zhou, C. Bekas, M.L. Tiago, J. Chelikowsky, Diagonalization methods in
functional theory calculations, J. Chem. Phys. 143 (12) (2015) 124110. parsec, Phys. Status Solidi, B Basic Res. 243 (9) (2006) 2188–2197.
[43] W. Hu, X. Qin, Q. Jiang, J. Chen, H. An, W. Jia, F. Li, X. Liu, D. Chen, F. Liu, et al., [53] O. Axelsson, Iterative Solution Methods, 1996.
High performance computing of dgdft for tens of thousands of atoms using millions [54] L.S. Blackford, A. Petitet, R. Pozo, K. Remington, R.C. Whaley, J. Demmel, J. Don-
of cores on sunway taihulight, Sci. Bull. 66 (2) (2021) 111–119. garra, I. Duff, S. Hammarling, G. Henry, et al., An updated set of basic linear algebra
[44] A. Levitt, M. Torrent, Parallel eigensolvers in plane-wave density functional theory, subprograms (blas), ACM Trans. Math. Softw. 28 (2) (2002) 135–151.
Comput. Phys. Commun. 187 (2015) 98–105.
[45] P. Hohenberg, W. Kohn, Inhomogeneous electron gas, Phys. Rev. 136 (3B) (1964)
B864.

12

You might also like