Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

DOI: 10.

2118/203929-PA Date: 4-February-22 Stage: Page: 597 Total Pages: 16

A Graphics Processing Unit–Based,


Industrial Grade Compositional
Reservoir Simulator
K. Esler, R. Gandham, L. Patacchini*, T. Garipov, and A. Samardzic, Stone Ridge Technology; and
P. Panfili, F. Caresani, A. Pizzolato, and A. Cominelli, Eni E&P

Summary
Recently, graphics processing units (GPUs) have been demonstrated to provide a significant performance benefit for black-oil reservoir
simulation, as well as flash calculations that serve an important role in compositional simulation. A comprehensive approach to compo-
sitional simulation based on GPUs has yet to emerge, and the question remains as to whether the benefits observed in black-oil simula-
tion persist with a more complex fluid description. We present a positive answer to this question through the extension of a commercial

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
GPU-based black-oil simulator to include a compositional description based on standard cubic equations of state (EOSs). We describe
the motivations for the selected nonlinear formulation, including the choice of primary variables and iteration scheme, and support for
both fully implicit methods (FIMs) and adaptive implicit methods (AIMs). We then present performance results on an example sector
model and simplified synthetic case designed to allow a detailed examination of runtime and memory scaling with respect to the
number of hydrocarbon components and model size, as well as the number of processors. We finally show results from two complex
asset models (synthetic and real) and examine performance scaling with respect to GPU generation, demonstrating that performance
correlates strongly with GPU memory bandwidth.

Introduction
Despite decades of advances in both computational hardware and software algorithms, the demand for increased performance in reser-
voir simulation remains unabated. Accurately modeling multiphase flow through highly heterogeneous subsurface formations and cap-
turing scale-dependent phenomena such as near-wellbore behavior, nearfield effects of hydraulic fractures, water coning, gas plume
formation, and polymer enhanced oil recovery requires highly refined grids. For large fields, adequately resolving this behavior with
quantitative accuracy often requires models with millions to several tens of millions of cells or more (Dogru et al. 2009; Cominelli
et al. 2014; Shotton et al. 2016). Compounding these requirements is the need to provide bounds on the uncertainty in expected produc-
tion because of inadequate knowledge of the subsurface structure, stratigraphy, and mineral properties. These uncertainty quantification
workflows may require the simulation of hundreds of independent model realizations to adequately sample the parameter space
(Perrone et al. 2017).
For approximately the past 15 years, the performance of individual processing cores has increased only very slowly in comparison
to the exponential performance growth enjoyed in previous decades (Feng et al. 2019). Supplying performance increases has thus only
been possible through algorithmic advances or through increasingly parallel computations taking advantage of more and more individ-
ual processors working together in concert. Since the 1990s, parallel reservoir simulators, primarily based on a message-passing para-
digm, were developed to provide the ability to work with finer grids and more complex physics (Bowen and Leiknes 1995; Verdiere
et al. 1999; Dogru et al. 2002; Shiralkar et al. 2005; Gries et al. 2014). The advancement hinged on parallel algorithms for linear solvers
(Cao et al. 2005; Gries et al. 2014) and sometimes required the refactoring of existing software architecture (Bowen and Leiknes 1995;
Gries et al. 2014).
Over the last decade, GPUs have developed into a potent alternative to central processing units (CPUs) for high-performance com-
puting across a broad range of disciplines; today, the Top500 list (Strohmaier et al. 2020) of the most powerful computing clusters high-
lights at least six GPU-equipped systems in the first 10 positions. In some fields, including seismic imaging, the rate of adoption was
fast, with large-scale deployment occurring in the first few years of the availability of GPUs capable of general-purpose computing. In
these applications, the computational work is often concentrated in a few computational kernels with abundant natural parallelism,
allowing relatively straightforward initial ports (Komatitsch et al. 2010; Massidda et al. 2013; Ferraro et al. 2014).
Reservoir simulation has also benefited significantly from the use of GPUs to accelerate the computation (Appleyard et al. 2011; Yu
et al. 2012; Bayat and Killough 2013; Tchelepi and Zhou 2013; Esler et al. 2014). As computational processors, GPUs can provide
much higher aggregate floating point throughput than conventional CPUs. Perhaps more importantly for engineering applications, they
also exhibit much higher bandwidth to on-board memory. However, because state-of-the-art GPUs contain more than 5,000 cores, each
of which must be oversubscribed with multiple in-flight threads of execution to attain optimal performance, significant acceleration can
only be achieved in applications with an extreme level of concurrency. Simulations running on even a single GPU must expose tens of
thousands of independent tasks to be executed concurrently. For computations that are naturally or embarrassingly parallel, such as the
evaluation of properties for each cell from interpolation tables, exposing this concurrency is often relatively straightforward. For others,
including complex multilevel preconditioners (Esler et al. 2012; Naumov et al. 2015), very significant restructuring is required to
achieve good performance, often in addition to the domain decomposition methodology used in conventional CPU-based parallelism.
Thus far, documented benefits of GPU computing have been largely restricted to black-oil modeling, where hydrocarbon mixtures
are simplified by means of a two-component formulation with precomputed tabulations to describe pressure/volume/temperature (PVT)
properties (Chen et al. 2014; Esler et al. 2014). However, many scenarios warrant a more physically accurate, compositional description
of the fluid properties. These include reservoir conditions approaching the critical point, enhanced oil recovery with miscible or near-
miscible injection, or reservoirs with disconnected pay zones with disparate oil character that may begin to communicate during

*Corresponding author; email: lpatacchini@stoneridgetechnology.com


Copyright V
C 2022 Society of Petroleum Engineers

This paper (SPE 203929) was accepted for presentation at the SPE Reservoir Simulation Conference, On-Demand, 26 October 2021, and revised for publication. Original manuscript
received for review 21 April 2021. Revised manuscript received for review 25 June 2021. Paper peer approved 21 July 2021. This paper is published as part of the 2021 SPE Reservoir
Simulation Conference Special Issue.

February 2022 SPE Journal 597

ID: jaganm Time: 10:25 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 598 Total Pages: 16

recovery. Computational requirements for simulating large models are greatly exacerbated if a compositional description is required;
hence, engineers are frequently forced to trade off high spatial resolution for the more accurate description of phase behavior and trans-
port to achieve practical run times. In this respect, the application of large CPU clusters has proven useful in enabling the simulation of
highly detailed compositional models (Obi et al. 2014; Casciano et al. 2015) and ensembles of models.
The use of GPUs to accelerate such simulations may prove to be of even greater benefit and importance than in black-oil simula-
tions. Bogachev et al. (2018) describe a commercial simulator in which both the linear solver and the EOS computations are offloaded
to the GPU. Surprisingly, the authors report only a modest benefit from GPU acceleration of compositional simulation, with an average
reduction in runtime well under 50%. Khait and Voskov (2017) took one step further in achieving optimal use of the GPUs in composi-
tional simulation; using an operator-based approach, the nonlinear loop and linearization are entirely performed on the GPU, while
property-related computations, relatively inexpensive in their formulation, are left on the CPU.
From an algorithmic perspective, the use of GPUs for compositional simulation presents several challenges beyond those already
present in black-oil simulations. Notably, an efficient and robust mean to calculate phase stability, composition, and constitutive proper-
ties is required. Gandham et al. (2016) showed that with a highly optimized implementation using mixed-precision, accelerated succes-
sive substitution, and Newton-Raphson iteration, stability and flash can be conducted on a GPU for many components and millions of
cells in a small fraction of a second, establishing that flash calculations do not present a performance bottleneck for compositional simu-
lation on GPUs. As we show through benchmarks on asset models, this consideration appears to be confirmed in practice.
A more pressing challenge is presented by the limited capacity of high-bandwidth memory present on GPU accelerators, which is
typically much smaller than the amount of random-access memory available on a CPU-based node. The amount of data required to

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
assemble, store, and solve the sparse linear system of equations resulting from the discretization of the flow equations grows quadrati-
cally with the number of pseudocomponents, nc, retained in the simulation. A direct implementation of solution methods developed
originally for CPUs may severely limit the number of cells that may be simulated on each GPU.
In this paper, we present our successful attempt to overcome these challenges. In contrast to previous approaches that migrated parts
of the simulation code to the GPU in a piecemeal fashion, the entirety of our approach was designed from inception to take maximal
advantage of GPU hardware characteristics while compensating for potential bottlenecks, such as limited memory capacity and the rela-
tively high cost of CPU-GPU communication. By means of formulation, iteration scheme, and preconditioner customized for the com-
putational characteristics of GPUs, we have sought to enable the simulation of large compositional models with smaller hardware
requirements and reduced run times, both for fully implicit and adaptive implicit time marching schemes. In addition to performance,
we have endeavored to support the wide set of standard features that are required to simulate industrial models. The compositional for-
mulation has been validated on an extensive range of conventional and unconventional assets with diverse fluid behavior, relative per-
meability options, recovery strategies, and asset management.
We will first describe the simulator formulation, including the choice of primary variables and iteration scheme. We then present
performance results on an example sector model and simplified synthetic case designed to allow a detailed examination of scaling with
respect to the number of hydrocarbon components and model size, as well as the number of processors. Finally, we show results from
two complex, real-field asset models and examine performance scaling with respect to GPU generation, demonstrating that performance
correlates strongly with GPU memory bandwidth.

Compositional Formulation
Governing Equations. We consider a mixture of nh hydrocarbon components allowed to partition in the oil and gas phases (in this
publication we liberally refer to any nonwater component as a hydrocarbon) and an immiscible water component identified to the aque-
ous (water) phase. By limiting ourselves to isothermal problems and omitting chemical reactions or mass exchange between fluids and
the rock, the physics is governed by the following nc ¼ nh þ 1 conservation equations:
@    
/ xi bo So þ yi bg Sg þ r  xi bo uo þ yi bg ug þ qi ¼ 0 ; 8i 2 ½1; nh 
@t
@
ð/bw Sw Þ þ r  ðbw uw Þ þ qw ¼ 0 ;                                                        ð1Þ
@t
where Su ; uu , and bu are the saturation, Darcy velocity, and molar density of phase u, respectively, and x and y are the molar fraction
 phase.
arrays in the oil and gas phases, respectively; the subscript w refers to both the water component and water  qi is a source-term
density for component i (e.g., arising from producer or injector wells).
The conservation equations are closed by the multiphase extension of Darcy’s law (Bear 2013):
kru  
uu ¼ K  rpu  qu g ; 8u 2 fg; o; wg ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð2Þ
lu

where lu ; qu ; kru , and pu are the viscosity, mass density, relative permeability, and pressure of phase u, respectively; K is the (sym-
metric, positive definite) permeability tensor; and g is the gravitational acceleration vector. Following the common practice of composi-
tional reservoir simulation, the molar composition of the hydrocarbon phases are assumed to be in instantaneous thermodynamic
equilibrium, with nc equilibrium constraints (Whitson and Brulé 2000; Michelsen and Mollerup 2007), defined as:
  
fi;g ph ; y  fi;o ph ; x ¼ 0 ; 8i 2 ½1; nh  ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð3Þ
 
where fi;u is the fugacity of component i in phase u. In Eq. 3, we neglect the pressure difference between phases and evaluate the fugac-
ities at the hydrocarbon saturation-weighted pressure ph:
Sg pg þ So po
ph ¼ : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð4Þ
Sg þ So

Note that the choice of reference pressure is arbitrary, and we could have for instance chosen to use the pressure of a selected phase
(e.g., gas). Notably, the instantaneous equilibrium assumption is an approximation, but a proper modeling of nonequilibrium would
require doubling the conservation equations, which is not practical (Patacchini et al. 2015).

598 February 2022 SPE Journal

ID: jaganm Time: 10:25 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 599 Total Pages: 16

Physical Models. Porosity is considered to be a function of the reference hydrocarbon pressure ph through simple correlations or tabu-
lations. The rock-fluid model is based on oil being the middle phase, with gas and water being the least and most wetting phases, respec-
tively. We therefore consider Pcgo ¼ pg  po to be a function of Sg and Pcow ¼ po  pw to be a function of Sw; similarly, we consider krg
to be a function of Sg, krw to be a function of Sw, and kro to be a function of both Sg and Sw through a three-phase relative permeability
model (Baker 1988; Rasmussen et al. 2021). Near-critical relative permeability adjustment is performed using a simplified calculation
of the critical temperature based on Li’s correlation (Li 1971; Petitfrère et al. 2019), and gas-oil miscibility effects are taken into
account as described by Coats (1980).
Fugacities needed for the thermodynamic equilibrium constraints (Eq. 3) are obtained via a cubic EOS, such as Peng-Robinson; the
same EOS is used to evaluate the hydrocarbon phases molar densities, whereas their viscosity is obtained from correlations (Whitson
and Brulé 2000). Water-phase properties (density and viscosity) are obtained from simple correlations as well.

Discretization and Choice of Variables. It is common practice in commercial reservoir simulators to discretize the conservation equa-
tions (Eq. 1) by a low order in a space and time finite-volume approach. We define the mobility of phase u as
kru
ku ¼ ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð5Þ
lu

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
and the potential difference between two gridblocks ‘ and ‘0 as (Rasmussen et al. 2021):

DUu;‘!‘0 ¼ pu;‘  pu;‘0  qu;‘;‘0 gDd‘!‘0 ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð6Þ

where Dd‘!‘0 is the depth difference between cells ‘ and ‘0 , g is the gravity acceleration, and qu;‘;‘0 is the mass density of phase u at res-
ervoir conditions at the interface between cells ‘ and ‘0 , computed according to
qu;‘ Su;‘ þ qu;‘0 Su;‘0
qu;‘;‘0 ¼ : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð7Þ
Su;‘ þ Su;‘0

Using a first-order upstream mobility weighting, time integration between tn and tnþ1 of the conservation equations yields the fol-
lowing mass balance residual:
nþ1
Ni;‘ n
 Ni;‘ X h   i
MBRi;‘ ¼ þ T‘;‘0  ðxi bo ko Þ" DUo;‘!‘0 þ yi bg kg " DUg;‘!‘0 þ Qi;‘ ; 8i 2 ½1; nh  ;
Dtn ‘0 2Cð‘Þ
nþ1
Nw;‘  n
Nw;‘ X h i
MBRw;‘ ¼ þ T‘;‘0  ðbw kw Þ" DUw;‘!‘0 þ Qw;‘ ;                                      ð8Þ
Dtn ‘0 2Cð‘Þ

where Dtn ¼ tnþ1  tn is the timestep length, Cð‘Þ is the set of cells connected to cell ‘; T‘;‘0 is the transmissibility between cells ‘ and
‘0 , and “:” indicates the upstream cell for the phase at hand; note that we here limit ourselves to a two-point flux approximation.
We implemented the three following first-order time discretization schemes:
• The FIM: All properties in Eq. 8 are evaluated at timestep tnþ1 , yielding a backward Euler scheme.
• The implicit-pressure, explicit mobility method (IMPEM): The pressure terms in Eq. 6 are evaluated at timestep tnþ1 , whereas the
capillary pressure and density terms of Eq. 6 as well as molar fractions and mobilities in Eq. 8 are evaluated at timestep tn.
• The AIM: The pressure terms in Eq. 6 are evaluated at timestep tnþ1 , whereas the remaining terms are evaluated either at timestep
tn or tnþ1 depending on numerical stability considerations, as described later in the paper.
The presented set of governing equations can be solved using different sets of primary unknowns. The overall molar and natural var-
iables formulations are examples of the commonly used formulations and corresponding choice of independent variables. Every choice
of independent variables has advantages and disadvantages as shown by Voskov and Tchelepi (2012).
We use extensive variables, namely mole numbers, ðNi ; Nw Þ; i 2 ½1; nh , as primary unknowns; the flow physics is governed by
nc ¼ nh þ 1 conservation equations; hence, these would be sufficient as a primary variable set. However, we note that Eqs. 5 and 6 con-
tain terms that are functions of pressure and saturations. We therefore include three more variables to the equation set (ph, Sw, and Sg)
and three more equations relating these variables to the mole number array; we choose these to be the volume balance equations:

VBRu;‘ ¼ PV‘ ðph ÞSu  Vu ph ; N ; 8u 2 fg; o; wg ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð9Þ

where N is the mole number array. Eq. 9 can be understood as follows. The volume occupied by the  fluid phase u is the product of its
 and the pore volume of the cell. This volume must match the extrinsic fluid volume, V p ; N , computed as the number of
saturation u h
moles divided by the molar density of the phase obtained from the EOS. The thermodynamic constraints in Eq. 3 can be solved locally
for every gridblock (in the form of pressure-temperature flash iterations, as described later in the paper).
This formulation favors an exact conservation of mass at the linear level, because molar fluxes out of a cell to a connected neighbor
must naturally balance the flow into that neighbor; this is not necessarily true at the nonlinear level because heuristics limiting local
solution changes to avoid overshoot may yield temporary inconsistencies in the mole number updates. These inconsistencies are gener-
ally reconciled in subsequent Newton iterations. Note that a degree of local volume imbalance (associated to the volume balance resid-
ual; Eq. 9) is unavoidable until nonlinear convergence is reached for a formulation in which mass is conserved at the linear level. In
contrast, the natural variables formulation satisfies volume balance by construction, whereas some local mass imbalance is likewise
unavoidable until nonlinear convergence is achieved. Coats et al. (1998) observed that a strict conservation of mass typically results in
a higher rate of convergence than a strict volume conservation.
All in all, our formulation has nc þ 3 primary variables/gridblock. This is three more than the minimum requirement of the natural
variables formulation [e.g., used in the Intersect simulator (Intersect 2018)], and two more than the minimum requirement of a standard
mass-variables formulation [e.g., used in the Eclipse 300 simulator (Eclipse 2017) or in the Nexus simulator (Nexus 2014)]. However,
we find that the extra variables do not incur a significant performance penalty (except perhaps with a very low number of components)

February 2022 SPE Journal 599

ID: jaganm Time: 10:25 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 600 Total Pages: 16

and provide flexibility in the level of implicitness and preconditioning schemes. In particular, by including saturations in the primary
variables set, we can
• More easily implement nonlinear update heuristics; see the discussion after Algorithm 1.
• Implement an implicit pressure-saturation, explicit composition scheme, which could then lead to a variant of our AIM formula-
tion (Cao 2002). Note that we have not yet attempted to do so.

Well Modeling. We use bottomhole pressure as a single variable defining the well state (Coats et al. 1998) and compute hydrostatic
head by integrating the density along the wellbore based on explicit (i.e., from the previously converged timestep) inflows. Before each
timestep and/or Newton iteration, individual well solves are performed with frozen reservoir variables to determine the most limiting
constraint, as well as to determine production or injection potential and deliverability (Edwards and Guyaguler 2011). Standard field
management, encompassing group production, injection/reinjection, and stream management (including fuel and sales gas accounting),
is layered on the individual well model.
Because each well solve has limited parallelism, these tasks are dynamically assigned to a pool of CPU threads. Solves are dynami-
cally assigned to cores, which helps ensure effective load balance even with significant disparity in the time required to solve each well.
The impact of this choice on overall performance is explored below in benchmarks on real asset models.

Adaptive Implicit Method (AIM). In the FIM, all fluid properties and saturation functions needed for stepping from time t to t þ Dt
are evaluated at time t þ Dt. This implies that the residual must be linearized in all state variables, resulting in a large block-sparse

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
linear system of equations that must be solved at each Newton iteration to update those variables. This approach has the advantage that
the time marching is unconditionally stable regardless of the size of the timestep.
In an alternative scheme, component fluxes are computed by keeping the mobilities fixed at the previous time level, t. In this simpli-
fication, only the change in pressure is present in the off-diagonal blocks of the Jacobian; all other variables may be eliminated from the
linear system, resulting in a Jacobian containing only cell pressures and well bottomhole pressures. However, with this IMPEM
scheme, the time evolution is only stable for a sufficiently small timestep, determined from the Courant-Friedrichs-Lewy (CFL) condi-
tion for each cell. For stable simulation, the timestep must be selected as the minimum stable step taken over all cells. For practical
asset simulation, this will be extremely limiting, and very long runtimes will result. Numerical dispersion is lower in IMPEM than in
FIM (Lantz 1971); hence the IMPEM scheme can possibly be superior for modeling laboratory experiments.
The AIM (Young and Russell 1993) is an attempt to combine the most favorable attributes of the FIM and IMPEM solution
schemes. Cells are sorted by the size of their maximum stable step size at the beginning of each timestep. Those with stable stepsize
larger than the present Dt are treated explicitly for that step, whereas the remaining cells are treated implicitly. We follow the approach
of Coats (2003) to obtain an estimate of the CFL accounting for gravity and capillary pressure effects. Because well constraints may
change during Newton iterations, an accurate prediction of the stability of completed cells a priori is impractical. Therefore, all com-
pleted cells are treated as implicit. Step size is chosen so that the fraction of implicit cells is kept below a target fraction, typically
between 5 and 10%. In many cases, this allows timesteps to be taken that are comparable in size to those used in the FIM, but the cost
of each linear solve (both in terms of time and memory requirements) is significantly reduced.

Timestep Selection. At the end of each converged timestep n, we compute the time-truncation error for each mole number Ni defined
as follows (Barkve 2000; Intersect 2018),

 n 
1 Ni  Nin1 Nin1  Nin2
TTEni ¼ Dtn  max X  : .......................................... ð10Þ
allcells Nin Dtn Dtn1
i2½1::nc 

The TTE is essentially a measure of the second time-derivative of a variable, and to avoid excessive numerical dispersion, we limit the
duration of the next timestep as follows:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
nþ1 n TTEmax
Dt  Dt : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð11Þ
maxi2½1::nc  TTEi

We find that using a default TTEmax ¼ 0:2 is a good compromise between performance and accuracy in most cases. Alternative criteria
based on solution change (i.e., first time-derivative) could also be used for the same purpose.
Failure to control time-truncation errors can result in significant performance improvement, particularly in the presence of grid-
blocks with small pore volume (e.g., with local grid refinement or with dual porosity grids). However, this is a problem for two reasons:
1. The solution per se may be meaningless.
2. The solution would depend on the timestep duration.
As a consequence, when operating a change in the model that would result in a different timestep sequence to be taken, it would be dif-
ficult to know whether the impact on production was due to the model change or to numerical artifacts. In addition, when operating in
IMPEM or AIM mode, the timestep may be further limited by CFL considerations (Coats 2003), as previously described.

Remark. We have so far only considered monolithic formulations in which all variables are solved simultaneously, which is currently
the state of the art in commercial simulators. An alternative would be to consider sequential-implicit schemes, possibly augmented by a
multiscale linear solver. This approach is in the early adoption stage in commercial settings for black-oil formulation (Kozlova et al.
2016; Lie et al. 2017) and in the research stage for compositional formulation (Ganapathy and Voskov 2018; Møyner and
Tchelepi 2018).

Solution Scheme
Nonlinear Iteration Scheme. Solution to the nonlinear system of equations assembled from Eqs. 8 and 9 is obtained through a
Newton-Raphson iteration scheme (shown in Algorithm 1). At each iteration, the thermodynamic constraints are exactly satisfied,
whereas the remaining equations are linearized, resulting in a sparse block linear system of equations. All property derivatives used to

600 February 2022 SPE Journal

ID: jaganm Time: 10:25 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 601 Total Pages: 16

construct this effective Jacobian include the terms resulting from the change in equilibrium partitioning. This means that at every non-
linear iteration, the solutions of thermodynamic equilibrium are calculated to obtain the phase properties such as phase molar density,
viscosity, and compressibility.

Algorithm 1—Nonlinear iteration scheme

Result: XtþDt : ptþDt ; StþDt tþDt


w ; Sg ; NitþDt
t t t t t
X0 ¼ X : p ; Sw ; Sg ; Ni ;
Evaluate well potentials for guide rate calculations;
for k ¼ 0, 1, until convergence do
Using Xk :
1. Perform pressure-temperature flash to compute hydrocarbon phase compositions, densities, viscosities, and associ-
ated derivatives.
2. Compute water properties and derivatives.
3. Compute pore volume and rock compressibility.
4. Compute relative permeability, capillary pressure, and corresponding derivatives.
5. Balance group and well target rates, and solve each well for its active constraint.
6. Compute nonlinear residual RðXk Þ;

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
if k > 0, DXk1 is small and material balance error computed with RðXk Þ is small then
XtþDt ¼ Xk ;
stop;
else
Construct Jacobian matrix JðXk Þ;
Solve JðXk ÞXk ¼ RðXk Þ;
Xkþ1 ¼ Xk þ dampðDXk Þ;
end
end

This nested iteration scheme contrasts with alternative schemes in which the phase equilibrium constraints equations and flow equa-
tions are solved simultaneously in one monolithic Newton-Raphson iteration. We believe the nested approach is well-suited to the com-
putational characteristics of GPUs. The independent flash calculations can be performed almost entirely in GPU registers and therefore
reach a significant percentage of the peak floating point performance of the processors, as shown by Gandham et al. (2016). In contrast,
solution of the sparse system resulting from linearizing the flow equations is strongly limited by memory bandwidth, and therefore
throughput per cell is many times lower. By solving the flash in an inner iteration, the cells in which the thermodynamic equilibrium
constraints pose a challenging nonlinear system in themselves (e.g., near the critical point) can be iterated to convergence at very low
cost. In this way, the number of outer global Newton iterations needed to converge the system can potentially be lowered, and overall
performance can be increased. In addition, this nested iteration scheme could be generalized to accommodate a multiphase equilibrium
formulation with changes to the code base that are primarily localized to the EOS module.
In Algorithm 1, the nonlinear update is damped locally using a method often referred to as the “Appleyard chop” (Naccache 1997);
we perform said damping on both the saturation and the composition arrays independently, on a cell-by-cell basis:
• The saturation array update DS is damped to avoid excessive variations (0.2 by default), as well as to avoid crossing the mobility

limits in a single Newton iteration. A theoretical justification for the usefulness of this damping has been proposed, for example,
by Younis et al. (2010).  X
• The mole number array update DN is damped such that the maximum change in mole fraction zi ¼ Ni = N for a given compo-

nent i is less than a prespecified value (0.2 by default). We did not attempt to justify the usefulness of this damping analytically;
we simply observe that it slightly improves convergence by limiting the probability of strong mobility changes within the same
Newton iteration (although phase flips are still possible).

Linear Solution Scheme. Algorithm 1 requires, at each Newton iteration, solution of JDX ¼ R. We use a flexible generalized mini-
mal residual (GMRES) iterative solution scheme (Saad and Schultz 1986) to solve equations of the form Ax ¼ b, where A is a sparse,
nonsymmetric matrix, x is the solution vector, and b is the right-hand side. The GMRES method generates a sequence of better and
better approximate solution vectors x0 ; …; xi reducing the residual ri ¼ b  Axi , with xi built minimizing jjri jj in the space generated by
i orthonormal vectors Qi ¼ ðq0 ; …; qi1 Þ. To move to a next iteration i þ 1, a new vector qi can be built using ri and then added to Qi:
the increase in the dimension of the orthonormal basis moves xi closer and closer to the solution while decreasing r. In principle, the
storage of the matrix Qi and previous solution vectors requires the allocation of a number of vectors equal to twice the maximum
number of allowed linear iterations, often called the GMRES stack.
By default, we allow a maximum of 20 linear iterations. It is optionally possible to use a restarted flexible GMRES algorithm with
reduced stack, reducing memory requirements at the potential expense of slower convergence; none of the test cases presented in this
paper use this option.
When used in combination with other approximate solution methods, known as preconditioners, the GMRES method can be made
to converge very rapidly. With preconditioning, the original linear system is modified by premultiplication by another matrix, M1 , as
M1 Ax ¼ M1 b, where M1 is some approximation to the inverse of A. The product matrix M1 A is never explicitly formed, but rather
M1 is only applied to linear residual vectors, ri, to return approximate solution vector updates.
In reservoir simulation, a marked advance in the performance and robustness of linear solvers was made with the introduction of the
constrained pressure residual method (Wallis 1983; Wallis et al. 1985; Cao et al. 2005). The method is motivated by the fact that the
pressure degree of freedom has a parabolic form in which local variations have a long-range impact, whereas the saturation degrees of
freedom have hyperbolic character with short-range effects. Because of these different characteristics, different preconditioning
schemes are appropriate for these two variable sets. The constrained pressure residual preconditioner recognizes this distinction by
introducing two stages:

February 2022 SPE Journal 601

ID: jaganm Time: 10:25 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 602 Total Pages: 16

(
x ¼ PM11 ðRri Þ
dxCPR ¼ x þ M21 ½ri  Ax ;                                                         ð12Þ
where dxCPR is the approximate solution update provided by the constrained pressure residual preconditioner; M11 and M21 are the first
and second stage preconditioners, respectively; P is a prolongation operator that projects the pressure-only solution to the full space of
unknowns; and R is a restriction operator that reduces the full right-hand side to a single value per cell.
In the first stage, a pressure-only equation is constructed from the full Jacobian by taking a weighted average of the pressure deriva-
tives in each mass conservation equation and then discarding the saturation and composition derivatives. This equation has a parabolic
character, but because of spatial variation in permeability, the coefficients connecting neighbors can be highly heterogeneous. Algebraic
multigrid methods have been shown to be extremely effective in solving equations with these characteristics (Brandt et al. 1985). We
used the GPU Algebraic Multigrid PACKage (GAMPACK) (Esler et al. 2012). The operation of GAMPACK itself has two phases: a
setup phase, in which a hierarchy of coarsened pressure matrices is constructed, and a solution phase, in which this hierarchy is used to
compute an approximate solution to the pressure system. By default, the algebraic multigrid hierarchy is reconstructed at the beginning
of each Newton iteration. Heuristics aiming at either reducing the reconstruction frequency or alternating between full and partial
reconstruction have been proposed (Wobbes 2014; Intersect 2018), but we have not attempted to implement them yet.
The second stage includes all remaining reservoir degrees of freedom, and we use different preconditioners depending on the problem
at hand. These preconditioners have been highly optimized for execution on GPUs. In particular, specific treatments to reduce memory
requirements in the fully implicit case have been devised; their description would go beyond the scope of this paper. It is nevertheless

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
important to mention here that our second stage is based on one of two foundations: isotropic and anisotropic preconditioning.
The incomplete factorization method known as ILU(0), or alternatively the diagonal ILU(0), a time and memory-saving variation of
ILU(0), are examples of isotropic preconditioners. These are used in conjunction with multicolored reordering; each cell in the reser-
voir is assigned an integer (called a color in analogy with map-coloring problems in mathematics), so that connected cells have different
colors. Factorization and solution can then occur in parallel, with all cells of the same color solved simultaneously because there are no
dependencies between them.
The massively parallel nested factorization (Tchelepi and Zhou 2013) is an example of an anisotropic preconditioner, the principle
of which is to take advantage of preferential flow directions in the model. Instead of using multicolored reordering, this type of precon-
ditioner can be used by ordering cells in vertical pillars or by following a maximum transmissibility path as discussed for example by
Fung and Dogru (2008).

Convergence Controls. The default convergence criterion for the linear solver is to achieve a linear residual reduction of 103 in L2
norm. The nonlinear convergence criterion is based on linearized variables change. We use a default of jdph j  0:1 atm, and
jdVi j  0:01 PV, where we define Vi as the volume occupied by component i:
(  
dVi ¼ DNi   gi þ  oi ; 8i 2 ½1; nh 
dVw ¼ DNw =bw :                                                                 ð13Þ
 ui is the partial molar volume of component i in phase u, defined as the partial derivative of the volume of phase u with respect to the
total (i.e., across all phases) number of moles of component i:
@Vu
 ui ¼ : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð14Þ
@Ni
We furthermore control that the global relative material balance error is less than 106 . This approach provides a compatible conver-
gence metric with well-known legacy simulators. We prefer a criterion based on component volume rather than phase saturations
because in compositional runs, it is possible to encounter flips in phase identification in the supercritical region, in which case changes
in saturation are not meaningful. In contrast, component volumes are not sensitive to phase identification.

Comparison with the Black-Oil Formulation


We have already described our black-oil formulation in Esler et al. (2014), although the code has evolved significantly since then
(Vidyasagar et al. 2019). There are two approaches to developing a black-oil simulator. The first is to realize that a black-oil model is
equivalent to a two hydrocarbon-component compositional model; most of the code is therefore common to both fluid formulations,
including the Jacobian assembly and the linear solution strategy. Only the EOS engine needs to be replaced by a table lookup. This is
the approach used, for example, in the Eclipse 300 (Eclipse 2017), the Intersect (2018), and the Nexus (2014) simulators.
The second approach is to write the equations in black-oil form, that is to say as a set of three conservation equations for the stock-
tank volumes of gas, oil, and water phases. This is for example the Eclipse 100 (Eclipse 2017) or the open porous media simulator
(Rasmussen et al. 2021) approach. Our simulator is a single unified code with both black-oil and compositional capabilities, yet the
black-oil formulation is implemented in black-oil form. The preprocessing, grid generation, rock-fluid engine, first-stage linear solver
preconditioner, and field-management facilities are common to both formulations; the fluid engine, well model, and second-stage linear
solver preconditioner are specialized. This is a consequence of the development history of our code, but in hindsight, we realize that it
also allows for formulations-specific optimizations that may enhance performance.

Code and Hardware Considerations


The Code. The code base comprises a combination of standard Cþþ17 and Nvidia CUDA Cþþ extensions. These extensions support
the definition and launching of GPU kernel functions. The code makes substantial use of object-oriented design and templates to
manage complexity, facilitate code reuse, and allow aggressive compiler optimizations. It targets both Windows and Linux operating
systems and can be compiled to support any of the CPU architectures for which CUDA is supported, including x86, OpenPOWER, and
ARM. All benchmarks presented in this paper are performed on an x86 Linux platform. The code supports execution on multiple
GPUs, using standard domain decomposition methods. Sparse halo exchanges are accomplished either through a message-passing inter-
face library or through direct GPU-to-GPU communication through the PCI Express bus or NVLink interconnects. The exchanges take
advantage of GPUDirect2 RDMA transfer when available, which allows data to be transferred directly between GPUs over an intercon-
nect fabric without passing through CPU memory. Input is provided in industry-standard formats, including both legacy ASCII formats,

602 February 2022 SPE Journal

ID: jaganm Time: 10:25 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 603 Total Pages: 16

as well as newer standards, such as RESQML (King et al. 2012). Scalar time series and 3D output are produced in legacy binary formats
for compatibility with existing workflows and in standard comma-separated variable format.

Hardware Used for Preparation of This Paper. The tests presented in this paper have been performed with four generations of data
center GPUs, and one desktop GPU, the characteristics of which are summarized in Table 1.

Generation Model Year Cores DP GFLOPS Capacity (GB) Bandwidth (GB/sec)


Kepler Tesla K80 2014 2,496 1,864 12 240
Pascal Tesla P100 2016 3,584 4,670 16 732
Volta Tesla V100 2017 5,120 7,800 32 900
Ampere Nvidia A100 2020 6,912 9,700 40 1,555
Turing RTX 2080Ti 2018 4,352 420 11 616

Table 1—Characteristics of four generations of Nvidia data center GPUs and a desktop GPU used for preparation of this paper. Note that the
Tesla K80 card contains two distinct GPUs and memory buffers on one board. When we refer to K80, we intend only a single GPU (i.e., half

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
the board).

Test Model 1: Guava Sector—Scaling with the Number of Components


The Model. Guava is a giant greenfield comprising a gas cap and an oil rim. For the purpose of this test, we will consider a sector
derived from the full-field model in which grid properties and well trajectories are left unchanged but both schedule and fluid model
have been modified.
The sector model comprises 149,000 active cells (66  45  59 corner-point grid), with four horizontal producers and four
horizontal injectors. The simulation is performed for 30 years, following a water-alternating-gas schedule with 1-year periods. To keep
field management to a minimum, wells are controlled individually with rate and bottomhole pressure constraints. Fig. 1 illustrates
the geometry.

Fig. 1—(Left) Ternary (gas, oil, and water) view of the Guava sector model, magnified 20X in the vertical direction. (Right) Phase
envelope of the synthetic PVT used for the study.

The fluid is described using a synthetic Peng-Robinson EOS with two pseudocomponents, C1 and C7P2, the properties of which are
similar to these of the ARCO fluid description of the third SPE comparative solution project (Kenyon and Behie 1987); the model tem-
perature is arbitrarily set to 140 F, and the gas-oil contact pressure is set at 5,066 psi. A key advantage of using two hydrocarbon com-
ponents is that the model can be converted to black oil without loss of information. Fig. 1 shows the corresponding binary phase
envelope in stock-tank volumes, indicating that the fluid is far from being miscible (the miscibility pressure where Rs ¼ 1=Rv is greater
than 8,000 psi); it is therefore an “easy” fluid for a compositional reservoir simulator.
A common issue in the construction of a compositional reservoir model is the determination of the number of pseudocomponents
that should be used to preserve the correct PVT properties of the fluid while retaining good performance. It is instructive, then, to
explore how both runtime and memory use scale while increasing the number of components. To isolate the strict algorithmic scaling
from issues related to the physics of the problem, we have attempted to devise a series of tests that maintain a nearly identical physical
solution. This is accomplished by starting with the original two-component EOS and “cloning” the components successively, resulting
in models with 2 to 30 hydrocarbon components. The initial composition and injection streams are then chosen to yield equal mole frac-
tion in all component copies. When simulated, we verified that these models yielded identical production within expected
numerical tolerances.
We performed simulations with both FIM and AIM formulations, on both Nvidia Tesla V100 and GeForce RTX 2080Ti GPUs; in
all cases, four CPU threads were used for the well solves. The V100 is a server card with bandwidth of 900 GB/sec and high-
performance double precision capability, whereas the RTX2080Ti is a desktop card with bandwidth of 616 GB/sec and lower-
performance double precision capability (Table 1).

Fully Implicit Results. Fig. 2 first shows the plot of elapsed time and required GPU memory using the V100 card, normalized to the
runtime of the equivalent black-oil model, after verification that the total number of linear and nonlinear iterations were almost unaf-
fected by the number of components. Here, the FIM formulation was used. We observe that both runtime and memory requirements of
the two-component model are approximately 1.5 their black-oil counterpart. As the number of components increases, both runtime
and memory requirements increase superlinearly, although in the case of memory, the trend is almost linear.

February 2022 SPE Journal 603

ID: jaganm Time: 10:25 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 604 Total Pages: 16

Fig. 2—Test Model 1, fully implicit runs. (Left) Scaling of the simulator runtime and required on-chip memory as a function of the
number of hydrocarbon components, normalized to the figures of the equivalent black-oil model (using a single V100 GPU with 4
CPU threads for the wells). Quadratic trendlines are also shown. (Right) Ratio of the runtimes of simulations performed on Nvidia
RTX 2080Ti and V100.

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
Memory requirements for the cases of Fig. 2 are 0.78 GB with 2 components, 1.55 GB with 10 components, and 4.31 GB with 30
components. When the number of components exceeds approximately 10, memory use plateaus at approximately 1 kB/cell/component.
Note that the black-oil model needs 0.47 GB (i.e., also 1 kB/cell/component, including water).
A frequent consideration in the provisioning of hardware for simulation is the degree to which the computation is limited by
floating-point throughput or memory bandwidth. A common figure of merit for a given kernel of execution is the arithmetic (or compu-
tational) intensity, which is defined as the ratio of the number of arithmetic operations performed per fetch from dynamic random-
access memory. Dense linear algebra (e.g., matrix factorization) has high arithmetic intensity, whereas sparse linear solves generally
have much lower arithmetic intensity. Because simulation entails a very large number of different execution kernels with widely vary-
ing intensities, it is difficult to estimate a priori the degree of intensity for the program as a whole. This difficulty in estimation is com-
pounded by the need to account for hardware-specific details such as cache hit rate and register spilling.
However, a more empirical approach may yield insight by comparing relative execution performance on different processors with
very different ratio of floating point performance to memory bandwidth. Notably, GPUs primarily targeting graphics are manufactured
with greatly reduced double-precision performance, but comparable bandwidth. By comparing normalized runtime between these
GPUs, we may infer the degree to which the double-precision throughput limits performance. Fig. 2 also shows how the relative perfor-
mance of the RTX 2080Ti and V100 evolves with the number of components. The ratio is approximately 1.5 for both the black-oil
and the two-component compositional models; this is the ratio of GPU bandwith (616 GB/sec vs. 900 GB/sec). As the number of com-
ponents increase, the cost of EOS computations (namely, stability and flash, which we perform in mixed precision) increases, and we
plateau at a ratio of approximately 1.9, reflecting the better double-precision performance of compute-optimized GPUs. We may thus
infer that at low component counts, simulation performance is firmly bandwidth-bound, but at large component counts, a relatively
small part of the overall computation may become compute-bound.

Adaptive Implicit Results. Fig. 3 shows the runtime/(nonlinear iterations) and memory requirements ratios between AIM and FIM
runs, using the V100 card. The rationale behind using runtime/(nonlinear iterations) rather than the raw runtime here is that AIM and
FIM runs take a different sequence of timesteps.

Fig. 3—AIM/FIM scaling of the simulator runtime and required on-chip memory as a function of the number of hydrocarbon compo-
nents (using a single V100 GPU with 4 CPU threads for the wells) for Test Model 1.

We first observe that the runtime benefit of AIM is not very significant; AIM and FIM have similar runtimes with 2 components,
and the ratio plateaus at slightly greater than 0.8 as the number of components increases. The memory benefit on the other hand is
more significant, with a ratio of approximately 0.65 with 2 components down to 0.35 with 30 components. Our fully implicit solver
uses memory very efficiently; hence, most of the memory savings observed when moving from FIM to AIM arise from the lower
memory footprint of the GMRES stack (when the percentage of implicit cells is small, AIM needs 1 variable/cell vs. nc þ 3 in FIM).
Other significant consumers of GPU memory, including the algebraic multigrid hierarchy, are common between FIM and AIM, and the
memory use advantage provided by AIM is limited by these common elements.

604 February 2022 SPE Journal

ID: jaganm Time: 10:26 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 605 Total Pages: 16

Memory requirements for the adaptive implicit cases of Fig. 3 are 0.52 GB with 2 components, 0.74 GB with 10 components, and
1.48 GB with 30 components. Contrary to the fully implicit case, here the memory cost/component keeps decreasing as the number of
components increases.
To better understand the modest reduction in runtime observed between FIM and AIM in this case, Fig. 4 shows the runtime break-
down of the FIM and AIM runs, considering the three most significant simulation elements: (1) the EOS, essentially corresponding to
the stability and flash calculations plus property evaluations; (2) the linear solver setup, essentially corresponding to the construction of
the algebraic multigrid hierarchy and second-stage preconditioner block inversions; and (3) the linear solve itself. “Other” refers to
remaining tasks such as computing the CFL (in AIM), checking convergence, performing well solves, constructing the well Jacobian
parts, and processing the input/output data. We observe that as the number of components increases, the relative cost of EOS calcula-
tions increases, whereas the relative cost of the solver setup decreases because there is only a single pressure variable. The relative cost
of the linear solve is reduced by a factor of 2 when moving from FIM (’ 40%) to AIM (’ 20%); this is consistent with the runtime
ratio plateau slightly greater than 0.8 observed in Fig. 3.

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
Fig. 4—Runtime breakdown of Test Model 1 as a function of the number of hydrocarbon components, for the most significant sim-
ulation elements, using a single V100 GPU with 4 CPU threads for the wells. (Left) FIM. (Right) AIM.

Model 2: Tiled SPE5—Experiments on a Synthetic Model


The Model. As noted in the introduction, optimal performance on GPUs requires exposing sufficient parallelism to keep all GPU cores
oversubscribed with sufficient tasks. This oversubscription (which is closely related to but not identical to simultaneous multithreading)
is the primary means that GPUs use to hide the latency of memory transactions. In the context of reservoir simulation, the amount of
parallelism strongly depends on the number of cells in the model. For small models, there may be insufficient parallelism to fully satu-
rate GPU performance. To quantify this concept, we begin with a highly regular, extremely simple model based on a refined version of
the Fifth Comparative Solution Project (SPE5) (Killough and Kossack 1987). We recognize that this is a “toy” model and not represen-
tative of a complex asset model; however, it serves as a convenient way to explore algorithmic scaling with system size. To accomplish
this, we tile the model a number of times horizontally, cloning all cell properties, wells, and well controls for each copy. We move the
injector and producer from the corners to be aligned with the x axis to increase symmetry. Because of this symmetry, each copy should
behave identically in a series of models with a progressively larger number of cells, with all tiles remaining fully connected; it is never-
theless important not to isolate the tiles from their neighbors to let the condition number of the pressure matrix increase with the
number of tiles. We then simulate this series of models, varying the number of GPUs used, and measure the overall run time.

Weak and Strong Scalability. Fig. 5 shows the results of this exercise by plotting simulation throughput, in units of cellsdays/sec.
The left plot includes four curves, corresponding to the use of one, two, four, or eight Nvidia Tesla A100 GPUs in fully implicit mode.
The corresponding throughput for adaptive implicit simulation is shown on the right plot, which includes the fit lines for fully implicit
simulations for comparison. Because AIM simulations use less memory, it was possible to run larger models than with FIM for the
same number of GPUs. Furthermore, simulation runtime was reduced for the same model size, leading to higher throughput.

Fig. 5—Simulation throughput on the refined and tiled SPE5 model described in the text. Throughput increases with model size,
because more parallelism can be exploited in models with greater numbers of cells. Trendlines were generated by fitting a cubic
polynomial for runtime vs. cell count and then computing the throughput from the interpolated runtime as T 5 cell
count 3 simulated days/runtime. (Left) Fully implicit. (Right) Adaptive implicit compared to the fully implicit fits.

February 2022 SPE Journal 605

ID: jaganm Time: 10:26 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 606 Total Pages: 16

We observed that despite the large disparity in model sizes, the aggregate number of nonlinear and linear iterations required during
the simulation remains essentially unchanged (plots not shown here). This demonstrates the well-known property that optimal precondi-
tioners incorporating multigrid methods can give a rate of convergence that is largely independent of condition number and system
size. This observation further allows inference of some characteristics of both the weak and strong scaling of the approach.
It can be readily seen that aggregate simulation throughput increases with model size; this is expected because larger models provide
more parallelism, which allows more hiding of memory latency and a more complete use of the GPUs resources. The asymptotic
throughput limit for a given GPU count indicates a scaling that is not very far from ideal weak scaling.
Strong scaling measures the degree to which increasing processor counts with a fixed problem size reduces the time to solution. As
is almost universally the case for reservoir simulators, strong scalability is more limited than weak, with performance gains diminishing
with each doubling of resources. Nonetheless, although performance is not directly proportional to GPU count for fixed size, it does
continue to yield appreciable gains up to eight GPUs. Because this corresponds to more than 50,000 GPU cores, this scaling is
still notable.

Running Multiple Models on the Same GPU. From Fig. 5, it appears that model sizes of approximately 500,000 cells/GPU are
required to reach 70% of the maximum GPU throughput, and 1 million cells/GPU are required to reach 90% (on A100 cards). Upon
first consideration, it would appear that a small model cannot provide enough parallelism to make optimal use of the available comput-
ing power, and CPU simulation may provide a better option.
However, small models are most often used in the context of multirealization workflows. In these workflows, additional parallelism

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
can be provided by running multiple, independent realizations simultaneously on one GPU. Nvidia provides a daemon that facilitates this
by dynamically scheduling kernels launched from multiple running processes to the same GPU. To test this concept, we take the smallest
SPE test case from Fig. 5 with 110,000 cells. We then run 1 to 10 copies of the simulation simultaneously on one A100 GPU and measure
the total simulation time required. We then compute an aggregate throughput measured with the same number of cellsdays/sec metric
used in Fig. 5. The result is shown in Fig. 6. As can be seen, this throughput appears to asymptote to just more than 160 million cellsdays/
sec, which is close to the asymptotic throughput for a single, large model for the same single GPU shown in Fig. 5 to be ’195 million
cellsdays/sec. At least in this simplified example, it appears that it is possible to make very efficient use of large data center GPUs to run
ensembles of small reservoir models, even when each simulation case does not have sufficient parallelism to saturate GPU performance.
More details on the application of the Nvidia multiprocess service to reservoir simulation can be found in Gandham et al. (2021).

Fig. 6—Aggregate simulation throughput when running multiple copies of the base 110,000 cell model simultaneously on one
GPU, using the FIM.

To apply this method in practice, care must be taken to limit the number of simultaneous runs to ensure that the limited GPU
memory is not exhausted, resulting in job failure. Although several approaches are possible, we use a simple queuing utility that orches-
trates the execution of a list of models provided by the user. The user also provides an estimate for the amount of memory required by
each job, which can be obtained easily by checking the output of a previous run. As can be seen in Fig. 6, greater than 90% of maximum
throughput is achieved when the GPU’s memory is half full. A user can therefore err on the conservative side by overestimating the job
memory requirement with very little loss of efficiency and thereby avoid the possibility of job failure caused by memory exhaustion.

Model 3: Mango—Experiments on a Realistic Model with a High Cell Count


The Model. Although sector models and synthetic test cases can provide a clean experimental test bed for investigating computational
scaling characteristics, it bears considering whether these results will carry over to more complex examples more reflective of common
industrial use. In this section, we consider a first example of complex asset model.
The Mango field is a deepwater, undersaturated oil asset characterized by a complex structural framework. The baseline corner-
point geological model is built on a 284  207  206 grid (horizontal block surface of 100  100m, and average thickness of 8 ft), of
which 5.7 million cells are active (Fig. 7). This model is described in Cominelli et al. (2014); it is a greenfield development, spanning
15 years, based on an immiscible water-alternating-gas injection simulated using a black-oil formulation. The field is depleted using
seven production wells and four injection wells.
For the purpose of this paper, we replaced the black-oil PVT with the six-component EOS of the fifth SPE comparative solution pro-
ject (SPE5) (Killough and Kossack 1987); initial pressure at datum (4,000 psi), and both initial and injected gas composition are as per
the SPE5 model. This choice of PVT was made to investigate the simulator performance in a multicontact miscible setting, which is
challenging for compositional reservoir simulators.

606 February 2022 SPE Journal

ID: jaganm Time: 10:26 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 607 Total Pages: 16

Fig. 7—The Mango model.

Strong Scalability. We first performed a strong scalability test by running the Mango model on 2 to 32 V100 GPUs, systematically
using four CPU threads for the well solves; the results are shown in Fig. 8 using our FIM formulation. We notice from the plot of field

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
gas/oil ratio that the solution is not affected by the GPU count, at least to within a visible margin. We also notice that the impact on the
cumulative number of linear and Newton iterations is small, indicating that our default linear and nonlinear convergence criteria are
appropriate and our solvers are robust. Strong scalability appears to be approximately 1.75 from 2 to 4 GPUs and approximately 1.5
from 4 to 8 GPUs.

Fig. 8—Scalability experiments on the Mango model, varying the number of V100 GPUs used (numbers in the legends), while hold-
ing the number of CPU threads for the well solves constant (four threads). The plots illustrate the field gas/oil ratio (GOR), cumula-
tive nonlinear (Newton) and linear iterations, and elapsed time. The FIM formulation is used.

Performance across GPU Generations. Fig. 9 shows a plot of Mango runtime on various data center GPU configurations vs. the
inverse of the aggregate GPU memory bandwidth of that configuration. Ideal scaling with bandwidth is represented by the dashed lines.
With the Mango model, the linear solver time, shown in blue, clearly dominates overall run time (here, contrary to Fig. 4, the “solver”
entry includes the setup). For this component of computation, the runtime correlates well with the inverse of memory bandwidth. The
correlation is not perfect, however. For example, the 4  P100 configuration seems to have slightly subpar performance relative to its
nominal bandwidth, whereas the V100 performance slightly exceeds expected performance. It should be noted that the transition from
Kepler to Pascal generations involved a transition of memory technology (from GDDR5 to HBM2), and the first-generation memory
controller in the Pascal generation could not reach peak bandwidth in practice. In contrast, the second generation memory controllers in
the Volta generation greatly improved memory bandwidth use. The time required for well handling in the Mango model is almost negli-
gible, because the ratio of cells to wells is approximately 450,000.

Refining the Model. We also considered a refined version of the Mango model with 21.5 million active cells and performed the same
study, varying the number of GPUs between 8 and 64; the results are shown in Fig. 10. As before, the reported field gas/oil ratio is inde-
pendent of the number of processors. The cumulative numbers of Newton and linear iterations increased by roughly a factor of two,
which may be expected with grid refinement (in particular because of the control on time-truncation error; see Eq. 10; the simulator

February 2022 SPE Journal 607

ID: jaganm Time: 10:26 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 608 Total Pages: 16

elects to take shorter timesteps even if it would be able to converge with longer ones). As before, the numbers of linear and nonlinear
iterations show little variation with GPU count. The performance continues to scale through 64 GPUs, with a parallel efficiency very
similar to that of the 5.7 million cell case.

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
Fig. 9—Scaling of simulation time with the inverse of memory bandwidth across four generations of data center Nvidia GPUs, run-
ning the Mango model with the FIM formulation. Dashed lines represent ideal scaling of simulation throughput with peak memory
bandwidth, using the runtime from the K80 as the reference point.

Fig. 10—Same scalability experiments as in Fig. 8 for the Mango model with 21.5 million active cells. GOR 5 gas/oil ratio.

Model 4: Kashagan—Running a Complex Real Asset Model


The Model. The Kashagan field is a strongly undersaturated light oil asset located below the Caspian Sea. It is produced by natural
depletion, with pressure support from partial reinjection of produced gas in first-contact miscible conditions.
The geologic setting, a huge carbonate platform surrounded by a naturally fractured rim, is modeled using a dual porosity and dual
permeability formulation. The tests are performed using the most recent model developed by Eni for internal evaluation purposes,
deriving from the work presented by Panfili et al. (2012); it contains an aggregate of 1.4 million active cells and some hundreds of
wells, with local grid refinement around the gas injectors. An eight-component Peng-Robinson EOS description similar to that of Coats
et al. (2007) is used. The simulation includes 3 years of history match, followed by 131 years of forecast. Fig. 11 illustrates the
reservoir structure.
The field must be operated under the constraint that all produced gas exceeding sales contracts be reinjected. This implies that over
the lifetime of the field, either the gas injection rate depends on the current rate of production, or alternatively, the production must be
throttled to respect the maximum injectivity. Which constraint is active will depend on the number of wells present and the degree of
pressure depletion at a given time. This is implemented in an automated group control logic that determines whether productivity or
injectivity is most limiting at each timestep.

608 February 2022 SPE Journal

ID: jaganm Time: 10:27 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 609 Total Pages: 16

Fig. 11—The Kashagan structure, after Panfili et al. (2012).

Performance across GPU Generations. Fig. 12 shows a plot of Kashagan runtime on various data center GPUs vs. the inverse of the
aggregate GPU memory bandwidth of that GPU (eight CPU cores were consistently used for well-related tasks). Ideal scaling with
bandwidth is represented by the dashed lines. The linear solver cost (shown in blue) correlates well with the inverse of memory band-
width, as observed with the Mango model (Fig. 9). However, the number of active cells for Kashagan is significantly less than for
Mango, whereas the well count is much larger; well-solver operations have, therefore, a larger impact on computational performance,

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
and the total runtime does not scale as per the ideal line.

Fig. 12—Scaling of Kashagan runtime with the inverse of memory bandwidth across four generations of Nvidia GPUs, using the
FIM formulation. Eight CPU cores were consistently used for the well solves, although the CPUs hosting each GPU differed, result-
ing in variable performance for these CPU-bound tasks. Dashed lines represent ideal scaling of simulation throughput with peak
memory bandwidth, using the runtime from the K80 as the reference point.

Scaling with Number of GPUs and CPU Cores. It would at first appear that well-solve time may limit the speedups that can be
achieved when moving between GPU generations. However, Fig. 12 fails to account for the fact that CPU core count and performance
are also increasing with time, in addition to GPU performance. GPU nodes with 128 CPU cores are now available; using a higher core
count for the well solves would significantly reduce this runtime.
Because individual well solves are independent, we would expect ideal scaling with the inverse of the number of cores applied to
this task. In practice, however, some wells are more expensive to solve than others, impacting the scaling as the number of cores
approaches the total number of active wells; to improve load balance, we use a dynamic and adaptive scheduling of well-solve tasks
across the thread pool.
Fig. 13 first shows the Kashagan model runtime vs. the inverse of the number of CPU cores dedicated to well tasks, using a single
V100 GPU; we see that scaling is indeed almost ideal up 16 cores. The difference between the runtime for a given number of cores and
the extrapolated runtime gives an estimate of the time spent on well-related tasks. Here, using eight cores yields 27%, while in Fig. 12,
wells for 1 V100 GPU and eight CPU cores are reported to cost 45% (green section of the bar). The reason is that the plots of Fig. 12
were prepared in December 2020 when some of the well-related tasks were not parallelized yet, whereas Fig. 13 was prepared more
recently upon suggestion from a technical reviewer.
Fig. 13 also shows the Kashagan model runtime vs. the inverse of the number of GPUs, adding four CPU cores for each GPU.
Strong scalability can be computed from the graph as 1.65 from 1 to 2 GPUs and 1.45 from 2 to 4 GPUs.
As a final note, we mention that our initial well-solve implementation was focused only on correctness and robustness, and we have
neglected several known optimizations to the individual well solves that will further reduce runtime. Some of these optimizations are
algorithmic: For instance, flash calculations are currently always starting from the Wilson initial guess (Michelsen and Mollerup 2007),
whereas in the case of separator calculations, it would be better to start from the converged solutions at the previous well-solve itera-
tion. Other known optimizations would rely on implementation details: For instance, it should be beneficial to move the computation of
well gravity head to the GPU. We believe that by running with a more realistic CPU core count and with these algorithmic and imple-
mentation improvements, well solves should not present a significant bottleneck, even on problems with very low cell-to-well ratio. As
such, our expectation is that GPU generational scaling will continue as silicon process technology improves and GPU performance
scales moving forward.

February 2022 SPE Journal 609

ID: jaganm Time: 10:27 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 610 Total Pages: 16

Fig. 13—(Left) Scaling of Kashagan runtime with the inverse of the number of CPU cores, using a single V100 GPU. The distance
between the elapsed time for a specific point and the extrapolated elapsed time is an estimate of the cost of parallelized well tasks
(example for the eight-CPU-core case). (Right) Scaling with the inverse of the number of GPUs, adding four CPU cores for each
additional GPU. Ideal scaling is indicated by dotted lines.

Impact on Engineering Workflows. It is worth noting that the performance highlighted in Fig. 12, achieved with a single GPU,

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
allows maximizing the efficiency of GPU-based clusters that are becoming the norm in energy companies because of adoption for seis-
mic imaging workflows. For example, Eni’s HPC5 on-premise cluster (Strohmaier et al. 2020) is designed with compute-nodes having
four V100 GPUs and a total of 48 CPU cores. Reservoir engineers can run a typical ensemble of 200 such Kashagan models on 50
nodes in approximately 3 hours, which is a step change in workflows because multiple iterations and pertaining analysis can be run in
the same working day.

Conclusions
This paper provides the first documented development of an industrial grade computational reservoir simulator designed for GPUs,
with both fully implicit and adaptive implicit capabilities. The low-level code infrastructure, as well as some key components of the
simulator (e.g., the first and elements of the second-stage linear solver preconditioners, or physical property modules such as for the
evaluation of relative permeabilities) are shared with the black-oil code first described by Esler et al. (2014) and further developed
since (Vidyasagar et al. 2019); other parts of the code are specifically implemented for the compositional formulation. The approach,
which can be seen as intermediate between that of using a unified formulation and that of using two separate codes, allows specific opti-
mization of both black-oil and compositional linear and nonlinear solvers, while limiting code duplication.
The selected examples show that new GPU-based hardware is very well-suited for running compositional models. By using the
enhanced throughput provided by GPUs, complex, high-resolution models can be simulated as part of everyday workflows. The benefit
of using a (compositional) GPU simulator is not limited to large models, however, and the paper demonstrates that running several
small models concurrently on the same GPU is able to saturate the available memory bandwidth. Furthermore, the use of GPUs for sim-
ulation workflows facilitates synergies in hardware requirements with other geoscience applications such as seismic imaging and
machine learning, allowing energy companies to concentrate their investments on a single high-performance computing solution. Nota-
bly, the Top500 (Strohmaier et al. 2020) reports Eni’s HPC5, Saudi Aramco’s Dammam-7, and Total’s Pangea3 as the top three pri-
vately owned clusters, all based on GPUs.
A specificity of the presented simulator is that the difference between AIM and FIM in terms of performance is lower than what is
observed with other commercial software. At the moment, FIM is used by default and AIM is an option; this may change in the future
as the code is further optimized.
With regard to the extensibility of the GPU code base, no technological barrier has been identified on the fundamentals, and future
work will mostly involve implementation of advanced physical options required by Eni and their peers, in the context of oil and gas pro-
duction, as well as in preparation for the transition of energy companies toward decarbonization.

Acknowledgments
The results presented in this paper have been achieved in the furtherance of a cooperative development agreement between Stone Ridge
Technology and Eni S.p.A. The authors thank Oracle Cloud Infrastructure for providing access to a node with Nvidia A100 GPUs used
for the benchmarks in this work. Stone Ridge Technology thanks Eni S.p.A. for permission to use proprietary models in this publication
and acknowledges Marathon Oil Corporation for financial support. This work was partially supported under Small Business Innovative
Research Grant DE-SC0015214 from the US Department of Energy.

References
Appleyard, J., Appleyard, J., Wakefield, M. et al. 2011. Accelerating Reservoir Simulators Using GPU Technology. Paper presented at the SPE Reservoir
Simulation Symposium, The Woodlands, Texas, USA, 21–23 February. SPE-141402-MS. https://doi.org/10.2118/141402-MS.
Baker, L. 1988. Three-Phase Relative Permeability Correlations. Paper presented at the SPE Enhanced Oil Recovery Symposium, Tulsa, Oklahoma,
USA, 16–21 April. SPE-17369-MS. https://doi.org/10.2118/17369-MS.
Barkve, T. 2000. Application of a Truncation Error Estimate to Time Step Selection in a Reservoir Simulator. Paper presented at the ECMOR VII - 7th
European Conference on the Mathematics of Oil Recovery, Baveno, Italy, 5–8 September. https://doi.org/10.3997/2214-4609.201406132.
Bayat, M. and Killough, J. 2013. An Experimental Study of GPU Acceleration for Reservoir Simulation. Paper presented at the SPE Reservoir Simula-
tion Symposium, The Woodlands, Texas, USA, 18–20 February. SPE-163628-MS. https://doi.org/10.2118/163628-MS.
Bear, J. 2013. Dynamics of Fluids in Porous Media. North Chelmsford, Massachusetts, USA: Courier Corporation.
Bogachev, K., Milyutin, S., Telishev, A. et al. 2018. High-Performance Reservoir Simulations on Modern CPU-GPU Computational Platforms. Paper
presented at the AAPG International Conference and Exhibition, Cape Town, South Africa, 4–7 November.
Bowen, G. R. and Leiknes, J. 1995. Parallel Processing Applied to Local Grid Refinement. Paper presented at the Petroleum Computer Conference,
Houston, Texas, USA, 11–14 June. SPE-30203-MS. https://doi.org/10.2118/30203-MS.

610 February 2022 SPE Journal

ID: jaganm Time: 10:27 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 611 Total Pages: 16

Brandt, A., McCormick, S., and Ruge, J. 1985. Algebraic Multigrid (AMG) for Sparse Matrix Equations. In Sparsity and Its Applications, ed. D. J.
Evans, 257–284. Cambridge, England, UK: Cambridge University Press.
Cao, H. 2002. Development of Techniques for General Purpose Research Simulator. PhD dissertation, Stanford University, Stanford, California, USA.
Cao, H., Tchelepi, H. A., Wallis, J. R. et al. 2005. Parallel Scalable Unstructured CPR-Type Linear Solver for Reservoir Simulation. Paper presented at
the SPE Annual Technical Conference and Exhibition, Houston, Texas, USA, 9–12 October. SPE-96809-MS. https://doi.org/10.2118/96809-MS.
Casciano, C., Cominelli, A., and Bianchi, M. 2015. Latest Advances in Simulation Technology for High-Resolution Reservoir Models: Achievements
and Opportunities for Improvement. Paper presented at the SPE Reservoir Characterisation and Simulation Conference and Exhibition, Abu Dhabi,
UAE, 14–16 September. SPE-175633-MS. https://doi.org/10.2118/175633-MS.
Chen, Z., Liu, H., Yu, S. et al. 2014. GPU-Based Parallel Reservoir Simulators. In Domain Decomposition Methods in Science and Engineering XXI,
199–206. New York, New York, USA: Springer.
Coats, K. H. 1980. An Equation of State Compositional Model. SPE J. 20 (5): 363–376. SPE-8284-PA. https://doi.org/10.2118/8284-PA.
Coats, K. 2003. IMPES Stability: Selection of Stable Timesteps. SPE J. 8 (2): 181–187. SPE-84924-PA. https://doi.org/10.2118/84924-PA.
Coats, K., Thomas, L., and Pierson, R. 1998. Compositional and Black Oil Reservoir Simulation. SPE Res Eval & Eng 1 (4): 372–379. SPE-50990-PA.
https://doi.org/10.2118/50990-PA.
Coats, K., Thomas, L., and Pierson, R. 2007. Simulation of Miscible Flow Including Bypassed Oil and Dispersion Control. SPE Res Eval & Eng 10 (5):
500–507. SPE-90898-PA. https://doi.org/10.2118/SPE-90898-PA.
Cominelli, A., Casciano, C., Panfili, P. et al. 2014. Deployment of High-Resolution Reservoir Simulator: Methodology & Cases. Paper presented at the
30th Abu Dhabi International Petroleum Exhibition and Conference, Abu Dhabi, UAE, 10–13 November. SPE-171965-MS. https://doi.org/10.2118/

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
171965-MS.
Dogru, A. H., Fung, L. S. K., Middya, U. et al. 2009. A Next-Generation Parallel Reservoir Simulator for Giant Reservoirs. Paper presented at the SPE
Reservoir Simulation Symposium, The Woodlands, Texas, USA, 2–4 February. SPE-119272-MS. https://doi.org/10.2118/119272-MS.
Dogru, A. H., Sunaidi, H., Fung, L. et al. 2002. A Parallel Reservoir Simulator for Large-Scale Reservoir Simulation. SPE Res Eval & Eng 5 (1): 11–23.
SPE-75805-PA. https://doi.org/10.2118/75805-PA.
Eclipse. 2017. Technical Description. Houston, Texas, USA: Schlumberger.
Edwards, D. A. and Guyaguler, B. 2011. Method To Improve Well Model Efficiency in Reservoir Simulation. Paper presented at the SPE Reservoir Sim-
ulation Symposium, The Woodlands, Texas, USA, 21–23 February. SPE-139571-MS. https://doi.org/10.2118/139571-MS.
Esler, K., Mukundakrishnan, K., Natoli, V. et al. 2014. Realizing the Potential of GPUs for Reservoir Simulation. Paper presented at the ECMOR XIV -
14th European Conference on the Mathematics of Oil Recovery, Baveno, Italy, 8–11 September. https://doi.org/10.3997/2214-4609.20141771.
Esler, K., Natoli, V., and Samardzic, A. 2012. GAMPACK (GPU Accelerated Algebraic Multigrid Package). Paper presented at the ECMOR XIII - 13th
European Conference on the Mathematics of Oil Recovery, Biarritz, France, 10–13 September. https://doi.org/10.3997/2214-4609.20143241.
Feng, S., Pal, S., Yang, Y. et al. 2019. Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective. In 2019 IEEE International
Symposium on Performance Analysis of Systems and Software (ISPASS), 202–211. New York, New York, USA: Institute of Electrical and Electron-
ics Engineers. https://doi.org/10.1109/ISPASS.2019.00033.
Ferraro, L., Orlandini, S., Calonaci, C. et al. 2014. Seismic Imaging Software Design for Accelerators. Paper presented at the 76th EAGE Conference
and Exhibition, Amsterdam, The Netherlands, 16–19 June. https://doi.org/10.3997/2214-4609.20141137.
Fung, L. S. and Dogru, A. H. 2008. Parallel Unstructured-Solver Methods for Simulation of Complex Giant Reservoirs. SPE J. 13 (4): 440–446. SPE-
106237-PA. https://doi.org/10.2118/106237-PA.
Ganapathy, C. and Voskov, D. 2018. Multiscale Reconstruction in Physics for Compositional Simulation. J Comput Phys 375: 747–762. https://doi.org/
10.1016/j.jcp.2018.08.046.
Gandham, R., Esler, K., Mukundakrishnan, K. et al. 2016. GPU Acceleration of Equation of State Calculations in Compositional Reservoir Simulation.
Paper presented at the ECMOR XV - 15th European Conference on the Mathematics of Oil Recovery, Amsterdam, The Netherlands, 29 August–1
September. https://doi.org/10.3997/2214-4609.201601747.
Gandham, R., Zhang, Y., Esler, K. et al. 2021. Improving GPU Throughput of Reservoir Simulations Using NVIDIA MPS and MIG. In Conference Pro-
ceedings, Fifth EAGE Workshop on High Performance Computing for Upstream, Vol. 2021, 1–5. European Association of Geoscientists & Engi-
neers. https://doi.org/10.3997/2214-4609.2021612025.
Gries, S., Stüben, K., Brown, G. L. et al. 2014. Preconditioning for Efficiently Applying Algebraic Multigrid in Fully Implicit Reservoir Simulations.
SPE J. 19 (4): 726–736. SPE-163608-PA. https://doi.org/10.2118/163608-PA.
Intersect. 2018. Technical Description. Houston, Texas, USA: Schlumberger.
Kenyon, S. E. and Behie, G. 1987. Third SPE Comparative Solution Project: Gas Cycling of Retrograde Condensate Reservoirs. SPE J. 39 (8): 981–997.
SPE-12278-PA. https://doi.org/10.2118/12278-PA.
Khait, M. and Voskov, D. 2017. GPU-Offloaded General Purpose Simulator for Multiphase Flow in Porous Media. Paper presented at the SPE Reservoir
Simulation Conference, Montgomery, Texas, USA, 20–22 February. SPE-182633-MS. https://doi.org/10.2118/182663-MS.
Killough, J. and Kossack, C. 1987. Fifth Comparative Solution Project: Evaluation of Miscible Flood Simulators. Paper presented at the SPE Symposium
on Reservoir Simulation, San Antonio, Texas, USA, 1–4 February. SPE-16000-MS. https://doi.org/10.2118/16000-MS.
King, M. J., Ballin, P. R., Bennis, C. et al. 2012. Reservoir Modeling: From RESCUE to RESQML. SPE Res Eval & Eng 15 (2): 127–138. SPE-135280-
PA. https://doi.org/10.2118/135280-PA.
Komatitsch, D., Erlebacher, G., Göddeke, D. et al. 2010. High-Order Finite-Element Seismic Wave Propagation Modeling with MPI on a Large GPU
Cluster. J Comput Phys 229 (20): 7692–7714. https://doi.org/10.1016/j.jcp.2010.06.024.
Kozlova, A., Li, Z., Natvig, J. R. et al. 2016. A Real-Field Multiscale Black-Oil Reservoir Simulator. SPE J. 21 (6): 2049–2061. SPE-173226-PA.
https://doi.org/10.2118/173226-PA.
Lantz, R. 1971. Quantitative Evaluation of Numerical Diffusion (Truncation Error). SPE J. 11 (3): 315–320. SPE-2811-PA. https://doi.org/10.2118/
2811-PA.
Li, C. 1971. Critical Temperature Estimation for Simple Mixtures. Can J Chem Eng 49 (5): 709–710. https://doi.org/10.1002/cjce.5450490529.
Lie, K.-A., Møyner, O., Natvig, J. et al. 2017. Successful Application of Multiscale Methods in a Real Reservoir Simulator Environment. Comput Geosci
21 (5–6): 981–998. https://doi.org/10.1007/s10596-017-9627-2.
Massidda, L., Theis, D., Bonomi, E. et al. 2013. Spectral Elements for Very Large Offshore Acoustic-Elastic Wave Simulations. In SEG Technical Program
Expanded Abstracts 2013, 3391–3395. Tulsa, Oklahoma, USA: Society of Exploration Geophysicists. https://doi.org/10.1190/segam2013-0591.1.
Michelsen, M. and Mollerup, J. 2007. Thermodynamic Models: Fundamentals and Computational Aspects. Ronnebarvej, Denmark: Tie-Line
Publications.
Møyner, O. and Tchelepi, H. A. 2018. A Mass-Conservative Sequential Implicit Multiscale Method for Isothermal Equation-of-State Compositional
Problems. SPE J. 23 (6): 2376–2393. SPE-182679-PA. https://doi.org/10.2118/182679-PA.

February 2022 SPE Journal 611

ID: jaganm Time: 10:27 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142


DOI: 10.2118/203929-PA Date: 4-February-22 Stage: Page: 612 Total Pages: 16

Naccache, P. 1997. A Fully-Implicit Thermal Reservoir Simulator. Paper presented at the SPE Reservoir Simulation Symposium, Dallas, Texas, USA,
8–11 June. SPE-37985-MS. https://doi.org/10.2118/37985-MS.
Naumov, M., Arsaev, M., Castonguay, P. et al. 2015. AmgX: A Library for GPU Accelerated Algebraic Multigrid and Preconditioned Iterative Methods.
SIAM J Sci Comput 37(5): S602–S626. https://doi.org/10.1137/140980260.
Nexus Technical Reference Guide. 2014. Washington, DC, USA: Halliburton.
Obi, E., Eberle, N., Fil, A. et al. 2014. Giga Cell Compositional Simulation. Paper presented at the International Petroleum Technology Conference,
Doha, Qatar, 19–22 January. IPTC-17648-MS. https://doi.org/10.2523/IPTC-17648-MS.
Panfili, P., Cominelli, A., Calabrese, M. et al. 2012. Advanced Upscaling for Kashagan Reservoir Modeling. SPE Res Eval & Eng 15 (2): 150–164. SPE-
146508-PA. https://doi.org/10.2118/146508-PA.
Patacchini, L., Duchenne, S., Bourgeois, M. et al. 2015. Simulation of Residual Oil Saturation in Near-Miscible Gasflooding through Saturation-
Dependent Tuning of the Equilibrium Constants. SPE Res Eval & Eng 18 (3): 28–302. SPE-171806-MS. https://doi.org/10.2118/171806-PA.
Perrone, A., Pennadoro, F., Tiani, A. et al. 2017. Enhancing the Geological Models Consistency in Ensemble Based History Matching an Integrated
Approach. Paper presented at the SPE Reservoir Characterisation and Simulation Conference and Exhibition, Abu Dhabi, UAE, 8–10 May. SPE-
186049-MS. https://doi.org/10.2118/186049-MS.
Petitfrère, M., De Loubens, R., and Patacchini, L. 2019. Continuous Relative Permeability Model for Compositional Reservoir Simulation, Using the
True Critical Point and Accounting for Miscibility. Paper presented at the SPE Reservoir Simulation Conference, Houston, Texas, USA, 10–11
April. SPE-193826-MS. https://doi.org/10.2118/193826-MS.
Rasmussen, A. F., Sandve, T. H., Bao, K. et al. 2021. The Open Porous Media Flow Reservoir Simulator. Comput Math Appl 81: 159–185. https://

Downloaded from http://onepetro.org/SJ/article-pdf/27/01/597/2631020/spe-203929-pa.pdf by Eni SPA E&P Division user on 11 November 2022
doi.org/10.1016/j.camwa.2020.05.014.
Saad, Y. and Schultz, M. H. 1986. GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems. SIAM J Sci Comput
7 (3): 856–869. https://doi.org/10.1137/0907058.
Shiralkar, G., Fleming, G., Watts, J. et al. 2005. Development and Field Application of a High Performance, Unstructured Simulator with Parallel Capa-
bility. Paper presented at the SPE Reservoir Simulation Symposium, The Woodlands, Texas, USA, 31 January–2 February. SPE-93080-MS. https://
doi.org/10.2118/93080-MS.
Shotton, M., Stephen, K., and Giddins, M. A. 2016. High-Resolution Studies of Polymer Flooding in Heterogeneous Layered Reservoirs. Paper presented
at the SPE EOR Conference at Oil and Gas West Asia, Muskat, Oman, 21–23 March. SPE-179754-MS. https://doi.org/10.2118/179754-MS.
Strohmaier, E., Dongarra, J., and Horst, S. 2020. Top500 List. https://www.top500.org/lists/top500/2020/11/highs/ (accessed December 2020).
Tchelepi, H. and Zhou, Y. 2013. Multi-GPU Parallelization of Nested Factorization for Solving Large Linear Systems. Paper presented at the SPE Reser-
voir Simulation Symposium, The Woodlands, Texas, USA, 18–20 February. SPE-163588-MS. https://doi.org/10.2118/163588-MS.
Verdiere, S., Quettier, L., Samier, P. et al. 1999. Applications of a Parallel Simulator to Industrial Test Cases. Paper presented at the SPE Reservoir Sim-
ulation Symposium, Houston, Texas, USA, 14–17 February. SPE-51887-MS. https://doi.org/10.2118/51887-MS.
Vidyasagar, A., Patacchini, L., Panfili, P. et al. 2019. Full-GPU Reservoir Simulation Delivers on Its Promise for Giant Carbonate Fields. Paper presented
at the Third EAGE WIPIC Workshop: Reservoir Management in Carbonates, Doha, Qatar, 18–20 November. https://doi.org/10.3997/2214-
4609.201903118.
Voskov, D. V. and Tchelepi, H. A. 2012. Comparison of Nonlinear Formulations for Two-Phase Multi-Component EoS Based Simulation. J Pet Sci Eng
82-83: 101–111. https://doi.org/10.1016/j.petrol.2011.10.012.
Wallis, J. R. 1983. Incomplete Gaussian Elimination as a Preconditioning for Generalized Conjugate Gradient Acceleration. Paper presented at the SPE
Reservoir Simulation Symposium, San Francisco, California, USA, 15–18 November. SPE-12265-MS. https://doi.org/10.2118/12265-MS.
Wallis, J. R., Kendall, R., and Little, T. 1985. Constrained Residual Acceleration of Conjugate Residual Methods. Paper presented at the SPE Reservoir
Simulation Symposium, Dallas, Texas, USA, 10–13 February. SPE-13536-MS. https://doi.org/10.2118/13536-MS.
Whitson, C. H. and Brulé, M. R. 2000. Phase Behavior, Vol. 20. Richardson, Texas, USA: Henry L. Doherty Series, Society of Petroleum Engineers.
Wobbes, E. 2014. Reducing Communication in AMG for Reservoir Simulation: Aggressive Coarsening and Non-Galerkin Coarse-Grid Operators. PhD
dissertation, Delft University, Delft, The Netherlands.
Young, L. and Russell, T. 1993. Implementation of an Adaptive Implicit Method. Paper presented at the SPE Symposium on Reservoir Simulation, New
Orleans, Louisiana, USA, 28 February–3 March. SPE-25245-MS. https://doi.org/10.2118/25245-MS.
Younis, R., Tchelepi, H. A., and Aziz, K. 2010. Adaptively Localized Continuation-Newton Method–Nonlinear Solvers That Converge All the Time.
SPE J. 15 (2): 526–544. SPE-119147-PA. https://doi.org/10.2118/119147-PA.
Yu, S., Liu, H., Chen, Z. J. et al. 2012. GPU-Based Parallel Reservoir Simulation for Large-Scale Simulation Problems. Paper presented at the SPE
Europec/EAGE Annual Conference, Copenhagen, Denmark, 4–7 June. SPE-152271-MS. https://doi.org/10.2118/152271-MS.

612 February 2022 SPE Journal

ID: jaganm Time: 10:27 I Path: S:/J###/Vol02701/210142/Comp/APPFile/SA-SPE-J###210142

You might also like