Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Version of Record: https://www.sciencedirect.

com/science/article/pii/S0309170819311649
Manuscript_62e99012c6a9184934342e2cbc685dcd

Physics-Informed Neural Networks for Multiphysics Data


Assimilation with Application to Subsurface Transport

QiZhi Hea , David Barajas-Solanoa , Guzel Tartakovskyb , Alexandre M.


Tartakovskya,∗
a Pacific Northwest National Laboratory Richland, WA 99354
b INTERA Incorporated, Richland, WA 99354

Abstract

Data assimilation for parameter and state estimation in subsurface transport


problems remains a significant challenge because of the sparsity of measurements,
the heterogeneity of porous media, and the high computational cost of forward
numerical models. We present a multiphysics-informed deep neural network
machine learning method for estimating space-dependent hydraulic conductivity,
hydraulic head, and concentration fields from sparse measurements. In this
approach, we employ individual deep neural networks (DNNs) to approximate
the unknown parameters (e.g., hydraulic conductivity) and states (e.g., hydraulic
head and concentration) of a physical system. Next, we jointly train these
DNNs by minimizing the loss function that consists of the governing equations
residuals in addition to the error with respect to measurement data. We apply
this approach to assimilate conductivity, hydraulic head, and concentration
measurements for the joint inversion of these parameter and states in a steady-
state advection–dispersion problem. We study the accuracy of the proposed data
assimilation approach with respect to the data size (i.e., the number of measured
variables and the number of measurements of each variable), DNN size, and the
complexity of the parameter field. We demonstrate that the physics-informed
DNNs are significantly more accurate than the standard data-driven DNNs,
especially when the training set consists of sparse data. We also show that

∗ Correspondingauthor
Email address: Alexandre.Tartakovsky@pnnl.gov (Alexandre M. Tartakovsky)

Preprint submitted to Advances In Water Resources April 30, 2020

© 2020 published by Elsevier. This manuscript is made available under the Elsevier user license
https://www.elsevier.com/open-access/userlicense/1.0/
the accuracy of parameter estimation increases as more different multiphysics
variables are inverted jointly.
Keywords: Physics-informed deep neural networks, data assimilation,
parameter estimation, inverse problems, subsurface flow and transport

1. Introduction

Modeling of transport in heterogeneous porous media is a part of many


environmental and engineering applications, including hydrocarbon recovery [1],
hydraulic fracking [2], exploitation of geothermal energy [3], geologic disposal of
5 radioactive waste [4], and groundwater contamination assessment [5]. Numerical
models of transport in porous media require deterministic or statistical knowl-
edge of the subsurface properties (e.g., hydraulic conductivity) and initial and
boundary conditions [6, 7]. However, because of heterogeneity and data sparsity,
the parameters of natural systems are often not fully known. Despite significant
10 research in inverse methods [8, 9, 5, 10], parameter estimation at a resolution
required for accurate modeling of transport processes remains a challenge.
Parameter estimation is complicated by the fact that the parameters of
interest (e.g., hydraulic conductivity) are difficult to measure directly. Most
inverse (parameter estimation) methods use indirect measurements (in addition to
15 direct measurements) to estimate parameters. Data assimilation (or model–data
integration) has been well recognized as an effective technique to reduce predictive
uncertainties and improve model accuracy. Data assimilation is a process where
model parameters and system states are updated using measurements and
governing equations [5, 11, 12]. Data assimilation has been used in many
20 fields, including atmospheric and oceanic sciences [13, 14], hydrology [15, 9],
subsurface transport [5, 10, 16], and uncertainty quantification [17, 18]. Data
assimilation in subsurface applications is challenging because the subsurface flow
and transport equations are highly nonlinear and the states and parameters are
non-Gaussian [16]. This nonlinearity poses a difficulty for both the direct inverse
25 methods and Bayesian parameter estimation methods.

2
Recent advances in machine learning (ML) methods, automatic differentiation
(AD) [19], and ML libraries (e.g., TensorFlow [20] and Pytorch [21]) have made
them potentially powerful tools for parameter estimation and data assimilation.
For example, Schmidt and Lipson [22] applied symbolic regression to learn
30 conservation laws, and Brunton et al. [23] used sparse regression to discover
equations of nonlinear dynamics directly from data. Physics-informed (deep)
neural networks (PINNs) were used to learn solutions and parameters in partial
and ordinary differential equations [24, 25, 26, 24]. Recently, PINNs were
extended for inverse problems associated with partial differential equations
35 (PDEs) with space-dependent coefficients (e.g., to estimate hydraulic conductivity
using sparse measurements of conductivity and hydraulic head) [27].
In this study, we extend the PINN-based parameter estimation method of [27]
to assimilate multiphysics measurement and refer to this multiphysics-informed
neural network approach as MPINN. We consider a subsurface transport problem
40 with sparse measurements of hydraulic conductivity, hydraulic head, and solute
concentration. In this approach, we use the Darcy and advection–dispersion
equations together with data to train deep neural networks (DNNs) that represent
space-dependent conductivity, head, and concentration fields. During training
of the DNNs, the governing equations and the associated boundary conditions
45 are enforced at the “residual” points over the domain. We demonstrate that
for sparse data, the MPINN approach significantly improves the accuracy of
parameter and state estimation as compared to standard DNNs trained with data
only. The MPINN approach can be easily extended to assimilate other types of
variables and physics laws, e.g., geophysical measurements and the corresponding
50 equations that describe the relationships between electrical resistivity, current,
and potential.
This paper is organized as follows. In Section 2, we describe the MPINN
method and its formulation for transport problems. The performance of the
MPINN approach for data assimilation, including the dependence of estimation
55 errors on the number of measurements, is given in Section 3. The effects of
the neural network size and the conductivity field correlation length on the

3
parameter estimation errors are discussed in Section 4. Conclusions are given in
Section 5.

2. MPINN for data assimilation

60 In this section, we present the MPINN approach for multiphysics data


assimilation and its formulation for subsurface transport applications. We also
discuss the automatic differentiation and two-step training algorithm that are
used in the MPINN approach.

2.1. General MPINN formulation

65 In the PINN approach and its MPINN extension, we employ fully connected
feed-forward networks to approximate unknown variables (states) and space-
dependent parameters, as described in Appendix A and shown in Figure A.17.
Given a sufficiently large number of hidden layers, DNNs have excellent
representative properties but require a lot of data to train them. This creates a
70 challenge in applying DNNs to subsurface problems where measurements are
usually sparse. For the purpose of this work, we define sparse measurements
as those that do not sufficiently cover the computational domain to accurately
estimate parameters with the standard data-driven DNNs method described
in Appendix A. In [27], we demonstrated that the Darcy law can be used as a
75 constraint for training a DNN model of conductivity that significantly improves
the predictive ability of the DNN model.
In the rest of this section, we extend the PINN parameter estimation method
of [27] to a data assimilation problem where different types of measurements are
used to estimate parameters and states. Consider a system of PDEs forming the
boundary value problem defined on the domain Ω ⊂ Rd with the boundary ∂Ω:

L(u(x); p(x)) = 0, x∈Ω


(1)
B(u(x); p(x)) = 0, x ∈ ∂Ω

where u is the (unknown) solution vector (can include head, concentrations,


saturation, electrical potential, and other variables), p is the (unknown) system

4
parameter vector (e.g., hydraulic and electric conductivities), L denotes the
80 known (nonlinear) differential operator, and the operator B expresses arbitrary
boundary conditions associated with the problem. The boundary conditions can
be of the Dirichlet and Neumann types applied on ∂D Ω and ∂N Ω, respectively,
such that ∂D Ω ∪ ∂N Ω = ∂Ω and ∂D Ω ∩ ∂N Ω = ∅.
We use the DNNs to approximate both state variables and unknown param-
eters, u(x) ≈ û(x; θ) and p(x) ≈ p̂(x; γ), x ∈ Ω, where θ and γ are weights
or parameters (which need to be estimated or trained) in the corresponding
DNNs. To determine these parameters, we minimize the loss function J(θ, γ)
with physics-informed penalty terms:

(θ, γ) = arg min J(θ, γ). (2)


θ,γ

where

J(θ, γ) = Jd (θ, γ) + ωf Jf (θ, γ) + ωb Jb (θ, γ). (3)

Here, Jd (θ, γ) is the loss due to a mismatch with the data (i.e., the measurements
of u and p):

1 X 1 X
Jd (θ, γ) = (û(x; θ) − u∗ (x))2 + (p̂(x; θ) − p∗ (x))2 , (4)
|Tu | |Tp |
x∈Tu x∈Tp

Jf (θ, γ) is the loss due to mismatch with the governing PDEs L(u(x); p(x)) = 0:

1 X
Jf (θ, γ) = (L(û(x; θ); p̂(x, γ)))2 , (5)
|Tf |
x∈Tf

and Jb (θ, γ) is the loss due to mismatch with the boundary conditions B(u(x); p(x)) =
0:
1 X
Jb (θ, γ) = (B(û(x; θ); p̂(x; γ)))2 . (6)
|Tb |
x∈Tb

In (3), ωf and ωb are weights that determine how strongly mismatch with
85 the governing PDEs and boundary conditions is penalized relative to data
mismatch. In this work, we assume that the measurements and physics model
are exact and set ωf = ωb = 1. The sets Tu = {x1 , x2 , ..., x|Tu | } ⊂ Ω and

5
Tp = {x1 , x2 , ..., x|Tp | } ⊂ Ω denote the measurement locations of u and p,
respectively, and u∗ (x), x ∈ Tu and p∗ (x), x ∈ Tp are the measured values of
90 u and p at these locations. The sets Tf = {x1 , x2 , ..., x|Tf | } ⊂ Ω and Tb =
{x1 , x2 , ..., x|Tb | } ⊂ ∂Ω denote locations of the “residual” points where Jf (θ, γ)
and Jb (θ, γ) are, respectively, minimized. The penalty terms Jf (θ, γ) and Jb (θ, γ)
force the DNN approximations of u and p to satisfy the governing equation (1)
at the residual points. Note that while it is preferable to enforce physics over
95 the whole domain, the computational cost of estimating and minimizing the
loss function (3) increases with the number of residual points. In this work, we
demonstrate convergence of the solution of (2) with an increasing number of
residual points, meaning that the DNNs û(x; θ) and p̂(x; γ) can be accurately
trained using a finite number of residual points. Similar convergence results for
100 solving PDEs with the PINN method were also observed in [24, 28, 27, 25, 29].
The loss Jf (θ, γ) is evaluated by computing spatial derivatives of û(x; θ) and
p̂(x; γ) using AD. AD is also used to evaluate the normal derivative n · ∇ in
the Neumann boundary condition in the loss Jb (θ, γ) (see details in Section 2.2).
AD is implemented in most ML libraries, including TensorFlow and Pytorch
105 [21], where it is mainly used to compute derivatives with respect to the DNN
weights (i.e., θ and γ). In the PINN method, AD allows the implementation of
any PDE and boundary condition constraints without numerically discretizing
and solving the PDEs.
Another benefit of enforcing PDE constraints via the penalty term Jf (θ, γ) is
110 that it allows using the corresponding weight ωf to account for the fidelity of the
PDE model. For example, we can assign a smaller weight to a low-fidelity PDE
model. In general, the number of unknown parameters in θ and γ is much larger
than the number of measurements, and training the DNNs requires regularization.
One can consider the losses Jb (θ, γ) and Jf (θ, γ) in the minimization problem
115 (2) as physics-informed regularization terms [27, 30].

6
2.2. Application of MPINN for subsurface transport problems

For sparsely sampled systems, data assimilation can significantly improve the
accuracy of parameter and state estimation. Here, we assume that the sparse
steady-state measurements of a synthetic tracer test in a heterogeneous porous
domain Ω = [0, L1 ] × [0, L2 ] are available, where the solute is continually injected
at the x1 = 0 boundary. This data includes the measurements of conductivity
Ki∗ := K(xK ∗ h ∗ C
i ), hydraulic head hi := h(xi ), and concentration Ci := C(xi ) at
NK h Nh C NC
the locations {xK
i }i=1 , {xi }i=1 , and {xi }i=1 , respectively, where NK , Nh , and

NC are the number of measurements of each variable. Our objective is to learn


the conductivity, head, and concentration fields based on these measurements.
We further assume that the concentration, hydraulic head, and conductivity
data can be accurately modeled by the steady-state Darcy flow:

K(x)
v(x) = − ∇h(x)





 φ


∇ · v(x) = 0, x ∈ Ω





h(x) = H2 , x1 = L1 (7)




−K(x)∂h(x)/∂x1 = q, x1 = 0







 −K(x)∂h(x)/∂x2 = 0, x2 = 0 or x2 = L2

and advection–dispersion equation:





 ∇ · [v(x)C(x)] = ∇ · [D∇C(x)], x ∈ Ω



C(x) = C0 (x2 ), x1 = 0


(8)
∂C(x)/∂x1 = 0, x1 = L1







∂C(x)/∂x2 = 0, x2 = 0 or x2 = L2

where φ is the effective porosity of the medium, v is the average pore velocity,
and D is the dispersion coefficient:

D = Dw τ I + α||v||2 . (9)

Here, I is the identity tensor, Dw is the diffusion coefficient, τ is the tortuosity


of the medium, and α is the dispersivity tensor with the diagonal components

7
αL and αT . The conductivity K(x) is assumed to be unknown except at the
NK
120 measurement locations {xK
i }i=1 .

In the following simulations, we set the parameters as: L1 = 1 m, L2 = 0.5 m,


2
H2 = 0 m, q = 1 m/hr, C0 (x2 ) = c exp(− (x2 −L
2
2 /2)
), c = 1 Kg/m3 ,  = 0.25 m ,
φ = 0.317, Dw = 0.09 m2 /hr, τ = φ1/3 = 0.681, αL = 0.01 m, and αT = 0.001 m.
We start by defining the DNN representations of K(x), h(x), and C(x) as:

K̂(x) := NK (x; θK )

ĥ(x) := Nh (x; θh ) (10)

Ĉ(x) := NC (x; θC )

where θK , θh , and θC are the vectors of parameters associated with each neural
125 network. For the considered two-dimensional problem, the dimension of the
input layers in these DNNs is two. The K, h, and C fields are scalar; therefore,
the dimensionality of the output layers in these DNNs is one.
The specific form of the general loss function (3) for training these DNNs is
given by equations (B.4)–(B.6) in Appendix B. In Eq. (B.6), PDEs (7) and (8)
130 are enforced at the residual points given by the sets Tfh and TfC , respectively,
where |Tfh | = Nfh and |TfC | = NfC . A schematic diagram of the MPINN method
for data assimilation in the transport problem described by Equations (7) and
(8) is shown in Figure 1.
In this work, we compare three approaches for training K̂: the MPINN
135 approach where we jointly train the DNNs K̂, ĥ, and Ĉ by minimizing the loss
function (B.4); the PINN–Darcy approach where we jointly train K̂ and ĥ by
only enforcing the Darcy equation and boundary conditions (7); the data-driven
DNN approach where we separately train K̂, ĥ, and Ĉ using only data. The
performance of these three approaches is investigated and compared in Sections
140 3 and 4.
Given that the loss function is highly nonlinear and non-convex with respect to
the network parameters θK , θh , and θC , we use the gradient descent minimization
algorithms, including the Adam [31], and L-BFGS-B [32] methods. In the L-
BFGS-B optimizer, the iterative minimization process is terminated once the

8
DNNs: 𝜃& , 𝜃( , 𝜃) Outputs AD layers PDE & boundary residuals

𝐾 ∇𝐾
∇ℎ 𝑓 ( 𝐾, ℎ , 𝑓/( 𝐾, ℎ

⋮ Darcy Eqn.

𝒙 ℎ ∇ℎ Advection-Dispersion
Eqn.
𝒗
∇𝐶 𝑓 ) 𝐾, ℎ, 𝐶 ,𝑓/) 𝐶

|𝒗|
𝐶 ⋮

𝐽(𝜃& , 𝜃( , 𝜃) )
Update (𝜃& , 𝜃( , 𝜃) ) by minimizing 𝐽
Physics-informed loss

Figure 1: A schematic diagram of the MPINN method for multiphysics data assimilation in
subsurface transport problems. Three DNNs are used to represent the unknown K(x), h(x),
and C(x) fields. Spatial derivatives of these fields in the PDE and boundary condition residuals
are computed with AD. The multiphysics loss function J and PDE residuals f h , fN
h , f C , and

C are defined in Appendix B.


fN

145 relative change in the loss function becomes smaller than a prescribed value. In
the Adam method, the DNNs training stops once the total loss function becomes
smaller than a prescribed small value or the predefined number of iterations
(epochs) is completed. As suggested in [24, 27, 33, 34], L-BFGS-B, a quasi-
Newtown method, shows superior performance with a better rate of convergence,
150 lower gradient vanishing, and a lower computational cost for problems with a
relatively small amount of training data and/or residual points. In this study,
we employ the L-BFGS-B method for the data-driven DNN and PINN–Darcy
methods with the default settings from Scipy [35]. However, our numerical
experiments show that the L-BFGS-B algorithm has a slow convergence for the
155 MPINN method, where a relatively large number of residual points is used.
We propose a two-step training algorithm, where the loss function is first

9
minimized by the Adam algorithm with a prescribed stop criterion followed
by the L-BFGS-B optimizer. In this work, we use the two-step algorithm in
all MPINN simulations unless it is stated otherwise. At the beginning of the
160 training process, the parameters of the neural networks are randomly initialized
using the Xavier scheme [36].

3. Data requirements and convergence properties of DNN methods

In this section, we study properties of the DNN methods, including data-


driven DNN, PINN–Darcy, and MPINN, for data assimilation in subsurface
165 transport problems described by Equations (7) and (8). We use a synthetic data
set where the conductivity field is computed as K(x) = 0.5 sin(4πx1 ) sin(4πx2 )+1
on the 256 × 128 uniform rectangular mesh. The synthetic h(x) and C(x) fields
are generated on the same mesh as numerical solutions of Equations (7) and (8).
These equations are solved using the finite-volume Subsurface Transport Over
170 Multiple Phase (STOMP) code [7]. The K(x), h(x), and C(x) fields are shown
in Figure 2, and in the following sections we refer to these fields as reference
fields and use them to test the accuracy of the considered DNN methods. We
select NK , Nh , and NC values of the K, h, and C fields at random locations on
the mesh as the measurements (training sets) of the respective fields. The rest
175 of the fields values are used as the testing set to evaluate the accuracy of the
DNN approximations. In this case, all reference fields are nonlinear with the
h(x) and C(x) fields having the smallest and largest gradients, respectively.
We quantify the accuracy of the DNN methods in terms of point errors and
the relative L2 errors that are defined as:
Z
γ 1
 := R
2 dx
[γ(x) − γ̂(x, θγ )]2 dx, for γ = K, h, C (11)

γ(x) Ω

where γ(x) and γ̂(x, θγ ) denote the reference fields and the DNN approximations,
respectively.
180 We first investigate the effect of the DNN size nh × mh on the approximation
errors, where nh is the number of hidden layers and mh is the number of neurons

10
in each hidden layer. Note that all DNNs have a two-dimensional input layer
(corresponding to x1 and x2 ) and a one-dimensional output layer (corresponding
to scalar quantities K, h, or C).

(a) K (b) h

(c) C

Figure 2: Reference fields: (a) conductivity K, (b) hydraulic head h, and (c) concentration C.

185 3.1. Data-driven DNNs for parameter estimation

We test the accuracy of the data-driven DNN approach (i.e., regression) for
estimating K(x) to establish a baseline for comparison with the MPINN and
PINN–Darcy methods. The K̂ DNN sizes and the corresponding mean and
variance of the L2 errors are summarized in Table 1. The statistics of the L2
190 errors are computed from five simulations in which the DNNs are randomly
initialized using the Xavier algorithm. The size of the networks is varied by
changing the number of hidden layers nh , while the number of neurons per layer
is set to mh = 32. The regression errors decrease from more than 100% for

11
Table 1: The effect of L1 and L2 regularization on the accuracy of the data-driven DNN K(x)
estimation. The mean and standard deviation of K as functions of the DNN size (the number
of hidden layers nh ) and the number of K measurements. The DNNs K̂(x, θK ) are trained
using the data-driven DNN method both with and without L1 or L2 regularization. The
corresponding standard deviations are given in parentheses.

Number of K measurements, NK
DNN size 16 32 48 64 80 96
3 × 32 156.1%(0.482) 53.5%(0.391) 44.6%(0.287) 17.5%(0.135) 6.0%(0.022) 3.2%(0.010)
DNN 4 × 32 206.9%(2.068) 64.9%(0.420) 42.8%(0.282) 11.4%(0.121) 4.5%(0.019) 3.3%(0.005)
5 × 32 128.0%(1.304) 53.1%(0.262) 49.7%(0.324) 21.4%(0.212) 4.3%(0.018) 3.0%(0.005)
3 × 32 30.0%(0.051) 21.1%(0.030) 9.9%(0.023) 4.3%(0.008) 3.1%(0.011) 2.0%(0.002)
DNN+L1 4 × 32 28.6%(0.049) 23.0%(0.028) 11.7%(0.014) 5.3%(0.016) 2.6%(0.005) 1.7%(0.001)
5 × 32 28.1%(0.046) 22.3%(0.057) 10.7%(0.029) 3.2%(0.009) 2.1%(0.003) 1.9%(0.002)
3 × 32 26.4%(0.044) 20.0%(0.016) 7.7%(0.010) 3.38%(0.009) 2.8%(0.004) 2.0%(0.003)
DNN+L2 4 × 32 34.5%(0.124) 17.8%(0.031) 10.1%(0.009) 3.1%(0.004) 2.6%(0.007) 2.1%(0.003)
5 × 32 28.8%(0.050) 19.5%(0.027) 9.6%(0.019) 2.9%(0.006) 2.5%(0.004) 1.9%(0.002)

NK = 16 to less than 4% for NK = 96. We do not see a clear dependence of the


195 errors on nh . We attribute large errors for small NK to overfitting of the DNNs.
Various regularization methods were introduced to reduce DNN overfitting,
including dropout [37] and L1 and L2 regularizers [38, 39]. In the L1 and L2
regularization methods, the L1 and L2 norms of the DNNs parameters are added
with the weight β in the DNN loss function [30]. We investigate the effect of the
200 L1 and L2 regularizers on the regression errors for the considered examples. We
consider β = 10−5 , 10−6 , and 10−7 , and the best results are presented in Table
1. The L1 and L2 regularizers reduce errors to less than 35% for NK = 16 and
less than 2.1% for NK = 96.
We also find that the K error depends on the choice of a regularization
205 approach and the tuning coefficient β. We demonstrate below that the PINN–
Darcy and MPINN methods provide alternative, physics-based approaches to
regularizing DNN training. The advantage of using physics constraints is that the
resulting solutions satisfy the governing equations, while solutions obtained with
the L1 or L2 regularizations, in general, do not. In addition, the PINN–Darcy
210 and MPINN methods allow using indirect observations in addition or instead of

12
direct observations in the K̂ DNN training.

3.2. The PINN–Darcy method

Here, we examine the accuracy of the PINN–Darcy approach, where the


measurements of conductivity and hydraulic head, as well as the Darcy equation
215 (7), are used to jointly train the K̂(x; θK ) and ĥ(x; θh ) DNNs. The mean and
standard deviations of K and h versus N = NK = Nh are plotted in Figures 3
(a) and (c), respectively, for Nfh = 0, 50, and 400. Note that the PINN–Darcy
method reduces to the data-driven DNN method for Nfh = 0. For each case,
the mean and standard deviation of errors are computed from the errors of five
220 DNNs, which are randomly initialized with the Xavier algorithm. In this test,
the measurements and residual points locations are randomly selected over the
domain. For comparison, we also estimate K and h using the data-driven DNN
and PINN–Darcy methods with L2 regularization; the corresponding K and h
errors are shown in Figures 3 (b) and (d), respectively.
225 As expected, the accuracy of the PINN–Darcy approach improves with in-
creasing N for all considered Nfh . For a relatively small number of measurements
(N < 80,) the accuracy of the K approximation increases with increasing Nfh ,
i.e., both the mean K and its standard deviation decrease with an increasing
number of residual points. The effect of enforcing the Darcy equation (i.e.,
230 having Nfh > 0) is especially profound for sparse data. For example, for N = 16,
K in the PINN–Darcy method (Nfh = 400) is ≈ 0.2 versus K ≈ 1.2 in the
data-driven DNN method. Note that for the same number of measurements, the
error in the estimated h is more than one order of magnitude smaller than that
in K for both the data-driven DNN and PINN–Darcy methods. The estimated
235 h field has smaller errors because of near-linear behavior. Still, PINN–Darcy
yields a significantly smaller h than the data-driven DNN method.
Notably, for a relatively large number of measurements (in this case, N > 80),
we observe that the K errors in PINN–Darcy are slightly larger than in the
data-driven DNN approach. There are several reasons for this, including: in this
240 example, N > 80 measurements are sufficient to accurately train K̂ without any

13
1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

(a) (b)
0.06 0.06

0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

(c) (d)

Figure 3: Mean of the K error in the data-driven DNN and PINN–Darcy estimations of
K(x) (upper) and hydraulic head h(x) (bottom) as functions of N = NK = Nh and the
number of residual points Nfh . The right and left columns present results with and without L2
regularization, respectevily. The bars correspond to one standard deviation of K and quantify
uncertainty due to random initialization of DNNs. The K̂ and ĥ DNNs sizes are 5 × 32 and
3 × 32, respectively.

physics constraints, as made evident by the small K ; the physics constraints in


PINN–Darcy make the loss function more complicated and harder to minimize;
and the physics model might not be exact. The synthetic h (and C) data are
sampled from a weak-form solution of the PDEs, while the strong-form of PDE
245 constraints are enforced in the loss function, which results in model errors. We
note that in real applications, measurements are sparse, and for sparse data, the
model errors are smaller than the regression errors; this is made evident by our
results for N < 80.

14
The L2 regularization significantly reduces the mean and standard deviation
250 of K and h . However, for N < 50, the PINN–Darcy method with Nfh =
400 provides more accurate results for both the K and h fields. Adding L2
regularization to the PINN method further reduces K and h , especially for a
relatively small number of residual points Nfh . Because the computational cost
of PINN–Darcy and MPINN increases with increasing Nfh , a combination of L2
255 regularization and physics constraints can potentially reduce the computational
cost of the PINN–Darcy and MPINN methods. Finally, we analyze the loss
functions decay during the K̂ DNN training in the data-driven DNN, data-
driven DNN with L2 regularization, and PINN–Darcy methods with NK = 36,
Nfh = 200, and the L-BFGS-B optimizer. These loss functions are shown in
260 Figure 4. The data-driven DNN method exhibits overfitting, as evident from
the small training error and large test error (see Figure 3 (c)), whereas both
PINN–Darcy and L2 regularizations prevent overfitting. We also see that the
L-BFGS-B optimizer is robust for these three approaches.

(a) (b) (c)

Figure 4: Loss functions in (a) data-driven DNN, (b) data-driven DNN with L2 regularization,
and (c) PINN–Darcy methods for estimating K(x) with NK = 36. In figure (c), Nh = 36,
Nfh = 200, J is the total loss, and JK and Jh are the parts of the loss function with respect to
K and h measurements, respectively. The DNN sizes are 5 × 32 for K̂ and 3 × 32 for ĥ.

3.3. The MPINN method

265 Here, we investigate the MPINN method for jointly training the K̂(x; θK ),
ĥ(x; θh ), and Ĉ(x; θC ) DNNs. Figure 5 shows the mean errors of the MPINN-
estimated fields as functions of NK , Nh , NC , and NfC . The number of points

15
0.3

0.25

0.2

0.15

0.1

0.05

0
10 20 30 40 50 60 70 80 90 100

(a)
0.025
0.1

0.02
0.08

0.015 0.06

0.01 0.04

0.005 0.02

0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

(b) (c)

Figure 5: The relative L2 errors K , h , and C in the MPINN estimation of (a) conductivity
K(x), (b) hydraulic head h(x), and (c) concentration C(x), respectively, versus the number
of measurements N = NK = Nh , NC , and the number of residual points NfC . Errors in the
PINN–Darcy K and h estimations are also provided in (a) and (b), respectively. In all cases,
Nfh = 200, and the K̂, ĥ, and Ĉ DNNs size is 5 × 32.

where the residuals of the Darcy equation are minimized is set to Nfh = 200 in
all cases. For comparison, we also show the PINN–Darcy mean K and h errors.
270 For a small number of K and h measurements (N < 50), the MPINN method
reduces K by approximately 25% and h by ≈ 80% relative to the PINN–Darcy
method.
The MPINN method leads to an even bigger improvement in the C field
estimation relative to the data-driven DNN method. For example, for NC = 64,
275 C is 0.02 in MPINN with NfC = 1000, while C = 0.22 in the data-driven DNN
method. In addition, the C field estimation improves as NK and Nh increase.

16
100

10-2

10-4

10-6
0 10 20 30 40 50
103

Figure 6: Loss J as a function of the number of epochs in the MPINN joint training of
the K̂, ĥ, and Ĉ DNNs. Also shown are JfC (the part of the loss due to residual in the
advection–dispersion equation B.1(b)) and JK , Jh , and JC , which are parts of the loss function
with respect to K, h, and C measurements, respectively. The K̂, ĥ, and Ĉ DNNs size is 5 × 32
and NK = Nh = 36, NC = 64, Nfh = 200, and NfC = 1000.

This underscores the advantage of using MPINN instead of data-driven DNN


for data assimilation because it allows using indirect measurements to estimate
quantities of interest (e.g., using K and h measurements for estimating the C
280 field).
Finally, we note that the two-step Adam-L-BFGS-B minimization algorithm
must be used in MPINN (versus the one-step L-BFGS-B algorithm in the data-
driven DNN and PINN–Darcy methods) because adding multiphysics constraints
makes it more difficult to find a good minimum of the MPINN loss function.
285 To illustrate the need for the two-step approach, in Figure 6 we plot the (total)
MPINN loss function and the loss related to the PDE residual (Eq. (B.1b))
versus the number of epochs. In this case, we stop the Adam optimizer when the
total loss reaches a prescribed value of J = 5 × 10−4 and apply the L-BFGS-B
optimizer to achieve the final convergence of the loss function. The stochastic
290 gradient descent method in the Adam algorithm causes oscillations in losses
that allows a better DNN generalization. The quasi-Newton L-BFGS-B method
enables a higher rate of convergence to the minimum identified by the Adam

17
algorithm. The small final value of JfC indicates that the Ĉ, K̂, and ĥ DNNs
approximately satisfy the advection–dispersion equation (8).

(a) DNN: K = 31.29% (b) DNN + L2 Reg.: K = 17.92%

(c) PINN-Darcy: K = 16.06% (d) MPINN: K = 12.16%

Figure 7: Absolute point errors computed as the difference between the reference K(x) and
K̂(x, θK ) estimated with (a) data-driven DNN, (b) data-driven DNN with L2 regularization, (c)
PINN–Darcy, and (d) MPINN. In these simulations, NK = 36, Nh = 36, NC = 64, Nfh = 200,
and NfC = 1000. The locations of K measurements are denoted by black circles. Relative L2
errors K are also provided for all DNN methods.

295 The distributions of absolute point errors in the K(x), h(x), and C(x) fields
learned with the data-driven DNN, data-driven DNN with L2 regularization,
PINN–Darcy, and MPINN methods are given in Figures 7–9. In this comparison
study, we use NK = Nh = 36, NC = 64, Nfh = 200, and NfC = 1000. As
expected from a regression method, the data-driven DNN method errors increase
300 as distance from the measurement locations increases. L2 regularization helps
reduce these errors. The PINN–Darcy and MPINN methods further reduce point

18
(a) DNN: h = 2.68% (b) PINN-Darcy: h = 1.47% (c) MPINN: h = 0.77%

Figure 8: Relative L2 error h and absolute errors (differences) between the reference h(x)
and ĥ(x, θh ) trained with (a) data-driven DNN, (b) PINN–Darcy, and (c) MPINN. In these
examples, Nh = NK = 36, NC = 64, Nfh = 200, and NfC = 1000. Locations of h measurements
are denoted by back circles.

(a) DNN: C = 21.16% (b) MPINN: C = 1.65%

Figure 9: Relative L2 errors C and absolute point errors (differences) between the reference
C(x) and Ĉ(x, θC ) estimated with (a) data-driven DNN and (b) MPINN. In these examples,
NC = 64, NK = Nh = 36, Nfh = 200, and NfC = 1000. Locations of C measurements are
denoted by black circles.

errors, especially in parts of the domain with no measurements. For example, the
data-driven DNN method yields a poor approximation of C near the injection
point, with absolute point errors on the order of 0.1. In the MPINN method
305 with the same number of C measurements, the point errors in the same region
are on the order of 0.01.

19
4. DNN methods for estimating conductivity with complex correla-
tion structure

Here, we investigate the performance of the DNN methods for estimating the
310 spatially correlated conductivity field K(x) = exp(Y (x)) with the exponential
covariance function CY (x, x0 ) = σ 2 exp(−||x − x0 ||/2λ2 ), where σ 2 and λ are
the variance and correlation length of Y (x), respectively. Specifically, we study
the performance of the DNN methods as a function of λ.

4.1. Optimal DNN size as a function of the correlation length λ

315 In Section 3, we showed that the network size affects the accuracy of the DNN
predictions, especially when the data is sparse. In this section, we study the
dependence of the optimal DNN size on the correlation length of the approximated
field.
We consider three K(x) fields generated as realizations of lognormal processes
320 with λ = 0.2, 0.5, and 1.0 (see Figure 10).

Table 2: The number of total tunable parameters corresponding to the DNN structure 3 × mh
as a function of mh .

mh 10 20 30 40 50 60 70 80 90 100

DOF 261 921 1981 3441 5301 7561 10221 13281 16741 20601

Here, we vary the DNN size by changing the number of neurons in each
hidden layer mh . The number of tunable parameters as a function of mh for the
chosen DNN architecture is given in Table 2. The conductivity fields in Figure
10 are generated on the domain Ω = [0, 1] × [0, 0.5] on a 256 × 128 grid with
325 32,768 grid points. Here, we use the values of K at 20,000 grid points to train
K̂(x, θK ) (without any physics constraints) and use the remaining K values to
evaluate the accuracy of K̂(x, θK ).
For this large number of measurements, we find that the L-BFGS-B algorithm
is not efficient for minimizing the loss function, especially for the field with λ = 0.2.

20
(a) (b)

(c)

Figure 10: Reference conductivity fields with different correlation lengths: (a) λ = 0.2, (b)
λ = 0.5, and (c) λ = 1.0.

330 Therefore, we adopt the Adam method with an experimentally determined initial
learning rate of 0.0002 and a batch size of 1000. We find that 4×105 iterations are
needed to train K̂(x, θK ) for λ = 0.2, 3 × 105 iterations for λ = 0.5, and 2 × 105
iterations for λ = 1.0 to achieve a sufficiently low training error. Our results
show that less iterations are needed to train DNNs for smoother conductivity
335 fields (with larger correlation lengths).
Figure 11 shows the mean and standard deviation of K as functions of mh
for the three correlation lengths. The statistical moments of K are computed
from simulations with 10 different DNN initializations. Initially, the approxima-
tion error decreases as the DNN size increases because of the improved DNN
340 representation ability. We can see that a smaller DNN is sufficient to represent
a smoother field with larger correlation length. We also see that for fields with

21
0.05 0.02

0.04
0.015

0.03

0.01

0.02

0.005
0.01

0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80

(a) Correlation length 0.2 (b) Correlation length 0.5

0.01

0.008

0.006

0.004

0.002

0
10 20 30 40 50 60 70 80

(c) Correlation length 1.0

Figure 11: The relative error of DNN approximation as a function of mh (number of neurons
in each hidden layers) for three conductivity fields with the correlation lengths: (a) λ = 0.2,
(b) λ = 0.5, and (c) λ = 1.0.

the correlation lengths 0.5 and 1, the approximation error increases because of
overfitting once the DNN size exceeds the optimal size. For example, for the
K field with λ = 0.5 (see Figure 11 (b)), the smallest relative error of 0.52% is
345 reached at mh = 60. DNNs with mh < 60 are not representative enough, and
DNNs with mh > 60 cause overfitting. Therefore, we postulate that for a DNN
with three hidden layers, mh = 60 is optimal for this K field. For the K fields
with λ = 0.2 and λ = 1.0, the optimal DNN size is reached at mh = 90 and
mh = 40, respectively. For the K field with λ = 1.0, the minimum of the mean
350 error function (≈ 0.3%) is very shallow, as shown in Figure 11 (c). Therefore,
we select mh = 40 as the optimal DNN width because it results in the K with

22
the smallest standard deviation.

Figure 12: The optimal neural network size as a function of the correlation length λ.

Figure 12 shows the optimal DNN size as a function of λ. For the considered
range of λ, the DNN size decreases as a power of λ. It is important to note that
355 in addition to the correlation length of the modeled field, the optimal DNN size
depends on many other factors, including the type of activation function and
the number of hidden layers. In this study, we fix the number of hidden layers
and the activation function. Therefore, the results in Figure 12 might not apply
to other DNN architectures.

360 4.2. Data assimilation

Next, we compare the data-driven DNN, PINN–Darcy, and MPINN methods


for data assimilation as well as parameter and state estimation using the K,
h, and C measurements. The K fields shown in Figure 10 are used as the
ground truth. The reference h and C fields are generated as the solutions of the
365 Darcy and advection–dispersion equations using the STOMP code. Based on the
analysis in Section 4.1, we use the 3 × 60 DNN architecture for all three K fields
because this network size produces a reasonable fit for these fields. We adopt
the two-step training procedure, where the Adam algorithm with a learning rate
of 0.0002 is followed by the L-BFGS-B method with the threshold of 0.0005.

23
2 1

1.8 0.9

1.6 0.8

1.4 0.7

1.2 0.6

1 0.5

0.8 0.4

0.6 0.3

0.4 0.2

0.2 0.1

0 0
20 30 40 50 60 70 80 20 30 40 50 60 70 80

(a) Correlation length 0.2 (b) Correlation length 0.5

0.3

0.25

0.2

0.15

0.1

0.05

0
20 30 40 50 60 70 80

(c) Correlation length 1.0

Figure 13: The relative error K as a function of NK in the data-driven DNN, PINN–Darcy,
and MPINN methods in problems with (a) λ = 0.2, (b) λ = 0.5, and (c) λ = 1.0. In these
examples, Nh = 40, NC = 100, Nfh = 1000, NfC = 1000, and the DNNs size is 3 × 60.

370 Figure 13 compares the approximation errors in the data-driven DNN, PINN–
Darcy, and MPINN methods as functions of NK for the three conductivity fields.
In these simulations, we use Nh = 40, NC = 100, Nfh = 1000, and NfC = 1000.
For all three correlation lengths, we see that adding physics constraints
improves the accuracy of the DNN approximation of the K field. The biggest
375 reduction in the estimation error is achieved by adding h measurements and the
Darcy equation constraint, as shown by the comparison of the data-driven DNN
and PINN–Darcy estimation errors. Adding C measurements and advection–
dispersion equation constraints further reduces the approximation error. The
advantage of MPINN is especially pronounced for sparse data (small NK ) and

24
Table 3: Relative errors h and C for problems with λ = 0.2, 0.5, and 1.0. In these simulations,
Nh = 40, NC = 100, Nfh = 1000, NfC = 1000, and the DNNs size is 3 × 60.

h C
NK 20 40 60 80 20 40 60 80
DNN 4.72% 18.25%
λ = 0.2 PINN 6.82% 6.49% 4.74% 4.35%
MPINN 6.71% 6.39% 5.19% 3.74% 8.60% 7.35% 5.91% 7.10%
DNN 1.75% 18.65%
λ = 0.5 PINN 0.94% 0.92% 0.75% 0.57%
MPINN 1.04% 0.69% 0.75% 0.48% 2.02% 1.14% 1.41% 1.36%
DNN 1.28% 16.72%
λ = 1.0 PINN 6.43% 2.58% 0.72% 0.60%
MPINN 2.53% 0.95% 0.74% 0.63% 3.51% 1.20% 1.13% 1.35%

380 small correlation lengths. For example, for λ = 0.2 and NK = 20, the K errors
are 1.8, 0.66, and 0.57 in the data-driven DNN, PINN–Darcy, and MPINN
methods, respectively.
Table 3 lists h and C as functions of NK for the data-driven DNN, PINN–
Darcy, and MPINN methods and the K fields with λ = 0.2, 0.5, and 1. We
385 can see here that the PINN–Darcy and MPINN methods provide significantly
improved hydraulic head and concentration estimations compared to the data-
driven DNN method. We note that Nh and NC are fixed in this comparison
study, and the data-driven DNN hydraulic head and concentration estimates
do not depend on NK . Moreover, the PINN–Darcy and MPINN estimations of
390 h and C improve with increasing NK . This demonstrates the capability of the
physics-informed DNNs to learn from indirect measurements. The improvements
are particularly pronounced for estimating (highly nonlinear) C(x), e.g., for
λ = 1 and N C = 100, C decreases from 16.72% in the data-driven DNN to
1.35% in MPINN.
395 Figures 14–16 show the K̂(x), ĥ(x), and Ĉ(x) DNNs estimated with the

25
data-driven DNN, PINN–Darcy, and MPINN methods, where λ = 0.5, NK = 40,
Nh = 40, NC = 100, Nfh = 1000, and NfC = 1000. In Figure 14, the comparison
of the estimated and reference K fields shows that PINN–Darcy significantly
improves the data-driven DNN prediction. MPINN further improves the K
400 estimation, as indicated by the smaller K . The data-driven DNN approximation
near the upper left corner significantly differs from the ground truth K field due
to the lack of measurements in this region. However, the approximation error
around this area is greatly reduced in the PINN–Darcy and MPINN methods,
which leverages indirect observations (i.e., head and concentration observations)
405 located in this area, as shown in Figures 14 (c) and (d).

(a) Reference (b) K = 35.62%

(c) K = 8.08% (d) K = 6.62%

Figure 14: (a) The reference K field (λ = 0.5) and the relative L2 error K and absolute point
errors in K̂(x, θK ) trained with the (b) data-driven DNN, (c) PINN–Darcy, and (d) MPINN
methods. Locations of K measurements are denoted by black circles.

Although a relatively good approximation of h can be obtained with all


three methods (see Figure 15), we still observe some improvements using the

26
(a) Reference (b) h = 1.75%

(c) h = 0.92% (d) h = 0.69%

Figure 15: (a) The reference h field and the relative L2 errors h and absolute errors in ĥ(x, θh )
trained with the (b) data-driven DNN, (c) PINN–Darcy, and (d) MPINN methods. Locations
of h measurements are denoted by black circles.

PINN–Darcy and MPINN approaches. For the highly nonlinear C field, the
data-driven DNN estimate is significantly less accurate than that found using
410 MPINN in terms of both the point and L2 errors, as shown in Figure 16. Notably,
MPINN is able to accurately describe the eye of the concentration plume with
very few direct measurements near this region. Once again, this demonstrates
that MPINN can use sparse direct and indirect measurements in combination
with PDEs to capture local features that otherwise cannot be described with
415 only direct measurements.
Finally, we investigate whether using the optimal-size K̂ DNN, as determined
in Section 4.1, would reduce the error in the estimated K, h, and C fields. As
an example, we choose the case with λ = 0.2. According to Figure 11 (a),
the optimal K̂ size for a field with λ = 0.2 is mh = 90. Table 4 presents the

27
(a) Reference (b) C = 18.65%

(c) C = 1.14%

Figure 16: (a) The reference C field, the relative L2 errors C , and absolute errors in Ĉ(x, θC )
trained with the (b) data-driven DNN and (c) MPINN methods. Locations of C measurements
are denoted by black circles.

420 estimation errors for the K̂ DNNs with mh = 60, 90, and 120. In this comparison
study, we fix the ĥ and Ĉ DNNs’ size at mh = 60 and use NK = 80, Nh = 40,
NC = 100 measurements, and Nfh = 1000 and NfC = 1000 residual points. We
can see that the optimal-size K̂ produces the smallest estimation errors not only
for the K field but also for the h field in the data-driven DNN, PINN–Darcy, and
425 MPINN methods. For the C field, the smallest error is achieved with mh = 60
in the K̂ DNN. This indicates that a smaller estimation error in K and h does
not always translate to a smaller error in C.

28
Table 4: The relative errors K , h , and C in the data-driven DNN, PINN–Darcy, and MPINN
methods as functions of mh in the K̂ DNN. The DNN architecture of the K̂, ĥ, and Ĉ DNNs
is 3 × mh ; and mh = 60 in the ĥ and Ĉ DNNs. In these examples, NK = 80, Nh = 40,
NC = 100, Nfh = 1000, NfC = 1000, and λ = 0.2.

mh = 60 mh = 90 mh = 120
K h C K h C K h C
DNN 64.8% 54.2% 60.5%
PINN 49.2% 4.35% 48.5% 3.95% 51.7% 4.40%
MPINN 41.9% 3.74% 7.10% 40.2% 3.64% 11.3% 53.1% 4.02% 8.95%

5. Conclusion

In this study, we presented the MPINN approach for data assimilation with
430 a focus on parameter and state estimation in subsurface transport problems. In
this approach, all unknown space-dependent parameters and states are modeled
with DNNs that are jointly trained by minimizing the loss function containing
the multiphysics data (e.g., conductivity, hydraulic head, and concentration
measurements) and the associated physics constraints, including the Darcy
435 and advection–dispersion equations. As a result, the DNNs can be trained
using indirect measurements and underlying physics in an unsupervised learning
fashion, which is important when the data is sparse.
We compared three DNN methods: (1) the pure data-driven DNN approach,
which only uses data to train DNNs; (2) the PINN approach, called "PINN–
440 Darcy," which utilizes the conductivity and hydraulic head measurements and the
Darcy equation; and( 3) the MPINN approach, which combines the conductivity,
head, and concentration measurements with the Darcy and advection–dispersion
equations.
Our numerical results show that both physics-informed methods (PINN–
445 Darcy and MPINN) are significantly more accurate for parameter estimation
than the data-driven DNN method; the physics-informed methods provide
regularization and reduce the uncertainty in DNN predictions, especially when the

29
direct measurements are limited. Furthermore, MPINN yields better parameter
and state estimation than PINN–Darcy.
450 We investigated the effect of the neural network size on the accuracy of
parameter and state estimation as a function of the correlation length of the
modeled K field. We demonstrated that in pure data-driven regression, small
and large networks might result in poor representability or overfitting, and that
an optimal DNN size increases with decreasing correlation length. The physics
455 constraints and added measurements reduce dependence of the DNN prediction
on the DNN size given that the DNN is large (representative) enough. However,
for a small number of measurements, we demonstrated that an optimal-size DNN
outperforms both the larger and smaller DNNs.
In subsurface applications, data is usually sparse and is often indirect. There-
460 fore, the MPINN approach offers a flexible and unified framework to deal with
sparse and multiphysics data. Because the proposed method involves training
DNNs by minimizing the loss function, the performance of training algorithms is
crucial. In our study, we found that introducing nonlinear PDE constraints into
the loss function increases the computational cost of training. Application of the
465 physics-informed DNNs to large-scale problems will require access to multi-GPU
computers and scalable training algorithms. The selection of training algorithms
and hyperparameters (learning rate, architecture of DNNs, etc.) should also be
studied in more details.

Acknowledgements

470 This research was partially supported by the U.S. Department of Energy
(DOE) Advanced Scientific Computing (ASCR) program. PNNL is operated by
Battelle for the DOE under Contract DE-AC05-76RL01830.

Appendix A. DNN approximation

In MPINN, we use a fully connected feed-forward network architecture known


as multilayer perceptrons, where the basic computing units (neurons) are stacked

30
𝑥 𝑢(𝑥)

Input layer Output layer

Hidden layers

Figure A.17: Schematic representation of a feed-forward deep neural network.

in layers, as shown in Figure A.17. The DNN approximation û(x; θ) of a function


u(x) is given as:

u(x) ≈ û(x; θ) = y nl +1 (y nl (...(y 2 (x))), (A.1)

ˆ denotes the DNN approximation, and


where (·)

y 2 (x) = σ(W 1 x + b1 )

y 3 (y 2 ) = σ(W 2 y 2 + b2 )

... (A.2)

y nl (y nl −1 ) = σ(W nl −1 y nl −1 + bnl −1 )

y nl +1 (y nl ) = W nl y nl + bnl .

The first layer is called the input layer, and the last layer is the output layer,
while all the intermediate layers are known as hidden layers. Here, nl denotes
the number of hidden layers, σ is the predefined activation function, x ∈ Rd
denotes the input (d is the number of spatial dimensions), y nl +1 is the output
vector, and θ denotes all weight and bias parameters in the DNN approximation
of u:
θ = {W 1 , W 2 , ..., W nl , b1 , b2 , ..., bnl }. (A.3)

31
In the "data-driven" approach, θ is directly estimated from the measurements of
u by minimizing the loss function L(θ) = x∈Tu (û(x; θ) − u∗ (x))2 :
P

X
θ = arg min (û(x; θ) − u∗ (x))2 , (A.4)
θ
x∈Tu

where Tu = {x1 , x2 , ..., x|Tu | } ⊂ Ω denotes a set of measurement locations,


475 Ω ⊂ Rd is the domain of the function u, and u∗ (x), x ∈ Tu are the measured
values of u at these locations.
Some of the commonly used activation functions include logistic sigmoid,
hyperbolic tangent, ReLu, and leaky ReLu. Because the objective of this study
is to approximate differentiable functions (space-dependent parameters and state
480 variables in partial differential equations), we adopt the hyperbolic tangent
activation function σ(x) = tanh(x), which is infinitely differentiable.

Appendix B. Multiphysics-informed neural networks for coupled Darcy


flow and advection–dispersion equations

Next, the residuals of Equations (7) and (8) are expressed in terms of θK ,
θh , and θC as:

f h (x; θK , θh ) = ∇ · [K̂(x; θK )∇ĥ(x, θh )] (B.1a)


1
f C (x; θK , θh , θC ) = − K̂(x; θK )∇ĥ(x, θh ) · ∇Ĉ(x, θC ) − ∇ · [D∇Ĉ(x, θC )].
φ
(B.1b)

To enforce the Neumann boundary conditions for Equations (7) and (8), we
define DNNs that approximate fluxes at the boundaries:

h
fN 1 (x; θK , θh ) = −K̂(x)∂ ĥ(x)/∂x1 − q,
(B.2)
h
fN 2 (x; θK , θh ) = −K̂(x)∂ ĥ(x)/∂x2 ,

and
C
fN 1 (x; θC ) = ∂ Ĉ(x)/∂x1 ,
(B.3)
C
fN 2 (x; θC ) = ∂ Ĉ(x)/∂x2 .

32
The loss function is then defined as:

J(θK , θh , θC ) = Jd (θK , θh , θC ) + Jfh (θK , θh ) + JfC (θK , θh , θC )


h h C C (B.4)
+ JN 1 (θK , θh ) + JN 2 (θK , θh ) + JN 1 (θC ) + JN 2 (θC )

+ Jbh (θh ) + JbC (θC ),

where the loss due to mismatch with data is

1
PNK ∗ 2
Jd (θK , θh , θC ) = NK i [K̂(xK
i ; θK ) − Ki ] (B.5)
PN
+ N1h i h [ĥ(xhi ; θh ) − h∗i ]2
PN ∗ 2
+ N1C i C [Ĉ(xCi ; θC ) − Ci ] ,

485 and the losses due to (partial differential equatian) PDE constraints and boundary
conditions are:

Jfh (θK , θh ) 1 h
(x; θK , θh )]2 ,
P
= |Tfh | x∈Tfh [f (B.6)

JfC (θK , θh , θC ) 1 C
(x; θK , θh , θC )]2 ,
P
= |TfC | x∈TfC [f
h 1
P h 2
JN 1 (θK , θh ) = h |
|TN h [fN 1 (x; θK , θh )] ,
x∈TN
1 1

h 1
P h 2
JN 2 (θK , θh ) = h |
|TN h [fN 2 (x; θK , θh )] ,
x∈TN
2 2

C 1
P C 2
JN 1 (θC ) = C |
|TN C [fN 1 (x; θC )] ,
x∈TN
1 1

C 1
P C 2
JN 2 (θC ) = C |
|TN C [fN 2 (x; θC )] ,
x∈TN
2 2

Jbh (θh ) 1
− h∗ (x)]2 ,
P
= |Tbh | x∈Tbh [ĥ(x; θh )

JbC (θC ) 1
− C ∗ (x)]2 .
P
= |TbC | x∈TbC [Ĉ(x; θC )

In Equation (B.6), PDEs (7) and (8) are enforced at the residual points given
by the sets Tfh and TfC , respectively, where |Tfh | = Nfh and |TfC | = NfC . The
terms with the subscripts N1 or N2 enforce the Neumann boundary conditions,
490 and those with the subscript b enforce the Dirichlet boundary conditions.

[1] D. J. Hartmann, E. J. Beaumont, Predicting Reservoir System Quality and


Performance, in: Exploring for Oil and Gas Traps, 1999.

[2] M. K. Hubbert, D. G. Willis, Mechanics of hydraulic fracturing.

33
[3] E. Barbier, Geothermal energy technology and current status: An overview
495 (2002). doi:10.1016/S1364-0321(02)00002-3.

[4] J. C. Helton, Uncertainty and sensitivity analysis techniques for use in per-
formance assessment for radioactive waste disposal, Reliability Engineering
and System Safetydoi:10.1016/0951-8320(93)90097-I.

[5] A. I. Rajib, G. A. Assumaning, S.-Y. Chang, E. B. Addai, Use of Multiple


500 Data Assimilation Techniques in Groundwater Contaminant Transport
Modeling, Water Environment Research 89 (11) (2017) 1952–1960. doi:
10.2175/106143017x15051465918930.

[6] J. Bear, A. H.-D. Cheng, Modeling groundwater flow and contaminant


transport, Vol. 23, Springer Science & Business Media, 2010.

505 [7] M. D. White, M. Oostrom, R. J. Lenhard, Modeling fluid flow and


transport in variably saturated porous media with the STOMP simula-
tor. 1. Nonvolatile three-phase model description, Advances in Water Re-
sourcesdoi:10.1016/0309-1708(95)00018-E.

[8] R. J. Hoeksema, P. K. Kitanidis, An Application of the Geostatistical Ap-


510 proach to the Inverse Problem in Two-Dimensional Groundwater Modeling,
Water Resources Researchdoi:10.1029/WR020i007p01003.

[9] J. A. Vrugt, P. H. Stauffer, T. Wöhling, B. A. Robinson, V. V. Vesselinov,


Inverse modeling of subsurface flow and transport properties: A review with
new developments (2008). doi:10.2136/vzj2007.0078.

515 [10] M. M. Rajabi, B. Ataie-Ashtiani, C. T. Simmons, Model-data interaction in


groundwater studies: Review of methods, applications and future directions,
Journal of Hydrology 567 (September) (2018) 457–477. doi:10.1016/j.
jhydrol.2018.09.053.
URL https://doi.org/10.1016/j.jhydrol.2018.09.053

34
520 [11] G. Evensen, Sequential data assimilation with a nonlinear quasi-geostrophic
model using Monte Carlo methods to forecast error statistics, Journal of
Geophysical Research.

[12] P. Rayner, A. M. Michalak, F. Chevallier, Fundamentals of Data Assim-


ilation, Geoscientific Model Development Discussions (July) (2016) 1–21.
525 doi:10.5194/gmd-2016-148.

[13] G. Evensen, The Ensemble Kalman Filter: Theoretical formulation and prac-
tical implementation, Ocean Dynamicsdoi:10.1007/s10236-003-0036-9.

[14] P. L. Houtekamer, H. L. Mitchell, A sequential ensemble kalman filter for


atmospheric data assimilation, Monthly Weather Review 129 (1) (2001)
530 123–137.

[15] G. Christakos, Methodological developments in geophysical assimilation


modeling, Reviews of Geophysics 43 (2).

[16] Q. Zheng, J. Zhang, W. Xu, L. Wu, L. Zeng, Adaptive Multifidelity Data


Assimilation for Nonlinear Subsurface Flow Problems, Water Resources
535 Research 55 (1) (2019) 203–217. doi:10.1029/2018WR023615.

[17] J. A. Vrugt, B. A. Robinson, V. V. Vesselinov, Improved inverse modeling


for flow and transport in subsurface media: Combined parameter and state
estimation, Geophysical research letters 32 (18).

[18] Y. Liu, H. V. Gupta, Uncertainty in hydrologic modeling: Toward an


540 integrated data assimilation framework, Water Resources Research 43 (7)
(2007) 1–18. doi:10.1029/2006WR005756.

[19] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic


differentiation in machine learning: a surveyarXiv:1502.05767, doi:10.
1016/j.advwatres.2018.01.009.
545 URL http://arxiv.org/abs/1502.05767

35
[20] B. Ramsundar, R. B. Zadeh, TensorFlow for deep learning: from linear
regression to reinforcement learning, " O’Reilly Media, Inc.", 2018.

[21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,


A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch.

550 [22] J. Bongard, H. Lipson, Automated reverse engineering of nonlinear dynami-


cal systems, Proceedings of the National Academy of Sciences of the United
States of Americadoi:10.1073/pnas.0609476104.

[23] S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equations


from data: Sparse identification of nonlinear dynamical systems, Proceedings
555 of the National Academy of Sciences 113 (15) (2016) 3932–3937. arXiv:
1509.03580, doi:10.1073/pnas.1517384113.

[24] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics-informed neural net-


works: A deep learning framework for solving forward and inverse problems
involving nonlinear partial differential equations, Journal of Computational
560 Physics 378 (2019) 686–707. doi:10.1016/j.jcp.2018.10.045.
URL https://doi.org/10.1016/j.jcp.2018.10.045

[25] I. E. Lagaris, a. Likas, D. I. Fotiadis, Artificial Neural Networks for Solving


Ordinary and Partial Differential Equations 9 (5) (1997) 26. arXiv:9705023,
doi:10.1109/72.712178.
565 URL http://arxiv.org/abs/physics/9705023

[26] E. Weinan, B. Yu, The Deep Ritz Method: A Deep Learning-Based Nu-
merical Algorithm for Solving Variational Problems, Communications in
Mathematics and Statistics 6 (1) (2018) 1–14. arXiv:arXiv:1710.00211v1,
doi:10.1007/s40304-018-0127-z.

570 [27] A. M. Tartakovsky, C. O. Marrero, P. Perdikaris, G. D. Tartakovsky,


D. Barajas-Solano, Learning Parameters and Constitutive Relationships
with Physics Informed Deep Neural NetworksarXiv:1808.03398.
URL http://arxiv.org/abs/1808.03398

36
[28] L. Lu, X. Meng, Z. Mao, G. E. Karniadakis, DeepXDE: A deep learning
575 library for solving differential equations (2019) 1–17arXiv:1907.04502.
URL http://arxiv.org/abs/1907.04502

[29] K. Rudd, S. Ferrari, E. J. Shaughnessy, J. D. Albertson, X. Sun, Solving


Partial Differential Equations Using Artificial Neural Networks, Ph.D. thesis,
Duke University (2013). doi:http://dx.doi.org/10.1002/hbm.21514.
580 URL http://lisc.mae.cornell.edu/PastThesis/KeithRuddPhD.pdf

[30] M. A. Nabian, H. Meidani, Physics-Driven Regularization of Deep Neu-


ral Networks for Enhanced Engineering Design and Analysis, Journal of
Computing and Information Science in Engineering 20 (1) (2020) 1–10.
arXiv:1810.05547, doi:10.1115/1.4044507.

585 [31] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization (2014)
1–15arXiv:1412.6980.
URL http://arxiv.org/abs/1412.6980

[32] R. H. Byrd, P. Lu, J. Nocedal, C. Zhu, A Limited Memory Algorithm


for Bound Constrained Optimization, SIAM Journal on Scientific Comput-
590 ingdoi:10.1137/0916069.

[33] J. Berg, K. Nyström, A unified deep artificial neural network approach to


partial differential equations in complex geometries, Neurocomputing 317
(2018) 28–41. arXiv:1711.06464, doi:10.1016/j.neucom.2018.06.056.
URL https://doi.org/10.1016/j.neucom.2018.06.056

595 [34] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A. Y. Ng, On opti-


mization methods for deep learning, Proceedings of the 28th International
Conference on Machine Learning, ICML 2011 (2011) 265–272.

[35] E. Jones, T. Oliphant, P. Peterson, Others, SciPy: Open source scientific


tools for Python (2001).
600 URL http://www.scipy.org/

37
[36] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedfor-
ward neural networks, in: Journal of Machine Learning Research, 2010.

[37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,


Dropout: A simple way to prevent neural networks from overfitting, Journal
605 of Machine Learning Research.

[38] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning,


Vol. 1, Springer series in statistics, New York, 2009. arXiv:1010.3003,
doi:10.1007/b94608.

[39] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, Cambridge, MA,


610 2016.

38

You might also like