Professional Documents
Culture Documents
URTEC-208344-MS 3D Seismic Facies Classification On CPU and GPU HPC Clusters
URTEC-208344-MS 3D Seismic Facies Classification On CPU and GPU HPC Clusters
This paper was prepared for presentation at the SPE/AAPG/SEG Asia Pacific Unconventional Resources Technology Conference to be held virtually on 16–18
November, 2021.
The URTeC Technical Program Committee accepted this presentation on the basis of information contained in an abstract submitted by the author(s). The contents
of this paper have not been reviewed by URTeC and URTeC does not warrant the accuracy, reliability, or timeliness of any information herein. All information is the
responsibility of, and, is subject to corrections by the author(s). Any person or entity that relies on any information obtained from this paper does so at their own
risk. The information herein does not necessarily reflect any position of URTeC. Any reproduction, distribution, or storage of any part of this paper without the written
consent of URTeC is prohibited.
Abstract
The commonly used method of analyzing 3D seismic facies classification data by stitching together 2D
cross-sections can produce unrealistic discontinuities in geological features. Depending on the direction
in which the 2D cross-sections are taken, some features such as depositional geomorphology, channel
boundaries, and faults might not be fully visible, resulting in misleading labeled data and incorrect
interpretations. Hence, in this work, we propose the application of 3D machine learning models to solve
the problem of seismic facies classification. This introduces a two-fold challenge: first, using 3D models
substantially increases the memory requirements of the computational framework; second, neural network
design becomes increasingly challenging due to the higher number of parameters in the model and its
larger training time. We utilize distributed deep learning techniques in order to address these challenges,
and efficiently train 3D deep learning models for seismic facies classification on Microsoft Azure High-
Performance Computing (HPC) clusters. Using those techniques, we were able to train 3D networks with
millions of trainable parameters within a span of 3 hours, enabling rapid hyperparameter tuning and
different network architecture evaluation. We found that the networks performed better when the 3D seismic
input cuboids (and their corresponding labels) were longer along the depth dimension compared to the
X and Y axes. Data augmentation through the non-uniform overlap of the training cuboids (with more
overlap in areas of greater geological heterogeneities) was also shown to improve training performance.
Overall, domain knowledge of the problem along with distributed computing techniques helped improve
the efficiency and performance of deep learning-based 3D seismic facies classification.
Introduction
No table of contents entries found.
Seismic data plays a key role in understanding the subsurface, especially for exploration and production
industry. Apart from helping in identification of structural features, seismic data is also used in identification
of stratigraphic features that are potential targets for drilling. The identification of different geological
features from subsurface seismic image is known as seismic facies classification. The identified seismic
facies are used in making a geological model of the subsurface which helps in exploration of minerals
and hydrocarbons. In traditional workflows, specialized seismic interpreters process the 3D data to identify
2 URTEC-208344-MS
seismic facies. In recent times, large 3D seismic datasets that cover several hundreds of km2 have
become commonly available. To interpret such large datasets manually is challenging, time-consuming and
subjective, and often relies on the expertise of the interpreter.
With the advancements in machine learning, several researchers have attempted to solve the seismic
facies classification problem as a semantic segmentation task using machine-learning algorithms (Ulku
2020). Semantic segmentation is a fundamental problem in the fields of computer vision and machine
learning (Minaee 2020, Zhou Z. 2020, Rippel 2020). Convolutional neural networks (CNNs) proved
Figure 1—A channel cut in the subsurface is a potential target during exploration (left).
Channel cuts are visible in 2D slices that are perpendicular to the channel axis (middle) but
cannot be identified in slices taken along the channel axis (right). Source: (McHargue 2011)
Recently, a few studies have been published that used 3D CNNs for geophysical problems, and some
of these are specifically applied for seismic facies classification (Liu 2020, Pradhan 2020). In these works,
the entire seismic cube is divided into small sub-cubes with maximum size 64x64x64 (to a total of 262144
pixes) to overcome the computational challenges. However, the authors mention the need for larger input
sizes, which would provide a large enough receptive field and contextual information to distinguish between
distinct types of facies. In cases where the cube size is larger, the computational time to run 100 epochs of
training using a machine with 128 GB RAM and two 32GB Tesla V100 GPUs is around 48 hours (Pradhan
2020).
In this work, we have used distributed deep learning to address both challenges: large memory
requirements and long compute times. In the remainder of the paper, we will discuss the problems of
3D semantic segmentation, 3D seismic facies classification, and provide an overview of the 3D U-Net
network architecture. We will elaborate on its performance and memory footprint and discuss data-parallel
distributed deep-learning strategies and implementation details on CPU and GPU-based HPC clusters.
URTEC-208344-MS 3
Finally, we will provide functional specifications of Intel, AMD, and NVIDIA HPC clusters on Microsoft
Azure, and report strong scaling and accuracy metrics for experiments performed on these clusters.
the problem of medical image segmentation and were shown to perform well with few training images
(Ronnenberger 2015). The key idea was to add convolutional layers to the up-sampling branch, making the
decoder section symmetric to the encoder, which gives the network its "U"-shape. In addition, features from
the decoder were combined with those from the encoder via skip connections, as shown in Figure 2, which
was shown to improve localization.
Memory utilization during training is driven by 3 factors: (1) number of model parameters in the network;
(2) mini-batch size; and (3) sizes of temporary tensors created during forward and back-propagation. In
the case of U-Nets, assuming the feature maps have fixed sizes, the number of training parameters in the
model will be closely related to the network depth, defined here as the number of "levels" with symmetric
encoder/decoder blocks linked by skip connections (Figure 2). As their name suggests, skip connections
feed output from early layers in the network into deeper layers, thus skipping over certain operations. This
has been shown to significantly alleviate issues of accuracy degradation and gradient vanishing in deep
neural networks (He 2015).
Figure 3a shows how peak memory utilization increases with network depth while training U-Nets using
images of size 256x256x256 pixels (the batch size was fixed at one). It also shows the wall-clock time (in
seconds) to process one sample of size 256x256x256 pixels as a function of network depth. On the other
hand, Figure 3b shows peak memory and per-epoch wall-clock time during training of a depth-4 U-Net for
the same dataset as functions of input sample size (the batch size was fixed at 4 samples). Notice how peak
memory exceeds the capacity of the 24GB Tesla RTX6000 GPU for sample size 128x128x128 pixels, and
far exceeds even the capacity of the state-of-the-art 32GB Tesla V100 GPU for input size 256x256x256
pixels.
URTEC-208344-MS 5
The results above demonstrate how infeasible it can be to run large-scale 3D segmentation problems on
single-GPU nodes, or even on small multi-GPU clusters, due to their limited device memory. In the next
section, we describe the distributed deep-learning solution we developed to address this challenge and show
how it enables massively parallel deep-learning jobs on HPC clusters with hundreds or thousands of CPU
or GPU cores.
Distributed Deep-Learning Solution. Our solution consists of several software components that
work together to allow seamless and efficient execution of parallel deep-learning workloads on cloud
infrastructure.
One of the most widely used techniques for performing distributed deep-leaning training is the so-called
data parallel strategy, in which identical copies of the model are simultaneously trained by independent
workers to minimize a common objective function (Ben-nun 2018). Such functions (e.g., root-mean square
error, binary cross-entropy, etc) compute an error between the network output and the expected (target)
data. For this to be possible, the training data samples (and their corresponding labels, in case of supervised
learning) must be equally split among the workers. Since stochastic optimization-based training already
entails splitting the data into mini-batches, this means one must further split the mini-batches into local mini-
batches, which are then asynchronously processed via forward and back-propagation steps. Local gradients
are computed by each worker and collectively averaged via an all-reduce operation. Once each worker
possesses the global gradient vector, they invoke the optimizer to update their local network parameters,
which are now in-sync with every other worker. This process is illustrated in Figure 4.
6 URTEC-208344-MS
Care must be taken, however, to ensure that results are independent of the number of workers utilized,
which is an important tenet of high-performance computing. To accomplish that, we start by augmenting
the dataset to make the total number of training samples Ns divisible by the number of workers p. Then,
each global mini-batch of size bs is divided into p equal parts, which become the local mini-batches to be
dispatched to the p workers, as shown in Figure 5. This ensures that the union of the local mini-batches
across all workers will be identical to the (global) mini-batch of the corresponding single-processor run.
Module rounding errors during gradient communication, the above scheme thus guarantees that the solution
will be independent of the number of workers. It also follows from the arithmetic that, for any global mini-
batch size bs chosen, the local mini-batches processed by workers at any given time will have the same size,
thus optimizing load balance.
Figure 5—Data splitting across workers in a parallel run: local mini batches are guaranteed
to always have identical sizes at any given time, promoting optimal load balance.
Our data distribution system contains several other features that have been shown to improve training
performance or accuracy, some of the most important being:
• Shuffling: the training dataset is randomized before being split into local mini-batches and
distributed to workers. A-priori data shuffling has been shown to improve convergence time and
model generalization for Stochastic Gradient Descent (SGD)-based training.
• Pre-loading: optionally, the local mini batches can be read upfront from disk and cached into
memory for the duration of the training process. The memory footprint of caching a subset of the
data is usually small on modern machines compared to the memory required during training, while
the runtime benefits of I/O-free execution can be significant.
URTEC-208344-MS 7
Hybrid Parallel Distribution Engine. Our parallelization strategy leverages both distributed-memory
Message Parsing Interface (MPI)-based communication primitives, which handle data transfer across
processes, as well as shared-memory Open Multi-processing (OpenMP) or Compute Unified Data
Architecture (CUDA)-based multi-threading, which exploits parallelism within a node. This combination
of shared-memory and message-passing paradigms within the same application is known as hybrid
programming (Duy 2012), and is illustrated in Figure 6. In the specific case of our deep-learning software,
MPI collective all-reduce calls are invoked to handle gradient communication and average across workers.
Cluster Specifications
HPC based ML/AI workflows demand premier computational performance, scalability and cost efficiency.
Microsoft Azure provides supercomputing-grade infrastructure built to optimize time to solution and
minimize costs by leveraging cloud-leading flexibility and scalability. Azure HPC Virtual Machines (VMs)
can often match or even outperform on-premises cluster capabilities due to their adoption of leading-
edge HPC technologies, e.g., performance optimized hardware, NVIDIA GPU acceleration and InfiniBand
networking. We have performed our 3D semantic segmentation studies using the following Intel, AMD and
NVIDIA-based HPC clusters as discussed below.
Table I—Multiple choices of loss function, data augmentation strategy, input sample size, and
network architecture used for training a depth-4 3D U-Net with the Parihaka seismic dataset.
Knobs Values
Training data generation Increasing z-overlap, Constant 90% z-overlap, Random overlap
Sample size (in pixels) (64 × 64 × 256), (128 × 32 × 256), (128 × 32 × 512), (256 × 16 × 512)
Network architecture 3D U-Net with different depths; 3D U-Net with different backbones –
Resnet, EfficientNet
On Table II, we show the resulting precision, recall, and F1-score (Dice coefficients) for different classes
achieved by training 4160 samples of size 64 × 64 × 256 pixels using NLL loss function and 3D U-Net
architecture with Resnet backbone. In Figure 10, we show the confusion matrix for the test volume using
the model trained with 64 × 64 × 256 pixels input cuboids. Inference was performed on the entire 1006 × 78
× 590 test block and took 50 secs on a single HC-series VM. Due to high-class imbalance in labeled data,
we observe higher accuracy for classes 0-3 compared to classes 4 and 5. In Figure 11, we show that both
majority and minority classes are predicted well for a slice at × = 723 in the test volume.
12 URTEC-208344-MS
Figure 11—True vs. predicted seismic facies corresponding to cross-sections at □ =732 in the test volume
Table II—Highest Precision, Recall, F1-score (Dice coefficient) results were achieved by training 3D U-Net architecture with
Resnet backbone on 4160 samples of size 64 × 64 × 256 with constant 90% overlap in z-direction and using NLL loss function.
Conclusions
References
Alaudah, Y., Michałowicz, P., Alfarraj, M., and AlRegib, G. 2019. "A machine-learning benchmark for facies
classification." Interpretation, 7(3) SE175–SE187.
Alaudah, Y., Michałowicz, P., Alfarraj, M., and AlRegib, G. 2019. "A machine-learning benchmark for facies
classification." Interpretation 7(3) SE175–SE187.
Ben-nun, T. and Hoefler, T. 2018. "Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency
Analysis." arXiv preprint arXiv:1802.09941v2.
Di, H., Wang, Z., and AlRegib, G. 2018. "Why using CNN for seismic interpretation? An investigation." In SEG Technical
Program Expanded Abstracts 2018, 2216–2220. Society of Exploration Geophysicists.
Dramsch, J.S. Lüthje, M. 2018. "Deep-learning seismic facies on state-of-the-art CNN architectures." Society of
Exploration Geophysicists. 2036–2040.
Dumoulin, V. and Visin, F. 2018. "A guide to convolution arithmetic for deep learning." arXiv:1603.07285v2.
Duy, T., Yamazaki, K., Ikegami, K., and Oyanagi, S. 2012. "Hybrid MPI-OpenMP Paradigm on SMP clusters: MPEG-2
Encoder and n-body Simulation." arXiv preprint arXiv:1211.2292.
He, K., Zhang, X., Ren, S., Sun, J. 2015. "Deep Residual Learning for Image Recognition." arXiv:1512.03385.
Houston, T.K., Treichler, S., Romero, J. et al. 2018. "Exascale Deep Learning for Climate Analytics." arXiv 1810.01993.
Kingma, D. P. and Ba, J. 2015. "Adam: A method for stochastic optimization." Proc. Int. Conf. Learning Representations
(ICLR), 2015.
Laanait, N., Romero, J., Yin, J. et al. 2019. "Exascale Deep Learning for Scientific Inverse Problems." arXiv preprint
arXiv:1909.11150v1.
Liang-Chieh, C., Papandreou, G., Schroff, F., Adam, H. 2017. "Rethinking Atrous Convolution for Semantic Image
Segmentation." arXiv:1706.05587.
Liu, Mingliang and Jervis, Michael and Li, Weichang and Nivlet, Philippe. 2020. "Seismic facies classification using
supervised convolutional neural networks and semi-supervised generative adversarial networks." Geophysics 85(4)
O47–O58.
14 URTEC-208344-MS
Long, Jonathan and Shelhamer, Evan and Darrell, Trevor. 2015. "Fully Convolutional Networks for Semantic
Segmentation." arXiv preprint arXiv:1411.4038v2.
McHargue, T., Pyrcz, M.J., Sullivan, M.D., Clark, J.D., Fildani, A., Romans, B.W., Covault, J.A., Levy, M., Posamentier,
H.W. and Drinkwater, N.J. 2011. "Architecture of turbidite channel systems on the continental slope: patterns and
predictions." Marine and Petroleum Geology, 28(3) 728–743.
Minaee, S., Boykov, Y., Porikli, F. et al. 2020. "Image segmentation using deep learning: A survey." arXiv preprint
arXiv:2001.05566v5, 2020.
Nouamane, L., Romero, J., Yin, J. et al. 2019. "Exascale Deep Learning for Scientific Inverse Problems." arXiv preprint