Estimating PVT Properties of Crude Oil Systems Based On A Boosted Decision Tree Regression Modelling Scheme With K-Means Clustering

SPE-196453-MS
Estimating PVT Properties of Crude Oil Systems Based on a Boosted

Decision Tree Regression Modelling Scheme with K-Means Clustering
Meshal Almashan, Yoshiaki Narusue, and Hiroyuki Morikawa, Graduate School of Engineering, The University of
Tokyo
Copyright 2019, Society of Petroleum Engineers
This paper was prepared for presentation at the SPE/IATMI Asia Pacific Oil & Gas Conference and Exhibition held in Bali, Indonesia, 29-31 October 2019.
This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents
of the paper have not been reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect
any position of the Society of Petroleum Engineers, its officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written
consent of the Society of Petroleum Engineers is prohibited. Permission to reproduce in print is restricted to an abstract of not more than 300 words; illustrations may
not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.
Abstract
Machine learning has been successfully implemented for the past 20 years in estimating reservoir fluid
properties competing with the empirical correlations. One of the most commonly utilized modelling schemes
is the artificial neural network which is known for its black-box problem that does not show the steps
taken to reach the final estimation of the fluid properties. This study offers a different modeling approach
that overcomes the limitations of the current implemented modeling scheme providing users with better
predications and deeper understanding of the key input parameters in modeling. The proposed model
predicts the bubble point pressure (Pb) and the oil formation volume factor at bubble point pressure
(Bob) as a function of oil and gas specific gravity, solution gas-oil ratio, and reservoir temperature by a
boosted decision tree regression (BDTR) predictive modeling scheme. The K-means clustering algorithm
is performed as a preprocessing step based on the Pressure-Volume-Temperature (PVT) input features to
increase the prediction accuracy. In addition, the predictive power of the built K-means clustered BDTR
model implemented in this study is compared against the most commonly used empirical correlations,
the ANNs, and the standalone BDTR model. Moreover, the feature importance of predicting Pb and Bob
is discussed. The universal dataset used in building the predictive model consists of 5200 experimentally
derived data points representing worldwide crude oils covering a wide range of geographical regions. The
built BDTR model is more accurate and it outperforms the most commonly used empirical correlations and
the previous machine learning models in predicting Pb and Bob in terms of the average absolute percent
relative error. Furthermore, the proposed model can be integrated into simulators and it can also be applied
towards predicating other oil and gas properties, in the gas-liquid two-phase flow pattern identification,
and in predicting rock properties. As an interpretable approach in predicting the PVT properties of crude
oils, the proposed model can be used as an alternative modeling scheme in PVT characterization where
the importance of the input features can heuristically and accurately be determined. This can be applied
towards preventive maintenance and anomaly detection studies where prediction decisions can be further
investigated by the interpretable representation of the decision trees. In addition, it is the most accurate
model to date in predicting the bubble point pressure and the oil formation volume factor at bubble point
pressure of crude oils.
2 SPE-196453-MS
Keywords: Predictive model, PVT, Oil and gas, Reservoir characterization
Introduction
The fluid properties of reservoirs are crucial in the computations implemented in the field of petroleum
engineering, examples are, the computations of the material balance, numerical simulation of reservoirs,
estimates of reserves, and calculations of the inflow performances. Therefore, accurate PVT properties data
are very important for the success of these applications. In addition, the better the PVT properties data
used in the calculations of reservoir performances, such as, production design and operations, the better
the results are. The more accurate the properties, the less cost, effort and time are spent on such processes.
Among the PVT properties, accurate and precise prediction of the bubble point pressure (Pb) and the oil
formation volume factor at bubble point pressure (Bob) is indeed very important in production and reservoir
computations for the determination of oil reserves or future production in oil wells. Furthermore, the API
gravity (γAPI) is an important crude oils PVT property. Based on this property crude oil is classified for
the determination of its heaviness by which its marketability is specified. Table 1 shows a standard oil
classification by γAPI [1].
Table 1—The classification of crude oils based on API gravity.
Classification °API Range
Light °API > 31.1
Medium 22.3 ≤ °API ≤ 31.1
Heavy °API < 22.3
Extra heavy °API < 10.0
The PVT simulators that are currently available can predict the properties of the fluids in the reservoir
with different degrees of accuracy depending on the model used, the geographical location and the overall
conditions. However, these PVT simulators all have a significant drawback as they do not estimate their
answers' accuracy. The PVT properties are best determined by experiments implemented at specialized
laboratories using oil samples collected from the wellbore or the wellhead. Nevertheless, these laboratory
experiments require much time, effort and cost. Consequently, graphical approaches, equations of state
(EOS), statistical regression techniques and empirical correlations are used as alternatives for the prediction
of the PVT properties. However, the EOS are complex in terms of computations, and they also require
comprehensive details of the compositions of the fluids of each reservoir and this requires time and
additional costs. Therefore, a significant number of empirical correlations have been developed as a
less expensive, less complex and quick solution. This resulted in a large amount of research related to
the development of PVT properties empirical correlations. The developed correlations were proposing
mathematical and graphical models for the determination of Pb and Bob. Such correlations are based on
assumptions that state that Pb and Bob are functions of the reservoir temperature (T), solution gas–oil ratio
(Rs), oil specific gravity (γo), and gas specific gravity (γg) [2, 3, 4]. Furthermore, the accuracy of the estimates
generated by those correlations is relying on using data within the same range of the data used in the
development of the correlations, and similar geographical locations of the oil wells where they share the
same fluid compositions and oil API gravity (γAPI). Consequently, as an attempt to have better results
generated by the correlations, crude oils were grouped based on γAPI before specifying the correlation for
each classified group [5]. In other research studies, machine learning algorithms have been implemented
for more precise and accurate predictions of the PVT properties [3, 6, 7, 8, 9, 10].
In recent years, for solving different problems in the oil and gas industry, artificial neural networks
have been implemented. Other examples on machine learning algorithms used in the prediction of the PVT
SPE-196453-MS 3
properties are Support Vector Machine Regressions (SVRs), and Functional Networks (FNs). However, the
neural networks models exhibit some drawbacks, such as, the inability to expressly determine potential
causative relationships, the overfitting drawback and the local minima of the cost function drawback.
Moreover, very few previous works in the literature that have applied machine learning in the prediction of
the PVT properties, have considered the diversity of γAPI or the other different input properties. For example,
in the case that the data used for training the machine learning model has more heavy crude oils than light
crude oils, the model generated will be constrained to the heavy crude oils. This form of dataset is termed as
"imbalanced dataset" [11]. The main objective of this study is to analyze the capability of boosted decision
tree regressions (BDTRs) on modeling PVT properties of crude oil systems and to avoid the mentioned
drawbacks of the ANNs. Consequently, in this research, we particularly investigate the capability of BDTRs
in modeling the bubble point pressure (Pb) and the oil formation volume factor at bubble point pressure
(Bob) using worldwide datasets representing crude oil samples from different geographical locations. In
addition, an integrated approach of K-means clustering and BDTR for predicting the crude oil systems PVT
properties is proposed. K-means clustering is implemented to produce clusters of the input dataset based on
γAPI and the other input features before the BDTR is used to perform predictions on the target PVT properties,
Pb and Bob. Furthermore, comparative studies are performed to compare the performance of the BDTR
model with K-means clustering against the most commonly used empirical correlations in the industry of
Petroleum Engineering. In addition, the performance of the integrated model (K-means clustering & BDTR)
is compared against the standalone BDTR model and against artificial neural networks (ANNs). The BDTR
model has been built using a worldwide crude oils dataset with 5200 data points.
Results show that the integrated model (K-means clustering & BDTR) outperforms ANNs and the
most commonly used empirical correlations in terms of the average absolute percent relative error (Ea) in
predicting Pb and Bob, and it outperforms the standalone BDTR model in predicting Pb. The proposed model
can be integrated into oil and gas reservoir simulators and software. This can also be extended towards
predicting other PVT properties for further estimations in the upstream exploration and production process
computations.
Literature review
In hydrocarbons, the pressure at which the first bubble of gas comes out of oil solution is referred to as
the bubble point pressure (Pb). At the bubble point pressure, the ratio of the volume of oil at reservoir
conditions to its volume at stock tank conditions is defined as the oil formation volume factor at bubble
point pressure (Bob). These two parameters are of the most important PVT properties of oil and gas
systems. These properties, in addition to other properties, are crucial in calculations related to reservoir and
production engineering. Examples are, forecasting the performance of future reservoirs, simulation of wells
and reservoirs, calculations of the mass balance, design of the production facilities, calculations of the flow
performance, the evaluation of economics, and projects related to enhance oil recovery, [12, 13, 14].
By applying constant-composition expansion (CCE) laboratory tests on samples collected from reservoir
fluids, the bubble point pressure and the oil formation volume factor can be determined [15]. In the CCE
test, at a pressure initially higher than the reservoir pressure and at temperature similar to the reservoir
temperature, a sample from reservoir fluids is first placed in a visual PVT cell. In the next step, while
keeping the temperature at a constant level, the pressure is reduced gradually. A plot of the volume of the
hydrocarbon sample against the pressure is drawn. The bubble point pressure is then determined as the point
at which the slope starts to change [15]. Laboratory analysis is a very accurate method of determining Pb
and Bob. However, it consumes time, money and effort in addition to the extra care that need to be taken
while handling the fluid samples from oil reservoirs [16].
As an alternative approach, equations of state (EOS) can be used to predict the PVT properties. However,
EOS are considered as computationally complex and require an extensive amount of data related to the
4 SPE-196453-MS
composition of the reservoir fluids. Consequently, for a long time, researchers have developed several
different empirical correlations to estimate the PVT properties with varying degrees of accuracy. Compared
to EOS, empirical correlations are much simpler in terms of calculations. For accuracy reasons, some of
these correlations were developed for certain geographical locations and chemical compositions of crude
oil systems. The main techniques used in developing these correlations are graphical, linear and non-linear
regression techniques. The major assumption in the development of the empirical correlations is that the
bubble point pressure and the oil formation volume factor are functions of the reservoir temperature (T),
solution gas–oil ratio (Rs), oil specific gravity (γo), and gas specific gravity (γg).
Standing [17] in 1947 developed graphical correlations for estimating the oil formation volume factor
(OFVF), the total OFVF and the bubble point pressure using 105 datasets generated from oil samples
collected from reservoirs in California. Standing's correlations exhibited average errors of 1.17%, 5% and
4.8% for OFVF, total OFVF and bubble point pressure respectively. Lasater [18] in 1958 developed an
empirical correlation to estimate the bubble point pressure using 158 datasets produced from oil samples
collected from North and South America. Lasater's correlation exhibited an average error of 3.8%. Vasquez
and Beggs [19] in 1980 proposed correlations for estimating different PVT properties. e.g., saturated
and undersaturated OFVF, solution-gas oil ratio and undersaturated oil viscosity. Their correlations are
developed based on 600 datasets from oil fields around the world. In addition, Vasquez and Beggs grouped
the datasets in their experiments into two groups, based on the oil API gravity (γAPI) as (γAPI <30 and γAPI
>30). Glasø [20] in 1980 developed correlations for estimating OFVF, total OFVF, dead oil viscosity and
the bubble point pressure. Glasø [20] developed correlations based on 45 oil samples collected from the
North Sea. Furthermore, Glasø [20] developed a method for estimating the bubble point pressure using a
correlation that works with the presence of N2, CO2 and H2S gases. The correlations exhibited average
relative errors of 4.56%, 0.43% and 1.28% for total OFVF, OFVF and the bubble point pressure respectively.
Al-Marhoun [21] in 1988 proposed correlations for predicting OFVF and the bubble point pressure. An
average absolute relative error of 0.88% and 3.66% for OFVF and the bubble point pressure respectively
were reported by Al-Marhoun [21]. Other examples of publications proposing empirical correlations for
different oil fields around the world are Kartoatmodjo and Schmidt [5] for the oil fields in the Middle East,
Indonesia, North America and Latin America, Petrosky and Farshad [22] for the oil fields in the Gulf of
Mexico, Dokla and Osman [23] for the oil fields in the United Arab Emirates, Omar and Todd [24] for the
oil fields in Malaysia, Naseri et al. [25]for the oil fields in Iran, Dindoruk and Christman [26] for the oil
fields in the Gulf of Mexico, Khairy et al. [27] for the oil fields in Egypt, Macary and El-Batanoney [28]
for the oil fields in the Gulf of Suez, Labedi [29] the oil fields in Africa and Frashad et al. [30] for the oil
fields in Colombia.
These correlations are, however, unable to produce reliable results [8]. Consequently, researchers have
implemented artificial neural networks (ANNs) as an alternative reliable solution for estimating the PVT
properties of crude oils. Feed forward neural networks and the back-propagation (BP) training algorithm are
the most commonly used methods in modeling the ANN. Gharbi and Elsharkawy [31] in 1997 developed
two models one for predicting OFVF and another model for predicting the bubble point pressure based on
neural networks. The datasets used in building the models were for crude oil systems in the Middle East.
The developed neural networks models have two hidden layers. The 4-6-6-2 structure was used to estimate
the OFVF and the 4-8-4-2 structure was used to estimate the bubble point pressure. Compared to the most
commonly used empirical correlations for estimating OFVF and the bubble point pressure, their models
generated lower standard deviations and relative errors. Elsharkawy [32] in 1998 proposed a new method
for predicting oil viscosity, OFVF, saturated oil density, undersaturated oil compressibility, solution gas-oil
ratio and evolved gas using a radial basis function neural network model. Results showed that the proposed
model by Elsharkawy [32] outperformed the existing empirical correlations in terms of accuracy. Osman et
al. [2] in 2001 developed a neural network model with 4-5-1 multilayer feed forward structure and a back-
SPE-196453-MS 5
propagation algorithm to estimate OFVF at bubble point pressure. Compared to the empirical correlations
developed by Standing [17], Glasø [20], Vasquez and Beggs [19], Al-Marhoun [21] and Al-Marhoun [33]
the proposed model has the lowest absolute percent relative error.
Malallah et al. [34] in 2006 estimated OFVF and the bubble point pressure by using the alternating
conditional expectation algorithm, as a new method. A dataset of 5200 data points for oil samples collected
from oil fields around the world (e.g., North and South America, the North Sea, Southeast Asia, the Middle
East, and Africa) was used in developing their model. Compared to the existing empirical correlations,
the developed model has a better prediction accuracy. Moghadassi et al. [35] in 2009 developed a model
to predict the PVT properties based on ANNs. The dataset that was used consisted of data points for
compressibility factor, reduced temperature and reduced pressure. The dataset was derived from the
Chemical Engineers' handbook [36]. For training this model by a back-propagation learning algorithm,
a number of different learning algorithms were tested and compared against each other. Examples are
Resilient Back Propagation (RP), Scaled Conjugate Gradient (SCG) and Levenberg-Marquardt (LM). Based
on the mean squared error (MSE) measure, the best model was constructed with an LM algorithm and one
hidden layer of 60 neurons. Results showed that the ANN model estimated the values of the PVT properties
accurately. Aref Hashemi Fath et al. [8] in 2018 developed a multilayer feed forward neural networks model
for predicting the bubble point pressure of crude oils. Derived from literature, a dataset of 760 data points
for crude oil samples collected from different oil fields around the world was used for training and testing
the model.
El-Sebakhy [37] in 2009, investigated the capability of the SVM regression as a predictive model in
estimating the PVT properties of crude oils. The predicted PVT properties are the bubble point pressure and
OFVF (Bo). Results showed that the proposed model by El-Sebakhy [37] outperformed the most commonly
used empirical correlations and the standard feedforward neural networks. Furthermore, El-Sebakhy [38] in
2009, investigated the capability of type1 fuzzy logic inference systems as a predictive model in estimating
the PVT properties of crude oils. The predicted PVT properties are Pb and Bob. Results showed that the
proposed model by El-Sebakhy [38] outperformed the most commonly applied empirical correlations, and a
feedforward neural network with a back-propagation learning algorithm and a sigmoid activation function,
in estimating Pb and Bob.
Munirudeen A. Oloso et al. [39] in 2017 proposed a hybrid model for predicting PVT properties. The
hybrid model components are K-means clustering and Functional Networks (FN). The predictive model
first clusters the input datasets based on the oil API gravity property and the other input features using
the K-means clustering algorithm. Subsequently, the PVT properties, Pb and Bob are predicted from the
generated clusters. The input parameters to the predictive model are the reservoir temperature, solution gas-
oil ratio, gas relative density and oil API gravity. The functional network used for prediction resembles
artificial neural networks in their architecture. However, functional networks do not suffer from the "black-
box" problem as their neurons are pre-defined functions. Therefore, functional networks are considered as
computationally complex alternative solutions. The best ANN model achieved in this study is a feedforward
ANN with a sigmoid activation function and an architecture of one hidden layer and ten neurons. Results
showed that the proposed model by Munirudeen A. Oloso et al. [39] outperformed the most commonly
implemented empirical correlations, the standalone functional networks and the modeled feedforward neural
network.
The present study is aimed to build a universal and accurate predictive model for estimating the bubble
point pressure Pb and the oil formation volume factor at bubble point pressure Bob of crude oils based on
a boosted decision tree regression (BDTR) model. Therefore, a dataset of 5200 data points is utilized. The
dataset is derived from oil samples covering a wide range of crude oils from different geographical locations
around the world [13]. Using this dataset, the model was built and evaluated. In addition, a comparative
study has been implemented. The comparative study is to compare the performance of the BDTR model
6 SPE-196453-MS
as both with and without the K-means clustering preprocessing technique. In addition, in the comparative
study, ANNs and the most commonly used empirical correlations were also considered. Furthermore, the
importance of each of the input parameters to the BDTR model used in estimating Pb and Bob is determined.
Data acquisition
The bubble point pressure (Pb) and the oil formation volume factor at bubble point pressure (Bob) are
functions of reservoir temperature (T), solution gas-oil ratio (Rs), gas specific gravity (γg) and oil API gravity
(γAPI). Predictive models' development and performance relies on the quality and diversity of the datasets
used for training, validation and testing. In this study, a large dataset consisting of diverse data points
generated from oil samples collected from different geographical locations with different types of crude oils
was used to build the BDTR model for predicting Pb and Bob.
A BDTR model has been trained and tested for the prediction of Pb and Bob using a worldwide crude oils
dataset with 5200 data points. The data points were collected from oil wells from different geographical
regions representing 350 different crudes from all over the world, including major oil fields in the North
Sea, Middle East, North and South America, Africa and South East Asia [13].
The datasets consist of the PVT properties: bubble point pressure (Pb), oil formation volume factor at
bubble point pressure (Bob), reservoir temperature (T), solution gas-oil ratio (Rs), gas specific gravity (γg) and
oil API gravity (γAPI). The statistical measures of the PVT datasets are shown in Table 2. The input parameters
of the BDTR predictive model are the reservoir temperature, solution gas-oil ratio, gas specific gravity, oil
API gravity (γAPI) and the cluster assignment number generated by the K-means clustering algorithm. The
bubble point pressure (Pb) and the oil formation volume factor at bubble point pressure (Bob) are the outputs.
As can be seen in Table 2, the datasets cover a wide range of values representing diverse crude oil samples
where Pb values ranged from 79 to 7130, Bob values ranged from 1.02 to 2.92, T values ranged from 74 to
342, Rs values ranged from 9 to 3370, γg values ranged from 0.5 to 1.67 and γAPI values ranged from 14.3
to 59. From the dataset, 70% of the data points was used for training and 30% of the data points was used
for validation and testing.
SPE-196453-MS 7
Table 2—Statistical measures of the PVT datasets.
Property Min Max Mean
T (°F) 74 342 184.2

Rs (scf/stb) 9 3370 496
γg (air = 1) 0.5 1.67 0.8
γAPI (°API) 14.3 59 36.82
Pb (psi) 79 7130 1644.4
Bob (rb/stb) 1.02 2.92 1.33
Methodology and experimental work

PVT properties predictions using empirical correlations
Several empirical correlations that predict the bubble point pressure Pb and the oil formation volume factor
at bubble point pressure Bob have been developed in the past. For example, Standing, 1977 [40]; Vazquez
& Beggs, 1980 [19]; Al-Marhoun, 1988 [21]; De Ghetto et al., 1995 [1]; Almehaideb, 1997 [41]; Petrosky
& Farshad, 1998 [42]; Al-Shammasi, 2001 [43]; Jarrahian, Moghadasi, & Heidaryan, 2015 [44]. There
empirical correlation have varying degrees of accuracy and limited ranges of the PVT input features.
PVT properties predictions using machine learning

As a step towards a more accurate estimation of the crude oils PVT properties, several machine learning
approaches have taken a part in the field of PVT characterization. The Artificial Neural Network (ANN) is
one of the most implemented machine learning techniques [45]. ANN is a machine learning tool. In their
structure, ANNs are intended to replicate the way a human brain learns. A neural network structure consists
of an input layer, in most cases one or more hidden layer(s) and an output layer. Each layer consists of an
aggregation of nodes (artificial neurons) that are interconnected and being directed from the input layer to
the output layer going through the hidden layer(s). For each artificial neuron, a non-linear function of the
sum of its inputs is used to compute the output. The neural networks edges and nodes have weights that
get adjusted during the learning phase.
However, some researchers have gone against the popularity of the ANN technique in this field, as it
represents a "black-box" modeling scheme and it suffers from the local minima problem that limits its
generalization ability on unseen datasets. In addition to ANNs, other machine learning techniques that have
been implemented for predicting PVT properties are genetic algorithms, support vector machine regressions
(SVR), Adaptive Neuro Fuzzy Inference Systems, Type-1 Fuzzy Logic Inference Systems, Type-2 Fuzzy
Logic Systems, Radial Basis Function Neural Network Models (RBFNM), Functional Networks (FN), and
others (Hajizadeh, 2007 [46]; El-Sebakhy et al., 2007 [3]; Khoukhi et al., 2011 [10]; Oloso et. al., 2017 [6]).
Decision tree regressions for estimating the PVT properties

In decision trees a large set of trees is constructed by avoiding correlations among trees. To avoid overfitting,
the average of the decision forest is chosen as a predictive tree model. On the other hand, to avoid overfitting
in boosted decision trees, the subdivision times and the number of data points in each region are limited.
A sequence of trees is constructed by the algorithm, each of these trees corrects itself by learning from the
error of the previous tree. A very accurate predictive model is the result of this sequence. Decision trees
are considered as non-parametric machine learning models. They traverse the data structure of a binary tree
until they reach a decision (leaf node). Non-linear decision boundaries can be represented by decision trees.
In addition, they can perform well with the presence of noise in the set of features. Furthermore, feature
selection and classification are integrated within its performance.
8 SPE-196453-MS
A decision forest regression has an ensemble of decision trees where every tree in the decision forest
regression generates a prediction as a Gaussian distribution. Afterwards, the Gaussian distributions of all
trees used in the model are combined into a single distribution for which a closest Gaussian distribution is
found by performing an aggregation on the trees' ensemble. On the other hand, In the proposed model, a
boosted decision tree regression (BDTR) model is trained and tested. Two BDTR models are built, one for
predicting Pb and another one for predicting Bob. Each tree in the BDTR model is dependent on prior trees.
The learning is done by the algorithm as fitting the residual of the preceding trees. Thus, it's boosting. With
a small possibility of less coverage, boosting is used in decision trees ensembles for improving accuracy. In
our approach, the boosted decision trees model is used for regression as a supervised learning method where
the label column is a numerical value, i.e., Pb or Bob. MART gradient boosting algorithm is implemented
in our model as a technique used for machine learning regression problems. This algorithm takes a step-
wise approach in building each regression tree. In every step, the error is measured by a loss function and it
is corrected in the next step. The ensemble of weaker predictive models is the generated BDTR predictive
model. An arbitrary differentiable loss function is used to select the optimal tree from the produced sequence
of trees and this tree is then used in predicting future Pb or Bob values. The model's input features are solution
gas–oil ratio (Rs), reservoir temperature (T), oil gravity (γAPI), and gas specific gravity (γg).
Different datasets gathered from different geographical locations require certain hyperparameters settings
for improving the accuracy and the precision of the model. In addition, a dataset of worldwide crude oils
with wide ranges of each of the input features (Table 2) requires an accurate and precise tuning of the
model's hyperparameters. Therefore, tuning the hyperparameters of the BDTR algorithm for generating the
best prediction results is one of the main tasks in modeling the PVT properties in this study.
The hyperparameter settings of the best BDTR model built in this study for estimating Pb are 7000 as
the number of the trees constructed, 2 as the minimum leaf instances, 18 as the number of leaves per tree
and 0.01 as the learning rate. The hyperparameter settings of the best BDTR model built in this study for
estimating Bob are 9200 as the number of the trees constructed, 2 as the minimum leaf instances, 15 as the
number of leaves per tree and 0.06 as the learning rate.
K-means clustering
K-means clustering is an effective pattern recognition technique that groups data points in a predefined
number of clusters. For any given dataset, there is no particular way for defining the number of clusters that
would produce the best grouping results. Organizing datasets into clusters is considered as a fundamental
step in learning as clustering helps finding structures in data.
K-means clustering algorithm is considered as one of the most popular clustering algorithms. It was first
introduced in 1955. An overview on a number of clustering algorithms and particularly on the K-means
clustering algorithm, has been discussed and published in (Jain 2010 [47]).
In our approach, a preprocessing technique for the datasets is applied. The preprocessing step involves
clustering of the datasets based on the input features, Rs, T, γAPI, and γg. K-means clustering is the algorithm
used for generating clusters of the input parameters where the number of clusters for each model is
predefined. Crude oil in the petroleum industry is categorized into 4 groups based on γAPI as seen in Table
1. However, the clustering in our model is done by the algorithm, based on the four input features and the
predefined number of clusters, and it's not hardcoded. After the clusters are generated and that each data
point is assigned to its cluster, the datasets with the four input features are fed into the BDTR model with an
additional feature as the cluster assignment number that was generated by the K-means clustering algorithm
for each data point. For the Pb model, 20 clusrters are generated, and for the Bob model, 10 clusters are
generated.
The overall performance of the BDTR model with a K-means clustering preprocessing step is compared
against the standalone BDTR model, i.e., without K-means clustering. In addition, the model is compared
SPE-196453-MS 9
against artificial neural networks (ANNs) and the most commonly used empirical correlations in predicting
Pb and Bob.
Evaluation methods
The following statistical accuracy measures were used to evaluate the predictive power of the built models.
Average percent relative error (Er). This statistical measure is the relative deviation between the
experimental target values and the estimated ones. It is defined as follows:
Whre Ei% denotes the relative deviation between the experimental value and its corresponding predicted
one.
Where xexp denotes the experimental value and xpred denotes the predicted one. While the total number
of the data points is denoted by n.
Average absolute percent relative error (Ea). The relative absolute deviation between the experimental
values and the estimated ones is determined by this statistical accuracy measure, a lower value of Ea%
indicates higher accuracy. It is defined as follows:
Results
Results show that the built BDTR model with K-means clustering outperforms the best empirical
correlations in predicting Pb and Bob, in terms of the average absolute percent relative error (Ea). A lower
value indicates a better correlation between the predicted values and the experimental ones. The studied
empirical correlations include Glasø [20], Dokla and Osman [23], Petrosky [22], Al-Marhoun [21], Frashad
[30], Vasquez and Beggs [19], Kartoatmodjo [5], and Standing [17]. Furthermore, in the present study, the
accuracy of our predictive BDTR model with K-means clustering data preprocessing, is compared against
the standalone model. The K-means clustered BDTR model outperforms the standalone model in predicting
Pb and performing nearly as accurate as the standalone model in predicting Bob. In addition, the BDTR
predictive model with K-means clustering is compared against the ANN predictive model. The structure of
10 SPE-196453-MS
the ANN model consists of one hidden layer with five neurons [13]. The statistical results of the comparative
study are shown in Table 3 for Pb, and in Table 4 for Bob. As shown in the table, the BDTR predictive
model with K-means clustering produced the highest accuracy in predicting Pb and Bob compared to the
most commonly used empirical correlations, ANNs and the standalone model.
Table 3—Statistical measures for estimatingPb.
Correlation Ea (%) Er(%)
Labedi 27.82 22.59
Kartoatmodjo 28.17 22.78
Vasquez and Beggs 32.98 29.99
Glasø 34.32 30.96
Dokla and Osman 39.82 30.88
Standing 25.20 18.81
Al-Marhoun 49.61 45.30
Frashad 22.73 11.48
ANN [13] 15.38 –0.02
Standalone BDTR 9.36 –1.09
BDTR with K-Means 8.07 –0.99

clustering
Table 4—Statistical measures for estimating Bob.
Correlation Ea (%) Er(%)
Farshad 2.38 –1.95
Abdul Majeed 2.53 –0.73
Kartoatmodjo 2.10 –1.35
Vasquez and Beggs 3.51 –2.67
Glasø 4.32 –3.93
Dokla and Osman 3.21 –1.76
Standing 2.47 –1.32
Al-Marhoun 2.35 –1.60
Petrosky 2.52 –1.52
Labedi 2.62 –1.68
Obomano 2.49 1.18
Elsharkawy 3.40 1.21
ANN [13] 2.04 1.39
Standalone BDTR 0.859 –0.08
BDTR with K-Means 0.856 –0.05

clustering
A cross plot of the predicted values of Pb to the corresponding experimental ones is shown in Fig. 1. In
Fig. 2 the corresponding predicted Bob values to the experimental data are indicated. As shown in Fig.1 and
Fig. 2, the estimated values by the built model, BDTR with K-means clustering, are in a good agreement
with the experimental ones.
SPE-196453-MS 11
Figure 1—A cross plot of bubble point pressure for the BDTR model with K-means clustering.
Figure 2—A cross plot of oil formation volume factor at bubble point pressure for the BDTR model with K-means clustering.
For the feature importance, as can be seen in Table 5, solution gas–oil ratio (Rs) is determined by
the predictive model as the most important input feature in estimating Pb and Bob. Permutation Feature
Importance (PFI) score is used to determine the importance of the input features.
Table 5—The input feature importance to the estimated values.
Input Features (PFI Scores)

Output Rs γg T γAPI
Features
Pb 0.694 0.076 0.0147 0.102

Bob 1.060 0.090 0.272 0.183
Conclusion
The BDTR model with K-means clustering outperforms the standalone model, ANNs and the most
commonly used empirical correlations in predicting Pb. For Bob, the K-means clustered BDTR model
ourprtforms the most commonly used imperical correlations and ANNs in predicting Bob, and performing
nearly as accurate as the standalone model in predicting Bob, as the clustering preprocessing step did
not increase the predictive power compared to the great importance of the Rs feature in predicting Bob.
12 SPE-196453-MS
The built model is accurate, and it can be reliably used for obtaining heuristic feature importance in an
interpretable representation. In addition, the K-means clustering integrated BDTR model can be extended
towards predicting other PVT properties for further estimations in the upstream exploration and production
computations, such as, in predicting permeability and porosity. Moreover, this model can be retrained and
then implemented in studying the gas-liquid multiphase flow phenomenon. Furthermore, this model can be
integrated into simulators used in the upstream oil and gas industry.
Acknowledgement
The authors acknowledge the support received from the Research and Development (R&D) department at
Kuwait Oil Company (KOC). The authors are especially indebted to Prof. Ridha Gharbi, a senior consultant
at KOC, for providing the worldwide crude oils' dataset.
Nomenclature
ANN Artificial Neural Network
Bob Oil formation volume factor at bubble point pressure, bbl/stb
Ea Average absolute percent relative error, %
EOS Equation of state
Er Average percent relative error, %
FN Functional Network
OFVF Oil formation volume factor
Pb Bubble point pressure, psi
PFI Permutation Feature Importance, score
PVT Pressure/volume/temperature
Rs Solution gas–oil ratio, scf/stb
STB Stock-tank barrel
SVM Support Vector Machine
SVR Support Vector Machine Regression
T Reservoir temperature, °F
γAPI Oil API gravity, °API
γg Gas specific gravity, air = 1.0
γo Oil specific gravity, air = 1.0
References
1. G. De Ghetto, et al, "Pressure-volume-temperature correlations for heavy and extra heavy oils,"
SPE International Heavy Oil Symposium, 1995.
2. E. A. Osman, et al, "Prediction of oil PVT properties using neural networks," SPE Middle East
Oil Show Society of Petroleum Engineers, 2001.
3. E. A. El-Sebakhy, et al, "Support vector machines framework for predicting the PVT properties
of crude oil systems," SPE Middle East oil and gas show and conference, Society of Petroleum
Engineers, 2007.
4. M. H. Goda, et al, "Prediction of the PVT data using neural network computing theory," The 27th
Annual SPE International Technical Conference and Exhibition in Abuja, Nigeria, August 4–6, p.
SPE85650, 2003.
5. T. Kartoatmodjo, et al, "Large data bank improves crude physical property correlations," Oil Gas
J., vol. 92, no. 27, pp. 51–55, 1994.
SPE-196453-MS 13
6. M. A. Oloso, et al, "Hybrid functional networks for oil reservoir PVT characterization," Expert
Systems with Applications, vol. 87, no. C, pp. 363–369, 2017.
7. R. B. Gharbi, et al, "Universal neural network based model for estimating the PVT properties of
crude oil systems," Energy & Fuels, vol. 13, no. 2, pp. 454–458, 1999.
8. A. Hashemi Fath, et al, "Development of an artificial neural network model for prediction of
bubble point pressure of crude oils," Petroleum, vol. 4, no. 3, pp. 281–291, 2018.
9. E. Osman, et al, "Artificial neural networks models for predicting PVT properties of oil field
brines," paper SPE 93765, 14th SPE Middle East Oil & Gas Show and Conference in Bahrain,
March 2005.
10. A. Khoukhi, et al, "Support vector regression and functional networks for viscosity and gas oil
ratio curves estimation," International Journal of Computational Intelligence and Applications,
vol. 10, no. 3, pp. 269–293, 2011.
11. D. Ramyachitra, et al, "Imbalanced dataset classification and solutions: A review," International
Journal of Computing and BusinessResearch (IJCBR), vol. 5, pp. 2229–6166, 2014.
12. W. D. J. McCain, et al, "Correlation of bubblepoint pressures for reservoir oils-a comparative
study," SPE Eastern Regional Meeting, Pittsburgh, Pennsylvania, 1998.
13. R. Gharbi, et al, "Predicting the bubble-point pressure and formation-volume-factor of worldwide
crude oil systems," J. Pet. Sci. Technol, vol. 21, no. 1-2, pp. 53–79, 2003.
14. S. S. Rafiee-Taghanaki, et al, "Implementation of SVM framework to estimate PVT properties of
reservoir oil," Fluid Phase Equilib, vol. 346, pp. 25–32, 2013.
15. T. Ahmed, Hydrocarbon Phase Behavior, Gulf Publishing, Houston, 1989.
16. W.D. McCain, The Properties of Petroleum Fluids, PennWell Books, 1990.
17. M. Standing, "A pressure-volume-temperature correlation for mixtures of California oils and
gases," Drilling and Production Practice, pp. 275–287, 1947.
18. J. Lasater, "Bubble point pressure correlation," J. Petrol. Technol., vol. 10, no. 5, pp. 65–67,
1958.
19. M. Vazquez, et al, "Correlations for fluid physical property prediction," J. Pet. Technol, vol. 32,
no. 6, pp. 968–970, 1980.
20. O. Glasø, "Generalized pressure-volume-temperature correlations," J. Pet. Technol vol. 32, no. 5,
pp. 785–795, 1980.
21. M. Al-Marhoun, "PVT correlations for Middle East crude oils," J. Pet. Technol, vol. 40, no. 5, pp.
650–666, 1988.
22. G. E. Petrosky Jr., et al, "Pressure-volume-temperature Correlations for Gulf of Mexico Crude
Oils," SPE Annual Technical Conference and Exhibition, Society of Petroleum Engineers, Texas,
1993.
23. M. E. Dokla, et al, "Correlation of PVT properties for UAE crudes," SPE Form. Eval., vol. 7, no.
1, pp. 41–46, 1992.
24. M. I. Omar, et al, "Development of new modified black oil correlations for Malaysian Crudes,"
SPE Asia Pacific Oil and Gas Conference, Singapore, 1993.
25. A. Naseria, et al, "A correlation approach for prediction of crude oil viscosities," J. Petrol. Sci.
Eng., vol. 47, no. 3-4, pp.163–174, 2005.
26. B. Dindoruk, et al, "PVT properties and viscosity correlations for Gulf of Mexico Oils," SPE
Reservoir Eval. Eng., vol. 7, no. 6, pp. 427–437, 2004.
27. M. Khairy, et al, "PVT correlations developed for Egyptian crudes," Oil Gas J., vol. 96, no. 19,
pp.114–116, 1998.
28. S. Macary, et al, "Derivation of PVT correlations for the Gulf of Suez crude oils," Sekiyu Gakkai
shi, vol. 36, no. 6, pp. 472–478, 1993.
14 SPE-196453-MS
29. R. M. Labedi, "Use of production data to estimate the saturation pressure, solution GOR,
and chemical composition of reservoir fluids," SPE Latin America Petroleum Engineering
Conference, Rio de Janeiro, Brazil, 1990.
30. F. Frashad, et al, "Empirical PVT correlations for Colombian crude oils," SPE Latin America/
Caribbean Petroleum Engineering Conference, Port-of-Spain, Trinidad, 1996.
31. R. B. Gharbi, et al, "Neural network model for estimating the PVT properties of Middle East
crude oils," Middle East Oil Show and Conference Society of Petroleum Engineers, 1997.
32. A. M. Elsharkawy, "Modeling the properties of crude oil and gas systems using RBF network,"
SPE Asia Pacific Oil and Gas Conference and Exhibition, Society of Petroleum Engineers Inc,
Perth, Australia, 1998.
33. M. A. Al-Marhoun, "New correlations for formation volume factors of oil and gas mixtures," J.
Can. Pet. Technol., vol. 31, no. 3, pp. 22–26, 1992.
34. A. M. Malallah, et al, "Accurate estimation of the world crude oil PVT properties using graphical
alternating conditional expectation," Energy Fuels. vol. 20, no. 2, pp. 688–698, 2006.
35. A.R. Moghadassi, et al, "A new approach for estimation of PVT properties of pure gases based
on artificial neural network model," Brazil, J. Chem. Eng. 26 (2009) 199–206.
36. R. Perry, H. Green, Perry's Chemical Engineers’ Hand Book, seventh ed., McGraw-Hill New
York, 1999.
37. E. El-Sebakhy, "Forecasting PVT properties of crude oil systems based on support vector
machines modeling scheme," Journal of Petroleum Science and Engineering, vol. 64, no. 1-4, pp.
25–34, 2009.
38. E. A. El-Sebakhy, "Data mining in forecasting PVT correlations of crude oil systems based on
Type1 fuzzy logic inference systems," Computers & Geosciences, vol. 35, no. 9, pp. 1817–1826,
2009.
39. M. A. Oloso, et al, "Hybrid functional networks for oil reservoir PVT characterisation," Expert
Systems with Applications, vol. 87, pp. 363–369, 2017.
40. M. B. Standing, "Volumetric and phase behavior of oil field hydrocarbon systems," Dallas, Texas:
Society of Petroleum Engineers of AIME, 1977.
41. R. A. Almehaideb, "Improved PVT correlations for UAE crude oils," Middle east oil show and
conference, Society of Petroleum Engineers, 1997.
42. G. E. Petrosky Jr., et al, "Pressure-volume-temperature correlations for Gulf of Mexico crude
oils," SPE Reservoir Evaluation & Engineering, vol. 1, no. 5, pp. 416–420, 1998.
43. A. A. Al-Shammasi, "A review of bubblepoint pressure and oil formation volume factor
correlations," SPE Reservoir Evaluation & Engineering, vol. 4, no. 2, pp. 146–160, 2001.
44. A. Jarrahian, et al, "Empirical estimating of black oils bubblepoint (saturation) pressure," Journal
of Petroleum Science and Engineering, vol. 126, pp. 69–77, 2015.
45. R. Talebi, et al, "Application of soft computing approaches for modeling saturation pressure of
reservoir oils," Journal of Natural Gas Science and Engineering, vol. 20, pp. 8–15, 2014.
46. Y. Hajizadeh, "Intelligent prediction of reservoir fluid viscosity," Production and operations
symposium, Society of Petroleum Engineers, 2007.
47. A. K. Jain, "Data clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31, no.
8, pp. 651–666, 2010.
48. O. O. Bello, et al, "Comparison of the performance of empirical models used for the prediction
of the PVT properties of crude oils of the Niger Delta," Petrol. Sci. Technol., vol. 26, no. 5, pp.
593–609, 2008.
49. M. A. Mahmood, et al, "Evaluation of empirically derived PVT properties for Pakistani crude
oils," J. Petrol. Sci. Eng., vol. 16, no. 4, pp. 275–290, 1996.
SPE-196453-MS 15
50. J. N. Moghadam, et al, "Introducing a new method for predicting PVT properties of Iranian crude
oils by applying artificial neural networks," Petrol. Sci. Technol., vol. 29, no. 10, pp. 1066–1079,
2011.
51. G. De Ghetto, et al, "Reliability analysis on PVT correlations," SPE European Petroleum
Conference, London, Unite Kingdom 1994.
52. A. Al-Shammasi, "Bubble point pressure and oil formation volume factor correlations," SPE
Middle East Oil Show & Conference, vol. 5, pp. 241–256, 1999.

Estimating PVT Properties of Crude Oil Systems Based On A Boosted Decision Tree Regression Modelling Scheme With K-Means Clustering

Uploaded by

Copyright:

Available Formats

You might also like

Estimating PVT Properties of Crude Oil Systems Based On A Boosted Decision Tree Regression Modelling Scheme With K-Means Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimating PVT Properties of Crude Oil Systems Based On A Boosted Decision Tree Regression Modelling Scheme With K-Means Clustering

Uploaded by

Copyright:

Available Formats

SPE-196453-MS

Estimating PVT Properties of Crude Oil Systems Based on a Boosted

Copyright 2019, Society of Petroleum Engineers

Keywords: Predictive model, PVT, Oil and gas, Reservoir characterization

Table 1—The classification of crude oils based on API gravity.

Classification °API Range

Light °API > 31.1

Medium 22.3 ≤ °API ≤ 31.1

Heavy °API < 22.3

Extra heavy °API < 10.0

Table 2—Statistical measures of the PVT datasets.

Property Min Max Mean

T (°F) 74 342 184.2

Methodology and experimental work

PVT properties predictions using machine learning

Decision tree regressions for estimating the PVT properties

Table 3—Statistical measures for estimatingPb.

Correlation Ea (%) Er(%)

Labedi 27.82 22.59

Kartoatmodjo 28.17 22.78

Vasquez and Beggs 32.98 29.99

Glasø 34.32 30.96

Dokla and Osman 39.82 30.88

Standing 25.20 18.81

Al-Marhoun 49.61 45.30

Frashad 22.73 11.48

ANN [13] 15.38 –0.02

Standalone BDTR 9.36 –1.09

BDTR with K-Means 8.07 –0.99

Table 4—Statistical measures for estimating Bob.

Correlation Ea (%) Er(%)

Farshad 2.38 –1.95

Abdul Majeed 2.53 –0.73

Kartoatmodjo 2.10 –1.35

Vasquez and Beggs 3.51 –2.67

Glasø 4.32 –3.93

Dokla and Osman 3.21 –1.76

Standing 2.47 –1.32

Al-Marhoun 2.35 –1.60

Petrosky 2.52 –1.52

Labedi 2.62 –1.68

Obomano 2.49 1.18

Elsharkawy 3.40 1.21

ANN [13] 2.04 1.39

Standalone BDTR 0.859 –0.08

BDTR with K-Means 0.856 –0.05

Table 5—The input feature importance to the estimated values.

Input Features (PFI Scores)

Pb 0.694 0.076 0.0147 0.102

You might also like