Data Analytics

Dr.
Srikanta Mishra
mishras@battelle.org
+1-614-424-5712
Application of Data Analytics in

Petroleum Engineering
Virtual Workshop Offered for
IIT(ISM) Dhanbad Dept. of Petroleum Engineering
25 September 2021
Speaker Introduction
• Technical Director, Geo-energy
Modeling & Analytics, Battelle Memorial
Institute, Columbus, Ohio, USA
• Recipient of SPE 2021 International
Award for Distinguished Membership
• Author of book “Applied Statistical
Modeling and Data Analytics”
• Instructor of multiple SPE and industry
workshops on petroleum data analytics
• SPE Distinguished Lecturer
(2018-19) on big data analytics
• Technical Lead, US Department of
Energy’s Subsurface ML Initiative
• PhD (Stanford), MS (Texas-Austin),
BTech (ISM) – all in Petroleum Engg.
Mishra - IIT(ISM) 2021 2

The Attraction / Challenge
Big Data Analytics ? Game Changer

=
large volumes of data
about subsurface, physical
infrastructure and flows
Actionable
New insights about reservoir
from “data mining” can help information
increase operational efficiencies

Workshop Goals
• Provide a practical introduction to applied statistical

modeling and data analytics techniques
▪ Foundational concepts
▪ Linear regression and variants (for building simple input-output models)
▪ Multivariate data reduction and clustering (for finding sub-groups of data
that have similar attributes)
▪ Machine learning for regression and classification (for developing more
complex data-driven input-output models)
▪ R/RATTLE software demonstration (application of ML techniques)
• Emphasize practical aspects (not statistical rigor)

Workshop Agenda
600 – 650p Introduction, statistical foundations –

regression analysis and multivariate statistics
650 – 700p Break
700 – 750p Machine learning basics, case studies
750 – 800p Break
800 – 850p Software demonstration
850 – 900p Wrap up

What is Statistics All About?
INFORMATION  Structure ~ Trends ~ Relationships
Relationships require specifying the form of model

Big Data & Analytics – What & Why?
Big Data Data Analytics
Volume
Make
better
decisions
Understand
“what does
Examine data say”
data Prediction
Velocity Variety
Learning
Data Analytics (aka Machine Learning, Knowledge Discovery, Data Mining)

helps understand hidden patterns and relationships in large, complex datasets –
machine (algorithm) determines the form of the (black-box) model

Competing on Analytics
by Tom Davenport
It’s virtually impossible to differentiate yourself from competitors
based on products alone. Your rivals sell offerings similar to yours.
And thanks to cheap offshore labor, you’re hard-pressed to beat
overseas competitors on product cost.
How to pull ahead of the pack? Become an analytics competitor:

Use sophisticated data-collection technology and analysis to wring
every last drop of value from all your business processes.
With analytics, you discern not only what your customers want but
also how much they’re willing to pay and what keeps them loyal.
You look beyond compensation costs to calculate your workforce’s
exact contribution to your bottom line. And you don’t just track
existing inventories; you also predict/prevent future inventory issues.
Harvard Business Review 84(1):98-107, January 2006

Types of Analytics

A Rose by Any Other Name!
Artificial
Intelligence
Knowledge
Machine Discovery
Learning Data
Mining
Data Statistical
Analytics Learning

Data Analytics v/s Machine Learning
v/s Artificial Intelligence
• Data analytics  data collection A
and analysis to understand Collect Data
C
hidden patterns and relationships
Collect Data
▪ Machine learning  building B
model between predictors and Infer Rules
response, most commonly using (predictive model)
black-box methods
D
E
• Artificial intelligence  Make Decision
applying predictive model with
new data to make decisions,
Data analytics  A, B
without human intervention (with Machine learning  B
possibility of feedback) Artificial Intelligence  C, B, D, E

Data Analytics Process
Exploratory Data Analysis
• Multi-dimensional data visualization
• Scatter-plot matrix, trellis plots
Unsupervised Learning
• Data reduction and clustering
• PCA, k-means, self-organizing maps
Supervised Learning
• Regression and classification
• Random forest, SVM, neural nets, kriging

Supervised v/s Unsupervised Learning
Ma et al., 2018, Symmetry, 10, 734

Areas of Application in O&G
Combining streaming data Exploration Finding hidden patterns
with past performance Data Mining in large geologic datasets
to predict potential failures
Predictive Reservoir
Maintenance Management
Identifying factors for
Real-time prediction of system improved performance
response (drilling, fluid injection)
Performance Proxy Creating fast system

Forecasting Modeling “emulators”
Reduce cost, improve productivity, increase efficiency

Exponential Growth in ML Applications
"Machine Learning" Hits from OnePetro Database
6000
5000
4000
3000
2000
1000
0
1985 1990 1995 2000 2005 2010 2015 2020 2025

Examples
• Predicting oil production as a function of water
• Predicting core-derived permeability from injection rates of surrounding injectors to identify
basic well-log attributes injector-producer connectivity
• Predicting rock facies (originally based on • Predicting bottom-hole and/or surface pressure as a
core and advanced log data) from basic well function of production rate to flag potential
log attributes anomalous trends and/or adverse events
• Predicting incidence of vuggy zones in
carbonate reservoirs (originally based on
• Identifying variables responsible for equipment
failure (e.g., ESP) based
core and advanced log data) from basic
on historical data, and calculating forward-looking
well log attributes
failure probability using real-time data
• Predicting total organic carbon content in
shale formations from basic well log • Analyzing drillers logs, fields reports etc. (using
attributes natural language processing) to identify incidence of
adverse events + underlying causes
• Predicting geomechanical properties in shale
wells from basic well log attributes • Building predictive model for drilling rate of
penetration using historical data, and forecasting
• Predicting oil production from shale wells as
future response using real-time data
a function of geologic and completion related
parameters • Building fast surrogate (proxy) model using
• Calculating water saturation from basic well reservoir simulation outputs
log attributes for repetitive calculations (e.g., history matching,
uncertainty quantification)
• Predicting PVT properties from basic crude
oil and reservoir characteristics • Using data-driven fits to decline curves for
predicting EUR

Statistical Foundations
Basic Regression Analysis

Ch. 4, Mishra and Datta-Gupta (2017)

Linear Regression
• Given data (x1,y1), (x2, y2)….(xn,yn)

• Postulate model
yi = a + bxi +ei ; e~N(0,s)
• Minimize objective function
S(a,b) = S(yi - a - bxi)2
• Residuals should be normally distributed with
mean=0 and SD= s
• If trend in ei versus xi is non-random (i.e., cyclic,
monotonic, etc.), then linear model not appropriate

Linear Regression Example
350
y = 2.0626x + 97.397
300
Initial Well Potential (BOPD)
R2 = 0.5385
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90
Net Pay (ft)

Regression Coefficients
Regression Statistics
Multiple R 0.733851
R Square 0.538538 Fraction of total variance explained by model
Adjusted R Square
0.522625
~ RMSE
Standard Error
44.71329 Estimated SD of error term in regression =
Observations 31
Coeff Std Error t Stat P-value Lower 95% Upper 95%

Intercept 97.397 13.844 7.035 9.75E-08 69.082 125.711
X Variable 1 2.063 0.355 5.818 2.63E-06 1.337 2.788
Mean and SD =Coeff/SE, the smaller ~ Coeff ± 2SE

=
of regression the bigger the better
coefficients the better (likelihood that
Coeff is different
from zero)

Diagnostic Plots
Observed v/s Predicted X Variable 1 Residual Plot
400 100
350 80
300 60
Y - predicted
250 40
Residuals
200 20
0
150
-20 0 20 40 60 80 100
100
-40
50
-60
0
-80
0 100 200 300 400
-100
Y - observed X Variable 1
Normal Probability Plot

20
y = 7.2525x - 8E-16
15
R² = 0.9879
10
5
Y
0
-3 -2 -1 -5 0 1 2 3
-10
-15
Standard Normal Deviate

Non-Linear Regression
• Given data (x1,y1), (x2, y2)….(xn,yn)

• Postulate model
f(yi) = a + b g(xi) +ei ; e~N(0,s)
• Minimize objective function
Can Use
S(a,b) = S(f(yi) - a - b g(xi))2 Excel SOLVER
• f(.) and g(.)  linearizing transforms
▪ Logarithmic ▪ Power
▪ Exponential ▪ Non-parametric

Multiple Linear Regression
• Also called Ordinary Least Squares (OLS) regression

• Assume a linear model between response and predictors
• Solve for bk by minimizing the sum of squared residuals
• Easily interpretable, fast, well-studied statistical properties

• Can capture some non-linear behavior through
transformation of variables (e.g., quadratic model)

How Many Terms in Regression?
• Seek parsimonious
balance between
goodness of fit and
model complexity
• Minimize AIC (Akaike
Information Criterion)
AIC = n log( SS E / n ) + 2 p
where
n = number of observations
p = number of model parameters
SS E = residual sum of squares

Key Takeaways
• First, plot the data !!!!

• Model structure important – may need to experiment
with multiple models for best fit
• Understand distribution of residuals
• Examine significance of regression coefficients
• Check for parsimonious outcomes
• Partition data into sub-populations prior to regression
analysis, if necessary

Plotting Multivariate Data
Mishra et al., 2014, Env. Geosci., 21(2), 59-74.

Visualizing Correlations
Correlation CO2_MMP.csv using Spearman
• Rank (Spearman) correlation 
MWC7.
MWC5.
C2.C6
MMP
robust measure for strength of
API
C7.
Vol
C1
V.I
Int
T
1
association (linear/non-linear) C7.
• Rank samples from smallest

0.8
C2.C6
(rank=1) to largest (rank=N) MWC7.

0.6
• Helps identify both Int

0.4
▪ Redundant variables API 0.2
▪ Relevant variables MWC5. 0
• Also referred to as “feature V.I -0.2
selection” when dealing with C1

-0.4
large data sets – multiple Vol

-0.6
approaches possible MMP
-0.8
T
https://www.kdnuggets.com/2021/06/ -1
feature-selection-overview.html
Rattle 2019-Oct-07 15:02:10 MISHRAS

Statistical Foundations
Multivariate Analysis

Principal Component Analysis
• Statistical technique for

▪ reducing dimensionality
▪ making data independent of each other
▪ without significant loss of information
• PCs formed by weighted linear combination of

original variables (rotation and projection)
• Coordinate transformation into a new set of variance
maximizing, mutually orthogonal coordinates (PCs)
• PCs can be interpreted as surrogate variables

PCA – Details
PC = a1x1 + a2x2 + …..anxn
Weighting factors Original variables
• Weighting factors given by eigenvectors of correlation

matrix of original variables
• Relative importance of PCs given by eigenvalues of
correlation matrix
• Correlation between PCs and original variables given
by factor loadings

Raw and Standardized Data
x1 x2 x3 dx1 dx2 dx3

2 1 2 -1.92 -1.85 -1.29
4 3 1 -1.19 -1.30 -1.55
6 7 5 -0.46 -0.19 -0.50
7 6 3 -0.09 -0.46 -1.02
5 4 7 -0.82 -1.02 0.02
8 9 6 0.27 0.37 -0.24
9 8 7 0.64 0.09 0.02
7 10 9 -0.09 0.65 0.54
10 11 11 1.01 0.93 1.07
8 9 8 0.27 0.37 0.28
12 11 10 1.74 0.93 0.81
9 13 14 0.64 1.48 1.85
mean 7.25 7.67 6.92 0.00 0.00 0.00
variance 7.48 12.97 14.63 1.00 1.00 1.00

Eigenvalues
• Eigenvalues solution of
Correlation Matrix |C-lI| = 0
x1 x2 x3 C = correlation matrix
x1 1.000 0.886 0.750
x2 0.886 1.000 0.889
I = identity matrix
x3 0.750 0.889 1.000 l = vector of eigenvalues
• N (obs) x P (var) dataset
Eigenvalues has P eigenvalues
l 2.685
l 0.251
• Eigenvalue = variance of
l 0.065 corresponding PC
• Sli = 3 = P (trace of C)
Mishra - IIT(ISM) 2021

32
Selection of Key PCs
Eigenvalues of correlation matrix
• Based on relative magnitude of Active variables only
3.5
eigenvalues 43.38%
3.0
• Each eigenvalue represents
fraction of total variance 2.5
explained by corresponding PC 2.0

23.12%
• Criteria for selecting key PCs 1.5
13.92%
Eigenvalue
▪ Scree plot (keep all PCs above 1.0
10.27%
“floor” level) 6.30%
0.5
3.01%
▪ Kaiser criterion (keep all PCs with .00%
0.0
eigenvalue > 1)
-0.5
▪ Variance threshold (keep all PCs -1 0 1 2 3 4 5 6 7 8 9
explaining ~90% variance) Eigenvalue number

Eigenvectors
• Eigenvectors solution of
Eigenvalues
|C-liI|ui = 0
l 2.685 C = correlation matrix
l 0.251 I = identity matrix
l 0.065 li = eigenvalues for i-th PC
ui = eigenvector for li
Eigenvectors • Eigenvectors are coefficients
u1 u2 u3 of variables in linear equations
0.567 0.711 0.416 defining PCs
0.597 -0.007 -0.802
0.567 -0.703 0.429
• Also define rotation from
original variable to PC space

34
Principal Components
dx1 dx2 dx3 pc1 pc2 pc3

-1.92 -1.85 -1.29 -2.92 -0.45 0.13
-1.19 -1.30 -1.55 -2.33 0.25 -0.12
-0.46 -0.19 -0.50 -0.65 0.03 -0.26
-0.09 -0.46 -1.02 -0.91 0.66 -0.11
-0.82 -1.02 0.02 -1.06 -0.59 0.48
0.27 0.37 -0.24 0.24 0.36 -0.29
0.64 0.09 0.02 0.43 0.44 0.20
-0.09 0.65 0.54 0.64 -0.45 -0.32
1.01 0.93 1.07 1.73 -0.04 0.13
0.27 0.37 0.28 0.54 -0.01 -0.06
1.74 0.93 0.81 1.99 0.66 0.33
0.64 1.48 1.85 2.30 -0.86 -0.13
mean 0.00 0.00 0.00 0.00 0.00 0.00
variance 1.00 1.00 1.00 2.685 0.250 0.065
l1 = 2.685
PC1 = 0.27*0.567 + 0.37*0.597 – 0.24*0.567 = 0.24

Example – PCA (Salt Creek Data)
0.486
0.272
0.116 0.098
0.028
Mishra - URTeC 2019

36
Cluster Analysis
• The goal is to group objects that are similar based on some

measured characteristics.
• Unsupervised classification since the operation is not guided
by a priori hypothesis or external models
Mishra - AAPG ES 2019 37

k-Means Clustering
• Modifies an initial classification by moving objects

from one group to another
• Requires a measure of distance between each pair
of points (x, y)
o Usually Euclidean distance
• Requires specifying number of clusters in advance

• Normalize variables for better, more stable results
• Needs an initial classification of the data (i.e., use
hierarchical approach for initialization)
Mishra - URTeC 2019

38
Example – Clustering (Salt Creek Data)
Mishra - URTeC 2019

39
Hierarchical Clustering
• Start with each observation in

separate groups (i.e., k = n)
• Clusters are progressively

merged until all observations are
in a single group
• At each step, the clusters chosen

for the merge are the ones that
are the least dissimilar, as defined
by some dissimilarity measure D
• Merge two clusters with smallest
dissimilarity
• Compute dissimilarity between new
cluster and all remaining clusters

Key Takeaways
• Different (novel) possibilities for data visualization

• Principal component analysis for reducing data
dimensionality and identifying surrogate variables
• Cluster analysis for grouping data into statistically

homogeneous populations for model building
• 1st step in determining structure within the space of

independent variables building input-output modeling

Machine Learning
Basic Concepts

Data-Driven Modeling
• Classical statisticspostulate model between independent

(predictor) and dependent (response) variables
• Need to look beyond linear regression (and variants) for
complex multi-dimensional data sets
• Idea is to extract the model from the data without making
any assumptions regarding the underlying functional form
(supervised learning)
▪ regression problems, where response variable is continuous
(e.g., permeability)
▪ classification problems, where response variable is categorical
(e.g., rock type)

Classification v/s Regression
Outputs are categorical Outputs are continuous
https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Why ML Models and When?
• Historically, subsurface science and engineering
analyses have relied on mechanistic models
• Incorporation of causal input-output relationship
• Experienced professionals are wary of purely “black-
box” ML models that lack such understanding
• Nevertheless, the use of ML models is easy to justify
➢ relevant physics-based model is computation intensive
and/or immature
➢ suitable mechanistic modeling paradigm does not exist

Three Cases for Black-Box Models
• When the cost of a wrong answer is low relative to the

value of a correct answer,
➢ Proxy models in history matching
• When they produce the best results,

➢ Image libraries for recognizing well-test response patterns
• As tools to inspire and guide human inquiry,

➢ Preventive maintenance applications
Holm, 2019, Science, 367, 26-27.

Key Concepts for Model Building
• Predictive modeling methods

• Model evaluation and validation
• Automatic tuning of model parameters
• Model aggregation
• Variable Importance
• Classification problems

Predictive Modeling Methods
Regression &
Regression &
Classification Tree
Classification Tree
Random
Random X1 < t1
Forest
Forest
Gradient
Gradient Boosting
Boosting
Machine
Machine X2 < t2 X2 < t3
Support
Support Vector
Vector
Machine
Machine R1 R2 R4 R3
Artificial Neural
Neural Partition
Build
Inputs
Find parameter
Multidimensional
Build
hyperplane
sequence
mapped
ensemble to space
ofmaximizing into
interpolation
of
outputs
trees
trees
that
via
using rectangular
address
separation
hidden
considering
random
short-
units
of
Network regions
trend
data
using awith
subsets
comings
and
and constant
sequence
of
transform values
autocorrelation
of
observations
eachof data
previous orlinear
nonlinear
into class
structure
and fitted labels
predictors
functions
tree
of
space
data
Gaussian
Gaussian Process
Process
Emulation
Emulation

Regression Model Evaluation
Model overfitting likely, if evaluated solely

against training dataset
Bias and Variance
Bias  difference between expected prediction and correct value
Variance  variability in predictions (due to model complexity)

Goodness-of-fit Metrics
• Average absolute error (AAE)
𝑛
1
𝐴𝐴𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑛
𝑖=1
• Mean or root mean squared error (MSE/RMSE)
• Pseudo-R2 (not bounded between 0 and 1 for general regression!)

k-fold Cross Validation
Recommended, if an independent test dataset is not available
Train
Model Model Model Model Model
Full Dataset
Predict

General Logistics of Model Fitting
Kuhn, M. and K. Johnson, Applied Predictive Modeling, 2013. Springer.

Tuning Parameters
Method Tuning Parameters

Tree depth,
Regression Tree
cost complexity parameter
Random Forest Randomly selected predictors
Number of iterations,
Gradient Boosting Machine Tree depth,
shrinkage
Cost,
Support Vector Regression
Sigma (radial basis function)
Number of hidden units,
Neural Networks
Weight decay

Model Aggregation – Why?
• Model fits measured in terms of training or test error –
multiple competing models may arise!
• Aggregating over a large set of acceptable models can

provide more robust understanding and predictions
• Ensemble models (with predictions aggregated) top
performers in data science competitions
Ensemble Modeling Methods
• Model aggregation strategies
▪ Simple averaging (direct average of constituent model predictions,
e.g., using arithmetic average)
▪ Weighted averaging (weighted averaging of constituent model

predictions, e.g., using inverse of RMSE)
▪ Stacking (predictions from the constituent models are used as

predictors in an aggregate model, e.g., NN training)
• Similar arguments underlie Beven’s concept of

“equifinality” in watershed hydrology modeling in the
GLUE framework (weighted averaging)
Schuetter et al., URTeC (2019)

Variable Importance
• Some algorithm specific importance

metrics available (e.g., RF, GBM, NN)
• Not always consistent across models

(because of different approaches for
handling variable interaction)
• R2-loss  loss in explanatory power
of model if the variable of interest is
dropped from regression
• Simple and intuitive model-

independent measure
• Can be aggregated for “meta ranking”

Variable Importance Approaches
Strategy Notation Description
Removing a Remove a variable from the model, re-train the model and compare
Remove
variable the reduction in pseudo-R2, i.e. R2 loss.
Permute a variable’s values, which breaks the relationship between
Permuting a the variable and the true outcome, then compare the reduction in
Permute
variable pseudo-R2, i.e. R2 loss, of the dataset with permuted values to that
with true values.
The partial dependence plot shows the marginal effect of different
Partial variables on the predicted outcome. PDPs are “flat” for less important
PDP
Dependent Plot variables while the variables whose PDP vary across a wider range of
the response are more likely to be important.
Compare how the model predictions change in a small “window” of
Accumulated
ALE different variables. ALE plots are faster and unbiased alternative to
Local Effects Plot
partial dependence plots.
Local LIME attempts to understand the model by perturbing the input of
Interpretable data samples and interpreting how the predictions change. Variable
LIME
Model-Agnostic weights can then be extracted from a simple local model on the
Explanations permuted dataset to explain local behavior.
SHAP is a method to explain individual predictions based on the game
theoretically optimal Shapley values. A prediction can be explained by
Shapley
assuming that each feature value of the instance is a “player” in a
Addictive SHAP
game where the prediction is the payout. Shapley values – a method
exPlantations
from coalitional game theory, tells us how to fairly distribute the
“payout” among the features.
https://christophm.github.io/interpretable-ml-book/

Classification Problems
• Goal is to use the predictors (x1, x2, …, xp) to determine

a group label for the observation
• Response y is categorical ; predictors xi can be
categorical or continuous
• As in regression, the model is trained using data
▪ Response is observed for n samples y1, y2, …, yn
▪ Predictors are observed for n samples, where the kth sample is

(x1k, x2k, …, xpk)
• Same concerns as before on evaluating model
performance (i.e., use training AND test data)

Classification Model Evaluation
▪ Be aware of class imbalance, especially for rare events

− Many methods can use weights on observations, or include options for
balancing the analysis with respect to class sizes
▪ One useful visualization: ROC curve + area under curve (AUC)

− Measures probability of separating 1
two classes (true v/s false positive rate)
− Shows performance (TPR
vs. FPR) as the parameter is True Receiver Operating
varied over a range Positive
Characteristic Curve
Rate
− AUC = 1  perfect prediction
− AUC = 0.5  no ability to distinguish
between classes
0 False Positive Rate 1
− AUC = 0  perfect inverse prediction

Other Classification Metrics
Recall
https://manisha-sirsat.blogspot.com/2019/04/confusion-
matrix.html
Machine Learning Methods
Case Studies & Examples


Case Study 1 – Shale Well Productivity
Schuetter et al., SPEJ (2018)
Field Description
• Wolfcamp Shale ID
M12CO
Well ID
Cum. production of 1st 12 producing months (BBL)
horizontal wells MMO12 Max. monthly production of 1st 12 producing months (BBL)
MMO2CLAT Production efficiency (MMO12/LATLEN)
▪ Data from 476 Wells Opt2 Categorized operator code
COMPYR Well completion year
▪ Goal Fit M12CO SurfX, SurfY Geographic location
as a function of the AZM Azimuth angle
TVDSS True vertical depth (ft)
12 predictors DA Drift angle
LATLEN Total horizontal lateral length (ft)
▪ Multiple regression
STAGE Frac stages
modeling methods FLUID Total frac fluid amount (gal)
PROP Total proppant amount (lb)
▪ Model validation + PROPCON Proppant concentration (lb/gal)
variable importance GORMM12 Gas-oil-ratio of the max. producing month (MMcf/BBL)
GORC12 Avg. gas-oil-ratio of the 1st 12 producing months (MMcf/BBL)

Scatter Plot Matrix Analysis

Model Fits
Variety of
modeling methods
Three types of
model validation
- full training data
- 10-fold CV
- held out test data

Predictor Importance

Conditional Sensitivity Analysis

Classification Tree Analysis
▪ Binary classification to identify

factors separating top 25% from Top 25%: Not
bottom 25% producing wells too shallow, not
too deep, long
lateral with
▪ Accuracy: more proppant,
but not too long
BOTTOM TOP CORRECT

25% 25% ID
BOTTOM
62 18 78%
25%
TOP 25% 7 73 91%
TOTAL 69 91 70%

Ensemble Modeling
M1 — direct averaging; M2 — weighted averaging;
M3a — stacking with NN; M3b — stacking with RF
RMSE
Model Name
(x1k BBL)
M1 37.57
M2 37.45
M3a 36.21
M3b 36.15
LPM 47.12
QPM 40.03
SVR 39.00
RF 38.33
GBM 40.40

Case Study 2 – Vug Characterization
Howat et al., AAPG (2016)
• Vuggy zones create high-
permeability pathways in
carbonate rocks
• Generally identified from

cores and FMI logs
• Challenge: Can vuggy

zones be identified from
well-log response alone? Zone of high
density vugs
• Approach: use exploratory

data analysis and machine
learning tools to create Large
classification rules Vug
Crystalline
Dolomite

Machine Learning Phase-1
• Identify vugs in a
single well using
image logs and
core samples
• Using that “truth”

data, train several
models to detect
vugs using sensor
log data only

Machine Learning Phase-2
• Identify vugs in multiple wells Correct ID

Held Out Well
Rate
• Evaluate using the best Well #1 0.721
performing model from Well #2 0.675
Phase I on the new data Well #3 0.748
▪ Wells held out one at a time Well #4 0.820
Well #5 0.767
▪ Model trained using the other Well #6 0.885
wells, then predicted on the held Well #7 0.733
out well
Well #8 0.604
• Vug correct identification rate Well #9 0.810
range 60% – 90% Well #10 0.820

Example Predictions on a Well
• Train a final model

using all the wells,
then use it to identify
vugs in wells for which
no image logs are
currently available
• Output file is a
Synthetic Vug Log:
SVL (0-1)

Mapping Vugs in Multiple Wells
Probability
of Vugs
q = 5 bbl/min q = 5 bbl/min q = 1 bbl/min

Case Study 3 – Prediction & Optimization
of Drilling Rate of Penetration
• Goal  Fit data-driven

model to predict and
optimize ROP during drilling
• Example  Data from
vertical well in Texas
• Selected features
▪ Weight on bit (WOB)
▪ Rotary speed of drilling (RPM)
▪ Rock strength (UCS)
▪ Flow rate
Hegde and Gray, 2017, JNGSE,

40, 327-335
Model Fitting and Validation
• Random Forest
▪ R^2 = 0.96
▪ RMSE = 7.4 ft/hr
▪ Mean error = ~5%
• Linear regression
▪ R^2 = 0.42
▪ RMSE = 18.4 ft/hr
▪ Mean error = ~14%
• Results above are for

Tyler sandstone
• RF error <10% for all
other formations

(Near) Real-time Optimization
• Using data-driven model, explore

feature space for improved ROP
▪ Change WOB, RPM, flow
▪ Use feature subspace in vicinity of
depth of interest
▪ Should not extrapolate, but expand
range of training data
• Possible to improve efficiency by

increasing ROP
• Practical considerations would
call for optimization over finite
intervals
Improved performance and cost savings

Example [1]
Perez et al.
SPERE April 2005
• Classification tree
analysis for
identifying rock
types from basic well
log attributes
• Accounting for
missing well logs
• Application for
permeability
prediction in Salt
Creek field

Example [2]
Shelley et al.
SPE-171003, 2014
• Identifying performance
drivers and completion
effectiveness for
Marcellus shale wells
• Predictive model using
ANN (Artificial Neural
Networks)
• Role of different
variables evaluated

Example [3]
Schuetter et al.
SPE 174905, 2015
• Building proxy model to CO2

geologic sequestration full-
physics simulation output
• Compositional simulation
with 9 inputs and 3 inputs
• Different designs (Box-
Behnken, Maximin LHS,
Maximum Entropy) used to
generate training runs
• Response fitted with
quadratic and kriging models

Example [4]
Santos et al.
OTC-26275, 2014
• Building prognostic
classifier for specific
turbogenerator failures
during startup
Test Accuracy
• Data from offshore (deg C)
facility – extraction of
Temperaure
Temperature
fuel burning related

Average
features
• RUSBoost and RF
models
Validation Set
• Multi-fold validation
approach for evaluation
Example [5]
Arumugam et al.
SPE-184062, 2016
• Processing of daily
drilling data to identify
drilling anomalies / Drill, Directional Drill,
Connections increased,
best practices observed excess drag,
observed fresh cuttings
▪ Information retrieval
▪ Conversion to Drill, maintained

structured data ROP/WOB/Torque, No
Drill, tight spot, tank stopped caving observed, Good
increasing, losses in trip tank, hole cleaning, No
▪ Clustering rare cavings, over pulled, vibrations
observed torque, drags
observed
▪ Pattern identification
▪ Knowledge management

Recap of Lessons Learned
• Problem formulation is important

• Predictive modeling is nuanced
• Multiple competing models may exist
• Data quality/quantity can compromise results
• Unwrapping black-box models is difficult
• Communicating results can be challenging
• Text-based datasets mostly untapped

Machine Learning Methods
Software Demo

R / RATTLE
• R – a freely available language and environment for statistical
computing and graphics which provides a wide variety of statistical
and graphical techniques: linear and nonlinear modelling, statistical
tests, time series analysis, classification, clustering, etc
• https://cran.r-project.org/
• Rattle – a popular GUI for data mining using R. It presents statistical

and visual summaries of data, transforms data so that it can be
readily modelled, builds both unsupervised and supervised machine
learning models from the data, presents the performance of models
graphically, and scores new datasets for deployment into production.
• https://rattle.togaware.com/

Problem 1 (Regression)
• Data set => Salt_Creek_regress
• 5 well-logs as inputs (log10(LLD), GR, NPHI, RHOB, PEF)
• Permeability as output (ln(Kg))
• Exploratory data analysis (distribution, correlation)
• Modeling (linear, tree, random forest, neural net)
• Validation
• Variable importance

Problem 2 (Classification)
• Data set => Salt_Creek_classify
• 5 well-logs as inputs (log10(LLD), GR, NPHI, RHOB, PEF)
• Class of permeability as output (high or low)
• Modeling (linear, tree, random forest, SVM, neural net)
• Validation
• Variable importance

Wrap-up Comments

Resources
• Mishra, Srikanta; and Akhil Datta-Gupta (2017). Applied Statistical Modeling

and Data Analytics for the Petroleum Geosciences, Elsevier, New York, NY.
• Mohaghegh, Shahab (2017). Data-Driven Reservoir Modeling, Society of
Petroleum Engineers, Richardson, TX.
• Mishra, Siddharth; Hao Li; and Jiabo He (2019). Machine Learning in
Subsurface Characterization, Gulf Professional Publishing, Houston, TX.
• Holdaway, Keith (2014). Harness Oil and Gas Big Data with Analytics: Optimize
Exploration and Production with Data-Driven Models, John Wiley & Sons, New
York, NY.
• Hastie, T., R. Tibshirani, and J.H. Friedman, 2008. The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, Springer, New York, NY.
• Kuhn, M. and K. Johnson, 2013, Applied Predictive Modeling, 2013. Springer,
New York, NY.

Software Resources
• Mishra and Datta-Gupta (2017) companion site – which includes:

(a) GRACE (non-parametric regression), (b) E-FACIES
(multivariate analysis), (c) E-REGRESS (experimental design),
(d) misc R scripts for machine learning, and (e) example datasets
https://www.elsevier.com/books-and-journals/book-
companion/9780128032794/software
• R – open source statistical analysis software
• Python – Programming language with built-in libraries
• WEKA – suite of machine learning software in Java
• Commercial packages (MATLAB, SAS, SPSS, …..)

Combining the Old and the New
350
y = 2.0626x + 97.397
300
Initial Well Potential (BOPD)

R2 = 0.5385
250
200
150
100
50
0
0 10 20 30 40 50 60 70 80 90
Net Pay (ft)
Exploratory Regression Multivariate

Data Analysis Modeling Analysis
0.5
-0.5
-1
1
0.5 1
0.5
0
0
-0.5
-0.5
-1 -1
Machine Experimental Uncertainty

Learning Design Quantification

Recommended Workflow
• Framing the problem
• Checking the data
• Selecting the causal variables
• Picking the software
• Choosing the modeling technique(s)
• Validating the model
• Understanding and communicating the results

Challenges for Acceptance of ML
• Our ML models are not very good.
• If I don’t understand the model,

how can I believe it?
• We are still waiting for the “Aha” moment!
• My staff need to learn data science, but how?

Mishra et al., 2021, JPT (March)

Challenges for Acceptance of ML
Poor Model Quality Lack of Understanding
• Consumer marketing ML/AI models • Articulate adequacy of predictors
are not necessarily highly accurate!
• Demonstrate model robustness
• Need to manage expectations re.
quality of fit for subsurface models • Explain inner workings (key variables)
• Focus more on added value from ML • Use creative visualizations
models and complementary role
“Aha” Moment? Learning Data Science

• ML model may or may not produce • Significant (informal) self-learning to
new insights
become “citizen data scientists”
• Provides an alternative quantitative
input-output relationship • Need formal knowledge of
conventional data analysis, python/R
• Useful when physics-based model is
slow, data-intensive or immature programming, and machine learning
Mishra et al., 2021, JPT (March)

Q1– Which Software Should I Use?
• Commercial visualization tools
▪ Spotfire, Tableau
• “Light and Easy” statistics tools

▪ Excel, JMP, SPSS, Minitab
• Commercial statistics software

▪ SAS, MATLAB, Stata
• Open source statistics software

▪ R/RATTLE, Python

Q2– Do I Need Machine Learning?
Linear
Linear
Random
Random Forest
Forest

Q3– Which Technique Works Best?
• No single technique is Power of Ensemble Modeling

consistent best performer
• Often, multiple competing
models have equally good
fits (R2/RMSE)
• Aggregate models  robust

understanding
and predictions
• Pick “Forest” over “Trees”
• Baseline regression model + one(+) tree-based model (e.g.,
RF, GBM) + one(+) non-tree based model (e.g., SVM, ANN)

Q4– How Much Data Do I Need?
Large datasets can produce poor
results if key causal variables
are not included in the model
N=81
Robust models can be built with N=2935

small datasets if all relevant causal
variables are included in the model

Q5– How Do I Learn Data Science?
• Citizen data scientist/analyst
(one who learns from data)
▪ Basic skills  domain knowledge (e.g., PE)
▪ New skills  statistics/ML, programming
• Core (data science) competencies

▪ Data collection, preparation, exploration
▪ Data storage and retrieval
▪ Computing with data
▪ Applied machine learning
▪ Data visualization/communication
Donoho, J. Comp. Graphical Stat, 2017 https://www.linkedin.com/pulse/new-venn-

diagram-data-science-pierluigi-casale/

Looking Ahead
• Growing trend towards the use of statistical and machine

learning techniques for oil and gas applications
• Goal  “mine” big data and develop data-driven insights
to improve reservoir description & performance prediction
• Broader applications for digital oil field data management,
real-time analytics and predictive maintenance
• Petro-techs need to develop better understanding of
full repertoire of available techniques and their potential
• Data scientists need to understand problem domain to
propose/apply appropriate techniques

Gartner “Hype” Trajectory

Final Thoughts
Beware the
hype / manage
expectations
ML comes
after posing
the problem
Don’t forget
the physics

Dr Srikanta Mishra
Battelle, Columbus, OH
(614) 424-5712
mishras@battelle.org

Machine Learning
Regression/Classification Techniques

Regression Trees (RT)
• Regression trees are simple, interpretive models to

describe how the predictors impact the response
• General idea:
▪ Split the predictor space into nested rectangular regions
▪ Within each region, predict the response with a constant value
▪ Can handle non-linear behavior using arbitrarily small regions
• Why is it called a tree?

▪ Rectangular regions are defined by using a branching structure
▪ Each branch is a split obtained by applying a threshold to the value

of one of the predictors

RT – Schematic
X1 < t1
R3
R2 t3
X2
X2 < t2 X2 < t3
t2
R4
R
R1 R2 R4 R3
t1
X1

RT – Procedure
• Parameters to be chosen at each split:

▪ Predictor j ; Threshold value s; Predicted response c1 and c2
▪ Find j, s, c1 and c2 to minimize
• Given j and s, the best prediction within each node is just

the mean response
• Pruning (balance between complexity and fit quality)
▪ Grow a tree to nearly full size
▪ Select the subtree that optimizes a complexity criterion

Example – RT Fit
Regression
True Surface Tree

RT – Pruning
summation term representing overall node

impurity, and a penalty term combining a
tuning (cost-complexity) parameter and the
number of terminal nodes.

Different Levels of Pruning
No Pruning More complicated, less pruning

RT – Pros and Cons
• Advantages
▪ Interpretable
▪ Identifies important predictors and critical values
▪ Easy to fit, fast evaluation
▪ Invariant to monotone transformations of inputs
▪ Resistant to outliers
• Disadvantages
▪ Less accurate than other models
▪ Prone to overfitting (can use pruning to mitigate this)

Classification Tree Analysis
• Analog of regression trees

• Provides rules that partition output into categories
based on input values
• Useful for determining prediction rules and finding

structure in data
• 2-variable “partition plot” helps visualize separation

of categorical outcomes
• Widely used in medical decision making

Simple Classification Tree
x1 x2 y
1 3 High
3 1 High
x1 x2 y
4 5 High
3 1 High
9 2 Low 5 4 Low
1 3 High 9 2 Low
11 6 Low
x1 x2 y 5 4 Low
4 5 High 2 7 High
4 5 High
11 6 Low 6 8 High
3 1 High
9 9 High
1 3 High 2 7 High
8 10 High
5 4 Low 6 8 High
6 11 High
9 2 Low 9 9 High
7 12 High
11 6 Low 8 10 High
2 7 High 6 11 High x1 > 4.5
7 12 High
Low
6 8 High
9 9 High
8 10 High x2 < 6.5
6 11 High
7 12 High
x1 < 4.5 High
x2 > 6.5 High

Random Forest Regression (RF)
• Random forest regression uses an ensemble of trees to

increase performance of a single regression tree
• Same data input always yields the same tree – how to

introduce variation?
▪ Fit each tree with only a subset of the data (a bootstrap sample)
▪ Constrain branches to select from a random selection of predictors
▪ Forces trees to view the dataset from multiple perspectives
• Called a “bagging” approach (bootstrap aggregation)

RF – Schematic

RF – Procedure
• Prediction
▪ Observation is passed through all of the trees in the ensemble
▪ Each tree produces a regression estimate
▪ Final estimate is an average of those tree-level estimates
• Built-in cross-validation
▪ Since each tree sees only a subset of the data, the remaining
observations are called out-of-bag samples
▪ For that tree, those out-of-bag samples are independent test data
▪ Used to get error rate estimates for gauging model performance

Example – RF Fit
True Surface (100 sampled points) Random Forest – 500 Trees

RF – Pros and Cons
• Advantages
▪ Can handle highly non-linear behavior
▪ Useful built-in methods (cross-validation, variable importance,

proximity measure, missing data imputation)
▪ Invariant to monotone transformations of inputs
▪ Resistant to outliers
• Disadvantages
▪ Not easily interpretable
▪ Slower to fit, although it is parallelizable
▪ Can have more trouble modeling certain types of surfaces

RF- Classification
• The random forest classifier is trained the same way as it
was in the regression setting
▪ Now classification trees are used instead of regression trees
• Prediction
▪ Observation is passed through all of the trees in the ensemble
▪ Each tree produces a predicted class label
▪ Final label is most popular vote among the trees
• Out-of-bag samples used to estimate misclassification

rate on independent test data
• Advantages and disadvantages are similar to those in the
regression case

Gradient Boosting Machines (GBM)
• GBMs are a different kind of ensemble method where

models in the ensemble are trained sequentially
• Each model is trained to overcome weaknesses of
previous models in the sequence
• General procedure:
▪ Begin with a base model F0(x)
▪ Repeat for m = 1, …, M:
− Fit a model hm(x) to the negative gradient of the residuals y – Fm-1(x)
− Let Fm(x) = Fm-1(x) + hm(x)
▪ Make predictions with the final model FM(x)

GBM – Schematic

Example – GBM Fit
Random Forest – 500 Trees GBM – 150 Trees

GBM – Pros and Cons
• Advantages
▪ Invariant under all monotone transformations of the input variables
▪ Robust against presence of irrelevant input variables
▪ Easy handling of missing values
▪ Useful built-in methods, similar to Random Forest
▪ Competitive accuracy
• Disadvantages
▪ Can easily overfit
▪ Can take a while to fit, but there are tricks for speeding this up

GBM - Classification
• GBMs are easily ported over from regression to the
classification setting
• TreeBoost scenario:
▪ Rather than fitting a single model at each step, there are K trees,
one for each group
▪ Negative gradient is based on multinomial deviance, rather than
regression residuals
• Advantages and disadvantages are the same as in the
regression setting
• One of the better models out there for regression or
classification

Support Vector Regression (SVR)
• SVR machines are linear models with ε-insensitive loss
• Errors within ε-tube are ignored
• vi are the training inputs
• w is the vector of parameters
▪ Goal is to minimize
▪ Alternate formulation:
• Kernel “trick” to handle non-linearities

SVR – Kernel Trick
• The “kernel trick”
▪ The model is only specified through the dot product of the support
vectors and the predictors (vitx)
▪ The dot product can be replaced by any kernel function
Polynomial
Gaussian
Exponential
Hyperbolic Tangent
▪ Using kernels like the ones above can produce regression fits to
non-linear surfaces
Kernel Trick - Schematic

SVR – Pros and Cons
• Advantages
▪ Can capture non-linear behavior
− Kernel function allows adaptability to many situations
▪ Accurate predictor compared to most methods
• Disadvantages
▪ Prediction requires storage of training data
▪ Can be influenced by outliers

Support Vector Machines
• SVMs1-2 define a hyperplane that separates two classes
▪ Hyperplane maximizes the β0 + xtβ = 1
margin between the classes Group A β0 + xtβ = 0
(Y = 1)
β0 + xtβ = -1
▪ Prediction is made using the sign
of the hyperplane equation
Group B
(Y = -1)
▪ β is a linear combination of vectors lying exactly on the margin

− These are called support vectors
• The kernel trick works here as it does in SVR machines

1. Vapnik V., The Nature of Statistical Learning Theory, Springer, 2000.
2. Hastie, T., R. Tibshirani, and J.H. Friedman, The elements of statistical learning: data mining, inference, and prediction. Second Edition 2008. Springer.

Neural Networks (NN)
• Neural Nets mimic how the

human brain works
• Neurons collect electrical
impulses from surrounding
cells (Σ)
• These impulses are combined
in a non-linear fashion (σ)
• Electrical impulses are
attenuated by the strengths of
the interconnections between
neurons (weights, bias)

NN – Procedure
• Each hidden unit (neuron) is a non-linear function of

weighted linear combinations of the inputs
▪ σ is the activation function
▪ Most commonly used

function today is a sigmoid
• Outputs fk(x) are different non-linear functions of weighted

linear combinations of the hidden units
• gk is the output function

Building the NN Model
• How many parameters?

▪ In a fully connected network
− p+1 parameters for each of M1 units in the first hidden layer
− M1+1 parameters for each of M2 units in the second hidden layer
− Continue until the last layer, with K outputs
▪ Example
− p = 4 inputs
− Two hidden layers (M1 = 5, M2 = 3)
− K = 2 outputs
− That is 5(4+1) + 3(5+1) + 2(3+1) =
25 + 18 + 8 = 51 parameters

Data Analytics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics

Uploaded by

Copyright:

Available Formats

Dr.

Application of Data Analytics in

Mishra - IIT(ISM) 2021 2

Big Data Analytics ? Game Changer

Mishra - IIT(ISM) 2021 3

• Provide a practical introduction to applied statistical

• Emphasize practical aspects (not statistical rigor)

Mishra - IIT(ISM) 2021 4

600 – 650p Introduction, statistical foundations –

650 – 700p Break

700 – 750p Machine learning basics, case studies

750 – 800p Break

800 – 850p Software demonstration

850 – 900p Wrap up

Mishra - IIT(ISM) 2021 5

INFORMATION  Structure ~ Trends ~ Relationships

Relationships require specifying the form of model

Mishra - IIT(ISM) 2021 6

Data Analytics (aka Machine Learning, Knowledge Discovery, Data Mining)

Mishra - IIT(ISM) 2021 7

How to pull ahead of the pack? Become an analytics competitor:

Harvard Business Review 84(1):98-107, January 2006

Mishra - IIT(ISM) 2021 8

Mishra - IIT(ISM) 2021 9

Mishra - IIT(ISM) 2021 10

Mishra - IIT(ISM) 2021 11

Mishra - IIT(ISM) 2021 12

Ma et al., 2018, Symmetry, 10, 734

Mishra - IIT(ISM) 2021 13

Performance Proxy Creating fast system

Reduce cost, improve productivity, increase efficiency

Mishra - IIT(ISM) 2021 14

Mishra - IIT(ISM) 2021 15

Mishra - IIT(ISM) 2021 16

Basic Regression Analysis

Mishra - IIT(ISM) 2021 17

• Given data (x1,y1), (x2, y2)….(xn,yn)

Mishra - IIT(ISM) 2021 18

Mishra - IIT(ISM) 2021 19

Coeff Std Error t Stat P-value Lower 95% Upper 95%

Mean and SD =Coeff/SE, the smaller ~ Coeff ± 2SE

Mishra - IIT(ISM) 2021 20

Normal Probability Plot

Mishra - IIT(ISM) 2021 21

• Given data (x1,y1), (x2, y2)….(xn,yn)

Mishra - IIT(ISM) 2021 22

• Also called Ordinary Least Squares (OLS) regression

• Solve for bk by minimizing the sum of squared residuals

• Easily interpretable, fast, well-studied statistical properties

Mishra - IIT(ISM) 2021 23

Mishra - IIT(ISM) 2021 24

• First, plot the data !!!!

Mishra - IIT(ISM) 2021 25

Mishra et al., 2014, Env. Geosci., 21(2), 59-74.

Mishra - IIT(ISM) 2021 26

• Rank samples from smallest

(rank=1) to largest (rank=N) MWC7.

• Helps identify both Int

▪ Redundant variables API 0.2

▪ Relevant variables MWC5. 0

• Also referred to as “feature V.I -0.2

selection” when dealing with C1

large data sets – multiple Vol

PC1 = 0.270.567 + 0.370.597 – 0.24*0.567 = 0.24