Professional Documents
Culture Documents
Acfrogdlbl6yshnkgzbijukaviavoa8oeybd3csxhiwyhqyz9cmdceszlntu2ijnl3q Kofhc3pyabltmnzulmv1g3kwqicq88ylsww4m5el0ryjez09pzgg Wckr5g Akk Wknjgxyc5a6yhmns
Acfrogdlbl6yshnkgzbijukaviavoa8oeybd3csxhiwyhqyz9cmdceszlntu2ijnl3q Kofhc3pyabltmnzulmv1g3kwqicq88ylsww4m5el0ryjez09pzgg Wckr5g Akk Wknjgxyc5a6yhmns
• univariate or multivariate
• categorical or continuous
• real-value (numerical) or not real-valued.
• observational or experimental
1. Geostatistical data
2. Lattice data
3. Spatial point patters
6
A simple definition
Statistics:
• There are three kinds of lies: lies, damn lies and statistics.
(Benjamin Disraeli, 1804-1881)
Why is Statistics Important?
• Statistics is part of the quantitative approach to knowledge
In the past, geology has been qualitative, but is now becoming increasingly quantitative.
Statistics can be used to quantify data, but often times statistics are ignored or
misrepresented
• In the “old days” geologists would use more observation skills. “This looks like granite”
• Methods of statistical data analysis are required to retrieve information from the set of
numbers in the computer
“Explosion” of Observations
Bivariate Plots
(plots having, or relating
to, two variables)
•Are these fields distinct? Are trends significant?
• The use of graphs is a helpful tool to get a first grasp about the information
content in the data
• However, the following questions remain:
- What do all of these show?
- Is there any critical analysis of these curves and fields?
• We need a better answer than “They look like it so they must be different”
• We need to employ more “rigorous” scientific methods to geologic
problems
- hypothesis testing
• Interpretation:
You are most likely to be familiar with this branch of statistics, because many examples arise in
everyday life.
Descriptive statistics form the basis for analysis and discussion in many fields.
Inferential Statistics
Examples:
- A survey that sampled 2001 full- or part-time workers ages 50 to 70, conducted by the American
Association of Retired Persons (AARP), discovered that 70% of those polled planned to work past
the traditional mid-60s retirement age.
•This statistics could be used the draw conclusions about the population of all workers ages 50 to 70.
- Or again more technical: “The mean adjustment of any set of GPS points used for orthorectification is
no less than 4.3 and no more than 6.1m; this statement has a 5% probability of being wrong”
• Interpretation:
- If you use inferential statistics, you start with a hypothesis and look to see whether the data are
consistent with this hypothesis.
- Inferential statistical methods can be easily misapplied or misconstructed, and many methods require
the use of a calculator or computer.
What is Variance?
• Variance measures how a set of data values for a variable fluctuate around the
mean of that variable.
• Variance is the natural error or scatter or variability in measurements or it can be
thought of as the natural spread of data
2. Bivariate
– two variables x versus y
There is a trade off between how confident you can be about a conclusion and
how applicable a statistical test is to your data.
1. Parametric
- greatest degree of confidence but assume a distribution or model
examples T test, F test, ANOVA, principle component analysis requires
“normal” distribution
2. Nonparametric
- not based on model, lesser degree of confidence
- Let the data speak for itself
– pebbles on a beach
Accuracy
• How close you are to the actual value
• The degree of perfection in measurement
• relates to the quality of the result
INFORMATION LIFECYCLE
Enable data-driven decisions
Achieve optimal technical, operational, and economic performance across the E&P Lifecycle
Exploration Lifecycle Management
Reservoir Lifecycle Management
Drilling Lifecycle Management
Volume Rendering
Multivariate Data Analysis
• The advanced data analytical tools for examining relationships among multiple variables
at the same time.
• Analyze multiple variables
• Enhance sweet spot identification
• Optimize well placement
• Easy to visualize results in the form of a map or 3D model, e.g. quality indicator
Multiple advance techniques for examining relationships among multiple variables at the same time. It is assumed that a given outcome of interest is affected or influenced by more than
one thing.
Multivariate Analytic techniques have been used for many years in others industries such as pharmaceutical, medical, stock market.
In O&G the application of multivariate analysis is relatively new, mainly due to the emergence of unconventional resources, more specifically shale reservoirs, where the good production targets
are a function of multiples variables that belong to the following categories: rock physics, Geochemistry, geological location and engineering parameters.
The main benefit of Multivariate Data Analysis is the ability to easily and effectively combine multiple independent variables to create a Super variable carrying out all information of the input
data, classifying the samples in good, average and bad areas to drill. The final result, can be visualized in maps, 3D grids or cross sections, where the asset team can plan their next wells.
Steps to answer the question about subsurface
What do we sample?
Sampling Types of Measures
Types of Measures
Sampling Representatively
Sampling Bias
Goal of Sampling and Statistics Example
Variable / Feature
Population and Sample
Parameters and Statistics
Predictor and Response Features
Inference
Prediction
Deterministic and Statistical Model
Deterministic model
Statistical model
Statistical model
Uncertainty
Reservoir modeling
Transfer Function
Exploratory data analysis (EDA):
• The following data are 26 acidity samples of precipitation from Allegheny County
Penn collected in 1973 – 1974. The data are pH values, pH 7 is neutral and the
lower the value, the more acidic the precipitation. These data could bear on the
theory that air pollution causes rainfall to be more acidic.
Four histograms of the same data
A better approach is a stem and leaf diagram
Let the data do the talking:
1. Look at data, find maximum and minimum
5.78 – Max
4.12 – Min
1.66 – Difference
2. Choose a suitable pair of adjacent digit positions so you have 10 – 20 groups and split each data
value between the positions. For the pH data, split 4.57 into 4.5 7
The 45 are the leading digits and the 7 is the trailing digit.
3. Write down a column of all the possible sets of leading digits in order from lowest to highest.
These are the stems. We have a stem spacing of 0.1 pH units
4. For each data value, write down the first trailing digit on the line labeled by its leading digits.
These are the leaves, one leaf for each data value.
Yes, but this ignores the fact that the student with the 89 is probably as deserving of an A
as the student with 90.
Approach 2: start with a stem and leaf or two
1. Order the data from lowest to highest. The stem and leaf is good for this.
2. Determine how far each value is from the low or high end of the batch. This
is called the depth of each value.
The “deepest” data value is called the median. It marks the middle of the batch
depth of median = (n + 1)/2 d(M)
In our example, n = 26, so d(M) = (26+1)/2 = 13.5 (the average of the 13th and 14th
data values
which are 4.52 and 4.56, so the MEDIAN is 4.54
The median divides the data into half. Similarly we can divide each half of the data
into halves, and that division is called a hinge (H).
Hinges (H): middle of each half of the data
d(H) = ([d(M)] + 1)/2 [ ] means the integer part, or truncate the value so 13.5
becomes 13
Another measure sometimes used are quartiles: d(quartile) = (n + 1)/4
Hinges and quartiles are not the same. For any batch there are two hinges, an upper
hinge and a lower hinge.
For our data,
d(H) = ([13.5] + 1)/2 = 14/2 = 7
Upper hinge = 4.82
Lower hinge = 4.31
On the other hand, quartiles
d(quartile) = (26+1)/4 = 27/4 = 6.75, which implies some tricky interpolation
between the depth 6 and depth 7 values. Most computer programs use quartiles
because they are easier for computers to handle.
Similarly we can divide the first and last quarters of the data into half, called
eighths
d(E) = ([d(H)] +1)/2
= (7+1)/2 = 4
Beyond eighths are letters D, C, B, A, Z, Y...
extremes are the largest and smallest values and
do not have letters: Depth = 1
The letter values are summarized on a table:
The Mid values are the averages of the upper and lower values, and are useful
in examining the symmetry of the data. Note how the midpoint increases with
decreasing depth.
The spread is the difference between the upper and lower values and show
something about the spread or variability of the data.