Professional Documents
Culture Documents
Lecture Slide Research Methodology
Lecture Slide Research Methodology
1
The goal of statistics
- Area of applied mathematics concerned with the data collection, analysis
interpretation and presentation.
- Used in practically every scientific disciplines.
- Discover pattern and relationship from the data and infer useful insight
and knowledge from the data.
- You always deal with data, right?
3
Basic terminologies in statistics (and probabilities)
- Random variable: A quantity having a numerical value for each member of a group, especially one
whose values occur according to a frequency distribution.
- Probability - The probability is the chance that an event will or will not occur.
- A probability space or a probability triple ( Ω , F , P ) is a mathematical construct that models a
real-world process (or “experiment”) consisting of states that occur randomly.
4
Descriptive statistics
Measures of central tendency (three kinds of “averages”):
As the name suggests, it gives you an impression regarding the average
tendency of your data:
● Mean: Add up all the numbers and then divide by the number of numbers.
● Median: The "middle" value in the list of numbers
● Mode: The value that occurs most often.
Measures of variability:
Gives you the information regarding how far your data deviates from the
center:
● Standard deviation / Variance.
● Maximum / minimum.
● Interquartile range (IQR)
5
Probability distribution (1)
● A probability distribution is a mathematical function that provides the probabilities
of occurrence of different possible outcomes in an experiment.
● In more technical terms, the probability distribution is a description of a random
phenomenon in terms of the probabilities of events.
● Knowing the shape of the distribution is crucial to the use of statistical methods in
research analysis.
● This is primarily since most methods make specific assumptions about the nature of
the distribution curve.
6
Probability distribution (2)
● The probability that a discrete random variable X takes on a particular value x, that is, P(X = x),
is frequently denoted f(x)
● For continuous random variable, probability distribution is commonly termed probability
density function
7
Probability distribution (3)
8
Central tendency: A simple example
42 20 79 95 13 3 84 70 67 7
74 39 65 17 70
3 13 17 20 39 42 65 67
70 70 70 74 79 84 85
95
68.5
Mean (arithmetic average) -> Expected value
● Mean = 55.8125
● Median = 68.5
● Mode = 70
● Min = 3
● Max = 95
10
Case of mean and median
10 20 30 10 50 10 70 50 80 60 100 100 5000
● Mean = 430
● Median = 50
Mean and median can give you different impression
Mean (arithmetic average) -> Expected
of the data. Such differences might be due to:
value
● The presence of outliers.
● Incorrect/wrong sampling/other errors.
● Non-robust method (in the context of algorithm
design)
In practice, it is better to observe both the mean and
median value.
11
Measures of variability: range (1)
● Range is defined as the difference between the values of the extreme items
of a series.
● Pros: It gives an idea of the variability very quickly.
● Cons: The range is affected very greatly by fluctuation of sampling. Its value
is never stable.
● Typically only used as a rough measure of variability.
12
Measures of variability: range (2)
42 20 79 95 13 3 84 70 67 7
74 39 65 17 70
Range = (95-3) = 92
13
Measures of variability: range (3)
10 20 30 10 50 10 70 50 80 60 100 100 5000
● Mean = 430
● Median = 50
Mean (arithmetic average) -> Expected
● Range = (5000-10) = 4990
value
14
Measures of variability: Standard deviation (1)
Mean
15
Why use squared values in standard deviation?
16
Measures of variability: Standard deviation (1)
Mean
17
Measures of variability: Standard deviation (2)
20
Measures of asymmetry: Skewness (2)
21
Measures of asymmetry: Skewness (3)
Mean = 0, Mean = 0,
Std. Dev 1, Std. Dev 1,
Skew = 0.7 Skew = -0.7
22
Standardized moments
Std. moment
Mean
23
Intermezzo: quantile
● By a quantile, we mean the fraction (or percent) of points below the given value. That is,
the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70%
fall above that value.
● For example, if the 0.8 quantile of the examination score of subject A is 40, means that
80% of the students obtain scores lower than 40.
24
Intermezzo: quartile
● Quartile is one form of quantile. Quartiles are the values that divide a list of numbers
into quarters:
25
Measures of variability: interquartile range (1)
● The Interquartile Range (IQR) may be used to characterize the data when
there may be extremities that skew the data.
● The interquartile range is a relatively robust statistic (also sometimes
called "resistance") compared to the range and standard deviation.
IQR = Q3-Q1
26
Measures of variability: interquartile range (2)
27
Outliers (1)
● A data point on a graph or in a set of
results that is very much bigger or
smaller than the next nearest data
point.
● Outlier can cause serious problems in
statistical analyses (depending on the
context)
● Outliers could mean two things: (1)
Measurement error (2) Heavy tailed
distribution.
● Now the question is: how to
detect outliers?
28
Outliers (2)
● There is no one universally accepted outlier detection method.
● There are some useful methods, one of them is Tukey’s fences.
● Tukey’s fences treat data as outliers when they are outside this
range
29
Outliers (3)
● An example of outlier detection using Tukey’s fences.
● Consider the following scores from mid-term exams:
21 22 33 10 8 9 10 4
30 20 30 10 20 17 8 9
10 22 11 2 30 20 22
33 34 37 21 20 21 23
11 21 13 8 6 9 100 90
30 70
● Q1 = 10
● Q2 = 30
● IQR = 20
● Lower bound for outlier detection with k = 1.5 is -20
● Upper bound for outlier detection with k = 1.5 60.
● Thus, those guys who have the scores of 100, 90 and 70 are
outliers. 30
Measures of
relationship
31
Quiz: How will you describe the following
relationship?
32
Correlation and dependence
● Statistical dependence or
association describes
statistical relationship between
two random variables.
● The relationship does not
indicate causality, but
describes correlation.
● It is a powerful relationship
when you want to analyze
whether there is a clear
relationship between two
variables.
33
Correlation and dependence
● Statistical dependence or association describes statistical relationship
between two random variables.
● The relationship does not indicate causality, but describes correlation.
● It is a powerful relationship when you want to analyze whether there is a
clear relationship between two variables.
● The most widely used measure is the Pearson’s correlation coefficient.
34
● How about for
cases with
more than
two
variables?
35
Correlation matrix
36
Presenting your results
37
Presenting your results: Histogram
● A histogram is a plot that lets you discover, and show, the underlying frequency distribution
(shape) of a set of data. This allows the inspection of the data for its underlying distribution (e.g.,
normal distribution), outliers, skewness, etc.
● Histogram is very useful if you want to visualize the general distribution of your data
38
Histogram: Choosing the number of bins
40
Histogram vs PDF plot
● Histogram is the empirical distribution of your sample.
● PDF is used if you want to describe the hypothesized underlying distribution (normal? exponential?)
● Histogram is usually shown with frequencies as the y-axis, while for PDF density is shown (note that
the right figure is only for illustrative purpose).
41
Presenting your results: Barplot
● The major difference is that a histogram is only used to plot the frequency of score
occurrences in a continuous data set that has been divided into classes, called bins. Bar
charts, on the other hand, can be used for a great deal of other types of variables.
● Barplot is very useful is you visualize such data as sales, sensitivity indices, to name a
few.
42
Presenting your results: Barplot
● One common mistake of using continuous plot: using continuous plot when the data
should actually visualized with barplot (i.e., when the x-axis is categorical variable).
44
Presenting your results: Scatterplot (2)
● Scatter plot can also be used for outlier detection.
45
Presenting your results: Boxplot
● Displaying the distribution of data based on a five number summary (remember
quartiles): Minimum, first quartile, median, third quartile, and maximum.
● Boxplot gives you good indication of how your datas are clustered and spread out.
● Other name of boxplot is “box and whiskers” plot.
● The whiskers can indicate: +/- 1.5 IQR, minimum and maximum of the data, one standard
deviation above and below the data, 9th and 91st percentile.
46
Presenting your results: Pie chart
● Pie car is useful when you have a data set such as preferences, sensitivity index,
grades, to name a few.
47
Presenting your results: Heat map
● Heat map is useful to better visualize data with arbitrary borders (e.g., geospatial,
CFD/FEM results)
48
Presenting your results: Contour map.
● Contour map performs visualization by creating multiple lines that share the same
values.
● Contour map can be shown with or without colours.
● In some occasions, contour map can be combined with heat map.
49
Tips: Use publication-
quality figures, please..
Not-acceptable
50
Statistical hypothesis
test
51
Statistical hypothesis
● Let’s talk about hypothesis first..
● “Hypothesis is a formal question that the
researcher intends to resolve”.
● Quite often a research hypothesis is a predictive
statement, capable of being tested by scientific
methods, that relates an independent variable to
some dependent variable.
52
Statistical hypothesis
● A hypothesis test is rule that specifies
whether to accept or reject a claim about
a population depending on the evidence
provided by a sample of data.
● A hypothesis test examines two opposing
hypotheses about a population: the null
hypothesis and the alternative
hypothesis.
● The null hypothesis is the statement being
tested. Usually the null hypothesis is a
statement of "no effect" or "no difference".
● The alternative hypothesis is the statement
you want to be able to conclude is true
based on evidence provided by the sample
data.
53
The use of statistical hypothesis
55
The use of statistical hypothesis
Case : We develop method A so that we can obtain superior results
as compared to method B (let’s say, optimization method)
● Null hypothesis (H0): “We are to compare method A with method B about
its superiority and we proceed on the assumptions that both method are
equally good”
● Alternative hypothesis (H1): “Method A is superior or the method B is
inferior”.
● Alternative hypothesis is usually the one which one wishes to prove.
● The null hypothesis is the one which one wishes to disproves (the one we are
trying to reject).
● How to assess this? Use p-value (the level of significance).
56
The use of statistical hypothesis
Case : We develop method A so that we can obtain superior results
as compared to method B (let’s say, optimization method)
57
The use of statistical hypothesis
● Based on the sample data, the test determines whether to reject the null
hypothesis.
● You use a p-value, to make the determination. If the p-value is less than
the significance level (denoted as α or alpha), then you can reject the null
hypothesis.
● The most common p-value is 0.05, meaning that to reject the hypothesis,
there should be 0.05 probability of H0 occuring.
● In other words, the 5 percent level of significance means that researcher is
willing to take as much as a 5 percent risk of rejecting the null hypothesis.
● There is now a movement that advocate that the p-value should be 0.005.
58
Type I versus type II error
● In the context of testing of
hypotheses, there are two
types of errors we can
make:
● We can reject H0 when H0
is true (Type I error),
denoted by α.
● We may accept H0 when
in fact H0 is not true
● α- error is understood as the level of (Type II error), denoted by
significance of hypothesis testing. β.
● Let’s say, if α- error is fixed at 5
percent, it means that there are about
5/100 chances that we will reject H0
when H0 is true. 59
Two-tailed and
one-tailed tests
60
Two-tailed and
one-tailed tests
● A two-tailed test rejects
the null hypothesis if, say,
the sample mean is
significantly higher or
lower than the
hypothesized mean of the
population.
61
Parametric vs non-parametric test
● In parametric test, we assume that
the data follows a specific
distribution (typically normal).
● Non-parametric, we do not assume
that the data follows a specific
distribution (distribution-free test).
● However, parametric data can also
be used for non-normal data with
some requirements (e.g. sample size).
62
Non-parametric tests
● Whitney U or Wilcoxon rank sum test: tests whether two samples are drawn from the same
distribution, as compared to a given alternative hypothesis.
● McNemar's test: tests whether, in 2 × 2 contingency tables with a dichotomous trait and
matched pairs of subjects, row and column marginal frequencies are equal
● Median test: tests whether two samples are drawn from distributions with equal medians
● Pitman's permutation test: a statistical significance test that yields exact p values by
examining all possible rearrangements of labels
● Rank products: detects differentially expressed genes in replicated microarray experiments
● Siegel–Tukey test: tests for differences in scale between two groups
● Sign test: tests whether matched pair samples are drawn from distributions with equal
medians.
● And many more...
63
Parametric tests
● Z-test.
Based on the normal distribution and is used for judging several
statistical measures (particularly the mean).
● t-test.
Based on t-distribution and is used for judging the significance of a
sample mean or the difference between two means of two samples in case
of small samples.
● Chi-square test.
Based on chi-square distribution and is used for comparing a sample
variance to a theoretical population variance.
● F-test.
Based on F-distribution and is used to compare the variance of two
independent samples
64
Parametric tests
● Z-test.
Based on the normal distribution and is used for judging several
statistical measures (particularly the mean).
● t-test.
Based on t-distribution and is used for judging the significance of a
sample mean or the difference between two means of two samples in case
of small samples.
● Chi-square test.
Based on chi-square distribution and is used for comparing a sample
variance to a theoretical population variance.
● F-test.
Based on F-distribution and is used to compare the variance of two
independent samples
65
66
67
68
69
Case 1: Hypothesis testing of means
70
Statistical hypothesis testing, illustration 1
● In this problem, we would like to know whether these 400 male students represent the general
population.
● The first step is to formulate the hypothesis. The null hypothesis is that the mean height of the
population is equal to 67.39 inches:
Given information
We assume normal
distribution and we
use z-test
Accept
Given information 71
Std. error of the mean
Statistical hypothesis testing, illustration 2
Accept
72
Statistical hypothesis testing, illustration 3
Reject
73
Case 2: Hypothesis testing for differences between
means
● Used when we are interested in knowing whether the parameters of two
populations are alike or different.
● Example: In company A, do female workers earn less than male workers
for the same job?
● The null hypothesis is generated states as H0 : μ1=μ2
74
Statistical hypothesis testing, illustration 4
Reject
75
Statistical hypothesis testing, illustration 5
Reject
77
Case 3: Hypothesis testing of proportions
● In case of qualitative phenomena, we have data on the basis of presence
or absence of an attribute
● With such data the sampling distribution may take the form of binomial
probability distribution whose mean would be equal to n ⋅ p and
standard deviation equal to n ⋅ p ⋅ q , where p represents the probability
of success, q represents the probability of failure such that p + q = 1 and
n, the size of the sample.
78
Statistical hypothesis testing, illustration 7
Reject
79
Statistical hypothesis testing, illustration 8
Accept 80
Other cases
81
Non-parametric hypothesis test, an example
1. A popular nonparametric test to compare outcomes between two independent groups is the
Mann Whitney U test.
2. Mann Whitney U test tests whether two samples are likely to derive from the same
population (i.e., that the two populations have the same shape).
3. Recall that the parametric test compares the means (H0: μ1=μ2) between independent groups.
4. Using U-value, i.e., measurement of the difference between the ranked observation of the two
samples.
82
83
Accept
84
Non-parametric hypothesis test, an example
● Restaurant A received reviews from google maps the score of [5 4 4 3 3 2 5 5 4 4 3 3]
● Restaurant B received reviews from google maps the score of [4 5 3 3 2 5 4 2 2 3 1 1]
● Is this statistically significant for both 10% level of significance?
85
Non-parametric hypothesis test, an example
● Restaurant A received reviews from google maps the score of [5 4 4 3 3 2 5 5 4 4 3 3]
● Restaurant B received reviews from google maps the score of [4 5 3 3 2 5 4 2 2 3 1 1]
● Is this statistically significant for 10% level of significance?
88
Notes on hypothesis testing
1. Testing is not decision-making, the tests are only useful aids for decision
making.
2. Test do not explain the reasons as to why does the difference exist.
They tell us whether the difference is due to chances or because of specific
reasons.
3. Test do not tell us “true” or “false”. They give us the probability of
accepting or rejecting evidences (e.g. very strong evidence, weak evidence).
89
Regression techniques
90
Regression analysis
● Regression analysis is used to
model the relationship between
a response variable and one
or more predictor variables.
● The goal is to understand the
general trend that relates the
independent and dependent
variable.
● We can also use regression
analysis to identify the impact
of each variables.
91
Terminologies Independent /
explanatory variable /
Dependent / target regression
variable / Response
variable
𝑦 = 𝑓(𝒙) ≈ 𝑓(𝒙)
93
Correlation vs regression
94
Deterministic vs statistical relationship
Deterministic Statistical
relationship relationship
96
Linear regression Linear model
97
Linear regression Linear model
98
Linear regression Linear model
99
Linear regression
What is the
maximum of a
function? i.e.,
when the gradients
are zero
Solutions
100
Linear regression: An example
101
Linear regression: How good is your model?
● R-squared is a statistical measure of how close
the data are to the fitted regression line. It is
also known as the coefficient of determination.
● The closer to 1, the better. A good value of R-
squared is context dependent, but typically R-
squared > 0.9 indicates a good fit.
Mean of data
102
Linear regression: case of multiple variables
● There are some cases where we are
interested on the relationship between
dependent variable and multiple
independent variables.
● In that case, we can use linear regression
for multiple variables.
● The coefficients tell us the importance of
the variables.
● Now, how to obtain the coefficients?
𝑦 = 𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+…+𝛽𝑚𝑥𝑚
103
Ordinary least squares
Solutions
Rearrange
104
Polynomial regression
Linear model
Quadratic model
105
Polynomial regression
106
Polynomial regression
𝑦 = 𝑓(𝒙) ≈ 𝑓(𝒙)
Use this regression
model
𝑒 =𝑦−𝑦
Use this on the
validation set
108
Error metrics
● There are several available error metrics such as mean absolute error
(MAE), root mean squared absolute error (RMSE), mean absolute
relative error (MARE), and root mean squared relative error (RMSRE).
● RMSE gives more penalty to large error. This means the RMSE is
most useful when large errors are particularly undesirable.
109
Is your model accurate or not? Do cross-validation!
● Cross-validation (CV) is the way to assess the accuracy of surrogate models without
external validation set.
● Why? Because in many cases, we don’t have the external validation set.
● CV separates a training set (sampling points to create Kriging) into a smaller training set
and a validation set.
● RMSE
● MAE
Error ● R2 (coefficient of determination)
metrics ● Mean absolute relative error.
● Root-mean-squared-relative error.
● etc..
110
Example : 5-fold cross validation 5-fold cross
validation
Steps:
1. Use fold 2-5 to create a surrogate model, use fold 1 as
validation samples. Compute error values.
2. Use fold 1,3,4,5 to create a surrogate model, use fold
2 as validation samples. Compute error values.
3. Continue until all folds have been used.
4. Average the error values.
111
How to avoid underfit/overfit? Do cross validation
112
Multivariate polynomial regression
● The extension of multivariable linear
regression.
● For each variable, we may have quadratic,
cubic, or higher order trend.
● Can capture non-linearity in high-
dimensional space (high number of
independent variables).
● Might also involve interaction between
independent variables.
114
Advance regression model :GPR
*Gaussian processes regression
• GPR constructs surrogate model as realizations
of stochastic process.
• The process is defined by a covariance function
(i.e. the measure of similarity between two points)
y(x) = μ(x) + Z(x)
Prediction
Stochastic
process
(gaussian Uncertainty
distribution
117
Sampling points
● It is important to decide the sampling points, i.e., the points where you will
evaluate y (e.g. by CFD, FEM, or physical experiments).
● Basically, we want to maximize the information from limited available
samples.
● To that end, the sampling points need to be carefully placed.
119
Sampling points: Example using MATLAB
LHS Random
120
Sampling points: Example using MATLAB
LHS Random
121
An example of computer experiment
● One of my former student designed a computer experiment to study the
impact of adding a plate to reduce vortex induced vibration.
● He used GPR to create a regression model.
● His independent variables was the length and angle of the plate attached
to the cylinder.
● His dependent variables (output of interests) were mean drag coefficient
and vibration
SURROGATE ASSISTED GENETIC ALGORITHM FOR COMPUTATIONALLY EXPENSIVE FLUID FLOW
OPTIMIZATION
Master Thesis, Yohanes Bimo Dwianto, Bandung Institute of Technology, Dept. of Aerospace Engineering
Perform analysis
122
An example of computer experiment
Analyze
123
ENGINEERING OPTIMIZATION: AN
INTRODUCTION
124
Engineering design optimization
Baseline shape
Optimized shape
Goals of optimization:
• To optimize engineering systems in order to improve performances.
• To discover important knowledge, insight, physical phenomena, and design
guidelines that will be useful for future design.
Main components:
• Objective functions. Performance measures to be optimized, expressed
with the so-called objective functions (e.g. drag coefficient, adiabatic
efficiency)
• Constraint functions. A limitation or restriction of the optimized
performance (e.g. stress limit, moment coefficient) 125
Formulating your optimization problem
126
Defining your design variables and objective function
• x is your vector of design variables
• f(x) is your objective function, your optimization goal.
• You want to find x that optimizes f(x)
• Define your design variables to create an airfoil (e.g. NACA, PARSEC), this is your x
• For each x, you have an unique airfoil.
• Now suppose you want to maximize L/D.. your f(x).
• How do you get f(x), use CFD!
• Your task is to find x, that will generate an airfoil shape that maximizes L/D.
This is your
optimal airfoil
Each point here
is a unique
airfoil
127
Definition of optimality This will be your favorite plot. You
Non-zero observe improvements over time!
gradient, not
your optimal
point
Zero gradient.
This is your
optimal point
Quasi-Newton Methods. x
https://codesachin.wordpress.com/2016/01/16/nelder-mead-optimization/
129
Example: flow control in cylinder
131
Why you need to know all of these?