Lecture Slide Research Methodology

Statistics for research
(and also probability)
Pramudita Satria Palar, Ph.D.
1
The goal of statistics
- Area of applied mathematics concerned with the data collection, analysis
interpretation and presentation.
- Used in practically every scientific disciplines.
- Discover pattern and relationship from the data and infer useful insight
and knowledge from the data.
- You always deal with data, right?
The course will cover:

1. Basic statistics.
2. Sampling methods
3. Statistical hypothesis test.
4. Basic regression techniques.
5. Data presentation.
2
Basics
3
Basic terminologies in statistics (and probabilities)
- Random variable: A quantity having a numerical value for each member of a group, especially one
whose values occur according to a frequency distribution.
- Probability - The probability is the chance that an event will or will not occur.
- A probability space or a probability triple ( Ω , F , P ) is a mathematical construct that models a
real-world process (or “experiment”) consisting of states that occur randomly.
Ω = Sample space, a set of all possible outcomes.

F = Event, each event is a set containing zero or more outcomes (i.e. subset of the sample space)
P = Probabilities of the event.
4
Descriptive statistics
Measures of central tendency (three kinds of “averages”):
As the name suggests, it gives you an impression regarding the average
tendency of your data:
● Mean: Add up all the numbers and then divide by the number of numbers.
● Median: The "middle" value in the list of numbers
● Mode: The value that occurs most often.
Measures of variability:
Gives you the information regarding how far your data deviates from the
center:
● Standard deviation / Variance.
● Maximum / minimum.
● Interquartile range (IQR)
5
Probability distribution (1)
● A probability distribution is a mathematical function that provides the probabilities
of occurrence of different possible outcomes in an experiment.
● In more technical terms, the probability distribution is a description of a random
phenomenon in terms of the probabilities of events.
● Knowing the shape of the distribution is crucial to the use of statistical methods in
research analysis.
● This is primarily since most methods make specific assumptions about the nature of
the distribution curve.
6
● The probability that a discrete random variable X takes on a particular value x, that is, P(X = x),
is frequently denoted f(x)
● For continuous random variable, probability distribution is commonly termed probability
density function
7
Probability mass function: Probability density function:

Binomial distribution Normal distribution
8
Central tendency: A simple example
42 20 79 95 13 3 84 70 67 7
74 39 65 17 70
Sort the values first.

Min. Median Max.
3 13 17 20 39 42 65 67 70
70 70 74 79 84 95
Mode
Mean (arithmetic average) -> Expected value

● Mean = 53.86
● Median = 67
● Mode = 70
● Min = 3
● Max = 95
9
Median in the case of even number
3 13 17 20 39 42 65 67 70 70 70 74 79 84 85 95
Min. Median Max.
3 13 17 20 39 42 65 67
70 70 70 74 79 84 85
95
68.5
● Mean = 55.8125
● Median = 68.5
● Mode = 70
● Min = 3
● Max = 95
10
Case of mean and median
10 20 30 10 50 10 70 50 80 60 100 100 5000
10 20 30 10 50 10 70 50 80 60 100 100 5000
● Mean = 430
● Median = 50
Mean and median can give you different impression
Mean (arithmetic average) -> Expected
of the data. Such differences might be due to:
value
● The presence of outliers.
● Incorrect/wrong sampling/other errors.
● Non-robust method (in the context of algorithm
design)
In practice, it is better to observe both the mean and
median value.
11
Measures of variability: range (1)
● Range is defined as the difference between the values of the extreme items
of a series.
● Pros: It gives an idea of the variability very quickly.
● Cons: The range is affected very greatly by fluctuation of sampling. Its value
is never stable.
● Typically only used as a rough measure of variability.
12
42 20 79 95 13 3 84 70 67 7
74 39 65 17 70

Min. Median Max.
3 13 17 20 39 42 65 67 70
70 70 74 79 84 95
Mode
Range = (95-3) = 92
13
10 20 30 10 50 10 70 50 80 60 100 100 5000

Min. Max.
10 20 30 10 50 10 70 50 80 60 100 100 5000
● Mean = 430
● Median = 50
Mean (arithmetic average) -> Expected
● Range = (5000-10) = 4990
value
This is an example where range might give you

wrong impression. However, range can also be used
to detect the ‘sanity’ of your data.
14
Measures of variability: Standard deviation (1)
Mean
● Standard deviation is the most widely used measure of dispersion of a

series.
● We can also use variance (σ2), i.e., , mathematical dispersion of the data
relative to the mean.
● Standard deviation is more useful to describe the data according to our
common sense, while variance is more useful in mathematical derivation.
15
Why use squared values in standard deviation?
16
Mean
● Standard deviation is the most widely used measure of dispersion of a

series.
● We can also use variance (σ2), i.e., , mathematical dispersion of the data
relative to the mean.
● Standard deviation is more useful to describe the data according to our
common sense, while variance is more useful in mathematical derivation.
17
Mean = 0, Std. dev. = 2 Mean = 0, Std. dev. = 1 18

Mean = 5, Std. dev. = 2 Mean = 5, Std. dev. = 1 19

Measures of asymmetry: Skewness (1)
● Skewness indicates distribution that is asymmetric.
● If the curve is distorted on the right side, we have positive skewness.
● When the curve is distorted towards left, we have negative skewness.
20
21
Mean = 0, Mean = 0,
Std. Dev 1, Std. Dev 1,
Skew = 0.7 Skew = -0.7
22
Standardized moments
Std. moment
Mean
23
Intermezzo: quantile
● By a quantile, we mean the fraction (or percent) of points below the given value. That is,
the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70%
fall above that value.
● For example, if the 0.8 quantile of the examination score of subject A is 40, means that
80% of the students obtain scores lower than 40.
24
Intermezzo: quartile
● Quartile is one form of quantile. Quartiles are the values that divide a list of numbers
into quarters:
25
Measures of variability: interquartile range (1)
● The Interquartile Range (IQR) may be used to characterize the data when
there may be extremities that skew the data.
● The interquartile range is a relatively robust statistic (also sometimes
called "resistance") compared to the range and standard deviation.
IQR = Q3-Q1
26
Measures of variability: interquartile range (2)
It is clear that Group I's

values are typically higher,
but Group I also has a
wider spread, as indicated
by the wider IQR shaded
bar.
27
Outliers (1)
● A data point on a graph or in a set of
results that is very much bigger or
smaller than the next nearest data
point.
● Outlier can cause serious problems in
statistical analyses (depending on the
context)
● Outliers could mean two things: (1)
Measurement error (2) Heavy tailed
distribution.
● Now the question is: how to
detect outliers?
28
Outliers (2)
● There is no one universally accepted outlier detection method.
● There are some useful methods, one of them is Tukey’s fences.
● Tukey’s fences treat data as outliers when they are outside this
range
● If k = 1.5, data outside this range is treated as “outlier”.

● If k = 3, data outside this range is indicated as “far-out”.
● We’ll also illustrate this later by using a boxplot.
29
Outliers (3)
● An example of outlier detection using Tukey’s fences.
● Consider the following scores from mid-term exams:
21 22 33 10 8 9 10 4
30 20 30 10 20 17 8 9
10 22 11 2 30 20 22
33 34 37 21 20 21 23
11 21 13 8 6 9 100 90
30 70
● Q1 = 10
● Q2 = 30
● IQR = 20
● Lower bound for outlier detection with k = 1.5 is -20
● Upper bound for outlier detection with k = 1.5 60.
● Thus, those guys who have the scores of 100, 90 and 70 are
outliers. 30
Measures of
relationship
31
Quiz: How will you describe the following
relationship?
32
Correlation and dependence
● Statistical dependence or
association describes
statistical relationship between
two random variables.
● The relationship does not
indicate causality, but
describes correlation.
● It is a powerful relationship
when you want to analyze
whether there is a clear
relationship between two
variables.
33
Correlation and dependence
● Statistical dependence or association describes statistical relationship
between two random variables.
● The relationship does not indicate causality, but describes correlation.
● It is a powerful relationship when you want to analyze whether there is a
clear relationship between two variables.
● The most widely used measure is the Pearson’s correlation coefficient.
34
● How about for
cases with
more than
two
variables?
35
Correlation matrix
● For cases with more than

one variable, we can plot the
correlation matrix.
● The diagonal shows the
histogram, while the off-
diagonal terms show each
individual two-variables
relationship and/or the
correlation coefficient.
36
Presenting your results
37
Presenting your results: Histogram
● A histogram is a plot that lets you discover, and show, the underlying frequency distribution
(shape) of a set of data. This allows the inspection of the data for its underlying distribution (e.g.,
normal distribution), outliers, skewness, etc.
● Histogram is very useful if you want to visualize the general distribution of your data
38
Histogram: Choosing the number of bins
Too much: Too few:

- The general pattern is difficult to see. - Might give us incomplete impression of the data
- Shows too much individual data. .
39
Density plots as alternative to histogram
● A Density Plot visualises the distribution of data over a continuous interval or time period.
● Basically, it is a smoothed histogram via kernel smoothing.
● It gives us a clearer picture regarding the shape of the distribution.
40
Histogram vs PDF plot
● Histogram is the empirical distribution of your sample.
● PDF is used if you want to describe the hypothesized underlying distribution (normal? exponential?)
● Histogram is usually shown with frequencies as the y-axis, while for PDF density is shown (note that
the right figure is only for illustrative purpose).
41
Presenting your results: Barplot
● The major difference is that a histogram is only used to plot the frequency of score
occurrences in a continuous data set that has been divided into classes, called bins. Bar
charts, on the other hand, can be used for a great deal of other types of variables.
● Barplot is very useful is you visualize such data as sales, sensitivity indices, to name a
few.
42
Presenting your results: Barplot
● One common mistake of using continuous plot: using continuous plot when the data
should actually visualized with barplot (i.e., when the x-axis is categorical variable).
Just right Umm, no..

43
Presenting your results: Scatterplot
● Scatter plot shows relationship between two sets of data by using points.
● Scatter plot is also typically used in multi-objective decision making/optimization to
visualize the Pareto front.
44
Presenting your results: Scatterplot (2)
● Scatter plot can also be used for outlier detection.
45
Presenting your results: Boxplot
● Displaying the distribution of data based on a five number summary (remember
quartiles): Minimum, first quartile, median, third quartile, and maximum.
● Boxplot gives you good indication of how your datas are clustered and spread out.
● Other name of boxplot is “box and whiskers” plot.
● The whiskers can indicate: +/- 1.5 IQR, minimum and maximum of the data, one standard
deviation above and below the data, 9th and 91st percentile.
46
Presenting your results: Pie chart
● Pie car is useful when you have a data set such as preferences, sensitivity index,
grades, to name a few.
47
Presenting your results: Heat map
● Heat map is useful to better visualize data with arbitrary borders (e.g., geospatial,
CFD/FEM results)
48
Presenting your results: Contour map.
● Contour map performs visualization by creating multiple lines that share the same
values.
● Contour map can be shown with or without colours.
● In some occasions, contour map can be combined with heat map.
49
Tips: Use publication-
quality figures, please..
Not-acceptable
● Many journals will comment

on the quality of your figures.
● Some reviewers even will
‘look down’ on your papers
when they see poor-quality
figures.
● Don’t forget to give details! Acceptable
Labels, legend, etc.
● Learn how do do this,
properly, properly, properly.
50
Statistical hypothesis
test
51
● Let’s talk about hypothesis first..
● “Hypothesis is a formal question that the
researcher intends to resolve”.
● Quite often a research hypothesis is a predictive
statement, capable of being tested by scientific
methods, that relates an independent variable to
some dependent variable.
So in this case, the following are good hypotheses.

● “Car A consumes less fuel than car B”
● “Students who receive tutorial score higher than
other students who do not take tutorial”
52
● A hypothesis test is rule that specifies
whether to accept or reject a claim about
a population depending on the evidence
provided by a sample of data.
● A hypothesis test examines two opposing
hypotheses about a population: the null
hypothesis and the alternative
hypothesis.
● The null hypothesis is the statement being
tested. Usually the null hypothesis is a
statement of "no effect" or "no difference".
● The alternative hypothesis is the statement
you want to be able to conclude is true
based on evidence provided by the sample
data.
53
The use of statistical hypothesis
● Does my algorithm A significantly performs

better than algorithm B? (according to statistical
tests)
● Do the students from major X receive
significantly higher score than the students from
major Y in calculus class?
● Do male and female undergraduates differ in
height on average?
Statistical hypothesis test tries to answer these Is the difference significant?
questions. 54
● Does my algorithm A significantly performs
better than algorithm B? (according to
statistical tests)
● Do the students from major X receive
significantly higher score than the students
from major Y in calculus class?
● Do male and female undergraduates differ
in height on average?
Let’s say that you say “Yes” to one of the

questions above, how can you be so sure?
55
Case : We develop method A so that we can obtain superior results
as compared to method B (let’s say, optimization method)
● Null hypothesis (H0): “We are to compare method A with method B about
its superiority and we proceed on the assumptions that both method are
equally good”
● Alternative hypothesis (H1): “Method A is superior or the method B is
inferior”.
● Alternative hypothesis is usually the one which one wishes to prove.
● The null hypothesis is the one which one wishes to disproves (the one we are
trying to reject).
● How to assess this? Use p-value (the level of significance).
56
Case : We develop method A so that we can obtain superior results
as compared to method B (let’s say, optimization method)
57
● Based on the sample data, the test determines whether to reject the null
hypothesis.
● You use a p-value, to make the determination. If the p-value is less than
the significance level (denoted as α or alpha), then you can reject the null
hypothesis.
● The most common p-value is 0.05, meaning that to reject the hypothesis,
there should be 0.05 probability of H0 occuring.
● In other words, the 5 percent level of significance means that researcher is
willing to take as much as a 5 percent risk of rejecting the null hypothesis.
● There is now a movement that advocate that the p-value should be 0.005.
58
Type I versus type II error
● In the context of testing of
hypotheses, there are two
types of errors we can
make:
● We can reject H0 when H0
is true (Type I error),
denoted by α.
● We may accept H0 when
in fact H0 is not true
● α- error is understood as the level of (Type II error), denoted by
significance of hypothesis testing. β.
● Let’s say, if α- error is fixed at 5
percent, it means that there are about
5/100 chances that we will reject H0
when H0 is true. 59
Two-tailed and
one-tailed tests
● One-tailed test is used

when we are to test
whether the population
mean is either lower than
or higher than some
hypothesized value.
60
Two-tailed and
one-tailed tests
● A two-tailed test rejects
the null hypothesis if, say,
the sample mean is
significantly higher or
lower than the
hypothesized mean of the
population.
For significance level 5 percent (p-value = 0.05)
61
Parametric vs non-parametric test
● In parametric test, we assume that
the data follows a specific
distribution (typically normal).
● Non-parametric, we do not assume
that the data follows a specific
distribution (distribution-free test).
● However, parametric data can also
be used for non-normal data with
some requirements (e.g. sample size).
62
Non-parametric tests
● Whitney U or Wilcoxon rank sum test: tests whether two samples are drawn from the same
distribution, as compared to a given alternative hypothesis.
● McNemar's test: tests whether, in 2 × 2 contingency tables with a dichotomous trait and
matched pairs of subjects, row and column marginal frequencies are equal
● Median test: tests whether two samples are drawn from distributions with equal medians
● Pitman's permutation test: a statistical significance test that yields exact p values by
examining all possible rearrangements of labels
● Rank products: detects differentially expressed genes in replicated microarray experiments
● Siegel–Tukey test: tests for differences in scale between two groups
● Sign test: tests whether matched pair samples are drawn from distributions with equal
medians.
● And many more...
63
Parametric tests
● Z-test.
Based on the normal distribution and is used for judging several
statistical measures (particularly the mean).
● t-test.
Based on t-distribution and is used for judging the significance of a
sample mean or the difference between two means of two samples in case
of small samples.
● Chi-square test.
Based on chi-square distribution and is used for comparing a sample
variance to a theoretical population variance.
● F-test.
Based on F-distribution and is used to compare the variance of two
independent samples
64
Parametric tests
● Z-test.
Based on the normal distribution and is used for judging several
statistical measures (particularly the mean).
● t-test.
Based on t-distribution and is used for judging the significance of a
sample mean or the difference between two means of two samples in case
of small samples.
● Chi-square test.
Based on chi-square distribution and is used for comparing a sample
variance to a theoretical population variance.
● F-test.
Based on F-distribution and is used to compare the variance of two
independent samples
65
66
67
68
69
Case 1: Hypothesis testing of means
● Hypothesis testing of means is used in situations such as comparing the

mean of smaller population with general population (e.g. does the
behavior of group A represents the general population?)
● The null hypothesis is generated states as H0 : μH0= A value.
70
Statistical hypothesis testing, illustration 1
● In this problem, we would like to know whether these 400 male students represent the general
population.
● The first step is to formulate the hypothesis. The null hypothesis is that the mean height of the
population is equal to 67.39 inches:
Given information
We assume normal
distribution and we
use z-test
Accept
Given information 71
Std. error of the mean
Accept
72
Reject
73
Case 2: Hypothesis testing for differences between
means
● Used when we are interested in knowing whether the parameters of two
populations are alike or different.
● Example: In company A, do female workers earn less than male workers
for the same job?
● The null hypothesis is generated states as H0 : μ1=μ2
74
Hypothesis Given information
Reject
75
Notice the large

sample size here Reject
76
Notice the small

sample size here
Reject
77
Case 3: Hypothesis testing of proportions
● In case of qualitative phenomena, we have data on the basis of presence
or absence of an attribute
● With such data the sampling distribution may take the form of binomial
probability distribution whose mean would be equal to n ⋅ p and
standard deviation equal to n ⋅ p ⋅ q , where p represents the probability
of success, q represents the probability of failure such that p + q = 1 and
n, the size of the sample.
78
Reject
79
Accept 80
Other cases
● Hypothesis testing for comparing two related samples.

● Hypothesis testing for difference between proportions.
● Hypothesis testing for comparing a variance to some hypothesized
population variance.
● Hypothesis testing of correlation coefficients.
81
Non-parametric hypothesis test, an example
1. A popular nonparametric test to compare outcomes between two independent groups is the
Mann Whitney U test.
2. Mann Whitney U test tests whether two samples are likely to derive from the same
population (i.e., that the two populations have the same shape).
3. Recall that the parametric test compares the means (H0: μ1=μ2) between independent groups.
4. Using U-value, i.e., measurement of the difference between the ranked observation of the two
samples.
82
83
Accept
84
● Restaurant A received reviews from google maps the score of [5 4 4 3 3 2 5 5 4 4 3 3]
● Restaurant B received reviews from google maps the score of [4 5 3 3 2 5 4 2 2 3 1 1]
● Is this statistically significant for both 10% level of significance?
85
● Restaurant A received reviews from google maps the score of [5 4 4 3 3 2 5 5 4 4 3 3]
● Restaurant B received reviews from google maps the score of [4 5 3 3 2 5 4 2 2 3 1 1]
● Is this statistically significant for 10% level of significance?
Sum of rank for restaurant A =

4.5+10+10+10+10+16.5+16.5+16.5+1.6+22+2
2+22 = 161.6
= 72
Sum of rank for restaurant B =
1.5+1.5+4.5+4.5+4.5+10+10+10+16.5+16.5+2
= 17.3205 2+22 = 123.5
The U values are

between the upper and
= 60.4, and 98.5 for R2 lower limit, Accept
Compute upper and lower limit: Upper limit = 100.4506, lower limit = 43.5944
86
Hypothesis testing: The debate
● “P-value < 0.05 is not strict, we should set p-value to < 0.005!”
Why?
1. By setting a lower threshold, science will get benefit by having a lower
probability of accepting false observations.
2. Increasing reproducibility of research. Studies that yielded highly
significant results (less than p=.01) are more likely to reproduce than those
that are just barely significant at the .05 level.
● “Statistical significance erodes the culture of science, they should be banned”.

Why?
1. Many researchers give too much focus on answering the questions of where
there is an effect or not, but giving less importance on why a particular
phenomenon happen.
2. However, both things should actually go hand in hand, it is the culture that
should change.
87
Hypothesis testing: The debate
1. Although it is a common practice to use p<0.05, there are some disciplines
that already use very low p-values.
2. Genetics uses p < 0.00000005, astrophysics uses p < 0.0000003 (5-sigma,
this is how they could be sore sure that Higgs-Boson exists).
88
Notes on hypothesis testing
1. Testing is not decision-making, the tests are only useful aids for decision
making.
2. Test do not explain the reasons as to why does the difference exist.
They tell us whether the difference is due to chances or because of specific
reasons.
3. Test do not tell us “true” or “false”. They give us the probability of
accepting or rejecting evidences (e.g. very strong evidence, weak evidence).
The take home point is that statistical testing should be combined

with adequate knowledge of the subject.
89
Regression techniques
90
Regression analysis
● Regression analysis is used to
model the relationship between
a response variable and one
or more predictor variables.
● The goal is to understand the
general trend that relates the
independent and dependent
variable.
● We can also use regression
analysis to identify the impact
of each variables.
91
Terminologies Independent /
explanatory variable /
Dependent / target regression
variable / Response
variable
𝑦 = 𝑓(𝒙) ≈ 𝑓(𝒙)
𝑤ℎ𝑒𝑟𝑒 𝒙 = 𝑥1 , 𝑥2 , … , 𝑥𝑚 Regression function
● f(x) is the true mathematical relationship that, in practice, we do not

know what it’s really like.
● 𝑓(𝒙) is the approximation of the true mathematical relationship by a
regression function/surrogate model.
92
Why do regression?
● Regression analysis makes ease the process of identifying
relationship by a quantitative measure.
● Regression analysis can also gives us the confidence of how accurate is
the regression function approximates the relationship.
● Regression allows you to make prediction (the goal of machine learning),
but in statistics, the goal is to uncover the relationship between
variables.
Thus, the following two aspects must be considered:

• Predictive power: Accuracy of the regression model on capturing the
true function.
• Interpretability: The mathematical model of the regression function
gives us insight regarding the trend/behavior of the true function.
93
Correlation vs regression
● Correlation is described as the

analysis which lets us know the
association or the absence of the
relationship between two variables ‘x’
and ‘y’.
● Regression analysis, predicts the
value of the dependent variable based
on the known value of the independent
variable.
● More precisely, regression describes
how an independent variable is
numerically related to the dependent
variable.
94
Deterministic vs statistical relationship
Deterministic Statistical
relationship relationship
● Deterministic relationship is when the relationship is EXACTLY known.

● We use statistical relationship to model/approximate the deterministic
relationship that is not known. 95
Physical versus computer experiment
● Physical experiment observe the

input/output at selected samples using
experimental apparatus.
● Computer experiment observe the
input/output at selected samples using
computer regression.
● The outputs from physical experiment
is typically corrupted by noise, while
it is typically smooth for computer
experiment
● When using regression, the data could
be generated from social
experiment/census, to name a few.
96
Linear regression Linear model
Intercept Slope Error/noise
● Linear regression is used for finding

linear relationship between target and one
or more predictor.
● Linear regression is the simplest
regression and particularly useful when
the relationship is linear.
● You can even do it with your calculator.
97
● QUIZ: How do you really compute the

coefficients?
98
● QUIZ: How do you really compute the

coefficients? Use least squares
approach
Minimizes the sum of

the squared error
99
Linear regression
What is the
maximum of a
function? i.e.,
when the gradients
are zero
Solutions
100
Linear regression: An example
101
Linear regression: How good is your model?
● R-squared is a statistical measure of how close
the data are to the fitted regression line. It is
also known as the coefficient of determination.
● The closer to 1, the better. A good value of R-
squared is context dependent, but typically R-
squared > 0.9 indicates a good fit.
Mean of data
102
Linear regression: case of multiple variables
● There are some cases where we are
interested on the relationship between
dependent variable and multiple
independent variables.
● In that case, we can use linear regression
for multiple variables.
● The coefficients tell us the importance of
the variables.
● Now, how to obtain the coefficients?
𝑦 = 𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+…+𝛽𝑚𝑥𝑚
103
Ordinary least squares
Solutions
Rearrange
104
Polynomial regression
● Polynomial regression models the relationship between the

independent and dependent variable by using n-th order polynomial.
● Polynomial regression is useful when the relationship does not exhibit
a clear linear relationship.
● Notice that linear regression is one particular case of polynomial
regression, i.e., when n = 1
Linear model
Quadratic model
105
● The higher the polynomial order,

the more complex the behavior of
the polynomial function.
● The best order totally depends on
the true behavior of the function
and the noise in the function.
106
● When using polynomial regression, it is important to avoid two

particular case:
● Underfitting: The polynomial regression misses the true behavior of
the unknown function.
● Overfitting: The polynomial regression captures the “trend” that
should not be there (e.g., noises) 107
Validation – Testing the accuracy of your model
● The goal of validation is to check the accuracy of our model.

● Validation is performed by testing the accuracy of the model on the
validation set.
● The ideal condition is to have a separate training set (the one that is
used to build the model) and validation set.
𝑦 = 𝑓(𝒙) ≈ 𝑓(𝒙)
Use this regression
model
𝑒 =𝑦−𝑦
Use this on the
validation set
108
Error metrics
● There are several available error metrics such as mean absolute error
(MAE), root mean squared absolute error (RMSE), mean absolute
relative error (MARE), and root mean squared relative error (RMSRE).
● RMSE gives more penalty to large error. This means the RMSE is
most useful when large errors are particularly undesirable.
● Which one is better? That depends on your objective.

● RMSE and MAE might not be easy to interpret when multiple cases
are considered. In some cases, people use the relative value (RMSRE,
MARE, normalized MAE, normalized RMSE)
109
Is your model accurate or not? Do cross-validation!
● Cross-validation (CV) is the way to assess the accuracy of surrogate models without
external validation set.
● Why? Because in many cases, we don’t have the external validation set.
● CV separates a training set (sampling points to create Kriging) into a smaller training set
and a validation set.
Two popular CV approaches:

CV
techniques ● Leave-p-out CV.
● K-fold CV
● RMSE
● MAE
Error ● R2 (coefficient of determination)
metrics ● Mean absolute relative error.
● Root-mean-squared-relative error.
● etc..
110
Example : 5-fold cross validation 5-fold cross
validation
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Steps:
1. Use fold 2-5 to create a surrogate model, use fold 1 as
validation samples. Compute error values.
2. Use fold 1,3,4,5 to create a surrogate model, use fold
2 as validation samples. Compute error values.
3. Continue until all folds have been used.
4. Average the error values.
111
How to avoid underfit/overfit? Do cross validation
112
Multivariate polynomial regression
● The extension of multivariable linear
regression.
● For each variable, we may have quadratic,
cubic, or higher order trend.
● Can capture non-linearity in high-
dimensional space (high number of
independent variables).
● Might also involve interaction between
independent variables.
For example, the quadratic multivariate

polynomial regression for two variables is
𝑦 = 𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥12 +𝛽4𝑥22 +𝛽5𝑥1𝑥2

113
Advance regression model :RBF
• Radial basis functions (RBF) creates an

interpolating/regression model by summing
several terms of RBF.
• RBF is defined as a function that only depends on
a distance.
114
Advance regression model :GPR
*Gaussian processes regression
• GPR constructs surrogate model as realizations
of stochastic process.
• The process is defined by a covariance function
(i.e. the measure of similarity between two points)
y(x) = μ(x) + Z(x)
Prediction
Stochastic
process
(gaussian Uncertainty
distribution
• Pros: Provides extra uncertainty structure, highly

generalizable, high prediction capability.
Mean of • Cons: The training time can be significantly long,
realizations
(the GP not suitable for very high dimension regression
model) Bay, Xavier, Laurence Grammont, and Hassan Maatouk. "Generalization of the
Kimeldorf-Wahba correspondence for constrained interpolation." Electronic journal (>100) 115
of statistics 10.1 (2016): 1580-1595.
Advance regression model :
Neural network
• NN models a function as an interconnected
“neuron”.
• There are three layers in NN, i.e., input layer,
hidden layer, and output layer.
• Hidden layers are basically where the inputs are
processed through activation functions (i.e. basis
functions).
• If the number of hidden of layers is numerous, the
model is called “Deep learning” or “Deep neural
network”
• Pros: Can be used for general applications. The

deep version is very powerful and is now considered
as the “holy grail” of machine learning
• Cons: The model is not interpretable. Sometimes it
is difficult to tune the NN parameters (number of
hidden layers etc).
116
Designing your experiment
● Let’s say that we want to perform
experiments and we want find the Design sampling points
relationship between x and y.
● Let’s also say that our experiment is
expensive (e.g., 4 hours for one
Evaluate sampling
experiment) points
● We can design the experiment so that we
could maximize the information that we
obtained. Create regression
Our typical goals are: models
1. To find the relationship between x and y.
2. To identify important and less important
independent variables. Perform analysis
117
Sampling points
● It is important to decide the sampling points, i.e., the points where you will
evaluate y (e.g. by CFD, FEM, or physical experiments).
● Basically, we want to maximize the information from limited available
samples.
● To that end, the sampling points need to be carefully placed.
Which one that you

think is the best?
118
Sampling points
● There are several available techniques for generating sampling points.
● Among them are full factorial design, latin hypercube sampling, and low-
discrepancy sequences.
● Good sampling points should ‘fill’ the space.
Full factorial design Latin hypercube sampling Low discrepancy

sequence
119
Sampling points: Example using MATLAB
LHS Random
120
Sampling points: Example using MATLAB
LHS Random
121
An example of computer experiment
● One of my former student designed a computer experiment to study the
impact of adding a plate to reduce vortex induced vibration.
● He used GPR to create a regression model.
● His independent variables was the length and angle of the plate attached
to the cylinder.
● His dependent variables (output of interests) were mean drag coefficient
and vibration
SURROGATE ASSISTED GENETIC ALGORITHM FOR COMPUTATIONALLY EXPENSIVE FLUID FLOW
OPTIMIZATION
Master Thesis, Yohanes Bimo Dwianto, Bandung Institute of Technology, Dept. of Aerospace Engineering
Perform analysis
122
An example of computer experiment
Design sampling points Perform experiment Create regression model
Analyze
123
ENGINEERING OPTIMIZATION: AN
INTRODUCTION
124
Engineering design optimization
Baseline shape
“The action of making the best or most effective use of a

situation or resource” - Oxford dictionaries
Why optimize? To achieve higher efficiency, reduced fuel
burn, better stability, lower production cost, etc.
What is the role of optimization in your experiment?
You want to know the best set of parameters that yields the
Objective: Decrease drag
best value. coefficient
Optimized shape
Goals of optimization:
• To optimize engineering systems in order to improve performances.
• To discover important knowledge, insight, physical phenomena, and design
guidelines that will be useful for future design.
Main components:
• Objective functions. Performance measures to be optimized, expressed
with the so-called objective functions (e.g. drag coefficient, adiabatic
efficiency)
• Constraint functions. A limitation or restriction of the optimized
performance (e.g. stress limit, moment coefficient) 125
Formulating your optimization problem
• Design variables: Discrete, combinatorial, continuous,

mixed-integer.
• Number of objectives: Single or multi-objective.
126
Defining your design variables and objective function
• x is your vector of design variables
• f(x) is your objective function, your optimization goal.
• You want to find x that optimizes f(x)
• Define your design variables to create an airfoil (e.g. NACA, PARSEC), this is your x
• For each x, you have an unique airfoil.
• Now suppose you want to maximize L/D.. your f(x).
• How do you get f(x), use CFD!
• Your task is to find x, that will generate an airfoil shape that maximizes L/D.
This is your
optimal airfoil
Each point here
is a unique
airfoil
127
Definition of optimality This will be your favorite plot. You
Non-zero observe improvements over time!
gradient, not
your optimal
point
Zero gradient.
This is your
optimal point
• This is the mathematical definition of optimality to have your optimum solution.

• But how about in cases when you don’t know gradient? We will achieve
optimized solution.
128
Techniques for optimization f(x)
1. Gradient-based techniques Local optimum

Pros: Can guarantee optimality.
Cons: Works only on smooth functions, gradient information is
not always available. Zero gradient
Examples: Stochastic gradient descent, Newton’s methods. Global optimum
Quasi-Newton Methods. x
2. Gradient-free techniques Nelder-mead

Pros: Do not need gradient, can be used for any problems.
Cons: Typically slow or even very slow. No guarantee of
optimality.
Examples: Nelder-Mead, Evolutionary (genetic) algorithms,
Particle Swarm Optimization, Divided Rectangles method.
https://codesachin.wordpress.com/2016/01/16/nelder-mead-optimization/
129
Example: flow control in cylinder
Single vs multi-objective optimization Low drag
Single-objective optimization: Only one-objective Low vibration
function to be optimized (e.g. strictly reduce drag).
Multi-objective optimization (MOO): Consider two Example: supersonic drag

or more objective functions to be optimized (e.g., vs bending moment
minimize drag and maximize strength).
Why multi-objective optimization?

Sometimes we have conflicting objectives (e.g. drag
coefficient vs structural strength). In MOO, we are
interested in finding the Pareto front.
Pareto front
Single-objective optimization: ONE optimal solution

Obayashi, Shigeru, et al. "Multiobjective evolutionary computation for supersonic
wing-shape optimization." IEEE transactions on evolutionary computation 4.2
(2000): 182-187.
Multi-objective optimization: MANY optimal solutions

130
Forward vs inverse design
Forward/direct design: Works by directly optimizing the
objective function.
Pros: No necessity to specify target pressure. Min f(x)
Cons: No control over the individual behavior of solutions
Inverse design: The target (e.g. pressure distribution) is

specified. The optimization aims for minimizing the error from Min cp(T)-cp(x)
the target.
Pros: Good control over the individual behavior of solutions.
Cons: It is not trivial to specify the target pressure.
131
Why you need to know all of these?
• Know how to use different measures of central tendency

(e.g. mean and median), and variability (e.g., standard
Basic statistics
deviation, IQR), measures of relationship, PDF, etc.
• Know how to use different plots for presenting data

Presenting data and information (e.g., histogram, boxplot, etc)
• Know how to use statistical hypothesis tests to differentiate

significant effect/difference and pure chance.
Statistical hypothesis test • Know the concept of statistical significance, formulating
hypothesis, etc.
• Know how to use regression models to perform an experiment

where we want to know the relationship between independent
Regression models and dependent variables.
• Understand the basic concept of designing such an experiment.
Optimization method • When necessary, optimization can be used to find

the optimal parameters for your research.
Question?

Lecture Slide Research Methodology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Slide Research Methodology

Uploaded by

Copyright:

Available Formats

Statistics for research

(and also probability)

Pramudita Satria Palar, Ph.D.

The course will cover:

Ω = Sample space, a set of all possible outcomes.

Probability mass function: Probability density function:

Sort the values first.

Mean (arithmetic average) -> Expected value

Sort the values first.

Min. Median Max.

Sort the values first.

10 20 30 10 50 10 70 50 80 60 100 100 5000

Sort the values first.

Mean (arithmetic average) -> Expected value

Sort the values first.

This is an example where range might give you

● Standard deviation is the most widely used measure of dispersion of a

● Standard deviation is the most widely used measure of dispersion of a

Mean = 0, Std. dev. = 2 Mean = 0, Std. dev. = 1 18

Mean = 5, Std. dev. = 2 Mean = 5, Std. dev. = 1 19

It is clear that Group I's

● If k = 1.5, data outside this range is treated as “outlier”.

● For cases with more than

Too much: Too few:

Just right Umm, no..

● Many journals will comment

So in this case, the following are good hypotheses.

● Does my algorithm A significantly performs

Let’s say that you say “Yes” to one of the

● One-tailed test is used

For significance level 5 percent (p-value = 0.05)

● Hypothesis testing of means is used in situations such as comparing the

Hypothesis Given information

Notice the large

Notice the small

● Hypothesis testing for comparing two related samples.

Sum of rank for restaurant A =

The U values are

● “Statistical significance erodes the culture of science, they should be banned”.

The take home point is that statistical testing should be combined

𝑤ℎ𝑒𝑟𝑒 𝒙 = 𝑥1 , 𝑥2 , … , 𝑥𝑚 Regression function

● f(x) is the true mathematical relationship that, in practice, we do not

Thus, the following two aspects must be considered:

● Correlation is described as the

● Deterministic relationship is when the relationship is EXACTLY known.

● Physical experiment observe the

Intercept Slope Error/noise

● Linear regression is used for finding

Intercept Slope Error/noise

● QUIZ: How do you really compute the

Intercept Slope Error/noise

● QUIZ: How do you really compute the

Minimizes the sum of

● Polynomial regression models the relationship between the

● The higher the polynomial order,

● When using polynomial regression, it is important to avoid two

● The goal of validation is to check the accuracy of our model.

● Which one is better? That depends on your objective.

Two popular CV approaches:

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

For example, the quadratic multivariate

𝑦 = 𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥12 +𝛽4𝑥22 +𝛽5𝑥1𝑥2