Download as pdf or txt
Download as pdf or txt
You are on page 1of 133

Statistics for research

(and also probability)

Pramudita Satria Palar, Ph.D.

1
The goal of statistics
- Area of applied mathematics concerned with the data collection, analysis
interpretation and presentation.
- Used in practically every scientific disciplines.
- Discover pattern and relationship from the data and infer useful insight
and knowledge from the data.
- You always deal with data, right?

The course will cover:


1. Basic statistics.
2. Sampling methods
3. Statistical hypothesis test.
4. Basic regression techniques.
5. Data presentation.
2
Basics

3
Basic terminologies in statistics (and probabilities)
- Random variable: A quantity having a numerical value for each member of a group, especially one
whose values occur according to a frequency distribution.
- Probability - The probability is the chance that an event will or will not occur.
- A probability space or a probability triple ( Ω , F , P ) is a mathematical construct that models a
real-world process (or “experiment”) consisting of states that occur randomly.

Ω = Sample space, a set of all possible outcomes.


F = Event, each event is a set containing zero or more outcomes (i.e. subset of the sample space)
P = Probabilities of the event.

4
Descriptive statistics
Measures of central tendency (three kinds of “averages”):
As the name suggests, it gives you an impression regarding the average
tendency of your data:
● Mean: Add up all the numbers and then divide by the number of numbers.
● Median: The "middle" value in the list of numbers
● Mode: The value that occurs most often.

Measures of variability:
Gives you the information regarding how far your data deviates from the
center:
● Standard deviation / Variance.
● Maximum / minimum.
● Interquartile range (IQR)
5
Probability distribution (1)
● A probability distribution is a mathematical function that provides the probabilities
of occurrence of different possible outcomes in an experiment.
● In more technical terms, the probability distribution is a description of a random
phenomenon in terms of the probabilities of events.
● Knowing the shape of the distribution is crucial to the use of statistical methods in
research analysis.
● This is primarily since most methods make specific assumptions about the nature of
the distribution curve.

6
Probability distribution (2)
● The probability that a discrete random variable X takes on a particular value x, that is, P(X = x),
is frequently denoted f(x)
● For continuous random variable, probability distribution is commonly termed probability
density function

7
Probability distribution (3)

Probability mass function: Probability density function:


Binomial distribution Normal distribution

8
Central tendency: A simple example
42 20 79 95 13 3 84 70 67 7
74 39 65 17 70

Sort the values first.


Min. Median Max.
3 13 17 20 39 42 65 67 70
70 70 74 79 84 95
Mode

Mean (arithmetic average) -> Expected value


● Mean = 53.86
● Median = 67
● Mode = 70
● Min = 3
● Max = 95
9
Median in the case of even number
3 13 17 20 39 42 65 67 70 70 70 74 79 84 85 95

Sort the values first.

Min. Median Max.

3 13 17 20 39 42 65 67
70 70 70 74 79 84 85
95

68.5
Mean (arithmetic average) -> Expected value
● Mean = 55.8125
● Median = 68.5
● Mode = 70
● Min = 3
● Max = 95
10
Case of mean and median
10 20 30 10 50 10 70 50 80 60 100 100 5000

Sort the values first.

10 20 30 10 50 10 70 50 80 60 100 100 5000

● Mean = 430
● Median = 50
Mean and median can give you different impression
Mean (arithmetic average) -> Expected
of the data. Such differences might be due to:
value
● The presence of outliers.
● Incorrect/wrong sampling/other errors.
● Non-robust method (in the context of algorithm
design)
In practice, it is better to observe both the mean and
median value.
11
Measures of variability: range (1)

● Range is defined as the difference between the values of the extreme items
of a series.
● Pros: It gives an idea of the variability very quickly.
● Cons: The range is affected very greatly by fluctuation of sampling. Its value
is never stable.
● Typically only used as a rough measure of variability.

12
Measures of variability: range (2)
42 20 79 95 13 3 84 70 67 7
74 39 65 17 70

Sort the values first.


Min. Median Max.
3 13 17 20 39 42 65 67 70
70 70 74 79 84 95
Mode

Mean (arithmetic average) -> Expected value

Range = (95-3) = 92
13
Measures of variability: range (3)
10 20 30 10 50 10 70 50 80 60 100 100 5000

Sort the values first.


Min. Max.
10 20 30 10 50 10 70 50 80 60 100 100 5000

● Mean = 430
● Median = 50
Mean (arithmetic average) -> Expected
● Range = (5000-10) = 4990
value

This is an example where range might give you


wrong impression. However, range can also be used
to detect the ‘sanity’ of your data.

14
Measures of variability: Standard deviation (1)

Mean

● Standard deviation is the most widely used measure of dispersion of a


series.
● We can also use variance (σ2), i.e., , mathematical dispersion of the data
relative to the mean.
● Standard deviation is more useful to describe the data according to our
common sense, while variance is more useful in mathematical derivation.

15
Why use squared values in standard deviation?

16
Measures of variability: Standard deviation (1)

Mean

● Standard deviation is the most widely used measure of dispersion of a


series.
● We can also use variance (σ2), i.e., , mathematical dispersion of the data
relative to the mean.
● Standard deviation is more useful to describe the data according to our
common sense, while variance is more useful in mathematical derivation.

17
Measures of variability: Standard deviation (2)

Mean = 0, Std. dev. = 2 Mean = 0, Std. dev. = 1 18


Measures of variability: Standard deviation (3)

Mean = 5, Std. dev. = 2 Mean = 5, Std. dev. = 1 19


Measures of asymmetry: Skewness (1)
● Skewness indicates distribution that is asymmetric.
● If the curve is distorted on the right side, we have positive skewness.
● When the curve is distorted towards left, we have negative skewness.

20
Measures of asymmetry: Skewness (2)

21
Measures of asymmetry: Skewness (3)

Mean = 0, Mean = 0,
Std. Dev 1, Std. Dev 1,
Skew = 0.7 Skew = -0.7

22
Standardized moments
Std. moment
Mean

23
Intermezzo: quantile
● By a quantile, we mean the fraction (or percent) of points below the given value. That is,
the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70%
fall above that value.
● For example, if the 0.8 quantile of the examination score of subject A is 40, means that
80% of the students obtain scores lower than 40.

24
Intermezzo: quartile
● Quartile is one form of quantile. Quartiles are the values that divide a list of numbers
into quarters:

25
Measures of variability: interquartile range (1)
● The Interquartile Range (IQR) may be used to characterize the data when
there may be extremities that skew the data.
● The interquartile range is a relatively robust statistic (also sometimes
called "resistance") compared to the range and standard deviation.

IQR = Q3-Q1

26
Measures of variability: interquartile range (2)

It is clear that Group I's


values are typically higher,
but Group I also has a
wider spread, as indicated
by the wider IQR shaded
bar.

27
Outliers (1)
● A data point on a graph or in a set of
results that is very much bigger or
smaller than the next nearest data
point.
● Outlier can cause serious problems in
statistical analyses (depending on the
context)
● Outliers could mean two things: (1)
Measurement error (2) Heavy tailed
distribution.
● Now the question is: how to
detect outliers?

28
Outliers (2)
● There is no one universally accepted outlier detection method.
● There are some useful methods, one of them is Tukey’s fences.
● Tukey’s fences treat data as outliers when they are outside this
range

● If k = 1.5, data outside this range is treated as “outlier”.


● If k = 3, data outside this range is indicated as “far-out”.
● We’ll also illustrate this later by using a boxplot.

29
Outliers (3)
● An example of outlier detection using Tukey’s fences.
● Consider the following scores from mid-term exams:
21 22 33 10 8 9 10 4
30 20 30 10 20 17 8 9
10 22 11 2 30 20 22
33 34 37 21 20 21 23
11 21 13 8 6 9 100 90
30 70
● Q1 = 10
● Q2 = 30
● IQR = 20
● Lower bound for outlier detection with k = 1.5 is -20
● Upper bound for outlier detection with k = 1.5 60.
● Thus, those guys who have the scores of 100, 90 and 70 are
outliers. 30
Measures of
relationship

31
Quiz: How will you describe the following
relationship?

32
Correlation and dependence
● Statistical dependence or
association describes
statistical relationship between
two random variables.
● The relationship does not
indicate causality, but
describes correlation.
● It is a powerful relationship
when you want to analyze
whether there is a clear
relationship between two
variables.

33
Correlation and dependence
● Statistical dependence or association describes statistical relationship
between two random variables.
● The relationship does not indicate causality, but describes correlation.
● It is a powerful relationship when you want to analyze whether there is a
clear relationship between two variables.
● The most widely used measure is the Pearson’s correlation coefficient.

34
● How about for
cases with
more than
two
variables?

35
Correlation matrix

● For cases with more than


one variable, we can plot the
correlation matrix.
● The diagonal shows the
histogram, while the off-
diagonal terms show each
individual two-variables
relationship and/or the
correlation coefficient.

36
Presenting your results

37
Presenting your results: Histogram
● A histogram is a plot that lets you discover, and show, the underlying frequency distribution
(shape) of a set of data. This allows the inspection of the data for its underlying distribution (e.g.,
normal distribution), outliers, skewness, etc.
● Histogram is very useful if you want to visualize the general distribution of your data

38
Histogram: Choosing the number of bins

Too much: Too few:


- The general pattern is difficult to see. - Might give us incomplete impression of the data
- Shows too much individual data. .
39
Density plots as alternative to histogram
● A Density Plot visualises the distribution of data over a continuous interval or time period.
● Basically, it is a smoothed histogram via kernel smoothing.
● It gives us a clearer picture regarding the shape of the distribution.

40
Histogram vs PDF plot
● Histogram is the empirical distribution of your sample.
● PDF is used if you want to describe the hypothesized underlying distribution (normal? exponential?)
● Histogram is usually shown with frequencies as the y-axis, while for PDF density is shown (note that
the right figure is only for illustrative purpose).

41
Presenting your results: Barplot
● The major difference is that a histogram is only used to plot the frequency of score
occurrences in a continuous data set that has been divided into classes, called bins. Bar
charts, on the other hand, can be used for a great deal of other types of variables.
● Barplot is very useful is you visualize such data as sales, sensitivity indices, to name a
few.

42
Presenting your results: Barplot
● One common mistake of using continuous plot: using continuous plot when the data
should actually visualized with barplot (i.e., when the x-axis is categorical variable).

Just right Umm, no..


43
Presenting your results: Scatterplot
● Scatter plot shows relationship between two sets of data by using points.
● Scatter plot is also typically used in multi-objective decision making/optimization to
visualize the Pareto front.

44
Presenting your results: Scatterplot (2)
● Scatter plot can also be used for outlier detection.

45
Presenting your results: Boxplot
● Displaying the distribution of data based on a five number summary (remember
quartiles): Minimum, first quartile, median, third quartile, and maximum.
● Boxplot gives you good indication of how your datas are clustered and spread out.
● Other name of boxplot is “box and whiskers” plot.
● The whiskers can indicate: +/- 1.5 IQR, minimum and maximum of the data, one standard
deviation above and below the data, 9th and 91st percentile.

46
Presenting your results: Pie chart
● Pie car is useful when you have a data set such as preferences, sensitivity index,
grades, to name a few.

47
Presenting your results: Heat map
● Heat map is useful to better visualize data with arbitrary borders (e.g., geospatial,
CFD/FEM results)

48
Presenting your results: Contour map.
● Contour map performs visualization by creating multiple lines that share the same
values.
● Contour map can be shown with or without colours.
● In some occasions, contour map can be combined with heat map.

49
Tips: Use publication-
quality figures, please..
Not-acceptable

● Many journals will comment


on the quality of your figures.
● Some reviewers even will
‘look down’ on your papers
when they see poor-quality
figures.
● Don’t forget to give details! Acceptable
Labels, legend, etc.
● Learn how do do this,
properly, properly, properly.

50
Statistical hypothesis
test

51
Statistical hypothesis
● Let’s talk about hypothesis first..
● “Hypothesis is a formal question that the
researcher intends to resolve”.
● Quite often a research hypothesis is a predictive
statement, capable of being tested by scientific
methods, that relates an independent variable to
some dependent variable.

So in this case, the following are good hypotheses.


● “Car A consumes less fuel than car B”
● “Students who receive tutorial score higher than
other students who do not take tutorial”

52
Statistical hypothesis
● A hypothesis test is rule that specifies
whether to accept or reject a claim about
a population depending on the evidence
provided by a sample of data.
● A hypothesis test examines two opposing
hypotheses about a population: the null
hypothesis and the alternative
hypothesis.
● The null hypothesis is the statement being
tested. Usually the null hypothesis is a
statement of "no effect" or "no difference".
● The alternative hypothesis is the statement
you want to be able to conclude is true
based on evidence provided by the sample
data.
53
The use of statistical hypothesis

● Does my algorithm A significantly performs


better than algorithm B? (according to statistical
tests)
● Do the students from major X receive
significantly higher score than the students from
major Y in calculus class?
● Do male and female undergraduates differ in
height on average?
Statistical hypothesis test tries to answer these Is the difference significant?
questions. 54
The use of statistical hypothesis
● Does my algorithm A significantly performs
better than algorithm B? (according to
statistical tests)
● Do the students from major X receive
significantly higher score than the students
from major Y in calculus class?
● Do male and female undergraduates differ
in height on average?

Let’s say that you say “Yes” to one of the


questions above, how can you be so sure?

55
The use of statistical hypothesis
Case : We develop method A so that we can obtain superior results
as compared to method B (let’s say, optimization method)
● Null hypothesis (H0): “We are to compare method A with method B about
its superiority and we proceed on the assumptions that both method are
equally good”
● Alternative hypothesis (H1): “Method A is superior or the method B is
inferior”.
● Alternative hypothesis is usually the one which one wishes to prove.
● The null hypothesis is the one which one wishes to disproves (the one we are
trying to reject).
● How to assess this? Use p-value (the level of significance).

56
The use of statistical hypothesis
Case : We develop method A so that we can obtain superior results
as compared to method B (let’s say, optimization method)

57
The use of statistical hypothesis
● Based on the sample data, the test determines whether to reject the null
hypothesis.
● You use a p-value, to make the determination. If the p-value is less than
the significance level (denoted as α or alpha), then you can reject the null
hypothesis.
● The most common p-value is 0.05, meaning that to reject the hypothesis,
there should be 0.05 probability of H0 occuring.
● In other words, the 5 percent level of significance means that researcher is
willing to take as much as a 5 percent risk of rejecting the null hypothesis.
● There is now a movement that advocate that the p-value should be 0.005.

58
Type I versus type II error
● In the context of testing of
hypotheses, there are two
types of errors we can
make:
● We can reject H0 when H0
is true (Type I error),
denoted by α.
● We may accept H0 when
in fact H0 is not true
● α- error is understood as the level of (Type II error), denoted by
significance of hypothesis testing. β.
● Let’s say, if α- error is fixed at 5
percent, it means that there are about
5/100 chances that we will reject H0
when H0 is true. 59
Two-tailed and
one-tailed tests

● One-tailed test is used


when we are to test
whether the population
mean is either lower than
or higher than some
hypothesized value.

60
Two-tailed and
one-tailed tests
● A two-tailed test rejects
the null hypothesis if, say,
the sample mean is
significantly higher or
lower than the
hypothesized mean of the
population.

For significance level 5 percent (p-value = 0.05)

61
Parametric vs non-parametric test
● In parametric test, we assume that
the data follows a specific
distribution (typically normal).
● Non-parametric, we do not assume
that the data follows a specific
distribution (distribution-free test).
● However, parametric data can also
be used for non-normal data with
some requirements (e.g. sample size).

62
Non-parametric tests
● Whitney U or Wilcoxon rank sum test: tests whether two samples are drawn from the same
distribution, as compared to a given alternative hypothesis.
● McNemar's test: tests whether, in 2 × 2 contingency tables with a dichotomous trait and
matched pairs of subjects, row and column marginal frequencies are equal
● Median test: tests whether two samples are drawn from distributions with equal medians
● Pitman's permutation test: a statistical significance test that yields exact p values by
examining all possible rearrangements of labels
● Rank products: detects differentially expressed genes in replicated microarray experiments
● Siegel–Tukey test: tests for differences in scale between two groups
● Sign test: tests whether matched pair samples are drawn from distributions with equal
medians.
● And many more...

63
Parametric tests
● Z-test.
Based on the normal distribution and is used for judging several
statistical measures (particularly the mean).
● t-test.
Based on t-distribution and is used for judging the significance of a
sample mean or the difference between two means of two samples in case
of small samples.
● Chi-square test.
Based on chi-square distribution and is used for comparing a sample
variance to a theoretical population variance.
● F-test.
Based on F-distribution and is used to compare the variance of two
independent samples

64
Parametric tests
● Z-test.
Based on the normal distribution and is used for judging several
statistical measures (particularly the mean).
● t-test.
Based on t-distribution and is used for judging the significance of a
sample mean or the difference between two means of two samples in case
of small samples.
● Chi-square test.
Based on chi-square distribution and is used for comparing a sample
variance to a theoretical population variance.
● F-test.
Based on F-distribution and is used to compare the variance of two
independent samples

65
66
67
68
69
Case 1: Hypothesis testing of means

● Hypothesis testing of means is used in situations such as comparing the


mean of smaller population with general population (e.g. does the
behavior of group A represents the general population?)
● The null hypothesis is generated states as H0 : μH0= A value.

70
Statistical hypothesis testing, illustration 1

● In this problem, we would like to know whether these 400 male students represent the general
population.
● The first step is to formulate the hypothesis. The null hypothesis is that the mean height of the
population is equal to 67.39 inches:
Given information
We assume normal
distribution and we
use z-test
Accept

Given information 71
Std. error of the mean
Statistical hypothesis testing, illustration 2

Accept

72
Statistical hypothesis testing, illustration 3

Reject

73
Case 2: Hypothesis testing for differences between
means
● Used when we are interested in knowing whether the parameters of two
populations are alike or different.
● Example: In company A, do female workers earn less than male workers
for the same job?
● The null hypothesis is generated states as H0 : μ1=μ2

74
Statistical hypothesis testing, illustration 4

Hypothesis Given information

Reject

75
Statistical hypothesis testing, illustration 5

Notice the large


sample size here Reject
76
Statistical hypothesis testing, illustration 6

Notice the small


sample size here

Reject
77
Case 3: Hypothesis testing of proportions
● In case of qualitative phenomena, we have data on the basis of presence
or absence of an attribute
● With such data the sampling distribution may take the form of binomial
probability distribution whose mean would be equal to n ⋅ p and
standard deviation equal to n ⋅ p ⋅ q , where p represents the probability
of success, q represents the probability of failure such that p + q = 1 and
n, the size of the sample.

78
Statistical hypothesis testing, illustration 7

Reject
79
Statistical hypothesis testing, illustration 8

Accept 80
Other cases

● Hypothesis testing for comparing two related samples.


● Hypothesis testing for difference between proportions.
● Hypothesis testing for comparing a variance to some hypothesized
population variance.
● Hypothesis testing of correlation coefficients.

81
Non-parametric hypothesis test, an example
1. A popular nonparametric test to compare outcomes between two independent groups is the
Mann Whitney U test.
2. Mann Whitney U test tests whether two samples are likely to derive from the same
population (i.e., that the two populations have the same shape).
3. Recall that the parametric test compares the means (H0: μ1=μ2) between independent groups.
4. Using U-value, i.e., measurement of the difference between the ranked observation of the two
samples.

82
83
Accept

84
Non-parametric hypothesis test, an example
● Restaurant A received reviews from google maps the score of [5 4 4 3 3 2 5 5 4 4 3 3]
● Restaurant B received reviews from google maps the score of [4 5 3 3 2 5 4 2 2 3 1 1]
● Is this statistically significant for both 10% level of significance?

85
Non-parametric hypothesis test, an example
● Restaurant A received reviews from google maps the score of [5 4 4 3 3 2 5 5 4 4 3 3]
● Restaurant B received reviews from google maps the score of [4 5 3 3 2 5 4 2 2 3 1 1]
● Is this statistically significant for 10% level of significance?

Sum of rank for restaurant A =


4.5+10+10+10+10+16.5+16.5+16.5+1.6+22+2
2+22 = 161.6
= 72
Sum of rank for restaurant B =
1.5+1.5+4.5+4.5+4.5+10+10+10+16.5+16.5+2
= 17.3205 2+22 = 123.5

The U values are


between the upper and
= 60.4, and 98.5 for R2 lower limit, Accept
Compute upper and lower limit: Upper limit = 100.4506, lower limit = 43.5944
86
Hypothesis testing: The debate
● “P-value < 0.05 is not strict, we should set p-value to < 0.005!”
Why?
1. By setting a lower threshold, science will get benefit by having a lower
probability of accepting false observations.
2. Increasing reproducibility of research. Studies that yielded highly
significant results (less than p=.01) are more likely to reproduce than those
that are just barely significant at the .05 level.

● “Statistical significance erodes the culture of science, they should be banned”.


Why?
1. Many researchers give too much focus on answering the questions of where
there is an effect or not, but giving less importance on why a particular
phenomenon happen.
2. However, both things should actually go hand in hand, it is the culture that
should change.
87
Hypothesis testing: The debate
1. Although it is a common practice to use p<0.05, there are some disciplines
that already use very low p-values.
2. Genetics uses p < 0.00000005, astrophysics uses p < 0.0000003 (5-sigma,
this is how they could be sore sure that Higgs-Boson exists).

88
Notes on hypothesis testing
1. Testing is not decision-making, the tests are only useful aids for decision
making.
2. Test do not explain the reasons as to why does the difference exist.
They tell us whether the difference is due to chances or because of specific
reasons.
3. Test do not tell us “true” or “false”. They give us the probability of
accepting or rejecting evidences (e.g. very strong evidence, weak evidence).

The take home point is that statistical testing should be combined


with adequate knowledge of the subject.

89
Regression techniques

90
Regression analysis
● Regression analysis is used to
model the relationship between
a response variable and one
or more predictor variables.
● The goal is to understand the
general trend that relates the
independent and dependent
variable.
● We can also use regression
analysis to identify the impact
of each variables.

91
Terminologies Independent /
explanatory variable /
Dependent / target regression
variable / Response
variable

𝑦 = 𝑓(𝒙) ≈ 𝑓(𝒙)

𝑤ℎ𝑒𝑟𝑒 𝒙 = 𝑥1 , 𝑥2 , … , 𝑥𝑚 Regression function

● f(x) is the true mathematical relationship that, in practice, we do not


know what it’s really like.
● 𝑓(𝒙) is the approximation of the true mathematical relationship by a
regression function/surrogate model.
92
Why do regression?
● Regression analysis makes ease the process of identifying
relationship by a quantitative measure.
● Regression analysis can also gives us the confidence of how accurate is
the regression function approximates the relationship.
● Regression allows you to make prediction (the goal of machine learning),
but in statistics, the goal is to uncover the relationship between
variables.

Thus, the following two aspects must be considered:


• Predictive power: Accuracy of the regression model on capturing the
true function.
• Interpretability: The mathematical model of the regression function
gives us insight regarding the trend/behavior of the true function.

93
Correlation vs regression

● Correlation is described as the


analysis which lets us know the
association or the absence of the
relationship between two variables ‘x’
and ‘y’.
● Regression analysis, predicts the
value of the dependent variable based
on the known value of the independent
variable.
● More precisely, regression describes
how an independent variable is
numerically related to the dependent
variable.

94
Deterministic vs statistical relationship
Deterministic Statistical
relationship relationship

● Deterministic relationship is when the relationship is EXACTLY known.


● We use statistical relationship to model/approximate the deterministic
relationship that is not known. 95
Physical versus computer experiment

● Physical experiment observe the


input/output at selected samples using
experimental apparatus.
● Computer experiment observe the
input/output at selected samples using
computer regression.
● The outputs from physical experiment
is typically corrupted by noise, while
it is typically smooth for computer
experiment
● When using regression, the data could
be generated from social
experiment/census, to name a few.

96
Linear regression Linear model

Intercept Slope Error/noise

● Linear regression is used for finding


linear relationship between target and one
or more predictor.
● Linear regression is the simplest
regression and particularly useful when
the relationship is linear.
● You can even do it with your calculator.

97
Linear regression Linear model

Intercept Slope Error/noise

● QUIZ: How do you really compute the


coefficients?

98
Linear regression Linear model

Intercept Slope Error/noise

● QUIZ: How do you really compute the


coefficients? Use least squares
approach

Minimizes the sum of


the squared error

99
Linear regression

What is the
maximum of a
function? i.e.,
when the gradients
are zero

Solutions

100
Linear regression: An example

101
Linear regression: How good is your model?
● R-squared is a statistical measure of how close
the data are to the fitted regression line. It is
also known as the coefficient of determination.
● The closer to 1, the better. A good value of R-
squared is context dependent, but typically R-
squared > 0.9 indicates a good fit.

Mean of data

102
Linear regression: case of multiple variables
● There are some cases where we are
interested on the relationship between
dependent variable and multiple
independent variables.
● In that case, we can use linear regression
for multiple variables.
● The coefficients tell us the importance of
the variables.
● Now, how to obtain the coefficients?

𝑦 = 𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+…+𝛽𝑚𝑥𝑚
103
Ordinary least squares

Solutions

Rearrange

104
Polynomial regression

● Polynomial regression models the relationship between the


independent and dependent variable by using n-th order polynomial.
● Polynomial regression is useful when the relationship does not exhibit
a clear linear relationship.
● Notice that linear regression is one particular case of polynomial
regression, i.e., when n = 1

Linear model

Quadratic model
105
Polynomial regression

● The higher the polynomial order,


the more complex the behavior of
the polynomial function.
● The best order totally depends on
the true behavior of the function
and the noise in the function.

106
Polynomial regression

● When using polynomial regression, it is important to avoid two


particular case:
● Underfitting: The polynomial regression misses the true behavior of
the unknown function.
● Overfitting: The polynomial regression captures the “trend” that
should not be there (e.g., noises) 107
Validation – Testing the accuracy of your model

● The goal of validation is to check the accuracy of our model.


● Validation is performed by testing the accuracy of the model on the
validation set.
● The ideal condition is to have a separate training set (the one that is
used to build the model) and validation set.

𝑦 = 𝑓(𝒙) ≈ 𝑓(𝒙)
Use this regression
model

𝑒 =𝑦−𝑦
Use this on the
validation set
108
Error metrics
● There are several available error metrics such as mean absolute error
(MAE), root mean squared absolute error (RMSE), mean absolute
relative error (MARE), and root mean squared relative error (RMSRE).
● RMSE gives more penalty to large error. This means the RMSE is
most useful when large errors are particularly undesirable.

● Which one is better? That depends on your objective.


● RMSE and MAE might not be easy to interpret when multiple cases
are considered. In some cases, people use the relative value (RMSRE,
MARE, normalized MAE, normalized RMSE)

109
Is your model accurate or not? Do cross-validation!
● Cross-validation (CV) is the way to assess the accuracy of surrogate models without
external validation set.
● Why? Because in many cases, we don’t have the external validation set.
● CV separates a training set (sampling points to create Kriging) into a smaller training set
and a validation set.

Two popular CV approaches:


CV
techniques ● Leave-p-out CV.
● K-fold CV

● RMSE
● MAE
Error ● R2 (coefficient of determination)
metrics ● Mean absolute relative error.
● Root-mean-squared-relative error.
● etc..
110
Example : 5-fold cross validation 5-fold cross
validation

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Steps:
1. Use fold 2-5 to create a surrogate model, use fold 1 as
validation samples. Compute error values.
2. Use fold 1,3,4,5 to create a surrogate model, use fold
2 as validation samples. Compute error values.
3. Continue until all folds have been used.
4. Average the error values.
111
How to avoid underfit/overfit? Do cross validation

112
Multivariate polynomial regression
● The extension of multivariable linear
regression.
● For each variable, we may have quadratic,
cubic, or higher order trend.
● Can capture non-linearity in high-
dimensional space (high number of
independent variables).
● Might also involve interaction between
independent variables.

For example, the quadratic multivariate


polynomial regression for two variables is

𝑦 = 𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥12 +𝛽4𝑥22 +𝛽5𝑥1𝑥2


113
Advance regression model :RBF

• Radial basis functions (RBF) creates an


interpolating/regression model by summing
several terms of RBF.
• RBF is defined as a function that only depends on
a distance.

114
Advance regression model :GPR
*Gaussian processes regression
• GPR constructs surrogate model as realizations
of stochastic process.
• The process is defined by a covariance function
(i.e. the measure of similarity between two points)
y(x) = μ(x) + Z(x)
Prediction

Stochastic
process
(gaussian Uncertainty
distribution

• Pros: Provides extra uncertainty structure, highly


generalizable, high prediction capability.
Mean of • Cons: The training time can be significantly long,
realizations
(the GP not suitable for very high dimension regression
model) Bay, Xavier, Laurence Grammont, and Hassan Maatouk. "Generalization of the
Kimeldorf-Wahba correspondence for constrained interpolation." Electronic journal (>100) 115
of statistics 10.1 (2016): 1580-1595.
Advance regression model :
Neural network
• NN models a function as an interconnected
“neuron”.
• There are three layers in NN, i.e., input layer,
hidden layer, and output layer.
• Hidden layers are basically where the inputs are
processed through activation functions (i.e. basis
functions).
• If the number of hidden of layers is numerous, the
model is called “Deep learning” or “Deep neural
network”

• Pros: Can be used for general applications. The


deep version is very powerful and is now considered
as the “holy grail” of machine learning
• Cons: The model is not interpretable. Sometimes it
is difficult to tune the NN parameters (number of
hidden layers etc).
116
Designing your experiment
● Let’s say that we want to perform
experiments and we want find the Design sampling points
relationship between x and y.
● Let’s also say that our experiment is
expensive (e.g., 4 hours for one
Evaluate sampling
experiment) points
● We can design the experiment so that we
could maximize the information that we
obtained. Create regression
Our typical goals are: models
1. To find the relationship between x and y.
2. To identify important and less important
independent variables. Perform analysis

117
Sampling points
● It is important to decide the sampling points, i.e., the points where you will
evaluate y (e.g. by CFD, FEM, or physical experiments).
● Basically, we want to maximize the information from limited available
samples.
● To that end, the sampling points need to be carefully placed.

Which one that you


think is the best?
118
Sampling points
● There are several available techniques for generating sampling points.
● Among them are full factorial design, latin hypercube sampling, and low-
discrepancy sequences.
● Good sampling points should ‘fill’ the space.

Full factorial design Latin hypercube sampling Low discrepancy


sequence

119
Sampling points: Example using MATLAB
LHS Random

120
Sampling points: Example using MATLAB
LHS Random

121
An example of computer experiment
● One of my former student designed a computer experiment to study the
impact of adding a plate to reduce vortex induced vibration.
● He used GPR to create a regression model.
● His independent variables was the length and angle of the plate attached
to the cylinder.
● His dependent variables (output of interests) were mean drag coefficient
and vibration
SURROGATE ASSISTED GENETIC ALGORITHM FOR COMPUTATIONALLY EXPENSIVE FLUID FLOW
OPTIMIZATION
Master Thesis, Yohanes Bimo Dwianto, Bandung Institute of Technology, Dept. of Aerospace Engineering

Perform analysis

122
An example of computer experiment

Design sampling points Perform experiment Create regression model

Analyze
123
ENGINEERING OPTIMIZATION: AN
INTRODUCTION

124
Engineering design optimization
Baseline shape

“The action of making the best or most effective use of a


situation or resource” - Oxford dictionaries
Why optimize? To achieve higher efficiency, reduced fuel
burn, better stability, lower production cost, etc.
What is the role of optimization in your experiment?
You want to know the best set of parameters that yields the
Objective: Decrease drag
best value. coefficient

Optimized shape
Goals of optimization:
• To optimize engineering systems in order to improve performances.
• To discover important knowledge, insight, physical phenomena, and design
guidelines that will be useful for future design.

Main components:
• Objective functions. Performance measures to be optimized, expressed
with the so-called objective functions (e.g. drag coefficient, adiabatic
efficiency)
• Constraint functions. A limitation or restriction of the optimized
performance (e.g. stress limit, moment coefficient) 125
Formulating your optimization problem

• Design variables: Discrete, combinatorial, continuous,


mixed-integer.
• Number of objectives: Single or multi-objective.

126
Defining your design variables and objective function
• x is your vector of design variables
• f(x) is your objective function, your optimization goal.
• You want to find x that optimizes f(x)
• Define your design variables to create an airfoil (e.g. NACA, PARSEC), this is your x
• For each x, you have an unique airfoil.
• Now suppose you want to maximize L/D.. your f(x).
• How do you get f(x), use CFD!
• Your task is to find x, that will generate an airfoil shape that maximizes L/D.
This is your
optimal airfoil
Each point here
is a unique
airfoil

127
Definition of optimality This will be your favorite plot. You
Non-zero observe improvements over time!
gradient, not
your optimal
point

Zero gradient.
This is your
optimal point

• This is the mathematical definition of optimality to have your optimum solution.


• But how about in cases when you don’t know gradient? We will achieve
optimized solution.
128
Techniques for optimization f(x)

1. Gradient-based techniques Local optimum


Pros: Can guarantee optimality.
Cons: Works only on smooth functions, gradient information is
not always available. Zero gradient
Examples: Stochastic gradient descent, Newton’s methods. Global optimum

Quasi-Newton Methods. x

2. Gradient-free techniques Nelder-mead


Pros: Do not need gradient, can be used for any problems.
Cons: Typically slow or even very slow. No guarantee of
optimality.
Examples: Nelder-Mead, Evolutionary (genetic) algorithms,
Particle Swarm Optimization, Divided Rectangles method.

https://codesachin.wordpress.com/2016/01/16/nelder-mead-optimization/

129
Example: flow control in cylinder

Single vs multi-objective optimization Low drag

Single-objective optimization: Only one-objective Low vibration

function to be optimized (e.g. strictly reduce drag).

Multi-objective optimization (MOO): Consider two Example: supersonic drag


or more objective functions to be optimized (e.g., vs bending moment

minimize drag and maximize strength).

Why multi-objective optimization?


Sometimes we have conflicting objectives (e.g. drag
coefficient vs structural strength). In MOO, we are
interested in finding the Pareto front.
Pareto front

Single-objective optimization: ONE optimal solution


Obayashi, Shigeru, et al. "Multiobjective evolutionary computation for supersonic
wing-shape optimization." IEEE transactions on evolutionary computation 4.2
(2000): 182-187.

Multi-objective optimization: MANY optimal solutions


130
Forward vs inverse design
Forward/direct design: Works by directly optimizing the
objective function.
Pros: No necessity to specify target pressure. Min f(x)
Cons: No control over the individual behavior of solutions

Inverse design: The target (e.g. pressure distribution) is


specified. The optimization aims for minimizing the error from Min cp(T)-cp(x)
the target.
Pros: Good control over the individual behavior of solutions.
Cons: It is not trivial to specify the target pressure.

131
Why you need to know all of these?

• Know how to use different measures of central tendency


(e.g. mean and median), and variability (e.g., standard
Basic statistics
deviation, IQR), measures of relationship, PDF, etc.

• Know how to use different plots for presenting data


Presenting data and information (e.g., histogram, boxplot, etc)

• Know how to use statistical hypothesis tests to differentiate


significant effect/difference and pure chance.
Statistical hypothesis test • Know the concept of statistical significance, formulating
hypothesis, etc.

• Know how to use regression models to perform an experiment


where we want to know the relationship between independent
Regression models and dependent variables.
• Understand the basic concept of designing such an experiment.

Optimization method • When necessary, optimization can be used to find


the optimal parameters for your research.
Question?

You might also like