Professional Documents
Culture Documents
Engineering Data Analysis Handsout Module 1 6
Engineering Data Analysis Handsout Module 1 6
Engineering Data Analysis Handsout Module 1 6
ENGINEERING DATA
ANALYSIS
(Summary)
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Lesson 1: Data Collection EXPERIMENTATION
SAMPLING
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Lesson 2.1: Basic Rules of Combining 2. MULTIPLICATION RULE
There are basic rules to follow on combining ways can be described as follows: If an operation
sum of the probabilities of the separate events. the first two a third operation can be performed in n₃ ways,
and so forth, then the sequence of k operations can be
Mutually exclusive events mean two or more
performed in n₁n₂ ··· nk ways.
events cannot happen at the same time.
(b) The simplest form of the Multiplication Rule for
(b) If the events are not mutually exclusive, there
probabilities is as follows: If the events are
can be overlap between them. This can be
independent, then the occurrence of one event
visualized using a Venn diagram. The probability
does not affect the probability of occurrence of
of overlap must be subtracted from the sum of
another event. In that case, the probability of
probabilities of the separate events
occurrence of more than one event together is the
Set Relations on Venn Diagram product of the probabilities of the separate events.
(This is consistent with the basic idea of counting
Let's look at the Venn diagram (b) and (c)
stated above.) If A and B are two separate events
• P [A ∩ B) = P [occurrence of both A and B], that are independent of one another, the
the intersection of events A and B. probability of occurrence of both A and B together
If three events A, B, and C are not mutually is read as the probability of B given A, or the
P [A ∩ B] = P [A] × P [B | A] = P [B] × P [A | B]
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Lesson 3: Permutation and Combination Lesson 3.1: SAMPLING DISTRIBUTION
Combinations - are similar to permutations, but with c. We may wish to draw conclusions about the
the important difference that combinations take no fairness of a particular coin by tossing it
account of order. Thus, AB and BA are different repeatedly. The population consists of all possible
permutations but the same combination of letters. tosses of the coin. A sample could be obtained by
Then the number of permutations must be larger examining, say, the first 60 tosses of the coin and
than the number of combinations, and the ratio noting the percentages of heads and tails.
between them must be the number of ways the
d. We may wish to draw conclusions about the
chosen items can be arranged.
colors of 200 marbles (the population) in an urn by
In general, the number of combinations of n items selecting a sample of 20 marbles from the urn,
taken r at a time is where each marble selected is returned after its
color is observed.
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Sampling With or Without Replacement Sample Mean
Sample Distribution
Point Estimate
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Unbiased Estimators (2) Statistics helps in the proper and efficient
planning of a statistical inquiry in any field of
Suppose we have two measuring
study.
instruments; one instrument has been accurately
calibrated, but the other systematically gives (3) Statistics helps in collecting appropriate
readings smaller than the true value being quantitative data.
measured. When each instrument is used
(4) Statistics helps in presenting complex data
repeatedly on the same object, because of
in a suitable tabular, diagrammatic, and graphic
measurement error, the observed measurements
form for easy and clear comprehension of the
will not be identical. However, the measurements
data.
produced by the first instrument will be distributed
about the true value in such a way that on (5) Statistics helps in understanding the nature
average this instrument measures what it purports and pattern of variability of a phenomenon
to measure, so it is called an unbiased instrument. through quantitative observations.
The second instrument yields observations that
(6) Statistics helps in drawing valid inferences,
have a systematic error component or bias.
along with a measure of their reliability about
Note: A Point Estimator theta is said to the unbiased the population parameters from the sample
data.
estimator of θ. If is not unbiased, the difference E(θ )
- θ is called the bias of . Descriptive statistics - is the term given to the analysis
of data that helps describe, show, or summarize data
Point Estimates and Interval Estimates
in a meaningful way such that, for example, patterns
An estimate of a population parameter given
might emerge from the data. Descriptive statistics is at
by a single number is called a point estimate of
the heart of all quantitative analysis. Descriptive
the parameter. An estimate of a population
statistics do not, however, allow us to make
parameter given by two numbers between which
the parameter may be considered to lie is called conclusions beyond the data we have analyzed or
an interval estimate of the parameter. reach conclusions regarding any hypotheses we might
have made. They are simply a way to describe our
Note: A statement of the error or precision of an
estimate is often called its reliability. data.
Lesson 4: Introduction to Statistics Typically, there are two general types of statistic that
are used to describe data: Measures of Central
Statistics - is defined as a branch of mathematics
Tendency and Measures of Variability.
that deals with the collection, analysis, interpretation,
and presentation of masses of numerical data.
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Measures of Central Tendency - used a single 4. Standard Deviation - It is defined as the
value to describe the center of a data set. The mean, square root of the variance.
median, and mode are all the three measures of
central tendency.
NOTE:
1.) Mean - is the arithmetic average, calculated Kurtosis - the sharpness of the peak of a frequency-
by finding the sum of the study data and dividing distribution curve.
it by the total number of data
Skewness - the measure of the asymmetry of the probability
2.) Median - is the middle value of the distribution of a real-valued random variable about its mean
1. Range - the difference between the Overfitting (low bias, high variance) - force-
maximum and minimum data fitting, too good to be true, If we have
2. Interquartile Range – quartiles divide the overfitted, this means that we have too many
range of values into four parts, each parameters to be justified by the actual
containing one quarter of the values. The underlying data and therefore build an overly
difference between Q3 and Q1 is called complex model.
Interquartile range. Like in finding median, it
Regression - is a statistical method used to
is necessary to list the values in numerical
determine the strength and character of the
order. In case there will be 2 values lying on
relationship between one dependent variable
Q1 or Q3, get the average.
(usually denoted by Y) and a series of other
3. Variance - in statistics is a measurement of
variables (known as independent variables).
the spread between numbers in a data set.
That is, itmeasures how far each number in Simple Linear Regression
the set is from the mean and therefore from
In statistics simple linear regression is a linear
every other number in the set.
regression model with a single explanatory variable.
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
That is, it concerns two-dimensional sample points Below are called the least-squares equations or
with one independent variable and one dependent normal equations for estimating the coefficients, a
variable (conventionally, the x and y coordinates in a and b, by the points (xi, yi)
Cartesian coordinate system and finds a linear
function (a non-vertical straight line that, as
accurately as possible, predicts the dependent
variable values as a function of the independent
variable.
EY ( Y) =α + β x
Where:
coefficients. From a sample consisting of n pairs of equation of regression given by the points (xi, yi),
data (x, y), we calculate estimates, for α and b for β. these are the formulas you need to remember:
y = a + bx
makes sense that the line passes through the means. b = Sxy / Sxx
The point is called the centroid or centroidal point.
a = (mean of y) - b (mean of x) =
Centroidal point = (mean of x ,mean of y) =
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Correlation • if r = 1 or positive value, positively correlated,
therefore as x increases, y increases.
The word correlation is used in everyday life
to denote some form of association. We might say • if r = -1 or negative value, negatively correlated,
that we have noticed a correlation between foggy therefore as x increases, y decreases.
days and attacks of wheeziness. However, in
• if r = 0, no correlation
statistical terms we use correlation to denote
association between two quantitative variables. The formula for the correlation coefficient, r:
We also assume that the association is linear, that
one variable increases or decreases a fixed
amount for a unit increase or decrease in the
other.
Correlation coefficient
H₁.
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
You will arrive in these conclusions: This is equivalent to test the hypothesis that the
probability of success on a given trial is p = 1/4
• reject H0 in favor of H1 because of sufficient
against the alternative that p > 1/4. This is usually
evidence in the data or
written as follows:
• fail to reject H0 because of insufficient evidence in
H₀: p = 0.25
the data.
H₁: p > 0.25
Though the applications of hypothesis testing are
quite abundant in scientific and engineering work,
perhaps the best illustration for a novice lies in the
predicament encountered in a jury trial. The null and
alternative hypotheses are:
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
When you perform a statistical test a p-value helps These are some procedures to test the hypothesis
you determine the significance of your results in on normal distribution:
relation to the null hypothesis.
1. 1-Sample Z - used to make inferences on the
How do you know if P-value is highly significant? mean of a population using a random sample mean.
The random samples must be normally
The level of statistical significance is often expressed
distributed and there is information on the previous
as a p-value between 0 and 1. The smaller the p-
population standard deviation.
value, the stronger the evidence that you should
reject the null hypothesis. 2. 1-Sample T - used to make inferences on the
mean of a population using a random sample mean.
A p-value of less than 0.05 (typically ≤ 0.05) is The random samples must be normally distributed,
statistically significant. It indicates strong and there is NO information on the previous
evidence against the null hypothesis, as there is population standard deviation.
less than a 5% probability the null is correct (and
3. 2-Sample T - requires independent normally
the results are random). Therefore, we reject the
distributed sample data and is used to compare
null hypothesis and accept the alternative
the difference between two means and make
hypothesis.
inferences if it is equal to a target.
However, this does not mean that there is a 95%
4. Paired t - used to analyze the difference
probability that the research hypothesis is true.
between paired observations against a reference
The p-value is conditional upon the null
value i.e. target value is 0.
hypothesis being true is unrelated to the truth or
falsity of the research hypothesis. 5. One-way ANOVA - ANOVA stands
for ANalysis Of VAriance. It is used to determine if
A p-value higher than 0.05 (> 0.05) is not the means of several
statistically significant and indicates strong populations have statistically significant
evidence for the null hypothesis. This means differences.
we retain the null hypothesis and reject the
6. Two-way ANOVA - it is a general linear model
alternative hypothesis. You should note that you
procedure to conduct an ANOVA, test the hypothesis
cannot accept the null hypothesis, we can only
that the means of several populations are equal, with
reject the null or fail to reject it.
a response factor (Y) and multiple predictors (X1, X2).
A statistically significant result cannot prove that
a research hypothesis is correct (as this implies
100% certainty).
Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Reference:
Walpole, R., Myers, R., Myers, S., & Ye, K. (2014). Scientists & Engineers Guide to
Probability & Statistics (9th ed.). Pearson Education Inc., Prentice-Hall.
Decoursey, W.J. (2003). Statistics and Probability for Engineering Applications.
Newnes.