Engineering Data Analysis Handsout Module 1 6

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

lOMoARcPSD|33318014

Engineering Data Analysis Handsout Module 1-6

Engineering Data Analysis (Technological Institute of the Philippines)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Hennie Riberta (ribertahennie@gmail.com)
lOMoARcPSD|33318014

ENGINEERING DATA
ANALYSIS
(Summary)

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Lesson 1: Data Collection EXPERIMENTATION

Data Collection – is a systematic way of gathering On the other hand, experimentation


and measuring information on different groups of is the collection of data in a more controlled
people. The data collected can be used on research, manner. One example is the data you collected
testing hypothesis, and other intended purposes. as a result of your laboratory experiments.
Kindly note that the experimentation process is
TYPES OF DATA
not limited inside a laboratory. Most of the
1. Quantitative - sets of data in numerical form, can companies use experimentation in order to test
be either counted or measured. their hypothesis. For example, a company can
launch a sales competition to test how
• Discrete Data - data that can be "counted"
salespeople react to different levels of
(e.g. No. of Pencils, No. of People)
performance incentives.
• Continuous Data - data can be
Lesson 2: Introduction to Probability
"measured" (e.g. Height, Weight, and
Temperature) Probability - is a measure of the likelihood that a
particular event will occur. To compute the probability
2. Qualitative - sets of data that is more on
of a particular event to happen:
characteristics and classification
Probability of an event = (number of ways it
• Binary Data - falls under two mutually
can happen) / (total number of outcomes)
exclusive categories (e.g. right/wrong,
true/false) If an event is certain to happen, Probability = 1. If
an event is impossible to happen, the Probability
• Nominal Data -named categories with no
of that event = 0. Therefore, the Probability value
specific rank or order (e.g. blue/red/green)
is ranging from 0 to 1.
• Ordinal Data - categories with specific rank
Probability can be expressed into a decimal,
or natural order (e.g. short, medium, tall)
fraction, or percentage. Let's take a look at these
Note: Data collection involves either sampling or examples:
experimentation.

SAMPLING

If you are collecting data about a group


of people, say, about 10 students. It is easy to
tally and record them accordingly. But if the
statistical population is too large to survey, it is
better to use gather data within a sample size
only. This process is called sampling. Sampling
is the selection of a subset of individuals from
within a statistical population to estimate
characteristics of the whole population

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Lesson 2.1: Basic Rules of Combining 2. MULTIPLICATION RULE

Probabilities (a) The basic idea for calculating the number of

There are basic rules to follow on combining ways can be described as follows: If an operation

probabilities: can be performed in n1 ways and if for each of


these ways a second operation can be performed
1. ADDITION RULE in n₂ ways, then the two operations can be
(a) If the events are mutually exclusive, there is no performed together in n₁n₂ ways.
overlap: if one event occurs, other events cannot Note: For more than two operations: If an operation can be
occur. In that case, the probability of occurrence performed in n₁ ways, and if for each of these a second
of one or another of more than one event is the operation can be performed in n₂ ways, and for each of

sum of the probabilities of the separate events. the first two a third operation can be performed in n₃ ways,
and so forth, then the sequence of k operations can be
Mutually exclusive events mean two or more
performed in n₁n₂ ··· nk ways.
events cannot happen at the same time.
(b) The simplest form of the Multiplication Rule for
(b) If the events are not mutually exclusive, there
probabilities is as follows: If the events are
can be overlap between them. This can be
independent, then the occurrence of one event
visualized using a Venn diagram. The probability
does not affect the probability of occurrence of
of overlap must be subtracted from the sum of
another event. In that case, the probability of
probabilities of the separate events
occurrence of more than one event together is the
Set Relations on Venn Diagram product of the probabilities of the separate events.
(This is consistent with the basic idea of counting
Let's look at the Venn diagram (b) and (c)
stated above.) If A and B are two separate events
• P [A ∩ B) = P [occurrence of both A and B], that are independent of one another, the
the intersection of events A and B. probability of occurrence of both A and B together

• P [A ∪ B) = P [occurrence of A or B or both], is given by P [A ∩ B] = P [A] × P [B]


the union of the two events A and B.
(c) If the events are not independent, one event
•If two events being considered, A and B, are affects the probability of the other event. In this
not mutually exclusive, and so there may be the case, conditional probability must be used. The
overlap between them, the Addition Rule conditional probability of B given that A occurs, or
becomes P (A ∪ B) = P (A) + P (B) – P (A ∩ B) on condition that A occurs, is written P [B | A].This

If three events A, B, and C are not mutually is read as the probability of B given A, or the

exclusive: probability of B on condition that A occurs.

Note: The multiplication rule for the occurrence of both A


P (A ∪ B ∪ C) = P (A) + P (B) + P (C) – P (A
and B together when they are not independent is the
∩ B) – P (A ∩ C) – P (B ∩ C) + P (A ∩ B ∩
product of the probability of one event and the conditional
C) probability of the other:

P [A ∩ B] = P [A] × P [B | A] = P [B] × P [A | B]

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Lesson 3: Permutation and Combination Lesson 3.1: SAMPLING DISTRIBUTION

Permutation - is an arrangement of all or part of a Population and Sample


set of objects. The number of permutations is the
Often in practice, we are interested in
number of different arrangements in which items can
drawing valid conclusions about a large group of
be placed. Notice that if the order of the items is
individuals or objects. Instead of examining the
changed, the arrangement is different, so we have a
entire group, called the population, which may be
different permutation. In permutations, the order is
difficult or impossible to do, we may examine only
important!
a small part of this population, which is called a
• Rule1. The number of permutations of n objects sample. We do this with the aim of inferring
is n! certain facts about the population from results
found in the sample, a process known as
• Rule2. The number of permutations of n distinct
statistical inference. The process of obtaining
objects taken r at a time is nPr = n! / (n − r)!
samples is called sampling. Let's take a look at
• Rule3. If n items are arranged in a circle, the these examples below.
arrangement doesn’t change if every item is
a. We may wish to draw conclusions about the
moved by one place to the left or the right.
weights of 12,000 adult students (the population)
Therefore in this situation, one item can be placed
by examining only 100 students (a sample)
at random, and all the other items are placed
selected from this population.
concerning the first item. The number of
permutations of n objects arranged in a circle is (n b. We may wish to draw conclusions about the
− 1)! percentage of defective bolts produced in a
factory during a given 6-day week by examining
• Rule4. The number of distinct permutations of n
20 bolts each day produced at various times
things of which n1 are of one kind, n2of a second
during the day. In this case, all bolts produced
kind, ... , nk of a kth kind is
during the week comprise the population, while
the 120 selected bolts constitute a sample.

Combinations - are similar to permutations, but with c. We may wish to draw conclusions about the
the important difference that combinations take no fairness of a particular coin by tossing it
account of order. Thus, AB and BA are different repeatedly. The population consists of all possible
permutations but the same combination of letters. tosses of the coin. A sample could be obtained by
Then the number of permutations must be larger examining, say, the first 60 tosses of the coin and
than the number of combinations, and the ratio noting the percentages of heads and tails.
between them must be the number of ways the
d. We may wish to draw conclusions about the
chosen items can be arranged.
colors of 200 marbles (the population) in an urn by
In general, the number of combinations of n items selecting a sample of 20 marbles from the urn,
taken r at a time is where each marble selected is returned after its
color is observed.

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Sampling With or Without Replacement Sample Mean

If we draw an object from an urn, we have


the choice of replacing or not replacing the object
into the urn before we draw again. In the first
case, a particular object can come up again and
again, whereas in the second it can come up only
once. Sampling where each member of a
population may be chosen more than once is
called sampling with replacement, while sampling
where each member cannot be chosen more than
once is called sampling without replacement. A
finite population that is sampled with replacement Sampling Distribution of Means
can theoretically be considered infinite since
samples of any size can be drawn without
exhausting the population. For most practical
purposes, sampling from a finite population that is
very large can be considered as sampling from an
infinite population.

Sample Distribution

The sampling distribution describes the expected


behavior of a large number of simple random
samples drawn from the same population.

Lesson 3.2: POINT ESTIMATION

Point Estimate

A Point Estimate of a parameter θ is a single


number that can be regarded as a sensible value
for θ. A point estimate is obtained by selecting a
suitable statistic and computing its value from the
given sample data. The selected statistic is called
the point estimator of θ.

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Unbiased Estimators (2) Statistics helps in the proper and efficient
planning of a statistical inquiry in any field of
Suppose we have two measuring
study.
instruments; one instrument has been accurately
calibrated, but the other systematically gives (3) Statistics helps in collecting appropriate
readings smaller than the true value being quantitative data.
measured. When each instrument is used
(4) Statistics helps in presenting complex data
repeatedly on the same object, because of
in a suitable tabular, diagrammatic, and graphic
measurement error, the observed measurements
form for easy and clear comprehension of the
will not be identical. However, the measurements
data.
produced by the first instrument will be distributed
about the true value in such a way that on (5) Statistics helps in understanding the nature
average this instrument measures what it purports and pattern of variability of a phenomenon
to measure, so it is called an unbiased instrument. through quantitative observations.
The second instrument yields observations that
(6) Statistics helps in drawing valid inferences,
have a systematic error component or bias.
along with a measure of their reliability about

Note: A Point Estimator theta is said to the unbiased the population parameters from the sample
data.
estimator of θ. If is not unbiased, the difference E(θ )

- θ is called the bias of . Descriptive statistics - is the term given to the analysis
of data that helps describe, show, or summarize data
Point Estimates and Interval Estimates
in a meaningful way such that, for example, patterns
An estimate of a population parameter given
might emerge from the data. Descriptive statistics is at
by a single number is called a point estimate of
the heart of all quantitative analysis. Descriptive
the parameter. An estimate of a population
statistics do not, however, allow us to make
parameter given by two numbers between which
the parameter may be considered to lie is called conclusions beyond the data we have analyzed or

an interval estimate of the parameter. reach conclusions regarding any hypotheses we might
have made. They are simply a way to describe our
Note: A statement of the error or precision of an
estimate is often called its reliability. data.

Lesson 4: Introduction to Statistics Typically, there are two general types of statistic that
are used to describe data: Measures of Central
Statistics - is defined as a branch of mathematics
Tendency and Measures of Variability.
that deals with the collection, analysis, interpretation,
and presentation of masses of numerical data.

What is the use of statistics?

(1) Statistics helps in providing a better


understanding and exact description of a
phenomenon of nature.

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Measures of Central Tendency - used a single 4. Standard Deviation - It is defined as the
value to describe the center of a data set. The mean, square root of the variance.
median, and mode are all the three measures of
central tendency.
NOTE:

1.) Mean - is the arithmetic average, calculated Kurtosis - the sharpness of the peak of a frequency-
by finding the sum of the study data and dividing distribution curve.
it by the total number of data
Skewness - the measure of the asymmetry of the probability

2.) Median - is the middle value of the distribution of a real-valued random variable about its mean

distribution. It is calculated by first listing the data


in numerical order then locating the value in the
middle of the list.
Lesson 5: Curve Fitting, Regression, and
Correlation
Odd set of data - the middle value
Curve Fitting
Even set of data - the average between two
middle values The general problem of finding equations of
approximating curves that fit given sets of data is
3. Mode - is the value that appears most
called curve fitting.
frequently in the set of data
Underfitting (high bias, low variance.) - too
Measures of Variation - indicates how spread out
simple to explain the variance. If we have
the study data is from a central value, i.e. the mean.
underfitted, this means that the model
The following are the commonly used measures function does not have enough complexity
of variation: (parameters) to fit the true function correctly.

1. Range - the difference between the Overfitting (low bias, high variance) - force-
maximum and minimum data fitting, too good to be true, If we have
2. Interquartile Range – quartiles divide the overfitted, this means that we have too many
range of values into four parts, each parameters to be justified by the actual
containing one quarter of the values. The underlying data and therefore build an overly
difference between Q3 and Q1 is called complex model.
Interquartile range. Like in finding median, it
Regression - is a statistical method used to
is necessary to list the values in numerical
determine the strength and character of the
order. In case there will be 2 values lying on
relationship between one dependent variable
Q1 or Q3, get the average.
(usually denoted by Y) and a series of other
3. Variance - in statistics is a measurement of
variables (known as independent variables).
the spread between numbers in a data set.
That is, itmeasures how far each number in Simple Linear Regression
the set is from the mean and therefore from
In statistics simple linear regression is a linear
every other number in the set.
regression model with a single explanatory variable.

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
That is, it concerns two-dimensional sample points Below are called the least-squares equations or
with one independent variable and one dependent normal equations for estimating the coefficients, a
variable (conventionally, the x and y coordinates in a and b, by the points (xi, yi)
Cartesian coordinate system and finds a linear
function (a non-vertical straight line that, as
accurately as possible, predicts the dependent
variable values as a function of the independent
variable.

The simplest situation is a linear or straight-line


relation between a single input and the response.
Say the input and response are x and y, respectively.
For this simple situation:

EY ( Y) =α + β x
Where:

α and β are constant parameters that we want to


estimate. They are often called regression Referring to the equation, y = a + bx, if we get the

coefficients. From a sample consisting of n pairs of equation of regression given by the points (xi, yi),

data (x, y), we calculate estimates, for α and b for β. these are the formulas you need to remember:

Substituting the variables, a and b, we have the fitted


regression line:

y = a + bx

To compute for the parameters, a and b, we will use


Methods of Least Squares.

METHODS OF LEAST SQUARES


Wherein:
The problem now is to determine a and b to
give the best fit with the sample data. If the points given Sxx = sum of squares for x
by (xi, yi) are close to a perfect straight line, it might be Syy = sum of squares for y
satisfactory to plot the points and draw the line by eye. Sxy = sum of products for x and y
The regression line is sometimes called the "line of best Using these equations, our formula now for
fit" or the "best fit line". Since it "best fits" the data, it estimating the coefficients, a and b:

makes sense that the line passes through the means.  b = Sxy / Sxx
The point is called the centroid or centroidal point.
 a = (mean of y) - b (mean of x) =
Centroidal point = (mean of x ,mean of y) =

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
Correlation • if r = 1 or positive value, positively correlated,
therefore as x increases, y increases.
The word correlation is used in everyday life
to denote some form of association. We might say • if r = -1 or negative value, negatively correlated,
that we have noticed a correlation between foggy therefore as x increases, y decreases.
days and attacks of wheeziness. However, in
• if r = 0, no correlation
statistical terms we use correlation to denote
association between two quantitative variables. The formula for the correlation coefficient, r:
We also assume that the association is linear, that
one variable increases or decreases a fixed
amount for a unit increase or decrease in the
other.

Correlation is commonly referred to as the


degree to which a pair of variables are linearly
related.

Correlation coefficient

The degree of association is measured by a


correlation coefficient, denoted by r. It is
sometimes called Pearson's correlation coefficient
after its originator and is a measure of linear Lesson 6: Test of Hypothesis
association. If a curved line is needed to express The Null and Alternative Hypotheses
the relationship, other and more complicated
The structure of hypothesis testing will be
measures of the correlation must be used.
formulated with the use of the term null
The correlation coefficient is measured on a hypothesis, which refers to any hypothesis we
scale that varies from + 1 through 0 to - 1. The wish to test and is denoted by H₀. The rejection of
complete correlation between two variables is H₀ leads to the acceptance of an alternative
expressed by either + 1 or -1. When one variable hypothesis, denoted by H1. An understanding of
increases as the other increases the correlation is the different roles played by the null hypothesis
positive; when one decreases as the other
(H₀) and the alternative hypothesis (H₁) is crucial
increases it is negative. The complete absence of
correlation is represented by 0. to one’s understanding of the rudiments of

hypothesis testing. The alternative hypothesis H₁


The figure below gives some graphical
representations of correlation. usually represents the question to be answered or
the theory to be tested, and thus its specification
is crucial. The null hypothesis H0 nullifies or

opposes H₁ and is often the logical complement to

H₁.

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
You will arrive in these conclusions: This is equivalent to test the hypothesis that the
probability of success on a given trial is p = 1/4
• reject H0 in favor of H1 because of sufficient
against the alternative that p > 1/4. This is usually
evidence in the data or
written as follows:
• fail to reject H0 because of insufficient evidence in
H₀: p = 0.25
the data.
H₁: p > 0.25
Though the applications of hypothesis testing are
quite abundant in scientific and engineering work,
perhaps the best illustration for a novice lies in the
predicament encountered in a jury trial. The null and
alternative hypotheses are:

H₀: the defendant is innocent,

H₁: the defendant is guilty.

The indictment comes because of suspicion of guilt.


The hypothesis H₀ (the status quo) stands in

opposition to H₁ and is maintained unless H₁ is Testing Hypothesis for Normal Distribution


supported by evidence “beyond a reasonable doubt.”
The normal distribution, also known as the
However, “failure to reject H₀” in this case does not
Gaussian distribution, is a probability distribution
imply innocence, but merely that the evidence was
that is symmetric about the mean, showing that
insufficient to convict. So the jury does not
data near the mean are more frequent in
necessarily accept H0 but fails to reject H₀. occurrence than data far from the mean. In graph
Testing Statistical Hypothesis form, normal distribution will appear as a bell
curve.
To illustrate the concepts used in testing a statistical
hypothesis about a population, we present the
following example:

A certain type of cold vaccine is known to be only


25% effective after a period of 2 years. To determine
if a new and somewhat more expensive vaccine is
superior in providing protection against the same
virus for a longer period of time, suppose that 20
people are chosen at random. P-value APPROACH

If more than 8 of those receiving the new vaccine


surpass the 2-year period without contracting the
virus, the new vaccine will be considered superior to
the one presently in use.

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021
When you perform a statistical test a p-value helps These are some procedures to test the hypothesis
you determine the significance of your results in on normal distribution:
relation to the null hypothesis.
1. 1-Sample Z - used to make inferences on the
How do you know if P-value is highly significant? mean of a population using a random sample mean.
The random samples must be normally
The level of statistical significance is often expressed
distributed and there is information on the previous
as a p-value between 0 and 1. The smaller the p-
population standard deviation.
value, the stronger the evidence that you should
reject the null hypothesis. 2. 1-Sample T - used to make inferences on the
mean of a population using a random sample mean.
 A p-value of less than 0.05 (typically ≤ 0.05) is The random samples must be normally distributed,
statistically significant. It indicates strong and there is NO information on the previous
evidence against the null hypothesis, as there is population standard deviation.
less than a 5% probability the null is correct (and
3. 2-Sample T - requires independent normally
the results are random). Therefore, we reject the
distributed sample data and is used to compare
null hypothesis and accept the alternative
the difference between two means and make
hypothesis.
inferences if it is equal to a target.
However, this does not mean that there is a 95%
4. Paired t - used to analyze the difference
probability that the research hypothesis is true.
between paired observations against a reference
The p-value is conditional upon the null
value i.e. target value is 0.
hypothesis being true is unrelated to the truth or
falsity of the research hypothesis. 5. One-way ANOVA - ANOVA stands
for ANalysis Of VAriance. It is used to determine if
 A p-value higher than 0.05 (> 0.05) is not the means of several
statistically significant and indicates strong populations have statistically significant
evidence for the null hypothesis. This means differences.
we retain the null hypothesis and reject the
6. Two-way ANOVA - it is a general linear model
alternative hypothesis. You should note that you
procedure to conduct an ANOVA, test the hypothesis
cannot accept the null hypothesis, we can only
that the means of several populations are equal, with
reject the null or fail to reject it.
a response factor (Y) and multiple predictors (X1, X2).
A statistically significant result cannot prove that
a research hypothesis is correct (as this implies
100% certainty).

Instead, we may state our results “provide


support for” or “give evidence for” our research
hypothesis (as there is still a slight probability
that the results occurred by chance and the null
hypothesis was correct – e.g. less than 5%).

Downloaded by Hennie Riberta (ribertahennie@gmail.com)


lOMoARcPSD|33318014

Handsout for CE 023 (Engineering Data Analysis) 1st Semester, S.Y. 2020-2021

Reference:
Walpole, R., Myers, R., Myers, S., & Ye, K. (2014). Scientists & Engineers Guide to
Probability & Statistics (9th ed.). Pearson Education Inc., Prentice-Hall.
Decoursey, W.J. (2003). Statistics and Probability for Engineering Applications.
Newnes.

Downloaded by Hennie Riberta (ribertahennie@gmail.com)

You might also like