Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

ENGINEERING DATA ANALYSIS

COURSE CODE: EM 1203 (3 UNITS)


INSTRUCTOR: ENGR. DANIELLE D. CABAÑA
SCHEDULE: MWF 9:30PM-12:30PM
ROOM: LB402TC
DATA PRESENTATION AND
DESCRIPTIVE STATISTICS
WHY SHOULD WE PAY ATTENTION?
THIS IS THE LINE
I DON’T WANT TO
HEAR FROM YOU!

• An essential component of scientific research;


• A must have skill! (of any researcher, but useful
also in commercial/industry/business)
• It will help you communicate more effectively
with your results.
HOW TO CONDUCT A SCIENTIFIC
PROJECT?
❑ Research your topic.
❑ Make a hypothesis.
❑ Write down your procedure.
❑ Assemble your Materials.
❑ Conduct the experiment.
❑ Repeat the experiment.
❑ Analyze your results.
❑Draw a conclusion.
WHAT ARE SOME EXMAPLES OF DATA
COLLECTED BY ENGINEERS?
• Measuring the ozone level of the air
• Classifying air as healthy or unhealthy based on Air Quality
Index.
• Measuring power output and efficiency of a wind turbine design
• Measuring the percent of chemical output of a process when
the temperature of the process is varied.
• Classifying the types of complaints seen in a consumer
response program.
• …and many more…
TYPES OF DATA AND FREQUENCY
DISTRIBUTION TABLES
TYPES OF DATA

QUALITATIVE ( CATEGORICAL) DATA

QUANTITATIVE (NUMERICAL) DATA


QUALITATIVE (CATEGORICAL) DATA
• Data in the form of classification into
different group or categories.
• Data that are categorically assigned.
e.g. the type of complaint received,
and the classification of air
QUANTITATIVE (NUMERICAL) DATA
• Data in the form of numerical measurements or counts.
e.g. measuring the power output and efficiency, number of
bridge operations per year, measuring percent of chemical
output, etc.
DISCRETE OR CONTINUOUS DATA
• Discrete data – values are distinct and separate, i.e. they can
be counted. (counted)
• Continuous data – values may take on any value within a finite
or infinite interval. (measured)
Discrete or Continuous?
• The number of suitcases lost by an airline.
• The height of apple trees.
• The number of apples produced.
• The number of green M&M's in a bag.
• The time it takes for a hard disk to fail.
• The production of cauliflower by weight.
OTHER CATEGORICAL DATA
• Categorical data – values can be sorted according to category
• Nominal data – values can be assigned code in the form of a
number, where the numbers are simply labels
• Ordinal data – values can be ranked or have rating scale
attached.
PRESENTING THE DATA
• Table 1.1 summarizes the types of complaints received by an
automobile paint shop over a period of time.
Definition
• Frequency of a measurement in a dataset is the number of
times that measurement occurs in the dataset. Frequency is
denoted by the letter f.
• Frequency distribution table is a table giving all the
measurements and their frequencies. The measurement can be
reported in a grouped or ungrouped manner.
• Relative frequency of an observation in a set with n
measurements is a ratio of frequency to the total number of
measurements, denoted by rf, where
𝒓𝒇 = 𝒇/𝒏
• Cumulative relative frequency gives the proportion of
measurements less than or equal to a specified value, and
denoted by crf.
Table 1.1 summarizes the types of complaints received by an
automobile paint shop over a period of time.
Table 1.2 compares the AIR Quality Indices for Los Angeles, and
Orlando, Florida, for 2007 by looking at the EPA standard
categories for air quality.
Exercise:
1. Table 1.3 displays the number of adults in the United States
that fall into various income categories, according to the number
of years of formal education.
a. Produce a companion table that shows relative frequencies
for the income categories within each education column.
b. Comment on the relationship between income and years of
education.
2. Classify the following data as quantitative or qualitative data.
a. The day of the week on which power outage occurred.
b. Number of automobile accidents at a major intersection in
Pittsburgh on the fourth of July weekend.
c. Ignition times of material used in mattress covers (in seconds).
d. Breaking strength of threads (in ounces).
e. The gender of workers absent on a given day
f. Number of computer breakdowns per month.
g. Number of gamma rays emitted per second by a certain radioactive
substance.
h. The ethnicity of students in a class.
i. Daily consumption of electric power by a certain city (in millions of
kilowatt hours).
j. The level of ability to handle a spinner achieved by machinists at the
end of training.
TERMINOLOGIES
• Population – the collection of items under investigation.
• Sample - a representative subset of the population, used in the
experiments
• Variable – a characteristic that changes over time and/or for
different individuals or objects under consideration
• Observation – the value of a variable taken during one of the
experiments

population
sample
GRAPHING CATEGORICAL DATA
• Bar chart: Useful for summarizing
and displaying the patterns in
categorical data; is created by
plotting all the categories in the data
on one axis and the frequency (or
relative frequency or percentages) of
occurrence of each category in the
data on the other axis. Either
horizontal or vertical bars of height
(or length) equal to the frequency (or
relative frequency or percentages)
are drawn.
• Pie Charts: Suitable to represent
categorical data; used to show
percentage; areas are proportion to
value of category.
GRAPHING NUMERICAL DATA
• Histograms: The graphical
representation of a frequency
table; Summarizes
categorical, nominal and
ordinal data; Display bar
vertically or horizontally,
where the area is
proportional to the frequency
of the observation falling in to
the class.
Data Frequency Table
The following data represent the amount of temperature of
rainfall as part of a hydrological study of a certain location.

112 100 127 120 134 118 105 110 109 112

110 118 117 116 118 122 114 114 105 109

107 112 114 115 118 117 118 122 106 110

116 108 110 121 113 120 119 111 104 111

120 113 120 117 105 110 118 112 114 114
Frequency Distribution Table Frequency Histogram
Class Cumulative
Class limits frequency
boundaries frequency

100 – 104 99.5 – 104.5 2 2

105 – 109 104.5 – 109.5 8 10

110 – 114 109.5 – 114.5 18 28

115 – 119 114.5 – 119.5 13 41 Frequency Polygon


120 – 124 119.5 – 124.5 7 48

125 – 129 124.5 – 129.5 1 49

130 – 134 129.5 – 134.5 1 50

n =  f = 50
The Ogive (Cumulative Frequency)
Visualizing Distributions
Three Components:
• Center – distribution describes a point near the middle of a
distribution that might serve as a typical value or a balance
point for the distribution
• Spread – distribution describes how the data spread out around
the center
• Shape – distribution describes the basic pattern of the plotted
data along with any notable departures from the pattern.
Shapes of Distribution

Bell-shaped Uniform

Right-skewed Left-skewed
Bimodal U-shaped
I. Measure of Center
1. Mean
If the dataset contains n measurement labeled 𝑥1 , 𝑥2 , … , 𝑥𝑛 , then
the mean (read as x bar) is defined as
𝑛
1
𝑥ҧ = ෍ 𝑥𝑖
𝑛
𝑖=1

a. Population: μ
b. Sample of n measurements
2. Median

For any dataset, the median is the middle of the ordered array of
numerical values.
• Arrange n measurements in increasing order of value, in other
words, from smallest to largest.
𝑛+1
• Compute 𝑙 = .
2
• Then the median M = the value of the lth measurement in the
ordered array of measurements.
3. Mode
the most frequently occurring data value
a) If all the elements in the data set have the same frequency of
occurrence, then the data set is said to have no mode.
b) If the data set has one value that occurs more frequently than
the rest of the values, then the data set is said to be unimodal.
c) If two elements of the data set are tied for the highest
frequency of occurrence, then the data set is said to be
bimodal.
Example 1
In order of the cities, the AQI data for year 2003 are, solve for
mean and median.

12, 8, 10, 5, 17, 19, 31, 11, 88, 11, 19, 37, 1, 2, 12
Mean = 𝑥ҧ = 18.87
Median = M = 12
Example 2
Mean transportation time for accidents in rural Alabama from
3,133 cases is 13.67 min, whereas that in urban Alabama in
2,065 cases is 8.97 min. The transportation time is defined as the
time to transport the vehicular accident victim from the site of
accident to the emergency care medical facility by the EMS
(emergency medical service) vehicle. What is the overall mean
transportation time for the state of Alabama?

Weighted Mean 𝑥ҧ𝑤 = 11.84 𝑚𝑖𝑛


II. Measures of Variability
1. Range and Interquartile Range:
𝑹 = 𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒
𝑰𝑸𝑹 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠 𝑡ℎ𝑒 𝑠𝑝𝑟𝑒𝑎𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑖𝑑𝑑𝑙𝑒 50% 𝑜𝑓 𝑎𝑛 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡

2. Variance and Standard deviation


If the center of the dataset is defined by the mean, then the
“typical” deviation from the mean is described by standard
deviation. The variance is the typical squared deviation. It is
measured in squared units.
 ( xi −  ) 2
a) Population variance: 2 =
N

b) Sample variance (from n measurement) :


_
 ( x − x ) 2
s2 = i
n −1

c) Standard deviation

a) Population:  =  2

b) Sample: s = s2
d. Coefficient of variation
A dimensionless quantity, the coefficient of variation is the ratio
between the standard deviation and the mean for the same
dataset, expressed as percentage,

𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛


=
𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
In order of the cities, the AQI data for year 2003 are:

12, 8, 10, 5, 17, 19, 31, 11, 88, 11, 19, 37, 1, 2, 12

Mean 18.87
Variance 462.25
Standard deviation 21.50
n 15
Significance of Standard deviation
Chebyshev’s Theorem

The proportion of any distribution that lies within s standard deviations of the mean is at
1
least: 1 − 2 , where s is any positive number except 1. Theorem applies to all distributions
𝑠
of data.
Empirical (Normal) Rule:

The Empirical Rule can be used only for relatively mound-shaped data sets.
III. Measures of relative standing
1. The z-score
The sample z score of a value of x is a measure of relative
standing defined by
x − x
z =
s
z-score measures the distance between an observation and the
mean, measured in units of standard deviation.
z-scores between -2 and +2 are highly likely.

z-scores exceeding 3 in absolute value are very unlikely and can


be considered outliers (unusually large or small observations).
2. The pth percentile and the quartiles
Quartiles, deciles, percentiles divide a frequency distribution into
number of parts containing equal frequencies.

Quartiles divide the range of values into four parts, each


containing one quarter of the values.
Deciles divide the range of values in to ten parts, each containing
one-tenth of the total frequency.
Percentiles divide the range of values in 100 parts, each
containing one-hundredth of the total frequency.
Suppose a set of n measurements on the variable x has been
arranged in order of magnitude.

The pth percentile is the value of x that exceeds p % of the


measurements and is less than the remaining (100 - p)%.
Example 3
Suppose in one major exam, you obtained a score of 150 (200
being the perfect score) and this placed you at the 60th
percentile in the distribution of scores among the examinees.
Where does your score of 150 stand in relation to the scores of
others who took the examination?
Solution
Scoring at the 60th percentile means that 60% of all
examination scores were lower than yours and 40% were
higher.
Percentile Corresponding to a given data value

• The percentile corresponding to a given data value x in a set is


obtained by :
Number of values below x + 0.5
Percentile =
Number of values in data set

Example

Given the following measurements :


13, 11, 10, 13, 11, 10, 8, 12, 9, 9, 8, 9
What is the percentile rank for the measurement “ 12 ” ?
Special percentiles

The median is the


same as the 50th
percentile.
The 25th and 75th
percentiles are called
the lower and
upper quartiles.
Example 4
To start a program to improve the quality of production in a factory are
examined and classified as “good” or “defective” products. The result
for 60 groups, so for 360 products, are shown. Find the mean,
median, first quartile, third quartile, proportion defective in the sample,
sample variance, sample standard deviation, ,and coefficient of
variation.

Numbers of Defectives in Groups of Six Items


1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 1 0
0 1 0 0 1 0 0 0 0 2 0 0
0 0 0 0 2 0 0 1 0 0 1 0
1 0 0 0 0 1 0 0 1 0 0 0
Answer
Number of defectives Frequency
0 48
1 10
2 2
>2 0

Mean = 0.233
Median = value between 30th and 31st = 0
Mode = 0
First quartile = 15th and 16th = 0
Third quartile = 45th and 46th = 0
Proportion defective of the sample = 14/60 = 0.0389
(relative
Sample variance= 0.2497
Sample Standard deviation = 0.4997
Coefficient of variance = 214%
Assignment:
The accompanying specific gravity Bin Frequency
values for various wood types used 0.35 2
in construction appeared in the 0.4 7
article “Bolted Connection design
Values om European Yield Model” 0.45 10
(J. of Structural Engr., 1993: 2169- 0.5 6
2186) 0.55 4
a. Construct a frequency histogram 0.6 1
and describe the distribution by its
shape. 0.65 1
b. Determine the mean, median and 0.7 4
mode, sample standard deviation, 0.75 1
sample variance, and coefficient of
variance

You might also like